Abstract
How hard can it be? It’s easy if the numbers are just a few hundred bits long. After that it gets a little tricky if you care about how long it takes. Here I calculate the biggest known prime number (trivial), which is 57,885,161 bits long, then render it to decimal (the tricky bit). I use Scheme and Python but mostly C++. My code is pitifully slow compared with the MPIR/GMP libraries.
Scheme
A while ago I heard on the news that a new biggest prime number had been found: 257,885,161-1. So I thought I'd try to generate it.
(define (prime n)
(- (expt 2 n) 1))
I found I had to be a little bit careful how I used this. I first tried
(prime 57885161)
and got no response for for about an hour, so I killed it. (I'm running this on DrRacket version 5.2.1 on a 2008 vintage iMac.) But the following form executed in under one second
(define p (prime 57885161))
and the following expression executes in under five minutes
(string-length (number->string p))
17425170
and these also execute in under five minutes each
(substring (number->string (prime 57885161)) 0 40)
"5818872662322464421751002121132323686363"
(with-output-to-file "/temp/prime.txt" (lambda () (printf "~a" (prime 57885161))))
That last expression creates a 17,425,170 byte file with the first and last few digits matching what Wikipedia says is the correct value.
Python
I tried something similar with Python. It took about three hours to complete, which was longer than I expected.
$
$
$python --version
Python 2.7.1
$
$time python -c "print(len(str(2**57885161-1)))"
17425170
real 170m9.957s
user 169m10.043s
sys 0m6.367s
$
$python3.3 --version
Python 3.3.0
$
$time python3.3 -c "print(len(str(2**57885161-1)))"
17425170
real 161m43.538s
user 160m51.519s
sys 0m5.939s
$
C++
And so I thought, how do I do this in C++? There is no built-in support in C++ for big numbers like there is in Scheme and Python. Boost offers a kind of wrapper around GMP - I haven't tried it. But I thought it's such a simple equation, how hard could it be to calculate it using only what the language itself offers?
My first idea was to represent a number as a vector of char where each char would represent one binary-coded decimal (BCD) digit. I'd need two simple functions, one to perform a multiply by two and one to perform a subtract one. Then I would start with 2 in my vector and call my multiply-by-two function 57,885,160 times and my subtract one function once. This has the advantage that the result will already be in decimal and the disadvantage that it would take hours, if not days, to run. Imagine, the multiply by two function would have to multiply and carry up to seventeen million digits each time it was called. And in total that function will be called fifty seven million times. Bonkers.
I needed a way to represent an arbitrary length binary number. One obvious way to do that is with a vector of some scalar value so that when you need to store more bits than will fit into one element of the vector you just push another onto the end and let the bits overflow into that.
#include <vector>
#include <climits>
#include <stdint.h>
// our arbitrary length unsigned number will be represented by a vector of
// "fragments", each fragment will be one of these
typedef uint32_t num_frag_t;
const int num_frag_t_size = sizeof(num_frag_t) * CHAR_BIT;
// arbitrary length unsigned number; least significant bits in lowest fragment
typedef std::vector<num_frag_t> num_vec_t;
A number of the form 2n - 1 may be represented by an n-bit number where every bit is 1. For example, 24 - 1 = 15 decimal, or 1111 binary. So using our num_vec_t data structure a C++ function to calculate 2n - 1 is quite simple
// set given 'p' to the value of (2^n)-1 for given 'n'; requires n > 0
void make_prime(int n, num_vec_t & p)
{
const int len = (n + num_frag_t_size - 1) / num_frag_t_size;
const num_frag_t ones = ~static_cast<num_frag_t>(0);
p = num_vec_t(len, ones);
if (n % num_frag_t_size)
p[len - 1] = ones >> (num_frag_t_size - n % num_frag_t_size);
}
(Yes, 'make_prime' is a misnomer because the function only returns a prime number for certain values of n, such as 57885161. But let's stick with this name.)
With this function I can generate the 57,885,161-bit biggest known prime number in about five milliseconds. My next problem is converting that big binary number to decimal.
From my schoolboy maths I know that if I divide a number by 10, the remainder will be the least-significant decimal digit of that number. For example, the decimal number 123 divided by 10 is 12 remainder 3. I can repeatedly extract the next least-significant decimal digit from my big binary number by dividing it by 10 over and over again until it reaches 0. The to_string() function in this program uses this idea
prime0.cpp:
|
I think that's fairly straight forward: we can now generate the biggest prime and convert it to decimal. There's just one problem:
$g++ -O3 prime0.cpp -o prime0
$time ./prime0
17425170
real 1606m16.433s
user 1582m25.091s
sys 1m20.332s
$
It takes to_string() a day to convert the 57 million-bit number to decimal. Heh. Python's three hours isn't looking so bad now and Scheme's five minutes is looking miraculous!
For every one of the 17 million decimal digits in the number to_string() will on average have to run through the inner for loop about a million times. (To begin with the number is 57 million bits long, which means the vector will contain nearly 2 million 32-bit words. By the end of the algorithm the vector is empty. Hence the average of one million inner-loops per decimal digit.)
Just to underline this: that inner loop will execute 17,000,000,000,000 times before we have an answer. If my computer could execute one iteration of the the inner-loop in a nanosecond, which it can't, it would still take 17,000 seconds or five hours to complete. As we've seen, in reality it takes 26 hours. (The good news is that this means the processor in my machine can execute those three statements in that inner for loop in about 5 nanoseconds, which I think is pretty good.)
There is a simple way to improve the algorithm. Instead of collecting one decimal digit per outer-while-loop we could collect nine. Like this
// return given 'num' as a decimal string
std::string to_string(num_vec_t num) // pass-by-value because algorithm is destructive
{
std::vector<char> result;
// keep dividing num by 1,000,000,000 until it is 0; the remainder of each
// division is the next least significant 9 decimal digits in the result
while (!num.empty()) {
uint64_t remainder = 0;
for (num_vec_t::size_type i = num.size(); i--; ) {
remainder = (remainder << num_frag_t_size) + num[i];
num[i] = static_cast<num_frag_t>(remainder / 1000000000);
remainder %= 1000000000;
}
// discard all 0-value number fragments from most significant end
num_vec_t::size_type i = num.size();
while (i && num[i - 1] == 0)
--i;
num.resize(i);
// extract the next 9 digits from the remainder
for (int j = 0; j < 9; ++j) {
// (C++ standard guarantees '0'..'9' are consecutive)
result.push_back(static_cast<char>(remainder % 10) + '0');
remainder /= 10;
if (remainder == 0 && num.empty())
break; // don't append superfluous zeros; e.g. 000000012 => 12
}
}
// the least-significant decimal digit is first so return result reversed
return std::string(result.rbegin(), result.rend());
}
prime1.cpp is identical to prime0.cpp except for the above change to to_string()
$g++ -O3 prime1.cpp -o prime1
$time ./prime1
17425170
real 242m48.302s
user 241m57.223s
sys 0m6.608s
$
This algorithm is only slightly more complicated than the previous version but it's almost ten times quicker. I can convert a one million-bit number to decimal in about five seconds. But for our 57 million-bit number the inner loop still executes about 2,000,000,000,000 times. As you can see, it still took over four hours.
How could our original Scheme program do this in under five minutes? They use a better algorithm.
A quick look at the Racket source shows that they use the GMP library. Presumably, when I call (number->string p) the mpn_get_str() function in gmp.c is eventually invoked. It seems to require some thousands of lines of C code to do the binary to decimal conversion when you include the other GMP code used in the process. Somehow I feel disappointed that there isn't a simple algorithm that won't take me days to understand.
A better algorithm (for really big numbers)
It turns out it doesn't take days to understand the general idea: We've seen that converting one 57 million-bit number to decimal using the previous algorithm takes hours. But we also know that with the same algorithm we could convert 57 one million-bit numbers to decimal in 57 x 5 seconds, or about five minutes. So, can we split the one big number into several smaller numbers, convert them to decimal, then stitch all the results back into one? Yes we can.
Imagine it was going to take days to convert the binary representation of 123,456,789 to decimal, but would take seconds to convert 1,234 and 56,789 to decimal. We could split our number into parts by dividing it by 100,000
1. 123,456,789 / 100,000 -> 1,234 remainder 56,789
2. Convert 1,234 to a string -> "1234"
3. Convert 56,789 to a string -> "56789"
4. Combine strings "1234" + "56789" -> "123456789"
We've added the cost of the division by 100,000, but we saved more time on the binary to decimal conversions, so we win overall. This is what we're going to do with our 57 million-bit prime.
There's just one fly in the ointment, as illustrated by this example
1. 123,000,789 / 100,000 -> 1,230 remainder 789
2. Convert 1,230 to a string -> "1230"
3. Convert 789 to a string -> "789"
4. Combine strings "1230" + "789" -> "1230789" WRONG!
But this is easily overcome by requiring results to be zero-filled fixed-width: after dividing by 10n we will require the quotient and remainder to be at least n digits, filled with leading zeros if necessary. Then we'll remove any leading zeros from the final result.
1. 123,000,789 / 100,000 -> 1,230 remainder 789
2. Convert 1,230 to a string, fixed width -> "01230"
3. Convert 789 to a string, fixed width -> "00789"
4. Combine strings "01230" + "00789" -> "0123000789"
5. Remove leading zeros from "0123000789" -> "123000789"
If you're wondering why the quotient needs to be fixed width, it's because we will be applying this algorithm repeatedly, so the quotient might be the high part of the remainder from an earlier division. We break the number into two, then we break each of these numbers into two, and so on until we have numbers small enough to be converted to decimal with our O(n2) algorithm from prime1.cpp.
Here is the complete C++ program that uses this algorithm to print the biggest known prime
prime2.cpp:
|
$g++ -O3 prime2.cpp -o prime2
$time ./prime2
17425170
real 107m10.509s
user 106m18.801s
sys 0m7.216s
$
Well, that's an improvement, but not as dramatic as I hoped for. I think I'm using the same algorithm as MPIR, but it still takes hours when I hoped for minutes. I'm going to try using the MPIR library functions for multiply and divide instead of my Knuth derived ones. This turns out to be quite easy to do because my representation of a multi-precision binary number as a vector of "fragments" maps quite nicely onto MPIR's contiguous "limbs". I need only be careful to match the size of my fragments to MPIR's limbs: my Windows MPIR library uses 32-bit limbs and my OS X MPIR library uses 64-bit limbs. Bearing this in mind I modify mul() and div() in prime2.cpp to call the MPIR functions
// our arbitrary length unsigned number will be represented by a vector of
// "fragments", each fragment will be one of these
typedef mp_limb_t num_frag_t;
#define NUM_FRAG_T_SIZE GMP_LIMB_BITS
const int num_frag_t_size = NUM_FRAG_T_SIZE;
// arbitrary length unsigned number; least significant bits in lowest fragment
typedef std::vector<num_frag_t> num_vec_t;
// product <- u * v; u and v must be normalised
void mul(num_vec_t & product, const num_vec_t & u, const num_vec_t & v)
{
// use mpn_mul() to do the multiplication; this is straight-forward
// because if we ensure our num_frag_t is the same size as GMP's
// mp_limb_t then our binary representation will match GMP's
// (what we called a fragment is what GMP calls a limb)
const mp_size_t m = u.size();
const mp_size_t n = v.size();
std::vector<mp_limb_t> p(m + n, 0);
if (m >= n)
mpn_mul(&p[0], &u[0], m, &v[0], n);
else
mpn_mul(&p[0], &v[0], n, &u[0], m);
normalise(p);
product.swap(p);
}
// quotient <- u / v; remainder <- u % v
void div(num_vec_t & quotient, num_vec_t & remainder,
const num_vec_t & u, const num_vec_t & v)
{
// use mpn_tdiv_qr() to do the division; this is straight-forward
// because if we ensure our num_frag_t is the same size as GMP's
// mp_limb_t then our binary representation will match GMP's
// (what we called a fragment is what GMP calls a limb)
const mp_size_t m = u.size();
const mp_size_t n = v.size();
if (m < n || n <= 0 || v[n-1] == 0)
throw std::runtime_error("div() unsupported input");
std::vector<mp_limb_t> q(m - n + 1, 0);
std::vector<mp_limb_t> r(n, 0);
mpn_tdiv_qr(&q[0], &r[0], 0, &u[0], m, &v[0], n);
normalise(q); quotient.swap(q);
normalise(r); remainder.swap(r);
}
// return decimal representation of given 'num' zero padded to 'width'
std::string to_string_fixed_width(
num_vec_t num, // pass-by-value because algorithm is destructive
int width) // result will be at least this wide, zero filled if necessary
{
std::vector<char> result;
// keep dividing num by 1,000,000,000 until it is 0; the remainder of each
// division is the next least significant 9 decimal digits in the result
while (!num.empty()) {
uint64_t remainder = 0;
for (num_vec_t::size_type i = num.size(); i--; ) {
#if NUM_FRAG_T_SIZE == 64
// each fragment of num is 64-bits: divide high and low 32-bits in two stages
remainder = (remainder << 32) + (num[i] >> 32);
uint64_t hi = remainder / 1000000000;
remainder %= 1000000000;
remainder = (remainder << 32) + (num[i] & 0xFFFFFFFF);
num[i] = (hi << 32) + (remainder / 1000000000);
remainder %= 1000000000;
#else
// each fragment of num is 32-bits: need only one 64-bit by 32-bit division
remainder = (remainder << num_frag_t_size) + num[i];
num[i] = static_cast<num_frag_t>(remainder / 1000000000);
remainder %= 1000000000;
#endif
}
// discard all zero-value 32-bit elements from most significant end
num_vec_t::size_type i = num.size();
while (i && num[i - 1] == 0)
--i;
num.resize(i);
// extract the next 9 digits from the remainder
for (int j = 0; j < 9; ++j) {
result.push_back(static_cast<char>(remainder % 10) + '0');
remainder /= 10;
if (remainder == 0 && num.empty())
break; // don't append superfluous zeros; e.g. 000000012 => 12
}
}
// add any necessary leading zeros to make the result the required width
width -= static_cast<int>(result.size());
while (width-- > 0)
result.push_back('0');
// the least-significant decimal digit is first so return result reversed
return std::string(result.rbegin(), result.rend());
}
The rest of prime3.cpp is the same as prime2.cpp.
$g++ -O3 prime3.cpp -l mpir -o prime3
$time ./prime3
17425170
real 3m37.341s
user 3m35.766s
sys 0m0.829s
$
Wow! That is the dramatic improvement I was looking for. Switching from the multiply and divide algorithms given in The Art of Computer Programming v2 4.3.1 to those used by the MPIR library made the code run 30 times faster.
What if I use the MPIR built-in binary-to-decimal function, mpn_get_str(), directly?
prime4.cpp:
|
$sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM)2 Duo CPU E8235 @ 2.80GHz
$
$g++
i686-apple-darwin11-llvm-g++-4.2: no input files
$
$
$g++ -O3 prime4.cpp -lmpir -o prime4
$time ./prime4
17425170
real 2m25.515s
user 2m24.323s
sys 0m0.521s
$
Well, that took 33% less time than prime3.cpp. But they're in the same ball park.
Conclusion
Here is a summary of execution times of all the above experiments
language name line execution digits/ comments
count time (s) second
C++/MPIR prime4.cpp 102 144 121,000 conversion using MPIR library code only
C++/MPIR prime3.cpp 297 215 81,000 same as prime2.cpp, but using MPIR for mul & div
Scheme/GMP 1 260 67,000
C++ prime2.cpp 354 6,378 2,700 my best effort in vanilla C++
Python 3.3 1 9,651 1,800
Python 2.7 1 10,150 1,700
C++ prime1.cpp 100 14,517 1,200 nine digits at a time
C++ prime0.cpp 93 94,945 184 one digit at a time
First, much respect to the mathematicians and programmers who developed and implemented the algorithms in GMP/MPIR.
I would like to have coded something that got a lot closer to the speed of the GMP/MPIR library code, but that would have taken a huge effort and I have other ideas bubbling up. It's time to move on.
How hard is binary to decimal conversion? It depends. For a few hundred bits the simple algorithm in prime1.cpp works well enough. Beyond that either use MPIR/GMP or be prepared to invest a lot of effort. (I hope someone will tell me about a simple, fast algorithm to do it.)
prime0..4.cpp and a test harness may be downloaded from here.
Finally, I don't know why Comeau's on-line compiler isn't there any more, but I missed it.
C++ program to convert decimal to binary
ReplyDeleteThanks
C++ program to convert decimal to binary
ReplyDeleteThanks
i use a much revised mapm c library,
ReplyDeletewhich does fft multiply in base 10000
multiply speed not as good as gmp, but
building decimal string is VERY FAST
rpn 0x1p57885161 1- ?i3 k > prime
finish in under 10 seconds ! (with comma added for clarity !)
email me if interested with my revised mapm library
albertmcchan@yahoo.com
sorry about my prior comment
ReplyDeleteyour title was binary to decimal
i think my solution might be cheating ...
it does not use binary ...
base 100
-> fft multiply in base 10000
-> base 100
-> decimal string
This comment has been removed by the author.
ReplyDeleteon the same laptop, gmp did it in 26.7 seconds
ReplyDeleteInteresting, thanks for this, came across your post looking for something similar, fyi, your code (prime4.cpp) ran in 4 seconds with lib gmp version 6.1.2,
ReplyDeleteMacBook Pro (Retina, 13-inch, Early 2015), 2,7 GHz Intel Core i5, 8 GB 1867 MHz DDR3
That's a remarkable improvement: from 144 seconds to 4 seconds. Thanks for commenting.
DeleteGreetings, Anthony. Improvements were made to GMP 6.2 for AMD Ryzen processors. GMP 6.1.2 and 6.2.1 ran in 4.292 and 3.472 seconds respectively.
ReplyDeleteParallel get_str https://github.com/marioroy/binary-to-decimal
g++ -O3 -DTEST6 -fopenmp prime_test.cpp -I/opt/gmp621/include -L/opt/gmp621/lib -l gmp -o prime6_test -Wno-attributes
time LD_LIBRARY_PATH=/opt/gmp621/lib ./prime6_test
The forked repo https://github.com/marioroy/binary-to-decimal was removed some time ago. I added examples fac_test.c, prime5.cpp, prime6.cpp, and prime_test.cpp to https://github.com/marioroy/Chudnovsky-Pi/tree/master/src/extra. The GMP/MPIR routines mpf_get_str, mpf_out_str, mpz_get_str, and mpz_out_str consume up to 8 threads max via parallel recursion.
ReplyDelete