# How to Write a Program

Anthony C. Hay

## 13 May 2013

### Abstract

How hard can it be? It’s easy if the numbers are just a few hundred bits long. After that it gets a little tricky if you care about how long it takes. Here I calculate the biggest known prime number (trivial), which is 57,885,161 bits long, then render it to decimal (the tricky bit). I use Scheme and Python but mostly C++. My code is pitifully slow compared with the MPIR/GMP libraries.

### Scheme

A while ago I heard on the news that a new biggest prime number had been found: 257,885,161-1. So I thought I'd try to generate it.

``````(define (prime n)
(- (expt 2 n) 1))
``````

I found I had to be a little bit careful how I used this. I first tried

``````(prime 57885161)
``````

and got no response for for about an hour, so I killed it. (I'm running this on DrRacket version 5.2.1 on a 2008 vintage iMac.) But the following form executed in under one second

``````(define p (prime 57885161))
``````

and the following expression executes in under five minutes

``````(string-length (number->string p))
17425170
``````

and these also execute in under five minutes each

``````(substring (number->string (prime 57885161)) 0 40)
"5818872662322464421751002121132323686363"

(with-output-to-file "/temp/prime.txt" (lambda () (printf "~a" (prime 57885161))))
``````

That last expression creates a 17,425,170 byte file with the first and last few digits matching what Wikipedia says is the correct value.

### Python

I tried something similar with Python. It took about three hours to complete, which was longer than I expected.

``````\$
\$
\$python --version
Python 2.7.1
\$
\$time python -c "print(len(str(2**57885161-1)))"
17425170

real 170m9.957s
user 169m10.043s
sys 0m6.367s
\$
\$python3.3 --version
Python 3.3.0
\$
\$time python3.3 -c "print(len(str(2**57885161-1)))"
17425170

real 161m43.538s
user 160m51.519s
sys 0m5.939s
\$
``````

### C++

And so I thought, how do I do this in C++? There is no built-in support in C++ for big numbers like there is in Scheme and Python. Boost offers a kind of wrapper around GMP - I haven't tried it. But I thought it's such a simple equation, how hard could it be to calculate it using only what the language itself offers?

My first idea was to represent a number as a vector of char where each char would represent one binary-coded decimal (BCD) digit. I'd need two simple functions, one to perform a multiply by two and one to perform a subtract one. Then I would start with 2 in my vector and call my multiply-by-two function 57,885,160 times and my subtract one function once. This has the advantage that the result will already be in decimal and the disadvantage that it would take hours, if not days, to run. Imagine, the multiply by two function would have to multiply and carry up to seventeen million digits each time it was called. And in total that function will be called fifty seven million times. Bonkers.

I needed a way to represent an arbitrary length binary number. One obvious way to do that is with a vector of some scalar value so that when you need to store more bits than will fit into one element of the vector you just push another onto the end and let the bits overflow into that.

``````#include <vector>
#include <climits>
#include <stdint.h>

// our arbitrary length unsigned number will be represented by a vector of
// "fragments", each fragment will be one of these
typedef uint32_t num_frag_t;
const int num_frag_t_size = sizeof(num_frag_t) * CHAR_BIT;

// arbitrary length unsigned number; least significant bits in lowest fragment
typedef std::vector<num_frag_t> num_vec_t;
``````

A number of the form 2n - 1 may be represented by an n-bit number where every bit is 1. For example, 24 - 1 = 15 decimal, or 1111 binary. So using our num_vec_t data structure a C++ function to calculate 2n - 1 is quite simple

``````// set given 'p' to the value of (2^n)-1 for given 'n'; requires n > 0
void make_prime(int n, num_vec_t & p)
{
const int len = (n + num_frag_t_size - 1) / num_frag_t_size;
const num_frag_t ones = ~static_cast<num_frag_t>(0);
p = num_vec_t(len, ones);
if (n % num_frag_t_size)
p[len - 1] = ones >> (num_frag_t_size - n % num_frag_t_size);
}
``````

(Yes, 'make_prime' is a misnomer because the function only returns a prime number for certain values of n, such as 57885161. But let's stick with this name.)

With this function I can generate the 57,885,161-bit biggest known prime number in about five milliseconds. My next problem is converting that big binary number to decimal.

From my schoolboy maths I know that if I divide a number by 10, the remainder will be the least-significant decimal digit of that number. For example, the decimal number 123 divided by 10 is 12 remainder 3. I can repeatedly extract the next least-significant decimal digit from my big binary number by dividing it by 10 over and over again until it reaches 0. The to_string() function in this program uses this idea

prime0.cpp:
 ```/* prime0.cpp - print number of decimal digits in biggest known prime number By Anthony C. Hay, 2013. See http://howtowriteaprogram.blogspot.com/ This is free and unencumbered public domain software; see http://unlicense.org/ I believe this code to be correct, but I may be wrong; use at your own risk. Compiled under OS X with g++ 4.2 g++ -O3 prime0.cpp -o prime0 Compiled under Windows with VC++ 2010 cl /nologo /EHs /O2 prime0.cpp /Feprime0.exe */ #include #include #include #include #include // our arbitrary length unsigned number will be represented by a vector of // "fragments", each fragment will be one of these typedef uint32_t num_frag_t; const int num_frag_t_size = sizeof(num_frag_t) * CHAR_BIT; // arbitrary length unsigned number; least significant bits in lowest fragment typedef std::vector num_vec_t; // return given 'num' as a decimal string std::string to_string(num_vec_t num) // pass-by-value because algorithm is destructive { std::vector result; // keep dividing num by 10 until it is 0; the remainder of each // division is the next least significant decimal digit in the result while (!num.empty()) { uint64_t remainder = 0; for (num_vec_t::size_type i = num.size(); i--; ) { remainder = (remainder << num_frag_t_size) + num[i]; num[i] = static_cast(remainder / 10); remainder %= 10; } // (C++ standard guarantees '0'..'9' are consecutive) result.push_back(static_cast(remainder) + '0'); // discard all 0-value number fragments from most significant end num_vec_t::size_type i = num.size(); while (i && num[i - 1] == 0) --i; num.resize(i); } // the least-significant decimal digit is first so return result reversed return std::string(result.rbegin(), result.rend()); } // set given 'p' to the value of (2^n)-1 for given 'n'; requires n > 0 void make_prime(int n, num_vec_t & p) { // a binary representation of (2^n)-1 is just an n-bit number with // every bit set; e.g. if n=4, (2^n)-1=15 decimal, or 1111 binary const int len = (n + num_frag_t_size - 1) / num_frag_t_size; const num_frag_t ones = ~static_cast(0); p = num_vec_t(len, ones); if (n % num_frag_t_size) { // the bits in p[len - 1] are already correct if n is a whole multiple // of num_frag_t_size (and anyway, ones>>num_frag_t_size is undefined in C++) p[len - 1] = ones >> (num_frag_t_size - n % num_frag_t_size); } } // return a decimal string representation of (2^n)-1 for the given 'n' std::string prime_str(int n) { num_vec_t p; make_prime(n, p); return to_string(p); } int main() { // output the number of decimal digits in (2^57,885,161)-1, i.e. 17,425,170 std::cout << prime_str(57885161).size() << '\n'; } ```

I think that's fairly straight forward: we can now generate the biggest prime and convert it to decimal. There's just one problem:

``````\$g++ -O3 prime0.cpp -o prime0
\$time ./prime0
17425170

real 1606m16.433s
user 1582m25.091s
sys 1m20.332s
\$
``````

It takes to_string() a day to convert the 57 million-bit number to decimal. Heh. Python's three hours isn't looking so bad now and Scheme's five minutes is looking miraculous!

For every one of the 17 million decimal digits in the number to_string() will on average have to run through the inner for loop about a million times. (To begin with the number is 57 million bits long, which means the vector will contain nearly 2 million 32-bit words. By the end of the algorithm the vector is empty. Hence the average of one million inner-loops per decimal digit.)

Just to underline this: that inner loop will execute 17,000,000,000,000 times before we have an answer. If my computer could execute one iteration of the the inner-loop in a nanosecond, which it can't, it would still take 17,000 seconds or five hours to complete. As we've seen, in reality it takes 26 hours. (The good news is that this means the processor in my machine can execute those three statements in that inner for loop in about 5 nanoseconds, which I think is pretty good.)

There is a simple way to improve the algorithm. Instead of collecting one decimal digit per outer-while-loop we could collect nine. Like this

``````// return given 'num' as a decimal string
std::string to_string(num_vec_t num) // pass-by-value because algorithm is destructive
{
std::vector<char> result;
// keep dividing num by 1,000,000,000 until it is 0; the remainder of each
// division is the next least significant 9 decimal digits in the result
while (!num.empty()) {
uint64_t remainder = 0;
for (num_vec_t::size_type i = num.size(); i--; ) {
remainder = (remainder << num_frag_t_size) + num[i];
num[i] = static_cast<num_frag_t>(remainder / 1000000000);
remainder %= 1000000000;
}

// discard all 0-value number fragments from most significant end
num_vec_t::size_type i = num.size();
while (i && num[i - 1] == 0)
--i;
num.resize(i);

// extract the next 9 digits from the remainder
for (int j = 0; j < 9; ++j) {
// (C++ standard guarantees '0'..'9' are consecutive)
result.push_back(static_cast<char>(remainder % 10) + '0');
remainder /= 10;
if (remainder == 0 && num.empty())
break; // don't append superfluous zeros; e.g. 000000012 => 12
}
}

// the least-significant decimal digit is first so return result reversed
return std::string(result.rbegin(), result.rend());
}
``````

prime1.cpp is identical to prime0.cpp except for the above change to to_string()

``````\$g++ -O3 prime1.cpp -o prime1
\$time ./prime1
17425170

real 242m48.302s
user 241m57.223s
sys 0m6.608s
\$
``````

This algorithm is only slightly more complicated than the previous version but it's almost ten times quicker. I can convert a one million-bit number to decimal in about five seconds. But for our 57 million-bit number the inner loop still executes about 2,000,000,000,000 times. As you can see, it still took over four hours.

How could our original Scheme program do this in under five minutes? They use a better algorithm.

A quick look at the Racket source shows that they use the GMP library. Presumably, when I call (number->string p) the mpn_get_str() function in gmp.c is eventually invoked. It seems to require some thousands of lines of C code to do the binary to decimal conversion when you include the other GMP code used in the process. Somehow I feel disappointed that there isn't a simple algorithm that won't take me days to understand.

### A better algorithm (for really big numbers)

It turns out it doesn't take days to understand the general idea: We've seen that converting one 57 million-bit number to decimal using the previous algorithm takes hours. But we also know that with the same algorithm we could convert 57 one million-bit numbers to decimal in 57 x 5 seconds, or about five minutes. So, can we split the one big number into several smaller numbers, convert them to decimal, then stitch all the results back into one? Yes we can.

Imagine it was going to take days to convert the binary representation of 123,456,789 to decimal, but would take seconds to convert 1,234 and 56,789 to decimal. We could split our number into parts by dividing it by 100,000

``````1. 123,456,789 / 100,000 -> 1,234 remainder 56,789
2. Convert 1,234 to a string -> "1234"
3. Convert 56,789 to a string -> "56789"
4. Combine strings "1234" + "56789" -> "123456789"
``````

We've added the cost of the division by 100,000, but we saved more time on the binary to decimal conversions, so we win overall. This is what we're going to do with our 57 million-bit prime.

There's just one fly in the ointment, as illustrated by this example

``````1. 123,000,789 / 100,000 -> 1,230 remainder 789
2. Convert 1,230 to a string -> "1230"
3. Convert 789 to a string -> "789"
4. Combine strings "1230" + "789" -> "1230789" WRONG!
``````

But this is easily overcome by requiring results to be zero-filled fixed-width: after dividing by 10n we will require the quotient and remainder to be at least n digits, filled with leading zeros if necessary. Then we'll remove any leading zeros from the final result.

``````1. 123,000,789 / 100,000 -> 1,230 remainder 789
2. Convert 1,230 to a string, fixed width -> "01230"
3. Convert 789 to a string, fixed width -> "00789"
4. Combine strings "01230" + "00789" -> "0123000789"
5. Remove leading zeros from "0123000789" -> "123000789"
``````

If you're wondering why the quotient needs to be fixed width, it's because we will be applying this algorithm repeatedly, so the quotient might be the high part of the remainder from an earlier division. We break the number into two, then we break each of these numbers into two, and so on until we have numbers small enough to be converted to decimal with our O(n2) algorithm from prime1.cpp.

Here is the complete C++ program that uses this algorithm to print the biggest known prime

prime2.cpp:
 ```/* prime2.cpp - print number of decimal digits in biggest known prime number By Anthony C. Hay, 2013. See http://howtowriteaprogram.blogspot.com/ This is free and unencumbered public domain software; see http://unlicense.org/ I believe this code to be correct, but I may be wrong; use at your own risk. mul() and div() are modified versions of algorithms published in Hacker's Delight: "You are free to use, copy, and distribute any of the code on this web site, whether modified by you or not." see http://hackersdelight.org/permissions.htm Compiled under OS X with g++ 4.2 g++ -O3 prime2.cpp -o prime2 Compiled under Windows with VC++ 2010 cl /nologo /EHs /O2 prime2.cpp /Feprime2.exe */ #include #include #include #include #include // our arbitrary length unsigned number will be represented by a vector of // "fragments", each fragment will be one of these typedef uint32_t num_frag_t; const int num_frag_t_size = sizeof(num_frag_t) * CHAR_BIT; // arbitrary length unsigned number; least significant bits in lowest fragment typedef std::vector num_vec_t; // "normalise" given 'num'; if num is normalised the following will be true // 1. num.size() > 0 // 2. if num is zero then num.size() == 1 && num[0] == 0 // 3. if num is non-zero then num.back() != 0 void normalise(num_vec_t & num) { num_vec_t::size_type i = num.size(); if (i == 0) { // assume an empty vector represents a zero value num.push_back(0); } else { // discard all 0-value number fragments from most significant end --i; while (i && num[i] == 0) --i; num.resize(i + 1); } } // return number of leading 0 bits in given 'x' int count_leading_zeros(num_frag_t x) { // this may not be the fastest method but has the advantage // of working correctly without modification regardless of // the number of bits in num_frag_t int n = num_frag_t_size; while (x) { x >>= 1; --n; } return n; } // return total number of bits in given 'num' int num_bits(const num_vec_t & num) { return num.empty() ? 0 : num.size() * num_frag_t_size - count_leading_zeros(num.back()); } // product <- a * b; a and b must be normalised void mul(num_vec_t & product, const num_vec_t & a, const num_vec_t & b) { // adapted from Hacker's Delight; originally from Knuth TAoCP V2 4.3.1 const unsigned a_len = a.size(); const unsigned b_len = b.size(); num_vec_t p(a_len + b_len, 0); for (unsigned ai = 0; ai < a_len; ++ai) { uint64_t k = 0; for (unsigned bi = 0; bi < b_len; ++bi) { const uint64_t t = static_cast(a[ai]) * static_cast(b[bi]) + static_cast(p[ai + bi]) + k; p[ai + bi] = static_cast(t); k = t >> num_frag_t_size; } p[ai + b_len] = static_cast(k); } normalise(p); product.swap(p); } inline num_vec_t & operator*=(num_vec_t & lhs, const num_vec_t & rhs) { mul(lhs, rhs, lhs); return lhs; } // quotient <- u / v; remainder <- u % v void div(num_vec_t & quotient, num_vec_t & remainder, const num_vec_t & u, const num_vec_t & v) { // adapted from Hacker's Delight; originally from Knuth TAoCP V2 4.3.1 const uint64_t base = static_cast(1) << num_frag_t_size; const int m = static_cast(u.size()); const int n = static_cast(v.size()); if (m < n || n <= 0 || v[n-1] == 0) throw std::runtime_error("div() unsupported input"); num_vec_t q(m - n + 1, 0); num_vec_t r(n, 0); if (n == 1) { // the single digit divisor special case uint64_t k = 0; for (int j = m - 1; j >= 0; --j) { q[j] = static_cast((k * base + u[j]) / v[0]); k = (k * base + u[j]) - q[j] * v[0]; } r[0] = static_cast(k); normalise(q); quotient.swap(q); normalise(r); remainder.swap(r); return; } // copy v to vn shifting it to the left so the most significant bit is 1 const int s = count_leading_zeros(v[n-1]); // 0 <= s <= 31. num_vec_t vn(n); // (Hacker's Delight allocates 2n) for (int i = n - 1; i > 0; --i) vn[i] = (v[i] << s) | (v[i-1] >> (num_frag_t_size - s)); vn[0] = v[0] << s; // copy u to un shifting it to the left by same amount as we did v above num_vec_t un(m + 1); // (Hacker's Delight allocates 2(m + 1)) un[m] = u[m-1] >> (num_frag_t_size - s); for (int i = m - 1; i > 0; i--) un[i] = (u[i] << s) | (u[i-1] >> (num_frag_t_size - s)); un[0] = u[0] << s; for (int j = m - n; j >= 0; --j) { // compute estimate qhat of q[j] uint64_t qhat = (static_cast(un[j+n])*base + static_cast(un[j+n-1])) / vn[n-1]; uint64_t rhat = (static_cast(un[j+n])*base + static_cast(un[j+n-1])) - qhat * vn[n-1]; while (qhat >= base || qhat*vn[n-2] > base*rhat + un[j+n-2]) { --qhat; rhat += vn[n-1]; if (rhat >= base) break; } // multiply and subtract int64_t k = 0; for (int i = 0; i < n; ++i) { uint64_t p = qhat * vn[i]; int64_t t = static_cast(un[i+j]) - k - (p & 0xFFFFFFFF); un[i+j] = static_cast(t); k = (p >> num_frag_t_size) - (t >> num_frag_t_size); } int64_t t = static_cast(un[j+n]) - k; un[j+n] = static_cast(t); q[j] = static_cast(qhat); if (t < 0) { // we subtracted too much - add it back q[j]--; k = 0; for (int i = 0; i < n; ++i) { t = static_cast(un[i+j]) + static_cast(vn[i]) + k; un[i+j] = static_cast(t); k = t >> num_frag_t_size; } un[j+n] = un[j+n] + static_cast(k); } } for (int i = 0; i < n; ++i) r[i] = (un[i] >> s) | (un[i+1] << (num_frag_t_size - s)); normalise(q); quotient.swap(q); normalise(r); remainder.swap(r); } // return decimal representation of given 'num' zero padded to 'width' std::string to_string_fixed_width( num_vec_t num, // pass-by-value because algorithm is destructive int width) // result will be at least this wide, zero filled if necessary { std::vector result; // keep dividing num by 1,000,000,000 until it is 0; the remainder of each // division is the next least significant 9 decimal digits in the result while (!num.empty()) { uint64_t remainder = 0; for (num_vec_t::size_type i = num.size(); i--; ) { remainder = (remainder << num_frag_t_size) + num[i]; num[i] = static_cast(remainder / 1000000000); remainder %= 1000000000; } // discard all zero-value 32-bit elements from most significant end num_vec_t::size_type i = num.size(); while (i && num[i - 1] == 0) --i; num.resize(i); // extract the next 9 digits from the remainder for (int j = 0; j < 9; ++j) { result.push_back(static_cast(remainder % 10) + '0'); remainder /= 10; if (remainder == 0 && num.empty()) break; // don't append superfluous zeros; e.g. 000000012 => 12 } } // add any necessary leading zeros to make the result the required width width -= static_cast(result.size()); while (width-- > 0) result.push_back('0'); // the least-significant decimal digit is first so return result reversed return std::string(result.rbegin(), result.rend()); } struct power_of_ten { num_vec_t n; // n = 10^power int power; // which power? e.g. if n = 1000 then power = 3 int bit_count; // number of bits in n; e.g. if n = 1000 then bit_count = 10 power_of_ten(const num_vec_t & n = num_vec_t(), int power = 0) : n(n), power(power), bit_count(num_bits(n)) { } }; typedef std::vector pot_vec_t; // return a list of values, each a power of ten, as required by to_string_helper() pot_vec_t powers_of_ten(int s) { // - the first value in the list will be 10^1024 // - each value will be the previous value squared // - the last value will be the first power that has at least s bits // - if s is less than the number of bits in 10^512 the list will be empty pot_vec_t result; num_vec_t p; p.push_back(10); int power(1); // p = 10^power while (num_bits(p) < s) { p *= p; // the next value is the previous value squared, power *= 2; // which means the power has doubled if (power > 1000) result.push_back(power_of_ten(p, power)); } return result; } // return given 'num' as a decimal string std::string to_string_helper( const num_vec_t & num, pot_vec_t::const_reverse_iterator p, const pot_vec_t::const_reverse_iterator p_end, int width) { if (p == p_end || num_bits(num) <= p->bit_count) { // don't need to break number into smaller pieces; it's small // enough to use our O(n^2) algorithm directly return to_string_fixed_width(num, width); } else { // num is too big to convert directly; break it up first std::vector parts(num_bits(num) / p->bit_count); int count = 0; num_vec_t quotient; // first repeatedly divide by p->n until quotient is no longer than p->n div(quotient, parts[count++], num, p->n); while (num_bits(quotient) > p->bit_count) div(quotient, parts[count++], quotient, p->n); // then do same again for each part, collecting results into one string std::string result(to_string_helper(quotient, p + 1, p_end, p->power)); while (count) result += to_string_helper(parts[--count], p + 1, p_end, p->power); return result; } } // return given 'num' as a decimal string std::string to_string(const num_vec_t & num) { // construct a table of powers of 10 appropriate to the size of the given // number and use it convert the number to decimal const pot_vec_t powers(powers_of_ten(num_bits(num) / 2)); std::string result(to_string_helper(num, powers.rbegin(), powers.rend(), 0)); // remove all leading zeros from the result and return it, // or return "0" if the result was nothing but zeros result.erase(0, result.find_first_not_of('0')); return result.empty() ? std::string("0") : result; } // set given 'p' to the value of (2^n)-1 for given 'n'; requires n > 0 void make_prime(int n, num_vec_t & p) { // a binary representation of (2^n)-1 is just an n-bit number with // every bit set; e.g. if n=4, (2^n)-1=15 decimal, or 1111 binary const int len = (n + num_frag_t_size - 1) / num_frag_t_size; const num_frag_t ones = ~static_cast(0); p = num_vec_t(len, ones); if (n % num_frag_t_size) { // the bits in p[len - 1] are already correct if n is a whole multiple // of num_frag_t_size (also, ones>>num_frag_t_size is undefined in C++) p[len - 1] = ones >> (num_frag_t_size - n % num_frag_t_size); } } // return a decimal string representation of (2^n)-1 for the given 'n' std::string prime_str(int n) { num_vec_t p; make_prime(n, p); return to_string(p); } int main() { // output the number of decimal digits in (2^57,885,161)-1, i.e. 17,425,170 std::cout << prime_str(57885161).size() << '\n'; } ```

``````\$g++ -O3 prime2.cpp -o prime2
\$time ./prime2
17425170

real 107m10.509s
user 106m18.801s
sys 0m7.216s
\$
``````

Well, that's an improvement, but not as dramatic as I hoped for. I think I'm using the same algorithm as MPIR, but it still takes hours when I hoped for minutes. I'm going to try using the MPIR library functions for multiply and divide instead of my Knuth derived ones. This turns out to be quite easy to do because my representation of a multi-precision binary number as a vector of "fragments" maps quite nicely onto MPIR's contiguous "limbs". I need only be careful to match the size of my fragments to MPIR's limbs: my Windows MPIR library uses 32-bit limbs and my OS X MPIR library uses 64-bit limbs. Bearing this in mind I modify mul() and div() in prime2.cpp to call the MPIR functions

```// our arbitrary length unsigned number will be represented by a vector of // "fragments", each fragment will be one of these typedef mp_limb_t num_frag_t; #define NUM_FRAG_T_SIZE GMP_LIMB_BITS const int num_frag_t_size = NUM_FRAG_T_SIZE; // arbitrary length unsigned number; least significant bits in lowest fragment typedef std::vector<num_frag_t> num_vec_t; // product <- u * v; u and v must be normalised void mul(num_vec_t & product, const num_vec_t & u, const num_vec_t & v) { // use mpn_mul() to do the multiplication; this is straight-forward // because if we ensure our num_frag_t is the same size as GMP's // mp_limb_t then our binary representation will match GMP's // (what we called a fragment is what GMP calls a limb) const mp_size_t m = u.size(); const mp_size_t n = v.size(); std::vector<mp_limb_t> p(m + n, 0); if (m >= n) mpn_mul(&p[0], &u[0], m, &v[0], n); else mpn_mul(&p[0], &v[0], n, &u[0], m); normalise(p); product.swap(p); } // quotient <- u / v; remainder <- u % v void div(num_vec_t & quotient, num_vec_t & remainder, const num_vec_t & u, const num_vec_t & v) { // use mpn_tdiv_qr() to do the division; this is straight-forward // because if we ensure our num_frag_t is the same size as GMP's // mp_limb_t then our binary representation will match GMP's // (what we called a fragment is what GMP calls a limb) const mp_size_t m = u.size(); const mp_size_t n = v.size(); if (m < n || n <= 0 || v[n-1] == 0) throw std::runtime_error("div() unsupported input"); std::vector<mp_limb_t> q(m - n + 1, 0); std::vector<mp_limb_t> r(n, 0); mpn_tdiv_qr(&q[0], &r[0], 0, &u[0], m, &v[0], n); normalise(q); quotient.swap(q); normalise(r); remainder.swap(r); } // return decimal representation of given 'num' zero padded to 'width' std::string to_string_fixed_width( num_vec_t num, // pass-by-value because algorithm is destructive int width) // result will be at least this wide, zero filled if necessary { std::vector<char> result; // keep dividing num by 1,000,000,000 until it is 0; the remainder of each // division is the next least significant 9 decimal digits in the result while (!num.empty()) { uint64_t remainder = 0; for (num_vec_t::size_type i = num.size(); i--; ) { #if NUM_FRAG_T_SIZE == 64 // each fragment of num is 64-bits: divide high and low 32-bits in two stages remainder = (remainder << 32) + (num[i] >> 32); uint64_t hi = remainder / 1000000000; remainder %= 1000000000; remainder = (remainder << 32) + (num[i] & 0xFFFFFFFF); num[i] = (hi << 32) + (remainder / 1000000000); remainder %= 1000000000; #else // each fragment of num is 32-bits: need only one 64-bit by 32-bit division remainder = (remainder << num_frag_t_size) + num[i]; num[i] = static_cast<num_frag_t>(remainder / 1000000000); remainder %= 1000000000; #endif } // discard all zero-value 32-bit elements from most significant end num_vec_t::size_type i = num.size(); while (i && num[i - 1] == 0) --i; num.resize(i); // extract the next 9 digits from the remainder for (int j = 0; j < 9; ++j) { result.push_back(static_cast<char>(remainder % 10) + '0'); remainder /= 10; if (remainder == 0 && num.empty()) break; // don't append superfluous zeros; e.g. 000000012 => 12 } } // add any necessary leading zeros to make the result the required width width -= static_cast<int>(result.size()); while (width-- > 0) result.push_back('0'); // the least-significant decimal digit is first so return result reversed return std::string(result.rbegin(), result.rend()); } ``` The rest of prime3.cpp is the same as prime2.cpp.
``````\$g++ -O3 prime3.cpp -l mpir -o prime3
\$time ./prime3
17425170

real 3m37.341s
user 3m35.766s
sys 0m0.829s
\$
``````

Wow! That is the dramatic improvement I was looking for. Switching from the multiply and divide algorithms given in The Art of Computer Programming v2 4.3.1 to those used by the MPIR library made the code run 30 times faster.

What if I use the MPIR built-in binary-to-decimal function, mpn_get_str(), directly?

prime4.cpp:
 ```/* prime4.cpp - print number of decimal digits in biggest known prime number By Anthony C. Hay, 2013. See http://howtowriteaprogram.blogspot.com/ This is free and unencumbered public domain software; see http://unlicense.org/ I believe this code to be correct, but I may be wrong; use at your own risk. Compiled under OS X with g++ 4.2 g++ -O3 prime4.cpp -lmpir -o prime4 Compiled under Windows with VC++ 2010 cl /nologo /EHs /O2 /I C:\mpir-2.6.0\lib\Win32\Release prime4.cpp /Feprime4.exe /link C:\mpir-2.6.0\lib\Win32\Release\mpir.lib */ #include #include #include #include #include #include // our arbitrary length unsigned number will be represented by a vector of // "fragments", each fragment will be one of these typedef mp_limb_t num_frag_t; #define NUM_FRAG_T_SIZE GMP_LIMB_BITS const int num_frag_t_size = NUM_FRAG_T_SIZE; // arbitrary length unsigned number; least significant bits in lowest fragment typedef std::vector num_vec_t; // return given 'num' as a decimal string std::string to_string(const num_vec_t & num) { // use mpn_get_str() to do the binary to decimal conversion; this is // straight-forward because if we ensure our num_frag_t is the same // size as GMP's mp_limb_t then our binary representation will match // GMP's (what we called a fragment is what GMP calls a limb) num_vec_t n(num); // make a copy because mpn_get_str() destroys its argument n.push_back(0); // mpn_get_str() needs one extra high limb // allocate a buffer big enough to hold the whole decimal string // (log(2)/log(10) = 0.30102999566398119521373889472449) const size_t buf_len = static_cast(num.size() * GMP_LIMB_BITS * 0.30103) + 3; std::vector buf(buf_len); size_t len = mpn_get_str(&buf[0], 10, &n[0], num.size()); // skip any leading zeros unsigned char * start = &buf[0]; while (len && *start == 0) { --len; ++start; } // convert 0-9 -> '0'-'9' for (size_t i = 0; i < len; ++i) start[i] += '0'; // (C++ standard guarantees '0'..'9' are consecutive) return len ? std::string(start, start + len) : std::string("0"); } // set given 'p' to the value of (2^n)-1 for given 'n'; requires n > 0 void make_prime(int n, num_vec_t & p) { // a binary representation of (2^n)-1 is just an n-bit number with // every bit set; e.g. if n=4, (2^n)-1=15 decimal, or 1111 binary const int len = (n + num_frag_t_size - 1) / num_frag_t_size; const num_frag_t ones = ~static_cast(0); p = num_vec_t(len, ones); if (n % num_frag_t_size) { // the bits in p[len - 1] are already correct if n is a whole multiple // of num_frag_t_size (also, ones>>num_frag_t_size is undefined in C++) p[len - 1] = ones >> (num_frag_t_size - n % num_frag_t_size); } } // return a decimal string representation of (2^n)-1 for the given 'n' std::string prime_str(int n) { num_vec_t p; make_prime(n, p); return to_string(p); } int main() { // output the number of decimal digits in (2^57,885,161)-1, i.e. 17,425,170 std::cout << prime_str(57885161).size() << '\n'; } ```

``````\$sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM)2 Duo CPU     E8235  @ 2.80GHz
\$
\$g++
i686-apple-darwin11-llvm-g++-4.2: no input files
\$
\$
\$g++ -O3 prime4.cpp -lmpir -o prime4
\$time ./prime4
17425170

real 2m25.515s
user 2m24.323s
sys 0m0.521s
\$
``````

Well, that took 33% less time than prime3.cpp. But they're in the same ball park.

### Conclusion

Here is a summary of execution times of all the above experiments

``````language    name         line  execution  digits/  comments
count   time (s)   second

C++/MPIR    prime4.cpp    102        144  121,000  conversion using MPIR library code only
C++/MPIR    prime3.cpp    297        215   81,000  same as prime2.cpp, but using MPIR for mul & div
Scheme/GMP                  1        260   67,000
C++         prime2.cpp    354      6,378    2,700  my best effort in vanilla C++
Python 3.3                  1      9,651    1,800
Python 2.7                  1     10,150    1,700
C++         prime1.cpp    100     14,517    1,200  nine digits at a time
C++         prime0.cpp     93     94,945      184  one digit at a time
``````

First, much respect to the mathematicians and programmers who developed and implemented the algorithms in GMP/MPIR.

I would like to have coded something that got a lot closer to the speed of the GMP/MPIR library code, but that would have taken a huge effort and I have other ideas bubbling up. It's time to move on.

How hard is binary to decimal conversion? It depends. For a few hundred bits the simple algorithm in prime1.cpp works well enough. Beyond that either use MPIR/GMP or be prepared to invest a lot of effort. (I hope someone will tell me about a simple, fast algorithm to do it.)

Finally, I don't know why Comeau's on-line compiler isn't there any more, but I missed it.

index of blog posts

1. i use a much revised mapm c library,
which does fft multiply in base 10000

multiply speed not as good as gmp, but
building decimal string is VERY FAST

rpn 0x1p57885161 1- ?i3 k > prime

finish in under 10 seconds ! (with comma added for clarity !)

email me if interested with my revised mapm library
albertmcchan@yahoo.com

2. sorry about my prior comment
your title was binary to decimal

i think my solution might be cheating ...
it does not use binary ...

base 100
-> fft multiply in base 10000
-> base 100
-> decimal string

3. This comment has been removed by the author.

4. on the same laptop, gmp did it in 26.7 seconds

5. Interesting, thanks for this, came across your post looking for something similar, fyi, your code (prime4.cpp) ran in 4 seconds with lib gmp version 6.1.2,

MacBook Pro (Retina, 13-inch, Early 2015), 2,7 GHz Intel Core i5, 8 GB 1867 MHz DDR3

1. That's a remarkable improvement: from 144 seconds to 4 seconds. Thanks for commenting.

6. Greetings, Anthony. Improvements were made to GMP 6.2 for AMD Ryzen processors. GMP 6.1.2 and 6.2.1 ran in 4.292 and 3.472 seconds respectively.

Parallel get_str https://github.com/marioroy/binary-to-decimal

g++ -O3 -DTEST6 -fopenmp prime_test.cpp -I/opt/gmp621/include -L/opt/gmp621/lib -l gmp -o prime6_test -Wno-attributes

time LD_LIBRARY_PATH=/opt/gmp621/lib ./prime6_test

7. The forked repo https://github.com/marioroy/binary-to-decimal was removed some time ago. I added examples fac_test.c, prime5.cpp, prime6.cpp, and prime_test.cpp to https://github.com/marioroy/Chudnovsky-Pi/tree/master/src/extra. The GMP/MPIR routines mpf_get_str, mpf_out_str, mpz_get_str, and mpz_out_str consume up to 8 threads max via parallel recursion.