Session 31/: Multiplying

Grade-school multiplying
Divide-and-conquer multiplication
Algorithm
Theoretical analysis
Implementation tricks
Experimental comparison

We saw, through Mergesort, how recursion can lead to good algorithms for sorting, using a technique called divide and conquer. Today I want to look at a completely different problem, in which the same technique applies: Multiplying numbers. Today, we're going to learn a new, faster way to multiply numbers.

In particular, we're interested in algorithms for multiplying together very large numbers - like numbers with hundreds of digits. We're interested in the case with both numbers have n digits, when n is very large.

Grade-school multiplying

You probably learned to multiply multi-digit numbers in grade school. This technique didn't involve any recursion.

     2412
   x 3231
---------
     2412
    7236
   4824
+ 7236
---------
  7793172

Analyzing this algorithm using big-O analysis proceeds in two steps, just like the algorithm. First, we look at the amount of time it takes to write the n numbers below the line. Each number takes O(n) time to compute, and there are n numbers, for a total of O(n²) time for this first phase.

The second phase is the addition phase. For this phase, we go through each of the 2n columns, totalling the up to n digits within the column (plus an additional number obtained from the carry of the previous column). Each column takes O(n) time, and so the overall time for this phase is O(n²).

The total time for both phases of the grade-school algorithm, therefore, is O(n²) + O(n²) = O(n²).

Divide-and-conquer multiplication

There is a faster way to multiply, though, caled the divide-and-conquer approach. The introduction of the technique is attributed to a 1962 paper by Karatsuba, and indeed it is sometimes called Karatusba multiplication.

Algorithm

With divide-and-conquer multiplication, we split each of the numbers into two halves, each with n/2 digits. I'll call the two numbers we're trying to multiply a and b, with the two halves of a being a_L (the left or upper half) and a_R (the right or lower half) and the two halves of b being b_L and b_R.

Basically, we can multiply these two numbers as follows.

             a_L        a_R
         x   b_L        b_R
         -----------------
            a_Lb_R      a_Rb_R
+ a_Lb_L      a_Rb_L
--------------------------
  a_Lb_L (a_Lb_R + a_Rb_L)  a_Rb_R

That image is just a picture of the idea, but more formally, the derivation works as follows.

ab = (a_L 10^n/2 + a_R) (b_L 10^n/2 + b_R)
   = a_L b_L 10ⁿ + a_L b_R 10^n/2 + a_R b_L 10^n/2 + a_R b_R
   = a_L b_L 10ⁿ + (a_L b_R + a_R b_L) 10^n/2 + a_R b_R

Thus, in order to multiply a pair of n-digit numbers, we can recursively multiply four pairs of n/2-digit numbers. The rest of the operations involved are all O(n) operations. (Multiplying by 10ⁿ may look like a multiplication (and hence not O(n)), but really it's just a matter of appending n zeroes onto the number, which takes O(n) time.)

That's fine as far as it goes, but it turns out that it's not far enough: If you write down the recurrence and solve it, it turns out to solve to O(n²), which is what we had from grade school. And this algorithm is much more complicated. Not a very encouraging result.

But there turns out to be a very clever approach, we permits us to reduce the number of n/2-digit multiplications from four to three! This clever idea yields a better result.

The idea works as follows: We're trying to compute

a_L b_L 10ⁿ + (a_L b_R + a_R b_L) 10^n/2 + a_R b_R

What we'll do is compute the following three products using recursive calls.

x₁ = a_L b_L
x₂ = a_R b_R
x₃ = (a_L + a_R) (b_L + b_R)

These have all the information that we want, since the following is true.

x₁ 10ⁿ + (x₃ - x₁ - x₂) 10^n/2 + x₂
= a_L b_L 10ⁿ + ((a_L b_L + a_L b_R + a_R b_L + a_R b_R) - a_L b_L - a_R b_R) 10^n/2 + a_R b_R
= a_L b_L 10ⁿ + (a_L b_R + a_R b_L) 10^n/2 + a_R b_R

And we already reason that this last is equal to the product of a and b.

In pseudocode, I might write down the algorithm this way. (This is just pseudocode, even though it may look like Java. But the Java would be quite a bit more complex, without really giving you much of an idea.)

BigInteger multiply(BigInteger a, BigInteger b) {
    int n = max(number of digits in a, number of digits in b)
    if(n == 1) {
        return a.intValue() * b.intValue();
    } else {
        BigInteger aR = bottom n/2 digits of a;
        BigInteger aL = top remaining digits of a;
        BigInteger bR = bottom n/2 digits of b;
        BigInteger bL = top remaining digits of b;
        BigInteger x1 = Multiply(aL, bL);
        BigInteger x2 = Multiply(aR, bR);
        BigInteger x3 = Multiply(aL + bL, aR + bR);
        return x1 * pow(10, n) + (x3 - x1 - x2) * pow(10, n / 2) + x2;
    }
}

Let's do an actual multiplication to illustrate how this works. I'm going to draw a recursion tree, labeling the edges with the final values computed by each node of the tree.

It's not too convincing that this is any better than the grade school method. It's really quite complicated. But, for very large numbers, the grade school algorithm begins to break down more quickly than this algorithm. I don't recommend using it for typically-sized numbers, though.

Theoretical analysis

We're going to do the same form of analysis we did for Mergesort: We'll draw a recursion tree, labeling each node with the amount of time consumed outside the recursive calls, and then we'll find the sum of all the nodes of the recursion tree to get the total time.

All of the calculations of divide-and-conquer multiplication take O(n) time, except the three recursive multiplications. The three multiplications are to n/2-digit numbers. (Actually, the last multiplicaiton could be to (n/2+1)-digit number, but this extra digit turns out not to affect the analysis. It complicates the analysis considerably, and so we'll ignore it.)

So here's the recursion tree. I'll label each node with the number of digits in each number of the pair being multiplied.

                                                      n
                n/2                                  n/2                                  n/2
    n/4         n/4         n/4          n/4         n/4         n/4          n/4         n/4         n/4
n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8  n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8  n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8 n/8
 :   :   :   :   :   :   :   :   :    :   :   :   :   :   :   :   :   :    :   :   :   :   :   :   :   :   : 
 1   1   1   1   1   1   1   1   1    1   1   1   1   1   1   1   1   1    1   1   1   1   1   1   1   1   1

As with Mergesort, there are 1 + log₂ n levels.

The times for the first level sum to n; the times for the second sum to (3/2)n; for the third, (9/4)n; for the fourth, (27/8)n; and so on, each time going up by a factor of 3/2. The last level will sum to (3/2)^{log₂ n} n.

We're going to simplify the total over all levels.

     3        3  2       3  3            3  log2 n
n + --- n + (---)  n + (---) n + ... + (---)       n
     2        2          2               2

           3      3  2     3  3          3  log2 n
= n ( 1 + --- + (---)  + (---) + ... + (---)       )
           2      2        2             2

This last is a finite geometric series, of the form 1 + r + r² + ... + r^k. The following a well-known mathematical fact (you may have learned it in calculus):

                              k+1
         2    3          k   r    - 1
1 + r + r  + r  + ... + r  = --------
                              r - 1

We're going to apply this fact now to further simplify our expression for the total between the levels, using r = 3/2

         1+log2 n
    (3/2)         - 1       (  3    3  log2 n     )
n ------------------- = 2 n ( --- (---)       - 1 )
         3/2 - 1            (  2    2             )


                                log2 n
        3  log2 n              3                  log2 n
= 3 n (---)       - 2 n = 3 n --------- - 2n = 3 3       - 2n
        2                       log2 n
                               2

     log2 3           1.585
= 3 n       - 2n = O(n     )

This final answer is quite a bit better than the O(n²) result we had from the multiplication technique we learned in grade school.

(The last step, where we replaced 3^{log2 n} with n^{log2 3}, comes from an identity I know that x^{logb y} = y^{logb x}. It can be proven with the following reasoning.

(logb x) (logb y) = (logb y) (logb x)

      logb x          logb y
logb y        = logb x

      logb x          logb y
     y        =      x

The first step follows from the identity that m log_b n = log_b n^m.)

(Incidentally, divide-and-conquer multiplication isn't the fastest known multiplication algorithm. The best known result is an O(n log n) algorithm. It's quite a bit more complicated, and I'm not sure if it's practical for reasonably sized numbers.)

Implementation tricks

It happens that I've written one of the faster divide-and-conquer multipliers out there. Implementing algorithms quickly is a bit of a trick. There are three tricks that my implemantion uses.

In recurring, the base case is actually when there are 8 digits, at which the code breaks down and does grade-school multiplication. This is because the grade-school technique actually is faster for small numbers - the big-O analysis only applies to really big values of n. The choice of 8 for the number of digits is an experimentally determined value; I tried several values, and 8 was fastest on my computer.
Carries are deferred until after the multiplication is complete. So, actually, as it's doing the multiplication, some of the ``digits'' are more than 10. It's still a base-10 representation, but the algorithm doesn't need the digits to be between 0 and 9. Thus, the implementation uses CC to represent the number 12x10 + 12 = 132.

This saves in two respects: First, it means that we avoid extra, unnecessary computation while the recursion is going on, saving all the carrying to one quick pass at the end. (It would amount to several passes if we did the carry as we go along.) And second, it means we don't have to worry about the annoying extra digit that can pop up in computing x₃.
A simple implementation of divide-and-conquer multiplication would often allocate new arrays, basically for each recursive call. Memory allocation can get pretty complex, however, and sometimes takes quite a while. My implementation plays some pretty heavy tricks to avoid allocating memory. In particular, it allocates 6n array entries at the very beginning, before starting the recursion, and it aggressively reuses these array entries throughout the recursion process. This saves the overhead of continually allocating arrays.

Experimental comparison

Graph of results (Note the logarithmic axes)

In January 1999, I implemented the program in C++ and timed it on a Sun SPARCstation 4. (It was an antiquated computer even at that time. A later test on a 700MHz Pentium II did about 50 times better.) The below results are the results for pairs of random n-digit numbers. (Yes, digits. I implemented it in decimal.) All times are in milliseconds.

# digits divide-and-conquer grade school

8 0.059923 0.063902

16 0.106360 0.121773

32 0.278862 0.414594

64 0.798085 1.481481

128 2.325581 5.780347

256 6.944444 22.727273

512 21.276596 88.333333

1024 63.750000 370.000000

2048 195.000000 1650.000000

What we see is that divide-and-conquer multiplication, properly implemented, beats grade-school multiplication even for 16-digit numbers. It is significantly better at 32 digits, and of course after that it just blows grade-school away.

# digits	divide-and-conquer	grade school
8	0.059923	0.063902
16	0.106360	0.121773
32	0.278862	0.414594
64	0.798085	1.481481
128	2.325581	5.780347
256	6.944444	22.727273
512	21.276596	88.333333
1024	63.750000	370.000000
2048	195.000000	1650.000000

Of course, this begs the question: Why would one want to multiply 100-digit numbers with exact precision? One response is cryptographic applications: Some protocols (including RSA) involve many multiplications of keys with hundreds of digits. (Another application is to breaking mathematical records (largest prime, whatever), but it's not clear how practical this is.)