Parallel

Optimizing Integer Division By a Constant Divisor

By Robert D. Grappel, February 01, 1991

Speed up slow processors by decomposing a division by a constant divisor into a simple sequences of additions, subtractions and shifts.

FEB91: OPTIMIZING INTEGER DIVISION BY A CONSTANT DIVISOR

Robert holds a Ph.D in solid-state physics from Ohio University and is currently a researcher at MIT, Lincoln Laboratory. He can be reached at 28 Buckmaster Dr., Concord, MA 01742.

Division is not an easy operation for computers to perform. Those of us old enough to remember 8-bit microprocessors such as the Motorola 6800 and the Intel 8080 (or those of us still using them for embedded systems or other similar applications) can recall writing division subroutines which were slow and consumed scarce memory resources. Modern 16- and 32-bit processors have divide instructions, but these instructions are still quite slow when compared to the "primitive" arithmetic instructions such as add, subtract, and shift. Table 1, for example, lists the number of clock cycles required by the Intel 80286 and Motorola 68020 processors for some of the basic arithmetic operations.

Table 1: Timings for arithmetic instructions

  Instruction                 80286  68020
  ----------------------------------------

  Add, Subtract                2     0-3
  Shift n                      5+n   1-4
  Multiply (unsigned 16-bit)   21    21-28
  Multiply (unsigned 32-bit)   ---   41-44
  Divide (unsigned 16-bit)     22    42-47

  Notes:

All times are in processor clock cycles.
Motorola time range for best-cache-worst memory case.
Intel add, subtract, and shift for 16-bit values. Motorola add, subtract, and shift for 32-bit values.
All instructions assume operands in registers.
Shifts assume a constant shift-count.

Note that the division instruction is many times slower than addition, subtraction, or shifting. This suggests that it might be useful to find a method to convert a division operation into a short sequence of adds, subtracts, and shifts. This instruction sequence might prove faster that the computer's divide instruction. For those computers which lack a divide instruction, the ability to express a division as a sequence of instructions which do exist in the instruction set can provide a substantial speed improvement over a subroutine call. The sequence might even prove to save memory, given the overhead of parameter passing and the subroutine code itself.

This article describes a method of decomposing a division by a constant divisor into just such a simple sequence of additions, subtractions, and shifts. For simplicity, we will assume that the numerator is an unsigned value and that the divisor is an unsigned constant. (Extending the algorithm to handle signed values is straightforward -- as mathematicians say "the proof is left for the reader.") The numerator and denominator (divisor) are assumed to be 16-bit values here, although the algorithm can be extended to handle longer values. We will assume that the computer has the capability to do double-precision (32-bit) adds, subtracts, and shifts. The instruction sequence assumes the existence of two registers in which the calculations will be done.

How Does a Mathematician Boil Water?

There is an old joke about mathematicians that describes one technique they frequently use to solve problems. The problem in this case is "how to boil water."

Problem 1: The mathematician finds himself in a kitchen. There is a kettle of water on the counter next to the sink.

Solution: He carries the kettle to the stove and heats it to boiling. (QED)

Problem 2: The kettle of water is already on the stove.

Solution: The mathematician takes the kettle and places it on the counter next to the sink. Now, proceed as in the solution to the first problem. (QED)

The technique of converting the existing problem into one which has already been solved is a basic tool in mathematics. In this article, the technique will be used to convert the problem of solving the quotient function Q(x)=x/y(where"/" indicates integer division, x and y are positive, and y is a nonzero constant) into the new form Q'(x)=(ax+b)/z(where a,b, and z are derived from y). We have replaced a division by a multiplication, an addition, and a different division. At first, it appears that we have made the problem worse. However, note that if we could make z a power-of-two, then the new division could be done with a shift instruction. The addition is no problem, so we have traded a division by a constant divisor for a multiplication by a constant multiplier. This is a problem we already know how to solve! In the March 1987 issue of DDJ # 125 (pages 34-37), I described a technique for unrolling a multiplication by a constant multiplier into a "star-chain" sequence of adds, subtracts, and shifts. Now, if we can generate appropriate values for a, b, and z, we have solved the problem of generating a sequence of adds, subtracts, and shifts for division.

The math for calculating the constants a, b and z is not very complicated. Because z must be a power-of-two, we will choose the maximum 16-bit value 65,536 for our example. We define a by the formula a = z/y which can be viewed as a sort of scaled reciprocal of the constant divisor y. We define the remainder of the integer division r as r = z%y which can be viewed as the "fractional part" of the scaled reciprocal. Then we can define b as b = a+r-1 and the transformation is complete. We have converted the problem Q(x) into the equivalent problem Q'(x) which we already know how to express in adds, subtracts, and shifts. We can precalculate the required constants and generate the complete instruction sequence at "compile time." (Note that the adds, subtracts, and shifts in the algorithm must be done in double-precision registers [32-bits] to avoid overflows in the calculation steps.) The final resulting quotient ends up in the upper 16-bits of the register (before the 16-bit shift-division by z) and the low-order 16-bits of the register contain an approximation to the remainder of the division.

Close Enough for Government Work

There is only one problem with the Q'(x) algorithm for division by a constant divisor -- it doesn't always get the right answer! Sometimes the accumulated round-off error in the calculation causes the quotient to be off by one when the numerator gets sufficiently large. This may be good enough for some applications which can limit the numerator range, or where an approximate quotient is acceptable.

There is a way to improve the chances of getting the right answer from the Q'(x) algorithm. If the divisor is even, keep shifting the numerator and denominator right (equivalent to divisions by two) until the denominator becomes odd, then apply the Q'(x) algorithm. By reducing the denominator, it becomes more likely that the round-off error will not cause problems for 16-bit (less than z) numerators. This improvement is employed in the results and code shown later in this article.

Table 2 lists the first 300 divisors for which the Q'(x) algorithm is accurate for all 16-bit numerators (after applying the shift-right operation described in the preceding paragraph). The table was generated by brute-force calculation. Q'(x) can be off by one for divisors not in the table. A quick check program can be written to determine if Q'(x) is accurate for a given divisor (not in the table) over the desired numerator range. For example, if the numerator range is limited to a maximum of 32,767 (15 bits, the maximum signed 16-bit value), the additional divisors in Table 3calculate accurate results when using the Q'(x) algorithm.

Table 2: First 300 divisors for which Q'(x) is accurate

       2     3     4     5     6     8    10    12    14    15
      16    17    20    24    28    30    32    34    40    48
      51    52    56    60    62    64    68    72    80    85
      96   102   104   112   120   124   128   136   144   152
     160   170   172   176   192   204   208   216   224   240
     248   255   256   257   272   284   288   302   304   320
     336   340   344   352   368   384   400   408   416   432
     434   448   480   496   508   510   512   514   516   544
     560   568   576   592   604   608   624   640   648   672
     680   688   704   720   736   768   771   800   816   832
     864   868   896   928   960   992  1008  1016  1020  1024
    1028  1032  1040  1056  1072  1088  1120  1136  1152  1184
    1208  1216  1232  1248  1280  1285  1296  1312  1344  1360
    1376  1408  1440  1456  1472  1504  1524  1536  1542  1568
    1600  1632  1664  1680  1696  1728  1736  1760  1792  1856
    1872  1920  1952  1984  2016  2032  2040  2048  2056  2064
    2080  2112  2114  2144  2176  2240  2272  2304  2368  2416
    2432  2464  2496  2560  2570  2576  2592  2608  2624  2688
    2720  2752  2784  8216  2848  2880  2896  2912  2944  3008
    3048  3072  3084  3120  3136  3200  3216  3264  3296  3328
    3360  3392  3456  3472  3488  3520  3584  3648  3692  3712
    3744  3776  3840  3855  3904  3968  4032  4048  4064  4080
    4096  4112  4128  4144  4160  4224  4228  4288  4352  4368
    4369  4416  4480  4544  4608  4672  4736  4800  4832  4864
    4928  4992  5040  5056  5088  5120  5140  5152  5184  5216
    5248  5280  5312  5376  5440  5504  5568  5632  5696  5728
    5760  5792  5824  5856  5888  5952  6016  6096  6112  6144
    6168  6208  6240  6272  6400  6432  6472  6512  6528  6592
    6656  6720  6784  6808  6848  6912  6944  6976  7040  7104
    7168  7280  7296  7384  7424  7488  7552  7680  7710  7744

  Note: all divisors > 32767 are accurate

Table 3: Accurate Q'(x) divisors (numerator<32768)

     7    26    31    36    76    86    88   108   142   151
   168   184   200   217   254   258   280   296   312   324
   360   464   504   520   528   536   616   656   728   752
   762   784   848   880   936   976  1057  1288  1304  1392
  1424  1448  1560  1608  1648  1744  1824  1846  1888  2024
  2072  2184  2336  2520  2528  2544  2656  2864  2928  2976
  3056  3104  3236  3256  3404  3424  3553  3640  3872  3912
  4000  4016  4176  4192  4320  4384  4448  4576  4680  4681
  4896  4944  5024  5344  5472  5488  5576  5600  5664  5920
  5984  6080  6336  6368  6392  6464  6552  6898  7008  7084
  7200  7232  7360  7456  7616  7648

Note that a factoring approach can be used to deal with some divisors that are not in the table. For example, the divisor 25 is not in the table, but 5 is. Hence, applying Q'(x) for 5 twice will result in an accurate division by 25.

Just the Facts, Ma'am

Listing One contains a C program which generates an instruction sequence for the Q'(x) division algorithm. The code is very similar to the multiplication algorithm code shown in my previous Dr. Dobb's Journal article. The main program of my March 1987 article (page 35) has been made into the subroutine binmul(). Note that the code to load the machine working-register Rw has been moved to the new main program. This is the only change that has been made to the star-chain multiplication code. The program assumes that the numerator will be in machine register R1 and it returns the quotient in machine register Rw. Note that unlike the multiplication algorithm instruction sequence, the instruction sequence generated by the Q'(x) algorithm will alter the contents of register R1 if the divisor is not a power-of-two.

The main program of Listing One uses the function trim_trailing( ) to count the number of low-order zero bits in the divisor (equivalent to the number of factors-of-two in the divisor which will be shifted out). Code is generated for the initial right-shift scaling. If the divisor was a power-of-two, this right-shift does the entire job; otherwise the Q'(x) algorithm is implemented. Note that the final right-shift by 16 bits (the division by z) might be done with a 68000-family SWAP instruction or by returning only the upper half of Rw (if the computer has only 16-bit registers, such as the Intel 80286).

As an example, consider a division by 102 (one of the accurate values from Table 2). The Q'(x) instruction sequence generated by the program in Listing One is shown in Example 1 (the comments were added for additional clarity).

Example 1: Division by 102

  R1 >>= 1    scale 102 down to 51 (odd)
  Rw = R1       mult. by a = (65536/51) = 1285
  Rw <<= 2
  Rw += R1
  Rw <<= 6
  Rw += R1
  Rw <<= 2
  Rw += R1
  Rw += 1285    add b = (a + r - 1) = 1285
  Rw >>= 16   divide by z = 65536

For a 68020, the worst-case timing for this sequence would be about 35 clocks -- compared to 47 clocks for a divide instruction. If the code was in cache, the timing for the sequence would improve to about 28 clocks against 46 clocks for the divide instruction. In the "best" timing case for the 68020, the sequence could require as few as 10 clocks against 42 clocks for the divide instruction. For computers which must implement division as a subroutine, the improvement entailed in using the Q'(x) algorithm can be greater still.


_OPTIMIZING INTEGER DIVISION BY A CONSTANT DIVISOR_
by Robert Grappel

[LISTING ONE]

<a name="0079_000c">

#include stdio.h

/* Program to generate "star-chain-sequence" for division of an unsigned
16-bit integer numerator by an unsigned 16-bit integer constant denominator.
It assumes the existence of two 32-bit integer "registers", add, subtract, and
shift instructions. It uses the "star-chain" multiplication routine from DDJ
#125, March 1987 page 35 */

static unsigned int mult; /* global variable for trim_trailing() & binmul() */

/* Support subroutine to trim trailing 0s or 1s from global variable "mult". */
int trim_trailing(one_zero) int one_zero;
{ /* if one_zero   == 0, trim trailing zeros in "mult", return "count"
         == 1, ones                                       */
   int c; /* bit counter */
   for (c = 0; ((mult & 1) == one_zero); c++, mult >>= 1) ;
   return c;
}

/* Slightly modified version of multiplication routine */
binmul(m) long m;
{
   int   last_shift,   /* final shift count */
      last_cnt,   /* count of low-order zeros */
      stkptr,    /* pointer to "stack[]" */
      cnt,      /* bit counter */
      ts,      /* top-of-stack element */
      flag,      /* flag for special-case */
      stack[16];   /* stack for shift-add/subs */
   mult = m;
   stkptr = last_cnt = 0;   /* init. stack ptr. and count */
   last_shift = trim_trailing(0);   /* trim trailing 0s */
   while (1)
   { /* loop to decompose "mult", building stack */
      cnt = trim_trailing(1); /* count low-order 1s */
      if (cnt > 1)
      { /* more than 1 bit, shift-subtract */
         flag = 0;
         if (last_cnt == 1)
    /* shift "k",sub,shift 1,add --> shift "k+1", sub */
            stack[stkptr-1] = -(cnt+1); /* overwrite */
         else
            stack[stkptr++] = -cnt;
      }
      else
         flag = 1; /* need another shift-add */
      if (mult == 0) break; /* decomp. "mult", now output */
      last_cnt = trim_trailing(0) + flag; /* low-order 0s */
      stack[stkptr++] = last_cnt; /* shift-add */
   }
   while (stkptr > 0)
   { /* output code from stack */
      ts = stack[--stkptr]; /* get top stack element */
      if (ts < 0)
      { /* generate shift-subtract instructions */
         printf("\nRw <<= %d",-ts);
         printf("\nRw -= R1");
      }
      else
      { /* generate shift-add instructions */
         printf("\nRw <<= %d",ts);
         printf("\nRw += R1");
      }
   }
   if (last_shift != 0) printf("\nRw <<= %d",last_shift);
}

main()
{ /* generate pseudo-instructions for star-chain division */
   int   a,b,r, /* computed multiplier, addend, and remainder */
      i2,   /* number of bits to scale divisor */
      denom,   /* intended divisor */
      denom2; /* divisor scaled by powers-of-2 */
   int z = 65536;   /* 2^16 */
   printf("\nEnter positive integer denominator: ");
   scanf("%d",&denom);
   if (denom != 0)
   { /* scale denominator by powers-of-2 */
      mult = denom;
      i2 = trim_trailing(0); /* how many powers-of-2? */
      denom2 = mult;
      if (denom2 == 1)
      { /* divisor was power-of-2, simply scale it */
         printf("\nRw = R1");
         if (i2 > 0) printf("\nRw >>= %d",i2);
      }
      else
      { /* divisor not power-of-2, scale and more */
         if (i2 > 0)
             printf("\nR1 >>= %d",i2); /* handle scaling */
         printf("\nRw = R1");   /* load work register */
         a = z / denom2;    /* scaled reciprocal */
          r = z % denom2;   /* remainder of recip. */
         b = a + r - 1;
         binmul(a);
         printf("\nRw += %d",b);
         printf("\nRw >>= 16");
      }
   }
   else
      printf("\nCannot divide by zero\n"); /* special case */
}



Example 1: Division by 102

R1 >>= 1      scale 102 down to 51 (odd)
Rw = R1            mult. by a = (65536/51) = 1285
Rw <<= 2
Rw += R1
Rw <<= 6
Rw += R1
Rw <<= 2
Rw += R1
Rw += 1285           add b = (a + r - 1) = 1285
Rw >>= 16      divide by z = 65536

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Optimizing Integer Division By a Constant Divisor

Table 1: Timings for arithmetic instructions

How Does a Mathematician Boil Water?

Close Enough for Government Work

Table 2: First 300 divisors for which Q'(x) is accurate

Table 3: Accurate Q'(x) divisors (numerator<32768)

Just the Facts, Ma'am

Example 1: Division by 102

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Optimizing Integer Division By a Constant Divisor

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content