Robert holds a Ph.D in solid-state physics from Ohio University and is currently a researcher at MIT, Lincoln Laboratory. He can be reached at 28 Buckmaster Dr., Concord, MA 01742.
Division is not an easy operation for computers to perform. Those of us old enough to remember 8-bit microprocessors such as the Motorola 6800 and the Intel 8080 (or those of us still using them for embedded systems or other similar applications) can recall writing division subroutines which were slow and consumed scarce memory resources. Modern 16- and 32-bit processors have divide instructions, but these instructions are still quite slow when compared to the "primitive" arithmetic instructions such as add, subtract, and shift. Table 1, for example, lists the number of clock cycles required by the Intel 80286 and Motorola 68020 processors for some of the basic arithmetic operations.
Table 1: Timings for arithmetic instructions
Instruction 80286 68020 ---------------------------------------- Add, Subtract 2 0-3 Shift n 5+n 1-4 Multiply (unsigned 16-bit) 21 21-28 Multiply (unsigned 32-bit) --- 41-44 Divide (unsigned 16-bit) 22 42-47 Notes:
- All times are in processor clock cycles.
- Motorola time range for best-cache-worst memory case.
- Intel add, subtract, and shift for 16-bit values. Motorola add, subtract, and shift for 32-bit values.
- All instructions assume operands in registers.
- Shifts assume a constant shift-count.
This article describes a method of decomposing a division by a constant divisor into just such a simple sequence of additions, subtractions, and shifts. For simplicity, we will assume that the numerator is an unsigned value and that the divisor is an unsigned constant. (Extending the algorithm to handle signed values is straightforward -- as mathematicians say "the proof is left for the reader.") The numerator and denominator (divisor) are assumed to be 16-bit values here, although the algorithm can be extended to handle longer values. We will assume that the computer has the capability to do double-precision (32-bit) adds, subtracts, and shifts. The instruction sequence assumes the existence of two registers in which the calculations will be done.
How Does a Mathematician Boil Water?
There is an old joke about mathematicians that describes one technique they frequently use to solve problems. The problem in this case is "how to boil water."
Problem 1: The mathematician finds himself in a kitchen. There is a kettle of water on the counter next to the sink.
Solution: He carries the kettle to the stove and heats it to boiling. (QED)
Problem 2: The kettle of water is already on the stove.
Solution: The mathematician takes the kettle and places it on the counter next to the sink. Now, proceed as in the solution to the first problem. (QED)
The technique of converting the existing problem into one which has already been solved is a basic tool in mathematics. In this article, the technique will be used to convert the problem of solving the quotient function Q(x)=x/y(where"/" indicates integer division, x and y are positive, and y is a nonzero constant) into the new form Q'(x)=(ax+b)/z(where a,b, and z are derived from y). We have replaced a division by a multiplication, an addition, and a different division. At first, it appears that we have made the problem worse. However, note that if we could make z a power-of-two, then the new division could be done with a shift instruction. The addition is no problem, so we have traded a division by a constant divisor for a multiplication by a constant multiplier. This is a problem we already know how to solve! In the March 1987 issue of DDJ # 125 (pages 34-37), I described a technique for unrolling a multiplication by a constant multiplier into a "star-chain" sequence of adds, subtracts, and shifts. Now, if we can generate appropriate values for a, b, and z, we have solved the problem of generating a sequence of adds, subtracts, and shifts for division.
The math for calculating the constants a, b and z is not very complicated. Because z must be a power-of-two, we will choose the maximum 16-bit value 65,536 for our example. We define a by the formula a = z/y which can be viewed as a sort of scaled reciprocal of the constant divisor y. We define the remainder of the integer division r as r = z%y which can be viewed as the "fractional part" of the scaled reciprocal. Then we can define b as b = a+r-1 and the transformation is complete. We have converted the problem Q(x) into the equivalent problem Q'(x) which we already know how to express in adds, subtracts, and shifts. We can precalculate the required constants and generate the complete instruction sequence at "compile time." (Note that the adds, subtracts, and shifts in the algorithm must be done in double-precision registers [32-bits] to avoid overflows in the calculation steps.) The final resulting quotient ends up in the upper 16-bits of the register (before the 16-bit shift-division by z) and the low-order 16-bits of the register contain an approximation to the remainder of the division.
Close Enough for Government Work
There is only one problem with the Q'(x) algorithm for division by a constant divisor -- it doesn't always get the right answer! Sometimes the accumulated round-off error in the calculation causes the quotient to be off by one when the numerator gets sufficiently large. This may be good enough for some applications which can limit the numerator range, or where an approximate quotient is acceptable.
There is a way to improve the chances of getting the right answer from the Q'(x) algorithm. If the divisor is even, keep shifting the numerator and denominator right (equivalent to divisions by two) until the denominator becomes odd, then apply the Q'(x) algorithm. By reducing the denominator, it becomes more likely that the round-off error will not cause problems for 16-bit (less than z) numerators. This improvement is employed in the results and code shown later in this article.
Table 2 lists the first 300 divisors for which the Q'(x) algorithm is accurate for all 16-bit numerators (after applying the shift-right operation described in the preceding paragraph). The table was generated by brute-force calculation. Q'(x) can be off by one for divisors not in the table. A quick check program can be written to determine if Q'(x) is accurate for a given divisor (not in the table) over the desired numerator range. For example, if the numerator range is limited to a maximum of 32,767 (15 bits, the maximum signed 16-bit value), the additional divisors in Table 3calculate accurate results when using the Q'(x) algorithm.
Table 2: First 300 divisors for which Q'(x) is accurate
2 3 4 5 6 8 10 12 14 15 16 17 20 24 28 30 32 34 40 48 51 52 56 60 62 64 68 72 80 85 96 102 104 112 120 124 128 136 144 152 160 170 172 176 192 204 208 216 224 240 248 255 256 257 272 284 288 302 304 320 336 340 344 352 368 384 400 408 416 432 434 448 480 496 508 510 512 514 516 544 560 568 576 592 604 608 624 640 648 672 680 688 704 720 736 768 771 800 816 832 864 868 896 928 960 992 1008 1016 1020 1024 1028 1032 1040 1056 1072 1088 1120 1136 1152 1184 1208 1216 1232 1248 1280 1285 1296 1312 1344 1360 1376 1408 1440 1456 1472 1504 1524 1536 1542 1568 1600 1632 1664 1680 1696 1728 1736 1760 1792 1856 1872 1920 1952 1984 2016 2032 2040 2048 2056 2064 2080 2112 2114 2144 2176 2240 2272 2304 2368 2416 2432 2464 2496 2560 2570 2576 2592 2608 2624 2688 2720 2752 2784 8216 2848 2880 2896 2912 2944 3008 3048 3072 3084 3120 3136 3200 3216 3264 3296 3328 3360 3392 3456 3472 3488 3520 3584 3648 3692 3712 3744 3776 3840 3855 3904 3968 4032 4048 4064 4080 4096 4112 4128 4144 4160 4224 4228 4288 4352 4368 4369 4416 4480 4544 4608 4672 4736 4800 4832 4864 4928 4992 5040 5056 5088 5120 5140 5152 5184 5216 5248 5280 5312 5376 5440 5504 5568 5632 5696 5728 5760 5792 5824 5856 5888 5952 6016 6096 6112 6144 6168 6208 6240 6272 6400 6432 6472 6512 6528 6592 6656 6720 6784 6808 6848 6912 6944 6976 7040 7104 7168 7280 7296 7384 7424 7488 7552 7680 7710 7744 Note: all divisors > 32767 are accurate
Table 3: Accurate Q'(x) divisors (numerator<32768)
7 26 31 36 76 86 88 108 142 151 168 184 200 217 254 258 280 296 312 324 360 464 504 520 528 536 616 656 728 752 762 784 848 880 936 976 1057 1288 1304 1392 1424 1448 1560 1608 1648 1744 1824 1846 1888 2024 2072 2184 2336 2520 2528 2544 2656 2864 2928 2976 3056 3104 3236 3256 3404 3424 3553 3640 3872 3912 4000 4016 4176 4192 4320 4384 4448 4576 4680 4681 4896 4944 5024 5344 5472 5488 5576 5600 5664 5920 5984 6080 6336 6368 6392 6464 6552 6898 7008 7084 7200 7232 7360 7456 7616 7648
Note that a factoring approach can be used to deal with some divisors that are not in the table. For example, the divisor 25 is not in the table, but 5 is. Hence, applying Q'(x) for 5 twice will result in an accurate division by 25.
Just the Facts, Ma'am
Listing One contains a C program which generates an instruction sequence for the Q'(x) division algorithm. The code is very similar to the multiplication algorithm code shown in my previous Dr. Dobb's Journal article. The main program of my March 1987 article (page 35) has been made into the subroutine binmul(). Note that the code to load the machine working-register Rw has been moved to the new main program. This is the only change that has been made to the star-chain multiplication code. The program assumes that the numerator will be in machine register R1 and it returns the quotient in machine register Rw. Note that unlike the multiplication algorithm instruction sequence, the instruction sequence generated by the Q'(x) algorithm will alter the contents of register R1 if the divisor is not a power-of-two.
The main program of Listing One uses the function trim_trailing( ) to count the number of low-order zero bits in the divisor (equivalent to the number of factors-of-two in the divisor which will be shifted out). Code is generated for the initial right-shift scaling. If the divisor was a power-of-two, this right-shift does the entire job; otherwise the Q'(x) algorithm is implemented. Note that the final right-shift by 16 bits (the division by z) might be done with a 68000-family SWAP instruction or by returning only the upper half of Rw (if the computer has only 16-bit registers, such as the Intel 80286).
As an example, consider a division by 102 (one of the accurate values from Table 2). The Q'(x) instruction sequence generated by the program in Listing One is shown in Example 1 (the comments were added for additional clarity).
Example 1: Division by 102
R1 >>= 1 scale 102 down to 51 (odd) Rw = R1 mult. by a = (65536/51) = 1285 Rw <<= 2 Rw += R1 Rw <<= 6 Rw += R1 Rw <<= 2 Rw += R1 Rw += 1285 add b = (a + r - 1) = 1285 Rw >>= 16 divide by z = 65536
For a 68020, the worst-case timing for this sequence would be about 35 clocks -- compared to 47 clocks for a divide instruction. If the code was in cache, the timing for the sequence would improve to about 28 clocks against 46 clocks for the divide instruction. In the "best" timing case for the 68020, the sequence could require as few as 10 clocks against 42 clocks for the divide instruction. For computers which must implement division as a subroutine, the improvement entailed in using the Q'(x) algorithm can be greater still.
_OPTIMIZING INTEGER DIVISION BY A CONSTANT DIVISOR_ by Robert Grappel
[LISTING ONE]
<a name="0079_000c"> #include stdio.h /* Program to generate "star-chain-sequence" for division of an unsigned 16-bit integer numerator by an unsigned 16-bit integer constant denominator. It assumes the existence of two 32-bit integer "registers", add, subtract, and shift instructions. It uses the "star-chain" multiplication routine from DDJ #125, March 1987 page 35 */ static unsigned int mult; /* global variable for trim_trailing() & binmul() */ /* Support subroutine to trim trailing 0s or 1s from global variable "mult". */ int trim_trailing(one_zero) int one_zero; { /* if one_zero == 0, trim trailing zeros in "mult", return "count" == 1, ones */ int c; /* bit counter */ for (c = 0; ((mult & 1) == one_zero); c++, mult >>= 1) ; return c; } /* Slightly modified version of multiplication routine */ binmul(m) long m; { int last_shift, /* final shift count */ last_cnt, /* count of low-order zeros */ stkptr, /* pointer to "stack[]" */ cnt, /* bit counter */ ts, /* top-of-stack element */ flag, /* flag for special-case */ stack[16]; /* stack for shift-add/subs */ mult = m; stkptr = last_cnt = 0; /* init. stack ptr. and count */ last_shift = trim_trailing(0); /* trim trailing 0s */ while (1) { /* loop to decompose "mult", building stack */ cnt = trim_trailing(1); /* count low-order 1s */ if (cnt > 1) { /* more than 1 bit, shift-subtract */ flag = 0; if (last_cnt == 1) /* shift "k",sub,shift 1,add --> shift "k+1", sub */ stack[stkptr-1] = -(cnt+1); /* overwrite */ else stack[stkptr++] = -cnt; } else flag = 1; /* need another shift-add */ if (mult == 0) break; /* decomp. "mult", now output */ last_cnt = trim_trailing(0) + flag; /* low-order 0s */ stack[stkptr++] = last_cnt; /* shift-add */ } while (stkptr > 0) { /* output code from stack */ ts = stack[--stkptr]; /* get top stack element */ if (ts < 0) { /* generate shift-subtract instructions */ printf("\nRw <<= %d",-ts); printf("\nRw -= R1"); } else { /* generate shift-add instructions */ printf("\nRw <<= %d",ts); printf("\nRw += R1"); } } if (last_shift != 0) printf("\nRw <<= %d",last_shift); } main() { /* generate pseudo-instructions for star-chain division */ int a,b,r, /* computed multiplier, addend, and remainder */ i2, /* number of bits to scale divisor */ denom, /* intended divisor */ denom2; /* divisor scaled by powers-of-2 */ int z = 65536; /* 2^16 */ printf("\nEnter positive integer denominator: "); scanf("%d",&denom); if (denom != 0) { /* scale denominator by powers-of-2 */ mult = denom; i2 = trim_trailing(0); /* how many powers-of-2? */ denom2 = mult; if (denom2 == 1) { /* divisor was power-of-2, simply scale it */ printf("\nRw = R1"); if (i2 > 0) printf("\nRw >>= %d",i2); } else { /* divisor not power-of-2, scale and more */ if (i2 > 0) printf("\nR1 >>= %d",i2); /* handle scaling */ printf("\nRw = R1"); /* load work register */ a = z / denom2; /* scaled reciprocal */ r = z % denom2; /* remainder of recip. */ b = a + r - 1; binmul(a); printf("\nRw += %d",b); printf("\nRw >>= 16"); } } else printf("\nCannot divide by zero\n"); /* special case */ } Example 1: Division by 102 R1 >>= 1 scale 102 down to 51 (odd) Rw = R1 mult. by a = (65536/51) = 1285 Rw <<= 2 Rw += R1 Rw <<= 6 Rw += R1 Rw <<= 2 Rw += R1 Rw += 1285 add b = (a + r - 1) = 1285 Rw >>= 16 divide by z = 65536
Copyright © 1991, Dr. Dobb's Journal