ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87
Floating point from the inside out
This article contains the following executables: ASMX87.ARC
Nicholas Wilt
Nicholas is a student at Dartmouth College, Hanover, New Hampshire. His interests include computer graphics, C++, and assembler programming. He can be reached through the editorial offices at DDJ.
Conventional wisdom holds that programming numeric coprocessors in assembler is not worthwhile. After all, so the story goes, the 80x87 architecture is so esoteric and complicated, and compilers do such a good job utilizing them, that it's not worth the effort to program them in assembler.
Nothing could be further from the truth. The 80x87 coprocessor has a quirky architecture, but no more so than its integer-based siblings. And we shall see that coprocessor code is just as susceptible to optimization as integer-based code, if not more so. This article is for assembler programmers not familiar with the 80x87 math coprocessors. It highlights features of the 80x87 most useful to the applications programmer, rather than attempting an exhaustive treatise.
80x87 Registers
The 80x87 contains eight 80-bit floating-point registers. This 80-bit "temporary real" format, intended for internal use by the 80x87, is the only numeric format the chip can directly manipulate. It can load and store integers and standard IEEE floating-point numbers, but they must be converted to and from the temporary real format.
The integer and floating-point formats supported by the 80x87 are listed in Table 1, along with their size specifiers. The integer formats are the same as those supported by the 80x86, except that the 80x87 can directly load and manipulate quadwords (64-bit integers) as well as words and doublewords.
Table 1: Integer and floating-point formats supported by the 80x87
Format Operand Load type with ----------------------------------- 32-bit float dword ptr FLD 64-bit double qword ptr FLD 80-bit long double tbyte ptr FLD 16-bit integer word ptr FILD 32-bit integer dword ptr FILD 64-bit integer qword ptr FILD
The temporary real format has a 64-bit mantissa, so 64-bit integers can be represented without loss of precision.
The eight FPU registers are organized as a stack. Numbers can be loaded into and stored from the top of the stack only. The top register on the stack is called ST or ST(0). Registers are numbered by increasing depth from ST; the first register from the top is ST(1), the second is ST(2), and so on. This stack-oriented architecture is the strangest aspect of programming the 80x87; floating-point units for other architectures are organized more like integer units, with constant register names.
On the 80x87, every time you load something into the chip, all the register names for the chip's contents change -- what used to be the stack top is now ST(1), ST(1) becomes ST(2), and so on down the line. The comments in 80x87 assembler code almost always concentrate on keeping the stack contents straight.
Like the integer processor, the 80x87 has special registers that contain flags and control bits. The CW (Control Word) of the 80x87 has bits that control such things as how to round numbers during conversions. The SW (Status Word) has bits that describe the state of the machine (such as the result of a comparison). These words can be accessed through the FLDCW (load control word), FSTCW (store control word), and FSTSW (store status word) instructions. Fortunately, we don't often have to worry about the control or status words. The default values in the control word are reasonable for most applications (after all, they are the same for us as they are for the compiler), and the status word is useful to applications programmers mostly for doing floating-point comparisons.
80x87 Instructions
The most important instructions on the 80x87 are the load and store instructions. FLD pushes an operand onto the stack; the loaded operand becomes ST. FILD, the integer version of FLD, pushes integer operands onto the stack.
Executing FLD on a 32-bit (dword-sized) operand converts a 32-bit, floating-point number to temporary real and pushes it onto the register stack. Any effective address can be used for the operand -- for example, to load a 32-bit local variable, you might write: FLD dword ptr [bp-4], Load local, while to load a 64-bit value pointed to by DS:SI, you would write: FLD qword ptr [si]; Load next value in array. To load a 16-bit integer, you use the FILD instruction: FILD word ptr [bp+6]; Load first parameter.
Besides FLD and FILD, there are a number of special FLD instructions to load special constants onto the stack. These instructions don't take any operands; they just push a value onto the stack. FLDZ, for example, pushes 0.0 onto the stack. Table 2 has a complete list of these special load instructions. These instructions are useful not only because they are faster than loading constants from memory, but also because they guarantee full 80-bit precision for the constants loaded.
Table 2: Special FLD instructions to load special constants onto the stack
Instruction Loads ---------------------------- fldlg2 Log 2 (base 10) fldln2 Log 2 (base e) fldl2e Log e (base 2) fldl2t Log 10 (base 2) fldpi Pi (3.14159...) fldz Zero (0.0) fld1 One (1.0)
As with loading operands, storing operands can only be done from ST, the stack top. The FST and FIST instructions store floating-point and integer formats. To store the stack top in the form of a 64-bit, double-precision number, write something such as: FST qword ptr [bp-8]; Store temp.
Often, you want to store the stack top and pop the stack in one operation. This is when you won't be needing the contents of the stack top after storing them. Operations that pop the stack after completion have a P appended onto the instruction name. To store the stack top and then pop it off the register stack, you use the FSTP and FISTP instructions in place of FST and FIST.
The instruction mnemonics for the 80x87 were chosen deliberately to help understand what the instruction does. For instance, instructions for the coprocessor always begin with the letter F; this distinguishes them as 80x87 instructions, because no integer instructions begin with F. Instructions that begin with FI perform integer versions of the operation. The 80x87 needs separate instructions to perform many operations on integers because the operand size (for example, 32 bits or dword) doesn't always imply the operand type. A 32-bit number can be a float or a long. So when you write FLD dword ptr [bp+6], this loads a 32-bit floating-point number onto the stack. To load a 32-bit integer onto the stack, you write: FILD dword ptr [bp+6].
The two instructions both load an operand onto the 80x87 stack, but they use different conversions to arrive at the 80-bit internal representation.
Another convention is that instructions with a P appended to them pop the stack after the operation. This is useful when used in conjunction with binary operations such as addition, as well as with FSTP.
80x87 Arithmetic Instructions
There are several classes of arithmetic instruction on the 80x87. The simplest class, unary instructions, perform an operation on a single operand. Absolute value, negation, and square root all require only one operand. Unary operations are predefined to operate on the stack top. For example, FSQRT replaces ST with its square root. Operations are listed in Table 3.
Table 3: 80x87 unary instructions: For fsin, fcos, and fsincos ST must have a magnitude of less than 2{63}.
Instruction Operation it performs ------------------------------------------- f2xm1 Computes 2{st-1} fabs Takes absolute value of ST. fchs Negates ST. frndint Rounds ST to integer. fsqrt Takes square root of ST. ftst Compares ST against 0 and sets FPU flags in status word. 387/j486 only fcos Replaces ST with its cosine. fsin Replaces ST with its sine. fsincos Takes sine and cosine of ST, leaves cosine in ST, sine in ST(1).
Unary operations only operate on ST, so it is often useful to swap registers around inside the 80x87. The FXCH instruction exchanges the stack top with the register operand specified. For example:
FXCH ; Exchange ST and ST(1) FXCH ST(2); Exchange ST and ST(2)
Table 4 lists the two-operand operations available on the 80x87. These are the most general, useful operations -- add, subtract, divide, and multiply. Because they are used so often, Intel made them very powerful; there are many variations of each instruction.
Table 4: Two-operand operations available on the 80x87
Instruction Operation it performs --------------------------------------- fadd Add fcom Compare fdiv Compute destination/source fdivr Compute source/destination fmul Multiply fsub Compute destination-source fsubr Compute source-destination
All two-operand instructions can be performed between registers on the FPU. These must involve the stack top, however. You can write: FADD ST(1), ST; ST(1)+=ST, but not FADD ST(2), ST(3); ST(2)+=ST(3). You can append a P to pop the stack after performing the operation, as follows: FADDP ST(3), ST; Add ST to ST(3), then pop.
In fact, one form of this operation is so common that it is implicit. If you write simply: FMUL; Multiply ST(1) by ST, pop, it is the same as: FMULP ST(1), ST; Multiply ST(1) by ST, pop.
You can also use memory operands with these instructions. As with register-register operations, they must involve the stack top. Memory operands are more restrictive, however -- the stack top is always the destination. Also, only 32-bit and 64-bit memory operands are supported. You use a memory operand as follows: FADDdword ptr es:[di]. This adds the 32-bit, floating- point number located at ES:DI to ST. (Note: Using FADDP with a memory operand doesn't make much sense -- you would add a number to the stack top, then pop the result just computed!)
These memory-source instructions have 16- and 64-bit integer forms, as well. If the number at ES:DI is a 32-bit integer, not a 32-bit floating-point number, you can write: FIADD dword dtr es:[di].
FSUB and FDIV
I should say a few words here about FSUB and FDIV, the results of which depend on the operand's order. By default, the source operand is subtracted from (or divided into) the destination operand. Thus: FSUB ST(1), ST; ST(1) = ST(1)-ST subtracts ST from ST(1). What if we want the destination to be subtracted from the source? Then we append an R (for "reverse operands") to the instruction mnemonic. While FSUB qword ptr [bp-8]; ST=ST-[bp-8] subtracts the 64-bit floating-point number at [BP-8] from the stack top, to swap the operands you write FSUBR qword ptr [bp-8]; ST=[bp-8]-ST. Instructions that have reversed operations and also pop the stack end in RP (for example, FDIVRP and FSUBRP).
Comparisons
The 80x87 lets you compare floating-point numbers in a roundabout fashion. The FCOM instruction compares ST to a given operand. For example, FCOM ST(2) compares ST to ST(2) and sets the condition code bits to reflect the results of the comparison.
FCOM has several forms. The memory form of the instruction compares ST to a 32- or 64-bit memory operand. FCOMP compares ST to the source operand and then pops ST off the register stack. FICOM takes a 16- or 32-bit integer memory operand. FCOM with no operands is the same as FCOMP ST(1). Finally, FCOMPP compares ST to ST(1) and pops both off the register stack.
You examine the comparison results by storing the status word to a memory operand with FSTSW, then loading it into AX and performing an SAHF instruction. Intel designed the 80x87 so that the condition code bits in the 80x87 map directly to the carry and 0 bits in the 80x86. (Some other bits from the status word get mapped to flags as well, but they are not relevant to comparisons.) Then, because the comparison results are in the carry and 0 flags, the unsigned conditional jump instructions can be used in mnemonic fashion: JA jumps if ST is greater than the source operand, JB jumps if it is less then the source operand, and JE jumps if it is equal to the source operand. solve_quadratic() has an example of a floating-point comparison to avoid taking the square root of a negative number. Here is an example of how to jump if ST>ST(1):
FCOM ST(1) ; Compare ST to ST(1) FSTSW [bp-2] ; Store SW to local MOV ax,[bp-2] ; SAHF ; Flags<-AH JA Greater ; Jump if ST>ST(1)
Note that on machines running the 287 and better, you can write the status word directly to AX with FSTSW AX.
Figure 1 is a graphical depiction of 80x87 assembler in action. The state of the coprocessor stack is shown after each instruction in the sequence:
fld A fabs fld B fadd C fmul fstp D
This sequence computes |A|*(B+C) and stores it in D, clearing the register stack in the process. Actual code would probably have size specifiers and addresses instead of capital letters, of course.
I haven't covered all the subtleties of 80x87 programming here. Writing a sine or cosine routine on pre-387 coprocessors is not nearly as straightforward as any of the applications illustrated in Listings One through Three (page 88). But not covering the partial arctangent, partial tangent, and similar esoteric instructions is no great loss; these operations are exactly where it's not worthwhile to write in assembler. My own sine and cosine routines, which run on all coprocessors, are only 12 percent faster than the library routines for my compiler -- and the library routines know enough to use FSIN or FCOS when available (on the 387 and 486). The performance hit is almost entirely due to the library routine checking whether it can use the faster instruction.
So rewriting library functions isn't usually very helpful; it's when you know something about the data that speed-ups are possible. For example, in C you can take numbers to a power with pow(). But if you know the power is an integer, you can get a significant speed-up.
Integer Powers
Listing One shows intpow(), a C-callable function that takes a double-precision number x and an unsigned integer y; the function returns x^y 50-90 percent faster than my compiler's pow() library function. I have examined the library function's code; although it can tell when y is integral, it takes a while to figure that out. Our function assumes y is integral, and so can compute the power more quickly. In addition to Listing One, I've developed a test program that's available electronically; see page 3. This program is provided in both executable (TESTPOW.EXE) and source (TESTPOW.C) form. I have also provided include files and a makefile for Borland C++.
Summing Arrays
Another useful function, sumarray(), is shown in Listing Two. It sums an array of floating-point values. sumarray() works with an array of floats, but it could easily be modified to work with doubles or 10-byte temporary reals. (Just change the dword ptr qualifiers to qword ptr or tbyte ptr, and change the size of the pointer increment to 8 or 10.)
To evaluate sumarray()'s performance, I raced it against a C loop that summed the same array. I compared the performance of this loop, which did not contain any function calls or parameter passing, against calls to sumarray(). For arrays of size 1024, sumarray() is about four times as fast as C code. In fact, sumarray() is as fast as C even when the array is only of size 2! This reflects the difficulty compilers have when generating code for the 80x87. The stackbased architecture is difficult to model, and variables that belong in 80x87 registers often get put in local variables.
Another nice thing about sumarray() is that it preserves greater precision than the equivalent loop in C. While the C loop repeatedly truncates the subtotal to 64 bits, sumarray() keeps it in an 80-bit register. This could be particularly important in an operation such as summation, where small errors get larger as more additions are performed.
In addition to Listing Two, I've developed a test program that's available electronically; see page 3. This program is provided in both executable (TESTSUM.EXE) and source (TESTSUM.C) form. I've also provided include files and a makefile for Borland C++.
Quadratic Formula
The third, most complicated routine, solve_quadratic(), illustrates a few things that intpow() and sumarray() did not. For one thing, solve_quadratic demonstrates how to perform a floating-point comparison.
The other key feature of solve_quadratic is that it passes its results back via pointers, rather than as a return value. The return value of solve_quadratic is used to indicate whether roots were found. If no roots are found (if the discriminant of the quadratic equation is negative), solve_quadratic returns 0; if it finds a pair of roots, it writes them to the two pointers passed to it and returns 1.
I haven't measured the performance of solve_quadratic against compiled code; I didn't need to. Compilers never generate code that looks like solve_quadratic. If you don't believe me, have your compiler issue the assembler for a few floating-point intensive functions. You will see some fine examples of how to keep the register stack as empty as possible when working with the 80x87. Some compilers are better than others, but they are no match for a determined human. This is not the compilers' fault; the 80x87 architecture is simply difficult to issue code for.
Floating-point calling conventions vary from compiler to compiler. The functions here assume that floating-point parameters are passed as double, and that floating-point return values are returned in ST. This is the calling convention for Borland C++ and its predecessors. Calling conventions vary, however; some compilers may pass floats as 32-bit values rather than promoting them to double, for instance. Check your manual for your compiler's conventions.
As with the other programs, I've written a test program that's available electronically; see page 3. This program is provided in both executable (TESTQUAD.EXE) and source (TESTQUAD.C) form. I've also provided include files and a makefile for Borland C++.
Conclusion
If you want to learn more about 80x87 programming, or you want to clarify what we have covered here, try altering the routines presented here to work differently. Make sumarray() sum an array of doubles or integers; make solve_quadratic pass back floats or long doubles instead of doubles. Make the routines work with a different calling convention, such as that of Microsoft C. When you can easily make simple changes such as these, you will be equipped to address your own floating-point applications in assembler.
Well, we've gone over the architecture and instruction set of the 80x87, and hopefully you're convinced that the math coprocessors are worth talking to occasionally. This article has only scratched the surface of 80x87 programming, but it should contain enough background material for you to start learning on your own.
_ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87_ by Nicholas Wilt[LISTING ONE]
<a name="008a_0012"> ; pow.asm: integer power function callable from Borland C++. ; Copyright (C) 1991 by Nicholas Wilt. All rights reserved. .MODEL LARGE,C .CODE ; double intpow(double x, unsigned int y); ; Returns x^y. PUBLIC intpow intpow PROC X:QWORD,Y:WORD fld1 ; Load 1 into the 80x87 mov cx,Y ; Get y fld X ; Load x into the 80x87 jcxz Return ; If y already zero, return TestY: test cx,1 ; Is the LSB of y set? jz NextIteration ; Jump if no fmul st(1),st ; ret *= x NextIteration: fmul st,st(0) ; Square x shr cx,1 ; y >>= 1 jnz TestY ; Continue if nonzero Return: fstp st ; Pop stack. Return value is ; now in ST(0). ret intpow ENDP END <a name="008a_0013"> <a name="008a_0014">[LISTING TWO]
<a name="008a_0014"> ; sumarray(), a Borland C++-callable function to sum the values ; in an array of floats. ; Copyright (C) 1991 by Nicholas Wilt. All rights reserved. .MODEL LARGE,C .CODE ; double sumarray(float *arr, int n); ; Returns sum of the n elements in arr. PUBLIC sumarray sumarray PROC ARR:DWORD,N:WORD les bx,ARR ; ES:BX <- pointer to arr mov cx,N ; Get number of elements fldz ; Load zero jcxz LeaveSum ; Leave if count is 0. DoSum: fadd dword ptr es:[bx] ; Add value in array to ST(0). add bx,4 ; Point to next value in array. loop DoSum ; Loop until done. LeaveSum: ; Return value in ST(0). ret sumarray ENDP END <a name="008a_0015"> <a name="008a_0016">[LISTING THREE]
<a name="008a_0016"> ; quad.asm: integer power function callable from Borland C++. ; Copyright (C) 1991 by Nicholas Wilt. All rights reserved. .MODEL LARGE,C .CODE ; int solve_quadratic(double a, double b, double c, double *x1, double *x2); ; solve_quadratic takes the coefficients of a quadratic polynomial ; and finds the roots of that polynomial. If there are two real ; roots, it writes them back to x1 and x2 and returns 1. If the ; discriminant b^2-4*a*c is less than 0, the roots are not real ; and solve_quadratic returns 0. ; The quadratic formula is as follows: ; (-b +/- sqrt(b*b-4*a*c)) / 2*a PUBLIC solve_quadratic solve_quadratic PROC A:QWORD,B:QWORD,C:QWORD,X1:DWORD,X2:DWORD ; Comments show stack contents ; separated by | signs ; Stack top is at left fld A ; a Here, ST(0) contains a. fld B ; b | a Now ST(1) has a. Etc. ; Negate b -- squaring it is sign-independent, and we need b ; negated in all the other instances in the formula. fchs fld C ; c | b | a fld st(1) ; b | c | b | a fmul st,st(0) ; b*b | c | b | a ; Just plain fxch has an implicit operand of ST(1). fxch ; c | b*b | b | a fadd st,st(0) ; 2*c | b*b | b | a fadd st,st(0) ; 4*c | b*b | b | a fmul st,st(3) ; 4*a*c | b*b | b | a fsubp st(1), st ; b*b-4*a*c | b | a ftst ; Compare against 0 ; We've computed b*b-4*a*c. If negative, we return 0. ; To do the comparison, we have to store the 80x87 status ; word and use sahf to store it into the flags. Once it's ; in the flags, we can jump on above or equal (jae) to jump if the ; number tested is greater than 0 after the FTST instruction above. sub sp,2 ; Quick, allocate a local mov bx,sp ; Point BX at it fstsw ss:[bx] ; Store the 80x87's status word there mov ax,ss:[bx] ; AX <- status word add sp,2 ; Deallocate the local sahf ; Get SW into flags jae ComputeResults ; Jump if positive fstp st ; This instruction clears the stack fstp st ; we've got three values on the stack fstp st ; to clear xor ax,ax ; Return 0 - no roots found. jmp short LeaveQuadratic ComputeResults: fsqrt ; Find square root of discriminant fxch st(2) ; a | b | sqrt(b*b-4*a*c) fadd st,st(0) ; 2*a | b | sqrt(b*b-4*a*c) fld st(1) ; b | 2*a | b | sqrt(b*b-4*a*c) fadd st,st(3) ; b+sqrt(b*b-4*a*c) | 2*a | b | sq.. fdiv st,st(1) ; x1 | 2*a | b | sqrt(b*b-4*a*c) les bx,X1 ; fstp qword ptr es:[bx] ; 2*a | b | sqrt(b*b-4*a*c) fxch ; b | 2*a | sqrt(b*b-4*a*c) fsubrp st(2),st ; 2*a | -b - sqrt(b*b-4*a*c) fdiv ; x2 les bx,X2 ; Store x2 fstp qword ptr es:[bx] mov ax,1 ; Return 1 LeaveQuadratic: ret ; Return solve_quadratic ENDP END
Copyright © 1992, Dr. Dobb's Journal