Embedded Systems

Assembly Language Programming for the 80x87

By Nicholas Wilt, March 01, 1992

Contrary to popular notions, numeric coprocessor code can be optimized using assembly language in much the same way as integer-based code.

MAR92: ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87

ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87

Floating point from the inside out

This article contains the following executables: ASMX87.ARC

Nicholas Wilt

Nicholas is a student at Dartmouth College, Hanover, New Hampshire. His interests include computer graphics, C++, and assembler programming. He can be reached through the editorial offices at DDJ.

Conventional wisdom holds that programming numeric coprocessors in assembler is not worthwhile. After all, so the story goes, the 80x87 architecture is so esoteric and complicated, and compilers do such a good job utilizing them, that it's not worth the effort to program them in assembler.

Nothing could be further from the truth. The 80x87 coprocessor has a quirky architecture, but no more so than its integer-based siblings. And we shall see that coprocessor code is just as susceptible to optimization as integer-based code, if not more so. This article is for assembler programmers not familiar with the 80x87 math coprocessors. It highlights features of the 80x87 most useful to the applications programmer, rather than attempting an exhaustive treatise.

80x87 Registers

The 80x87 contains eight 80-bit floating-point registers. This 80-bit "temporary real" format, intended for internal use by the 80x87, is the only numeric format the chip can directly manipulate. It can load and store integers and standard IEEE floating-point numbers, but they must be converted to and from the temporary real format.

The integer and floating-point formats supported by the 80x87 are listed in Table 1, along with their size specifiers. The integer formats are the same as those supported by the 80x86, except that the 80x87 can directly load and manipulate quadwords (64-bit integers) as well as words and doublewords.

Table 1: Integer and floating-point formats supported by the 80x87

  Format                Operand  Load
                         type    with
  -----------------------------------

  32-bit float        dword ptr  FLD
  64-bit double       qword ptr  FLD
  80-bit long double  tbyte ptr  FLD
  16-bit integer      word ptr   FILD
  32-bit integer      dword ptr  FILD
  64-bit integer      qword ptr  FILD

The temporary real format has a 64-bit mantissa, so 64-bit integers can be represented without loss of precision.

The eight FPU registers are organized as a stack. Numbers can be loaded into and stored from the top of the stack only. The top register on the stack is called ST or ST(0). Registers are numbered by increasing depth from ST; the first register from the top is ST(1), the second is ST(2), and so on. This stack-oriented architecture is the strangest aspect of programming the 80x87; floating-point units for other architectures are organized more like integer units, with constant register names.

On the 80x87, every time you load something into the chip, all the register names for the chip's contents change -- what used to be the stack top is now ST(1), ST(1) becomes ST(2), and so on down the line. The comments in 80x87 assembler code almost always concentrate on keeping the stack contents straight.

Like the integer processor, the 80x87 has special registers that contain flags and control bits. The CW (Control Word) of the 80x87 has bits that control such things as how to round numbers during conversions. The SW (Status Word) has bits that describe the state of the machine (such as the result of a comparison). These words can be accessed through the FLDCW (load control word), FSTCW (store control word), and FSTSW (store status word) instructions. Fortunately, we don't often have to worry about the control or status words. The default values in the control word are reasonable for most applications (after all, they are the same for us as they are for the compiler), and the status word is useful to applications programmers mostly for doing floating-point comparisons.

80x87 Instructions

The most important instructions on the 80x87 are the load and store instructions. FLD pushes an operand onto the stack; the loaded operand becomes ST. FILD, the integer version of FLD, pushes integer operands onto the stack.

Executing FLD on a 32-bit (dword-sized) operand converts a 32-bit, floating-point number to temporary real and pushes it onto the register stack. Any effective address can be used for the operand -- for example, to load a 32-bit local variable, you might write: FLD dword ptr [bp-4], Load local, while to load a 64-bit value pointed to by DS:SI, you would write: FLD qword ptr [si]; Load next value in array. To load a 16-bit integer, you use the FILD instruction: FILD word ptr [bp+6]; Load first parameter.

Besides FLD and FILD, there are a number of special FLD instructions to load special constants onto the stack. These instructions don't take any operands; they just push a value onto the stack. FLDZ, for example, pushes 0.0 onto the stack. Table 2 has a complete list of these special load instructions. These instructions are useful not only because they are faster than loading constants from memory, but also because they guarantee full 80-bit precision for the constants loaded.

Table 2: Special FLD instructions to load special constants onto the stack

  Instruction  Loads
  ----------------------------

  fldlg2       Log 2 (base 10)
  fldln2       Log 2 (base e)
  fldl2e       Log e (base 2)
  fldl2t       Log 10 (base 2)
  fldpi        Pi (3.14159...)
  fldz         Zero (0.0)
  fld1         One (1.0)

As with loading operands, storing operands can only be done from ST, the stack top. The FST and FIST instructions store floating-point and integer formats. To store the stack top in the form of a 64-bit, double-precision number, write something such as: FST qword ptr [bp-8]; Store temp.

Often, you want to store the stack top and pop the stack in one operation. This is when you won't be needing the contents of the stack top after storing them. Operations that pop the stack after completion have a P appended onto the instruction name. To store the stack top and then pop it off the register stack, you use the FSTP and FISTP instructions in place of FST and FIST.

The instruction mnemonics for the 80x87 were chosen deliberately to help understand what the instruction does. For instance, instructions for the coprocessor always begin with the letter F; this distinguishes them as 80x87 instructions, because no integer instructions begin with F. Instructions that begin with FI perform integer versions of the operation. The 80x87 needs separate instructions to perform many operations on integers because the operand size (for example, 32 bits or dword) doesn't always imply the operand type. A 32-bit number can be a float or a long. So when you write FLD dword ptr [bp+6], this loads a 32-bit floating-point number onto the stack. To load a 32-bit integer onto the stack, you write: FILD dword ptr [bp+6].

The two instructions both load an operand onto the 80x87 stack, but they use different conversions to arrive at the 80-bit internal representation.

Another convention is that instructions with a P appended to them pop the stack after the operation. This is useful when used in conjunction with binary operations such as addition, as well as with FSTP.

80x87 Arithmetic Instructions

There are several classes of arithmetic instruction on the 80x87. The simplest class, unary instructions, perform an operation on a single operand. Absolute value, negation, and square root all require only one operand. Unary operations are predefined to operate on the stack top. For example, FSQRT replaces ST with its square root. Operations are listed in Table 3.

Table 3: 80x87 unary instructions: For fsin, fcos, and fsincos ST must have a magnitude of less than 2{63}.

  Instruction    Operation it performs
  -------------------------------------------

  f2xm1          Computes 2{st-1}
  fabs           Takes absolute value of ST.
  fchs           Negates ST.
  frndint        Rounds ST to integer.
  fsqrt          Takes square root of ST.
  ftst           Compares ST against 0
                  and sets FPU flags in
                  status word.

  387/j486 only
  fcos           Replaces ST with its
                  cosine.
  fsin           Replaces ST with its sine.
  fsincos        Takes sine and cosine of ST,
                  leaves cosine in ST, sine
                  in ST(1).

Unary operations only operate on ST, so it is often useful to swap registers around inside the 80x87. The FXCH instruction exchanges the stack top with the register operand specified. For example:

  FXCH     ; Exchange ST and ST(1)
  FXCH ST(2); Exchange ST and ST(2)

Table 4 lists the two-operand operations available on the 80x87. These are the most general, useful operations -- add, subtract, divide, and multiply. Because they are used so often, Intel made them very powerful; there are many variations of each instruction.

Table 4: Two-operand operations available on the 80x87

  Instruction  Operation it performs
  ---------------------------------------

  fadd         Add
  fcom         Compare
  fdiv         Compute destination/source
  fdivr        Compute source/destination
  fmul         Multiply
  fsub         Compute destination-source
  fsubr        Compute source-destination

All two-operand instructions can be performed between registers on the FPU. These must involve the stack top, however. You can write: FADD ST(1), ST; ST(1)+=ST, but not FADD ST(2), ST(3); ST(2)+=ST(3). You can append a P to pop the stack after performing the operation, as follows: FADDP ST(3), ST; Add ST to ST(3), then pop.

In fact, one form of this operation is so common that it is implicit. If you write simply: FMUL; Multiply ST(1) by ST, pop, it is the same as: FMULP ST(1), ST; Multiply ST(1) by ST, pop.

You can also use memory operands with these instructions. As with register-register operations, they must involve the stack top. Memory operands are more restrictive, however -- the stack top is always the destination. Also, only 32-bit and 64-bit memory operands are supported. You use a memory operand as follows: FADDdword ptr es:[di]. This adds the 32-bit, floating- point number located at ES:DI to ST. (Note: Using FADDP with a memory operand doesn't make much sense -- you would add a number to the stack top, then pop the result just computed!)

These memory-source instructions have 16- and 64-bit integer forms, as well. If the number at ES:DI is a 32-bit integer, not a 32-bit floating-point number, you can write: FIADD dword dtr es:[di].

FSUB and FDIV

I should say a few words here about FSUB and FDIV, the results of which depend on the operand's order. By default, the source operand is subtracted from (or divided into) the destination operand. Thus: FSUB ST(1), ST; ST(1) = ST(1)-ST subtracts ST from ST(1). What if we want the destination to be subtracted from the source? Then we append an R (for "reverse operands") to the instruction mnemonic. While FSUB qword ptr [bp-8]; ST=ST-[bp-8] subtracts the 64-bit floating-point number at [BP-8] from the stack top, to swap the operands you write FSUBR qword ptr [bp-8]; ST=[bp-8]-ST. Instructions that have reversed operations and also pop the stack end in RP (for example, FDIVRP and FSUBRP).

Comparisons

The 80x87 lets you compare floating-point numbers in a roundabout fashion. The FCOM instruction compares ST to a given operand. For example, FCOM ST(2) compares ST to ST(2) and sets the condition code bits to reflect the results of the comparison.

FCOM has several forms. The memory form of the instruction compares ST to a 32- or 64-bit memory operand. FCOMP compares ST to the source operand and then pops ST off the register stack. FICOM takes a 16- or 32-bit integer memory operand. FCOM with no operands is the same as FCOMP ST(1). Finally, FCOMPP compares ST to ST(1) and pops both off the register stack.

You examine the comparison results by storing the status word to a memory operand with FSTSW, then loading it into AX and performing an SAHF instruction. Intel designed the 80x87 so that the condition code bits in the 80x87 map directly to the carry and 0 bits in the 80x86. (Some other bits from the status word get mapped to flags as well, but they are not relevant to comparisons.) Then, because the comparison results are in the carry and 0 flags, the unsigned conditional jump instructions can be used in mnemonic fashion: JA jumps if ST is greater than the source operand, JB jumps if it is less then the source operand, and JE jumps if it is equal to the source operand. solve_quadratic() has an example of a floating-point comparison to avoid taking the square root of a negative number. Here is an example of how to jump if ST>ST(1):

FCOM ST(1)    ; Compare ST to ST(1) 
FSTSW [bp-2]  ; Store SW to local 
MOV   ax,[bp-2]  ; 
SAHF         ; Flags<-AH 
JA    Greater ; Jump if ST>ST(1)

Note that on machines running the 287 and better, you can write the status word directly to AX with FSTSW AX.

Figure 1 is a graphical depiction of 80x87 assembler in action. The state of the coprocessor stack is shown after each instruction in the sequence:

  
fld   A   
fabs   
fld   B   
fadd  C   
fmul   
fstp  D

This sequence computes |A|*(B+C) and stores it in D, clearing the register stack in the process. Actual code would probably have size specifiers and addresses instead of capital letters, of course.

I haven't covered all the subtleties of 80x87 programming here. Writing a sine or cosine routine on pre-387 coprocessors is not nearly as straightforward as any of the applications illustrated in Listings One through Three (page 88). But not covering the partial arctangent, partial tangent, and similar esoteric instructions is no great loss; these operations are exactly where it's not worthwhile to write in assembler. My own sine and cosine routines, which run on all coprocessors, are only 12 percent faster than the library routines for my compiler -- and the library routines know enough to use FSIN or FCOS when available (on the 387 and 486). The performance hit is almost entirely due to the library routine checking whether it can use the faster instruction.

So rewriting library functions isn't usually very helpful; it's when you know something about the data that speed-ups are possible. For example, in C you can take numbers to a power with pow(). But if you know the power is an integer, you can get a significant speed-up.

Integer Powers

Listing One shows intpow(), a C-callable function that takes a double-precision number x and an unsigned integer y; the function returns x^y 50-90 percent faster than my compiler's pow() library function. I have examined the library function's code; although it can tell when y is integral, it takes a while to figure that out. Our function assumes y is integral, and so can compute the power more quickly. In addition to Listing One, I've developed a test program that's available electronically; see page 3. This program is provided in both executable (TESTPOW.EXE) and source (TESTPOW.C) form. I have also provided include files and a makefile for Borland C++.

Summing Arrays

Another useful function, sumarray(), is shown in Listing Two. It sums an array of floating-point values. sumarray() works with an array of floats, but it could easily be modified to work with doubles or 10-byte temporary reals. (Just change the dword ptr qualifiers to qword ptr or tbyte ptr, and change the size of the pointer increment to 8 or 10.)

To evaluate sumarray()'s performance, I raced it against a C loop that summed the same array. I compared the performance of this loop, which did not contain any function calls or parameter passing, against calls to sumarray(). For arrays of size 1024, sumarray() is about four times as fast as C code. In fact, sumarray() is as fast as C even when the array is only of size 2! This reflects the difficulty compilers have when generating code for the 80x87. The stackbased architecture is difficult to model, and variables that belong in 80x87 registers often get put in local variables.

Another nice thing about sumarray() is that it preserves greater precision than the equivalent loop in C. While the C loop repeatedly truncates the subtotal to 64 bits, sumarray() keeps it in an 80-bit register. This could be particularly important in an operation such as summation, where small errors get larger as more additions are performed.

In addition to Listing Two, I've developed a test program that's available electronically; see page 3. This program is provided in both executable (TESTSUM.EXE) and source (TESTSUM.C) form. I've also provided include files and a makefile for Borland C++.

Quadratic Formula

The third, most complicated routine, solve_quadratic(), illustrates a few things that intpow() and sumarray() did not. For one thing, solve_quadratic demonstrates how to perform a floating-point comparison.

The other key feature of solve_quadratic is that it passes its results back via pointers, rather than as a return value. The return value of solve_quadratic is used to indicate whether roots were found. If no roots are found (if the discriminant of the quadratic equation is negative), solve_quadratic returns 0; if it finds a pair of roots, it writes them to the two pointers passed to it and returns 1.

I haven't measured the performance of solve_quadratic against compiled code; I didn't need to. Compilers never generate code that looks like solve_quadratic. If you don't believe me, have your compiler issue the assembler for a few floating-point intensive functions. You will see some fine examples of how to keep the register stack as empty as possible when working with the 80x87. Some compilers are better than others, but they are no match for a determined human. This is not the compilers' fault; the 80x87 architecture is simply difficult to issue code for.

Floating-point calling conventions vary from compiler to compiler. The functions here assume that floating-point parameters are passed as double, and that floating-point return values are returned in ST. This is the calling convention for Borland C++ and its predecessors. Calling conventions vary, however; some compilers may pass floats as 32-bit values rather than promoting them to double, for instance. Check your manual for your compiler's conventions.

As with the other programs, I've written a test program that's available electronically; see page 3. This program is provided in both executable (TESTQUAD.EXE) and source (TESTQUAD.C) form. I've also provided include files and a makefile for Borland C++.

Conclusion

If you want to learn more about 80x87 programming, or you want to clarify what we have covered here, try altering the routines presented here to work differently. Make sumarray() sum an array of doubles or integers; make solve_quadratic pass back floats or long doubles instead of doubles. Make the routines work with a different calling convention, such as that of Microsoft C. When you can easily make simple changes such as these, you will be equipped to address your own floating-point applications in assembler.

Well, we've gone over the architecture and instruction set of the 80x87, and hopefully you're convinced that the math coprocessors are worth talking to occasionally. This article has only scratched the surface of 80x87 programming, but it should contain enough background material for you to start learning on your own.


_ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87_
by Nicholas Wilt

[LISTING ONE]

<a name="008a_0012">

; pow.asm: integer power function callable from Borland C++.
; Copyright (C) 1991 by Nicholas Wilt.  All rights reserved.

.MODEL  LARGE,C

.CODE

; double intpow(double x, unsigned int y);
; Returns x^y.

    PUBLIC  intpow

intpow  PROC    X:QWORD,Y:WORD
    fld1            ; Load 1 into the 80x87
    mov cx,Y        ; Get y
    fld X       ; Load x into the 80x87

    jcxz    Return      ; If y already zero, return

TestY:  test    cx,1        ; Is the LSB of y set?
    jz  NextIteration   ; Jump if no
    fmul    st(1),st    ; ret *= x
NextIteration:
    fmul    st,st(0)    ; Square x
    shr cx,1        ; y >>= 1
    jnz TestY       ; Continue if nonzero
Return: fstp    st      ; Pop stack.  Return value is
                ; now in ST(0).
    ret
intpow  ENDP

    END






<a name="008a_0013">
<a name="008a_0014">

[LISTING TWO]

<a name="008a_0014">

; sumarray(), a Borland C++-callable function to sum the values
; in an array of floats.

; Copyright (C) 1991 by Nicholas Wilt.  All rights reserved.

.MODEL  LARGE,C

.CODE

; double sumarray(float *arr, int n);
;   Returns sum of the n elements in arr.

    PUBLIC  sumarray

sumarray    PROC    ARR:DWORD,N:WORD
    les bx,ARR      ; ES:BX <- pointer to arr
    mov cx,N        ; Get number of elements
    fldz            ; Load zero
    jcxz    LeaveSum    ; Leave if count is 0.
DoSum:  fadd    dword ptr es:[bx] ; Add value in array to ST(0).
    add bx,4        ; Point to next value in array.
    loop    DoSum       ; Loop until done.
LeaveSum:           ; Return value in ST(0).
    ret
sumarray    ENDP

    END





<a name="008a_0015">
<a name="008a_0016">

[LISTING THREE]

<a name="008a_0016">

; quad.asm: integer power function callable from Borland C++.
; Copyright (C) 1991 by Nicholas Wilt.  All rights reserved.

.MODEL  LARGE,C

.CODE

; int solve_quadratic(double a, double b, double c, double *x1, double *x2);
; solve_quadratic takes the coefficients of a quadratic polynomial
; and finds the roots of that polynomial.  If there are two real
; roots, it writes them back to x1 and x2 and returns 1.  If the
; discriminant b^2-4*a*c is less than 0, the roots are not real
; and solve_quadratic returns 0.
; The quadratic formula is as follows:
;   (-b +/- sqrt(b*b-4*a*c)) / 2*a

    PUBLIC  solve_quadratic

solve_quadratic PROC    A:QWORD,B:QWORD,C:QWORD,X1:DWORD,X2:DWORD
                ; Comments show stack contents
                ; separated by | signs
                ; Stack top is at left
    fld A       ; a     Here, ST(0) contains a.
    fld B       ; b | a     Now ST(1) has a.  Etc.

    ; Negate b -- squaring it is sign-independent, and we need b
    ; negated in all the other instances in the formula.
    fchs

    fld C       ; c | b | a
    fld st(1)       ; b | c | b | a
    fmul    st,st(0)    ; b*b | c | b | a

    ; Just plain fxch has an implicit operand of ST(1).

    fxch            ; c | b*b | b | a
    fadd    st,st(0)    ; 2*c | b*b | b | a
    fadd    st,st(0)    ; 4*c | b*b | b | a
    fmul    st,st(3)    ; 4*a*c | b*b | b | a
    fsubp   st(1), st   ; b*b-4*a*c | b | a
    ftst            ; Compare against 0

    ; We've computed b*b-4*a*c.  If negative, we return 0.
    ; To do the comparison, we have to store the 80x87 status
    ; word and use sahf to store it into the flags.  Once it's
    ; in the flags, we can jump on above or equal (jae) to jump if the
    ; number tested is greater than 0 after the FTST instruction above.
    sub sp,2        ; Quick, allocate a local
    mov bx,sp       ; Point BX at it
    fstsw   ss:[bx]     ; Store the 80x87's status word there
    mov ax,ss:[bx]  ; AX <- status word
    add sp,2        ; Deallocate the local
    sahf            ; Get SW into flags
    jae ComputeResults  ; Jump if positive
    fstp    st      ; This instruction clears the stack
    fstp    st      ; we've got three values on the stack
    fstp    st      ; to clear
    xor ax,ax       ; Return 0 - no roots found.
    jmp short LeaveQuadratic
ComputeResults:
    fsqrt           ; Find square root of discriminant
    fxch    st(2)       ; a | b | sqrt(b*b-4*a*c)
    fadd    st,st(0)    ; 2*a | b | sqrt(b*b-4*a*c)
    fld st(1)       ; b | 2*a | b | sqrt(b*b-4*a*c)
    fadd    st,st(3)    ; b+sqrt(b*b-4*a*c) | 2*a | b | sq..
    fdiv    st,st(1)    ; x1 | 2*a | b | sqrt(b*b-4*a*c)
    les bx,X1       ;
    fstp    qword ptr es:[bx] ; 2*a | b | sqrt(b*b-4*a*c)
    fxch            ; b | 2*a | sqrt(b*b-4*a*c)
    fsubrp  st(2),st    ; 2*a | -b - sqrt(b*b-4*a*c)
    fdiv            ; x2
    les bx,X2       ; Store x2
    fstp    qword ptr es:[bx]
    mov ax,1        ; Return 1
LeaveQuadratic:
    ret         ; Return
solve_quadratic ENDP

    END

Previous 1 2

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Assembly Language Programming for the 80x87

ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87

Floating point from the inside out

Nicholas Wilt

80x87 Registers

Table 1: Integer and floating-point formats supported by the 80x87

80x87 Instructions

Table 2: Special FLD instructions to load special constants onto the stack

80x87 Arithmetic Instructions

Table 3: 80x87 unary instructions: For fsin, fcos, and fsincos ST must have a magnitude of less than 2{63}.

Table 4: Two-operand operations available on the 80x87

FSUB and FDIV

Comparisons

Integer Powers

Summing Arrays

Quadratic Formula

Conclusion

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Assembly Language Programming for the 80x87

ASSEMBLY LANGUAGE PROGRAMMING FOR THE 80X87

Nicholas Wilt

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content