Max is a programmer and Ph.D. student. He can be contacted at [email protected].
It hasn't been that long since computationally intensive, real-time graphics applications required digital signal processors (DSPs) or other special processors. With the introduction of Single-Instruction-stream Multiplex-Data-stream (SIMD) extensions to general-purpose processors, however, things have changed. With multimedia-instruction-set extensions such as Intel's MMX, you can execute up to 16 integer operations on 8-bit data per clock cycle or up to four integer operations on 32-bit data per cycle. The introduction of 3DNow! by AMD and streaming SIMD extensions by Intel has brought floating-point performance up to speed with number crunching rates of four single-precision floating-point operations per cycle. These technologies open possibilities for real-time image processing, speech recognition, audio/video compression, and 3D rendering. Raw CPU speed, however, is not enough to make applications work faster. Programmers must use optimization techniques to achieve optimal performance of critical code. In this article, I'll discuss MMX code optimization and suggest techniques for achieving maximum speed on two common PC CPUs -- the Intel Pentium II and AMD K6-2.
AMD K6-2 Versus Intel Pentium II
Both Intel's Pentium II and AMD's K6-2 are sophisticated CPUs with complex internal structures. Both CPU families employ superscalar pipelining, dynamic execution, and branch prediction -- and both can execute up to 6 m-operations per cycle.
Of course, there is a difference in the internal architecture. The Pentium II, for instance, has three instruction decoders and the K6-2 has two. Aimed to speedup existing software, AMD's K6-2 is less sensitive to code selection and instruction scheduling. Specific details on the internal architecture of these two CPUs can be found in DirectX, RDX, RSX, and MMX Technology: A Jump-start Guide to High Performance APIs, by Rohan Coelho and Maher Hawash (Addison-Wesley, 1998); The Pentium II Processor Developer's Manual (Intel Corporation, Order #243502-001, October 1997); and the AMD K6-2 Processor Data Sheet (Document # 21850, http:// www.amd.com/K6/ k6docs/).
As Table 1 illustrates, however, there is another striking difference between Pentium II and K6-2 -- cache organization. AMD did not include L2 cache in the K6-2 chip package. Instead, the original Socket 7 was improved to work at 100 MHz and renamed "Super 7." It connects the CPU with external L2 cache via a 100-MHz 64-bit wide bus. On the other hand, the Pentium II, famous for its integrated L2 cache, works at half the speed of the CPU core (full speed for Pentium II Xeon).
At 300-MHz, Pentium II L2 cache works at 150 MHz, or 50 percent faster than 100-MHz K6-2 L2 cache. At 450 MHz, the difference in speed is 125 percent. To compensate for this without risking major processor and motherboard redesigning, AMD doubled the CPU's internal cache from 32 KB to 64 KB and implemented a sector L1 cache structure, where two 32-byte cache lines are combined in a sector. When cache misses occur and a cache line is filled from L2 cache, the other line in the sector is automatically fetched. Theoretically, this approach compensates for the excessive L2 cache latency by reducing the number of misses by a factor of two.
While the Pentium III did not introduce any major changes in cache organization, the same can't be said about the AMD K6-III (see AMD K6-III Processor Data Sheet #21918, http://www .amd.com/K6/k6docs/). Still fitting into a Super 7 socket, the K6-III contains a 256 KB integrated L2 cache working at the same speed with the CPU core and supports external L3 cache. The K6-III is capable of fetching data from L2 cache twice as fast as Pentium III, but the cache pool is two times smaller. Thus, expensive L2 cache misses happen twice as often. This creates a confusing situation where it is hard to guess whether these tradeoffs will compensate each other.
MMX Data Optimization
Efficient processing of a continuous stream of data requires both a fast CPU and fast memory throughput. For instance, when you encode a video sequence, each frame has to be fetched from system memory, compressed, and stored in another location. The performance is limited by the memory bandwidth, no matter how fast the CPU is. However, it is possible to reduce, if not eliminate, the computational compressor overhead by aggressive instruction scheduling and data prefetching.
Still, by having a continuous data stream or static data, the performance can be improved by changing the data organization (if possible) or by changing the way data is processed. If the video compressor has to process the same frame several times and the frame is too large to fit in cache, the data will have to be fetched from memory again and again, thus reducing performance dramatically. However, if the frame can be split into several blocks (each small enough to fit in cache) and each block can be processed separately, then the data will be fetched from system memory only once, and all consecutive loads will hit the cache. Thus, with multipass processing (typical for most applications), it is highly desirable to partition data in the smallest blocks possible (16 KB to fit in Pentium II/III), do multipass processing on each block while it resides in cache, store results, fetch another block, and so on. This approach may not work for all applications, however. Convolution filtering, for example, may create undesirable edge effects and image artifacts when applied to a block-split image.
Another important consideration is whether complex data should be organized as an array of structures (AoS) or a structure of arrays (SoA). It depends. With MMX, SIMD processing, it is vital that the data, which can be processed in parallel, is packed into 8-byte chunks. Consider the example of when alpha values of an RGBA image must be adjusted. In such a case, the SoA approach eliminates unnecessary unpacking. At the same time, when the intensity of the entire image has to be adjusted (and each RGBA component must be offset by the same value), the AoS approach is favorable.
In short, the necessary steps to achieve MMX data optimization are:
1. Determine optimal data packing format for SIMD processing (AoS, SoA, or data structure layout).
2. Determine in how many passes data is processed.
3. If data is large and processed in more than one pass, determine the minimal block size the data can be split into (the preferable block size is L1 cache size).
5. When possible, the data processing should be in place, rather than out-of-place, to minimize the number of memory references and cache misses.
MMX Code Optimization
Once the data format and organization is selected, it is time to write code. Despite the CPU's rescheduling and out-of-order execution, it is critical to arrange MMX code correctly.
Therefore, the code optimization guidelines are as follows:
- Pair and schedule instructions to fill both pipelines.
- Maximize register usage and minimize memory references.
- Minimize branching and unroll small loops.
- Align code, branch targets, and data.
- Avoid using long (more than 7 bytes) and complex instructions (loop, enter, leave, and so on).
MMX-specific code optimization guidelines include the following:
- Parallel processing of multiple data streams along with loop unrolling can improve pairing.
- MMX instructions do not mix well with floating-point instructions.
- MMX instructions, which reference memory or integer registers, do not mix well with integer instructions referencing memory or registers.
- Column-wise processing can be better than sequential row-wise processing.
- MMX code sections should end with emms instructions if floating-point operations are to be used later in the program.
In general, MMX instructions can be easily paired with each other and with integer instructions. The exceptions are:
- MMX instructions that reference memory or integer registers. These should not be paired with integer instructions referencing memory or the registers; see Listing One.
- MMX shift/pack/unpack instructions. These should not be paired because there is a single shifter unit; see Listing Two.
- MMX multiplication instructions pmull/ pmulh/pmadd. These should not be paired because there is a single MMX multiplication unit; see Listing Three.
- The destination register of the first instruction. This should not match the source or destination register of the second instruction in the pair, except for certain movq instructions; see Listing Four.
Sometimes, when it seems that it is impossible to improve instruction pairing in a block of MMX code, multiple data stream processing or loop unrolling helps.
In Listing Five, which is an example of thresholding, each instruction depends on the results of previous instruction, and almost no MMX instruction pairing takes place. However, as Listing Six illustrates, you can improve pairing by processing two data streams in parallel. You can improve this code even further by rearranging instructions, where pairing is improved from 40 to 100 percent and code performance is improved 5/3 (1.7 times); see Listing Seven.
Multiple data stream processing has the effect of loop unrolling combined with aggressive instruction scheduling and reordering. For large loops, the branching overhead is virtually nonexistent because the correctly predicted branch takes only one cycle to execute. There is only one expensive mispredicted branch in such a loop, and it happens only in the last iteration.
Loop unrolling is a powerful technique in speeding up MMX code. However, excessive unrolling may result in a large code footprint and more instruction cache misses. Thus, the size of the inner loop should be kept in the 16-KB and 32-KB range for the Pentium II and K6-2, respectively.
The side effect of loop unrolling is that the processed data size or image height/ width (in bytes) must be a multiplicative of eight times the number of loop unrolls. If it is not possible to achieve such data granularity, then the edges must be processed separately.
Data alignment is also important for efficient data processing. Misaligned data incurs a 3-cycle penalty on the Pentium II and a 1-cycle penalty on the K6-2. Thus 32-bit data should be aligned on a 32-bit boundary, and 64-bit data on a 64-bit boundary. If misaligned data access crosses the cache line boundary, the penalty is even greater (6-9 cycles on the Pentium II).
Most compilers (including Visual C++) align data automatically. However, when some addresses happen to be misaligned, they can be corrected manually, as in Listing Eight. Misalignments of 64-bit data can occur when data arrays are declared as members of C++ classes/C-structures having the default 32-bit alignment.
Code alignment is also vital. Compilers automatically align function entry points and branch targets. However, the branch targets in inline assembly code may be misaligned, thus causing a 1-cycle penalty per branch on the Pentium II. To avoid this, the code should be analyzed (debugged or disassembled), and extra nop instructions should be used to align the branch; see Listing Nine.
Likewise, data arrangement plays a critical role in MMX code performance. In the case of horizontal and vertical 8-bit image decimation with averaging, for instance, each new column/row transforms into a sum of two original columns/rows. In column-wise processing, it is easy to add and summate 8 pixels at once; see Listing Ten.
A row-wise approach is more computationally expensive, however. Each 8-byte chunk will have to be unpacked, bytes rearranged so that bytes 0, 2, 4, 6 could be added with bytes 1, 3, 5, 7, and the result then packed back into an 8-byte chunk.
Another important code optimization issue is instruction scheduling. Though most instructions have a latency of 1 cycle, multiplication (pmul and pmadd) and memory referencing instructions have a latency of 3 cycles (3 cycles is a minimum cache hit latency for load instructions) on the Pentium II/III, and 2 cycles on the K6-2/K6-III. Thus, in order to avoid wasting latency cycles, the results of multiplication/load instructions should not be referenced immediately after the instruction is issued; see Listing Eleven.
Properly utilized latency cycles provide room for other instructions. Up to four such instructions can be executed on the Pentium II to fill latency cycles incurred by MMX multiply/load instructions.
Another performance issue comes from preemptive multitasking, which is commonly found in desktop operating systems. Real-time applications require immediate data processing and full CPU power.
However, the OS task scheduler may suspend the application while executing some other process. Windows 95/98/NT rely on preemptive multitasking and do not provide any means for CPU monopolization by high-level applications. It is possible to boost the priority of the currently executing time-critical thread by using the Win32 API calls:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
These calls set thread priority to the highest value of 31. On NT, the thread blocks even system processes, including mouse input and disk buffer flushing. Thus, at the end of a time-critical section, the process and thread priorities should be downgraded to their normal values:
SetPriorityClass(GetCurrentProcess(), NORMAL_PRIORITY_CLASS);
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_NORMAL);
Surprisingly, Windows 9x responds differently to the priority boost. System processes do not get suspended, and VMM activity eats up 30 percent of the CPU -- and this is without an active network or other tasks running.
You can write a VxD for Windows 9x that executes in ring 0 and blocks system processes for time-critical operations by calling Adjust_Execution_Time or Adjust_Exec_ Priority; see Listing Twelve. But Windows 9x can crash if you use MMX instructions in VxD (see How to Use Floating-Point or MMX Instructions in Ring 0 or a VxD under Windows 95, http://developer.intel .com/drg/mmx/appnotes/). Until I inserted the two statements in Listing Thirteen in front of an MMX code block in my device driver, it often crashed. These calls preserve and restore the FP state, which is ultimately violated by the execution of MMX instructions.
Performance Measurement
A quick and dirty way of measuring the performance of critical code is to measure clock ticks before and after the code section. To improve the accuracy of measurement, the code should run in a loop. Also, the application priority should be boosted to its maximum value (Listing Fourteen) for more accurate results. It helps to terminate all running tasks and services and wait a few minutes to let the OS settle (flush disk buffers, complete page file I/O, and so on).
The ideal running time is 1-2 seconds. Longer running times increase chances of preemption and introduce OS interference jitter. Because Windows 9x is less sensitive to priority changes, the performance measurements done under this OS are less accurate and can vary by 10-20 percent between runs.
Code Samples
Many of the code optimization techniques discussed here were developed in connection with the development of a real-time, PC-based ultrasound imaging system. Inside the imaging system, the data is constantly acquired by a specially designed PCI interface card, which notifies the CPU about the readiness of data by invocation of a hardware interrupt. An interrupt handler located in a device driver reads the data as fast as possible and initiates another data-acquisition cycle. The interrupt handler notifies a user-level GUI application by sending a message. The application must process and display acquired data as fast as possible to achieve maximum frame rate.
I initially wrote image processing functions in C and operated on 32-bit data. Because they were sluggish and could barely yield 2-3 frames per second on AMD's K6-2/350, I changed the data format to 8-bit and wrote all critical code using MMX instructions. This made a big difference -- frame rates soared to over 30 frames per second. In this section, I'll discuss some useful 8-bit signed image-processing functions and detail their implementation. The complete code that implements these techniques is available electronically; see "Resource Center," page 5.
90-Degree Image Rotation (Matrix Transposition)
Matrix transposition may seem trivial, but efficient MMX implementation of it is not obvious. A detailed discussion of 16-bit square matrix in-place transposition and 16-bit rectangular matrix transposition can be found in Using MMX Instructions to Transpose a Matrix (http://developer.intel .com/drg/mmx/appnotes/).
The basic idea is that a matrix is split on square blocks of the same size, and the transposition of each block results in the transposition of the whole matrix. For 8-bit matrix transposition, the size of each block is 8×8 bytes (Figure 1). The width of the block is selected to match MMX register size. The transposition is affected through the use of an MMX unpack operation (Figure 2). Because only one unpack operation per cycle can be executed, they were intermixed with other instructions to improve parallelism (code that illustrates this is available electronically). Matrix transposition cannot be done in place unless the matrix is square.
Vertical Decimation with Averaging
Vertical decimation by a factor of 2 with averaging reduces image size and produces new images in which each new row is a sum of two original image rows divided by 2. This process exhibits inherent parallelism (since eight 8-bit points can be added at once), and can be efficiently coded (code available electronically).
Summation requires conversion from 8-bit to 16-bit; otherwise, you encounter the effects of saturation. As mentioned in the previous section, 8-bit signed numbers are unpacked. The main loop is unrolled twice for better parallelism.
Horizontal Decimation
Horizontal decimation by a factor of 2 is a process of packing two 8-byte chunks together (Listing Fifteen). To do this, high bytes of each 16-bit chunk must be separated from low bytes, added together, and packed back into a single 8-byte value (code that does this is available electronically).
Horizontal Zoom
Horizontal zoom is part of a 2× image scaling algorithm described in Using MMX Instructions to Implement 2× 8-bit Image Scaling (http://developer.intel.com/drg/ mmx/appnotes/). Vertical zoom is essentially row duplication and its implementation has nothing to do with MMX.
Horizontal 2× and 4× zoom has some interesting points. First, both operations can be done in place. For this, the image has to be processed back to front; see Figure 3. In 2× zooming, as Listing Sixteen shows, each byte in an 8-byte chunk of the original image is duplicated using the unpack instruction (complete code is available electronically).
In 4×, zooming each byte in an 8-byte chunk of the original image is duplicated four times; see Listing Seventeen (the complete code is available electronically). Data is stored in reverse order (high bytes are written first) to match the reverse back-to-front processing direction.
Digital Image Filtering
Digital filtering is a key operation for image compression and feature detection. Digital filtering of an input sequence sj can be expressed as a discrete convolution of the input sequence with the FIR filter coefficients fi in Figure 4, where dj is an output (filtered) sequence, N is the number of elements in the input sequence, and m the number of filter coefficients (taps). The convolution operation in Figure 4 is a mere sequence of multiply-accumulate operations (MAC). The number of MACs for each input sequence element is equivalent to the number of filter taps m. The total number MACs for the sequence is m×N. Considering the fact that a typical filter has over 16 taps, this operation is extremely computationally expensive.
Efficient MMX algorithms implementing FIR filtering can be found in Using MMX Technology Instructions to Compute a 16-Bit FIR Filter, Using MMX Instructions to Implement a Column Filter, and Using MMX Instructions to Implement a Row Filter (http://developer.intel .com/ drg/mmx/ appnotes/). The sequential image filtering (row filtering) algorithm should be optimized to avoid misaligned-data-access penalties. When targeting AMD K6-2/K6-III CPUs, however, this optimization may be unnecessary because a misaligned data access penalty is only one cycle.
Column filtering does not suffer from misaligned access because data is processed in 8-byte chunks and the chunks do not overlap (see Figure 5). Normally, row filters utilize pmadd instructions. However, column filters are faster to implement as a sequence of pmul and padd instructions.
When the source image consists of 8-bit values, the data must be unpacked for pmulhw/pmullw operations. To avoid the overhead of signed 8-bit value unpacking (that is, eliminate shift operations), the data can be unpacked into high-order bytes; see Listing Eighteen. This multiplies the source data by 256. If the filter coefficients can also be scaled by 256, then a pmulhw instruction can be used for multiplication and no further result scaling is needed (code that does this is available electronically).
Finally, pay attention to the pmul instruction scheduling. Although it is possible to schedule one such instruction every cycle, the results are available with 3-cycle delay. Thus, for optimal performance, the results should be referenced no sooner than 3 cycles after the pmul instruction is issued.
Proper scheduling of the pmul instructions in the original code resulted in a dramatic (20 percent) performance increase. No actual code changes were made except for the order of instructions.
Contrast Enhancement
Contrast enhancement is a useful operation in image processing (code available electronically). In general, contrast enhancement of a color pixel value is done according to the formula dj=sjv+c, where dj is the resulting pixel value, sj the original pixel value, and v and c are constants. (Assume in this example that the value of c is zero.)
Contrast enhancement of the image consisting of 8-bit grayscale signed values makes positive pixels brighter and negative pixels darker, while preserving the color of zero level.
The contrast enhancement routine utilizes the same trick as the column filter -- signed 8-bit values are unpacked into high-order bytes and the value of v is scaled by 256. MMX saturation prevents pixel value wraparound and effectively blocks sudden color changes.
Color Keying
Overlay transparency blitting is important for many graphics applications. In some cases, black (zero) may not be the best choice for transparent color. Thus, it might be useful to have an overlay blitting routine that accepts an arbitrary color key (that is, a value corresponding to transparent pixels).
As Listing Nineteen illustrates, the implementation of arbitrary color-key blitting is straightforward. The overlay pixels are compared against color-key values and the mask is calculated. The mask is then applied to the overlay to clear transparent pixels. The same mask is applied to destination pixels to clear everything except the overlay background area. Both pixels are then combined using a logical operation (code is available electronically). Efficient sprite overlay routine with a black color key can be found in Using MMX Instructions to Implement 2D Sprite Overlay (http://developer.intel.com/drg/ mmx/appnotes/).
Conclusion
MMX code generation and optimization is a complex and time-consuming process requiring an understanding of the processor architecture and the specifics of different processor families. In brief, MMX code generation and optimization steps can be summarized as follows:
1. Determine minimal suitable precision for the processed values and the corresponding packed data type for SIMD processing.
2. Arrange data in the best way for SIMD processing (SoA/AoS, row-wise, and column-wise arrangements).
3. Produce straightforward MMX code.
4. Unroll loops and reorder/pair instructions to improve parallelism.
5. Schedule instructions to avoid latency cycle wasting.
6. Use an optimization tool (such as Intel's VTune Analyzer, http://developer.intel .com/vtune/) to fine tune the code.
Where applications require fast but not necessarily time-critical processing, consider using Intel's Performance Library Suite (http://developer.intel.com/vtune/perflibst/), which offers a variety of image processing, mathematical, primitive recognition, and DSP functions -- and can be freely downloaded.
DDJ
Listing One
movq mm0,[esi] ; these two instructions won't execute in same cycle add eax,ebx movq mm0,mm1 ; these two would add eax,ebx
Listing Two
psraw mm0,8 ; these two instructions won't execute in same cycle punpckhbw mm1,mm2
Listing Three
pmullw mm0,mm1 ; these two instructions won't execute in same cycle pmullh mm2,mm1
Listing Four
paddw mm0,mm1 ; these two instructions won't execute in same cycle pmullh mm2,mm0 movq mm0,[esi] ; these two would movq mm1,mm0
Listing Five
; mm2 = threshold, all memory references are L1 cache hits M: movq mm0,[esi + ebx] ; 1 movq mm1,mm0 pcmpgtw mm0,mm2 ; 2 pand mm1,mm0 ; 3 movq [esi + ebx],mm1 ; 4 add ebx,8 ; 5 jz M ; total of 5*DataSize / 8 cycles
Listing Six
M: movq mm0,[esi + ebx] ; 1 movq mm1,mm0 movq mm3,[esi + ebx + 8] ; 2 movq mm4,mm3 pcmpgtw mm0,mm2 ; 3 pcmpgtw mm3,mm2 pand mm1,mm0 ; 4 pand mm4,mm3 movq [esi + ebx],mm1 ; 5 movq [esi + ebx + 8],mm4 ; 6 add ebx,16 ; 7 jz M ; total of 7*DataSize / 16 cycles
Listing Seven
M: movq mm0,[esi + ebx] ; 1 movq mm1,mm0 movq mm3,[esi + ebx + 8] ; 2 movq mm4,mm3 pcmpgtw mm0,mm2 ; 3 pcmpgtw mm3,mm2 pand mm1,mm0 ; 4 add ebx,16 pand mm4,mm3 ; 5 movq [esi + ebx - 16],mm1 movq [esi + ebx - 8],mm4 ; 6 jz M ; total of 6*DataSize / 16 = 3*DataSize / 8 cycles
Listing Eight
short *p, *pnew; pnew = (short*)(((int)p + 7) & -8); // ensure 64-bit alignment
Listing Nine
... nop ; inserted to align the branch target, notice that nop is not a part of the loop and M: ; is executed once ... jz M
Listing Ten
movq mm0,[esi] ; load 8 pixels from current row movq mm1,[esi + image_width] ; load 8 pixels from the next row paddb mm0,mm1 ; summate movq [esi],mm0 ; store ... ; repeat for each row, than move to the next column
Listing Eleven
; poor scheduling pmullw mm0,mm1 ; 3 cycle latency on Pentium II paddw mm2,mm0 ; this instruction will stall for 2 cycles ; optimal scheduling pmullw mm0,mm1 ; 3 cycle latency on Pentium II MMX inst 1 ; do something (2nd cycle) MMX inst 2 ; do something (3rd cycle) paddw mm2,mm0 ; this instruction will execute without delay
Listing Twelve
include vmm.inc ... mov eax,Time ; time in ms mov ebx,VMHandle VMMCall Adjust_Execution_Time ... mov eax,PriorityBoost ; use Time_Critical_Boost for best performance mov ebx,VMHandle VMMCall Adjust_Exec_Priority
Listing Thirteen
CurrentThread = Get_Cur_Thread VMCPD_GET_THREAD (CurrentThread, MyVxD_Buff) ... MMX instructions ... VMCPD_SET_THREAD (CurrentThread, MyVxD_Buff)
Listing Fourteen
clock_t c1, c2; ... SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL); c1 = clock(); for ( i = 0; i < M; i++ ) testfunc(); c2 = clock(); printf("%g seconds", float(c2 - c1)/M/CLOCKS_PER_SEC);
Listing Fifteen
movq mm0,[esi] ; Load first 8-byte chunk movq mm1,mm0 movq mm2,[esi + 8] ; Load second 8-byte chunk pand mm0,mask ; Clear high bytes of each 16-bit value in first ; chunk ; mask = 00FF00FF00FF00FF psrlw mm1,8 ; Clear (sign extend) high bytes of the first chunk movq mm3,mm2 pand mm2,mask ; Clear high bytes of each 16-bit value in the second chunk psrlw mm3,8 ; Clear (sign extend) high bytes of the second chunk paddsw mm0,mm1 ; Add high bytes and low bytes together paddsw mm2,mm3 psraw mm0,1 ; Divide the results by 2 psraw mm2,1 packsswb mm0,mm2 ; Pack two averaged 8-byte chunks into one movq [esi],mm0 ; Write back
Listing Sixteen
; mm0 = mm2 = 76543210 punpckhbw mm0,mm0 ; duplicate high bytes: 77665544 punpcklbw mm2,mm2 ; duplicate low bytes: 33221100
Listing Seventeen
punpckhbw mm0,mm0 ; mm0 = 77665544 movq mm1,mm0 punpcklbw mm2,mm2 ; mm1 = 33221100 movq mm3,mm2 punpckhwd mm0,mm0 ; mm0 = 77776666 punpcklwd mm1,mm1 ; mm1 = 55554444 punpckhwd mm2,mm2 ; mm2 = 33332222 punpcklwd mm3,mm3 ; mm3 = 11110000
Listing Eighteen
pxor mm1,mm1 pxor mm2,mm2 punpcklwb mm1,mm0 ; mm0 = source data punpcklwb mm2,mm0
Listing Nineteen
movq mm0,mm4 ; mm0 = [src], mm2 = [dest] pcmpeqb mm0,colorKey ; mm0 = bitmask pand mm2,mm0 ; mm2 = [dest] AND bitmask pandn mm0,mm4 ; mm0 = NOT bitmask AND [src] por mm2,mm0 ; mm2 = mm2 OR mm0
Copyright © 1999, Dr. Dobb's Journal