Introduction
Many general-purpose microprocessors today feature multimedia extensions that provide SIMD (single-instruction, multiple-data) instructions on relatively short vectors. By processing multiple data elements in parallel, these extensions provide a convenient way to utilize data parallelism in scientific, engineering, or graphical applications that apply a single operation to all the elements in a data set, such as a vector or matrix. One effective, but tedious and non-portable way to exploit multimedia extensions is to hand code time-consuming parts of an application by means of in-line assembly or intrinsic functions. An alternative approach is to let a "vectorizing compiler" automatically take advantage of the multimedia extensions available on the target architecture. Vectorizing compilers date back to the early days of supercomputers when, with a traditional focus on Fortran programs, the automatic conversion of serial code into vector instructions was essential to fully exploit the pipelined functional units available in a vector processor. Although the target architectures of these early vectorizing compilers differ substantially from modern microprocessors with multimedia extensions, a lot of the vectorization methodology developed in the past can be put to use in todays compilers that target multimedia instruction sets.
To show the potential performance boost of automatic vectorization, Table 1 compares the performance (in MFLOPS) of Alfred Aburtos FLOPS benchmark [1] on a 2.66 GHz Intel Pentium 4 Processor compiled with the Intel C/C++ compiler with default optimizations enabled (using the compiler switch "/O2") and with vectorization that specifically targets SSE/SSE2 (Streaming SIMD Extensions) enabled (using the compiler switch "/QxW"). The speedup numbers show that vectorization more than doubles the performance of five out of the eight modules in this benchmark. The performance of module 2 (not vectorized) is unchanged, while modules 1 and 7 (dominated by floating-point divisions) only show moderate gains.
Table 1 clearly shows that a vectorizing compiler can boost the performance of an application without any assistance from the programmer. Even so, vectorizing compilers are still met with skepticism by programmers who believe that the best (if not the only) way to obtain high performance is hand coding, even if this implies that most of the work has to be repeated for each new generation of multimedia extensions. Although this reasoning occasionally has some merit, in this article we will try to reduce the skepticism by providing a few programming guidelines that can greatly improve the effectiveness of vectorizing C/C++ compilers. Most of the guidelines are generally applicable to vectorizing compilers. The syntax of a few hints, however, is specific to the Intel C/C++ compiler. The Intel C/C++ compiler is a compiler for Windows and Linux that targets the Intel MMX technology and SSE/SSE2.
Data-Dependence Considerations
The semantics of a sequential programming language like C or C++ impose a total order on the execution of statements in a program. Central to converting implicit parallelism in a program into an explicit form (like SIMD instructions) is the observation that statements without read-after-write, write-after-read, or write-after-write conflicts can be executed in any order without affecting the final outcome of the program. Therefore, vectorizing compilers start with some form of data-dependence analysis to find a more relaxed order on the execution of statements that still preserves the semantics of the original program. Data-dependence analysis has been well studied over the years, and current compilers use rather advanced analysis methods. Unfortunately, these methods are usually a lot more effective for languages like Fortran (with more restrictions on potentially aliased data) than for languages like C and C++ where the use of pointer variables can completely obscure the flow of data.
Consider, for example, the function copy:
void copy(char *p, char *q, int n) { int i; for (i = 0; i < n; i++) p[i] = q[i]; }
Without more information, a vectorizing compiler must conservatively assume that the memory regions accessed by the pointer variables p and q may (partially) overlap, which gives rise to potential data dependencies that prohibit straightforward conversion of this loop into SIMD instructions. At this point, the compiler may decide to keep the loop serial or, as done by the Intel C/C++ compiler, generate a run-time test for overlap, where the loop in the true-branch can be converted into SIMD instructions:
if (p+n < q || q+n < p) /* vector loop */ for (i = 0; i < n; i++) p[i] = q[i]; else /* serial loop */ for (i = 0; i < n; i++) p[i] = q[i];
In our experience, run-time data-dependence testing provides a generally effective way to exploit implicit parallelism in C or C++ code at the expense of a slight increase in code size and testing overhead (hopefully justified by the gains obtained from using SIMD code). If the function copy is only used in specific ways, however, the programmer can assist the vectorizing compiler as follows.
First, if the function is mainly used for small values of n or for overlapping memory regions, the programmer can simply prevent vectorization and, hence, the corresponding run-time overhead by inserting a #pragma novector hint before the loop. Conversely, if the loop is guaranteed to operate on non-overlapping memory regions, this information can be propagated to the compiler by means of a #pragma ivdep hint before the loop, which informs the compiler that conservatively assumed data dependencies that prevent vectorization can be ignored. This will result in vectorization of the loop without run-time data-dependence testing. Alternatively, the programmer may use the restrict keyword in the declarations of p and q, as shown below, to inform the compiler that each pointer variable provides exclusive access to a certain memory region. (Using such a language extension may require an extra compiler switch, such as -Qrestrict for the Intel C/C++ compiler.)
void copy(char * restrict p, char * restrict q, int n) { int i; for (i = 0; i < n; i++) p[i] = q[i]; }
This latter method is convenient in case the exclusive access property holds for pointer variables that are used in a large portion of code with many loops, because it avoids the need to annotate each of the vectorizable loops individually. Note, however, that both the loop-specific #pragma ivdep hint, as well as the pointer variable-specific restrict hint must be used with care, since incorrect usage may change the semantics intended in the original program.
As a final, slightly contrived example of the worst-case data-dependence assumptions a vectorizing compiler must sometimes make, consider the following function reset:
int n; . . . void reset(int *x) { int i; for (i = 0; i < n; i++) x[i] = 0; }
Although, at first glance, it seems that this function always implements a loop that resets n integers in memory to zero, note that this function could also be invoked as reset(&n) which, for any initial value n > 0 will cause the loop to iterate only once (unlike Fortran, where bounds are only evaluated at entry of a DO loop). This alias between x[0] and n prevents straightforward vectorization of the loop. An obvious way to inform the compiler that such (or similar) situations will not occur is to use local variables as loop bounds, so that local analysis alone is able to prove data independence (since the address of the local variable loc_n is not taken):
void reset(int *x) { int i, loc_n = n; for (i = 0; i < loc_n; i++) x[i] = 0; }
Alignment Considerations
Most multimedia extension instruction sets are rather sensitive to alignment. The data movement instructions of SSE/SSE2, for example, operate much more efficiently on data that is aligned at a 16-byte boundary in memory. Therefore, the success of a vectorizing compiler also depends on its ability to select an appropriate data layout that, in combination with code restructuring (like loop peeling), will result in aligned memory accesses throughout the program. In cases where the compiler has taken sub-optimal alignment decisions, however, the programmer can use the directive declspec(align(base,offset)), where 0 <= offset < base and base is a power of two, to allocate a data structure at offset from a certain base. (Namely the property a MOD base = offset holds for the address a at which the data structure will be allocated.)
Suppose, as an example, that most of the execution time of an application is spent in a loop of the following form:
double a[N], b[N]; . . . for (i = 0; i < N; i++) a[i+1] = b[i] * 3;
If the first element of both arrays is aligned at a 16-byte boundary, then either an unaligned load of elements from b or an unaligned store of elements into a has to be used after vectorization. (Note that in this case, peeling off an iteration would not help.) However, the programmer can enforce the alignment shown below, which will result in two aligned access patterns after vectorization (assuming an 8-byte size for doubles):
_declspec(align(16, 8)) double a[N]; _declspec(align(16, 0)) double b[N]; /* or simply "align(16)" */
If pointer variables are used, the compiler is usually not able to determine the alignment of access patterns at compile time. Consider the following simple fill function:
void fill(char *x) { int i; for (i = 0; i < 1024; i++) x[i] = 1; }
Without more information, the compiler cannot make any assumption on the alignment of the memory region accessed by this loop. At this point, the compiler may decide to vectorize this loop using unaligned data movement instructions or, as done by the Intel C/C++ compiler, generate the run-time alignment optimization shown here:
peel = x & 0x0f; if (peel != 0) { peel = 16 - peel; /* runtime peeling loop */ for (i = 0; i < peel; i++) x[i] = 1; } /* aligned access */ for (i = peel; i < 1024; i++) x[i] = 1;
Again, our experience is that this run-time optimization provides a generally effective way to obtain aligned access patterns at the expense of a slight increase in code size and testing. If incoming access patterns are guaranteed to be aligned at a 16-byte boundary, however, the programmer can avoid this overhead with the hint __assume_aligned(x, 16); in the function to convey this information to the compiler. Note that this hint must also be used with care because incorrect usage of aligned data movements will result in an exception for SSE/SSE2.
Architecture-Specific Considerations
Familiarity with a specific target architecture can guide design and implementation decisions into a direction that will eventually result in code that is much more amenable to effective vectorization. For example, since the data movement instructions of most multimedia extensions favor unit stride memory references over non-unit stride accesses (which must be "shuffled" or "packed" into place first), transposing the layout of a two-dimensional array or selecting a structure-of-arrays layout rather than an array-of-structures layout may greatly increase the number of unit stride vector loops in a program. Other general guidelines that can be given in this context are:
- To use the lowest possible precision for data types to maximize potential SIMD width. (If only 16-bits are relevant, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.)
- To avoid mixing data types to obtain uniform vector lengths.
- To avoid operations not supported in hardware (e.g., the Intel MMX technology does not support byte shifts).
As a final comment, vectorizing compilers usually have some built-in efficiency heuristics to decide whether vectorization will be beneficial. The Intel C/C++ compiler, for instance, will disable vectorization of loops with many unaligned or non-unit stride access patterns. However, if experimentation reveals that vectorization will still improve performance, the programmer can override this behavior with a #pragma vector always hint before the loop, which enforces vectorization of any loop regardless of the outcome of the efficiency analysis (provided, of course, that vectorization is legal).
Vectorizing Compilers Examples of academic compilers that target multimedia extensions are:
Examples of commercial compilers that target multimedia extensions are:
|
Conclusions
The growing popularity of multimedia extensions to todays general-purpose microprocessors has renewed the interest in vectorizing compilers. Popular programming languages like C and C++ pose new challenges for such compilers, such as effectively dealing with pointer variables. Although the ultimate goal of any vectorizing compiler is to obtain the highest possible performance without any assistance from the programmer, in this article we have provided a few programming guidelines that can greatly increase the effectiveness of a vectorizing compiler like the Intel C/C++ compiler. A detailed overview of the vectorization techniques used by this compiler is provided in [2].
Notes
[1] The benchmark FLOPS is written by Alfred Aburto ([email protected]).
[return to text]
[2] For more information on the vectorization methods used
by the Intel C/C++/Fortran compiler, see "Efficient Exploitation of Parallelism
on Pentium III and Pentium 4 Processor-Based Systems" (Intel Technology Journal,
Q1 2001, <http://intel.com/technology/itj/>)
and "Automatic Intra-Register Vectorization for the Intel Architecture" (International
Journal of Parallel Programming, April 2002, <www.kluweronline.com/issn/0885-7458>).
[return to text]
About the Authors
Aart Bik received his M.Sc. degree in Computer Science from Utrecht University, The Netherlands, in 1992, and his Ph.D. degree from Leiden University, The Netherlands, in 1996. In 1997, he was a post-doctoral researcher at Indiana University, Bloomington, Indiana, where he conducted research in high-performance compilers for the Java programming language. In 1998, he joined Intel Corporation where he is currently working in the vectorization and parallelization team of the Software and Solutions Group. His email is [email protected].
Milind Girkar received his B.Tech. degree from the Indian Institute of Technology, Mumbai, his M.Sc. degree from Vanderbilt University, and his Ph.D. degree from the University of Illinois at Urbana-Champaign, all in Computer Science. Currently, he manages the IA32 compiler team in Intels Software and Solutions Group. Before joining Intel, he worked on a compiler for the UltraSPARC platform at Sun Microsystems. His email is [email protected].
Paul Grey received his B.Sc. degree in Applied Physics from the University of the West Indies and his M.Sc. degree in Computer Engineering from the University of Southern California. Currently he is working in Intels Software and Solutions Group, researching compiler optimizations for parallel computing. Before joining Intel, he worked on parallel compilers, parallel programming tools, and graphics system software at Kuck and Associates, Inc., Sun Microsystems, and Silicon Graphics. His email is [email protected].
Xinmin Tian is currently working in the vectorization and parallelization team at Intel Corporation, where he works on compiler parallelization and optimization. He manages the OpenMP parallelization group. He holds B.Sc., M.Sc., and Ph.D. degrees in Computer Science from Tsinghua University. He was a postdoctoral researcher in the School of Computer Science at McGill University, Montreal. Before joining Intel, he worked on a parallelizing compiler, code generation, and performance optimization at IBM. His email is [email protected].