C++ has more than caught up
Todd is a graduate student in systems design engineering at the University of Waterloo, Canada, and the author of the Blitz++ class library. He can be reached at [email protected].
C++ has a number of features that make it attractive for scientific computing, including templates for generic programming, operator overloading for expressiveness, and object orientation for abstraction and code reuse. Despite these features, the scientific computing community has been reluctant to adopt C++, partly because of performance problems. In the early 1990s, C++ programs were indeed slower than their Fortran counterparts -- typical benchmarks showed C++ lagging behind Fortran's performance by anywhere from 20 percent to a factor of 10. Users of C++ often rewrote the number crunching parts of their programs in Fortran to get acceptable performance.
In the past five years, the performance of C++ programs has improved markedly due to a combination of better optimizing C++ compilers and new library techniques. In some recent benchmarks, C++ is even faster than Fortran. In this article, I'll explain these techniques and show benchmarks comparing Fortran to Blitz++, a C++ class library designed for scientific computing (http://monet.waterloo .ca/blitz/).
Why C++ Programs are Faster Than They Used to Be
There are a number of reasons for C++'s improved performance in numeric programming, including expression templates, optimization of small objects, and better compilers.
Expression Templates. Operations on vectors, matrices, and arrays form the foundation of most scientific computing code. In C++, classes can customize the meaning of operators such as (+, -, *, /). This capability, called "operator overloading," enables C++ matrix/vector class libraries to provide a very natural notation. For example, w=x+y+z; might represent the summation of three vectors.
Operator overloading in C++ is pairwise: To evaluate this expression, C++ forces you to first evaluate t1=x+y, then t2=t1+z, and finally assign w=t2. The t1 and t2 objects are temporaries used to store intermediate results.
Pairwise expression evaluation for vectors is slow: Instead of a single loop to evaluate an expression, multiple loops and temporary vectors are generated. These temporary vectors are usually allocated dynamically, which causes a severe performance loss for small vectors. For example, suppose you evaluated w=x+y+z for vectors of length 3. The floating-point arithmetic takes only a handful of clock cycles; dynamically allocating and deleting the temporary vectors consumes hundreds (or even thousands) of cycles. Consequently, C++ matrix/vector libraries tended to be slow for operations on small objects.
For large vectors, the main performance issue is memory bandwidth. Many scientific computing codes are constrained in performance not by floating-point arithmetic, but by the time required to ship data between main memory and the CPU. Suppose that each vector contained 1 MB of data. The operator-overloaded version of w=x+y+z requires 8 MB of data to be transferred between the CPU and main memory, exactly twice what is actually needed. As a consequence, the speed is roughly half of what could be attained.
The pairwise evaluation problem has been solved by the expression templates technique. In an expression templates implementation, operators like "+" don't return intermediate results. From the compiler's point of view, the expression is still evaluated pairwise, but operators like "+" return partial parse trees of the expression (rather than intermediate results). The parse tree is encoded as a C++ class type using templates. When the parse tree is assigned to an object, the expression can be evaluated in one pass without having to generate intermediate results.
The technique has some overhead associated with expression parsing, but the performance for long vectors is just as good as hand-coded C. A good optimizing C++ compiler is able to eliminate this overhead completely, so that performance is reasonable for small vectors, too. (To get optimal performance, you have to unroll the loops completely for small vectors.)
Optimizations for Small Objects. In "Linear Algebra with C++ Template Metaprograms," (DDJ, August 1996), I described special optimizations for small vectors and matrices using the template metaprogram technique. Briefly summarized, it turns out that C++ compilers can be persuaded to do arbitrary computations at compile time by metaprograms that exploit the template instantiation mechanism. One good use of template metaprograms is creating specialized algorithms. As an example, consider Listing One which calculates a 3×3 matrix-vector product.
Calculating the result requires 15 floating-point operations, but this bit of code will consume hundreds of clock cycles. Why? The Matrix and Vector objects must allocate their memory dynamically, which takes a lot of time. The product() function resides in a library somewhere, and likely consists of two nested loops to compute the matrix-vector product. When compilers optimize loop code, they often assume that the loop will be executed many times. The result is code that is very fast for large matrices, but mediocre for small matrices.
The solution I've adopted in the Blitz++ class library is to provide two versions of objects like Matrix and Vector. One version is optimized for large objects (Vector, for example), and the other is optimized for small objects (TinyVector). The Tiny... objects use template metaprograms to create very fast code. Listing Two is the matrix-vector product implemented with Tiny... objects, while Listing Three is a skeletal declaration of TinyVector. Putting the vector data inside the object allows it to reside on the stack, so the overhead of allocating memory is avoided. The function product() invokes a template metaprogram that creates a specialized matrix-vector multiplication algorithm. The metaprogram produces Listing Four.
Big blocks of uninterrupted floating-point code like this are exactly what's needed to get peak performance on modern pipelined, superscalar CPUs. The metaprogram version of the 3×3 matrix-vector product takes 13 clock cycles in sustained execution on an RS/6000 Model 43P. (It's this fast because matrix A is kept in registers, and combined multiply-add instructions are used.)
Better Compilers. A large part of C++'s performance problem was inadequate compilers. Happily, C++ compilers have come a long way in the past decade. To demonstrate this, consider KAI C++ (from Kuck and Associates, http://www.kai.com/), which has incredible optimization abilities. To illustrate, I'll examine some code that reflects a ray of light off a surface. I'll call the incident ray vector x, the reflected ray y, and the surface normal vector n; see Figure 1.
I'll work in a three-dimensional space, so it makes sense to use the class TinyVector, which does special optimizations for small vectors. Listing Five is the reflection function. The reflect() code looks very similar to the mathematical equation that it implements: y=x-2(x n)n. Behind this simplicity lies an elaborate implementation: First, a template metaprogram unrolls the dot product dot(x,n) into the code x[0]*n[0]+x[1]*n[1]+x[2]*n[2]. The vector arithmetic is then parsed using expression templates, producing temporary objects corresponding to partial parse trees. Evaluation of the expression for y[0], y[1], and y[2] is unrolled using another template metaprogram. KAI C++ is able to optimize away all the expression-template overhead using copy propagation and dead-code elimination, and produce code equivalent to Listing Six.
Examining this code, you see six multiplies, three additions, three subtractions, six loads, and three stores for a total of 21 operations. On an RS/6000, the assembly code generated by KAI C++ and the back-end compiler consists of only 20 opcodes. The data is kept in registers, and combined multiply-add instructions are used. This is very impressive considering how complex the expression-template machinery is. But it gets better! Suppose you want to calculate the reflection of a particular ray. Overloading the comma operator is a concise way to assign vector literals; see Listing Seven.
When this code is compiled, KAI C++ discards the temporary objects created by the overloaded comma operator, does the aforementioned optimizations for reflect(), and performs constant folding to evaluate arithmetic on constant values. The final code is equivalent to Listing Eight. The compiler calculates the reflection at compile time, and ditches the x and n vectors, since they're no longer needed. Now that's an optimizing compiler!
Benchmark Results
The benchmarks I'll now turn to were measured on a 100-MHz IBM RS/6000 Model 43P (PowerPC) with 16 KB of L1 cache and 256 KB of L2 cache. The C++ programs were compiled using KAI C++ at +K3 -O2, and Fortran 77 code was compiled with XL Fortran at -O3.
Linear Algebra. The first benchmark measures the performance of a DAXPY operation, short for "Double Precision A Times X Plus Y." This routine is the workhorse of many linear algebra operations, such as Gaussian elimination. Listing Nine is a DAXPY operation using the Blitz++ class library.
Figure 2 shows benchmark results for the Blitz++ classes TinyVector and Vector, Fortran 77, and the class valarray (Modena implementation) from the draft ISO/ANSI C++ Standard. The class TinyVector is optimized for very small vectors, so its performance is much better than the other implementations for vectors of length 1..10. The Vector class and the Fortran 77 implementation have some loop overhead, which makes their performance poorer for small vectors. As you can see, for longer vectors their performance is very similar. The sudden performance drop around N=1000 occurs because the vectors are no longer small enough to fit in the L1 cache.
The performance of valarray is typical of older C++ class libraries that use pairwise evaluation and dynamically allocated temporaries. Notice that for very small vectors, memory allocation overhead causes the performance to be awful -- about a tenth that of Vector and Fortran 77. Its peak performance is also terrible, since pairwise evaluation forces it to move twice as much memory around as necessary. The extra memory use causes valarray to experience the L1 cache crunch much sooner than Vector or Fortran 77.
Fortran 90, the successor to Fortran 77, was also benchmarked, but on this platform the compiler had difficulty optimizing the DAXPY operation -- the performance was much worse than Fortran 77.
Array Stenciling. Array stencils are common in signal processing and the solution of partial differential equations. A stencil updates an array element using a function of the elements in a local neighborhood. Listing Ten is a Fortran 77 stencil operation that smooths a 3D array.
The performance of a stencil operation is heavily constrained by memory accesses, so efficient cache use is critical. Most computers have (at least) two levels of cache: the L1 cache, which is part of the CPU, and the external L2 cache. Of these, the L1 cache is faster. So, for good performance, the L1 cache must be used as much as possible. In the straightforward Fortran implementation just mentioned, the L1 cache-hit rate is typically about 54 percent for a large array (this means that 54 percent of the data is found in the L1 cache; the remaining data has to be pulled in from the L2 cache or main memory). To get a higher L1 cache-hit rate, it's necessary to traverse the array in a different order. One approach is dividing the array into smaller subarrays, and applying the stencil to each subarray in turn. This is commonly called "tiling" or "blocking." Converting Listing Ten to a blocked version would require five (or six) nested loops. In some cases, Fortran compilers can transform an algorithm into a blocked version automatically, but often not for complicated stencils.
One problem with blocking is selecting the block size -- to get optimal performance, it has to be chosen carefully with the particular architecture and cache sizes in mind. In the Blitz++ library, I decided to adopt a slightly different approach, based on the Hilbert space-filling curve; see Figure 3. In this approach, there are two loops: an outer loop, which follows an approximation to the Hilbert space-filling curve (1), and an inner loop (2), which traverses the corresponding column of the array. The space-filling curve ensures good cache use, and the inner loop allows the CPU pipelines to pick up steam as the column is traversed.
This traversal ordering has a very nice property: Performance increases steadily for every extra bit of cache you have, resulting in good performance on all platforms. For most platforms, the L1 cache-hit rate is 70 to 76 percent (compared to only 54 percent for the Fortran 77 program previously described). This isn't the case for blocked algorithms, where performance makes abrupt jumps at critical cache sizes, making tuning very platform dependent.
Implementing the Hilbert curve traversal is complicated, requiring about a hundred lines of code. In Blitz++, this complexity is completely hidden inside the library. The user-level code looks like Listing Eleven.
Using the expression templates technique, this code is transformed into a Hilbert curve traversal. To implement a similar traversal in Fortran, you'd have to write a hundred lines of code, or rewrite the compiler.
Figure 4 shows a benchmark comparing the Blitz++ version of the 3D array stencil with the Fortran 77 version for a variety of array sizes. For very small arrays, Fortran 77 has the upper hand, because its code is much simpler. For larger arrays, the extra complexity of the Hilbert curve traversal pays off: The higher L1 cache hit rate translates into better performance.
Lattice Quantum Chromodynamics. The last benchmark I'll describe is drawn from Lattice Quantum Chromodynamics (QCD), a computational version of strong interactions in the Standard Model of particle physics. It's used to calculate particle properties (such as mass and charge) predicted by current theory. Lattice QCD is very demanding: These programs engage supercomputers for months or years at a time.
This benchmark is based on a Lattice QCD implementation from the Edinburgh Parallel Computing Centre. In the EPCC implementation, the bulk of CPU time is spent multiplying 3×3 and 3×2 matrices of complex numbers. This is an ideal application for the Blitz++ TinyMatrix class, which uses template metaprograms to create specialized algorithms for matrix-matrix multiplication. Listing Twelve is an initial version of the Blitz++ benchmark. The arguments res and src are vectors of 3×2 complex matrices; M is a vector of 3×3 complex matrices. Listing Thirteen is the initial Fortran 77 version.
Figure 5 shows the benchmark results for these two codes. The performance of the initial Fortran 77 version was disappointing due to the poor memory layout of the data, and the failure of the compiler to unroll the inner loops. The performance doubled after optimizing the memory layout and manually unrolling the inner loops (see the series "Fortran 77 tuned" in Figure 5), but still wasn't as good as Blitz++. Examination of the assembly code revealed that the Fortran compiler was suffering from "register spillage" -- it was unable to fit all the data in registers, so their contents had to be spilled out onto the stack. This is a compiler problem; with a better code generator, the Fortran version ought to be as fast as the initial Blitz++ version.
To tune the Blitz++ version, I took a step that is second nature to C++ programmers. I encapsulated the related objects in a single class; see Listing Fourteen. The revised qcdCalc() routine operates on a single vector of latticeUnit objects, rather than three separate vectors. This encapsulation ensures that related objects are stored nearby in memory, a property known as "locality of reference." Cache designs assume programs will exhibit locality of reference, so encapsulation tends to improve performance. In the QCD benchmark, encapsulation boosted performance by 15-20 percent (see the "Blitz++ tuned" series in Figure 5).
Conclusion
After nearly a decade of being dismissed as too slow for scientific computing, C++ has caught up with Fortran and is giving it stiff competition. The combination of new library design techniques (expression templates and template metaprograms) and better compilers means that programmers can enjoy the expressiveness and ease of C++ without paying a performance price.
Listing One
Matrix<float> A(3,3);Vector<float> b(3), c(3); // ... c = product(A,b);
Listing Two
TinyMatrix<float,3,3> A;TinyVector<float,3> b, c; c = product(A,b);
Listing Three
template<class T, int N>class TinyVector { ... private: float data[N]; };
Listing Four
c[0] = A[0][0]*b[0] + A[0][1]*b[1] + A[0][2]*b[2];c[1] = A[1][0]*b[0] + A[1][1]*b[1] + A[1][2]*b[2]; c[2] = A[2][0]*b[0] + A[2][1]*b[1] + A[2][2]*b[2];
Listing Five
typedef TinyVector<double,3> ray; </p> inline void reflect(ray& y, const ray& x, const ray& n) { y = x - 2 * dot(x,n) * n; }
Listing Six
double _t1 = x[0]*n[0] + x[1]*n[1] + x[2]*n[2]; double _t2 = _t1 + _t1; y[0] = x[0] - _t2 * n[0]; y[1] = x[1] - _t2 * n[1]; y[2] = x[2] - _t2 * n[2];
Listing Seven
void test(ray& y){ ray x, n; x = 1.00, 0.40, -1.00; n = 0.31, 0.20, 0.93; reflect(y, x, n); }
Listing Eight
void test(ray& y){ y[0] = 1.3348; y[1] = 0.6160; y[2] = 0.0044; }
Listing Nine
Vector<double> x(N), y(N);double a; ... y += a * x;
Listing Ten
DOUBLE PRECISION A(N,N,N), B(N,N,N) </p> DO k=2, N-1 DO j=2, N-1 DO i=2, N-1 A(i,j,k) = (B(i,j,k) + B(i+1,j,k) + B(i-1,j,k) . + B(i,j+1,k) + B(i,j-1,k) + B(i,j,k+1) . + B(i,j,k-1)) / 7.0; END DO END DO END DO
Listing Eleven
const int N = 64;Array<double,3> A(N,N,N), B(N,N,N); ... Range I(1,N-2), J(1,N-2), K(1,N-2); A(I,J,K) = (B(I,J,K) + B(I+1,J,K) + B(I-1,J,K) + B(I,J-1,K) + B(I,J+1,K) + B(I,J,K-1) + B(I,J,K+1)) / 7.0;
Listing Twelve
typedef TinyMatrix<complex<double>,3,3> gaugeElement;typedef TinyMatrix<complex<double>,3,2> spinor; </p> void qcdCalc(Vector<spinor>& res, Vector<gaugeElement>& M, Vector<spinor>& src) { for (int i=0; i < res.length(); ++i) res[i] = product(M[i], src[i]); }
Listing Thirteen
subroutine qcdf(M, res, src, V) integer V, iters double complex M(V,3,3), res(V,3,2), src(V,3,2) </p> DO site=1,V DO spin=1,2 DO col=1,3 res(site,col,spin) = M(site,col,1) * src(site,1,spin) . + M(site,col,2) * src(site,2,spin) . + M(site,col,3) * src(site,3,spin) ENDDO ENDDO ENDDO </p> return end
Listing Fourteen
class latticeUnit { ... protected: spinor one; gaugeElement ge; spinor two; };
DDJ
Copyright © 1997, Dr. Dobb's Journal