The Benchmarks
To illustrate how array organization can affect performance, I use three examples that calculate area and volume (source-code listings are available at www.ddj.com/code/):
- Triangle area, 8 FLOPs/Triangle; see Listing Two online.
- 2D Quadrilateral area, 8 FLOPs/Quadrilateral; see Listing Three online.
- 3D Brick volume, 60 FLOPs/Brick (Hexahedron); see Listing Four online.
I apply these calculations to hundreds of thousands of mesh elements. I include enough elements in my benchmark to have a memory footprint of over 20 megabytes, but I organize elements in a cache-optimal way, so cache reuse occurs.
My benchmarks have two interacting array classes. I use a Point class to store coordinates, and a shape-specific class to store the point indices that define the shape.
The examples I've chosen have subtly different memory layouts and memory access patterns. The subtlety helps to emphasize how the interplay of algorithms and data layouts can influence the effectiveness of compiler optimizations in addition to memory latency.
The Results
Figures 2, 3, and 4 show the performance of Listings Two, Three, and Four, respectively. For each bar in the graphs, 20 runs were made, and the minimum time was used. There was little variance among the 20 runs since each run was made on a "dedicated" processor having no other users. Table 1 provides the processor/compiler details of each benchmark.
All results within a given bar color are normalized against the minimum time for that color. This lets you measure the relative performance of a given data layout for a given environment.
Comparing different bar colorings to determine the fastest compiler won't work. When interpreting results, you should pick two bar colorings, then compare how those colors interact across all four memory layouts. For example, when looking at Figure 3, you can compare the Core2/pathscale3.0 results to the Power5/xlC7.0 results to see that the fastest memory layout for one configuration is the slowest for the other. The performance difference here is as much as 13 percent. By using performance portable array classes, you can get the best performance in both environments.
Mnemonic | Processor | Compiler | Compiler Options | |
Core2/icpc10.0 | Intel Xeon E5345 | Intel | v10.0 | -fast |
Core2/g++4.1.1 | GNU | v4.1.1 | -O3 -mfpmath=sse -static | |
Core2/PGI6.2-3 | PGI | v6.2-3 | -fast -Mipa=fast,inline -Bstatic | |
Core2/path3.0 | Pathscale | v3.0 | -Ofast -static | |
Opteron/g++4.1.1 | AMD 8216 | GNU | v4.1.1 | -O3 -mfpmath=sse -static |
Opteron/PGI6.2-3 | PGI | v6.2-3 | -fast -Mipa=fast,inline -Bstatic | |
Opteron/path3.0 | Pathscale | v3.0 | -Ofast -static | |
Itanium2/icpc10.0 | Intel | Intel | v10.0 | -fast |
Itanium2/g++3.4.4 | GNU | v3.4.4 | -O3 -static | |
Power5/xlC7.0 | IBM | xlC | v7.0 | -O5 -qnoeh |
Power5/g++3.3.3 | GNU | v3.3.3 | -O3 -static |
Figure 4 is also interesting. Looking at the graph, the most efficient arrangement of data is to implement the Point class as individual arrays, and the Brick connectivity class as an array of structs. Compare this to the layout where Point and Brick are both implemented as arrays of structs. The performance degradation for using a Struct-like Point class with the icpc v10.0 compiler on the Itanium2 is about 36 percent. This is surprising because many scientific applications use Struct-like Point classes almost exclusively.
Also, the graphs show that compiling a given benchmark with the same compiler but on two different architectures gives different results. Looking at Figure 2, a comparison of the Core2/PGI6.2-3 with the Opteron/PGI6.2-3 shows a performance divergence of up to 26 percent when using exactly the same compiler, data layouts, and compiler options. This fact only becomes evident when compiling on two different machines. This example reveals a key advantage of flexible data structures. The ability to switch data structures lets you catch performance problems related to data layout that may not be obvious through profiling tools. The spikes in some of the graphs show clear performance problems for some configurations, but a profiler may not catch that effect because the usage of a given data structure may be spread uniformly throughout the code.