Tools

Performance Portable C++

By Jeff Keasler, May 07, 2008

Performance portability means that code can achieve good performance across a range of computer architectures while maintaining a single body of source code.

The Benchmarks

To illustrate how array organization can affect performance, I use three examples that calculate area and volume (source-code listings are available at www.ddj.com/code/):

Triangle area, 8 FLOPs/Triangle; see Listing Two online.
2D Quadrilateral area, 8 FLOPs/Quadrilateral; see Listing Three online.
3D Brick volume, 60 FLOPs/Brick (Hexahedron); see Listing Four online.

I apply these calculations to hundreds of thousands of mesh elements. I include enough elements in my benchmark to have a memory footprint of over 20 megabytes, but I organize elements in a cache-optimal way, so cache reuse occurs.

My benchmarks have two interacting array classes. I use a Point class to store coordinates, and a shape-specific class to store the point indices that define the shape.

The examples I've chosen have subtly different memory layouts and memory access patterns. The subtlety helps to emphasize how the interplay of algorithms and data layouts can influence the effectiveness of compiler optimizations in addition to memory latency.

The Results

Figures 2, 3, and 4 show the performance of Listings Two, Three, and Four, respectively. For each bar in the graphs, 20 runs were made, and the minimum time was used. There was little variance among the 20 runs since each run was made on a "dedicated" processor having no other users. Table 1 provides the processor/compiler details of each benchmark.

All results within a given bar color are normalized against the minimum time for that color. This lets you measure the relative performance of a given data layout for a given environment.

Comparing different bar colorings to determine the fastest compiler won't work. When interpreting results, you should pick two bar colorings, then compare how those colors interact across all four memory layouts. For example, when looking at Figure 3, you can compare the Core2/pathscale3.0 results to the Power5/xlC7.0 results to see that the fastest memory layout for one configuration is the slowest for the other. The performance difference here is as much as 13 percent. By using performance portable array classes, you can get the best performance in both environments.

Mnemonic	Processor	Compiler		Compiler Options
Core2/icpc10.0	Intel Xeon E5345	Intel	v10.0	-fast
Core2/g++4.1.1		GNU	v4.1.1	-O3 -mfpmath=sse -static
Core2/PGI6.2-3		PGI	v6.2-3	-fast -Mipa=fast,inline -Bstatic
Core2/path3.0		Pathscale	v3.0	-Ofast -static
Opteron/g++4.1.1	AMD 8216	GNU	v4.1.1	-O3 -mfpmath=sse -static
Opteron/PGI6.2-3		PGI	v6.2-3	-fast -Mipa=fast,inline -Bstatic
Opteron/path3.0		Pathscale	v3.0	-Ofast -static
Itanium2/icpc10.0	Intel	Intel	v10.0	-fast
Itanium2/g++3.4.4	Intel	GNU	v3.4.4	-O3 -static
Power5/xlC7.0	IBM	xlC	v7.0	-O5 -qnoeh
Power5/g++3.3.3	IBM	GNU	v3.3.3	-O3 -static

Table 1: Processor/compiler details of each benchmark. Each test was compiled on the same processor it was run on, except the Core2 runs that had to be compiled on a 2.4-GHz Intel Xeon using an E75xx chipset.

Figure 2: Point class/Triangle class.

Figure 3: Point class/Quadrilateral class.

Figure 4 is also interesting. Looking at the graph, the most efficient arrangement of data is to implement the Point class as individual arrays, and the Brick connectivity class as an array of structs. Compare this to the layout where Point and Brick are both implemented as arrays of structs. The performance degradation for using a Struct-like Point class with the icpc v10.0 compiler on the Itanium2 is about 36 percent. This is surprising because many scientific applications use Struct-like Point classes almost exclusively.

Figure 4: Point class/Brick class.

Also, the graphs show that compiling a given benchmark with the same compiler but on two different architectures gives different results. Looking at Figure 2, a comparison of the Core2/PGI6.2-3 with the Opteron/PGI6.2-3 shows a performance divergence of up to 26 percent when using exactly the same compiler, data layouts, and compiler options. This fact only becomes evident when compiling on two different machines. This example reveals a key advantage of flexible data structures. The ability to switch data structures lets you catch performance problems related to data layout that may not be obvious through profiling tools. The spikes in some of the graphs show clear performance problems for some configurations, but a profiler may not catch that effect because the usage of a given data structure may be spread uniformly throughout the code.

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools

Performance Portable C++

The Benchmarks

The Results

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Tools

Performance Portable C++

The Benchmarks

The Results

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content