Conclusion
Performance portability is likely to become a necessary part of programming in the near future. Already, some compilers will only create good SSE code for the x86 processor architecture if stride one vectors are used (also known as Array-like layouts). At the same time, older superscalar machines usually perform best when Struct-like layouts are used as often as possible. Without the flexibility to switch memory layouts, performance almost certainly suffers as code is ported, and this is more likely to be so as we head into the future.
The technique I describe here is used in a large multiphysics project that encompasses hundreds of thousands of lines of code. The technique was originally used as a refactoring tool more than a performance portability tool. At one point, the project was able to quickly refactor the hydrodynamics portion of their physics code for a 42-100 percent speedup depending on the problem being solved and the machine being used. The hydrodynamics can be the dominant portion of the runtime for many physics applications, so doubling the performance with just a change of data structures is impressive. Part of the performance gain probably came from the compiler recognizing extra optimizations that could be applied (same compiler flags), and the rest came from different cache latency characteristics.
Thanks to Brian McCandless for questioning me during a presentation on the sidebar material, and inspiring me to write this article. His group uses a variant of this technique in their software.
References
Triangle and Quadrilateral areas (softsurfer.com/algorithms.htm).
Grandy, Jeff. "Efficient Computation of Volume of Hexahedral Cells," Lawrence Livermore National Laboratory UCRL-ID-128886, Oct 30, 1997.
Next-Generation Performance Portability
To take full advantage of multicore, NUMA, and heterogeneous technologies, lots of code will likely need to be rewritten, especially in the High Performance Computing (HPC) arena.
To mitigate the impact of this shift to new architectures, I'm investigating the use of source-to-source translation as a possible salvation. I've used a sophisticated tool called ROSE (www.roseCompiler.org) to write a Tiny C language extension that implements the techniques in this article in a way that is transparent to users. All the drawbacks I mention in the article go away, and new features emerge, especially with respect to multidimensional arrays.
Users create a single schema file describing the struct/array grouping of array names used within a code. The compiler can then use the schema to generate structs or arrays as appropriate at compile time. The groupings can be hierarchical, thus subtle relationships among groups of arrays can easily be captured.
I plan to investigate generating RapidMind, Intel's Threaded Building Block, or other backend code to complement the performance portable language extension. My hope is that source translation combined with a vector-like language extension might let the power of next-generation architectures be exploited with a minimal amount of effort devoted to code rewrites.
J.K. |