The Bitmap: Hitting the Bottleneck
Throughout development, Step 5 has constantly been the bottleneck of the entire process. At the highest degree of optimization, it still visits only between 35- and 114-million edges per second per SPE. This throughput is much less than the other steps. We started with a sequential version of Step 5, which is pretty straightforward (Listing Six; available at http://www.ddj.com/code/), and we optimized it extensively, obtaining Listing Seven (available at http://www.ddj.com/code/). The result is not exactly the most readable code, but it gives you a good idea of what deeply optimized source code for the Cell looks like.
The initial code does not compile efficiently on the Cell: Only 19 registers are used out of 128, the average CPI is 2.18, no SIMD operations are used, only 3 percent of the issues are dual issues (SPEs have two separate instruction pipelines and can issue two instructions at a time, if they don't conflict), and 48.5 percent of the cycles are spent waiting in stalls, either for memory operations or for intermediate results. On the other hand, our final implementation is 3.68-times faster, it uses 82 registers, it is optimally SIMDized, has a CPI equal to 0.99, 31 percent of the issues are dual, and stalls occur in 32.9 percent of the cycles.
Code in step5() (Listing Seven) is divided in two similar portions. The first portion processes vertices owned by the local SPE, which are ready immediately because they did not need to be transferred in the all-to-all exchange. The second portion processes each incoming Qin after waiting for the associated sentinel. Since it takes more time to process a Qin (4.10 microseconds) than to transfer it (1.19 microseconds), and the first Qin does not need to be transferred, the actual total time spent waiting for sentinels is zero.
As for optimizations, we inlined functions and unrolled the loops. Both optimizations reduce branch penalties and create larger basic blocks. With larger basic blocks, the compiler can schedule the instructions more efficiently, decreasing stalls and increasing the dual issue rate.
We employ 128-bit loads and SIMD operations (the spu_xxx intrinsics in Listing Seven) wherever it is convenient. This leads to better register exploitation and lower CPI. We also issue four loads per iteration with "offset pointers"; that is, a base pointer plus a small literal offset. These are compiled as load instructions in the "d-form," which calculates the sum of a register and an immediate offset without the need for a separate address computation instruction.
We got rid of branches by using software speculation. Branches cost up to 24 cycles when predicted incorrectly. So, we set bits in the bitmap unconditionally, even when they were already set. This has no impact on the algorithm's veracity. And we queue vertex codes into Qnext even when they're not needed. Then, we advance the end pointer of the queue only if the vertex was actually added. This allows most of the code to proceed unconditionally.
We use "restrict" pointers. Normally, a sequence of multiple loads causes stalls because the compiler does not try to schedule them concurrently. In fact, this would break the C pointer semantics if the loads alias. And bitmap load/stores are responsible for more than 30 percent of the time spent in this step. To improve performance, we declare as __restrict__ the pointers to values in the bitmap, guaranteeing that no aliasing takes place and allowing the four loads to proceed in parallel. And at graph generation, we shuffle adjacency lists appropriately so that they do not cause aliasing.
Conclusion
In short, the Cell offers an impressive potential for performance. However, due to its architecture and limited support offered by the compiler, you can't expect to exploit this potential by just recompiling your current applications. Applications must be radically redesigned in terms of computation and data transfers. Computational bottlenecks must often be analyzed and addressed manually, and data transfers must be properly orchestrated in order to hide their latency completely under the computational delays. This work has been supported by the Laboratory Directed Research and Development Program for the Data-Intensive Computing Initiative at Pacific Northwest National Laboratory, operated by Battelle for the U.S. Department of Energy under Contract DEAC0576RL01830.