Parallel

Programming the Cell Processor

By Daniele Paolo Scarpazza, Oreste Villa and and Fabrizio Petrini, March 09, 2007

Our authors present algorithms and strategies they've used to make breadth-first searching on graphs as fast as possible on the Cell multicore processor.

Fitting In

Listing Two has a major issue—it assumes no limits on the size of its data structures. But the space in an LS is limited—code, stack, global data, and heap must fit in 256 KB. Therefore, we leave Q, Qnext, and G in main memory, and operate on smaller buffers, called bQ, bQnext, and bG, respectively, which are allocated in the LS. Figure 1 describes the new algorithm.

Figure 1: A dataflow diagram of our final implementation of the algorithm.

Step 1 fetches a portion of Q into buffer bQ via a DMA transfer. We hide the latency of this transfer with a double-buffering technique. In short, while buffer 0 is being processed, new data are transferred into buffer 1. When we finish processing buffer 0 and transferring buffer 1, we swap the buffers: Buffer 1 comes into use, and another transfer is initiated on buffer 0; see Listing Three (available at http://www.ddj.com/code/). The same technique is used for putting bQnext into main memory (in Step 6).

Step 2 scans vertices in bQ and loads the respective adjacency lists in bG until bG is full. Also, bG is managed as a double buffer. But adjacency lists are sparse blocks in main memory, therefore, multiple transfers are needed. We carry them out with a DMA list. The Cell supports DMA lists up to 2048 transfers, each involving up to 16 KB. You must know the size of each transfer in advance to set it up in the DMA list, but there is no obvious way to know the size of an adjacency list before loading it. We solve this with a hack— we represent vertices with vertex codes; for instance, a 32-bit word where the upper bits contain the vertex identifier, and the lower bits contain its adjacency list length. The results are given in Listing Four (available at http://www.ddj.com/code/), where macros VERTEX_CODE_TO_INDEX and VERTEX_CODE_TO_LENGTH extract the two fields from a vertex code. With the help of the length field, Step 2 can stuff the maximum possible amount of adjacency lists into bG, minimizing wasted space.

Step 3 splits the adjacency lists loaded by Step 2 into the respective Qouts. To expedite this process, at graph generation, we prepare adjacency lists in the graph in a split per-SPE format. A header specifies the offset and length of each per-SPE portion of that list.

Step 4 is an all-to-all exchange in which each SPE delivers each of the Qout queues into a Qin inside its respective recipient. To detect at recipient side when each Qin is ready, we use a sentinel. The sentinel is a flag that the sender sets when the transfer is complete. To detect transfer completion, the recipient waits for the sentinel to change value. Hardware support is employed to guarantee that sentinels are not transferred before their payload. That is, we interleave payload and sentinels in a DMA list and employ a DMA list with a barrier; for instance, intrinsic mfc_putlb in Listing Five (available at http://www.ddj.com/code/).

For maximum efficiency, we avoid the transfer of Qout[i] to the same SPE i, and we schedule transfers in a circular fashion that exploits the geography of the EIB. The send order is computed by init_all_to_all() and stored in Qout_send_order; see Listing Five. The order and coordinates of transfers are known in advance, so the DMA list can be prepared once for all at program initialization. Thanks to this, Step 4 boils down to just a call to mfc_putlb.

Step 5 scans the vertices in each incoming Qin queue; if they were not marked before, it adds them to its Qnext and marks them. Step 5 is the most expensive step of the algorithm and hardest to optimize.

Step 6 commits Qnext to Qnext in main memory with double buffering. The same considerations made for Step 1 apply.

To achieve the best results, we overlap computation and data transfers as much as possible. Due to double buffering, an iteration takes two "scheduling periods," but iterations are pipelined, so one iteration completes at each period. With the aforementioned parameters, a period lasts 47 microseconds and processes 2700 edges on average. The latency of data transfers is completely hidden. Computations never have to wait for data being transferred. The overall latency is determined by the sum of all computations.

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Programming the Cell Processor

Fitting In

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Programming the Cell Processor

Fitting In

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content