The Problem
To illustrate the peculiarities of Cell programming, we use the Breadth-First Search (BFS) on a graph. Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.
Listing One is a minimal BFS implementation in C. Variable G contains the graph in the form of an array of adjacency lists. G[i].length tells how many neighbors the i-th vertex has, which are in G[i].neighbors[0], G[i].neighbors[1], and so on. The vertex from which the visit starts is in variable root. A BFS visit proceeds in levels: First, the root is visited, then its neighbors, then its neighbors' neighbors, and so on. At any time, queue Q contains the vertices to visit in the current level. The algorithm scans every vertex in Q, fetches its neighbors, and adds each neighbor to the list of vertices to visit in the next level, Qnext. To prevent being caught in loops, the algorithm avoids visiting those vertices that have been visited before. To do so, it maintains a marked array of Boolean variables. Neighbors are added to Qnext only when they are not already marked, then they get marked. At the end of each level, Q and Qnext swap, and Qnext is emptied.
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> /* ... */ /* the graph */ vertex_t * G; /* number of vertices in the graph */ unsigned card_V; /* root vertex (where the visit starts) */ unsigned root; void parse_input( int argc, char** argv ); int main(int argc, char ** argv) { unsigned *Q, *Q_next, *marked; unsigned Q_size=0, Q_next_size=0; unsigned level = 0; parse_input(argc, argv); graph_load(); Q = (unsigned *) calloc(card_V, sizeof(unsigned)); Q_next = (unsigned *) calloc(card_V, sizeof(unsigned)); marked = (unsigned *) calloc(card_V, sizeof(unsigned)); Q[0] = root; Q_size = 1; while (Q_size != 0) { /* scanning all vertices in queue Q */ unsigned Q_index; for ( Q_index=0; Q_index<Q_size; Q_index++ ) { const unsigned vertex = Q[Q_index]; const unsigned length = G[vertex].length; /* scanning each neighbor of each vertex */ unsigned i; for ( i=0; i<length; i++) { const unsigned neighbor = G[vertex].neighbors[i]; if( !marked[neighbor] ) { /* mark the neighbor */ marked[neighbor] = TRUE; /* enqueue it to Q_next */ Q_next[Q_next_size++] = neighbor; } } } level++; unsigned * swap_tmp; swap_tmp = Q; Q = Q_next; Q_next = swap_tmp; Q_size = Q_next_size; Q_next_size = 0; } return 0; }
On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.
Let's Get Parallel
The first step in adapting programs to a multicore architecture is making it parallel. The basic idea is to split loop for (Q_index=0; Q_index<Q_size; Q_index++)... among different SPEs. Then you access a lock marked by the protection of a synchronization mechanism. Locks work fine in cache-coherent shared-memory machines with uniform memory and limited threads, but scale poorly on multicore systems. Instead, we partition both the vertex space and the marked array evenly among the SPEs. Each SPE explores only the vertices it owns, and forwards the others to their respective owners. Function which_owner() returns the owner of a given vertex identifier.
Rather than synchronizing at a fine grain, we adopt a Bulk-Synchronous Parallel (BSP) approach. In BSP, an algorithm is split in steps, and all the cores execute the same step at the same time. After each step, there is a barrier; see barrier() in Listing Two (available at http://www.ddj.com/code/). At a barrier, whoever finishes first waits for all the others to complete before proceeding to the next step. The BSP approach is very common in the parallel programming community because it greatly simplifies the control flow and the communication protocols, at the expense of a negligible performance penalty.
Listing Two is a pseudo-C rendition of the algorithm in BSP form. The code is executed by each of the SPEs, numbered from 0 to Nspe-1. Each SPE examines the neighbor lists of each of the vertices in its Q, encountering neighbors that belong to different SPEs. It dispatches them, putting those owned by the i-th SPE in queue Qout[i]. Then, an all-to-all exchange takes place, which routes the vertices to their respective owners. Each Qout[i] is sent to the i-th SPE, which receives it into Qin[s], where s is the identifier of the sender SPE. Next, each SPE examines the incoming adjacency lists, and marks and adds vertices to its private Qnext as before. The entire algorithm is complete when all the SPEs find their Qs empty. This is done via a reduce_all operation, which performs a distributed addition of all the variables Q_size among all the SPEs.