In CUDA, Supercomputing for the Masses: Part 9 of this article series on CUDA (short for "Compute Unified Device Architecture"), I looked at how you extend high-level languages (like Python) with CUDA. In this installment, I examine CUDPP, the "CUDA Data Parallel Primitives Library." CUDPP is a quickly maturing package that implements some not-so-obvious algorithms to efficiently use the GPU for basic data-parallel operations such as sorting, stream compaction, and even building data structures like trees and summed-area tables. I discuss CUDPP here because it might provide some of the functionality needed to quickly speed the development of one of your projects.
I also introduce the concept of creating a "plan," a programming pattern used to provide an optimized execution configuration based on problem specification and destination hardware. Although not an optimizing compiler, the use of plans can greatly enhance the ability of programmers to create efficient software for multiple types of CUDA-enabled GPUs -- in addition to providing the ability to select problem-specific optimized code for specific problems within a general-purpose library framework. The NVIDIA cuFFT library, for example, can decide to use more efficient power-of-two FFT algorithms when appropriate. While the concept of a plan is not new to CUDA or this article series, it is a common design pattern that has stood the test of time.
Why Use CUDPP?
Most of us have a tool kit of libraries and methods that we use to do some of our work for us. In a nutshell, these libraries provide primitives that we can use to quickly and efficiently perform some of our computational tasks. Sorting is one example where it is just as easy and efficient to call something like the qsort()
routine to return a data structure in sorted order. The NVIDIA cuBLAS and cuFFT libraries provide similar functionality for some not-so-easy tasks such as programming the FFT and optimized BLAS functionality.
CUDPP uses the same ideas to provide a library of optimized "best in class" methods to perform primitive operations such as parallel-prefix-sum ("scan"), parallel sort (of numbers), parallel reduction and other methods that permit the efficient implementation of sparse matrix-vector multiply, and other operations.
A parallel-prefix scan is a primitive that can help in implementing efficient solutions to parallel problems in which each output apparently requires global knowledge of the inputs. For example, the prefix sum (also known as the "scan", "prefix reduction", or "partial sum") is an operation on lists in which each element in the result list is obtained from the sum of the elements in the operand list up to its index. This appears to be a serial operation because each result depends on all the previous values as follows:
Definition:
The all-prefix-sums
operation takes a binary associative operator
and an array of n
elements:
given: [a
returns:[a
Example
: If is addition, then the all-prefix-sums
operation on the array of n
elements
given [3, 1, 7, 0, 4, 1, 6, 3]
returns [3, 4, 11, 11, 15, 16, 22, 25].
There are many uses for all-prefix-sums
illustrated above, including, but not limited to sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, and building histograms and data structures (graphs, trees, etc.) in parallel. There are various survey papers that provide more extensive and detailed applications such as Guy Blelloch's Prefix Sums and Their Applications.
Obviously a sequential version of scan (that could be run in a single thread on a CPU, for example) is trivial. We simply loop over all the elements in the input array and add the value of the previous element of the input array to the sum computed for the previous element of the output array, and write the sum to the current element of the output array.
void scan( float* output, float* input, int length) { output[0] = 0; // since this is a prescan, not a scan for(int j = 1; j < length; ++j) { output[j] = input[j-1] + output[j-1]; } }
This code performs exactly n
additions for an array of length n
-- the minimum number of additions required to produce the scanned array. It would be wonderful if a parallel version of scan could be work-efficient, which means the parallel version performs no more addition operations (or work) than the sequential version. In other words the two implementations should have the same work complexity, O(n). CUDPP claims to achieve O(n) scan runtime, which should clarify the value of CUDPP because creating a parallel implementation is non-trivial. For more information, see Scan Primitives for GPU Computing by Shubhabrata Sengupta et al.
For version 1.0, CUDPP provides:
- Segmented Scan: an algorithm for performing multiple variable-length scans in parallel. Useful for algorithms such as parallel quicksort, parallel sparse matrix-vector multiplication, and more.
- Sparse Matrix-Vector Multiplication (based on segmented scan): Sparse matrix operations are important because they allow GPUs to work on matrices with many zeros (e.g, a sparse matrix) in both a space and computationally efficient way. Since most of the values are zero, most of the work can be avoided. Similarly, there is no need to waste space in storing the zeros.
- An improved scan algorithm, called "warp scan", for higher performance and simpler code.
- Scans and segmented scans now support add, multiply, maximum, and minimum operators.
- Inclusive scans and segmented scans are now supported.
- Improved, more useful,
cudppCompact()
interface. - Backward compact (reverse-and-compact) is now supported.
- CUDA 2.0 support.
- Added support for Mac OS X and Windows Vista.