Since the CUDA device is idle, the kernel immediately starts running based on the execution configuration and according to the function arguments. Meanwhile, the host continues to the next line of code after the kernel launch. At this point, both the CUDA device and host are simultaneously running their separate programs. In the case of incrementArrays.cu, the host immediately calls cudaMemcpy
, which waits until all threads have finished on the device (e.g., returned from incrementArrayOnDevice
) after which it pulls the modified array back to the host. The program completes after the host system performs a sequential comparison to verify we got the same result on the parallel CUDA device with incrementArrayOnDevice
as on the host with the sequential version incrementArrayOnHost
.
There are several variables determined at kernel startup through the execution configuration (in this example via the variables nBlocks
and blockSize
contained between the triple angle brackets "<<<" and ">>>") that are available to any kernel. The thinking behind nBlocks
and blockSize
is actually quite elegant because it allows the developer to account for hardware limitations without requiring the recompilation of the application -- which is an essential feature for developing commercial software with CUDA.
As I'll examine in future columns, threads within a block have the ability to communicate and synchronize with each other. This is a marvelous software feature that unfortunately costs money from a hardware standpoint. Expect more expensive (and future) devices to support a greater number of threads per block than less expensive (and older) devices. The grid abstraction was created to let developers take into account -- without recompilation -- differing hardware capabilities regardless of price point and age. A grid, in effect, batches together calls to the same kernel for blocks with the same dimensionality and size, and effectively multiplies by a factor of nBlocks
the number of threads that can be launched in a single kernel invocation. Less capable devices may only be able to run one or a couple of thread blocks simultaneously, while more capable (e.g., expensive and future) devices may be able to run many at once. Designing software with the grid abstraction requires balancing the trade-offs between simultaneously running many independent threads, and requiring a greater number of threads within a block that can cooperate with each other. Please be cognizant of the costs associated with the two types of threads. Of course, different algorithms will impose different requirements, but when possible try to use larger numbers of thread blocks.
In the kernel on the CUDA-enabled device, several built-in variables are available that were set by the execution configuration of the kernel invocation. They are:
blockIdx
which contains the block index within the grid.threadIdx
contains the thread index within the block.blockDim
contains the number of threads in a block.
These variables are structures that contain integer components of the variables. Blocks, for example, have x-, y-, and z- integer components because they are three-dimensional. Grids, on the other hand, only have x- and y-components because they are two-dimensional. This example only uses the x-component of these variables as the array we moved onto the CUDA device is one-dimensional. (Future columns will explore the power of this two-dimensional and three-dimensional configuration capability and how it can be exploited.)
Our example kernel used these built-in variables them to calculate the thread index, idx
with the statement:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
The variables nBlocks
and blockSize
are the number of blocks in the grid and the number of threads in each block, respectively. In this example, they are initialized just before the kernel call in the host code:
int blockSize = 4; int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
In cases where N
is not evenly divisible by blockSize
, the last term in the nBlocks
calculation adds an extra block, which for some cases implies some threads in the block will not perform any useful work.
This example is obviously contrived for simplicity as it assumes that the array size is smaller than the number of threads that can be contained within four (4) thread blocks. This is an obvious oversimplification but it let us, with simple code, explore the kernel call to incrementArrayOnDevice
.
It is important to emphasize that each thread is capable of accessing the entire array a_d
on the device. There is no inherent data partitioning when a kernel is launched. It is up to the programmer to identify and exploit the data parallel aspects of the computation when writing kernels.
Figure 1 illustrates how idx
is calculated and the array, a_d
is referenced. (If any of the preceding text is unclear, I recommend adding a printf
statement to incrementArrayOnDevice
to print out idx
and the associated variables used to calculate it. Compile the program for the emulator, "make emu=1", and run it to see what is going on. Be certain to specify the correct path to the emulator executable to see the printf
output.)
Again, kernel calls are asynchronous -- after a kernel launch, control immediately returns to the host CPU. The kernel will run on the CUDA device once all previous CUDA calls have finished. The asynchronous kernel call is a wonderful way to overlap computation on the host and device. In this example, the call to incrementArrayOnHost
could be placed after the call to incrementArrayOnDevice
to overlap computation on the host and device to get better performance. Depending on the amount of time the kernel takes to complete, it is possible for both host and device to compute simultaneously.
Until the next column, I recommend:
- Try changing the value of
N
andnBlocks
. See what happens when they exceed the device capabilities. - Think about how to introduce a loop to handle arbitrary sized arrays.
- Distinguish between the types of CUDA-enabled device memory (e.g., global memory, registers, shared memory, and constant memory). Take a look at the CUDA occupancy calculator and either the
nvcc
options-cubin
or--ptxas-options=-v
to determine the number of registers used in a kernel.
Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at [email protected].
For More Information
- CUDA, Supercomputing for the Masses: Part 14
- CUDA, Supercomputing for the Masses: Part 13
- CUDA, Supercomputing for the Masses: Part 12
- CUDA, Supercomputing for the Masses: Part 11
- CUDA, Supercomputing for the Masses: Part 10
- CUDA, Supercomputing for the Masses: Part 9
- CUDA, Supercomputing for the Masses: Part 8
- CUDA, Supercomputing for the Masses: Part 7
- CUDA, Supercomputing for the Masses: Part 6
- CUDA, Supercomputing for the Masses: Part 5
- CUDA, Supercomputing for the Masses: Part 4
- CUDA, Supercomputing for the Masses: Part 3
- CUDA, Supercomputing for the Masses: Part 2
- CUDA, Supercomputing for the Masses: Part 1