Looking at the Source Code
Looking at the source code for reverseArray_multiblock.cu, you notice that the structure of the program is very, very similar to the structure of moveArrays.cu from Part 2. An error routine, checkCUDAError
is provided so the host can print out a human-readable message and exit when an error is reported by cudaGetLastError
. As can be seen, checkCUDAError
is judiciously utilized throughout the program to check for errors.
The program reverseArray_multiblock.cu essentially creates a 1D array of integers, h_a
, containing the integer values [0 .. dimA-1]
. Array h_a
is moved via cudaMemcpy
to array d_a
, which resides in global memory on the device. The host then launches the reverseArrayBlock
kernel to copy the array contents in reverse order from d_a
to d_b
, which is another global memory array. Again, cudaMemcpy
is used to transfer data -- this time from d_b
to the host. A check is then performed on the host to verify that the device produced the correct result (e.g, [dimA-1 .. 0]
).
// includes, system #include <stdio.h> #include <assert.h> // Simple utility function to check for CUDA runtime errors void checkCUDAError(const char* msg); // Part3: implement the kernel __global__ void reverseArrayBlock(int *d_out, int *d_in) { int inOffset = blockDim.x * blockIdx.x; int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); int in = inOffset + threadIdx.x; int out = outOffset + (blockDim.x - 1 - threadIdx.x); d_out[out] = d_in[in]; } ///////////////////////////////////////////////////////////////////// // Program main ///////////////////////////////////////////////////////////////////// int main( int argc, char** argv) { // pointer for host memory and size int *h_a; int dimA = 256 * 1024; // 256K elements (1MB total) // pointer for device memory int *d_b, *d_a; // define grid and block size int numThreadsPerBlock = 256; // Part 1: compute number of blocks needed based on // array size and desired block size int numBlocks = dimA / numThreadsPerBlock; // allocate host and device memory size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int); h_a = (int *) malloc(memSize); cudaMalloc( (void **) &d_a, memSize ); cudaMalloc( (void **) &d_b, memSize ); // Initialize input array on host for (int i = 0; i < dimA; ++i) { h_a[i] = i; } // Copy host array to device array cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice ); // launch kernel dim3 dimGrid(numBlocks); dim3 dimBlock(numThreadsPerBlock); reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a ); // block until the device has completed cudaThreadSynchronize(); // check if kernel execution generated an error // Check for any CUDA errors checkCUDAError("kernel invocation"); // device to host copy cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost ); // Check for any CUDA errors checkCUDAError("memcpy"); // verify the data returned to the host is correct for (int i = 0; i < dimA; i++) { assert(h_a[i] == dimA - 1 - i ); } // free device memory cudaFree(d_a); cudaFree(d_b); // free host memory free(h_a); // If the program makes it this far, then the results are // correct and there are no run-time errors. Good work! printf("Correct!\n"); return 0; } void checkCUDAError(const char *msg) { cudaError_t err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); exit(EXIT_FAILURE); } }
A key design feature of this program is that both arrays d_a
and d_b
reside in global memory on the device. The CUDA SDK provides an example program, bandwidthTest, which provides some information about the device characteristics. On my system, the global memory bandwidth is slightly over 60 GB/s. This is excellent until you consider that this bandwidth must service 128 hardware threads -- each of which can deliver a large number of floating-point operations. Since a 32-bit floating-point value occupies four (4) bytes, global memory bandwidth limited applications on this hardware will only be able to deliver around 15 GF/s -- or only a small percentage of the available performance capability. (This assumes the application only reads from global memory and does not write to it.) Obviously, higher performance applications must reuse data in some fashion. This is the function of shared and register memory and it is our job as programmers to gain the maximum benefit of these memory types. To gain a better understanding of machine balance as floating-point capability relates to memory bandwidth (and other machine characteristics), read my article HPC Balance and Common Sense.