Parallel

CUDA, Supercomputing for the Masses: Part 3

By Rob Farber, May 13, 2008

Error handling and global memory performance limitations

Looking at the Source Code

Looking at the source code for reverseArray_multiblock.cu, you notice that the structure of the program is very, very similar to the structure of moveArrays.cu from Part 2. An error routine, checkCUDAError is provided so the host can print out a human-readable message and exit when an error is reported by cudaGetLastError. As can be seen, checkCUDAError is judiciously utilized throughout the program to check for errors.

The program reverseArray_multiblock.cu essentially creates a 1D array of integers, h_a, containing the integer values [0 .. dimA-1]. Array h_a is moved via cudaMemcpy to array d_a, which resides in global memory on the device. The host then launches the reverseArrayBlock kernel to copy the array contents in reverse order from d_a to d_b, which is another global memory array. Again, cudaMemcpy is used to transfer data -- this time from d_b to the host. A check is then performed on the host to verify that the device produced the correct result (e.g, [dimA-1 .. 0]).

// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part3: implement the kernel
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
    int inOffset  = blockDim.x * blockIdx.x;
    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
    int in  = inOffset + threadIdx.x;
    int out = outOffset + (blockDim.x - 1 - threadIdx.x);
    d_out[out] = d_in[in];
}
/////////////////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////////////////
int main( int argc, char** argv) 
{
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)

    // pointer for device memory
    int *d_b, *d_a;

    // define grid and block size
    int numThreadsPerBlock = 256;

    // Part 1: compute number of blocks needed based on 
    // array size and desired block size
    int numBlocks = dimA / numThreadsPerBlock;  

    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host
    for (int i = 0; i < dimA; ++i)
    {
        h_a[i] = i;
    }

    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);
    reverseArrayBlock<<< dimGrid, 
         dimBlock >>>( d_b, d_a );

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("memcpy");

    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++)
    {
        assert(h_a[i] == dimA - 1 - i );
    }

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    // If the program makes it this far, then the results are 
    // correct and there are no run-time errors.  Good work!
    printf("Correct!\n");

    return 0;
}
void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) 
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 
                                  cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }                         
}

A key design feature of this program is that both arrays d_a and d_b reside in global memory on the device. The CUDA SDK provides an example program, bandwidthTest, which provides some information about the device characteristics. On my system, the global memory bandwidth is slightly over 60 GB/s. This is excellent until you consider that this bandwidth must service 128 hardware threads -- each of which can deliver a large number of floating-point operations. Since a 32-bit floating-point value occupies four (4) bytes, global memory bandwidth limited applications on this hardware will only be able to deliver around 15 GF/s -- or only a small percentage of the available performance capability. (This assumes the application only reads from global memory and does not write to it.) Obviously, higher performance applications must reuse data in some fashion. This is the function of shared and register memory and it is our job as programmers to gain the maximum benefit of these memory types. To gain a better understanding of machine balance as floating-point capability relates to memory bandwidth (and other machine characteristics), read my article HPC Balance and Common Sense.

Previous 1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

CUDA, Supercomputing for the Masses: Part 3

Looking at the Source Code

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

CUDA, Supercomputing for the Masses: Part 3

Looking at the Source Code

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content