Parallel

CUDA, Supercomputing for the Masses: Part 6

By Rob Farber, July 25, 2008

Global memory and the CUDA profiler

Enabling and Controlling Textual Profiling

The environmental variables that control the text version of the CUDA profiler are:

CUDA_PROFILE: Set to 1 or 0 to enable/disable the profiler
CUDA_PROFILE_LOG: Set to the name of the log file (The default is ./cuda_profile.log)
CUDA_PROFILE_CSV: Set to 1 or 0 to enable or disable a comma separated version of the log
CUDA_PROFILE_CONFIG: Specify a configuration file with up to four signals

The last bullet is important because only four signals can be profiled at a time. The developer can have the profiler collect any of the following events by specifying their names on separate lines in the file named by CUDA_PROFILE_CONFIG:

gld_incoherent: Number of non-coalesced global memory loads
gld_coherent: Number of coalesced global memory loads
gst_incoherent: Number of non-coalesced global memory stores
gst_coherent: Number of coalesced global memory stores
local_load: Number of local memory loads
local_store: Number of local memory stores
branch: Number of branch events taken by threads
divergent_branch: Number of divergent branches within a warp
instructions: instruction count
warp_serialize: Number of threads in a warp that serialize based on address conflicts to shared or constant memory
cta_launched: executed thread blocks

Notes on Profiler Counters

Note that the performance counter values do not correspond to individual thread activity. Instead, these values represent events within a thread warp. For example, an incoherent store within a thread warp will increment the gst_incoherent counter by 1. So the final counter value stores information for all incoherent stores in all warps.

In addition, the profiler can only target one of the multiprocessors in the GPU, so the counter values will not correspond to the total number of warps launched for a particular kernel. For this reason, when using the performance counter options in the profiler the user should always launch enough thread blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. In practice, NVIDIA suggests it is best to launch at least 100 blocks or so for consistent results.

As a result, users should not expect the counter values to match the numbers one would determine through inspection of the kernel code. Counter values are best used to identify relative performance differences between unoptimized and optimized code. For example, if the profiler reports some number of non-coalesced global loads for an initial piece of software, then it is easy to see if a more refined version of the code utilizes a smaller number of non-coalesced loads. In most cases, the goal is to make the number of non-coalesced global loads zero, so the counter value is useful for tracking progress toward this goal.

Profiling Results

Let's look at reverseArray_multiblock.cu and reverseArray_multiblock_fast.cu with the profiler. In this example, we will set the environment variables and configuration file in the bash shell under Linux as follows:

export CUDA_PROFILE=1
export CUDA_PROFILE_CONFIG=$HOME/.cuda_profile_config

Profiler configuration via environnent variables in Linux with bash

gld_coherent
gld_incoherent
gst_coherent
gst_incoherent

Contents of the CUDA_PROFILE_CONFIG file

Running the reverseArray_multiblock.cu executable generates the following profiler report in ./cuda_profile.log:

method,gputime,cputime,occupancy,gld_incoherent,gld_coherent,gst_incoherent,gst_coherent
method=[ memcopy ] gputime=[ 438.432 ] 
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 267.520 ] cputime=[ 297.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 0 ] gld_coherent=[ 1952 ] gst_incoherent=[ 62464 ] gst_coherent=[ 0 ]
method=[ memcopy ] gputime=[ 349.344 ]

Profile report for reverseArray_multiblock.cu

Similarly, running the reverseArray_multiblock_fast.cu executable produces the following output, which overwrites the previous output in .cuda_profile.log.

method,gputime,cputime,occupancy,gld_incoherent,gld_coherent,gst_incoherent,gst_coherent
method=[ memcopy ] gputime=[ 449.600 ] 
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 50.464 ] cputime=[ 108.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 0 ] gld_coherent=[ 2032 ] gst_incoherent=[ 0 ] gst_coherent=[ 8128 ]
method=[ memcopy ] gputime=[ 509.984 ]

Profile report for reverseArray_multiblock_fast.cu

Comparing these two profiler results shows that reverseArray_multiblock_fast.cu has zero incoherent stores as opposed to reverseArray_multiblock.cu, which has many. Look at the source of reverseArray_multiblock.cu and see if you can fix the performance problem with incoherent stores. Once fixed, measure how fast the two programs are relative to each other.

For convenience, Listing One presents reverseArray_multiblock.cu and Listing Two reverseArray_multiblock_fast.cu.

// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part3: implement the kernel
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
    int inOffset  = blockDim.x * blockIdx.x;
    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
    int in  = inOffset + threadIdx.x;
    int out = outOffset + (blockDim.x - 1 - threadIdx.x);
    d_out[out] = d_in[in];
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv) 
{
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)

    // pointer for device memory
    int *d_b, *d_a;

    // define grid and block size
    int numThreadsPerBlock = 256;

    // Part 1: compute number of blocks needed based on array size and desired block size
    int numBlocks = dimA / numThreadsPerBlock;  
    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host
    for (int i = 0; i < dimA; ++i)
    {
        h_a[i] = i;
    }

    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);
    reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("memcpy");

    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++)
    {
        assert(h_a[i] == dimA - 1 - i );
    }

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    // If the program makes it this far, then the results are correct and
    // there are no run-time errors.  Good work!
    printf("Correct!\n");

    return 0;
}
void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) 
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }                         
}

reverseArray_multiblock.cu

// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part 2 of 2: implement the fast kernel using shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
    extern __shared__ int s_data[];

    int inOffset  = blockDim.x * blockIdx.x;
    int in  = inOffset + threadIdx.x;

    // Load one element per thread from device memory and store it 
    // *in reversed order* into temporary shared memory
    s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

    // Block until all threads in the block have written their data to shared mem
    __syncthreads();

    // write the data from shared memory in forward order, 
    // but to the reversed block offset as before

    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

    int out = outOffset + threadIdx.x;
    d_out[out] = s_data[threadIdx.x];
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv) 
{
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)

    // pointer for device memory
    int *d_b, *d_a;

    // define grid and block size
    int numThreadsPerBlock = 256;

    // Compute number of blocks needed based on array size and desired block size
    int numBlocks = dimA / numThreadsPerBlock;  

    // Part 1 of 2: Compute the number of bytes of shared memory needed
    // This is used in the kernel invocation below
    int sharedMemSize = numThreadsPerBlock * sizeof(int);

    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host
    for (int i = 0; i < dimA; ++i)
    {
        h_a[i] = i;
    }

    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);
    reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("memcpy");

    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++)
    {
        assert(h_a[i] == dimA - 1 - i );
    }

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    // If the program makes it this far, then the results are correct and
    // there are no run-time errors.  Good work!
    printf("Correct!\n");

    return 0;
}

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) 
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }                         
}

reverseArray_multiblock_fast.cu

For More Information

Click here for more information on CUDA and here for more information on NVIDIA.

Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at [email protected].

Previous 1 2

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

CUDA, Supercomputing for the Masses: Part 6

Enabling and Controlling Textual Profiling

Notes on Profiler Counters

Profiling Results

For More Information

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

CUDA, Supercomputing for the Masses: Part 6

Enabling and Controlling Textual Profiling

Notes on Profiler Counters

Profiling Results

For More Information

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content