In Part 5 of this article series on CUDA (short for "Compute Unified Device Architecture"), I discussed memory performance and the use of shared memory in reverseArray_multiblock_fast.cu. In this installment, I examine global memory using the CUDA profiler
Astute readers of this series timed the two versions of the reverse array example discussed in Part 4 and Part 5 and were puzzled about how the shared memory version is faster than the global memory version. Recall that the shared memory version, reverseArray_multiblock_fast.cu, kernel copies array data from the global memory to the shared memory, then back to global memory while the slower kernel, reverseArray_multiblock.cu, only copies data from global memory to global memory. Since global memory performance is between 100x-150x slower than shared memory, shouldn't the significantly slower global memory performance dominate the runtime of both examples? Why is the shared memory version faster?
Answering this question requires understanding more about global memory plus the use of additional tools from the CUDA development environment -- specifically the CUDA profiler. Profiling CUDA software is fast and easy, as both the text and visual versions of the profiler read hardware profile counters on CUDA-enabled devices. Enabling text profiling is as easy as setting the environmental variables that start and control the profiler. Using the visual profiler is equally easy: Just start cudaprof and start clicking in the GUI. Profiling provides valuable insight. The collection of profile events is handled entirely by hardware within CUDA enabled devices. However, profiled kernels are no longer asynchronous. Reporting of results to the host only occurs after each kernel completes, which minimizes any communications impact.
Global Memory
Understanding how to efficiently use global memory is an essential requirement to becoming an adept CUDA programmer. Following is a brief discussion about global memory that should be sufficient to understand the performance difference between reverseArray_multiblock.cu and reverseArray_multiblock_fast.cu. Future columns will, of necessity, continue to explore efficient uses of global memory. In the meantime, a detailed discussion on global memory, with illustrations, can be found in Section 5.1.2.1 of the CUDA Programming Guide.
Global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced within a half-warp so the hardware can then fetch (or store) the data in the fewest number of transactions. CUDA Compute Capability devices (1.0 and 1.1) can fetch data in a single 64-byte or 128-byte transaction. If the memory transaction cannot be coalesced, then a separate memory transaction will be issued for each thread in the half-warp, which is undesirable. The performance penalty for non-coalesced memory operations varies according to the size of the data type. The CUDA documentation provides some rough guidelines for the expected performance degradation to expect for various size data types:
- 32-bit data types will be roughly 10x slower
- 64-bit data types will be roughly 4x slower
- 128-bit data types will be roughly 2x slower
Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architecture when:
- The threads access 32-, 64- or 128-bit data types.
- All 16 words of the transaction must lie in the same segment of size equal to the memory transaction size (or twice the memory transaction size when accessing 128-bit words). This implies that the starting address and alignment are important.
- Threads must access the words in sequence: the kth thread in the half-warp must access the kth word. Note: not all threads in a warp need to access memory for the thread accesses to coalesce. This is called a "divergent warp".
Newer architectures such as the GT200 family of devices have more relaxed coalescing requirements than those just discussed. I will discuss architectural differences more deeply in a future column. For purposes here, suffice to say that if you tune your code to coalesce well on a G80 CUDA-enabled device, it will coalesce well on a GT200 device.