While multithreaded programming is not a new paradigm, the increasing adoption of multiprocessor machines means that developers may seriously consider multithreaded software design for performance in desktop applications. In this article, I'll examine the most common pitfalls in multithreaded programming and highlight specific issues targeting Hyper-Threading (HT) Technology, a technology developed by Intel that enables a single processor to run two separate threads simultaneously (http://www.intel.com/labs/htt/).
Parallelism, the simultaneous processing of different data or tasks, is achieved through two models: data- and functional- decomposition. As the names imply, these two models represent very different yet complementary methods of applying multiple threads to achieve a higher level of performance within a single process. Data decomposition means applying the same independent operation to different sets of data. Functional decomposition means performing independent work on asynchronous threads. Consequently, one model could potentially invoke the other, making the two complementary.
Some obvious advantages arise from implementing multithreaded algorithms. Threads can run concurrently on multiprocessor or multiprocessor-like hardware, yielding higher performance and more responsiveness: something that your end-users will notice and appreciate. Having one ready-to-run thread per logical processor on an HT system will mean that an application can consume idle processor cycles (again working towards that goal of higher performance). In a more forward-looking fashion, threading an application today could likely mean built-in scalability when run on future hardware. With all these advantages, why not at least consider a threaded approach towards implementing an application?
Three Threading Methods
The simplest means of parallelizing loops (data decomposition) is to let the compiler do all the work. The Intel C++ Compiler 7.0 (and later) can analyze loops and automatically parallelize those that it considers good candidates. The -Qparallel switch enables the feature and, following a few simple guidelines, increases the likelihood that the compiler will identify a parallel loop.
First, expose the trip count of the loop. The compiler must be able to determine whether the trip count varies during loop execution. Second, avoid branching from the loop. And third, avoid referencing global data or procedure calls from within the loop body. Keep in mind, though, that failure to adhere to these guidelines does not automatically disqualify a loop from automatic parallelization. Use the -Qpar_report3 switch to have the compiler generate a report on which loops were successfully parallelized and which dependencies prevented parallelization of others. (-Qpar_report[n] allows varying degrees of reporting.)
While the aforementioned compiler technology is impressive, oftentimes developers prefer to tell the compiler what to thread. After all, you know much more about the overlying algorithm than the compiler. The OpenMP Standard (http://www.openmp.org/) gives you the kind of flexibility desired, while limiting the time investment necessary to thread code. OpenMP works on an explicit fork/join model, so you must specify the start and end of a parallel region. Not all C++ compilers at the time of this writing support OpenMP, but the Intel C++ Compiler 7.0 (and later) does; the -Qopenmp switch enables processing of the OpenMP directives or pragmas to generate threaded code. Without this switch, the compiler will simply ignore the directives.
OpenMP allows easy parallelization of loops or regions of your code without large-scale modifications. In fact, the original serial code is left largely intact, as Listing 1 illustrates. Efficient parallelization is achieved with very minimal programming effort, and functional performance increases (versus the serial version) of a measurable factor are possible on an Pentium 4 processor with Hyper-Threading technology.
The third option is to use the thread libraries (Win32 threads or PTHREADS) to drive explicitly defined threads. Thread libraries are flexible enough to address both data and functional decomposition (as is OpenMP), but using them tends to be invasive to the code base, time consuming, and error prone for the inexperienced. Computations must be separated into functions that can be mapped to threads, and the work must be manually divided among the threads. Additionally, explicit synchronization must be added in order to guarantee correct results. That said, the benefit of the threading libraries is the powerful flexibility and direct programmer control they provide.
A word on performance regarding the thread libraries: Creating threads is expensive. Thread creation and deletion should be avoided except in one-time initialization and destruction situations, respectively. If you are using the thread libraries, it is probably worthwhile to generate a thread pool during initialization, and set events to wake up threads to perform tasks when necessary. When the threads finish, they can go back to a resting state, releasing hardware resources for other tasks. While on the topic of performance, let's consider the impact of targeting your code for Hyper-Threading Technology.
Hyper-Threading Coding
On HT-enabled systems, the two logical processors share common execution resources. Even though two threads are running concurrently, they are competing for some of the same hardware resources. Keep in mind some of the following common programming pitfalls when targeting an HT-enabled system to avoid hard-to-diagnose performance problems later on during the optimization cycle.
For synchronization on variables that you know are going to be imminently available, a simple spin-wait loop may suffice. Care is required, though, because spin-waits can quickly become a hindrance on performance. A spin-wait locks up processor resources because it runs so efficiently, therefore stalling both logical processors. Inserting an assembly-level pause instruction can alleviate the detrimental effects of spin-waits, and looks like so:
// Tight spin-wait loop that can // lead to performance issues while (synch_var != some_val) { } // Tight spin-wait loop with pause // instruction; fixes performance // issue from above while (synch_var != some_val) { _asm pause }
The pause instruction hints to the processor that the loop is spin-wait, and inserts a momentary pause into execution to allow the processor resources to be freed for other applications. Processors that precede HT-enabled processors effectively interpret the pause instruction as a no-op, so there is no need to create an HT-specific code path in this case.
Another way to slow down the tight spin-wait is to put the waiting thread to sleep. Either use the ANSI Standard sleep() or the Win32 Sleep(). The argument for the ANSI Standard sleep() is the number of seconds for the thread to sleep, while the Win32 Sleep() uses milliseconds. Use whichever makes more sense for your application.
// Tight spin-wait loop that sleeps // one tenth of one second while (synch_var != some_val) { Sleep(100); }
When you know the thread will be waiting for some indeterminate amount of time, the best way to synchronize is to tell the OS what the thread is doing. Under Windows, using the thread-blocking APIs (e.g., WaitForSingleObject()) lets you do just that.
In addition to explicit synchronization, you should be careful of implicit synchronization. For example, allocating memory off the heap or stack causes serialization to occur. Avoid this by using thread-local stack allocation APIs such as TlsAlloc() or _alloca(). Accessing global variables may also cause implicit serialization.
64K Alias Conflicts
64K-aliasing conflicts happen when a virtual memory address references a cache line that is modulo 64 KB apart from another that already resides in the L1 cache. Similar conflicts occur at 1-, 2-, and 4-MB boundaries. If your profiler reports an aliasing event number increase of 5× or more on an HT-enabled system versus the same system with HT technology disabled, then it is likely possible to achieve a meaningful speedup by alleviating some of the aliasing pressure. Generally, two cases cause such an impact when HT is enabled:
- Thread stacks aligned on 1- or 2-MB boundaries.
- Threads accessing data structures that happen to fall modulo 64 KB apart.
It is possible to avoid these common pitfalls by following some well-known guidelines. First, try adjusting thread stack allocation, thereby changing the relative addresses between offending accesses. The Win32 default for thread stack allocation is on 1-MB boundaries, and thus has the potential to cause serious performance penalties on HT-enabled systems. In the case of interleaving or interspersed loads and stores from the same 64-KB address space, try doing multiple loads, then multiple stores.
For data structures, there are several methods to try. One is to simply have your memory allocator allocate structures of 68 KB instead of 64 KB. Alternatively, your memory allocator might pad and return random offsets that are multiples of 128 bytes (for cache-friendliness and avoiding stalls due to unaligned loads). Finally, for multiple arrays, it is advisable to pad those that are multiple of 64 KB in size, especially those that are accessed with the same index, so that they will not be allocated contiguously in memory. The simplest way to pad arrays is to artificially increase their sizes or interleave their declarations with the declaration of other variables.
Effective cache utilization is critical to top-end performance on both single- and multithreading processors. False sharing and poor data locality are common issues related to the cache that occur in multithreaded programming.
False sharing can cause some serious performance degradation on both dual- or multiprocessor and HT-enabled systems. False sharing happens when multiple threads own private data on the same cache block. For Pentium 4 and Xeon processors, a cache block is effectively 128 bytes. Determining whether false sharing is an issue is easy with a profiler (such as Intel's VTune Performance Analyzer), and fixing the problem can be as easy as padding out data structures. Perhaps an even better fix is to structure data and data accesses in a cache-friendly manner, on or at 128-byte boundaries. These recommendations are complementary to those for avoiding 64K-aliasing, so watching out for one pitfall can actually help prevent two or more.
HT-enabled processors share the L2 cache. Consequently, targeting a specific cache size can be beneficial or detrimental to your application's performance. The cache-blocking technique restructures loops with frequent iterations over large data arrays by subdividing the large array into smaller blocks or tiles. Each data element in the array is reused within the data block, such that the block of data fits within the data cache, before operating on the next block or tile. This technique is widely used in linear algebra, and is a common transformation applied by compilers and game programmers. Because HT-enabled processors share the L2 cache, the most effective target block size is somewhere around 133 KB per thread, or roughly a quarter of the full L2 cache size.
Write-combining (WC) store buffers are also a shared resource on HT-enabled systems. The Pentium 4 processor has six WC buffers, and in the past it has been recommended that programmers target four of the six, as the OS and other processes are likely consuming at least two. An example of targeting four WC buffers is writing to four addresses that fall on unique cache lines. With the WC buffers being shared, the recommendation for processors with HT technology is to only target two unique WC buffers per running thread. The solution for avoiding unwanted WC buffer evictions is to break apart a loop such that it only targets the recommended number of WC buffers.
Conclusion
In this article, I've introduced models to achieve parallelism in your code, and discussed some of the common pitfalls and respective workarounds for developing threaded code for Hyper-Threading processors. To close, I will leave you with these last comments. While you can keep an eye out for some issues and common pitfalls when developing threaded code, you will find yourself debugging and/or fine-tuning some part of your application. Profile often, use your tools' options aggressively, and test your threaded code on threading systems. Who knows, you may very well come across a few serial optimization opportunities in addition to finding threading issues, but that is a whole different topic.
William E. Damon is a software engineer with Electronic Arts. He can be contacted at [email protected].