Configuring VC++ Multithreaded Memory Management
Gerbert Orasche
Many programmers have discovered VC++ programs that actually execute significantly slower when run on a multi-processor machine. This kind of effect is usually an indication that multiple threads are fighting over the same resource. In the case of VC++ programs, the problem often lies in the details of how the VC++ memory-management functions work. This article describes the problem and suggests a cheap solution you can use to quickly speed up these kinds of bottlenecked programs when theyre running on multi-processor machines. Surprisingly, you can even use this fix to speed up VC++ programs that you dont have the source code for!
The Problem
The C and C++ runtime libraries provide heap memory-management routines: malloc() and free() for C, and new and delete for C++. Whenever you allocate memory by calling malloc() or new, the function searches the heap for a free chunk of memory large enough to satisfy the request. If there is no free chunk large enough, then the runtime library has to request new pages from the operating system. Pages are the units the virtual memory manager operates on and are 4,096 bytes each under NT on Intel processors. When you free up memory by calling free() or delete, that chunk of memory is put back into the heap and can be reused for another allocation request.
While these operations seem quite trivial, they are not. See [1] to get an inside view of a memory managers problems. The problem I want to describe in more detail occurs when multiple threads are allocating memory at the very same time. This happens often on multi-processor machines, but can also be the case if a thread is preempted at just the wrong time on a single-processor machine.
Consider two threads within the same process. Thread 1 tries to allocate a block of 1,024 bytes. At exactly the same time, Thread 2, which is running on another processor, requests a block of 256 bytes. The memory manager finds a free chunk of memory for Thread 1, and the very same routines find the same block for Thread 2. If both threads are simultaneously trying to update internal data structures that keep track of what is allocated and what size it is, the heap will likely be corrupted. Even if the allocation routines return successfully, then two separate threads will each believe they own the same memory chunk. It is only a matter of time until the program malfunctions.
Situations like these are called race conditions and are well-known enemies of all multithreading programmers. The cure is to guard the memory-manager routines with a lock. Locks guarantee mutual exclusion of the threads running the same code. If one thread is already accessing the guarded code, all other threads have to wait. This solution is also called serialization.
NT provides several lock implementations. CreateMutex() creates a system-wide lock object and is for sure the slowest way to go. A critical section (created via InitializeCriticalSection()) is a faster form of a lock for threads within a single process. Even better performance can be gained with spin locks, which exist in NT 4 since service pack 3. See the VC++ help for InitializeCriticalSectionAndSpinCount() for more details [2]. Interestingly, although the help page mentions that these spin locks are used by NTs heap manager (HeapAlloc() and friends), the VC++ runtime library heap-management routines do not use spin locks to synchronize access to their heap. If you look at the source code of the VC++ runtime library, you will see that a simple critical section is used for all memory operations. However, Ill revisit this fact later: if you could somehow get the VC++ runtime library to use HeapAlloc() instead of its own internal heap, you would get the speed advantage of using spin locks instead of critical sections.
By using a critical section to synchronize access to its heap, the VC++ runtime library can safely allow multiple threads to allocate and free memory. However, that synchronization can cause performance problems and is called memory contention. If a thread tries to access the heap while another thread is using it, the first thread has to wait and will lose its timeslice, resulting in a thread context switch. Thread context switches under NT are normally fairly inexpensive because they are a small percentage of a threads normal timeslice. However, when you have multiple threads fighting over access to the same heap, the result can be many more context switches than usual enough to cause a drastic performance penalty.
Symptoms
How do you know your multi-processor system is suffering from memory contention? There is a simple method to find out. Open the performance monitor in the administrative tools and add the context switches/second counter in the system group. Now run the multithreading application you want to test and add the associated processor time counter in the process group for that process. Then you will see how many context switches happen when the process is under high load.
Doing thousands of context switches per second is quite normal under heavy load. When the counter jumps up to 80,000 or 100,000 switches, most of the processor time is wasted on switching thread contexts. A little calculation shows what is happening: if you assume 100,000 switches per second, this means that a single thread has only 10 microseconds to run before it is switched. The normal timeslice on Windows NT Server is about 12 milliseconds, which is more than one thousand times longer.
Figure 1 shows a performance monitor graph with excessive context switches, whereas Figure 2 shows the very same process under the same load with my proposed solution (described next) applied. In the first case, the system is doing more than 120,000 context switches per second. With the fix applied, the number of switches is less than 1,000. Both pictures were taken from a run of the same test program. There were three simultaneous threads doing allocations with a maximum size of 2,048 bytes. The hardware platform was a dual Pentium II 450 machine with 256MB RAM.
Solutions
Some third-party memory management libraries exist, like SmartHeap by Microquill [3], and most of them use multiple heaps with try locks or similar multithreading paradigms. Have a look at the Hoard Multiprocessor Memory Allocator website [4] for a free source-code example implementation. In the September 1999 issue of the C/C++ Users Journal [5], Kevin Manley proposes another solution to the very same problem by implementing thread private heaps. All these methods have one big disadvantage: you need access to the source code and have to incorporate new code into your application.
My answer to the problem can be applied to existing applications, if they were created with VC++ and dynamically linked against the C runtime. You have to make sure that msvcrt.dll with version 6 is installed and that NT has at least service pack 5. For VC++ v6.0 or later, the solution works with applications statically linked against libcmt.lib.
Whenever a VC++ application starts up, the C runtime library is initialized. One of the first actions is to determine which heap manager to use. The VC++ v6 runtime library is capable of using either its own internal set of heap-management routines, using heap-management routines compatible with the previous version of the compiler, or just calling the operating-system heap-management routines (HeapAlloc() and friends) directly. Internally, __heap_select() does the following three steps:
1. Check the operating-system version. If running on NT and the major version is five or higher (Windows 2000 or later), then HeapAlloc() is used.
2 Look for the environment variable __MSVCRT_HEAP_SELECT. If present, it influences which heap is to be used. If the value is __GLOBAL_HEAP_SELECTED, the behavior for all programs is changed. When a full path to an executable is found, the code checks if its the actual application by calling GetModuleFileName(). Which heap type is selected depends on a value appended after a separating colon: 1 to use HeapAlloc(), 2 to use the VC++ v5 heap, and 3 to use the VC++ v6 heap.
3. Evaluate the linker build flags in the executable. If the application was built with VC++ v6 or later, the version 6 heap is used; all others use version 5.
Now, how can you speed up your applications? If it is dynamically linked against msvcrt.dll, make sure that the DLL is not older than February 1999. I tested with a version of 2/10/99 18:33, which had 266,293 bytes and NT service pack 5 installed. For statically linked applications, the linker version flag should be 6 or higher. You can verify this by using quickview.exe, which comes on the NT CD-ROM. To change the selected heap for applications started from a command line, type the following:
set __MSVCRT_HEAP_SELECT=__GLOBAL_HEAP_SELECTED,1
All programs started from this command line will inherit this environment setting and thus use HeapAlloc() (which uses spin locks rather than critical sections). If you want to configure all applications in your system to use the faster heap, run the system applet in the control panel, select the environment tab, click once into the system variables field, and enter the variable name and value. After this, press the Set button and close the dialog. To commit the changes, reboot your computer.
According to Microsoft, the VC++ v6 heap manager is more aggressive at reuse of freed memory, and thus buggy applications compiled with older compilers can fail even though they worked pretty well with the old C runtime. If you encounter problems after forcing programs to use the system heap, you can try to unset the environment variable for the specific application by using a batch file. Optimum control can be gained by specifying only the applications for which you want to change the heap. For example:
set __MSVCRT_HEAP_SELECT=c:\program files\ myapp\myapp.exe,1 c:\bin\buggyapp.exe,2
You should type this in as a single line. (It is wrapped here to fit in the magazine.) Here, myapp.exe uses HeapAlloc(), and buggyapp.exe still uses the VC++ v5 heap.
Test Program
To verify the behavior on your multi-processor machine, I have provided a small console-mode test program in heaptest.c (Listing 1). main() first tests the given parameters. Parameter one defines the number of threads, which should be used to do allocations in parallel. The second parameter is the maximum size of the blocks to allocate; the third is the number of allocation operations for each thread.
After verifying the input variables, main() creates the given number of worker threads. The threadData structure is used to transfer the count variables. workerThread() does the actual work. It initializes the random number generator and then does the given number of malloc() and free() operations. The main thread waits for the termination of the worker threads by calling WaitForMultipleObjects() and then prints how long the threads were working. Timing is not implemented very accurately, but it is enough to get a feeling for the magnitude.
To compile the test program, you must have installed VC++ v6.0. Open a command-line window and type:
cl /MT heaptest.c
The /MT switch links the program against the static C runtime multithreading version. To link against the dynamic C runtime, use the /MD switch instead. This is useful if you still have VC++ v5 and a newer msvcrt.dll. Now run the program, and look at the number of context switches in the performance monitor. Set the environment variable as described previously, and rerun with the same arguments.
No additional fragmentation or memory consumption could be seen when using HeapAlloc(); the test program did not unveil any negative performance impacts on single processors. In fact, I found that the program also ran about 10 percent faster on a single-processor Pentium II machine when using HeapAlloc()!
When capturing the two figures, the test program took 60,953 milliseconds for 3,000,000 memory allocations when using the VC++ v6 heap. After switching to HeapAlloc(), the very same operations took only 5,291 milliseconds. Thus, using HeapAlloc() was more than 10 times faster in this special case. The same speedup was also seen in real-life applications.
Conclusion
Having multiple processors seems like an automatic performance gain, but as this article shows, having multiple processors contending for the same resources can easily make a multi-processor system slower than a single-processor system. The most common place this arises for C/C++ programs is when they have multiple threads performing intensive memory management. As presented here, a little environment setting can dramatically improve the performance of multithreading applications on multi-processor computers. No access to source code or re-linking of the executables is necessary. The best thing is that the speedup is completely free.
References
[1] Richard Jones and Rafael Lins. Garbage Collection: Algorithms for Automatic Dynamic Memory Management (John Wiley, 1996), ISBN 0471941484.
[2] MSDN online: msdn.microsoft.com/library/psdk/winbase/synchro_20ok.htm.
[3] SmartHeap SMP online: www.microquill.com.
[4] The Hoard Multiprocessor Memory Allocator online: www.cs.utexas.edu/users/emery/hoard/.
[5] Kevin Manley. Improving Performance with Thread-Private Heaps, C/C++ Users Journal (September 1999), pages 50-62.
Gerbert Orasche is a software engineer at Hyperwave, a company producing knowledge portal and Intranet software. Please send any resumes or ideas to [email protected] (beer and pizza welcome).
Get Source Code