A system where multiple threads coexist in the same address space is extremely vulnerable to data corruption and deadlock. While there is a trade-off between these two problems, deadlock remains the ultimate bug and is almost impossible to eradicate.
This article will discuss a technique for detecting and breaking deadlocks in a multithreaded environment. It will also present a simple and powerful way to analyze why the deadlock occurred.
A deadlock is a situation where each thread in a set of at least two threads is waiting for a resource (shared data is also considered a resource), which is locked by another thread in the set. The result is that each thread in the set is waiting indefinitely for the resources to be released. There are three major deadlock-handling strategies:
- Ignoring assuming deadlocks rarely occur and not taking any action to handle them.
- Avoidance avoiding states that may lead to a deadlock by keeping track of resource allocation.
- Detection and breaking detecting the deadlock and breaking the cycle, usually by killing one of the threads in the cycle.
Each strategy has its advantages and weaknesses, but none gives a perfect solution to the deadlock problem. This article will focus on the design and implementation of one common technique to detect deadlocks. The code was compiled using Visual C++ 6 and was tested under Windows 2000. Note that in order to simplify the code, I don't perform any error checking and all my WaitForObjects() wait infinite time.
The Algorithm
The basic idea is to track each lock request, successful lock, and release of lock that is made in our system. The application updates the locks monitor about each lock operation and the information about the locks is entered into a directed graph in the following ways:
- When the thread tries to lock a resource, insert a node from the thread to the resource.
- When the thread acquires the lock, reverse the direction of the node from the resource to the thread.
- When the thread releases a resource, erase the node from the resource to the thread.
Figure 1 represents the following deadlock scenario (T-stands for Thread, R-Resource):
- T1 locks R1
- T2 locks R2
- T1 tries to lock R2
- T2 tries to lock R1
A cycle in the graph represents a deadlock. Note that a cycle can be closed only when a thread tries to lock a resource.
Along with finding a deadlock, by keeping track of the locks in the application, we can get some extra information; for example, if a thread tries to lock a resource while already owning the resource or if the thread tries to release a resource without owning it. These situations, although not necessarily presenting a problem, can be the result of an unexpected flow in the program and it can be helpful to be notified about them.
Locks Monitor Design
The process of tracking and analyzing the locks is done by what I call the "locks monitor," shown in Figure 2. The locks monitor is controlled by a manager, which creates the locks monitor, starts it, and ends it. The manager also makes sure that only one locks monitor exists and runs in the application (running more than one locks monitor in an application will give poor results).
The locks monitor interacts with the rest of the application via the messages queue. For each lock operation (try, own, or release), the application sends a message to the queue and the locks monitor retrieves the message and processes it. Each message contains three elements: the lock operation, the thread performing the lock operation, and the resource, which is the destination of the operation.
The locks monitor consists of three main components:
- Messages processor as its name implies, this part gets messages from the queue and processes them. This part is the main loop of the locks monitor and controls its flow.
- Directed graph holds the information about the locks in a graph as mentioned in the algorithm description.
- Logger a simple logger to notify about possible problems. When a deadlock occurs, the logger will dump some useful information about the state of the application that will help to reproduce the scenario that led to the deadlock.
Messages Queue
The most important part of the locks monitor is the messages queue.
It synchronizes the messages arriving from the sending threads along with the receiving thread (the locks monitor). If it is not handled carefully, the messages queue can create a bottleneck and slow down performance and, if the queue doesn't do the synchronization right, the results will be unpredictable. In order to assure high efficiency and good performance of this important component, I decided to use the Win32 messaging mechanism.
The locks monitor is created and terminated by the locks monitor manager. The manager is implemented in the class CLocksMonitorManager. It is implemented as a singleton, meaning there could be only one instance of it in the application. (See References for a more detailed explanation of singletons.) To get this instance, you should call this static method:
CLocksManager()::GetInstance()
The manager is the object that creates the thread in which the locks monitor runs. The manager has another static method, GetLMThreadId(), which holds the ID of the locks monitor thread. This method is used by the threads in the application when sending messages to the locks monitor thread.
The locks monitor thread is implemented in the LocksMonitorThreadProc() function (Listing 1) and is quite simple: All it does is sit in a loop and pull messages from the queue. For each message received, the thread calls the appropriate method in the class CLocksMonitor, which handles the message.
CLocksMonitor uses two major data members, which are responsible for most of the work. The first one is a member of type CLocksLogger, which is a simple logger, used to dump info into a file when problems (such as a deadlock) arise.
Note that the logger can produce a full stack trace for any given thread. It does it with the help of a data member of type CStackTrace. This class produces a stack trace using the StackWalk() function, which is part of the Win32 API. In order to get a stack trace when your application deadlocks, you must make sure that you have dbgHelp.dll on your computer and that you produced .pdb files for your application. If the .pdb files are missing, all you will be able to see is a bunch of addresses instead of a meaningful stack trace. A more detailed discussion about this class is out of the scope of this article. See References for more information.
The second member is of type CDirectGraph. It holds the direct graph using an STL map:
struct Edge { HANDLE m_hID; bool m_bThread; } typedef map<HANDLE, Edge> EdgeList; typedef map<Edge, EdgeList> GraphMap; . . . GraphMap m_graphMap;
An Edge is a simple struct that contains the thread ID or a handle to a resource, and a Boolean member that tells if this edge represents a thread or a resource. Each entry in the GraphMap maps an edge to a list of edges. This edge has nodes that point to each edge in the list of edges. For example, if we have three edges: E1, E2, and E3, and the GraphMap maps edge E1 to the list of edges {E2, E3}, this represents the existance of the following nodes in the direct graph: node from E1 to E2 (E1-->E2) and a node from E1 to E3 (E1-->E3).
In the application part, I implemented the operation of putting the messages in the queue. For that purpose, I created some macros that will do the job automatically. The macros replace each WaitForObjects() function with MyWaitForObjects() and any ReleaseResource() with MyReleaseResource():
define WaitForSingleObject(hResource, dwMilliseconds) \ MyWaitForSingleObject(hResource, dwMilliseconds); define WaitForMultipleObjects(nCount, pHandles, bWaitAll, dwMilliSec) \ MyWaitForMultipleObjects(nCount, pHandles, bWaitAll, dwMilliSec); define ReleaseMutex(hMutex) \ MyReleaseMutex(hMutex);
All MyFunctions() are actually the basic functions with the addition of posting the proper messages to the locks monitor. Note that in the message, I pass the thread ID rather than the thread handle. This is because using GetCurrentThread() will return only a pseudohandle, which is actually a constant (0xfffffffe). When needed, I can obtain the thread handle from the thread ID by calling OpenThread(). (The OpenThread() function is supported only under Windows 2000/Me/XP and is part of the Microsoft platform SDK. If you use NT or 98, or you don't have the appropriate platform SDK installed, you can build a map that maps thread ID to the thread handle instead of using OpenThread().)
Listing 2 shows the implementation of MyWaitForSingleObject() and MyReleaseMutex().
Integration
To integrate the locks monitor into your application, you need to perform just two steps:
- Add to each of your files (or at least the ones that perform any lock operation) the file locksWrapperMacros.h. This is done in order to replace the regular WaitForObjects() with the MyWaitForObjects(), which updates the locks monitor.
- In the entry point of your application, call CLockMonitorMgr::GetInstance()->Start() to start the locks monitor. At the exit point from your application call CLockMonitorMgr::GetInstance()->End() to terminate it.
That's all. Now your application is ready to work with the locks manager. Listing 3 is a simple example that demonstrates how to use the locks monitor. In the example, I created two threads (Thread_1, Thread_2) and two resources (g_hResource[0], g_hResource[1]), and produced the following scenario:
Thread_1 locks g_hResource[0] and try to lock g_hResource[1] Thread_2 locks g_hResource[1]and try to lock g_hResource[0]
I added sleep in order to guarantee a deadlock. Note that the calls to WaitForSingleObject() and ReleaseMutex() are being replaced with MyWaitForSingleObject() and MyReleaseMutex(). This replacement is totally transparent to the programmer and is made by the macros defined in locksWrapperMacros.h. The output of the example is dumped into lockMonitor.log.
In the output (Listing 4) you see that a deadlock was detected, and a full stack trace was obtained for each thread in the cycle. Afterward, one thread from the cycle was killed in order to break the deadlock.
Known Bugs
The implementation presented here has one major drawback: It doesn't deal correctly with WaitForMultipleObjects() when the bWaitAll parameter is False. Think of the following situation (shown in Figure 3):
T1 locks R1, T2 Locks R2, and T3 Locks R3. Now T1 waits for R2 and T2 waits for R1 or R3
In my implementation, the locks monitor will detect a deadlock even though there is no deadlock: T3 will release R3, T2 will lock R3 and will stop trying to lock R1, and the cycle is broken.
It is possible to upgrade the locks monitor so it will handle this state correctly. The reason I didn't do it is because it will increase the complexity of the code and will make the code and the basic idea much harder to understand.
Another problem is the fact that the locks monitor will work only if you use a mutex to gain ownership over a resource. This is because of the different behavior of different synchronization objects.
If you wait on a mutex after you owned it, the second wait will return immediately and no harm is done. But, if for some reason you chose to use event instead of a mutex (not recommended), you will find that the second wait on the event causes a deadlock. Another difference is that with a mutex, a thread that didn't acquire the lock can't release it, which is not the case with events.
Again, although the locks monitor behavior can be enhanced to handle other synchronization objects, I didn't do it in order to keep things simple.
Conclusion
In this article, I presented an approach for detecting deadlocks. I also presented a way for the programmer to get useful information about the flow that led to the deadlock.
Obviously, there is no perfect solution for deadlock. Even when detected and broken, it is still hard to estimate how breaking the deadlock by killing a thread will affect the application. Still, by using the locks monitor you can minimize the damage and, based on the info that the locks monitor gives, you can get to the root of the problem that caused the deadlock.
References
C/C++ Users Journal. "A Per-Thread Singleton," May 2002, by Puneesh Chaudhry.
Windows Developer Magazine, August 2002 (Tech Tips). More info about producing a stack trace.
Microsoft Systems Journal. "Under the Hood," April 1997, by Matt Pietrek.
"Detecting and Breaking Deadlocks," by Ben Adida, http://web.mit.edu/6.033/1997/reports/r03-ben.html.
Tomer Abramson holds a BSc in Computer Science from Ben Gurion University. He has been programming for four years, mostly in C++, and currently works as a software engineer for cti2 in the telephony group.