Developing Class QLock
Next, I looked at the Pthread-w32 library. The code I discovered in that library's implementation of pthread_once
(pthread_once.c, CVS version 1.16) used an approach radically different from either of the two techniques just discussed: If multiple threads enter pthread_once
simultaneously, the first thread that has to wait (the second thread overall) creates a synchronization object (a Windows event object, in this case). The last thread to leave the routine destroys the object. An atomic reference count keeps track of the number of threads involved. All of the problems of the previous two approaches seemed to be avoided. Unfortunately, a closer examination of the code uncovered a race condition. After a few attempts at fixing the problem (see pthread-w32 mailing list archive if you are interested in the details; sources.redhat.com/ml/pthreads-win32/), it became clear that this general approach is fundamentally flawed and cannot be fixed.
Although reference counting used by the pthread_once
implementation did not work, trying to fix it gave me some ideas. What if each thread that had to be suspended (waiting for another thread to complete the initialization of a Singleton), created and then destroyed its own synchronization object? For instance, I knew how the MCS implementation of spin-locks, to reduce contention, makes each thread wait spinning on its own flag. Could I use a similar technique, but instead of spinning, make each thread wait on its own semaphore or event object? It turned out that it was not only possible to make such a technique work, but the resulting code was fairly simple, too. Listing Nine (available electronically, see mutex.) uses this specialized waiting technique to implement my Singleton.
The central idea of this new approach is encapsulated in class QLock
. Each instance of this class represents a lock held on the mutex that was passed as a constructor argument. A mutex itself, which is represented by the QLock::Mutex
type, is just a pointer, which is set to zero when the mutex is not locked. When the mutex is locked, it points to the tail of the queue of threads waiting on this mutex. Each thread in the queue is represented by the instance of class QLock
that the thread created to obtain the lock. In other words, the mutex is a pointer to the QLock
instance created by the last thread waiting in the queue. This QLock
instance is linked to another QLock
instance, which was created by the previous thread in the queue, and so on. The QLock
instance at the head of the queue corresponds to the thread that currently holds the lock.
A Closer Look at QLock
Table 1 lists the QLock
class data members, while Table 2 lists the values that d_readyFlag
and d_nextFlag
may have. The constructor QLock::QLock
atomically exchanges the mutex pointer for this. If the mutex pointer was 0 prior to the exchange, the lock is obtained and the constructor completes. If the mutex pointer was not zero, this instance of QLock
has to join the queue and wait for the ready flag.
Data Member | Description |
d_mutex | Pointer to the mutex locked by this QLock instance. |
d_next | Pointer to the next instance of QLock in queue waiting to obtain the lock. |
d_readyFlag | Flag used to indicate that the lock has been released by the predecessor. |
d_nextFlag | Flag used to indicate that the next pointer (d_next) has been set by a successor. |
Table 1: QLock class data members.
Value | Description |
0 | Initial value indicating that the flag is not set. |
-1 | Indicates that the flag was set before an event object had been installed into the flag. |
Neither 0 nor -1 | Indicates that an event object has been installed into the flag, which now holds the handle of the event object. |
Table 2: Values that d_readyFlag and d_nextFlag may have.
The destructor QLock::~QLock
, using one atomic compare-and-exchange operation, checks if the mutex pointer contains the value of this pointer. If it does, it resets the mutex pointer to zero, thus releasing the mutex. If the mutex no longer contains the value of this pointer, the queue has been formed, and the destructor must pass the lock to the next instance in the queue. The destructor first waits on the d_nextFlag
to make sure that the next pointer has been set, then sets the d_readyFlag
.
The algorithms used in QLock
's constructor/destructor are basically the same as those used to lock/unlock MCS spin-locks. The setFlag
and waitOnFlag
methods are where we make our important deviation from MCS locks. Instead of spinning (as is appropriate for a spin-lock), the waitOnFlag
routine:
- Creates an event object.
- Atomically checks that the flag has not yet been set and installs the event object's handle into the flag.
- Proceeds to wait on the event object.
Now compare our QLock
-based solution for the Singleton problem with the two previous solutions (spin-lock-based and named-mutex-based). Similar to the named-mutex-based solution, you avoid the unpredictable behavior of the spin-lock: Instead of spinning, waiting threads are suspended by the kernel, and resumed as soon as the mutex is available. Additionally, you avoid the problems with the named mutex: Event objects used for synchronization are process local, do not require artificial names, and are created only when threads need to be suspended due to contention. When there is no contention (when only one thread enters the Singleton::instance
method at the time of initialization), no synchronization objects are created: The overhead is simply that of a few atomic operations.
Observations
After using the synchronization mechanisms implemented by the QLock
class to solve the Singleton problem on Windows (and more generally, the problem with the static initialization of mutexes on Windows), I discovered that QLock
has some other important properties that made it attractive for other applications (including on UNIX). QLock::Mutex
has a small memory footprint (one pointer), and a small initialization cost (that of initializing a pointer with 0). Additionally, when there is no contention, QLock
is fast. If no wait is necessary, locking a QLock
requires just one atomic instruction and no system calls. If there are no waiting threads, unlocking is also just one atomic instruction.
Furthermore, when there is contention, the cost of constructing event objects (or semaphores on UNIX) can be virtually eliminated by using a lock-free pool of allocated event objects. This pool works particularly well because the number of event objects required never exceeds the number of waiting threadsan inherently small number.
One common scenario in which QLock
appears to be of great use is when you have a large collection of some kind of items that are accessed from multiple threads. To increase concurrency, it may be logically possible and desirable to maintain a mutex per item of the collection (instead of maintaining a single mutex for the entire collection). It is often the case, though, that the memory footprint and the initialization cost of a standard mutex or critical section would be prohibitive due to the (sometimes very large) size of the collection. QLock
, because of its small size and negligible initialization cost, may be able to provide a viable solution to fine-grain locking.
Vladimir is a senior architect and head of application infrastructure at Bloomberg LP. He can be contacted at [email protected].