Example 3: Reducing Head Contention
In Examples 1 and 2, the producer was responsible for lazily removing the nodes consumed since the last call to Produce
. But that's bad for performance for several reasons, notably because it forces a producer to touch both ends of the queueand every thread that uses the queue, whether producer or consumer, has to touch the queue's head end. Even though a producer and a consumer don't use the same spinlocks and so can run fully concurrently with respect to each other, the fact that they touch the same memory inherently adds invisible contention, as updates to the memory containing the head nodes have to be propagated to all threads on other cores, not just to consumer threads that naturally have to touch the head end to do their work.
In Example 3, we'll let each consumer be responsible for trimming the node it consumed (which it was touching anyway) and this gives better locality. The first thing we notice is that we can get rid of divider
itself a source of contention because it was used by both consumers and producers:
// Example 3 (diffs from Example 2): // Moving cleanup to the consumer // LowLockQueue() { first = last = new Node( nullptr ); // no more divider producerLock = consumerLock = false; }
Consume
now doesn't need to deal with divider
, but must add the work to clean up the previous now-unneeded first
dummy node when it consumes an item:
bool Consume( T& result ) { while( consumerLock.exchange(true) ) { } // acquire exclusivity if( first->next != nullptr ) { // if queue is nonempty Node* oldFirst = first; first = first->next; T* value = first->value; // take it out first->value = nullptr; // of the Node consumerLock = false; // release exclusivity result = *value; // now copy it back delete value; // and clean up delete oldFirst; // both allocations return true; // and report success } consumerLock = false; // release exclusivity return false; // queue was empty }
Next, Produce
becomes simpler because we can eliminate the lazy cleanup code. However, just eliminating that code leads to a very subtle pitfall because one existing line also has to change. Can you see why?
bool Produce( const T& t ) { Node* tmp = new Node( t ); // do work off to the side while( producerLock.exchange(true) ) { } // acquire exclusivity last->next = tmp; // A: publish the new item last = tmp; // B: not "last->next" producerLock = false; // release exclusivity return true; }
Changing Responsibilities Can Introduce Bugs
Note that line B
used to be last = last->next;
. That was always slightly inefficient because it needlessly reread last
(a holdover from the original code written by someone else). Now, if left unchanged, it becomes something much worse: a small race window. Now that there's no divider
and consumers clean up consumed nodes, the way consumers know there's an item available to be consumed is to check first->next;
if it's not null, it's okay to go ahead and consume a nodeand delete what used to be the first one because that node is no longer needed. The trouble arises when a sequence like the following occurs:
- Initially: queue is empty, f
irst == last
- The producer (from Example 2 code, without the Example 3 correction):
l
ast->next = tmp; // A: publish
- The consumer performs an entire call to
Consume
the just-published node, including deleting the now unnecessary previous first node before it - Then the producer dereferences
last
last = last->next; // B: update last
// oops: accesses freed memory.
The key is that the act of publishing the new node (line A)
not only advertises that the new node is ready to be consumed, but also implicitly transfers ownership of the preceding node to the consumer. Hence, line B
must not dereference last
again, but should just assign from tmp
directly.
"But," someone might object, "will this interleaving really happen? After all, A-B
is a very small window for a call to Consume
to fit into." True, it won't happen often. Based on experience, however, I can report that under heavy stress on a multicore system, this tends to fail once for every few tens of millions of items moving through the queue. This was the only race I wrote (that I know of) when putting these examples together, and it was a real pain to reproduce and diagnose.
Moral: When you change responsibilities for cleanup, code that used to be innocuous can suddenly turn into a subtle race window.
Measuring Example 3
But back to the main event: How well does moving the cleanup responsibility and reducing contention on the head of the queue really help? Again, before seeing my results, consider how much, and why, you think this is likely to affect throughput, scalability, contention, and the oversubscription penalty.
Figure 3 shows the Example 3 performance results. The effects are mainly on the left-hand small object graph, with only incremental improvements for large objects. For small objects, peak throughput has improved by nearly another factor of two, and we've again improved scalability and actually get close to reaching the dashed line, which represents our capacity for getting more work done using more cores. There is some dropoff due to contention as we exceed about 20 active threads (e.g., 12 producers and 8 consumers), and for the first time we can actually see the oversubscription wall on the left-hand graph beyond 24 threads. Although we'd like to scale that wall, right now we're happy to just be able to approach it in the first place!