Example 2: Shrinking the Consumer Critical Section
Our first optimization involves having each node allocate its contained queue item on the heap and hold it by pointer instead of by value. To experienced parallel programmers, performing an extra heap allocation might seem like a bad idea at first, because heap allocations are notorious scalability busters on many of today's memory allocators. But experience is no substitute for measurement: It turns out that, even on a system with a nonscalable allocator, the benefits often outweigh the advantages. Here, holding the queued object by pointer lets us get greater concurrency and scalability among the consumer threads, because we can take the work of actually copying its value out of the critical section of code that updates the shared data structure.
I'll show only the changed lines. In Node
itself, of course, we need to hold the values by pointer, and adjust the constructor slightly to match:
// Example 2 (diffs from Example 1): // Moving the copying work out of // the consumer's critical section // struct Node { Node( T* val ) : value(val), next(nullptr) { } T* value; atomic<Node*> next; }; LowLockQueue() { first = divider = last = new Node( nullptr ); producerLock = consumerLock = false; //size = maxSize = 0; }
The key changes we've enabled are in Consume
. The key is that we can move the copying of the dequeued object, and the deletion of the value, outside the critical section:
bool Consume( T& result ) { while( consumerLock.exchange(true) ) {} // acquire exclusivity if( divider->next != nullptr ) { // if queue is nonempty T* value = divider->next->value; // take it out divider->next->value = nullptr; // of the Node divider = divider->next; // publish that we took an item consumerLock = false; // release exclusivity result = *value; // now copy it back to the caller delete value; return true; // and report success } consumerLock = false; // release exclusivity return false; // queue was empty }
Produce
only changes in its first line, where new
Node( t )
becomes new Node( new T(t) ).
This change should allow better concurrency among consumers. But will it? Before reading on, ask yourself: How would you expect this to affect the magnitude and shape of the performance graphs? How will it affect overall throughput, scalability, contention, and the cost of oversubscription?
Measuring Example 2
Figure 2 shows the results of running Example 2 in the same test harness and on the same hardware. Again, larger circles mean better throughput on the same graph, and pay attention to each graph's "max" (largest circle) and "min" (smallest circle) notes.
Clearly, both throughput and scalability have improved across the board. Our peak throughput is better by a factor of 3 for small objects, and by an order of magnitude for large objects. From this we can deduce that Example 1's throttling of consumers was a severe limiting factor. But the reason peak throughput improved is because we improved scalability: We reach the peak now at larger numbers of producers and consumers. For large objects, we're able to saturate the 24-core machine, and get peak throughput at 15 producer threads and 9 consumer threads.
Unfortunately, for small objects we peak too early and can only usefully harness about half of the available cores before contention is the limiting factor, and adding more producers and consumers, even on otherwise-idle cores, actually accomplishes less total work. For small objects, we don't even reach the dashed line. Even for large objects, contention among consumers still seems to be a limiting factor as shown by the thinness of the top half of the graph.
Finally, note that for large objects, not only did we reach the dashed line with throughput still rising, but we spilled across it in the lower fewer-consumers region without losing much throughput. The cost of oversubscription is lessened, at least when adding extra producers; more producers doesn't get more total work done, but they don't negatively impact total work much either. Both contention and oversubscription are still bad, however, when we add extra consumers.
That's a good first step, but let's keep going.