Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

By Herb Sutter, December 01, 2008

When it comes to scalability and concurrency, more is always better.

Example 2: Shrinking the Consumer Critical Section

Our first optimization involves having each node allocate its contained queue item on the heap and hold it by pointer instead of by value. To experienced parallel programmers, performing an extra heap allocation might seem like a bad idea at first, because heap allocations are notorious scalability busters on many of today's memory allocators. But experience is no substitute for measurement: It turns out that, even on a system with a nonscalable allocator, the benefits often outweigh the advantages. Here, holding the queued object by pointer lets us get greater concurrency and scalability among the consumer threads, because we can take the work of actually copying its value out of the critical section of code that updates the shared data structure.

I'll show only the changed lines. In Node itself, of course, we need to hold the values by pointer, and adjust the constructor slightly to match:


// Example 2 (diffs from Example 1):
// Moving the copying work out of
// the consumer's critical section
//
  struct Node {
    Node( T* val ) : value(val), next(nullptr) { }
    T* value;
    atomic<Node*> next;
  };

  LowLockQueue() {
    first = divider = last = new Node( nullptr );
    producerLock = consumerLock = false;
    //size = maxSize = 0;
  }

The key changes we've enabled are in Consume. The key is that we can move the copying of the dequeued object, and the deletion of the value, outside the critical section:



bool Consume( T& result ) {
    while( consumerLock.exchange(true) ) 
      {} 	// acquire exclusivity

    if( divider->next != nullptr ) { 	// if queue is nonempty

      T* value = divider->next->value;  // take it out
      divider->next->value = nullptr;  // of the Node
      divider = divider->next;  // publish that we took an item
      consumerLock = false;	// release exclusivity

      result = *value;		// now copy it back to the caller
      delete value;
      return true;		// and report success
    }

    consumerLock = false;		// release exclusivity
    return false;		// queue was empty
  }

Produce only changes in its first line, where new Node( t ) becomes new Node( new T(t) ).

This change should allow better concurrency among consumers. But will it? Before reading on, ask yourself: How would you expect this to affect the magnitude and shape of the performance graphs? How will it affect overall throughput, scalability, contention, and the cost of oversubscription?

Measuring Example 2

Figure 2 shows the results of running Example 2 in the same test harness and on the same hardware. Again, larger circles mean better throughput on the same graph, and pay attention to each graph's "max" (largest circle) and "min" (smallest circle) notes.

Clearly, both throughput and scalability have improved across the board. Our peak throughput is better by a factor of 3 for small objects, and by an order of magnitude for large objects. From this we can deduce that Example 1's throttling of consumers was a severe limiting factor. But the reason peak throughput improved is because we improved scalability: We reach the peak now at larger numbers of producers and consumers. For large objects, we're able to saturate the 24-core machine, and get peak throughput at 15 producer threads and 9 consumer threads.

Unfortunately, for small objects we peak too early and can only usefully harness about half of the available cores before contention is the limiting factor, and adding more producers and consumers, even on otherwise-idle cores, actually accomplishes less total work. For small objects, we don't even reach the dashed line. Even for large objects, contention among consumers still seems to be a limiting factor as shown by the thinness of the top half of the graph.

[Click image to view at full size]

Figure 2: Example 2 throughput (total items through queue per second).

Finally, note that for large objects, not only did we reach the dashed line with throughput still rising, but we spilled across it in the lower fewer-consumers region without losing much throughput. The cost of oversubscription is lessened, at least when adding extra producers; more producers doesn't get more total work done, but they don't negatively impact total work much either. Both contention and oversubscription are still bad, however, when we add extra consumers.

That's a good first step, but let's keep going.

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 2: Shrinking the Consumer Critical Section

Measuring Example 2

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 2: Shrinking the Consumer Critical Section

Measuring Example 2

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content