Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

By Herb Sutter, December 01, 2008

When it comes to scalability and concurrency, more is always better.

How would you write a fast, internally synchronized queue, one that callers can use without any explicit external locking or other synchronization? Let us count the ways...or four of them, at least, and compare their performance. We'll start with a baseline program and then successively apply three optimization techniques, each time stopping to measure each change's relative performance for queue items of different sizes to see how much each trick really bought us.

All versions of the code we'll consider have one thing in common: They control concurrency using (the equivalent of) two spinlocks, one for each end of the queue. Our goal in refining the code is to allow separate threads calling Produce and/or Consume to run as concurrently as possible, while also improving scalability to get more total throughput through the queue when we use larger numbers of producer and/or consumer threads.

I'll analyze the throughput of each version of the code by varying three values:

The number of producer threads. In my test harness, each producer thread is a tight loop that calls Produce with no other work.
The number of consumers. Each consumer thread is a tight loop that calls Consume with no other work.
The size of the queued objects: I'll test queues of objects of two different sizes—"small" and "large" objects that each contain 10 and 100 strings, respectively, of length 26 characters, so that enqueueing or dequeueing an object costs 10 or 100 deep string copies. This actually lets us accomplish two things: a) to measure how performance depends on the size and copying cost of the queued objects; and b) to indirectly simulate the performance effect of how much "other work" the producer and consumer threads may be doing on real-world workloads, which don't usually call Produce and Consume in tight loops that just throw away the values without using them.

Let's dive in.

Example 1: Baseline

The baseline program is adapted from the code in [1] by adding a spinlock on each end of the queue to allow only one producer and one consumer to update the queue at a time. We will allow some concurrency among producer threads because the producer spinlock doesn't cover the entire body of the Produce function, but we will allow only one call to Consume to run at a time. This code follows the policy in [1] that only producer threads ever actually change the queue; consumed nodes aren't freed by the consumer, but are lazily cleaned up by the next producer.

The contained objects are held by value:


// Example 1: Baseline code
//
template <typename T>
struct LowLockQueue {
private:
  struct Node {
    Node( T val ) : value(val), next(nullptr) { }
    T value;
    atomic<Node*> next;
  };

The control variables include pointers to the first and last nodes in the underlying list, and a pointer to a divider element that marks the boundary between the producer and consumer:


  Node *first, *last;		// for producer only
  atomic<Node*> divider;		// shared
  atomic<bool> producerLock;	// shared by producers   
  atomic<bool> consumerLock;	// shared by consumers

Note that, as described in [1], we always maintain at least one dummy node at the front, so the first unconsumed node is actually the one after divider. Any nodes before divider are already consumed nodes that can be lazily freed by the next producer. The constructor sets up that invariant, and the destructor (in Java or .NET, the disposer) simply walks the list to free it:


public:
  LowLockQueue() {
    first = divider = last = new Node( T() );
    producerLock = consumerLock = false;
  }
  ~LowLockQueue() {
    while( first != nullptr ) {
      Node* tmp = first;
      first = tmp->next;
      delete tmp;
    }
  }

Consume copies the value contained in the first unconsumed node to the caller and updates divider to mark that the node has been consumed. Note that the entire body of Consume is inside the critical section, which means we get no concurrency among consumers in this version of the code:

  bool Consume( T& result ) {
    while( consumerLock.exchange(true) ) 
      { }		 	// acquire exclusivity

    if( divider->next != nullptr ) {	// if queue is nonempty
      result = divider->next->value; 	// copy it back to the caller
      divider = divider->next; 	// publish that we took an item
      consumerLock = false;	 	// release exclusivity
      return true;	 	// and report success
    }

    consumerLock = false;	 	// release exclusivity
    return false;		 	// queue was empty
  }

Produce allocates a new node and adds it to the tail of the list, then performs the lazy cleanup of consumed nodes at the front of the list, if any. Note that not all of the body of Produce is inside the exclusive critical section—many producers can concurrently be allocating their new nodes and copying their new item's value into them, which allows some concurrency among producers:

bool Produce( const T& t ) {
  Node* tmp = new Node( t ); 	// do work off to the side

  while( producerLock.exchange(true) ) 
    { }	// acquire exclusivity

  last->next = tmp;	 	// publish the new item
  last = last->next;	 	// (more about this later)

  while( first != divider ) {	 	// lazy cleanup
    Node* tmp = first;
    first = first->next;
    delete tmp;
  }

  producerLock = false;	 	// release exclusivity
  return true;
 }
};

How well would you expect this version of the code to scale for different numbers of producers and consumers, or for different kinds of queued objects? Let's find out.

Measuring Example 1

Figure 1 shows the results of running Example 1 using different numbers of producer and consumer threads, and the two sizes of queued objects. The larger the circle, the better the throughput; and each graph has a corner note showing the "max" (largest circle) and "min" (smallest circle) values on that graph. Note the size of each circle represents the relative throughput for that graph only; in this example, even the smallest result on the left-hand graph (where the minimum value is 18,700 objects moved through the queue per second, where each object contains 10 strings) is better than the largest circle on the right-hand graph (11,300 objects per second, where each object contains 100 strings).

This test already illustrates four important properties to consider for parallel throughput. The first key property is overall throughput, or the total amount of work the system is able to accomplish. Here, we can move up to 88,000 small objects or 11,300 objects through the Example 1 queue every second.

The second key property is scalability, or the ability to use more cores to get more work done. For both sizes of queued objects, Example 1 isn't very scalable at all: We get our best throughput at two producers and one consumer, and adding more producers and consumers doesn't get more objects through the queue.

[Click image to view at full size]

Figure 1: Example 1 throughput (total items through queue per second).

The third key property is contention, or how much different threads interfere with each other by fighting for resources. The small-object case (left graph) shows acute contention, because total throughput drops dramatically when more producers and/or consumers are added—even when those producers and consumers are running on otherwise-idle cores, which unfortunately is a great example of negative scalability. For larger queued objects, the problem is still visible but much less acute; throughput does not drop off as rapidly, at least not until we get to the dashed line.

Incidentally, speaking of that dashed line, I never described the hardware I used to run these tests. If you've been wondering about how many cores were available on my test machine, you can read the answer off the right-hand graph: The dashed line where we see a major performance cliff is where #consumers + #producers = 24, the number of CPU cores on the test system.

The right-hand graph's diagonal dropoff is a classic artifact of the fourth key property, the cost of oversubscription, or having more CPU-bound work ready to execute than available hardware to execute it. Clearly, we can't scale to more cores than actually exist on a given machine, but there can be a real cost to exceeding that number. In this example, especially when more than half the available threads are producers, oversubscription causes a dramatic increase in churn and contention and loss of overall system throughput. In this test case, a larger number of consumers had less effect, as illustrated by the "spillover" across the top part of the graph.

Let's see if we can improve total throughput and scalability, while driving down contention and the cost of oversubscription.

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 1: Baseline

Measuring Example 1

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 1: Baseline

Measuring Example 1

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content