Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue

By Herb Sutter, December 01, 2008

When it comes to scalability and concurrency, more is always better.

Example 4: Padding To Avoid False Sharing

The changes for our final Example 4 are extremely simple compared to the changes in Examples 2 and 3. In fact, we're not going to make any changes to the program logic at all: We're simply going to follow the advice that "if variables A and B are...liable to be used by two different threads, keep them on separate cache lines" to avoid false sharing or "ping-ponging," which limits scalability [2]. In this case, we want to add padding to ensure that different nodes (notably the first and last nodes), the first and last pointers to those nodes, and the producerLock and consumerLock variables are all on different cache lines:


  struct Node {
    Node( T* val ) :        value(val), next(nullptr) { }
    T* value;
    atomic<Node*> next;
    char pad[CACHE_LINE_SIZE - sizeof(T*) - sizeof(atomic<Node*>)];
  };

  char pad0[CACHE_LINE_SIZE];
  Node* first;		 	// for one consumer at a time
  char pad1[CACHE_LINE_SIZE - sizeof(Node*)];
  atomic<bool> consumerLock; 	// shared among consumers
  char pad2[CACHE_LINE_SIZE - sizeof(atomic<bool>)];
  Node* last;			 	// for one producer at a time
  char pad3[CACHE_LINE_SIZE - sizeof(Node*)];
  atomic<bool> producerLock  	// shared among producers
  char pad4[CACHE_LINE_SIZE - sizeof(atomic<bool>)];

That seems like a pretty trivial change. Is it really worth its own section? Again, ask yourself: Is this likely to have any effect on throughput, scalability, contention, or oversubscription? If so, how much, and why?

Measuring Example 4

Figure 4 shows the Example 4 performance results. Again, the effects are less noticeable for large objects, where the cost of copying dominates and was already moved out of the critical section in earlier code revisions.

For small objects, peak throughput has improved by another 20 percent over Example 3 by extending scalability right to the dashed line representing the total number of hardware cores. As a bonus, we've also reduced the contention dropoff to the dashed line and beyond, smoothing the transition into oversubscription. As in every case, the sparse top half of each graph shows that larger numbers of consumers are still causing significant contention, but we've pushed the consumer contention effect out (upwards) somewhat for both small and large objects.

[Click image to view at full size]

Figure 4: Example 4 throughput (total items through queue per second).

Direct Comparisons

Graphs like Figures 1-4 are useful to get a sense of the shape of our code's scalability for varying amounts of different kinds of work, and allow some visual comparison among different versions of the code. But let's consider a more direct comparison by graphing Examples 1-4 together on a line graph. Because all of the graphs so far show that our code tends to favor having more producers than consumers, I've chosen to analyze the slice of data along the line #producers = 2 * #consumers (i.e., 2/3 of threads are producers, 1/3 are consumers), which goes through the fat high-throughput part of each of the previous graphs to allow for fairer comparison.

Figure 5 shows each example's throughput along this selected slice for small queue items, with automatically generated polynomial trend lines to connect the dots. Notice that each vertical slice contains several dots of the same example; this is because I ran each test several times, and the vertical spread in dots shows performance variability in repeated executions of the same test.

First, consider the left-hand graph in Figure 5, which shows the straightforward throughput numbers: As we now know to expect, Example 1 clearly has the worst performance, and each of the other examples successively improve both the peak throughput (the magnitude of each curve's high point) and scalability (how far to the right each peak is). You can visually see how each successive change we made removed an invisible barrier and let the curve continue growing a bit further. The downslope on the right-hand side of each curve shows the throughput-slowing effects of contention. Oversubscription would show up on this graph as a discontinuity as we pass 24 threads; here, with small objects, it's not very visible except for Example 3, as we also saw earlier.

[Click image to view at full size]

Figure 5: Comparing throughput and scaling for Examples 1-4 along the line #Producers = 2 * #Consumers (small queue item size).

But there's another useful way to scale the graph. Still in Figure 5, the right-hand graph shows exactly the same data as the left-hand graph, only we scale it by dividing each line's data by that same line's initial value and the number of threads. This is a more direct way to measure the relative scalability of each version of the code—the ideal is the grey horizontal bar, which shows linear scalability, and we want our curve to float as high and as close to that as possible. As before, Example 1 is the worst, because it quickly dives to awful depths. Example 4 is the best: It's still delivering over 50 percent scalability until just before the machine is fully saturated—getting nearly 4 times the performance using 7 times the cores (21 versus the baseline 3) may not be linear scaling, but it's still fairly decent scaling and nothing to sneeze at.

Now consider Figure 6, which shows each example's throughput along the same slice, but this time for large queue items. Again, the dotted curves are automatically generated polynomial trendlines to connect the dots; we'll come back to these in a moment.

[Click image to view at full size]

Figure 6: Comparing throughput and scaling for Examples 1-4 along the line #Producers = 2 * #Consumers (large queue item size).

What can we conclude from Figure 6? There's much less difference among Examples 2 through 4 for larger queued items where the cost of copying dominates—or, equivalently, when the producer and consumer threads are doing more work than just madly inserting and removing queue items as quickly as possible and throwing them away. Most of the benefit came from just applying the change in Example 2 to hold the queued item by pointer in order to gain more concurrency in the consumer, thereby removing what evidently was a consumer bottleneck. On the right-hand graph it's evident that all examples scale about equally well for large objects, and all are much better than they were in Figure 5 for small objects—70 percent scaling until right before we achieve machine saturation. And those pretty polynomial trendlines help us visualize what's going on better...

Or do they?

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Measuring Parallel Performance: Optimizing a Concurrent Queue