Parallel

Maximize Locality, Minimize Contention

By Herb Sutter, May 23, 2008

Want to kill your parallel application's scalability? Easy: Just add a dash of contention.

In the concurrent world, locality is a first-order issue that trumps most other performance considerations. Now locality is no longer just about fitting well into cache and RAM, but to avoid scalability busters by keeping tightly coupled data physically close together and separately used data far, far apart.

Of Course, You'd Never Convoy On a Global Lock

Nobody would ever willingly write code like this in a tight loop.


// Threads 1-N
while( ... ) {
  globalMutex.lock();
  DoWork();
  globalMutex.unlock();
}

This is clearly foolish, because all the work is being done while holding a global lock and so only one thread can make progress at a time. We end up with a classic lock convoy: At any given time, all of the threads save one are piled up behind the lock as each waits idly for its turn to come.

Convoys are a classic way to kill parallel scalability. In this example, we'll get no parallel speedup at all because this is just a fancy way to write sequential code. In fact, we'll probably get a minor performance hit because of taking and releasing the lock many times and incurring context switches, and so we would be better off just putting the lock around the whole loop and making the convoy more obvious.

True, we sometimes still gain some parallel benefit when only part of each thread's work is done while holding the lock:


// Threads 1-N
while( ... ) {
  DoParallelWork();     // p = time spent here,
                        // parallel portion
  globalMutex.lock();
  DoSequentialWork();   // s = time spent here,
                        // sequential portion
  globalMutex.unlock();
}

Now at least some of the threads' work can be done in parallel. Of course, we still hit Amdahl's Law [1]: Even on infinitely parallel hardware with an infinite number of workers, the maximum speedup is merely (s+p)/s, minus whatever overhead we incur to perform the locking and context switches. But if we fail to measure (ahem) and the time spent in p is much less than in s, we're really gaining little and we've again written a convoy.

We'd never willingly do this. But the ugly truth is that we do it all the time: Sometimes it happens unintentionally; for example, when some function we call might be taking a lock on a popular mutex unbeknownst to us. But often it happens invisibly, when hardware will helpfully take exactly such an exclusive lock automatically, and silently, on our behalf. Let's see how, why, and what we can do about it.

1 2 3 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Maximize Locality, Minimize Contention

Of Course, You'd Never Convoy On a Global Lock

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Maximize Locality, Minimize Contention

Of Course, You'd Never Convoy On a Global Lock

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content