In the concurrent world, locality is a first-order issue that trumps most other performance considerations. Now locality is no longer just about fitting well into cache and RAM, but to avoid scalability busters by keeping tightly coupled data physically close together and separately used data far, far apart.
Of Course, You'd Never Convoy On a Global Lock
Nobody would ever willingly write code like this in a tight loop.
// Threads 1-N while( ... ) { globalMutex.lock(); DoWork(); globalMutex.unlock(); }
This is clearly foolish, because all the work is being done while holding a global lock and so only one thread can make progress at a time. We end up with a classic lock convoy: At any given time, all of the threads save one are piled up behind the lock as each waits idly for its turn to come.
Convoys are a classic way to kill parallel scalability. In this example, we'll get no parallel speedup at all because this is just a fancy way to write sequential code. In fact, we'll probably get a minor performance hit because of taking and releasing the lock many times and incurring context switches, and so we would be better off just putting the lock around the whole loop and making the convoy more obvious.
True, we sometimes still gain some parallel benefit when only part of each thread's work is done while holding the lock:
// Threads 1-N while( ... ) { DoParallelWork(); // p = time spent here, // parallel portion globalMutex.lock(); DoSequentialWork(); // s = time spent here, // sequential portion globalMutex.unlock(); }
Now at least some of the threads' work can be done in parallel. Of course, we still hit Amdahl's Law [1]: Even on infinitely parallel hardware with an infinite number of workers, the maximum speedup is merely (s+p)/s,
minus whatever overhead we incur to perform the locking and context switches. But if we fail to measure (ahem) and the time spent in p
is much less than in s
, we're really gaining little and we've again written a convoy.
We'd never willingly do this. But the ugly truth is that we do it all the time: Sometimes it happens unintentionally; for example, when some function we call might be taking a lock on a popular mutex unbeknownst to us. But often it happens invisibly, when hardware will helpfully take exactly such an exclusive lock automatically, and silently, on our behalf. Let's see how, why, and what we can do about it.