Breaking the Speed Limit
Even if your current application doesn't have much embarrassingly parallel functionality, the key observation is that s and p are not fixed. You can change them, including by applying not only O(N), but also O(K) concurrency:
- You can increase p by doing more of the same: Increase the volume of data processed by the parts that are O(N) parallelizable (and therefore the percentage p of time spent in computing). This is Gustafson's Law.
- You can increase p by doing more of something new: Add new features that are O(N) parallelizable.
- You can reduce s by partially parallelizing it using O(K) techniques, such as pipelining.
Let's consider each of these.
1. Do More of the Same O(N) Work
Amdahl analyzed the effect on execution time assuming that we add more processors or cores while keeping the workload fixed. In 1988, John Gustafson came along and said (in effect): But that's not what really happens! When we get more horsepower, we solve bigger problems. "Hence," Gustafson suggested, "it may be most realistic to assume that run time, not problem size, is constant." [3]
I'm sure that resonates with your own experience: When it comes to performance criteria that our software must meet, what constraint tends to be the most immovably fixed? It's how much time we're allowed to use to get a useful result. Users have finite patience. For example, they want to wait:
- Less than a second for a response to a mouse click
- Less than a minute to sync a PDA or do an incremental compile
- Less than an hour to encode and burn a DVD
- Less than 12 hours to calculate tomorrow's weather forecast or do a nightly build
and they push back hard if we don't meet those deadlines. But if our users can compile or sync or burn N times as much in the same time given N cores, they will happily accept (nay, greedily gobble) that capability and buy the hardware that matches their problem size.
If we keep run time constant and focus instead on increasing the problem size, we get a much better formula. Again assuming no overhead, Gustafson's Law says that the total work a program can perform in a fixed time on a machine with N cores is given by:
Total Work = s + Np
where "Total Work" is equivalent to the "best possible speedup" compared to hypothetically doing all that work sequentially, even if nobody would ever wait for such a sequential execution. For example, Figure 2 illustrates applying Gustafson's Law to the problem in Figure 1. That wonderful amplification of O(N) code just makes our parallel day. Importantly, note that the total program is now O(N), even though there is a sequential portion. As N increases to infinity, the total work we can accomplish also increases to infinity. In practice, this means that it increases until we can't make the problem any bigger or are bound on some other factor, such as memory or other I/O. Still, the key is that the concurrency is scalable, and is not hardwired to a predetermined preferred fixed number K of cores, or worse still bound by a constant factor maximum speedup over the original code as in Amdahl's approach.
2. Invent New O(N) Work
Besides solving bigger versions of the same problem, we also have the option of adding new O(N) features. This is especially effective, and it's market changing when the feature is a harder challenge than we could ever solve practically before.
Remember Myst? Released in 1993, that first-person immersive adventure was the bestselling computer game of all time for most of the rest of the decade because of its breakthrough graphics. Myst let you walk through high-resolution pre-rendered scenes that were like static postcards, preselected views frozen in space and time. Those scenes took hours and hours to render, of course, so all the rendering work had to be done offline and the game was shipped with only the final images, as many as could fit on a CD-ROM. It was simply impossible to think about actually rendering on the user's computer at all, never mind at several frames per second in real time so that the player could actually move around the world and look wherever he wanted. Sure, Doom accomplished that later the same year, but not nearly at Myst's high graphic quality.
But real-time 3D was only a matter of time. Myst sequels themselves first let the player turn and look at a 360-degree panoramic rendering from a fixed viewpoint, and then completely freed us to walk and look anywhere in real-time 3D.
Just as Myst itself was once unthinkable but was made possible by the widespread availability of PCs that had good graphics and CD drives, and real-time 3D sequels were made possible by advances in both CPUs and parallel graphics hardware, so new features that are unthinkable today will be possible on 16-core and 64-core machines. In each case, it's a matter of finding what previously unthinkable work could now be enabled by new machines, and enabling itat first by gurus building their own libraries and frameworks, then by more and more developers as the tools became more mainstream.
But don't show me ray-traced bouncing balls or Mandelbrot graphics or the other usual embarrassingly parallel but niche (or downright useless) clichéswhat we're looking for are real ideas of real software we could imagine real kids and grandmothers using that could become possible on manycore machines. Here's a quick potential example: Researchers know how to do speech recognition with near-human-quality accuracy in the lab, which is astonishingly good and would enable breakthrough user interfaces if it could be done that reliably in real time. The only trouble is that the software takes a week to run...on a single core. Can it be parallelized, and if so how many cores would we need to get a useful answer in a useful time? Bunches of smart people (and the smart money behind them) are investing hard work not only to find out the answer to that question, but also to find more questions like it.
3. Reduce s Using O(K) Concurrency
Just because something can't be made O(N) parallel doesn't mean it doesn't contain any potential concurrency at all. For the past 30 years, processors have been successfully speeding up purely sequential programs by injecting O(K) concurrency in the form of caching, prefetching, instruction reordering, and pipelining.
Granted, the maximum potential speedup of sequential code is bounded by some fixed constant K. For example, adding a fast Level 1 (L1) cache can improve memory access from slower L2 or L3 cache times (or, worse still, glacial DRAM or geological disk) up to the speed of L1, but no further. Adding a five-stage instruction pipeline can improve performance by a factor of five, assuming all the pipeline stages have equal weight and minimal overhead, but no further. But O(K) improvements are nothing to sneeze at. Today, nobody would buy a desktop PC that didn't have plenty of cache and a pipelined processor.
We can use the same tricks. We can compute separable pieces of work concurrently by running some of the work on a thread pool. For example, given the following sequential function that computes total sales in two independent HASH(0x87e450) of similar size
Quantity ComputeSales() { return region1.Sales() // these calculations + region2.Sales(); // are independent }
you can improve this from O(1) to O(2) by moving the computation of region1.Sales() to a background thread or thread pool, so that region1.Sales() and region2.Sales() can execute concurrently. Just be sure that the calculation you're shipping to another thread is big enough to justify today's overhead of executing it somewhere else, and that the data the two operations use really are independent.
Generalizing that technique, we can pipeline to overlap separable subparts of work being done to each piece of data in a collection [1]. Figure 3 illustrates the effect of additionally applying O(K) techniques to the "sequential" portions of the program.