Your free lunch will soon be over. What can you do about it? What are you doing about it. The major processor manufacturers and architectures, from Intel and AMD to Sparc and PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. Instead of driving clock speeds and straight-line instruction throughput ever higher, they are instead turning en masse to hyperthreading and multicore architectures. Both of these features are available on chips today; in particular, multicore is available on current PowerPC and Sparc IV processors, and is coming in 2005 from Intel and AMD. Indeed, the big theme of the 2004 In-Stat/MDR Fall Processor Forum was multicore devices, with many companies showing new or updated multicore processors. Looking back, it's not much of a stretch to call 2004 the year of multicore.
And that puts us at a fundamental turning point in software development, at least for the next few years and for applications targeting general-purpose desktop computers and low-end servers (which happens to account for the vast bulk of the dollar value of software sold today). In this article, I describe the changing face of hardware, why it suddenly does matter to software, and how it specifically matters to you and changes the way you'll likely be writing software in the future.
Arguably, the free lunch has already been over for a year or two, only we're just now noticing.
The Free Performance Lunch
There's an interesting phenomenon known as "Andy giveth, and Bill taketh away." No matter how fast processors get, software consistently finds new ways to eat up the extra speed. Make a CPU 10 times as fast, and software usually finds 10 times as much to do (or in some cases, will feel at liberty to do it 10 times less efficiently). Most classes of applications have enjoyed free and regular performance gains for several decades, even without releasing new versions or doing anything special because the CPU manufacturers (primarily) and memory and disk manufacturers (secondarily) have reliably enabled ever-newer and ever-faster mainstream systems. Clock speed isn't the only measure of performance, or even necessarily a good one, but it's an instructive one: We're used to seeing 500MHz CPUs give way to 1GHz CPUs, which give way to 2GHz CPUs, and so on. Today, we're in the 3GHz range on mainstream computers.
The key question is: When will it end? After all, Moore's Law predicts exponential growth, and clearly exponential growth can't continue forever before we reach hard physical limits; light isn't getting any faster. The growth must eventually slow down and even end. (Caveat: Yes, Moore's Law applies principally to transistor densities, but the same kind of exponential growth has occurred in related areas such as clock speeds. There's even faster growth in other spaces, most notably the data storage explosion, but that important trend belongs in a different article.)
If you're a software developer, chances are you have already been riding the "free lunch" wave of desktop computer performance. Is your application's performance borderline for some local operations? "Not to worry," the conventional (if suspicious) wisdom goes, "tomorrow's processors will have even more throughput, and anyway, today's applications are increasingly throttled by factors other than CPU throughput and memory speed (for instance, they're often I/O-bound, network-bound, or database-bound)." Right?
Right enough, in the past. But dead wrong for the foreseeable future.
The good news is that processors are going to continue to become more powerful. The bad news is that, at least in the short term, the growth will come mostly in directions that do not take most current applications along for their customary free ride.
Over the past 30 years, CPU designers have achieved performance gains in three main areas, the first two of which focus on straight-line execution flow:
- Clock speed.
- Execution optimization.
- Cache.
Increasing clock speed is about getting more cycles. Running the CPU faster more or less directly means doing the same work faster.
Optimizing execution flow is about doing more work per cycle. Today's CPUs sport some more powerful instructions, and they perform optimizations that range from the pedestrian to the exotic, including pipelining, branch prediction, executing multiple instructions in the same clock cycle(s), and even reordering the instruction stream for out-of-order execution. These techniques are all designed to make the instructions flow better and/or execute faster, and to squeeze the most work out of each clock cycle by reducing latency and maximizing the work accomplished per clock cycle.
Note that some of what I just called "optimizations" are actually far more than optimizations, in that they can change the meanings of programs and cause visible effects that can break reasonable programmer expectations. This is significant. CPU designers are generally sane and well-adjusted folks who normally wouldn't hurt a fly and wouldn't think of hurting your code...normally. But in recent years, they have been willing to pursue aggressive optimizations just to wring yet more speed out of each cycle, even knowing full well that these aggressive rearrangements could endanger the semantics of your code. Is this Mr. Hyde making an appearance? Not at all. That willingness is simply a clear indicator of the extreme pressure the chip designers face to deliver ever-faster CPUs; they're under so much pressure that they'll risk changing the meaning of your program, and possibly break it, to make it run faster. Two noteworthy examples in this respect are write reordering and read reordering: Allowing a processor to reorder write operations has consequences that are so surprising, and break so many programmer expectations, that the feature generally has to be turned off because it's too difficult for programmers to reason correctly about the meaning of their programs in the presence of arbitrary write reordering. Reordering read operations can also yield surprising visible effects, but that is more commonly left enabled anyway because it isn't quite as hard on programmers (and the demands for performance cause designers of operating systems and operating environments to compromise and choose models that place a greater burden on programmers because that is viewed as a lesser evil than giving up the optimization opportunities).
Finally, increasing the size of on-chip cache is about staying away from RAM. Main memory continues to be so much slower than the CPU that it makes sense to put the data closer to the processorand you can't get much closer than being right on the die. On-die cache sizes have soared, and today most major chip vendors will sell you CPUs that have 2MB of on-board L2 cache. (Of these three major historical approaches to boosting CPU performance, increasing cache is the only one that will continue in the near term.)
Okay. So what does this mean?
A fundamentally important thing to recognize about this list is that all of these areas are concurrency agnostic. Speedups in any of these areas directly lead to speedups in sequential (nonparallel, single-threaded, single-process) applications, as well as applications that do make use of concurrency. That's important because the vast majority of today's applications are single-threadedand for good reasons.
Of course, compilers have had to keep up; sometimes, you need to recompile your application, and target a specific minimum level of CPU, to benefit from new instructions (MMX, SSE, and the like) and some new CPU features and characteristics. But, by and large, even old applications have always run significantly fastereven without being recompiled to take advantage of all the new instructions and features offered by the latest CPUs.
That world was a nice place to be. Unfortunately, it has already disappeared.
Why You Don't Have 10GHz Today
You can get similar graphs for other chips, but I'm going to use Intel data here. Figure 1 graphs the history of Intel chip introductions by clock speed and number of transistors. The number of transistors continues to climb, at least for now. Clock speed, however, is a different story.
Figure 1: Intel CPU introductions (sources: Intel, Wikipedia).
Around the beginning of 2003, you'll note a disturbing sharp turn in the previous trend toward ever-faster CPU clock speeds. I've added lines to show the limit trends in maximum clock speed; instead of continuing on the previous path, as indicated by the thin dotted line, there is a sharp flattening. It has become harder and harder to exploit higher clock speeds due to several physical issues, notably heat (too much of it and too hard to dissipate), power consumption (too high), and current leakage problems.
In short, CPU performance growth as we have known it hit a wall two years ago. Most people have only recently started to notice.
Quick: What's the clock speed on the CPU(s) in your current workstation? Are you running at 10GHz? On Intel chips, we reached 2GHz a long time ago (August 2001), and according to CPU trends before 2003, we now should have the first 10GHz Pentium-family chips. A quick look around shows that, well, actually, we don't. What's more, such chips are not even on the horizonwe have no good idea at all about when we might see them appear.
Well, then, what about 4GHz? We're at 3.4GHz alreadysurely 4GHz can't be far away? Alas, even 4GHz seems to be remote indeed. In mid-2004, as you probably know, Intel first delayed its planned introduction of a 4GHz chip until 2005, and then in fall 2004, it officially abandoned its 4GHz plans entirely. As of this writing, Intel is planning to ramp up a little further to 3.73GHz early this year (already included in Figure 1 as the upper-right-most dot), but the clock race really is over, at least for now; Intel's and most processor vendors' futures lie elsewhere as chip companies aggressively pursue the same new multicore directions.
We'll probably see 4GHz CPUs in our mainstream desktop machines someday, but it won't be in 2005. Sure, Intel has samples of their chips running at even higher speeds in the labbut only by heroic efforts, such as attaching hideously impractical quantities of cooling equipment. You won't have that kind of cooling hardware in your office any day soon, let alone on your lap while computing on the plane.
TANSTAAFL: Moore's Law and The Next Generation(s)
TANSTAAFL=There ain't no such thing as a free lunch.
R.A. Heinlein,
The Moon Is a Harsh Mistress
Does this mean Moore's Law is over? Interestingly, the answer in general seems to be "no." Of course, like all exponential progressions, Moore's Law must end someday, but it does not seem to be in danger for a few more years. Despite the wall that chip engineers have hit in juicing up raw clock cycles, transistor counts continue to explode, and it seems CPUs will continue to follow Moore's Law-like throughput gains for some years to come.
The key difference, which is the heart of this article, is that the performance gains are going to be accomplished in fundamentally different ways for at least the next couple of processor generations. And most current applications will no longer benefit from the free ride without significant redesign.
For the near-term future, meaning for the next few years, the performance gains in new chips will be fueled by three main approaches, only one of which is the same as in the past. The near-term future performance growth drivers are:
- Hyperthreading.
- Multicore.
- Cache.
Hyperthreading is about running two or more threads in parallel inside a single CPU. Hyperthreaded CPUs are already available today, and they do allow some instructions to run in parallel. A limiting factor, however, is that although a hyperthreaded CPU has some extra hardware (including extra registers), it still has just one cache, one integer math unit, one FPU, and in general, just one each of most basic CPU features. Hyperthreading is sometimes cited as offering a 5 to 15 percent performance boost for reasonably well-written multithreaded applications, or even as much as 40 percent under ideal conditions for carefully written multithreaded applications. That's good, but it's hardly double, and it doesn't help single-threaded applications.
Multicore is about running two or more actual CPUs on one chip. Some chips, including Sparc and PowerPC, have multicore versions available already. The initial Intel and AMD designs, both due this year, vary in their level of integration but are functionally similar. AMD's seems to have some initial performance design advantages, such as better integration of support functions on the same die; whereas Intel's initial entry basically just glues together two Xeons on a single die. The performance gains should initially be about the same as having a dual-CPU system (only the system will be cheaper because the motherboard doesn't have to have two sockets and associated "glue" chippery), which means something less than double the speed even in the ideal case. Just like today, it will boost reasonably well-written multithreaded applicationsnot single-threaded ones.
Finally, on-die cache sizes can be expected to continue to grow, at least in the near term. Of these three areas, only this one will broadly benefit most existing applications. The continuing growth in on-die cache sizes is an incredibly important and highly applicable benefit for many applications, simply because space is speed. Accessing main memory is expensive, and you really don't want to touch RAM if you can help it. On today's systems, a cache miss that goes out to main memory typically costs about 10 to 50 times as much as getting the information from the cache; this, incidentally, continues to surprise people because we all think of memory as fast, and it is fast compared to disks and networks, but not compared to on-board cache, which runs at faster speeds. If an application's working set fits into cache, we're golden; and if it doesn't, we're not. That is why increased cache sizes will save some existing applications and breathe life into them for a few more years without requiring significant redesign: As existing applications manipulate more and more data, and as they are incrementally updated to include more code for new features, performance-sensitive operations need to continue to fit into cache. As the Depression-era old-timers will be quick to remind you, "Cache is king."
(Aside: Here's an anecdote to demonstrate "space is speed" that recently hit my compiler team. The compiler uses the same source base for 32-bit and 64-bit compilers; the code is just compiled as either a 32-bit process or a 64-bit one. The 64-bit compiler gained a great deal of baseline performance by running on a 64-bit CPU, principally because the 64-bit CPU had many more registers to work with and had other code performance features. All well and good. But what about data? Going to 64 bits didn't change the size of most of the data in memory, except that (of course) pointers in particular were now twice the size they were before. As it happens, our compiler uses pointers much more heavily in its internal data structures than most other kinds of applications ever would. Because pointers were now 8 bytes instead of 4 bytes, a pure data size increase, we saw a significant increase in the 64-bit compiler's working set. That bigger working set caused a performance penalty that almost exactly offset the code execution performance increase we'd gained from going to the faster processor with more registers. As of this writing, the 64-bit compiler runs at the same speed as the 32-bit compiler, even though the source base is the same for both and the 64-bit processor offers better raw processing throughput. Space is speed.)
But cache is it. Hyperthreading and multicore CPUs will have nearly no impact on most current applications.
So what does this change in hardware mean for the way we write software? By now, you've probably noticed the basic answer, so let's consider it and its consequences.