First, let's look at the definitions our three pundits proposed, beginning with stage 1 or 0, "unawareness": A developer at this stage "Completely trusts that the operating system handles everything automatically when more cores or processors are added," according to Zeichick. Similarly, "Since no mainstream language makes threads a first-class concern, threads are side effects in library and infrastructure calls," O'Brien writes. And Dossot, with typical humor, describes a "rookie who writes a web application that can only accommodate one concurrent user and complains why is the server not working better." As I've discussed concurrency with assorted developers over the past year, a recurring comment has been to dismiss the concept as already solved by the operating system, or a simple paradigm shift that mainframers made long ago. If Herb Sutter is correct that the next great programming wave will be concurrency, these authors have astutely identified a widespread ignorance of this tectonic movement.
At the next stage, though usage is casual, there is still "no testing of applications against platforms with different cores/processors to identify runtime issues," in Zeichick's modelan excellent reminder that a key tenet of performance optimization is to only fix what needs fixing. At the casual experimentation stage, O'Brien writes, "most programmers are aware of threads and their ability to run a lengthy calculation or IO operation while maintaining a responsive UI." They still run into trouble, Dossot observes, with "double-checked locking, interruption swallowing and the like."
At stage three, "rigid" or "hotspot" use of threading, "locking is the hammer with which they pound the nail of concurrency," O'Brien says. Zeichick categorizes this stage as one where rudimentary profiling begins to measure results, and simple tools such as OpenMP come into play for loop optimization.
Stage four, nearing the threading zenith, involves the use of tools to test threaded code and some evaluation of runtime issues on different numbers of cores, but "No formal validation of threaded code for correctness," according to Zeichick. O'Brien discusses the use of lock-free algorithms and data structures, and a developer mindshift towards more of a hardware focus.
At the final stage, many of the formalities described in the Capability Maturity Model are extrapolated to threading, including design concerns, code validation and efforts to eliminate non-compliant libraries. Here, Dossot has the most entertaining last word, defining "nirvana" as "Being Brian Goetz. At this level, the developer has an intimate knowledge of the memory model, the concurrency mechanisms and all the pesky internal details."
Interestingly, Goetz, author of the best-selling Java Concurrency in Practice (Addison-Wesley Professional, 2006), comments on this model as "a one-dimensional projection (the part where x=y) of a two-dimensional scale, which measures both a person's actual skill level, and their self-perceived skill level. If people can stay on the diagonal, we're OK, but if not, other people's feet get shot." His concern is that too few developers are cognizant of their own limitations to realize when they should step back and gain some expertise or training before proceeding. And he also cautions against premature optimization, a potentially damaging side effect of the concurrency controversy.
Model Misfits
Not everyone agrees, however, that threading maturity makes senseand some even question the focus on developer skill sets rather than on evolving tools and platforms.
Tim Sweeney, CEO and chief architect of Epic Games and an eager proponent of optimization for Intel dual core technology (with experience converting his popular rendering engine to run on multiple cores), gives me a different scale: "I'm a minimalist, so I'd say, stage one is no threading. Two is explicit threading (your items two through four). Three is implicit parallelism. You just write the implicitly parallel code (using pure functional programming, transactional memory, message passing, etc.), and its runtime behavior is more or less independent of the way that the compiler and runtime library choose to thread it."
"Lock-free data structures aren't a special case," Sweeney argues, referring to O'Brien's description of stage-four maturity. "That's just a toolbox of hacks for reducing the use of a toolbox of other hacks (locks). All of those models impose the same approximate 2X-4X productivity burden, so I don't see a benefit to distinguishing them."
Overall, Sweeney has little use for current tools, including OpenMP, and languages. He's one of many who disdain Java concurrency and its brethren as short-term solutions at best: "Companies in the industry are currently at stage one or two. That is fine in the era of one to four CPU cores. Once we're up to tens of cores, that model becomes untenable in the same way that assembly-language programming became untenable in the 1980s, and will force a move to implicitly parallel languages."
With a soupcon of dogma, Rob McCammon, director of advanced technology planning at Wind River Systems, suggests that a generalized "Multiprocessing Maturity Model" for embedded systems is in order. "Most multi-core device implementations will utilize a shared memory/multi-threading model, a message passing model, or both to manage concurrency," he writes in his blog. As a result, "device software development teams need to have a multi-core readiness plan, which results in predictable increases to their level of Multiprocessing Maturity." He is still developing the details of what such a plan should include, however.
In a March 2007 post entitled "Do We Need a Threading Maturity Model?", Artima Developer's Frank Sommers cites IBM Distinguished Engineer Greg Pfister's work in Serial Program, Parallel Subsystem parallelism: "SPPS parallelism allows a developer to feed a serial programa Web controller, a database query, or a data mining algorithm implementationto a parallel subsystemsuch as a Web application server, a database server, or a massively parallel supercomputerand that parallel subsystem will ensure the maximum concurrency for the serial program." He concludes that "In the SPPS world, developers don't need much of a threading maturity model. At best, awareness of concurrency suffices, as does trust in the underlying parallel subsystem."
Passing to the Pipeline
And creating such subsystems appears to be a burgeoning industry. Take RapidMind, a framework for executing data-parallel computations in C++ on multi-core processors. It claims to go where no compiler can, parallelizing serial algorithms across any number of cores. RTT AG's RealTrace automotive visualization software, built with RapidMind, runs on multiple graphics processing units (GPUs) and was demonstrated to much acclaim at SIGGRAPH 2006.
Another approach comes from Rogue Wave Software, which has marketed its "Software Pipelines" concept for service oriented architectures in the past. According to Patrick Leonard, the company's vice president of product development, Software Pipelines "can be used to abstract the threading model out of the application code. As software developers, you would not mix your UI code with your business logic, and for good reasons. A similar principle should apply for programming for concurrency the threading model should not be driven from within the application logic."
"Software Pipelines is a general term, not owned or trademarked by Rogue Wave or anyone as far as I'm aware," Leonard continues. "It also does not require the use of any of our technology. Software Pipelines borrows conceptually from hardware pipelines and also from fluid dynamics, which has interesting parallels to software systems."
Concurrency Junkies
While tool prototypes may be popping up on the concurrency landscape, there's one wrinkle in the theory that developers shouldn't delve into threading themselves: They're darn interested in doing so. Look no further than the stellar sales of Goetz's book on Java concurrency. Java developers have always had threading capabilities at the ready with java.util.concurrent's package of utility classes. Now they realize both the need for concurrency-enabled performance enhancements and the dastardly side effects of threading done wrong.
In a similar vein, .NET developers are eagerly awaiting Joe Duffy's guide to concurrency for Windows, due out this year. Duffy works at Microsoft on concurrency programming models in the Common Language Runtime (CLR). Duffy and Goetz agree on many topics, but key among them is that Zeichick's stage-five concept, testing performance on multiple cores and platforms, is a necessary precursor to concurrency optimization and threading techniques.
Profiling Tools
The most obvious measure of multi-core performance? All your CPUs should be busy, says Duffy. Low processor utilization may indicate tangled threads caused by overzealous synchronization. Windows Task Manager can give you this information, as well as PerfMon.exe. Similarly, Task Manager can tell you if your program is continually overflowing physical memory; VSPerfCmd is another way to troubleshoot. Duffy also recommends that you know if any particular processes are monopolizing execution time. Microsoft Visual Studio's profiler can query each CPU's performance counters in search of instructions retired and cache misses.
In the embedded world, Wind River ProfileScope offers hierarchical profiling that should mitigate some of the threading confusion. Again, developers should use this type of tool at runtime to find the functions that are most cycle-intensivepreferably before the application goes live. Each thread or function is shown with visual cues to indicate event timing. The company claims that these zoom and time comparison features are key to the product's usability.
Multi-core Merriment
But Epic's Sweeney may be right about one thing: Today's threading knowledge will be a puny defense against tomorrow's chip complexity. The nine-core Cell BE processor, for example, contains a Power Processing Unit and eight Synergistic Processing Units with 256 KB of on-chip memory and 128-bit registers. A single Cell BE processor whips through massive matrix multiplications at over 200 Gflops, hundreds of times faster than an old-fashioned CPU. Intel's Larrabee project is a many-core x86 effort; there is speculation that 24 and 32 core versions will be available in 2009 and a 48 core chip in 2010. The company claims Larrabee "will include enhancements to accelerate applications such as scientific computing, recognition, mining, synthesis, visualization, financial analytics and health applications."
That gives developers two years to get very, very good at multithreading.