Uh oh--looks like the industry is about to rediscover parallel computing. I first got into it in the mid-1980s, when the Connection Machine (in the U.S.) and the transputer (in the UK) seemed to signal that we would soon all be writing code to run on hundreds or thousands of processors at once.
The excitement died pretty quickly, as did most of the companies trying to build a business around parallel computing: Of the more than two dozen firms we surveyed in 1991 , only a handful were still in business five years later, all of which had either moved into some other area, or set parallel computing aside.
Parallelism had a brief renaissance in the mid '90s, when Linux clusters first appeared. They have steadily been gaining ground as both servers and supercomputers, but up until now, most programmers have been able to ignore them. Judging from the recent flurry of discussion in the blogosphere, that may be about to change, thanks primarily to the appearance of multicore CPUs (which put several independent processors on a single chip). As Tim Bray recently pointed out , Sun, Intel, AMD, and IBM are all pursuing multicore architectures as a way of putting more CPU cycles in programmers' hands. Within two or three years, most of us could well be faced with hardware that only reaches its potential if multithreading and other parallel tricks are used.
Bray then went on to discuss some of the obstacles that ubiquitous parallelism faces. Number five on his list (after legacy apps, observability, Java mutxes, and LAMP, and just ahead of the non-problem "how many is enough?") is "testing and debugging". As far as I'm concerned, this is the only problem worth mentioning. Most scientific programmers turn their backs on massively-parallel machines because they simply can't get the damn things to work. Task farms? No problem. Simple regular grids? Piece of cake. But anything else--any program that actually had lots of different things going on at once--requires heroic effort to debug and tune.
The underlying problem is that thinking concurrently is a jillion times harder than thinking about one thing at a time. Adding non-determinism (which is a feature of almost all parallel programs), makes it a jillion times harder again, and today's tools simply aren't powerful enough to help us. (Yes, I've used Totalview. Yes, it's better than the competition. No, it doesn't help enough--sorry.)
Enter the notion of omniscient, or replay, debuggers. These capture information about programs as they execute, so that developers can step backward in time and see what happened just before a breakpoint or a crash. Bil Lewis described one in the June 2005 issue of Dr. Dobb's Journal, but the idea has been around a long time--a colleague of mine, Irving Reid, remembers using one to debug microcontroller code in the 1980s.
Replay debuggers are wonderful even when you're dealing with sequential code, but they're practically indispensable when you're faced with interrupts, context switches, and messages. The problem is, logging enough information to be useful slows programs down considerably. This distorts the sequence of events, which can in turn move or hide the bug you're chasing.
In that light, I'd like to know whether anyone at Sun or elsewhere is planning to use one of the cores in a multicore to log the activities of the others, or (more radically) to architect one of the cores purely for logging purposes. This wouldn't just be for developers' sakes: As a sometime system administrator, the thought of having applications log what they're doing, 24/7, so that I can send crash trace to vendors when their software fails on me, is very attractive.
Devoting that much silicon to logging and debugging might seem extravagant, but there's no point producing hardware whose full power can't ever be realized. I'm still fascinated by the potential of parallel computing; maybe this time, we'll finally see that potential become a reality.
Gregory Wilson is a DDJ contributing editor. He can be contacted at [email protected].
2