"We work in a mature field now." Yeah, right. Working software pros know better. Can you imagine concepts like mindshare and the "hype trajectory" transplanted to, say, aeronautics? Daring engineers design concrete aircraft! "They're maintenance-free! Built to last, with a mold-on upgrade path (MOUP)." Venture capitalists sign on. Catapulted prototypes (briefly) leave the ground. "Early Q3 2002, transatlantic airliners! Early adoption equals market advantage." A hype-storm brews, acronyms proliferate, pundits pontificate, startups start up. Fleets of concrete aircraft taxi ponderously about, almost leaving the ground. Passengers desert in droves to the freeways. A suspiciously similar set of pundits begins to cast aspersions. Promoters blame halfhearted adoption:
"Remember, we said you'd need 20-mile-long runways." Big-business-conspiracy theorists mutter about Boeing and Airbus crippling the concept. Concrete airplanes fade as the new baking-soda-and-vinegar jet engine captures the attention of the trade media
Beyond Immoderate Praise
OK, so maybe I'm "piling" it on (sorry). The tricky part is that there's
often a nugget of value buried under the mountain of hype. Certainly, peer-to-peer
technologies have gotten their share of immoderate praise, and suffered the
inevitable backlash. In his April 25, 2001 article for Slashdot, "Does
Peer-To-Peer Suck?", Jon Katz quotes "respected Stanford Law Net
guru Lawrence Lessig: 'Peer-to-peer,' [Lessig] exults, 'is the next great thing
for the Internet.'" Then Katz observes sourly, "If we've learned anything
in the past decade or so, it's to run for your life whenever you hear anybody
say that. One thing you can take to the bank (if it's still letting you in the
door): Peer-to-peer is not the next great thing, on the Net or off." But
now that some of the transform-the-planet fervor has died down, useful tools
are emergingand good work has already been accomplished.
What was all the hype about in the first place? What is this P2P stuff, anyway? Ah, that's a slippery question in itself. Do computing clusters count? Napster? ICQ? SETI@Home? Heck, what about Windows file shares? Well, that way lies flame war, friends, so let's not go there. I think we can agree on a spectrum, from definitely not peer-to-peer (your browser retrieves this article from Software Development's Web server) to certainly is (a decentralized file-sharing system like Gnutella). Columnist and Accelerator Group partner Clay Shirky drew a useful line around the middle way back in November 2000 when he published "What is P2P And What Isn't" on O'Reilly's Openp2p.com.
"P2P," he wrote, "is a class of applications that takes advantage of resourcesstorage, cycles, content, human presenceavailable at the edges of the Internet." And "If you're looking for a litmus test for P2P, this is it: First, does it treat variable connectivity and temporary network addresses as the norm, and second, does it give the nodes at the edges of the network significant autonomy?" Finally, and famously, "PCs are the dark matter of the Internet, and their underused resources are fueling P2P."
From MP3s to Mapping the Genome
Unless you've been living with a volleyball on some island for the past year,
you're more than familiar with Napster, the absurdly popular music-file sharing
service that ran afoul of multiple copyright lawsuits. I work at the University
of Wisconsin, a major research institution. We've got your high-speed distributed
computing, your distance-education streaming video, your professors swapping
data on global climateall bandwidth hogs. None of those even came close
to top honors, however. Statistics showed that, at its height, the single greatest
consumer of network bandwidth at UW was Napster traffic to the dorms, by a factor
of two or better. Napster's popularity helped jump-start the peer-to-peer movementto
many, the two are synonymous. In fact, back in 2000, Lincoln Stein, a researcher
involved in the Human Genome Project, got written up in Wired magazine's
online journal for examining Napster as a mechanism for publishing gene-sequence
information to researchers worldwide. But when I contacted him, Stein, who is
an associate professor at the Cold Spring Harbor Laboratory in New York, told
me that so far, peer-to-peer hadn't measured up.
"Napster wasn't scalable because it relies on a central directory. Also, it uses hard-coded attribute fields, such as artist, that apply only to song files. To distribute genome sequence information, I needed a flexible way of describing and searching for attributes." Stein also investigated using the decentralized Gnutella protocol: "Gnutella provides much better handling of attribute fields. However, its concept of a 'network horizon' means that the world is inevitably fragmented into many small subnets that aren't connected. Genome researchers need access to all the data, not to the subset that happens to be connected at the time." And Freenet, according to Stein, wasn't just skimpy with attributes; due to privacy concerns, it also made it impossible to discern a datum's provenancea critical item for a researcher.
"I haven't given up on P2P," Stein claims, "but I'll need more robust protocols that are available as open-source implementations before I can use it for serious work."
Anyone Out There?
But there's a lot more to P2P than file sharing. Chances are, right now your
PC and Internet connections are running at some fraction of their capacitycapacity
you paid for whether you're using it or not. There are several big computing
projects out there that would be happy to utilize that capacity. This is the
key idea behind distributed computingdonating or renting currently unused
computing resources to a large project. One of Gnutella's lead developers, Gene
Kan, put it this way in an August 2, 2001 article for O'Reilly: "The price
of performance is decreasing constantly while the performance itself is increasing
ridiculously. That means I'm pretty happy to share my 'Pentium 8 50gHz' because
I only need all that horsepower while Windows boots. After that, the CPU is
hardly utilized because I can't hit 50 billion keys in a second. Between keys,
my computer could be cracking RC5 or musing on colon cancer."
The obvious example of such a system is SETI@Homedownload a client, hook up to a server, and whenever your machine is idle, it begins to analyze radio-telescope data for patterns indicating signals from sentient life. The SETI @Home Web site shows that 3.5 million users have registered, donating more than 869,000 years of compute time (working out to something like 30 teraflopstrillions of floating-point operationsper second). Making it simple for volunteers to participate is no trivial proposition, when you think about it. Since many clients connect via ISPs, their IP numbers change all the time (that "dark matter" problem again). So the system designers worked out a protocol whereby the newly connected client uploads its current IP to a known server address; after that, communication proceeds between peers.
CRYSTAL Becomes CONDOR
Harnessing networked computers together into a big parallel processor is nothing
new, of course. Look at where I work, for example: Way back in 1985, when I
walked into the lab for my first systems-admin job at UW, I did a double-take:
Why on earth were we operating a laundromat? It was, in fact, a big rack of
20-some DEC VAX 11/780s, all networked together into a parallel computer called
CRYSTAL. The network connection was a highly customized token ring, the operating
systems were built practically from scratch, and of course the application software
was purpose-built to run on CRYSTAL. But it definitely hummed, and proved the
point that shared-memory, close-coupled parallelism wasn't the only way to build
a supercomputer.
CRYSTAL made a darned fine space-heater for the lab, toobut it was already obsolete. Three years later, noticing how many faculty members had their own high-powered workstations, UW researchers started building CONDOR. Where CRYSTAL needed a dedicated group of identical machines, CONDOR was software, intended to exploit idle cycles from a pool of workstations. To join the CONDOR pool on the local area network, researchers could simply run a daemon, allowing them to submit jobs to the grid and to add their computer to its resources.
Of course, now that people's personal workstations were involved, the CONDOR team had to develop techniques for keeping participants happy, or they'd drop right out of the flock. (Yes, CONDOR has its own little ornithological jargon: flocks, gliding in to computations, you name it.) So they built an entire language for users to specify things like the maximum permissible load on the machine, who was allowed to submit jobs to it and when, and so on.
Technical challenges arose, too: What happens to a half-completed computation when the computer's owner turns back to it and starts to type? The team had to add checkpointing logic, which let CONDOR save its computational state periodically so that, if interrupted, it could pick up from the most recent checkpoint. Their efforts were fruitful: CONDOR is still in use today, and is still being improved (In fact, the last time I looked, there were three job openings on the Web site!). In September 2001, a new release made CONDOR pools available as resources for something even bigger. CONDOR 6.3.0 included support for the emerging standard in grid computing: the Globus Toolkit.
Technological Toolbox
The Toolkit is an open-source set of software tools, developed primarily at
the Argonne National Laboratory, the University of Southern California's Information
Sciences Institute and the University of Chicago's Distributed Systems Laboratory.
(OK, here's another plug for UW: The CONDOR team is listed as a "major
contributor.") The Globus Toolkit tackles problems like resource discovery
and directory services, resource allocation, and single-sign-on authentication,
to enable users to create Net-wide infrastructures of jaw-dropping computing
power. And for anyone who still thinks that "open-source project"
means "impractical, wild-eyed dream," the list of Globus supporters
is almost a Who's Who of Big Fast Computers: Compaq, Cray, Entropia, Fujitsu,
Hitachi, IBM, Microsoft, NEC, SGI, Sun Microsystems and Veridian have all publicly
committed to adopt the Toolkit for their platforms; Platform Computing plans
to build a commercial implementation of it. Other institutions building grids
with the Toolkit include the U.S. National Partnership for Advanced Computational
Infrastructure and the U.S. National Computational Science Alliance, the European
Datagrid Project, and NASA's Information Power Grid.
The Nitty-Griddy
What's a "grid"? Glad you asked. In "The Anatomy of the Grid:
Enabling Scalable Virtual Organizations" (International Journal of Supercomputer
Applications, 15(3), 2001), Ian Foster, Carl Kesselman and Steven Tuecke
point out that "The real and specific problem underlying the Grid concept
is coordinated resource sharing and problem solving in dynamic, multi-institutional
virtual organizations
This sharing is, necessarily, highly controlled,
with resource providers and consumers defining clearly and carefully just what
is shared, who is allowed to share, and the conditions under which sharing occurs
The following are examples of VOs: the application service providers,
storage service providers, cycle providers and consultants engaged by a car
manufacturer to perform scenario evaluation during planning for a new factory;
members of an industrial consortium bidding on a new aircraft
and members
of a large, international, multiyear high-energy physics collaboration."
Foster, Kesselman and Tuecke go on to note that "Current distributed computing technologies do not address [these] concerns and requirements." Technologies like CORBA and J2EE share resources in an organization; commercial solutions for distributed computing require "highly centralized access to those resources." So to qualify as a grid, a distributed-computing setup must provide decentralized access to powerful computing resources, allowing virtual organizations to come and go as needed. In addition, issues like security, access control and who pays for what must be built in from the start, not slathered on as an afterthought.
If your mind is beginning to boggle at the grandiose scale of these plans, I'm right with you, but the companies involved don't seem daunted. IBM, for instance, is jumping in with both big blue feet, implementing the Globus Toolkit in their eServer Linux systems. They're working on the Distributed Terascale Facility, a National Science Foundation project to build a grid with well over 10 teraflops peak capacity, by mid-2003. According to the NSF, the primary vendors are IBM (servers), Intel (processors) and Qwest (40-gigabit/second network); clusters of high-speed Itanium-processor machines are being set up at four sites. The National Center for Supercomputing Applications in Illinois will provide the biggest number-cruncher, with a new 6-teraflop cluster added onto existing resources for a total of 8 teraflops available, plus 240 terabytes of secondary storage. The San Diego Supercomputer Center will handle data and "knowledge management" with a 4-teraflop cluster and another 225 terabytes of storage. At Chicago's Argonne National Laboratory, a 1-teraflop cluster will be available for visualization and data rendering; Caltech will chime in with scientific data, to the tune of 0.4 teraflops and 86 terabytes of storage.
In other words: we're talking serious gaming platform here. SimGalaxy, anyone?
Sharing Standards
There's more to P2P than teraflops and terabytes, though. At the other end of
the spectrum, peer-to-peer is not about what you've got (that is, brute computing
power), it's more about who you're working with. The file-sharing frameworks
pointed the way, but as more people discover the power of end-to-end application
connectivity, they're putting P2P to use in all kinds of applications: sharing
LANDSAT images, collaborative computer-aided design work, shared diaries, auctions
and much more.
Until fairly recently, developing such software was a challenge, not least because the most common P2P architecture was the "silo": a monolithic piece of software that handled everything from getting through firewalls, to discovering peers on the Net, to the nitty-gritty of passing messages back and forth. It's crazy, but your SETI@Home client, your ICQ chat client, and your Gnutella file-sharing client are all doing pretty much the same thing, with independently developed protocols. Fortunately, however, standards are beginning to emerge. For one thing, the burgeoning interest in Web services is creating a whole culture of programmers who grok Simple Object Access Protocol (SOAP) and its friends. While not designed expressly for P2P, the Web services protocols certainly get the job done. So much has been written about SOAP and its companion technologies that it's hardly worth going into here. Suffice it to say that the widespread familiarity with SOAP helps solve the chicken-and-egg problem common to P2P adoption. And there's nothing technically wrong with using SOAP for P2P; in Shirky's opinion, "the Web services stack is a better attempt at encoding and serialization than anything the P2P folks could come up with on their ownSOAP looks like the P2P implementation language to me."
The Next Big Thing?
But Shirky, and many others, are also keeping a close eye on Project JXTA (pronounced
juxta, as in juxtapose: it's not an acronym). JXTA is an open-source
initiative to develop a complete peer-to-peer infrastructure. Initially bootstrapped
by Sun Microsystems and now fueled in large part by its developer community,
Project JXTA is intended to promote collaboration among everything from servers
to cell phones. Central to JXTA are three concepts: groups, pipes and monitoring.
Groups pertain to how peers come online, receive unique IDs, get around firewalls,
and, most importantly, discover others in their group. Pipes are the basic communication
facility in JXTA, and come in various flavors: unidirectional, bidirectional,
propagate-to-group, "reliable" (like TCP) and "unreliable"
(like UDP). And with hooks for monitoring services built-in, IT staff can track
traffic for peer nodes, and will someday be able to manage themshutting
a node off if it swamps the local net, for instance.
Juan Carlos Soto, whose extra-wide business card bears the title "Group Marketing Manager for Project JXTA and Community Manager for JXTA.org," thinks the project may be positioned to catch the Next Big Thing. "In the 1980s," says Soto, "a big turning point was the adoption of TCP/IPnot because it was a superior network protocol, but because it had been broadly adopted. In the 1990s, the innovation all blew up around HTML. We think that there's a similar phenomenon underway with P2P. The key idea is that the devices on the edges are not just consumers, they're pretty powerful, and able to be providers, too."
Mind you, that's devices, not computers: "People are talking to us about putting it in light switches," says Soto. He claims that's an important distinction from the PC-server-HTTP model of SOAP, UDDI and their ilk: "Most of the Web services protocols seemed like overkill for having your PDA interact with your cell phone," and he goes on to point out that for other kinds of devices, HTTP connectivity can't be assumed in the first place. But Soto says Project JXTA certainly hasn't written off communication with the Web services world, either. "It's still being looked at. There's a project on JXTA.org, Network Services, to find out where it makes sense to either have seamless links into existing Web services or adopt their protocols."
Not Just Java
JXTA is intended to be language-, platform- and transport-agnostic. The reference
implementation is in Java; you might assume that since JXTA starts with a J,
and Sun started JXTA, it's Yet Another Java Extension. Not so, says Soto: Sun's
resources are going into both C and Java 2 Micro Edition versions of the protocols,
and the community is busily coming up with others: Objective-C, Perl and Python
have all been demonstrated. And while most applications built with JXTA currently
use TCP/IP as the low-level transport layer, that's not a requirement.
Soto is quick to point out that Sun hopes to profit substantially from JXTA's success, though not from the protocols themselves. "JXTA is available for anybody to use under an Apache-style OS license. Sun is a player just like anybody else; our view was that a lot of our product line would benefit from having P2P resources available. We hope that JXTA becomes the protocols that trigger the next wave of innovation. One company couldn't do this alone, so the best way was through an open-source effort." He cites the project discussion lists as evidence of JXTA's vitality: "If somebody posts a question, more likely than not, it's answered by a community member, not somebody paid by Sun."
The worlds of grid computing and protocols like JXTA aren't mutually exclusive, of course; in fact, some companies have built systems that use the JXTA protocols for peer discovery and initial communication, then use heavy-duty APIs like the Globus Toolkit to get the computing done.
So real companies are indeed building real applications today with P2P. There's Groove Networks, Consilient and Ikimbo, building their business-to-business collaboration platforms; OpenCola, with collaborative search-and-discovery tools; and Entropia, makers of distributed-computing software. As of this writing, even the beleaguered Napster hasn't tossed in the towel, still hoping to launch a new service in 2002. When the hype-meisters have long since moved on to the next concrete-airplane fad, I'm confident that peer-to-peer technologies will still be delivering real value to their users, and real advantages to software developers.
Information, Please The Virtual Bookshelf
In-Process Projects
Peer-to-Peer Toolkits and Products
|