Jack is a DDJ contributing editor. He can be contacted at [email protected].
Priority Inversion: How We Found It, How We Fixed It
The Mars Pathfinder Mission was the first mission of NASA's Mars Surveyor program -- a decade-long program of robotic exploration that focused on the search for evidence of past life on Mars, understanding the Martian climate and its lessons for the past and future of Earth's climate, and understanding the geology and resources that could be used to support future human missions to Mars. As such, the mission (managed by the Jet Propulsion Laboratory; http://www.jpl.nasa.gov/) was one of the most ambitious and closely watched space missions in history.
Pathfinder essentially consisted of a stationary lander (the "Lander") and surface rover (called "Sojourner"), which together had the primary objective of demonstrating the feasibility of low-cost landings on and exploration of the Martian surface. The Pathfinder system itself was built around a variety of off-the-shelf hardware and software components. At the heart of the system was a single CPU -- the RS6000 -- running the Wind River Systems VxWorks real-time operating system.
Pathfinder was launched on December 4, 1996, and had a seven-month cruise to Mars. It landed on Mars on July 4, 1997, where Sojourner almost immediately began conducting experiments and collecting data; see Figure 1. In the first month of surface operations the mission returned to NASA about 1.2 gigabits of data, including 9669 Lander and 384 Rover images and about 4 million temperature, pressure, and wind measurements.
Suddenly, on September 27, 1997, communication with Sojourner was lost as the system began experiencing total resets, along with the resulting loss of data. The mainstream press promptly -- and generally inaccurately -- pounced on the problem, referring to "software glitches" that were due to the computer "trying to do too many things at once."
The problem, as JPL engineers like Glenn Reeves (the "Flight Software Cognizant Engineer for the Attitude and Information Management Subsystem, Mars Pathfinder Mission"), quickly discovered involved priority inversion -- a phenomenon familiar to real-time operating-systems engineers for more than 20 years. Contrary to most published reports, the problem was hardly extraterrestrial: A low-priority task (such as one for meteorological data gathering) occasionally grabbed a semaphore needed by a top-priority task (a bus management task, for example), then got preempted by medium-priority tasks (for instance, a communications task). Therefore, the high-priority task was not able to complete its work by the specified time. When the blocked top-priority task didn't do its work in time, a watchdog thread reset the system. This reset reinitialized all of the hardware and software. It also terminated the execution of the current ground commanded activities. Data that had already been collected wasn't lost. However, the remainder of the activities for that day could not be accomplished until the next day.
Not only were JPL engineers aware of priority inversion from the outset, but they had carefully planned for instances of it. Consequently, they were able to quickly reproduce the problem on a duplicate system in the lab, then transfer a fix -- changing the system default priority for select() semaphore creation -- to the spacecraft about 100 million miles away. (Talk about remote debugging!) This was doable because they had originally created two writable images of the control software, and either one could be deleted and reloaded from Earth. The team chose to never delete one of the two original copies while patching from the ground, for fear of losing something clean to boot with.
In the end, all went well. The spacecraft completed its mission. Reeves, who is now the Chief Engineer, Mission Data System Project, recently took time to chat with DDJ contributing editor Jack Woehr about Pathfinder and other issues involved in writing software for extraterrestrial exploration.
DDJ: You're programming computers that operate on the surface of another planet. That's weird!
GR: It's weird for me, let me tell you. You're in this microcosm of people where your whole life is concentrated on making sure the next day's work of activities are going to go just right. You're walking home at 3:00am still wired from staying up all night looking at information coming back. Your wife and the kids are asleep, there's no way you can sleep and you've go to be back at 8:00am anyway. So you turn on CNN and there's your smiling face in the control room and you're looking at the same images you looked at 12 hours ago. It's a strange feeling.
It was a small project by a small team, yet it had a phenomenal impact overwhelming to many of us. On the technical side, it seemed straightforward at the time.
DDJ: I guess when you're doing the code, it's like any other real-time project.
GR: When you have a ballistic entry, there's not much you can do about it. There's a little unknown, where you're relying on information gathered 22 years ago, but in general, we tried to simplify that job as much as we could.
We spun the spacecraft, we had inertia on our side...the control part was fairly straightforward. The hard part was thinking what things could go wrong, what do we do if something breaks...There are very few things that are redundant. How are we to go about testing it? How do we make sure that we can build an environment that matches what we want the spacecraft to fly through? Things like that turn out to be much harder than the spacecraft development itself.
DDJ: That's always a problem when one designs a large system, how one tests it in anything resembling the deployment environment.
GR: Space is pretty forgiving, few forces, everything action and reaction. Coming down through the atmosphere, that's a little different ballgame, but we can still do decent simulations. We gave up on trying to model bouncing and orientation of the vehicle and the simulation of the accelerometer values during the movements. Those were too hard to do. We tested the thing as hard as we can in a brute force sort of way.
DDJ: You dropped it from a tall building?
GR: Some drop testing. We did a lot of testing of the airbags. We proved to ourselves they could withstand the forces of the bounces, made sure they wouldn't bottom out, would retain the right pressure, that the airbag material itself wouldn't rip prematurely or catastrophically. We did that for two or three years, it was either the most expensive portion of our test program or pretty close.
The actual orientation, opening the petals and turning the vehicle over, was very much brute force. Put the thing on the ground, read the accelerometers, see that it opened the right petal in the right order.
We did testing of the airbags as they are retracted. They get pulled in by cables within the bag. We did a lot of testing with rock placement, "Given the fact that bag's going to wrap around this rock, are the motors going to stall and leave the bag extended and flapping in the breeze?"
In the end, the darn thing bounced 16 times, landed right on its base, and all the airbags retracted right in, just like it was supposed to. The perfect scenario happened in the actual mission.
DDJ: It seems almost petty to ask about the computer programming problem in comparison with the flow of the entire enterprise.
GR: In this mission, most of the navigation was done from the ground. The attitude of the spacecraft was controlled onboard.
JPL used to build their own computers; in fact, at a point in time, we built our own CPUs. Not any longer, that's an art form that's better left to industry. Single-board computers are an off-the-shelf item.
So for the second time in history, JPL went out to buy a flight computer on contract. We put in a provision that specified an operating system on top of that. I wrote a lot of the flight computer specifications.
I was very familiar with [Mentor Graphics] VRTX and [Wind River Systems] VxWorks, and have done some [Integrated Systems] pSOS work before. So I specified a fairly generic real-time operating system. What was proposed by IBM, the winner of the flight computer contract, was OSOpen, done by IBM's Raleigh, North Carolina division. It was in early beta.
We looked at it, and they had some more work to do on it. Then suddenly around 1993-1994, IBM Federal Systems got sold to Loral. What had been one company became two divisions of two large companies that were mutually antagonistic. It became obvious that we weren't going to get a real-time operating system from these guys in Raleigh within the time frame we needed it.
That's when we decided to go to Wind River and see if they would come in and port for us to this RAD6000 computer.
DDJ: The original IBM RISC mask, rad-hard?
GR: Right. In a nutshell, we sold Wind River on the neat publicity they would get porting VxWorks onto this processor. They bit, and did it fairly inexpensively and they did it quickly. We had a working version of VxWorks within four months.
Not only were they receptive to the idea, but they put people on it right away. They were very generous with time, there was no nickel-and-diming on this contract. When there were problems...I had a team of seven people doing the flight software for this (the rover was entirely separate)...the technical interchange was engineer-to-engineer, there was no contract management in the middle. It was a nice interface. They did a good job.
DDJ: And when you had to hot-patch in flight?
GR: That's standard procedure. You always build in the ability to change it.
DDJ: Just in case.
GR: Just in case, but JPL and a bunch of decent-sized companies have had the problem where you can't get all the software done in time for launch. You always make sure you build the capability to change things.
DDJ: When you send something 300,000,000 miles there's always the chance something you hadn't anticipated will happen.
GR: Well, as we said, space is forgiving, but on the surface of Mars and other places, you're right. But what we'd really like to do is build spacecraft that are much smarter about how to take care of things they don't precisely expect yet still achieve the things they are intended to do without interaction from the ground.
DDJ: Real robots! Were you an Asimov fan when you were a kid?
GR: Still am! Both a kid and an Asimov fan.
DDJ: What did your title on the Pathfinder mission -- "Flight Software Cognizant Engineer" -- signify?
GR: I was ultimately responsible for development and operation of flight software on the spacecraft itself. My butt on the line.
DDJ: Software project management, the final frontier! How do you guys actually get the product out?
GR: Here's my opinion on why Pathfinder worked as well as it did: We were focused. We had a specific goal, to get this spacecraft launched, get it to Mars, get it through the atmosphere, deliver the rover onto the surface. We managed to focus everyone on the team on that objective.
Number two, JPL was in this Total Quality Management phase, so they talked a lot about empowerment, authority, and responsibility. Some of us, despite being more than a little skeptical about TQM itself, took that and ran with it. If we were empowered, that meant we really had the ability to make the decisions. The whole project worked that way, and the management of the project structured it. They really did trust the people working on it. You really did take responsibility for the people you put on the project.
We were the small project at JPL. Still ongoing was the back end of all the Cassini (see Figure 2) development, a 3000-person development. Pathfinder was 300 persons or fewer. So, the team was very tightly knit.
All these things contributed to the success from the management point of view. I'd love to say that JPL took that lesson and applied it to all the subsequent projects, however, I would say the exact opposite was true. A lot of people look at Pathfinder as successful, yet, "We don't want to do it that way again."
Recognize, however, that one of the reasons that's true is that, in the past, we at JPL had these huge projects and were able to build up a lot of infrastructure institution-wide within a project. But since we have a bunch of much tinier projects now, we have to build an infrastructure across the board, so there's a lot more interdependency between projects than there has been in the past.
So the Pathfinder approach of putting yourselves out, a sort of skunkworks type of environment, just doesn't work anymore at JPL.
DDJ: As space travel becomes routine, you don't need geniuses to do it.
GR: When that occurs!
DDJ: But that's your job, year after year, trying to make it routine.
GR: That's correct. And once we make it routine, we'd like to make sure that industry is doing the routine part and we go off to the stuff that's still on the frontier.
DDJ: Glenn, what are you doing now?
GR: I'm working on one of those infrastructure things. JPL is developing some avionics hardware that's radiation hardened to the point it can survive at Europa, which has a very severe radiation environment. We are pushing the technology in this arena, looking at the part level at one megarad type of problems.
This is a lot of contractual work going outside of JPL. We're trying to fly as much commercial heritage stuff as we possibly can. The flight computer we've currently selected is PowerPC 750 based at 200 MHz.
DDJ: You're putting a Macintosh into space!
GR: Isn't that convenient? The processor speed of the thing we flew on Pathfinder was about 22 MHz. We've gone up an order of magnitude. That's a phenomenal increase. JPL has, historically, flown processor technologies that are 10 to 12 years behind...
DDJ:...As you wait for the rad-hard version of the part.
GR: Yup. What I'm specifically working on is a project that is sort of mislabelled as "Mission Data System." It's really an institution-wide effort to do a couple things. JPL has had several autonomy-oriented things occurring. We're trying to come up with an architecture that builds a foundation where we can have much more autonomous spacecraft than we've had in the past.
There are a lot of things involved with that, not just software, but also process, normal, up-to-date software engineering practices. I think JPL is moving from a hardware-centric organization to a more software-centric organization. The computer systems that we're building now look a lot like little local area networks. Computer systems both fast and less fast connected by things like FireWire and I2C.
DDJ: I2C is still going strong!
GR: It's not fast, but it's low power. That's often a real determinant. For high volume, high speed, we have 1394 FireWire -- 100 MBit/sec.
DDJ: I guess the NASA environment is pretty "toy-rich" for the programmer?
GR: We're trying to make it toy-rich in the sense that we're pushing hard for the commercial, off-the-shelf standardized things.
On Pathfinder we said, "We're going to fly the VME bus, because look at all the cards we can buy instead of doing in-house design and building every piece of test equipment."
We're still following that path.
DDJ: We started discussing the challenge of your current assignment in the context of trying to preserve the best of the Pathfinder skunkworks methodology in the current horizontal arrangement of smaller teams and standardized parts. How are you addressing that challenge?
GR: JPL is in the throes of trying to address team organization and interdependency between teams effectively. It's a big problem for an organization to go from a very command and control-oriented to a very interdependency-oriented structure.
DDJ: Is it left to JPL internally, or are you one eddy in the NASA river?
GR: At this point it's pretty much JPL internal. Our interdependency issues are certainly known to NASA.
DDJ: You personally are being slurped into management. Do you get to write code anymore these days?
GR: Not these days...I have a position called Mission Data System Chief Engineer. It sounds like a technical decision making position, but it's primarily a programmatic activity. I'm resisting as best I can!
DDJ: All the techies reading DDJ are cheering, "Don't give in to the Dark Side!"
GR: Someone has to put the plan together. That deficiency has dragged me to the dark side.
DDJ: Yes, really, the management is the final frontier. Development environments are shrink-wrap now. Figuring out how to divide work suitably between people, how to interleave so the product arrives on time...
GR: Those are second-order effects. The first-order effects are methodology. UML or not UML? Languages, classic things, the ability to do things on the type of computers we're now buying...Some of the things we had to worry about in the past we don't have to worry about any more. A 200-MHz PowerPC is about 200 times faster than the processors that run in Cassini. A lot of efficiency issues are starting to disappear.
DDJ: So you just write "good old code" like anybody writes?
GR: Well, wouldn't that be nice? Wouldn't it be nice if we could get close to the maturing edge of some software engineering? JPL, because we've been relegated to targets about 10 to 15 years back, we ended up having to use whatever's available, but we're no longer in that ballgame.
DDJ: What about Linux?
GR: We haven't evaluated it from a flight perspective. It has a fairly decent presence here on desktop computers. Every engineer has his opinion on the One True Way. Languages turn out to be a religious issue, too.
DDJ: What religions are sweeping JPL?
GR: Most of the work JPL has done has been primarily in procedural languages. We did a fair amount of Ada on Cassini, but from a true object-oriented perspective, we haven't been there yet, not on the flight side. Pathfinder and others are all C-based.
But moving towards an object-oriented direction takes you to an object-oriented language, C++ or Java...We have a whole bunch of artificial intelligence folks for whom LISP is a language of choice. Factor in a multitasking, preemptive real-time environment, even a multilanguage environment, and you have a whole bunch of religious conflicts.
DDJ: You're going to need memory to run that stuff.
GR: Look at the memory densities we're starting to see. On Pathfinder we had 128 megabytes of RAM. We moved into the realm where my desktop machine has more than the spacecraft. We're continuing in that direction. Let's put it this way, the software guys will be able to use all of that space. "If it's there, we can fill it up."
DDJ: When are we going to Europa?
GR: The hardware and the software come together in the two-year time frame for the two missions that will be supported. The first one is a mission called "Space Technology 4," or ST4. It's going off to a comet [to] take some samples [and] possibly eventually return. I'm not sure about the return part now, they've done some descoping lately.
The second mission is to Europa. They're launching possibly in 2003. There are other ongoing missions that will also probably use this technology, but those are the only two signed up now.
DDJ: What keeps people going to Congress for funding to shoot tin cans at Europa? Why does this happen?
GR: My personal opinion is that you have to recognize this is a very small planet. At some point in the future, man is going to move off into space, there's no doubt about it. People recognize there has to be some level of exploration that's ongoing to support that goal.
Perhaps the question is, "How can we spend billions in space while there are starving children in the world?" I think it's a balance, where you're balancing the short-term goals with the long-term goals.
I can't speak for the manned exploration side of NASA, but we on the robotic exploration side have witnessed great changes, going from billion-dollar spacecraft to $200 million spacecraft. NASA applies fairly small resources to greater and greater discovery and exploration missions.
DDJ: I grew up with Robert Heinlein's books about mining the asteroids. Is that in the foreseeable future?
GR: That's interesting, because this week or next week there's a guy coming from a company called SpaceDev Inc., to talk to us about commercial deep-space exploration. How there will be a shift from government-sponsored space exploration, whether Russian, Japanese, German, French, or American. I'm not sure he's going to address mining, but he's going to address that there are institutions both research and commercial that are willing to pay for images and spectroscopy in the hopes of discovering minerals, resources...
DDJ: The classic SciFi goals of space exploration.
GR: I can tell you that it's coming. Any time soon? Maybe in our lifetime, late...My kids will see it.
DDJ: Any advice for software engineers interested in space?
GR: My brother-in-law says, "Ya write the software but ya never actually know about the thing it's doing." I think that's wrong. I think you can still be a computer scientist and do phenomenally interesting applications of that art. Pathfinder is an example of how exciting that can really get.
DDJ
Copyright © 1999, Dr. Dobb's Journal