Priority Inversion: How We Found It, How We Fixed It
By Glenn E. Reeves
Glenn is Chief Engineer, Mission Data System Project, at JPL. He can be contacted at [email protected].
The Mars Pathfinder spacecraft had a single RS6000-based single-board computer residing on a VME bus to control the spacecraft. The single VME chassis also contained interface cards for the radio, camera, and an interface to a 1553 bus. The 1553 bus, in turn, connected to the "cruise stage" and the "lander" part of the spacecraft. The hardware on the cruise part of the spacecraft controlled the thrusters, valves, sun sensor, and star scanner. The hardware on the Lander provided an interface to accelerometers, radar altimeter, and an instrument for meteorological science (ASI/MET). The hardware that was used to interface to the 1553 bus (at both ends) was inherited from the Cassini spacecraft.
To support the Mars Pathfinder Mission, Wind River Systems ported its standard VxWorks for the 680x0 to the RS6000. The RS6000 is the same single-chip CPU that can be found in some (now older) IBM AIX workstations. The Mars Pathfinder flight software also had several debug features that were used in the lab, but not on the actual flight spacecraft because they produced more information than we could send back to Earth. These features were not enabled, but remained in the software by design.
One of these tools was a trace/log facility that was originally developed to find a bug in an early version of the Wind River Systems VxWorks port. David Cummings, one of the JPL software engineers, built the trace/log facility. Lisa Stanley (at Wind River Systems) took this facility and instrumented the pipe services, msgQ services, interrupt handling, select services, and the tExec task. The facility initialized at startup and continued to collect data (in ring buffers) until told to stop. The facility produced a voluminous dump of information when asked.
When the "repeated reset problem" (for details, see RISKS Digest, Volume 19 Issue 54, January 10, 1998; http://catless .ncl.ac.uk/Risks/19.54.html) occurred on Mars, we ran the same set of spacecraft activities over and over again in the lab. Since the flight software had the trace/log facility enabled and the failing task was already coded so as to stop the trace/log collection and dump the data (even though we knew we could not get the dump in flight) for this error, we went into the lab to test whether we would have to change the software.
In less than 18 hours, we were able to repeat the problem, isolate it to an interaction of the pipe() and select() mechanisms, diagnose it as a priority inversion problem, and identify the most likely fix. In fact, the fix seemed straightforward: We had to change the creation flags for the semaphore used within the select() facility so as to enable priority inheritance. This change was possible because Wind River Systems supplied global variables for parameters, such as the "options" parameter for the semMCreate used by the select service (although this was not documented and those who do not have VxWorks source code or have not studied the source code might be unaware of this feature). Still, the fix was not that straightforward for several reasons:
1. The code for this was in the selectLib() and was common for all device creations. When this global variable is changed, all of the select semaphores created after that point will be created with the new options. There was no easy way in our initialization logic to only modify the semaphore associated with the problem.
2. If we made this change and applied it on a global basis, we didn't know how it would affect the behavior of the rest of the system.
3. Because Wind River Systems deliberately left the priority inversion option out of the default selectLib() service for optimum performance, we didn't know if performance would degrade if we turned the priority inversion on.
4. Finally, we didn't know if there was some intrinsic behavior of the select mechanism itself that would change if the priority inversion was enabled.
In the end, we modified the global variable to include the priority inversion. This corrected the problem. We asked the Wind River Systems engineers to analyze the potential impacts for (3) and (4) above. They concluded that the performance impact would be minimal and that the behavior of select() would not change so long as there was always only one task waiting for any particular file descriptor (this was true in our system). I believe that the debate at Wind River Systems still continues over whether the priority inversion option should be on as the default. As for the aforementioned (1) and (2), the change did alter the characteristics of all of the select semaphores. We concluded, both by analysis and test, that there was no adverse behavior. We tested the system extensively before we changed the software on the spacecraft.
We weren't able to catch the problem before launch because the problem would only manifest itself when ASI/MET data was being collected and intermediate tasks were heavily loaded. Our before-launch testing was limited to the "best case" high data rates and science activities. The fact that data rates from the surface were higher than anticipated and the amount of science activities proportionally greater served to aggravate the problem. We did not expect nor test the "better than we could have ever imagined" case.
We did see the problem before landing, but could not get it to repeat when we tried to track it down. It was not forgotten nor was it deemed unimportant. Yes, we were concentrating heavily on the entry and landing software. Yes, we considered this problem lower priority. Yes, we would have liked to have everything perfect before landing. However, I didn't see any problem, other than that we ran out of time to get the lower priority issues resolved.
We did have one other thing on our side -- we knew how robust our system was because that is the way we designed it. We knew that if this problem occurred, we would reset. We built in mechanisms to recover the current activity so that there would be no interruptions in the science data (although this wasn't used until later in the landed mission). We built in the ability (and tested it) to go through multiple resets while we were going through the Martian atmosphere. We designed the software to recover from radiation induced errors in the memory or the processor. The spacecraft would have even done a 60-day mission on its own, including deploying the rover, if the radio receiver had broken when we landed. There were a large number of safeguards in the system to ensure robust, continued operation in the event of a failure of this type. These safeguards allowed us to designate problems of this nature as lower priority. We had our priorities right.
Did we (the JPL team) make an error in assuming how the select/pipe mechanism would work? Probably. But there was no conscious decision to not have the priority inversion enabled. We just missed it. There were several other places in the flight software where similar protection was required for critical data structures and the semaphores did have priority inversion protection. A good lesson when you fly commercial off-the-shelf stuff -- make sure you know how it works.
DDJ
Copyright © 1999, Dr. Dobb's Journal