Safety in Software - now more important than ever.

I am going to discuss safety aspects in critical software systems using the ariane 5 disaster. "Safety" means that we are not talking about problems caused by attacks. In the case of Ariane 5 we are also not discussing insecurities due to excess or ambient authority which voided chances for damage reduction. We are really talking unexpected errors in software leading to safety problems. And - unlike many other safety or security problems I have seen - this one has a very special quality: we cannot just shut down the system when we detect a problem. This is OK with e.g. Automated Teller Machines (ATMs) and many other systems. But there is a class of systems which includes devices in motion (airplanes, rockets, cars), critical infrastructure (heating, medical equipment etc.) which do not have this quality and which will be more and more important in the future. Those devices cannot lose their liveness in the presence of safety or security problems. I have structured the paper as follows:

  1. Description of the Crash Scenario, External variables, threat model, chance and error

  2. Black Swan Problem

  3. Overflow problems (silent, exception, healing)

  4. Solutions: Documentation vs. Testing vs. Runtime Recovery, Review Problems

  5. Empirical software-technology: Multi-Version-Programming (Prof. Tichy, KIT)

  6. What to do in case of a RAM failure? (NMI caused by Ram failure etc., silent, detected)

The crash scenario

The Ariane rockets use a combination of laser/gyro sensors with an Inertial Reference System (IRS) to feed calibration data into the On Board Computer (OBC). This system is only used before start-up but to avoid lengthy re-calibration times in case of a stopped count-down it was allowed to run 50 seconds into the flight phase. Ironically it was this subsystem that caused the crash of Ariane 5 at a time the subsystemt wasn't even necessary anymore. The culprit was a conversion error which caused an uncaught exception. The error showed up because Ariane 5 was able to create a higher horizontal bias than its predecessor from whom the IRS and Laser/Gyro units were re-used. The result of this conversion exception was that CPU 1 shut down and the hot-standby unit CPU2 took over. Because it ran the same code with the same failure built into it, it crashed immediately and came to a halt. Now both subsystems were down and the control logic put a debug pattern onto the system bus which the OBC mistook for navigation data. The change of course destroyed the rocket. The first question that comes up is why the engineers decided to put that debug pattern on the bus. It confused the OBC. Looks like hardware engineers thought that the IRS would only be needed before take-off and the rocket would not be in flight already. They might have decided to use an explicit protocol to notify the OBC about the fatal error but leave a chance for the OBC to take some emergency measures. The question highlights the thinking that prevails in embedded control: it is the thinking of hardware engineers, not so much of software specialists.

The next question is why the software engineers did not catch the exception from the conversion. This is rather simple: for the Ariane 4 this was physically impossible as the rocket could not fly the way Ariane 5 could. Still it leaves the question of a black swan: the input variable horizontal bias could be caused by some error in the Gyro/Laser unit or through transmission. Transmission can and should be protected but what about internal, spurious errors? We will discuss this in the section on execution time safety later. The third question is whether we are facing a system integration problem or a system internal problem. What is the difference? A system internal problem is e.g. a timer that is started and when it expires the system comes to a HLT because the software forgot to reset the timer. This error is independent of external input or output. A system integration error is something where one component or subsystem creates some output which kills some receiving component even though the output is correct in the eyes of the sending system. The debug messages on the system bus confused the OBC which mistook them for navigation data. The conversion code - is it an internal or a system integration problem? At the first glance it looks like an internal problem: the possible exception did not get caught. But what caused the exception finally? it was an input value bigger than expected by the receiver. So it is a case of an integration problem and - as Bertrand Meyer has put it - a case of wrong re-use because the components were re-used from an earlier version of the rocket.

Why is the difference between internal and combination error important? Because it puts the error into the responsibility of different groups. System integration problems are only detected when BOTH systems: sender and receiver are combined and looked at. Typically those systems are developed by different teams in different companies and that makes system integration problems so hard to avoid. The error won't show up in individual tests without an assumption about the other system. And that assumption has to be correct.

Before I take a look at methods to avoid those problems I will briefly discuss what we have to assume in system building, in other words: what kind of "black swans" do we need to imagine?

The Black Swan

In Philosophy the black swan problem is the problem of something totally unexpected and assumed to be non-existant suddenly shows up. (wikipedia also has some paragraphs on the new black swan theory by Taleb but this is not what I mean here). One can say that the horizontal bias of Ariane 5 was a black swan for the IRS subsystem: The value seen had been physically impossible before and the code was unprepared to handle the situation. (we will discuss defensive measures shortly).

This raises the question what kind of spurious and crazy errors we have to assume in software? this is a rather hard question because performance sets limits to what we can check. Let's start with input variables. Can we simply assume that they will never be bigger than what has been specified for the sending system? That raises a surprising question: were the Ariane 4 designers which built the old IRS correct in not expecting conversion problems simply because they were "physically" impossible? What does "physically" mean here? Does it mean that with Ariane 4 it was causally impossible that horizontal bias would ever grow beyond a signed 16 bit integer when converted? Or does it mean that during correct operation the Ariane 4 rocket would not be able to exceed the value? Or does it mean that it is causally impossible for a higher value to be ever received?

Clearly, the last statement is wrong. Even with Ariane 4 there was always the possibility for the IRS to receive a larger value: transmission problems, spurious bugs in the gyro/laser unit and sheer chance could have caused such a "black swan" value. And the effect would have been just as desastrous.

If you have trouble seeing the point, Judea Pearl gives the example of the death squad: Even if the captain does not give the order to shoot, one of the soldiers could any at any time decide to shoot and the victim gets killed.

There are two very nice slide sets by Judea Pearl which give a gentle introduction to causality and its problems. The first one is The Art and Science of Cause and Effect and explains e.g. the adjustment problem. This is the problem of choosing the right set of variables for causal analysis. The second one "Reasoning with Cause and Effect" also discusses the differences between logical and causal explanations (the death squad problem).

Overflow problems

The surprising result of our black swan discussion is that physical impossibility is no reason to conclude that a certain computer variable could not get larger than expected. And that means we will have to discuss more defensive measures. The horizontal bias value could have been a derived value from a different input value. Integer operations can cause silent overflows anytime. In this light not protecting any variable used in a conversion is dangerous.

But now lets discuss solutions to the black swan that really destroyed Ariane 5: the fact that Ariane 5 took a different flight trajectory. And what could have been done to detect this early.

Solutions: Documentation vs. Testing vs. Runtime Recovery, Review Problems

In his Critique of "Put it in the contract: The lessons of Ariane" Ken Garlington divided the problem space in three parts: Documentation/design time, testing time and runtime execution time. ( Bertran Meyer had claimed that design by contract would have prevented the crash already in the documentation/design phase). But the chances to find the mismatch between the overall Ariane 5 design and the constraints hidden deeply in the IRS subsystem were slim. Of course the constraint had been documented - und heaps of other documentation. And what if the constraint would have been stated directly in the source code as Design by Contract mandates? This requires that somebody puts it into the source code in the first place. Considers it an essential restriction - something it became only in hindsight. Second it means that somebody needs to discover the mismatch. And don't forget: it is a system integration problem, not an internal system problem. Could this happen during regular source code or documentation reviews? Not very likely. Reviews are usefull to discover core architectural decisions, not small technical constraints. Reviews e.g. to discover buffer overflows in the bind/DNS software were largely ineffective.

An alternative solution suggests going through a detailed process of validating reusable components. In Ariane 5 - Luftfahrt, Berühmt - berüchtigte Software - Fehler"Christian Weyand suggest the use of a process model to capture and validate the requirements of the new rocket. This includes a risk assessment stage. But even such a process oriented strategy does not guarantee that the mismatch will be detected.

The only method to really uncover the mismatch is to enforce the tagging of all input and output values with their respective value ranges and to run a model checker which does an automatic comparison. Here the problem is in defining the correct values for the sensor unit: the output values of the gyro/laser unit depends on the position of the rocket. This would require a model of the flight trajectory to be part of the system model.

What about testing? Wouldn't testing uncover the mismatch at some time? The diagram below shows several test approaches. Just using test data from previous Ariane 4 runs as input to the IRS would have shown a correctly functioning system. This was actually performed during tests. Using the real gyro/laser unit as input still requires unit movements as expected for the Ariane 5 to show the error (and we have already stated that nobody really considered the difference in flight behavior to be of essential importance). Creating a simulation unit for the gyro/laser unit suffers from the same problem: only when used differently to the Ariane 4 rocket, the problem would have been uncovered. This leaves us with using input from a fuzzer. Real test data get augmented with random values for one reason only: to detect a "black swan".

So the result with respect to testing is: yes, testing can uncover the bug but only when done in a way that encourages the discovery of a black swan, e.g. by using a fuzzer to generate random test input. Testing needs to include a phase that does not test against requirements but against the unexpected.

Finally, what can we do at runtime when we hit a black swan? Or when we expect we might hit one. And not to forget: our main task at runtime is to stay alive. Stop and complain is not an option.

This brings us to runtime recovery. What could the code do to prevent crashing? Catching exceptions is only one half of the truth. To catch an exception you need to expect one first. And you should know what to do in case the exception is caught. Already the first point is hard: there are defined and undefined (runtime) exceptions. The discussion around when to use which has never really stopped. Do exceptions pose a performance problem? Only if they are thrown (stack) and caught. Otherwise they just tend to clog your code.

And what should the code do when the conversion exception is thrown? What exactly does the exception tell about the problem the code is facing? One way to prevent the problem would have been to make the code more generic by applying a scaling operation before the conversion instead of assuming certain flight boundaries.

But what if there is really an impossible input value? The code could skip it, average it against previous values and in both cases signal to the recevier that the value is now unclear and potentially wrong.

We see that the runtime options are rather limited. The Erlang approach of configuring hierarchical process tracers which would simply restart a failed module do not help so much here (except in the case of a spurious bad inpt variable).

Empirical Software Technology: Multi-version programming

When logic fails, perhaps an empirical view on the problem might help. Prof Tichy of KIT investigated in Die Rolle der Empirie in der Softwaretechnik (in german) the quality of so called multi-version programming: take three different hardware platforms, programming languages and programming teams and hope that the errors that will show up are independent (always leaving at least two working solutions). Unfortunately Tichy could show with empirical tests that this hope is largely unfounded: most software has problems at exactly the same locations. And we don't really have to talk about the price of multi-version programming. It seems to pay off in hardware where the problems seem to be really independent (e.g. radiation striking one component).

What to do in case of a RAM failure?

I recently noticed in a discussion with my students that most of them were not aware that RAM ICs at one time had an additional bit, the parity bit. The CPU would calculate odd or even parity bits on every access and therefore immediately detect the change of one bit e.g. due to radiation.

This is no longer the case in most computers because the industry decided that 1/9 cost reduction in RAM ICs is worth countless spurious errors or wrong data in private machines (servers are usually equipped with ECC memory but what good should that do if the data are manipulated on smal client machines and written back?).

So nowadays RAM failures go unnoticed in the majority of cases and lead to wrong data on disks (where they are not safe either, see the findings of CERN on RAID controllers).

But an even more interesting question is what software could do if the failure were still detected? Systems had a very drastic approach to RAM failures: Kernel Panic. In my very first hours as a Unix kernel engineers I once asked why the system took such drastic measures and was told to come up with something better: What code can still be trusted after the detection of a bad RAM cell? We need to distinguish different cases here, especially with respect to the location of the RAM failure. If it is in the operating system kernel we need to figure out which functions are still reliable. This requires us to have a mapping between RAM locations and OS functions and their dependencies. And while the mapping is probably less of a problem, the dependencies between functions might be a killer. To be on the safe side we would have to start a copy of the damaged function in a different memory area and re-link it with the rest of the system (the damage handling function of course would have to be duplicated in advance...). A major problem that could come up here is to find out in which state the interrupted function was: Do we have to undo some actions and do we know which ones? (This is like killing a thread that might have locked some resources and started copying). While I don't see any fundamental problems with the mappings and restart of functions missing transactional guarantees could be a problem.

What if the RAM failure lies withing application code or data? Surprisingly code IS a problem here for the same reasons as above: we do not know what the code was doing at the moment of the RAM failure. Just taking the application data which are yet unwritten to disk and make them persistent could lead to inconsistent data.

The result is that we need to have all functions transactional to be able to copy the function and restart it. And that affects the overall software architecture in a big way.