Professionalism/Ariane 5 Flight 501
The Event
editOn June 4th, 1996, the European Space Agency launched the Ariane 5 rocket, Flight 501 from Kourou, French Guiana[1]. The goal of the rocket was to launch commercial payloads into orbit, in particular, four Cluster satellites[2]. The satellites would have been placed into high elliptical orbits to conduct research on Earth’s Magnetosphere[3], thus making Europe prominent in the commercial space business[4]. The European Space Agency had spent 10 years and $7 billion to produce the rocket[4].
That day, all the effort that the Agency had expended into building the rocket went to waste, along with $370 million[3], when the launcher veered off its flight path, disintegrated and exploded only about 40 seconds after flight sequence initiation at an altitude of 3,700 meters, scattering fiery rubble across French Guiana[4][5]. Fortunately, no lives were lost. Engineers from the Ariane 5 project teams of CNES and Industry immediately began investigating the failure, which a small computer program trying to stuff a 64-bit number into a 16-bit space caused[4][5].
Due to incorrect control signals that were sent to the engines and swiveled the rocket 37 seconds after take off, the Inertial Reference System, which is used to calculate and guide a rocket's velocity, position, and orientation, failed[2]. The explosion of the rocket increased the aluminum oxide content in the ground and water, damaging the environment of the French Guianese swamps[1].
Technical Failure
editAn independent inquiry board was set up days after the incident, and one of its conclusions was due to specification and design errors in the Inertial Reference System (SRI), the launcher’s system lost guidance and altitude information 37 seconds after the main engine ignition sequence started. The Inquiry Board found that the following design faults in the SRI software caused the Flight 501 failure: the maintenance after the lift-off of the pre-launch function (alignment mode) was incompatible with flight[1]. Approximately 0.05 seconds after the computer within the back-up SRI, which was working on stand-by for guidance and attitude control, became inoperative, the active SRI, which was identical to the back-up system in hardware and software, failed for identical reasons. Since the back-up inertial system was already inoperative, correct guidance and attitude information could no longer be obtained. Because the back-up SRI failed, the active SRI transmitted diagnostic information to the launcher's main computer, which interpreted it as flight data and used it for flight control calculations. Based on those calculations, the main computer commanded the booster nozzles and the main engine nozzle to make a large correction for an altitude deviation that had not occurred. Due to aerodynamic forces, the resulting rapid change of altitude caused the launcher to disintegrate at 39 seconds after the main engine ignition sequence started. Destruction was automatically initiated upon disintegration, as designed[5].
All of the various programme tests and reviews, which had otherwise proved effective, had not revealed these anomalies, and the ground/flight mode interface had been inadequately identified[1]. A paper from the European Space Agency, provided by the NASA Astrophysics Data System, explained that the Inertial Measurement Unit program on Flight 501 failed due to a code fragment that the flight phase did not use, so it was very important to exhaustively identify the dead code and unused but active code, and demonstrate that the unused executable code could never result in a run time error[6].
Though extensive reviews and tests occurred during the Ariane 5 Development Programme, they did not include adequately analyzing and testing the SRI or the complete flight control system, which could have detected the potential failure[5]. The Agency erroneously assumed that because the SRI worked for Ariane 4, it would work for Ariane 5, which had a different technical specification[7].
The Status Quo Bias
editWith more than 100 successful launches, the Ariane 4 was one of the most successful rockets in ESA history[8]. It was in service for more than 20 years and came to be called the “workhorse” of the ESA’s fleet of launch vehicles. The Ariane 5 project team was tasked with designing a successor to the Ariane 4. The new rocket had to outperform its predecessor in payload capacity without compromising reliability.
With the Ariane 4’s success in mind, engineers working on the Ariane 5 began borrowing major components from the Ariane 4 program, including the Ariane 4’s software package[5]. Much of the Ariane 4’s software was designed as a “black box,” meaning it could be reused in different launch vehicles without major modifications. Ariane 5 engineers recycled everything from guidance control systems to flight path optimization software, because the Ariane 4 software package had a 100% success rate[8].
The Ariane 4's components essentially became the status quo. Given the options of designing a new system and using an Ariane 4 component, engineers took the easier, cheaper path. To redesign a system is to assume responsibility for that system’s success. Engineers did not want to change what they thought already worked, and ended up borrowing more than they should have.
The software which led to Flight 501's failure was borrowed from the Ariane 4[5]. It was not as adaptable as Ariane 5 engineers thought, and failed to function as it was supposed to. Engineers put too much faith in the SRI. By trying to avoid additional risk, the Ariane 5 design team inadvertently made a decision which resulted in catastrophic failure of the rocket.
Diffusion of Responsibility
editThe Ariane 5 Flight 501 explosion represents a diffusion of responsibility because no one person or group was responsible for the disaster. A report issued by an independent inquiry board set up by the French and European Space Agencies stated that five separate groups working on the spacecraft could have prevented the explosion[9]. These groups were:
- Programmers: Better programming practice would have prevented the software error that resulted from a data conversion of a 64-bit floating point to a 16-bit signed integer value.
- Designers: A better design that disallowed software exceptions from halting hardware units that were functioning correctly would have prevented the SRI from shutting down.
- Requirement Engineers: Better requirement analysis and trace-ability would have prevented the rogue piece of alignment code from earlier models of the Ariane from activating and resulting in the spacecraft failure.
- Test Engineers: A test to verify that the SRI would behave correctly when being subjected the to flight sequence and trajectory of the Ariane 5 would have exposed the failure.
- Project Managers: Improved project management processes that facilitate closer engineering cooperation with clear authority and responsibility as well as consistent code and documentation would have increased the chances of exposing the failure[9].
While many groups could have been blamed, the European investigators chose not to single out any particular contractor or department. They said that "a decision was taken. It was not fully analyzed or fully understood. The possible implications of allowing it to continue to function during flight were not realized[4]."
Normalization of Deviance
editThe error may have been caught in time if engineers had more proactive testing procedures. However, engineers took shortcuts during design testing in response to a series of budget cuts across the board[5]. The ESA specifies very rigorous software testing procedures designed to catch the sort of problem that brought down Flight 501, but they were not followed. Much like the Challenger disaster, risky behavior slowly became the norm.
When engineers were testing the Ariane 5 software package, cost was a major factor in their decisions. Engineers had two options:
- Run a full simulated launch with all software tested simultaneously[9].
- Run several independent tests which test only software subsystems[9].
The team chose option two because it was significantly cheaper than option one. Because the software was tested independently, the fault which led to the failure of Flight 501 was not detected.
Cost-cutting prompted the use of risky testing procedures in many aspects of the Ariane 5’s software design. Analysis by the Inquiry Board after the incident revealed that major components of the software package were full of bugs that had been overlooked[5]. Because software represents a singular point of failure where redundant systems are difficult to implement, exercising due-diligence through testing is especially important. Ariane 5 software engineers failed to exercise this due-diligence.
Future
editThe European Inquiry Board investigated the disaster and made recommendations on what could have been done to prevent it. Out of the 14 recommendations that the Inquiry Board produced, recommendations 12-14, in particular, refer to flaws or failures of the process. These three recommendations show how the flight failed and how to prevent failure in the future.
- Recommendation 12: “Give the justification documents the same attention as code. Improve the technique for keeping code and its justifications consistent[10].”
- Recommendation 13: “Set up a team that will prepare the procedure for qualifying software, propose stringent rules for confirming such qualification, and ascertain that specification, verification and testing of software are of a consistently high quality in the Ariane 5 programme. Including external RAMS experts is to be considered[10].”
- Recommendation 14: “A more transparent organisation of the cooperation among the partners in the Ariane 5 programme must be considered. Close engineering cooperation, with clear cut authority and responsibility, is needed to achieve system coherence, with simple and clear interfaces between partners[10].”
Looking forward, these three recommendations will help promote a more cohesive, clear, and professional working environment to mitigate future confusion.
Generalizations
editLessons learned from the Ariane 5 Flight 501 disaster can be generalized to other cases. First, even the smallest details matter and can have enormous consequences. The conversion error of a 64 bit floating point number caused the explosion of a massive, multi-million dollar spacecraft. In many systems, small details are just as important as large ones.
Second, besides testing for what a system should do, one should also test for what a system should not do. If the system had tested for the failure of the floating point number conversion, then the error that caused the explosion could have been caught. Testing for situations that should not occur can improve reliability[11].
Third, authority and responsibilities should be obvious. No one was held responsible for the error that resulted in Flight 501's explosion. It is harder to find a problem and to avoid future problems when no one is held accountable for a failure. When responsibilities among teammates are clear-cut, problems are easier to find, and communication and cooperation improve.
Fourth, do not design a system where a single component failure could cause the entire system to fail - single point of failure. For Flight 501, a software error not only resulted in the failure of the SRI, a software system, but also it resulted in the explosion of the entire spacecraft. A system where one failure doesn't break the entire system is more reliable and safer.
References
edit- ↑ a b c d de Dalmau, J. & Gigou, J. (1997). Ariane-5: Learning from Flight 501 and Preparing for 502. European Space Agency. http://www.esa.int/esapub/bulletin/bullet89/dalma89.htm
- ↑ a b Cluster (spacecraft). (2015). In Wikipedia. https://en.wikipedia.org/wiki/Cluster_(spacecraft)
- ↑ a b c d e Gleick, J. (1996). Little Bug, Big Bang. The New York Times. http://www.nytimes.com/1996/12/01/magazine/little-bug-big-bang.html
- ↑ a b c d e f g h Lions, J.L. (1996). ARIANE 5 Failure-Full Report. European Space Agency. https://www.ima.umn.edu/~arnold/disasters/ariane5rep.html
- ↑ Lacan, P., Monfort, J. N., Ribal, L. V. Q., Deutsch, A., & Gonthier, G. (1998). ARIANE 5 - The Software Reliability Verification Process. European Space Agency. http://adsabs.harvard.edu/full/1998ESASP.422..201L
- ↑ Georgiadou, E. and George, C. (2006). Information Systems Failures: Can we make professionals more responsible?. Software Quality Management XIV. http://www.eis.mdx.ac.uk/staffpages/cgeorge/sqm_2006_paper.pdf
- ↑ a b Ariane 4. http://www.cnes.fr/web/CNES-en/1378-ariane-4-a-challenge-for-europes-space-industry.php
- ↑ a b c d Nuseibeh, B. (1997). Ariane 5: Who Dunnit? IEEE Xplore. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=589224
- ↑ a b c The Inquiry Board's Recommendations. http://www.esa.int/esapub/bulletin/bullet89/recom89.htm
- ↑ Sommerville, I. (2014). Ariane Launch failure. YouTube. https://www.youtube.com/watch?v=W3YJeoYgozw