Embedded Control Systems Design/Failure modes and prevention

A failure mode is the manner in which a system fails, or the manner by which a failure is observed. So, it is not the same as the cause of the failure, but it describes the way a failure occurs. There are three kinds of failure modes: conceptual, technological and organizational. This text deals with technological failure modes only, and concentrates on embedded control systems. This chapter is very relevant for the embedded systems designer because such systems often work without human supervision and at places where human correction of the failure are expensive to execute. Therefore, the designers should pay extra attention to what could go wrong in their system (i.e., to identify its failure modes) and to assess the risk of each failure (i.e., to analyse the consequences of each failure mode). It is obvious that it is way better to avoid failures than to repair them, and that simple designs are easier in this respect than complex systems; however, making simple designs is still a form of engineering art, and not yet a structured engineering discipline. Also keep in mind that many failures will not be detected by testing.

Introduction

Technological failure modes in embedded systems can be divided into two main groups: hardware failure modes and software failure modes; the toughest failures to prevent however, are those caused by subtle interactions between hardware and software.

Some examples of software failure modes are:

Buffer overflow: the computer memory is smaller than the programmer expected, so during operation of the embedded system, one of the programs in the system is accessing wrong parts of the computer's memory.
Dangling pointers: this error is common in non-safe programming languages in which the human programmer is responsible for making sure that every pointer points to the right memory location at all times.
Resource leaks [1] in which programming errors lead to the loss of computer control over some of the hardware resources; memory leaks are the simplest form of resource leak.
Race conditions in which specific relative timing events of different components of the system leads to unexpected behaviour. Such race conditions are often hard to detect by testing only.
Semantic design, for example: the meaning of an arrow between two subsystems in a visual software environment should be the same as the interpretation of it by the hardware.

Some examples of hardware failure modes:

Electrical failure: short-circuiting, too high voltage/current
Mechanical failure: jamming of a valve
Temperature effects: deformation of components
Material failure: corrosion

It is important to note again that these examples are only consequences and not causes!

Some examples of software failure causes are:

Deadlock: two or more processes are each waiting for the other to finish, so none of the processes ever finish.
Resource starvation: a process doesn't get the resources it needs, so it can never finish.
Too small memory
Noise
Shared interfaces with other systems

Some examples of hardware failure causes:

Hostile environments: any factor which prevents a system from functioning correctly.
Badly calibrated sensors
Choosing the wrong dimensions
Manufacturing/assembly process deficiencies

Some failures are not caused by hardware or software, but are caused on the system level.

An example of a system failure cause:

Operational failure: human operators make mistakes too. The Three Mile Island accident is a fine example of an operational failure.

Overview

Due to the increasing capabilities and functionality of embedded systems, it is difficult to prevent or sometimes even detect failure modes. One way to ensure reliability is extensive testing using techniques such as probabilistic reliability modeling. One of the problems with these techniques is that they are only used in the late stage of development. It is better to design quality and reliability in, in the early stages of development.

To detect failures in the design process it is important to perform different tests on the system (especially on the software) at the beginning of the design. But tests are often expensive and they also should provide the correct information: the usability of test results depend on the quality of the test. So it is not always easy to come up with an appropriate test.

Dynamic analysis in the software world is the testing and evaluation of software by executing programs on a processor. An example of a dynamic analysis on hardware could be vibration and stress analysis.

These days engineers have developed a static analysis for software, which is test-free: no specific tests need to be developed and the software can be checked for flaws without having to execute the program.

There are a number of possibilities to reduce the chance of failure occurrences. But some failures need to be treated more urgent than others. At first one should look at the frequency with which a systems fails, this is called the failure rate of a system. It is desired that systems don't fail, but if a failure is very rare it is often not necessary to take steps.

Another aspect of a failure mode is its severity. An electrical appliance that short-circuits can be life-threatening, whereas the jamming of a valve in vending machine is less life-threatening.

Despite all the effort an engineer can put into designing a system that doesn't fail, failures will always occur. For example: an average cell phone these days contains as much as 2 million lines of software code. It is very likely that in one of those lines a fault is introduced. Systems are getting even more complex. For instance: that same cell phone is expected to have as much as 10 million lines of code in 10 years. Therefore, a design should be more robust.[2] When the system detects something goes wrong it can signal this and go into a safe mode until the user takes appropriate actions. Take for example again the jamming of a valve of the vending machine: the machine can light all its LEDs to signal something is wrong and cease providing soda until it is repaired.

Failures are also to be expected when separate systems have to work together, for instance: the different robots in RoboCup. An other example of such a complex system are the robots of professor James McLurkin of MIT who have to perform the Star Wars theme tune together, but every robot can only play some notes. So they have to cooperate in order to play the entire theme correctly.[3]

The cost of designing embedded control systems tends to increase exponentially with increasing reliability: removing X% of the faults in a system will not necessarily improve the reliability by X% (a study at IBM proved that with removing 60% of the errors, only 3% reliability was added). In some cases, it may therefore be more cost effective to not investigate in more trustworthy systems but to pay the failure costs. However one has to keep in mind that this strategy can lead to a bad name of the company for selling systems that can not be trusted. There is always a trade-off between different design criteria, depending on the system. Off course, critical systems should always be designed as reliable as possible.

This all stresses out how important it is to rule out failures in the design process. Fortunately, engineers have developed some procedures to do this systematically. All the following procedures can be used in what is called safety engineering. The study of failures is an important aspect of designing embedded control systems as it can safe time, money and even lives, and helps with eventual future modification of a system.

There are many examples of real-time disasters, due to incomplete failure testing.

Failure prevention

Safety factors

Safety factors are used to ensure that a design will work, and to protect it against failures. However, large safety factors don't always give rise to a reliable system. Often they lead to overdesigned systems, which are more expensive and can take a longer time to manufacture/assemble.

Failure mode and effects analysis

In order to reduce (or better prevent) the failure chance of a system, engineers have developed a technique called “Failure Mode and Effects Analysis” (FMEA). This is a tool to identify potential or actual failure modes in a system and to choose the proper corrective action, when designing. FMEA provides an analytical approach to determine which risk has the greatest concern, and therefore an action is needed to prevent a problem before it arises. The development of these specifications will ensure a system that will meet the defined requirements.

It is also possible to identify critical or important design/process characteristics that require special controls to prevent or detect failure modes. A crucial step is anticipating what might go wrong with a product. While anticipating every failure mode is not possible, a development team should formulate an extensive list of potential failure modes as completely as possible. FMEA starts at the beginning of a design, and is maintained and adapted through the entire design process. This way it is possible to design out failures. This way FMEA contains important information for use in future system improvements.

Using FMEA when designing

The process for conducting an FMEA is straightforward. It is developed in 3 main steps, in which appropriate actions need to be undertaken. But before starting with a FMEA, it is important to do some pre-work to make sure the robustness and past history are included in the analysis. It is important to consider both intentional and unintentional uses! Unintentional uses are a form of hostile environments.

Step 1: Determine the severity number

Determine all failure modes based on the functional requirements and their effects as perceived by the user. Each effect is given a severity number(SEV) from 1(no danger) to 10(important). If the severity of an effect has a number 9 or 10, actions are considered to change the design by eliminating the failure mode, if possible, or protecting the user from the effect.

Step 2: Determine the probability number

In this step it is necessary to look at the cause of a failure and how many times it occurs. A failure mode is given a probability number(OCCUR), again from 1 to 10. Actions need to be determined if the occurrence is high (meaning >4 for non safety failure modes and >1 when the severity-number from step 1 is 9 or 10).

TIP: The probability number can also be used after implementation in a table of failures to give the operator a first guess of the origin of problems of frequent occurrence.

Step 3: Determine the detection number

When appropriate actions are determined, it is necessary to test their efficiency. Also a design verification is needed. The proper inspection methods needs to be chosen. Each combination from the previous 2 steps, receives a detection number(DETEC). This number represents the ability of planned tests and inspections at removing defects or detecting failure modes.

After these 3 basic steps, Risk Priority Numbers (RPN) are calculated.

Risk Priority Numbers

RPN do not play an important part in the choice of an action against failure modes. They are more threshold values in the evaluation of these actions.

$RPN=SEV*OCCUR*DETEC$

The failure modes that have the highest RPN should be given the highest priority for corrective action.

After these values are allocated, recommended actions with targets, responsibility and dates of implementation are noted. Once the actions have been implemented in the design/process, the new RPN should be checked, to confirm the improvements. Whenever a design or a process changes, a FMEA should be updated.

A few logical but important thoughts come to mind:

Try to eliminate the failure mode (some failures are more preventable than others)
Minimize the severity of the failure
Reduce the occurrence of the failure mode
Improve the detection (!)

Anticipatory Failure Determination

Like FMEA, Anticipatory failure determination (AFD) has the objective of identifying and preventing possible failures.[4] The approach of AFD however is just the inverse of that of FMEA. Rather than searching for causes of failure modes, AFD asks developers to view at the failure of interest as an intended consequence and to look for ways to make sure that this failure always happens reliably. In this way, AFD is a complementary method to FMEA. Using FMEA the failures with the highest Risk Priority Number can be found and afterward redesigned using the AFD method.

AFD is more suited for complex failure analysis than FMEA. FMEA relies on the identification of failures and their causes based on application or personal experience of others. However the problem with this approach is “the denial phenomenon”. If one tries to consider what can go wrong with a functioning system, there is the tendency to resist thinking about unpleasant possibilities that might occur unless they actually have been experienced before. Sometimes people even deny a problem because one would not admit having made mistakes in the design. By reversing the problem, AFD overcomes this “denial phenomenon” and opens up creative insights into analysis of failures.

AFD-process

Step 1: Formulation or inversion of the problem

Instead of thinking about possible causes for a failure, an engineer should think about how to make that failure happen, under the same environmental conditions that made the failure happen. First identification of these conditions is needed. After that one should think about the scenario that gives rise to the failure and try to localize it.

Step 2: Search for solutions or methods to produce the failure

The thought process is now shifted to finding the mechanism or means to produce the examined failure. Function analysis can be useful to identify a series of functions or actions involved in the failure scenario.

Step 3: Verify that resources are available to cause the failures

There are seven potential categories of resources: substances, field effects, space available, time, object structure, system functions and other data on the system. For each of the potential solutions to cause a failure, it is necessary to check if the required resources are available to support this solution.

Fault tree Analysis

Fault tree analysis (FTA) is a third form of failure analysis in which an undesired state of a system is analyzed using boolean logic to combine a series of lower-level events. FTA takes place after implementation in the design process i.e. at testing level, but if all these fault trees are collected and maintained well they can be a useful aid at operating level too if failure nevertheless occur. This analysis method is mainly used in the field of safety engineering.

FTA analysis involves 5 steps:

Step 1: Define the undesired event to study

Definition of the undesired event can be very hard to catch, although some of the events are very easy and obvious to observe. An engineer with a wide knowledge of the design of the system or a system analyst with an engineering background is the best person who can help define and number the undesired events. Undesired events are used then to make the FTA, one event for one FTA.

Step 2: Obtain an understanding of the system

Once the undesired event is selected, all causes with probabilities of affecting the undesired event are studied and analyzed. Getting exact numbers for the probabilities leading to the event is usually impossible for the reason that it may be very costly and time consuming to do so.
System designers have full knowledge of the system and this knowledge is very important for not missing any cause affecting the undesired event. For the selected event all causes are then numbered and sequenced in the order of occurrence.

Step 3: Construct the fault tree

After selecting the undesired event and analyzed the system so that we know all the causing effects and if possible their probabilities we can now construct the fault tree. Fault tree is based on AND and OR gates which define the major characteristics of the fault tree.

Step 4: Evaluate the fault tree

Analyze the tree for any possible improvement or in other words study the risk management and find ways for system improvement. In short, in this step we identify all possible hazards affecting in a direct or indirect way the system.

Step 5: Control the hazards identified

This step is very specific and differs largely from one system to another, but the main point will always be that after identifying the hazards all possible methods are pursued to decrease the probability of occurrence.

Because assembling a FTA can be a costly and cumbersome experience, the perfect method is to consider subsystems and integrate them afterwards.