Chapter 2- Reliability and Fault Tolerance.ppt

资源描述

1、Real-Time Systems and Programming Languages Alan Burns and Andy Wellings,Chapter 2: Reliability and Fault Tolerance,Aims,To understand the factors which affect the reliability of a system and introduce how software design faults can be tolerated To introduce Safety and Dependability Reliability, fai

2、lure and faults Failure modes Fault prevention and fault tolerance N-Version programming Dynamic Redundancy,Scope,Four sources of faults which can result in system failure:Inadequate specification not covered Design errors in software covered now Processor failure not covered Interference on the com

3、munication subsystem not covered,Safety and Reliability,Safety: freedom from those conditions that can cause death, injury, occupational illness, damage to (or loss of) equipment (or property), or environmental harm By this definition, most systems which have an element of risk associated with their

4、 use as unsafe Reliability: a measure of the success with which a system conforms to some authoritative specification of its behaviourSafety is the probability that conditions that can lead to mishaps do not occur whether or not the intended function is performed,Safety,E.g., measures which increase

5、 the likelihood of a weapon firing when required may well increase the possibility of its accidental detonationIn many ways, the only safe airplane is one that never takes off, however, it is not very reliableAs with reliability, to ensure the safety requirements of an embedded system, system safety

6、 analysis must be performed throughout all stages of its life cycle development,Aspects of Dependability,Dependability,Available,Readiness for Usage,Dependability Terminology,Dependability,Reliability, Failure and Faults,The reliability of a system is a measure of the success with which it conforms

7、to an authoritative specification of its behaviour When the behaviour of a system deviates from that which is specified for it, this is called a failure Failures result from unexpected problems internal to the system that eventually manifest themselves in the systems external behaviour These problem

8、s are called errors and their mechanical or algorithmic cause are termed faults Systems are composed of components which are themselves systems: hence failure - fault - error - failure - fault,Fault Types,A transient fault starts at a particular time, remains in the system for some period and then d

9、isappears E.g. hardware components which have an adverse reaction to radioactivity Many faults in communication systems are transient Permanent faults remain in the system until they are repaired; e.g., a broken wire or a software design error Intermittent faults are transient faults that occur from

10、 time to time E.g. a hardware component that is heat sensitive, it works for a time, stops working, cools down and then starts to work again,Software Faults,Called Bugs Bohrbugs: reproducible identifiable. Heisenbugs: only active under rare conditions: e.g. race conditions Software doesnt deteriorat

11、e with age: it is either correct or incorrect but Faults can remain dormant for long periods Usually related to resource usage e.g. memory leaks,Failure Modes,Failure mode,Value domain,Timing domain,Arbitrary (Fail uncontrolled),Constraint error,Value error,Early,Omission,Late,Fail silent,Fail stop,

12、Fail controlled,Approaches to Achieving Reliable Systems,Fault prevention attempts to eliminate any possibility of faults creeping into a system before it goes operationalFault tolerance enables a system to continue functioning even in the presence of faultsBoth approaches attempt to produces system

13、s which have well-defined failure modes,Fault Prevention,Two stages: fault avoidance and fault removal Fault avoidance attempts to limit the introduction of faults during system construction by: use of the most reliable components within the given cost and performance constraints use of thoroughly-r

14、efined techniques for interconnection of components and assembly of subsystems packaging the hardware to screen out expected forms of interference. rigorous, if not formal, specification of requirements use of proven design methodologies use of languages with facilities for data abstraction and modu

15、larity use of software engineering environments to help manipulate software components and thereby manage complexity,Fault Removal,Design errors (hardware and software) will exist Fault removal: procedures for finding and removing the causes of errors; e.g. design reviews, program verification, code

16、 inspections and system testing System testing can never be exhaustive and remove all potential faults A test can only be used to show the presence of faults, not their absence It is sometimes impossible to test under realistic conditions Most tests are done with the system in simulation mode and it

17、 is difficult to guarantee that the simulation is accurate Requirements errors during the systems development may not manifest themselves until the system goes operational,Failure of Fault Prevention Approach,In spite of all the testing and verification techniques, hardware components will fail; the

18、 fault prevention approach will therefore be unsuccessful when either the frequency or duration of repair times are unacceptable, or the system is inaccessible for maintenance and repair activitiesAn extreme example of the latter is the crewless spacecraft Voyager (currently 10 billions miles from t

19、he sun!)Alternative is Fault Tolerance,Levels of Fault Tolerance,Full Fault Tolerance the system continues to operate in the presence of faults, albeit for a limited period, with no significant loss of functionality or performance Graceful Degradation (fail soft) the system continues to operate in t

20、he presence of errors, accepting a partial degradation of functionality or performance during recovery or repair Fail Safe the system maintains its integrity while accepting a temporary halt in its operation The level required will depend on the application Most safety critical systems require full

21、fault tolerance, however in practice many settle for graceful degradation,Graceful Degradation in an ATC System,Full functionality within required response times,Redundancy,All fault-tolerant techniques rely on extra elements introduced into the system to detect & recover from faults Components are

22、redundant as they are not required in a perfect system Often called protective redundancy Aim: minimise redundancy while maximising reliability, subject to the cost and size constraints of the system Warning: the added components inevitably increase the complexity of the overall system This itself c

23、an lead to less reliable systems E.g., first launch of the space shuttle It is advisable to separate out the fault-tolerant components from the rest of the system,Hardware Fault Tolerance,Two types: static (or masking) and dynamic redundancy Static: redundant components are used inside a system to h

24、ide the effects of faults; e.g. Triple Modular Redundancy TMR 3 identical subcomponents and majority voting circuits; the outputs are compared and if one differs from the other two, that output is masked out Assumes the fault is not common (such as a design error) but is either transient or due to c

25、omponent deterioration To mask faults from more than one component requires NMR Dynamic: redundancy supplied inside a component which indicates that the output is in error; provides an error detection facility; recovery must be provided by another component E.g. communications checksums and memory p

26、arity bits,Software Fault Tolerance,Used for detecting design errors Static N-Version programming Dynamic Detection and Recovery Recovery blocks: backward error recovery Exceptions: forward error recovery,N-Version Programming,Design diversity The independent generation of N (N 2) functionally equiv

27、alent programs from the same initial specification No interactions between groups The programs execute concurrently with the same inputs and their results are compared by a driver process The results (VOTES) should be identical, if different the consensus result, assuming there is one, is taken to b

28、e correct,N-Version Programming,Driver,vote,status,vote,vote,status,status,Vote Comparison,To what extent can votes be compared? Text or integer arithmetic will produce identical results Real numbers = different values Need inexact voting techniques,Consistent Comparison Problem,T3, Tth,yes,P3, Pth,

29、T1, Tth,no,P1, Pth,no,V1,T2, Tth,no,P2,yes, Pth,V2,V3,Each version will produce a different but correct result,Even if inexact comparison techniques are used, the problem occurs,N-version programming depends on,Initial specification The majority of software faults stem from inadequate specification

30、This will manifest itself in all N versions of the implementation Independence of effort Experiments produce conflicting results Complex parts of a specification leads to a lack of understanding of the requirements If these also refer to rarely occurring input data, common design errors may not be c

31、aught during system testing Adequate budget The predominant cost is software. A 3-version system will triple the budget requirement and cause problems of maintenance Would a more reliable system be produced if the resources potentially available for constructing an N-versions were instead used to pr

32、oduce a single version?,military versus civil avionics industry,Software Dynamic Redundancy,Alternative to static redundancy: four phases error detection no fault tolerance scheme can be utilised until the associated error is detected damage confinement and assessment to what extent has the system b

33、een corrupted? The delay between a fault occurring and the detection of the error means erroneous information could have spread throughout the system error recovery techniques should aim to transform the corrupted system into a state from which it can continue its normal operation (perhaps with degr

34、aded functionality) fault treatment and continued service an error is a symptom of a fault; although the damage is repaired, the fault may still exist,Error Detection,Environmental detection hardware e.g. illegal instruction O.S/RTS null pointer Application detection Replication checks Timing checks

35、 (e.g., watch dog) Reversal checks Coding checks (redundant data, e.g. checksums) Reasonableness checks (e.g. assertion) Structural checks (e.g. redundant pointers in linked list) Dynamic reasonableness check,Damage Confinement and Assessment,Damage assessment is closely related to damage confinemen

36、t techniques usedDamage confinement is concerned with structuring the system so as to minimise the damage caused by a faulty component (also known as firewalling)Modular decomposition provides static damage confinement; allows data to flow through well-define pathways (assuming strongly type languag

37、e)Atomic actions provides dynamic damage confinement; they are used to move the system from one consistent state to another,Error Recovery,Probably the most important phase of any fault-tolerance technique Two approaches: forward and backward Forward error recovery continues from an erroneous state

38、by making selective corrections to the system state This includes making safe the controlled environment which may be hazardous or damaged because of the failure It is system specific and depends on accurate predictions of the location and cause of errors (i.e, damage assessment) Examples: redundant

39、 pointers in data structures and the use of self-correcting codes such as Hamming Codes,Backward Error Recovery (BER),BER relies on restoring the system to a previous safe state and executing an alternative section of the program This has the same functionality but uses a different algorithm (c.f. N

40、-Version Programming) and therefore no fault The point to which a process is restored is called a recovery point and the act of establishing it is termed checkpointing (saving appropriate system state) Advantage: the erroneous state is cleared and it does not rely on finding the location or cause of

41、 the fault BER can, therefore, be used to recover from unanticipated faults including design errors Disadvantage: it cannot undo errors in the environment!,The Domino Effect,With concurrent processes that interact with each other, BER is more complex. Consider:,R22,R21,R13,R12,R11,IPC4,IPC3,IPC2,IPC

42、1,Execution time,Terror,P1,P2,If the error is detected in P1 rollback to R13 If the error is detected in P2 ?,Fault Treatment and Continued Service,Error recovery returned the system to an error-free state; however, the error may recur; the final phase of F.T. is to eradicate the fault from the syst

43、em The automatic treatment of faults is difficult and system specific Some systems assume all faults are transient; others that error recovery techniques can cope with recurring faults Fault treatment can be divided into 2 stages: fault location and system repair Error detection techniques can help

44、to trace the fault to a component. For, hardware the component can be replaced A software fault can be removed in a new version of the code In non-stop applications it will be necessary to modify the program while it is executing!,The Recovery Block approach to FT,Language support for BER At the ent

45、rance to a block is an automatic recovery point and at the exit an acceptance test The acceptance test is used to test that the system is in an acceptable state after the blocks execution (primary module) If the acceptance test fails, the program is restored to the recovery point at the beginning of

46、 the block and an alternative module is executed If the alternative module also fails the acceptance test, the program is restored to the recovery point and yet another module is executed, and so on If all modules fail then the block fails and recovery must take place at a higher level,Recovery Bloc

47、k Syntax,Recovery blocks can be nested If all alternatives in a nested recovery block fail the acceptance test, the outer level recovery point will be restored and an alternative module to that block executed,ensure byelse byelse by. else byelse error,Example: Solution to Differential Equation,Expli

48、cit Kutta Method fast but inaccurate when equations are stiff Implicit Kutta Method more expensive but can deal with stiff equations The above will cope with all equations It will also potentially tolerate design errors in the Explicit Kutta Method if the acceptance test is flexible enough,ensure Ro

49、unding_err_has_acceptable_tolerance by Explicit Kutta Method else by Implicit Kutta Methodelse error,The Acceptance Test,The acceptance test provides the error detection mechanism; it is crucial to the efficacy of the RB scheme There is a trade-off between providing comprehensive acceptance tests an

50、d keeping overhead to a minimum, so that fault-free execution is not affected Note that the term used is acceptance not correctness; this allows a component to provide a degraded service All the previously discussed error detection techniques discussed can be used to form the acceptance tests Care must be taken as a faulty acceptance test may lead to residual errors going undetected,

展开阅读全文