1、Best Practices Entry: Best Practice Info:a71 Committee Approval Date: 2000-04-20a71 Center Point of Contact: JSCa71 Submitted by: Wil HarkinsSubject: Fault-Detection, Fault-Isolation and Recovery (FDIR) Techniques Practice: Apply techniques such as Built in Test (BIT), strategic placing of sensors,
2、centralized architecture, and fault isolation and recovery to optimize system availability.Programs that Certify Usage: This practice has been used on International Space Station Program.Center to Contact for Information: JSCImplementation Method: This Lesson Learned is based on Maintainability Tech
3、nique number DFE-7 from NASA Technical Memorandum 4628, Recommended Techniques for Effective Maintainability.Benefit:The main goal of fault detection and isolation is to effectively detect faults and accurately isolate them to a failed component in the shortest time possible. This capability leads t
4、o reduction in diagnostic time or downtime in general and, therefore, increased system availability. A good inherent diagnostic of a system also enhances the crewmembers confidence in operating the system, the main driver of mission success. Effective FDIR can keep a difficult to maintain system up
5、and running where normal methods would lead to system downtime. FDIR is especially beneficial to an on-orbit system where maintenance may be impossible.Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Implementation Method:The growth of electronic tec
6、hnology challenges the use of electronic systems in several respects. One of these is the complexity of testing the systems to determine functional status and to permit efficient fault detection and fault isolation. The term “diagnostic capabilities“ refers to the abilities of a system to detect a f
7、ailure and to isolate it to a failed maintainable unit. In the past, diagnostics were considered only as a design afterthought and, as a result, many programs are faced with higher mean time to repair (MTTR) and higher work-hour and false alarm rates. This reduces system availability and operational
8、 readiness while increasing life cycle costs. Diagnostics are a significant key to achieving system performance and cost effectiveness goals.In such a critical system as the International Space Station, on which human life is dependent, a system recovery concept is also an important aspect that need
9、s to be considered early in the systems design phase. This technique consists of sections on fault-detection, fault-isolation, and recovery techniques. Since they are all related under the integrated diagnostics concept, techniques of one section may be referenced in other sections.Fault-Detection T
10、echniquesA system fault can be detected manually or automatically, depending on operating modes and how quickly the system needs to be restored. For a system that requires human interfaces, system failures can be detected quickly by human visual and/or auditory senses. If, for example, a light is sw
11、itched on and there is no illumination, one can visually detect that there is a problem with either the light switch, light bulb, power source, or circuitry. The obvious advantages of manual fault detection are that it incurs no costs associated with complex system designs.Another common methodology
12、, built-in testing (BIT), is employed to detect and isolate faults without using external test equipment. BIT ranges in complexity from a lamp that lights when equipment fails, to a resident computer that generates test signals and evaluates system responses. BIT can be continuously operated, interl
13、eaved with other operations, or initiated on command. During power-on self-testing, for example, the system runs a self-diagnostic test after the power is applied and includes hardware sensors and software error correcting codes. Its particular mechanization and utilization in a system are, of cours
14、e, determined by the designer.BIT often means additional hardware above that required for the primary function. Reliability and cost are affected and tradeoffs leading to a balanced solution must be made. BIT protective circuitry, moreover, should be designed to be fail-safe. This means that failure
15、 in the BIT circuitry should not affect system performance. Whenever feasible, the BIT input and output should be sufficiently isolated from the normal channels so that any failure in the BIT will not cause impairment of the function being tested. Also, it should be recognized that BIT can fail, and
16、 additional measures should be taken to avoid utilization of possibly erroneous BIT output in recovery measures.Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-In addition to BIT circuitry which actuates visual status indicators, BIT features may als
17、o include test points and self-test meters. The goal of BIT design is to decrease MTTR by steering a technician to the faulty component as quickly as possible. BIT designers attempt to attain this goal through various means, including the use of innovative circuitry and rearrangement of circuits to
18、perform dual functions with a single circuit, if possible; e.g., driving a visual indicator and tying into various AND gates with a single driver.The BIT designer also standardizes BIT circuitry as much as possible, thus driving down the cost of implementing BIT.Other important general consideration
19、s in designing hardware BIT are:a. The reliability of the BIT hardware should exceed that of the hardware being tested. If this is not the case, the probability of failure of the BIT may be almost as great as the probability of failure of the unit being tested.b. The BIT should be kept simple but ef
20、fective in meeting operational needs.c. The type of circuitry used for BIT should be, if feasible, of the same type used in the normal system to minimize the number of different types of components used in any particular system.d. As a part of the BIT design process, the overall system architecture
21、must also be considered for the most effective implementation. Generally, there are two common approaches: centralization and decentralization.Centralization is regarded as a highly integrated approach in which a centralized unit acts as a “watch dog“ in detecting and reporting system out-of-toleran
22、ce conditions. The centralized unit determines if a failure actually occurred based on the data and information queried from the lower level, and annunciates or reports faults (see figure 1).Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-refer to D
23、descriptionD Figure 1. Centralized Architecture The type of information acquired by the central unit is an example of passive BIT. Passive BIT monitors system performance on line without the use of a test pattern generator; therefore, it may not be able to completely monitor the system.Active BIT, a
24、 more comprehensive type of testing, can also be used. In active BIT, a test pattern is written to a unit and compared to an expected pattern. The system operation must be interrupted for this type of test if the module is operating. Not all modules are operated continuously, however, and a computer
25、-controlled BIT system can take advantage of times when a module is not needed to run a test sequence. This is referred to as interleaving BIT, which can be a powerful means for maintaining confidence in a system without disrupting its mission to run tests.On the other hand, decentralized architectu
26、re places detection capabilities at the maintainable unit level. Each maintainable unit has the means to detect all the identified failures within the unit. Once a failure occurs, the unit reports to a higher level for record and fault annunciation (see figure 2). Both passive and active BIT can be
27、used in this case.Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-refer to D descriptionD Figure 2. Decentralization With decentralized BIT, each module will have its own circuitry to monitor such out-of-tolerance conditions as voltage, current, or p
28、arity word at a particular node or memory location. Failure condition of a module can be notified to a central display unit via a data distribution network such as a 1553 bus network. One advantage of decentralized architecture is that a subsystem or a module can be taken off line for a comprehensiv
29、e test using active BIT (generated test pattern). In general, the decentralized control must have the following characteristics:a. Self-test capability.b. Isolation from other system data.c. Some type of synchronization with system functions.A voting scheme is another effective technique to detect f
30、ailure. With this technique, on-line data are processed by three or more redundant computers. A failure is declared when one units output is different from the other two or three units. Decentralized architecture is commonly used in this application. The inertial measurement system on the Orbiter ex
31、emplifies this type of architecture, in that three redundant inertial measurement units process real-time data and compare the outputs for majority agreement. This method requires additional resources but is highly effective in fault detection.Fault-Isolation TechniqueOnce a failure is detected, the
32、 next step is to locate the cause of that failure. The complexity of a system, quick-turn around time demand, repair location, and human skill level are some important factors which must be considered in planning the fault isolation strategy of a system.Provided by IHSNot for ResaleNo reproduction o
33、r networking permitted without license from IHS-,-,-Faults can be isolated manually by visually inspecting for burned-out components or by using an external test set for system diagnostics. For a more complex system and time-critical mission, the faults can also be isolated automatically. BIT, as me
34、ntioned in the fault-detection section, contains an inherent fault-isolation mechanism, since signals generated by a failed module can be identified by the control unit, thereby allowing a test pattern to be injected to that particular unit to confirm its failure.Sometimes BIT can isolate failures d
35、own to a certain system level or to a region of a system that has components connected in a series. In this situation, it is rather difficult to determine exactly which component has actually failed. In contrast, voting scheme will accurately pinpoint the failed unit, because the redundant connectio
36、ns of the system dictate the ease of fault isolation. However, in situations where the remaining two units indicate a fault has occurred, a failed unit will not be easily identified because there is no “majority“ reference data for comparison.A more recent approach to isolating a failure in the inte
37、grated circuit (IC) industry is called the boundary scan. The IC is divided into regions which are accessible via scan operations. A boundary scan path consists of a series of boundary scan cells (BSCs), one BSC per IC function pin. The BSCs are interconnected with the host ICs test data input pin a
38、nd test data output pin, for serial access.During normal IC operation, input and output signals pass through each BSC without interference. When the boundary test mode is selected, however, the test stimulus is applied through a series of BSCs, and test results are captured at the end of the scan pa
39、th. This technique overcomes the test access problems that can cause difficulty in fault isolation. The unit level tests can also be combined for a system-level verification.Recovery TechniqueIn order for a system to recover manually or automatically from a failure, modes of operation which depend o
40、n types of failures must be defined and planned ahead. During the design phase, the systems critical functions, levels of redundancy, and functional paths are usually identified so its recovery functions can be realized. In general, there are three categories of recovery: (a) 100 percent functional
41、recovery using redundant system components, (b) functional recovery using an alternative path, and (c) degraded functional recovery.For category (a), the system is designed so that when a component fails, its failure is reported and the components redundant or backup unit can be turned on manually o
42、r automatically. In such systems as satellites, interplanetary probes, and the Space Station, which require autonomous operations, automatic recovery is likely to be provided.Resources are almost always limited in any situation; therefore, instead of having a redundant string or unit, an alternative
43、 path, category (b), may be taken to recover the lost function. The alternative Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-path usually does not achieve the full capabilities of the original function because of limiting space and design constrai
44、nts, including costs.For reasons stated above, many redundant critical functions of the Space Station have been designed using the method of functional recovery. If, for example, the cooling loop of the thermal control system failed and was unable to cool the electronics equipment mounted on the col
45、d plates, cool air from the environmental control and life support system may be redirected in order to keep the equipment from overheating. The equipment may also be required to operate at a minimal level to lessen the heat generated.As a worst case, redundant strings are out or not available; in w
46、hich case, operating a system at some minimal capacity must be considered to protect crews or vehicles. In these cases, the critical functions of a system must be looked at to decide which of its components or units may be turned off without losing the ability to control the spacecraft until repairs
47、 are made.If, for example, some of the solar array panels were damaged, insufficient electrical power would be generated to support all the Space Station functions. In this case, power allotted to less critical functions would have to be curtailed or eliminated and even critical systems may have to
48、be operated at some compromised level.SummaryFDIR is becoming an increasingly important factor in designing todays complicated systems and in todays competitive edges for operating an efficient plant or space system with minimal downtime. In any business, downtime or delays may cost millions of doll
49、ars a year in addition to operating costs, simply because FDIR was a design afterthought.By implementing FDIRs design features, one can be assured that the final product will be a safe, efficient, and maintainable system.References:1. Air Force Design Handbook 1-9, “Maintainability (for Ground Electronic Systems),“ Second Edition, Revision 7, February 25, 1988, United States Air Force Aeronautical