1、Lessons Learned Entry: 2041Lesson Info:a71 Lesson Number: 2041a71 Lesson Date: 2008-12-16a71 Submitting Organization: JPLa71 Submitted by: David Oberhettingera71 POC Name: Todd Bayer; David E. Hermana71 POC Email: Todd.J.Bayerjpl.nasa.gov; David.E.Hermanjpl.nasa.gova71 POC Phone: 818-354-5810 (Bayer
2、); 818-393-5872 (Herman)Subject: MRO Spaceflight Computer Side Swap Anomalies Export Version Abstract: A few months into its mission, MRO began experiencing unexpected side swaps to the redundant flight computer that placed the spacecraft into safe mode. The problem was traced to subtle inconsistenc
3、ies between the MRO design implementation of an ASIC device and a known limitation of that device. Users of the RAD750 spaceflight computer should assure that the “PPCI Erratum 24“ ASIC defect cannot cause excessive accumulation of uncorrectable SDRAM memory errors, and that the system architecture
4、has robust error recovery capabilities.Description of Driving Event: Mars Reconnaissance Orbiter (MRO) was launched in August 2005 with a mission to study the Martian climate, identify water-related landforms and aqueous deposits, characterize potential landing sites for Mars landers, and provide UH
5、F relay for science data produced by these future missions. The MRO spacecraft is furnished with two redundant onboard computers (i.e., two Command & Data Handling Subsystems, or C&DHs), referred to as Side A and Side B, that share continuously updated state and sensor data. One computer remains act
6、ive, while the second serves as a “cold backup“ that can boot in tens of seconds. In March 2007, 4 months after beginning the science phase of its mission, telemetry alerted the operations team at the NASA/Caltech Jet Propulsion Laboratory to two successive timeouts of the spacecrafts heartbeat watc
7、hdog timer (Reference (1). The first timeout prompted onboard fault protection (FP) software to order a warm reset of Side A. The second timeout triggered an Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-autonomous switch or “side swap“ to the Side
8、 B computer. After the booting of Side B, FP autonomously configured the vehicle into safe mode. This prompted an intensive investigation that failed to determine the root cause and rule out a permanent failure of Side A. Eleven months later, MRO performed another unrequested warm reset followed by
9、an unrequested side swap- this time back to Side A of the C&DH (Reference (2). Since Side A was now functioning properly, it was clear to JPL investigators that the fault on Side A which caused the first swap was cleared by the power cycling of Side A, allowing them to rule out a permanent hardware
10、failure. This prompted JPL to re-open the investigation. In the course of this, they revisited information on a defect (“PPCI Erratum 24“) in the Power Peripheral Component Interconnect (PPCI) bridge Application-Specific Integrated Circuit (ASIC) in the RAD750 Spaceflight Computer (SFC) that was fir
11、st reported in 2006 by the RAD750 vendor (Reference (2). Under very specific conditions, this ASIC defect can cause the memory controller (Figure 1) to halt operations, resulting nominally in a warm reset of the computer that clears the condition. Figure 1, the block diagram of the RAD750 SFC, has t
12、hree blocks. The center block is labeledFigure 1. Block diagram of the RAD750 SFC with the memory controller highlightedThis reported defect had not raised much JPL concern in 2006 because of the events rarity and the belief that it would result merely in a warm reboot of the computer. However, the
13、MRO project did not fully understand the low level details of RAD750 operation and its interaction with the MRO system design configuration. Specifically, . The remainder of this paragraph describes the failure mechanism experienced by the MRO project specific to its design implementation of the RAD
14、750 Spaceflight Computer. The text has been redacted for International Traffic in Arms Regulations (ITAR) compliance. “U.S. Persons“ may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at davidonasa.gov). Unintended C&DH side swaps
15、 and spacecraft placement into safe mode may interrupt telemetry downlink and, under some circumstances, threaten the mission. In October 2008, the MRO project Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-implemented a vendor-recommended workaroun
16、d, involving commanding a change to a parameter and a setting within the PPCI bridge ASIC, that should prevent further MRO side swap incidents. References:1. “MRO Side Swap to Side B,“ JPL Incident Surprise Anomaly (ISA) No. Z90507, March 14, 2007.2. PPCI Bridge ASIC Master Errata List, BAE Document
17、 # A13917 Revision (-) Version 1.3, Errata List for PPCI ASIC P/N 244A907 (Bridge Chip).3. “Excess Latency in SPS Safe Mode Predicts Delivery,“ JPL Incident Surprise Anomaly (ISA) No. Z90508, March 14, 2007.4. “Final Report on the Mars Reconnaissance Orbiter C&DH Side Swap #1 and #2 Anomalies,“ JPL
18、Document No. D-37650 (MRO Report No. MRO-36-747), October 16, 2008.Lesson(s) Learned: The “Erratum 24“ defect in the PPCI bridge ASIC represents a subtle failure mechanism for spacecraft employing a RAD750 SFC architecture that can be overcome by an operational workaround, but is best prevented thro
19、ugh flight system design measures.Recommendation(s): The full text of these recommendations have also been redacted for ITAR compliance. “U.S. Persons“ may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at davidonasa.gov). For all
20、 missions employing a RAD750 SFC architecture: 1. The U.S. version of this recommendation calls for analyzing the proposed C&DH design to assure that it is not vulnerable to the Erratum 24 defect. (The Erratum 24 defect is a known issue, but the subtleties may not be apparent from the vendor-publish
21、ed data.) 2. The U.S. version of this recommendation refers to the need for robust error checking. 3. The U.S. version of this recommendation suggests robust design measures for data salvage. 4. The U.S. version of this recommendation advocates a “clear-everything“ capability for power on resets (PO
22、Rs).Evidence of Recurrence Control Effectiveness: JPL has referenced this lesson learned as additional rationale and guidance supporting Paragraph 9.4.2 (“Flight System Flight Operations Design: Prime/Redundant Hardware Usage - Swapping to Redundant Hardware“) in the JPL standard “Design, Verificati
23、on/Validation and Operations Principles for Flight Systems (Design Principles),“ JPL Document D-17868, Rev. 3, December 11, 2006.Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Documents Related to Lesson: N/AMission Directorate(s): a71 Sciencea71 Sp
24、ace OperationsAdditional Key Phrase(s): a71 0.a71 0.a71 1.Engineering design and project processes and standardsa71 1.Software Engineeringa71 1.Spacecraft and Spacecraft Instrumentsa71 0a71 0a71 0.a71 0.a71 1.Computersa71 1.Flight Equipmenta71 1.Flight Operationsa71 1.Ground Equipmenta71 1.Hardwarea71 1.Payloadsa71 1.Softwarea71 1.SpacecraftAdditional Info: a71 Project: Mars Reconnaissance OrbiterApproval Info: a71 Approval Date: 2009-05-06a71 Approval Name: mbella71 Approval Organization: HQProvided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1