REG NASA-LLIS-2041-2008 Lessons Learned MRO Spaceflight Computer Side Swap Anomalies [Export Version].pdf-资源下载-麦多课文库

REG NASA-LLIS-2041-2008 Lessons Learned MRO Spaceflight Computer Side Swap Anomalies [Export Version].pdf

1、Lessons Learned Entry: 2041Lesson Info:a71 Lesson Number: 2041a71 Lesson Date: 2008-12-16a71 Submitting Organization: JPLa71 Submitted by: David Oberhettingera71 POC Name: Todd Bayer; David E. Hermana71 POC Email: Todd.J.Bayerjpl.nasa.gov; David.E.Hermanjpl.nasa.gova71 POC Phone: 818-354-5810 (Bayer

2、); 818-393-5872 (Herman)Subject: MRO Spaceflight Computer Side Swap Anomalies Export Version Abstract: A few months into its mission, MRO began experiencing unexpected side swaps to the redundant flight computer that placed the spacecraft into safe mode. The problem was traced to subtle inconsistenc

3、ies between the MRO design implementation of an ASIC device and a known limitation of that device. Users of the RAD750 spaceflight computer should assure that the “PPCI Erratum 24“ ASIC defect cannot cause excessive accumulation of uncorrectable SDRAM memory errors, and that the system architecture

4、has robust error recovery capabilities.Description of Driving Event: Mars Reconnaissance Orbiter (MRO) was launched in August 2005 with a mission to study the Martian climate, identify water-related landforms and aqueous deposits, characterize potential landing sites for Mars landers, and provide UH

5、F relay for science data produced by these future missions. The MRO spacecraft is furnished with two redundant onboard computers (i.e., two Command & Data Handling Subsystems, or C&DHs), referred to as Side A and Side B, that share continuously updated state and sensor data. One computer remains act

6、ive, while the second serves as a “cold backup“ that can boot in tens of seconds. In March 2007, 4 months after beginning the science phase of its mission, telemetry alerted the operations team at the NASA/Caltech Jet Propulsion Laboratory to two successive timeouts of the spacecrafts heartbeat watc

7、hdog timer (Reference (1). The first timeout prompted onboard fault protection (FP) software to order a warm reset of Side A. The second timeout triggered an Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-autonomous switch or “side swap“ to the Side

8、 B computer. After the booting of Side B, FP autonomously configured the vehicle into safe mode. This prompted an intensive investigation that failed to determine the root cause and rule out a permanent failure of Side A. Eleven months later, MRO performed another unrequested warm reset followed by

9、an unrequested side swap- this time back to Side A of the C&DH (Reference (2). Since Side A was now functioning properly, it was clear to JPL investigators that the fault on Side A which caused the first swap was cleared by the power cycling of Side A, allowing them to rule out a permanent hardware

10、failure. This prompted JPL to re-open the investigation. In the course of this, they revisited information on a defect (“PPCI Erratum 24“) in the Power Peripheral Component Interconnect (PPCI) bridge Application-Specific Integrated Circuit (ASIC) in the RAD750 Spaceflight Computer (SFC) that was fir

11、st reported in 2006 by the RAD750 vendor (Reference (2). Under very specific conditions, this ASIC defect can cause the memory controller (Figure 1) to halt operations, resulting nominally in a warm reset of the computer that clears the condition. Figure 1, the block diagram of the RAD750 SFC, has t

12、hree blocks. The center block is labeledFigure 1. Block diagram of the RAD750 SFC with the memory controller highlightedThis reported defect had not raised much JPL concern in 2006 because of the events rarity and the belief that it would result merely in a warm reboot of the computer. However, the

13、MRO project did not fully understand the low level details of RAD750 operation and its interaction with the MRO system design configuration. Specifically, . The remainder of this paragraph describes the failure mechanism experienced by the MRO project specific to its design implementation of the RAD

14、750 Spaceflight Computer. The text has been redacted for International Traffic in Arms Regulations (ITAR) compliance. “U.S. Persons“ may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at davidonasa.gov). Unintended C&DH side swaps

15、 and spacecraft placement into safe mode may interrupt telemetry downlink and, under some circumstances, threaten the mission. In October 2008, the MRO project Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-implemented a vendor-recommended workaroun

16、d, involving commanding a change to a parameter and a setting within the PPCI bridge ASIC, that should prevent further MRO side swap incidents. References:1. “MRO Side Swap to Side B,“ JPL Incident Surprise Anomaly (ISA) No. Z90507, March 14, 2007.2. PPCI Bridge ASIC Master Errata List, BAE Document

17、 # A13917 Revision (-) Version 1.3, Errata List for PPCI ASIC P/N 244A907 (Bridge Chip).3. “Excess Latency in SPS Safe Mode Predicts Delivery,“ JPL Incident Surprise Anomaly (ISA) No. Z90508, March 14, 2007.4. “Final Report on the Mars Reconnaissance Orbiter C&DH Side Swap #1 and #2 Anomalies,“ JPL

18、Document No. D-37650 (MRO Report No. MRO-36-747), October 16, 2008.Lesson(s) Learned: The “Erratum 24“ defect in the PPCI bridge ASIC represents a subtle failure mechanism for spacecraft employing a RAD750 SFC architecture that can be overcome by an operational workaround, but is best prevented thro

19、ugh flight system design measures.Recommendation(s): The full text of these recommendations have also been redacted for ITAR compliance. “U.S. Persons“ may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at davidonasa.gov). For all

20、 missions employing a RAD750 SFC architecture: 1. The U.S. version of this recommendation calls for analyzing the proposed C&DH design to assure that it is not vulnerable to the Erratum 24 defect. (The Erratum 24 defect is a known issue, but the subtleties may not be apparent from the vendor-publish

21、ed data.) 2. The U.S. version of this recommendation refers to the need for robust error checking. 3. The U.S. version of this recommendation suggests robust design measures for data salvage. 4. The U.S. version of this recommendation advocates a “clear-everything“ capability for power on resets (PO

22、Rs).Evidence of Recurrence Control Effectiveness: JPL has referenced this lesson learned as additional rationale and guidance supporting Paragraph 9.4.2 (“Flight System Flight Operations Design: Prime/Redundant Hardware Usage - Swapping to Redundant Hardware“) in the JPL standard “Design, Verificati

23、on/Validation and Operations Principles for Flight Systems (Design Principles),“ JPL Document D-17868, Rev. 3, December 11, 2006.Provided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Documents Related to Lesson: N/AMission Directorate(s): a71 Sciencea71 Sp

24、ace OperationsAdditional Key Phrase(s): a71 0.a71 0.a71 1.Engineering design and project processes and standardsa71 1.Software Engineeringa71 1.Spacecraft and Spacecraft Instrumentsa71 0a71 0a71 0.a71 0.a71 1.Computersa71 1.Flight Equipmenta71 1.Flight Operationsa71 1.Ground Equipmenta71 1.Hardwarea71 1.Payloadsa71 1.Softwarea71 1.SpacecraftAdditional Info: a71 Project: Mars Reconnaissance OrbiterApproval Info: a71 Approval Date: 2009-05-06a71 Approval Name: mbella71 Approval Organization: HQProvided by IHSNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-

邮箱/手机：
温馨提示：	如需开发票，请勿充值！快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：	注意：如需开发票，请勿充值！
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？