1、 AMERICAN NATIONAL STANDARD FOR TELECOMMUNICATIONS ATIS-0100030.2012 MEAN TIME BETWEEN OUTAGES A GENERALIZED METRIC FOR ASSESSING PRODUCTION FAILURE RATES IN TELECOMMUNICATIONS NETWORK ELEMENTS As a leading technology and solutions development organization, ATIS brings together the top global ICT co
2、mpanies to advance the industrys most-pressing business priorities. Through ATIS committees and forums, nearly 200 companies address cloud services, device solutions, M2M communications, cyber security, ehealth, network evolution, quality of service, billing support, operations, and more. These prio
3、rities follow a fast-track development lifecyclefrom design and innovation through solutions that include standards, specifications, requirements, business use cases, software toolkits, and interoperability testing. ATIS is accredited by the American National Standards Institute (ANSI). ATIS is the
4、North American Organizational Partner for the 3rd Generation Partnership Project (3GPP), a founding Partner of oneM2M, a member and major U.S. contributor to the International Telecommunication Union (ITU) Radio and Telecommunications sectors, and a member of the Inter-American Telecommunication Com
5、mission (CITEL). For more information, visit . AMERICAN NATIONAL STANDARD Approval of an American National Standard requires review by ANSI that the requirements for due process, consensus, and other criteria for approval have been met by the standards developer. Consensus is established when, in th
6、e judgment of the ANSI Board of Standards Review, substantial agreement has been reached by directly and materially affected interests. Substantial agreement means much more than a simple majority, but not necessarily unanimity. Consensus requires that all views and objections be considered, and tha
7、t a concerted effort be made towards their resolution. The use of American National Standards is completely voluntary; their existence does not in any respect preclude anyone, whether he has approved the standards or not, from manufacturing, marketing, purchasing, or using products, processes, or pr
8、ocedures not conforming to the standards. The American National Standards Institute does not develop standards and will in no circumstances give an interpretation of any American National Standard. Moreover, no person shall have the right or authority to issue an interpretation of an American Nation
9、al Standard in the name of the American National Standards Institute. Requests for interpretations should be addressed to the secretariat or sponsor whose name appears on the title page of this standard. CAUTION NOTICE: This American National Standard may be revised or withdrawn at any time. The pro
10、cedures of the American National Standards Institute require that action be taken periodically to reaffirm, revise, or withdraw this standard. Purchasers of American National Standards may receive current information on all standards by calling or writing the American National Standards Institute. N
11、otice of Disclaimer hence the immediate focus on router reliability. The field reliability of modern provider edge routers, which have a large variety of interface cards, cannot be accurately characterized by a single downtime or reliability metric because it requires averaging the contributions of
12、the various router components that may hide the poor reliability of some components. MTBO Positioned as a Practical Application of Traditional MTBF Failures in router elements go beyond the traditional view of failures whereby only element failure leading to total replacement is considered for MTBF.
13、 In reality, there are other transient modes of failure such as software glitches that also lead to downtime for customer facing line cards. MTBO counts all failures that affect customer downtime besides physical replacement. Consequently, the initial focus of MTBO development in the first publicati
14、on of ATIS-0100030.2010 is exclusively from a router perspective customer facing line card downtime is the driver for MTBO estimation. Typically, a service provider Operations organization applies the MTBO metric across different routers and switches, including redundant configurations. In a redunda
15、nt configuration, a line-card failure is counted against MTBO only if there is no failover to the redundant card. As one could expect the MTBO goal (target) for the redundant configuration is much higher than that for a configuration without redundancy. For example, if the MTBO goal for single point
16、 of failure (SPOF) configuration is 100,000 hours, and the coverage factor is 0.99, then for a configuration with redundancy the MTBO goal will be 100,000/(1-0.99)=10,000,000 hours. In fact, the MTBO concept can be applied to any and all forms of telecommunication network elements, including element
17、s with redundancy. Given that the rapid pace of current telecommunications technologies spanning wireline and wireless networks is inevitable, it makes sense to try and manage equipment supplier product reliability that includes transient failures in addition to the more traditional type of MTBF phy
18、sical replacement scenario. The following examples illustrate the applicability (and advantage) of the MTBO concept: 1. Any Software Controlled Device: For example, a transmission power amplifier in the UMTS nodeB or the LTE eNodeB will be turned off when its temperature exceeds some threshold, and
19、then turned on when its temperature decreases to a normal level. The number of such temporary outages with automatic recovery can be counted against MTBO. In particular, a defective amplifier may have smaller MTBO in comparison with its predicted hardware MTBF. In such cases a proactive replacement
20、in the maintenance window can be issued that will prevent the production failure. 2. Set of Software Control Devices: The UMTS elements such as nodeB and RNC, the LTE elements such as eNodeB, etc., consist of a set of devices (subsystems), some of which are redundant. Failures of these devices (subs
21、ystems) may cause partial or complete service interruption for customers. The MTBO metric can be calculated per device like in the previous example, or per element where device outages can be generally weighted according to their customer impact. Equal weighting is applied if we do not differentiate
22、 between partial and complete outages. ATIS-0100030.2012 4 3. Ethernet Quality Assessment: The MTBO metric can be applied to the quantification of quality of the Ethernet connection between two routers based on the number of timeouts of a specific protocol like Bi-directional Forwarding Detection (B
23、FD). The intent is to test for frequency of short duration outages of the Ethernet connection. This revision provides a generalized definition of the MTBO metric (Section 6). It retains the application of the metric to the Customer Facing Line Cards (Section 7). It then provides descriptions of appl
24、ying the metric to additional situations described above (Section 8). 6 Generalized MTBO Definition In modern telecommunication networks, many electronic and media (e.g., Ethernet) components have short-time outages with automatic recovery. These outages may remain undetected because of their short
25、duration (of the order of one second) but they have severe impact on services like VoIP/LTE, Telepresence, and any video application. The outage detection and respective outage time measurement can be done using fault detection alarms, BFD protocol, SLA protocol, etc. These measurements are used for
26、 the MTBO calculation in given time interval T as follows: 7 MTBO & Network Production Failure Rate Use of Customer Facing Line Cards A typical IP router consists of line cards (LC) carrying traffic and shared equipment (e.g., router processor, switching elements, and cooling devices). Each of these
27、 components has predicted Mean Time-Between Failure (P-MTBF) offered by the equipment supplier using traditional Telcordia methodology Telcordia SR-332, TL9000-Hbk. The P-MTBF metric is traditionally used to compare the predicted and production (or measured) failure rates of hardware elements in rou
28、ters comprising service provider edge networks. However, hardware failures leading to element replacement are not the only reasons for service disruption. Line cards and router processors operate under control of complex software systems (including packet transmission protocols) that may also experi
29、ence failures leading to temporary downtime for these elements. Consider a set of identical Customer Facing Line Cards. For these cards, all outages caused by hardware and software, including the entire router failures, are counted. This count is used to calculate the production line card failure ra
30、te in a given time period (e.g., monthly) and the respective mean time between failures is referred to as MTBO. ATIS Technical Report ATIS-0100025 ATIS-0100025 described the use of a normalizing factor in determining the service availability impacts due to a variety of network element hardware and s
31、oftware outages. The use of Customer Facing Line Cards as the normalizing element was described in ATIS-0100025 and the MTBO was briefly presented as a component of estimating service availability impacts arising from various outages. The rationale for utilizing Customer Facing Line Cards is repeate
32、d in this clause. The Customer Facing Line Card is a common component to all elements that can cause service outages. These Line Cards are on the “drop side” of an Edge (Access) Router, where facilities from a customers CPE terminate on individual ports on the Line Cards. A failure in any element in
33、 the Access Network may result in downtime for individual ports on the Line Cards or on the entire Line Card on the Access Router. Such failures prevent delivery of customer transactions to the backbone. ATIS-0100030.2012 5 KEY AR: Access (Edge) Router BR: Backbone Router LC: Customer Facing Line Ca
34、rd Figure 1: Access Network Elements There are several element types in a typical IP Access Network topology (Figure 1) whose failure can cause downtime for Line Cards5directly or indirectly. Elements whose failures directly impact downtime are: Line Card on the customer facing side: Any failure in
35、the electronic or optical components of the Line Card that causes traffic interruption will result in Line Card downtime. Access (Edge) Routers that form an edge on an Internet Service Provider (ISP) backbone network: Line Card downtime can be caused by a failure in a router component or from a tota
36、l router failure. Network elements whose failures may indirectly impact Line Card downtime are: Facilities and supporting elements such as cross-connects, which link Access Routers to Backbone Routers: To increase the availability of the Access Network, an ISP usually provides redundancy by connecti
37、ng each Access Router to two Backbone Routers at the same access node using two independent sets of uplinks6(Figure 1 depicts a typical access node with several Access Routers and two Backbone Routers). This permits customer traffic to enter the backbone in the following failure scenarios: 5Only tra
38、nsport layer failures that directly impact Customer Facing Line Cards are considered for this document, as shown in Figure 1 (access and backbone routers, their components, and facilities linking them). Failures of non-transport layer elements (e.g., service/application layer elements) are not consi
39、dered. 6An uplink is a facility (e.g., DS3, OC-3, OC-48) connecting any access router to a backbone router. ATIS-0100030.2012 6 o A failed uplink. o A failed card supporting an uplink. o A failed Backbone Router at the access node. If all facilities linking an access router to a backbone router fail
40、, then all Line Cards at the access router will experience downtime. Backbone Routers linked to Access Routers: As shown in Figure 1, if both Backbone Routers at an access node fail (a rare event), then all Line Cards on the Access Routers at this node lose connection to the backbone. Facilities lin
41、king Backbone Routers at an access node, to backbone routers at other backbone nodes: Such facility failures decrease the available bandwidth from Access Routers to the backbone. Note that if all Backbone Router uplinks at an access node fail (a rare event), then all Line Cards on the Access Routers
42、 at this node lose connection to the backbone. Impacts on Line Cards from such failures are extremely rare as service providers typically have redundancy in the backbone (all elements that may indirectly cause Line Card downtime). Full redundancy in terms of facilities dual-homing the Access Routers
43、 to pairs of Backbone Routers are intended to serve this purpose. The estimation of MTBO and Production Failure Rates can then be done based on line card failure count caused by failures of customer facing line cards, both uplinks and the entire router for each combination of Router Class, Line Card
44、 Type. A Router Class is a set of identical access routers from a single vendor. The use of such sets can enable metric estimation for different router vendors. For example, if a network has routers from two separate vendors and each vendor produces two unique types of routers, then the total number
45、 of access routers in the network can be grouped into four Router Classes one for each vendor, router type combination. The metrics can then be estimated for any Line Card type within any given Router Class. The generalized MTBO metric defined above is adapted for each Router Class, Line Card Type c
46、ombination as follows. Consider a set of access routers of the same class with J types of access line cards which are monitored for failures during time interval of length .T For each customer impacting failure i, i = 1, 2, , L, the number ijn of type j cards affected is recorded. Two points to note
47、 here: In case of redundancy, the failure of the active (primary) line card is not counted if the failover to the backup card was hitless. Otherwise, only failures of active cards are counted. In case the entire router is down (due to router failure or due to downstream failures described in the pre
48、vious clause), then all Line Cards on the router are considered to be down. Let Njbe the total number of Line Cards of type j in the given Router Class. Then, the MTBO metric for Class j type of Router Cards is denoted as: 1jj LijiNTMnConsequently, the Production Failure Rate metric associated with
49、Line Card of type j can be written as: PFRj= 1/Mj ATIS-0100030.2012 7 8 MTBO Application for Other Cases MTBO application is illustrated with the following examples. 8.1 Set of Uniform Devices Consider a set of uniform devices/boards such as power amplifiers in the UMTS nodeB. Let N be the total number of amplifiers and there were n outages of these amplifiers during time interval with duration of T hours. We count outages with automatic recovery as well as those outages where the failed amplifier was replaced to restore the service. Then: NTMTBOn