1、AMERICAN NATIONAL STANDARD FOR TELECOMMUNICATIONS ATIS-0100030.2012(R2017) MEAN TIME BETWEEN OUTAGES A GENERALIZED METRIC FOR ASSESSING PRODUCTION FAILURE RATES IN TELECOMMUNICATIONS NETWORK ELEMENTS As a leading technology and solutions development organization, ATIS brings together the top global
2、ICT companies to advance the industrys most-pressing business priorities. Through ATIS committees and forums, nearly 200 companies address cloud services, device solutions, M2M communications, cyber security, ehealth, network evolution, quality of service, billing support, operations, and more. Thes
3、e priorities follow a fast-track development lifecyclefrom design and innovation through solutions that include standards, specifications, requirements, business use cases, software toolkits, and interoperability testing. ATIS is accredited by the American National Standards Institute (ANSI). ATIS i
4、s the North American Organizational Partner for the 3rd Generation Partnership Project (3GPP), a founding Partner of oneM2M, a member and major U.S. contributor to the International Telecommunication Union (ITU) Radio and Telecommunications sectors, and a member of the Inter-American Telecommunicati
5、on Commission (CITEL). For more information, visit .AMERICAN NATIONAL STANDARD Approval of an American National Standard requires review by ANSI that the requirements for due process, consensus, and other criteria for approval have been met by the standards developer. Consensus is established when,
6、in the judgment of the ANSI Board of Standards Review, substantial agreement has been reached by directly and materially affected interests. Substantial agreement means much more than a simple majority, but not necessarily unanimity. Consensus requires that all views and objections be considered, an
7、d that a concerted effort be made towards their resolution. The use of American National Standards is completely voluntary; their existence does not in any respect preclude anyone, whether he has approved the standards or not, from manufacturing, marketing, purchasing, or using products, processes,
8、or procedures not conforming to the standards. The American National Standards Institute does not develop standards and will in no circumstances give an interpretation of any American National Standard. Moreover, no person shall have the right or authority to issue an interpretation of an American N
9、ational Standard in the name of the American National Standards Institute. Requests for interpretations should be addressed to the secretariat or sponsor whose name appears on the title page of this standard. CAUTION NOTICE: This American National Standard may be revised or withdrawn at any time. Th
10、e procedures of the American National Standards Institute require that action be taken periodically to reaffirm, revise, or withdraw this standard. Purchasers of American National Standards may receive current information on all standards by calling or writing the American National Standards Institu
11、te. Notice of Disclaimer hence the immediate focus on router reliability. The field reliability of modernprovider edge routers, which have a large variety of interface cards, cannot be accurately characterizedby a single downtime or reliability metric because it requires averaging the contributions
12、of the variousrouter components that may hide the poor reliability of some components. MTBO Positioned as a Practical Application of Traditional MTBF Failures in router elements go beyondthe traditional view of failures whereby only element failure leading to total replacement is considered forMTBF.
13、 In reality, there are other transient modes of failure such as software glitches that also lead todowntime for customer facing line cards. MTBO counts all failures that affect customer downtime besidesphysical replacement.Consequently, the initial focus of MTBO development in the first publication
14、of ATIS-0100030.2010 is exclusively from a router perspective customer facing line card downtime is the driver for MTBO estimation. Typically, a service provider Operations organization applies the MTBO metric across different routers and switches, including redundant configurations. In a redundant
15、configuration, a line-card failure is counted against MTBO only if there is no failover to the redundant card. As one could expect the MTBO goal (target) for the redundant configuration is much higher than that for a configuration without redundancy. For example, if the MTBO goal for single point of
16、 failure (SPOF) configuration is 100,000 hours, and the coverage factor is 0.99, then for a configuration with redundancy the MTBO goal will be 100,000/(1-0.99)=10,000,000 hours. In fact, the MTBO concept can be applied to any and all forms of telecommunication network elements, including elements w
17、ith redundancy. Given that the rapid pace of current telecommunications technologies spanning wireline and wireless networks is inevitable, it makes sense to try and manage equipment supplier product reliability that includes transient failures in addition to the more traditional type of MTBF physic
18、al replacement scenario. The following examples illustrate the applicability (and advantage) of the MTBO concept: 1. Any Software Controlled Device: For example, a transmission power amplifier in the UMTS nodeB or theLTE eNodeB will be turned off when its temperature exceeds some threshold, and then
19、 turned on whenits temperature decreases to a normal level. The number of such temporary outages with automaticrecovery can be counted against MTBO. In particular, a defective amplifier may have smaller MTBO incomparison with its predicted hardware MTBF. In such cases a proactive replacement in the
20、maintenancewindow can be issued that will prevent the production failure.2. Set of Software Control Devices: The UMTS elements such as nodeB and RNC, the LTE elements suchas eNodeB, etc., consist of a set of devices (subsystems), some of which are redundant. Failures of thesedevices (subsystems) may
21、 cause partial or complete service interruption for customers. The MTBO metriccan be calculated per device like in the previous example, or per element where device outages can begenerally weighted according to their customer impact. Equal weighting is applied if we do notdifferentiate between parti
22、al and complete outages.ATIS-0100030.2012(R2017) 4 3. Ethernet Quality Assessment: The MTBO metric can be applied to the quantification of quality of theEthernet connection between two routers based on the number of timeouts of a specific protocol like Bi-directional Forwarding Detection (BFD). The
23、intent is to test for frequency of short duration outages of theEthernet connection.This revision provides a generalized definition of the MTBO metric (Section 6). It retains the application of the metric to the Customer Facing Line Cards (Section 7). It then provides descriptions of applying the me
24、tric to additional situations described above (Section 8). 6 Generalized MTBO Definition In modern telecommunication networks, many electronic and media (e.g., Ethernet) components have short-time outages with automatic recovery. These outages may remain undetected because of their short duration (o
25、f the order of one second) but they have severe impact on services like VoIP/LTE, Telepresence, and any video application. The outage detection and respective outage time measurement can be done using fault detection alarms, BFD protocol, SLA protocol, etc. These measurements are used for the MTBO c
26、alculation in given time interval T as follows: 7 MTBO & Network Production Failure Rate Use of Customer Facing Line Cards A typical IP router consists of line cards (LC) carrying traffic and shared equipment (e.g., router processor, switching elements, and cooling devices). Each of these components
27、 has predicted Mean Time-Between Failure (P-MTBF) offered by the equipment supplier using traditional Telcordia methodology Telcordia SR-332, TL9000-Hbk. The P-MTBF metric is traditionally used to compare the predicted and production (or measured) failure rates of hardware elements in routers compri
28、sing service provider edge networks. However, hardware failures leading to element replacement are not the only reasons for service disruption. Line cards and router processors operate under control of complex software systems (including packet transmission protocols) that may also experience failur
29、es leading to temporary downtime for these elements. Consider a set of identical Customer Facing Line Cards. For these cards, all outages caused by hardware and software, including the entire router failures, are counted. This count is used to calculate the production line card failure rate in a giv
30、en time period (e.g., monthly) and the respective mean time between failures is referred to as MTBO. ATIS Technical Report ATIS-0100025 ATIS-0100025 described the use of a normalizing factor in determining the service availability impacts due to a variety of network element hardware and software out
31、ages. The use of Customer Facing Line Cards as the normalizing element was described in ATIS-0100025 and the MTBO was briefly presented as a component of estimating service availability impacts arising from various outages. The rationale for utilizing Customer Facing Line Cards is repeated in this c
32、lause. The Customer Facing Line Card is a common component to all elements that can cause service outages. These Line Cards are on the “drop side” of an Edge (Access) Router, where facilities from a customers CPE terminate on individual ports on the Line Cards. A failure in any element in the Access
33、 Network may result in downtime for individual ports on the Line Cards or on the entire Line Card on the Access Router. Such failures prevent delivery of customer transactions to the backbone. ATIS-0100030.2012(R2017) 5 KEY AR: Access (Edge) Router BR: Backbone Router LC: Customer Facing Line Card F
34、igure 1: Access Network Elements There are several element types in a typical IP Access Network topology (Figure 1) whose failure can cause downtime for Line Cards5directly or indirectly. Elements whose failures directly impact downtime are: Line Card on the customer facing side: Any failure in the
35、electronic or optical components of the LineCard that causes traffic interruption will result in Line Card downtime. Access (Edge) Routers that form an edge on an Internet Service Provider (ISP) backbone network: LineCard downtime can be caused by a failure in a router component or from a total rout
36、er failure.Network elements whose failures may indirectly impact Line Card downtime are: Facilities and supporting elements such as cross-connects, which link Access Routers to BackboneRouters: To increase the availability of the Access Network, an ISP usually provides redundancy byconnecting each A
37、ccess Router to two Backbone Routers at the same access node using twoindependent sets of uplinks6(Figure 1 depicts a typical access node with several Access Routers and twoBackbone Routers). This permits customer traffic to enter the backbone in the following failure scenarios:5Only transport layer
38、 failures that directly impact Customer Facing Line Cards are considered for this document, as shown in Figure 1 (access and backbone routers, their components, and facilities linking them). Failures of non-transport layer elements (e.g., service/application layer elements) are not considered. 6An u
39、plink is a facility (e.g., DS3, OC-3, OC-48) connecting any access router to a backbone router. ATIS-0100030.2012(R2017) 6 o A failed uplink.o A failed card supporting an uplink.o A failed Backbone Router at the access node.If all facilities linking an access router to a backbone router fail, then a
40、ll Line Cards at the access router will experience downtime. Backbone Routers linked to Access Routers: As shown in Figure 1, if both Backbone Routers at an accessnode fail (a rare event), then all Line Cards on the Access Routers at this node lose connection to thebackbone. Facilities linking Backb
41、one Routers at an access node, to backbone routers at other backbone nodes:Such facility failures decrease the available bandwidth from Access Routers to the backbone. Note that ifall Backbone Router uplinks at an access node fail (a rare event), then all Line Cards on the AccessRouters at this node
42、 lose connection to the backbone.Impacts on Line Cards from such failures are extremely rare as service providers typically have redundancy in the backbone (all elements that may indirectly cause Line Card downtime). Full redundancy in terms of facilities dual-homing the Access Routers to pairs of B
43、ackbone Routers are intended to serve this purpose. The estimation of MTBO and Production Failure Rates can then be done based on line card failure count caused by failures of customer facing line cards, both uplinks and the entire router for each combination of Router Class, Line Card Type. A Route
44、r Class is a set of identical access routers from a single vendor. The use of such sets can enable metric estimation for different router vendors. For example, if a network has routers from two separate vendors and each vendor produces two unique types of routers, then the total number of access rou
45、ters in the network can be grouped into four Router Classes one for each vendor, router type combination. The metrics can then be estimated for any Line Card type within any given Router Class. The generalized MTBO metric defined above is adapted for each Router Class, Line Card Type combination as
46、follows. Consider a set of access routers of the same class with J types of access line cards which are monitored for failures during time interval of length .T For each customer impacting failure i, i = 1, 2, , L, the number ijn of type j cards affected is recorded. Two points to note here: In case
47、 of redundancy, the failure of the active (primary) line card is not counted if the failover to thebackup card was hitless. Otherwise, only failures of active cards are counted. In case the entire router is down (due to router failure or due to downstream failures described in theprevious clause), t
48、hen all Line Cards on the router are considered to be down.Let Njbe the total number of Line Cards of type j in the given Router Class. Then, the MTBO metric for Class j type of Router Cards is denoted as: 1jj LijiNTMnConsequently, the Production Failure Rate metric associated with Line Card of type
49、 j can be written as: PFRj= 1/Mj 7 ATIS-0100030.2012(R2017) 8 MTBO Application for Other Cases MTBO application is illustrated with the following examples. 8.1 Set of Uniform Devices Consider a set of uniform devices/boards such as power amplifiers in the UMTS nodeB. Let N be the total number of amplifiers and there were n outages of these amplifiers during time interval with duration of T hours. We count outages with automatic recovery as well as those outages where the failed amplifier was replaced to restore the service. Then: NTMTBOnOne can expec