1、 AMERICAN NATIONAL STANDARD FOR TELECOMMUNICATIONS ATIS-0100020.2008(R2013) Availability Metric for IP-Based Networks and Services As a leading technology and solutions development organization, ATIS brings together the top global ICT companies to advance the industrys most-pressing business priorit
2、ies. Through ATIS committees and forums, nearly 200 companies address cloud services, device solutions, emergency services, M2M communications, cyber security, ehealth, network evolution, quality of service, billing support, operations, and more. These priorities follow a fast-track development life
3、cycle from design and innovation through solutions that include standards, specifications, requirements, business use cases, software toolkits, and interoperability testing. ATIS is accredited by the American National Standards Institute (ANSI). ATIS is the North American Organizational Partner for
4、the 3rd Generation Partnership Project (3GPP), a founding Partner of oneM2M, a member and major U.S. contributor to the International Telecommunication Union (ITU) Radio and Telecommunications sectors, and a member of the Inter-American Telecommunication Commission (CITEL). For more information, vis
5、it. AMERICAN NATIONAL STANDARD Approval of an American National Standard requires review by ANSI that the requirements for due process, consensus, and other criteria for approval have been met by the standards developer. Consensus is established when, in the judgment of the ANSI Board of Standards R
6、eview, substantial agreement has been reached by directly and materially affected interests. Substantial agreement means much more than a simple majority, but not necessarily unanimity. Consensus requires that all views and objections be considered, and that a concerted effort be made towards their
7、resolution. The use of American National Standards is completely voluntary; their existence does not in any respect preclude anyone, whether he has approved the standards or not, from manufacturing, marketing, purchasing, or using products, processes, or procedures not conforming to the standards. T
8、he American National Standards Institute does not develop standards and will in no circumstances give an interpretation of any American National Standard. Moreover, no person shall have the right or authority to issue an interpretation of an American National Standard in the name of the American Nat
9、ional Standards Institute. Requests for interpretations should be addressed to the secretariat or sponsor whose name appears on the title page of this standard. CAUTION NOTICE: This American National Standard may be revised or withdrawn at any time. The procedures of the American National Standards
10、Institute require that action be taken periodically to reaffirm, revise, or withdraw this standard. Purchasers of American National Standards may receive current information on all standards by calling or writing the American National Standards Institute. Notice of Disclaimer it does not include ser
11、vice unavailability periods outside of such conditions (e.g., service degradation as a result of packet loss or jitter). It is recognized that communications services can span several network domains, possibly of differing technologies (e.g., 3GPP Access-IP Backbone-PSTN Termination). Once agreement
12、s have been reached on service availability metrics and estimation methodologies, discussions on extending availability on an end-to-end basis over all domains and technologies can commence. Finally, it is recognized that evolving technologies such as MPLS are further transforming IP-based networks.
13、 This document serves as the first in a series of documents that will also examine availability estimation within the context of MPLS-based networks and services. ATIS-0100020.2008 2 2 NORMATIVE REFERENCES The following standards contain provisions which, through reference in this text, constitute p
14、rovisions of this American National Standard. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this American National Standard are encouraged to investigate the possibility of applying the most recent editions of
15、 the standards indicated below. Y.1540 ITU-T Recommendation Y.1540, IP packet transfer and availability performance parameters.1ATIS-Avail-TR “End to End Service Availability”, ATIS Draft Technical Report, ATIS Contribution PRQC-2007-017R4, July 2007.2ATIS-Router-Avail-TR ATIS-0100016, End to End Se
16、rvice Availability.2Y.2171 ITU-T Recommendation Y.2171, Admission Control Priority Levels in Next Generation Networks.1Y.2172 ITU-T Recommendation Y.2172, Service Restoration Priority Levels in Next Generation Networks.13 DEFINITIONS 3.1 Service Affecting Element Outage Duration: Total amount of tim
17、e that a network element is unavailable to transport IP packets. Note that if network restoration successfully re-routes traffic around the failed element, then the latter is no longer considered to be part of the transporting path; hence, outage duration is the time taken to re-route traffic - see
18、7.1 for further discussion. 3.2 Element Outage: Failure of a network element or elements that cause a disruption in the transport of an IP service, thus impacting service availability. Such failures can generally be considered as hardware (e.g., power supplies, line cards, routers, links, etc.) fail
19、ures or software (e.g., software protocols) failures - see 7.2 for further discussion. 4 ACRONYMS further analysis would be required to trace the cause back to a software failure. Other forms of software failures could be even more complex. For example, control plane software errors may result in “m
20、essage storms” causing a significant strain on network resources. Software failures and their impacts on network operations are for further study. Detailed development of NMS capabilities for tracking downtimes arising from hardware and software failures is also for further study. ATIS-0100020.2008
21、7 7.3 Discussion on Service Priority When there is network failure, services having higher priority of admission Y.2171 will be selectively admitted into the network over other services depending on the availability of adequate resources. Similarly, service flows that have been established will get
22、re-routed in the case of failed elements depending on the service restoration priority Y.2172. Hence, for example, critical services such as Emergency Telecommunications (ETS) have a very high probability that service availability will suffer minimum impact even under conditions of a regional or nat
23、ional failure. While the availability metric does not include priority as a parameter explicitly, the effect of priority on the availability metric is implicit in the definitions of fraction of service lost and outage duration. A higher admission priority results in a lower fraction of service lost
24、while a higher restoration priority leads to a decrease in outage duration. Hence, the proposed metric captures the effect of service priority on service availability. ATIS-0100020.2008 8 Annex A (informative) A HIGH LEVEL DISCUSSION ON ESTIMATING AVAILABILITY IMPACT ON IP-BASED SERVICES FROM NETWOR
25、K ELEMENT OUTAGES This annex provides a high level discussion on how the two factors downtime and fraction of service lost can be estimated. The discussions offered are generic in nature. Detailed discussions require specific operational and network management structures and are for further study. A
26、.1 Element Outage Downtime Estimating element downtime could potentially be quantified via historical data. This method looks at all possible element failures in an IP network and tracks failure durations Difor given elements over their life cycle. It also tracks network re-convergence times (time t
27、aken to re-route packet flows over alternate paths) for given element failures. This assumes that there is sufficient spare bandwidth to route all new packet flows after network re-convergence see Figure 1 for redundancies in the backbone. For example, a complete router failure in the backbone typic
28、ally results in an Open Shortest Path First (OSPF) protocol re-convergence time of the order of a few minutes. Thus, incoming packet flows may experience an inability to get proper routing over a period typically less than three minutes - hence, they would be lost. However, subsequent flows would be
29、 successfully routed around the failed element. The result is that downtime in this case equals the re-convergence time, even though the actual element downtime may be longer. Examining and tracking network re-convergence behaviors over different types of failed elements permit the ability to provid
30、e reasonable estimates for downtimes related to these failed elements. This method requires network management tools capable of tracking element status, the state of the network (e.g., fully re-converged), as well trouble ticket systems that accurately track customer service status. Some examples of
31、 element failures are as follows: Complete Router Failure: Typical re-convergence times tens of seconds. Partial Router Failure (e.g., Failed Router Card): Typical re-convergence time few seconds. Complete Link Failure: Typical re-convergence time few seconds. To complete the discussion, it is impor
32、tant to consider the case where spare bandwidth is insufficient for re-routing all new incoming flows around the failed element. In such a case, a percentage of new incoming packet flows will be lost until the failed element is repaired. The extent of the loss will have to be monitored by operations
33、 systems as well as customer trouble tickets. One example of this possibility is a complete failure of an access router that is linked to a network gateway through which incoming and outgoing traffic flows originate and terminate. If the gateway is dual-homed to another access router, and there is s
34、ufficient link bandwidth to accommodate peak traffic flows across both access routers, then a complete access router failure results in a downtime = Border Gateway Protocol (BGP) re-convergence time. If not, then some percentage of incoming/outgoing traffic flows will be lost and the extent will hav
35、e to be tracked. Repair-time distributions for such cases can be derived by careful tracking. ATIS-0100020.2008 9 A.2 Fraction of Service Lost The fraction of service lost, fi, depends on the service under consideration. Specific services such as VoIP and VPN services require careful tracking. Avail
36、ability over all services can also be of interest to the service provider. Regardless of the specific service (or total traffic over all services) under consideration, the following data is essential for estimating fi: Network and Router Topology Connectivity between routers in the IP network. Netwo
37、rk Traffic Matrix Point-to-point traffic patterns between pairs of routers depending on time-of-day and day-of-week. Routing Model Traffic routing patterns based on OSPF, Label Distribution Protocol (LDP), and BGP. Typically, this level of data can be derived for total network traffic. Service speci
38、fic data can be derived if service specific details are available. Sampling methods can be used to estimate fraction of service lost for specific services such as VPN services. A service provider may have several VPN services for specific “Enterprise Customers”. For each VPN service, the customer fa
39、cing ports on access routers are known. To estimate fraction of service lost for any given VPN, some assumptions need to be made: A VPN service in an IP network is modeled as a set of available paths or “connections” between pairs of customer facing ports on access routers that link the customer to
40、the IP network. In other words, if VPN service traffic can ingress the network on port a on access router A and egress at port z on access router Z, then (A,a)-(Z,z) is assumed to represent an available “connection” for the VPN service. A full mesh of bi-directional “connections” is assumed to be av
41、ailable between all relevant pairs of ingress/egress ports related to the specified VPN service. The traffic distribution on these “connections” is assumed to be symmetric and equal3. In other words, VPN service traffic on any given “connection” is likely to be the same as that on any other “connect
42、ion” regardless of time-of-day and day-of-week. Detection of VPN service packet streams (incoming and outgoing) at any given port could be done via sampling probes. In the event of a network element outage, the sampling probes would not detect the appropriate VPN streams. The lack of appropriate dat
43、a for a subset of VPN customer facing ports therefore is an indication that the “connections” comprising these ports have temporarily failed. If j is the number of lost connections and J is the total number of “connections” associated with the VPN service, then the fraction of VPN service lost is th
44、e ratio j/J. The assumption here is that the sampling techniques are sufficiently optimized4. In the case of MPLS networks, logical LSPs or Traffic Engineering (TE) tunnels are setup between the end points and their state (“up” or “down”) can be tracked with Network Management Systems. Such LSPs/tun
45、nels can be weighted according to their bandwidth and the fraction of service lost can then be determined. 3It is understood that port sizes may not be the same. Appropriate weighting and/or normalization factors will need to be implemented in such cases. 4Discussion on sampling techniques is beyond the scope of this document. ATIS-0100020.2008 10 For services such as VoIP transactions, it may be possible to track the number of calls lost due to the element outage and compare that with the time-of-day, day-of-week traffic forecasts to determine the fraction of service lost.