1、 TECHNICAL REPORT T1.TR.80-2003 Technical Report on Descriptors for User-Perceived Impairments in Speech Over Voice-Over-Internet-Protocol (VoIP) Networks Prepared by T1A1.3 Working Group on Performance of Networks and Services Problem Solvers to the Telecommunications Industry A Word from ATIS and
2、Committee T1 Established in February 1984, Committee T1 develops technical standards, reports and requirements regarding interoperability of telecommunications networks at interfaces with end-user systems, carriers, information and enhanced-service providers, and customer premises equipment (CPE). C
3、ommittee T1 is sponsored by ATIS and is accredited by ANSI. T1.TR.80-2003 Published by Alliance for Telecommunications Industry Solutions 1200 G Street, NW, Suite 500 Washington, DC 20005 Committee T1 is sponsored by the Alliance for Telecommunications Industry Solutions (ATIS) and accredited by the
4、 American National Standards Institute (ANSI). Copyright 2003 by Alliance for Telecommunications Industry Solutions All rights reserved. No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission of the publisher. F
5、or information contact ATIS at 202.628.6380. ATIS is online at . Printed in the United States of America. T1.TR.80-2003 Technical Report on Descriptors for User-Perceived Impairments in Speech Over Voice-Over-Internet-Protocol (VoIP) Networks Alliance for Telecommunications Industry Solutions Approv
6、ed April 2003 Abstract This Technical Report (TR) documents the descriptors listeners use in characterizing how speech sounds when it has been modified by passing through a Internet Protocol (IP) network and data packets have been lost or their timing has been disturbed. This TR demonstrates how the
7、 effects of the loss and jitter impairments sound through the use of embedded audio files. A standard set of speech recordings were modified by passing through G.711 and G.729 codecs produced by two different manufacturers, and were further modified by being operated on by the NIST Net network emula
8、tor. The emulator allows packet loss and packet delay variation to be manipulated parametrically. The modified speech recordings were subjectively evaluated by consumers (i.e., non-expert judges). The consumers described the sound of the speech samples from the various combinations of codec and netw
9、ork performance and a set of verbal descriptors was derived. A second set of consumers rated the speech recordings with respect to the verbal descriptors. A suite of 10 ratings of each speech sample is presented along with its recording. T1.TR.80-2003 Foreword This Technical Report (TR) is entitled
10、Descriptors for User-Perceived Impairments in Speech Over Voice-Over-Internet-Protocol Networks. Detailed information on the derivation of the descriptors used in this TR can be found in contribution T1A1.1/2001-044, Verbal Descriptors for VoIP Speech Sounds. Future control of this document will res
11、ide with Accredited Standards Committee Telecommunications, T1. This control of additions to the specification, such as protocol evolution, new applications, and operational requirements, will permit compatibility among U.S. networks. Such additions will be incorporated in an orderly manner with due
12、 consideration to the ITU-T layered model principles, conventions, and functional boundaries. Suggestions for improvement of this technical report will be welcome. These should be sent to the Alliance for Telecommunications Industry Solutions, T1 Secretariat, 1200 G Street, NW, Suite 500, Washington
13、 DC 20005. Working Group T1A1.3 on Performance of Networks and Services (of Technical Subcommittee T1A1 on Performance and Signal Processing) developed this TR. Over the course of its development, the following individuals participated in the Working Groups discussions and made significant contribut
14、ions to the technical report: Greg Cermak wrote and edited this TR, with comments and encouragment provided by Garry Couch, John Grigg, Neal Seitz, Stephen Voran, and Randy Wohlert. ii T1.TR.80-2003 Table of Contents 0 INTRODUCTION1 1 SCOPE, PURPOSE, AND APPLICATION 1 1.1 SCOPE.1 1.2 PURPOSE.1 1.3 A
15、PPLICATION 2 2 NORMATIVE REFERENCES 2 3 DEFINITIONS 2 4 ABBREVIATIONS, ACRONYMS, AND SYMBOLS 2 5 SUBJECTIVE DESCRIPTORS FOR VOIP IMPAIRMENTS3 5.1 BRIEF DESCRIPTION OF SPEECH SAMPLES3 5.2 BRIEF DESCRIPTION OF IMPAIRMENT DESCRIPTOR DATA.3 5.3 IMPAIRMENT SAMPLES AND CORRESPONDING DESCRIPTORS 3 A INFORM
16、ATIVE REFERENCES.11 B DESCRIPTION OF SPEECH SAMPLES AND VERBAL DESCRIPTOR DATA.12 B.1 PRODUCTION OF SPEECH SAMPLES .12 B.2 COLLECTION OF VERBAL DESCRIPTOR DATA .14 B.2.1 INTERVIEWS14 Table of Figures FIGURE 1 - IMPAIRMENT DESCRIPTORS FOR G.729 CODEC OF MANUFACTURER A AT 0%, 1%, AND 5% PACKET LOSS AN
17、D NO JITTER4 FIGURE 2 - IMPAIRMENT DESCRIPTORS FOR G.711 CODEC OF MANUFACTURER B AT 0%, 1%, AND 5% PACKET LOSS AND 0 OR 50 MS JITTER5 FIGURE 3 - IMPAIRMENT DESCRIPTORS FOR G.729 CODEC OF MANUFACTURER B AT 0 MS, 50 MS, AND 100 MS JITTER AND 0% OR 1% PACKET LOSS.6 FIGURE 4 - IMPAIRMENT DESCRIPTORS FOR
18、 G.711 CODEC OF MANUFACTURER A AT 0 MS, 50 MS, AND 100 MS JITTER AND 0% OR 1% PACKET LOSS.7 FIGURE 5 - IMPAIRMENT DESCRIPTORS FOR G.729 CODEC OF MANUFACTURER A AND G.711 CODEC OF MANUFACTURER B AT 0% PACKET LOSS AND 0 MS JITTER.8 FIGURE 6 - IMPAIRMENT DESCRIPTORS FOR G.729 CODEC OF MANUFACTURER A AN
19、D G.711 CODEC OF MANUFACTURER B AT 1% PACKET LOSS AND 50 MS JITTER.9 FIGURE 7 - IMPAIRMENT DESCRIPTORS FOR G.729 CODEC OF MANUFACTURER A AND G.711 CODEC OF MANUFACTURER B AT 5% PACKET LOSS AND 100 MS JITTER.10 FIGURE 8 - STAGES IN PRODUCTION OF SPEECH SAMPLES.13 FIGURE B.1 - RATING SCALE USED IN APP
20、LYING SUBJECTIVE DESCRIPTORS TO SPEECH SAMPLES16 iii Technical Report on T1.TR.80-2003 Descriptors for User-Perceived Impairments in Speech Over Voice-Over-Internet-Protocol (VoIP) Networks 0 Introduction Increasing volumes of voice traffic are being carried over packet networks using the Internet P
21、rotocol (VoIP). The quality of speech over IP networks can be measured objectively, as in ITU-T Recommendation P.862, or subjectively, as in ITU-T Recommendation P.860. What the objective and subjective measurements have in common is that they measure the overall quality of a speech sample transmitt
22、ed over a network, typically using a single number. Transmitted speech samples with qualitatively different-sounding impairments could be measured to have the same overall quality in either the objective or subjective metrics. The work reported in the current Technical Report (TR) is aimed at beginn
23、ing the development of more detailed descriptive terms for impairments of speech samples carried over IP networks. A library of speech samples was created by operating on an original unimpaired set of recorded samples using an IP network emulator at various settings. Each combination of settings pro
24、duced a set of output speech samples. Human observers, in a series of one-on-one sessions, described how the various speech samples sounded, qualitatively. A candidate set of descriptors was derived from this collection of qualitative terms. Another set of observers rated the speech samples with res
25、pect to the candidate descriptors; each speech sample was thus associated with a battery of quantitative descriptors. The original and impaired speech samples and the corresponding batteries of descriptors are presented in this TR. The particular impairments presented here are only a sample of all p
26、ossible impairments. The impairments represent variations in amount of packet loss, different amounts of variability in the delay between packets (“jitter”), and the interaction of these impairments with two codecs (G.711 and G.729A) from each of two different manufacturers. It would be helpful if t
27、his TR were supplemented and updated using data derived from a wider array of impairments. Annex B describes the speech samples and the descriptors in greater detail. 1 Scope, Purpose, and Application 1.1 Scope The results presented here should apply to any IP network whose performance can be charac
28、terized by packet loss and packet jitter. It is not known how well the present results apply when packet loss is “bursty” rather than random. The present results do not apply to the performance impairments echo and delay. The present results may also not apply to significantly different codecs than
29、those employed here, and especially to different approaches for concealing or correcting packet loss. It is to be hoped that further work with bursty packet loss, other codecs, and wireless networks will supplement the work presented here. 1.2 Purpose The purpose of this technical report is to advan
30、ce a vocabulary for describing the sounds of different qualitative impairments to speech carried over IP networks. 1 T1.TR.80-2003 1.3 Application The vocabulary presented here can be used to describe impairments to transmitted speech caused by IP networks and network components such as gateways and
31、 codecs. The vocabulary is in the words of consumers, so its first use might be in aiding consumers to describe the sounds of certain VoIP impairments. This non-technical vocabulary might also be used in conversations among network operators, network users, and equipment manufacturers. When more ful
32、ly developed, the vocabulary might be used in specifying requirements and service level agreements. For relationships between the current set of descriptors and traditional measures such as mean opinion score, see contribution T1A1.1/2001-044, Verbal Descriptors for VoIP Speech Sounds. 2 Normative R
33、eferences Cermak, G. W. Verbal descriptors for VoIP speech sounds. Contribution T1A1.1/2001-044. Standards Committee T1, Telecommunications. 2001.13 Definitions 3.1 Jitter: In this technical report, “jitter“ is controlled by the NIST Net as the standard deviation of the distribution of packet delay
34、times, in milliseconds. Note that the emulator produces delays; when producing delays sampling from a distribution may be convenient. Current proposals for monitoring delays, in a network find a different definition convenient (see, for example, contribution T1A1/2001-011, Proposal for Delay Variati
35、on Parameters in Rec. Y.1540). 3.2 Packet Loss: The packet loss process in the NIST Net emulator, as used in this study, applies a process to determine the packets to be lost, with each packet having the same probability of being lost. 3.3 Packet Reordering: Packets of speech signal are produced in
36、a specific order. If the simulated jitter is large enough, the order of the packets at the receiving end of the network will differ from the original order of the packets. This is because the variable delays are independently selected on a packet-by-packet basis, and packets are emitted from the sim
37、ulator using a schedule derived from each individual packets arrival time plus the simulated delay. 4 Abbreviations, Acronyms, and Symbols ANSI American National Standards Institute ATIS Alliance for Telecommunications Industry Solutions IEEE Institute of Electrical and Electronics Engineers IP Inte
38、rnet Protocol ITU-T International Telecommunication Union Telecommunication Standardization Sector NIST National Institute of Standards and Technology PCM Pulse code modulation digital telephony signal VoIP Voice over Internet Protocol network WAV Standard file format for sound files _ 1This documen
39、t is available at . 2 T1.TR.80-2003 5 Subjective Descriptors for VoIP Impairments 5.1 Brief description of speech samples Eight recorded samples of the “Harvard sentences” (IEEE 1969) formed the basis on which the set of impaired samples was constructed. Four of the samples are in female voices; fou
40、r are in male voices. The speech samples were designed to be phonetically balanced. Each sample is in the form of a 128 kbyte digital file eight seconds long. The original analog sounds had been digitized at 8 kHz at 16 bits per digitization. The set of original recordings was operated on by passing
41、 them through an IP network that included codecs and a network emulator. Four codecs were used: two manufacturers versions of the G.711 and G.729A codecs. The network emulator was set at one of three levels of packet loss and one of three levels of jitter. (Those levels were 0.0%, 1.0%, or 5.0% pack
42、et loss, and 0, 50, or 100 ms jitter on the NIST Net emulator.) Of the 36 possible combinations of emulator settings and codecs, 24 were actually used for the present report. Further detail concerning the recorded speech samples is available in Annex B and in T1A1.7/99-011 and T1A1.1/2001-044. It is
43、 important to realize that the impairments which appear in individual speech samples were statistically generated for each sample. Of the eight samples that received the “same” configuration of treatment variables (in codec and emulator), all will sound somewhat different. In the present report, exa
44、mples of impairments are chosen to be distinctive. For a statistical analysis of typical impairments, see T1A1.1/2001-044. 5.2 Brief description of impairment descriptor data The examples of impairments presented in 5.3 are represented by combinations of 10 descriptors plotted on graphs. The descrip
45、tors are non-technical terms that consumers used to describe impairments in the set of speech samples under consideration here. The graphed quantities are average numerical ratings by a set of consumers; an elemental rating was a consumers judgment of the degree to which a single descriptor applied
46、to the set of eight speech samples under a given impairment condition. Further detail concerning the descriptor data is available in Annex B and in T1A1.1/2001-044. The labels on the graphs in 5.3 are shortened versions of the descriptors that had been used by the set of consumers in their judgments
47、. Graph labels and the corresponding descriptors are given below: Distortion: Distortion comes and goes. Dropped syllables: Syllables are dropped/clipped off. Lost words: Interrupted/lost words (individual words, not all of them). Fading: Sound level fades/muffled. Whooshy: Whooshy/breathy. Garbled:
48、 Garbled. Slurred: Slurred/fuzzy. Warble: Warbly/quavery/quivery. Choppy: Choppy/jerky. Static: Staticky/crackly. 5.3 Impairment samples and corresponding descriptors Packet loss is varied while holding jitter constant, Example 1: G.729 codec from Manufacturer A. Packet loss is set at values of 0%,
49、1%, and 5% on the NIST Net emulator; jitter is set at 0 ms. The two impaired waveforms correspond to female and male speakers. 3 T1.TR.80-2003 Packet loss = 0%, jitter = 0 ms A72900f3.wav A72900m3.wav Packet loss = 1%, jitter = 0 ms A72905f3.wav A72905m3.wav Packet loss = 5%, jitter = 0 ms A72901f3.wav A72901m3.wav 543210staticchoppywarbleslurredgarbledwhooshyfadinglost wordsdropped syllablesdistortionA72900A72901A72905Amount of ImpairmentType of ImpairmentFigure 1 - Impairment descriptors for G.729 codec of Manufacturer A at 0%, 1%, and 5% packet loss an