ITU-T P 800 2-2016 Mean opinion score interpretation and reporting (Study Group 12)《平均意见得分解释和报告(研究组12)》.pdf

资源描述

1、 I n t e r n a t i o n a l T e l e c o m m u n i c a t i o n U n i o n ITU-T P.800.2 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (07/2016) SERIES P: TERMINALS AND SUBJECTIVE AND OBJECTIVE ASSESSMENT METHODS Methods for objective and subjective assessment of speech and video quality Mean opinion

2、score interpretation and reporting Recommendation ITU-T P.800.2 ITU-T P-SERIES RECOMMENDATIONS TERMINALS AND SUBJECTIVE AND OBJECTIVE ASSESSMENT METHODS Vocabulary and effects of transmission parameters on customer opinion of transmission quality Series P.10 Voice terminal characteristics Series P.3

3、0 P.300 Reference systems Series P.40 Objective measuring apparatus Series P.50 P.500 Objective electro-acoustical measurements Series P.60 Measurements related to speech loudness Series P.70 Methods for objective and subjective assessment of speech quality Series P.80 Methods for objective and subj

4、ective assessment of speech and video quality Series P.800 Audiovisual quality in multimedia services Series P.900 Transmission performance and QoS aspects of IP end-points Series P.1000 Communications involving vehicles Series P.1100 Models and tools for quality assessment of streamed media Series

5、P.1200 Telemeeting assessment Series P.1300 Statistical analysis, evaluation and reporting guidelines of quality measurements Series P.1400 Methods for objective and subjective assessment of quality of services other than speech and video Series P.1500 For further details, please refer to the list o

6、f ITU-T Recommendations. Rec. ITU-T P.800.2 (07/2016) i Recommendation ITU-T P.800.2 Mean opinion score interpretation and reporting Summary Recommendation ITU-T P.800.2 introduces some of the more common types of mean opinion score (MOS) and describes the minimum information that should accompany M

7、OS values to enable them to be correctly interpreted History Edition Recommendation Approval Study Group Unique ID* 1.0 ITU-T P.800.2 2013-05-14 12 11.1002/1000/11934 2.0 ITU-T P.800.2 2016-07-29 12 11.1002/1000/12973 Keywords Absolute category rating, ACR, mean opinion score, MOS, objective model,

8、reporting, subjective experiment. _ * To access the Recommendation, type the URL http:/handle.itu.int/ in the address field of your web browser, followed by the Recommendations unique ID. For example, http:/handle.itu.int/11.1002/1000/11830-en. ii Rec. ITU-T P.800.2 (07/2016) FOREWORD The Internatio

9、nal Telecommunication Union (ITU) is the United Nations specialized agency in the field of telecommunications, information and communication technologies (ICTs). The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is responsible for studying technical, operati

10、ng and tariff questions and issuing Recommendations on them with a view to standardizing telecommunications on a worldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU-T study groups which, in turn, prod

11、uce Recommendations on these topics. The approval of ITU-T Recommendations is covered by the procedure laid down in WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative basis with ISO and IEC. NOTE In thi

12、s Recommendation, the expression “Administration“ is used for conciseness to indicate both a telecommunication administration and a recognized operating agency. Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain mandatory provisions (to ensure, e.g., in

13、teroperability or applicability) and compliance with the Recommendation is achieved when all of these mandatory provisions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. The use of such words does not suggest

14、 that compliance with the Recommendation is required of any party. INTELLECTUAL PROPERTY RIGHTSITU draws attention to the possibility that the practice or implementation of this Recommendation may involve the use of a claimed Intellectual Property Right. ITU takes no position concerning the evidence

15、, validity or applicability of claimed Intellectual Property Rights, whether asserted by ITU members or others outside of the Recommendation development process. As of the date of approval of this Recommendation, ITU had not received notice of intellectual property, protected by patents, which may b

16、e required to implement this Recommendation. However, implementers are cautioned that this may not represent the latest information and are therefore strongly urged to consult the TSB patent database at http:/www.itu.int/ITU-T/ipr/. ITU 2016 All rights reserved. No part of this publication may be re

17、produced, by any means whatsoever, without the prior written permission of ITU. Rec. ITU-T P.800.2 (07/2016) iii Table of Contents Page 1 Scope . 1 2 References . 1 3 Definitions 1 3.1 Terms defined elsewhere 1 3.2 Terms defined in this Recommendation . 1 4 Abbreviations and acronyms 1 5 Conventions

18、 2 6 Introductory information 2 7 Subjective MOS values . 2 8 Interpreting MOS values . 4 9 Video considerations 4 10 Statistical analysis of MOS . 5 11 Objective MOS values 5 12 Reporting subjective MOS values 6 13 Reporting objective MOS values 7 14 Notation 7 Bibliography. 8 Rec. ITU-T P.800.2 (0

19、7/2016) 1 Recommendation ITU-T P.800.2 Mean opinion score interpretation and reporting 1 Scope This Recommendation introduces some of the more common types of mean opinion score (MOS) and describes the minimum information that should accompany MOS values to enable them to be correctly interpreted. I

20、t should be noted that this text does not aim to provide a definitive guide to subjective or objective testing. The bibliography at the end of this Recommendation provides information on more detailed material. 2 References The following ITU-T Recommendations and other references contain provisions

21、which, through reference in this text, constitute provisions of this Recommendation. At the time of publication, the editions indicated were valid. All Recommendations and other references are subject to revision; users of this Recommendation are therefore encouraged to investigate the possibility o

22、f applying the most recent edition of the Recommendations and other references listed below. A list of the currently valid ITU-T Recommendations is regularly published. The reference to a document within this Recommendation does not give it, as a stand-alone document, the status of a Recommendation.

23、 ITU-T P.800.1 Recommendation ITU-T P.800.1 (2006), Mean Opinion Score (MOS) terminology. 3 Definitions 3.1 Terms defined elsewhere None. 3.2 Terms defined in this Recommendation This Recommendation defines the following terms: 3.2.1 condition: One of a set of use cases being evaluated in a subjecti

24、ve experiment; often referred to as a hypothetical reference circuit (HRC) in video experiments. 3.2.2 sub-condition: A subset of a condition defined by a specific characteristic of the use case, e.g., speech material from a particular talker. 3.2.3 subject: A participant in a subject experiment. 3.

25、2.4 vote: A subjects response to a question in a rating scale for an individual test sample or interaction. 4 Abbreviations and acronyms This Recommendation uses the following abbreviations and acronyms: ACR Absolute Category Rating DCR Degradation Category Rating DMOS Degradation Mean Opinion Score

26、 HRC Hypothetical Reference Circuit 2 Rec. ITU-T P.800.2 (07/2016) MOS Mean Opinion Score MUSHRA Multi-stimulus test with Hidden Reference and Anchor QCIF Quarter Common Intermediate Format SSCQE Single Stimulus Continuous Quality Evaluation VGA Video Graphics Array 5 Conventions None. 6 Introductor

27、y information Audio and video quality are inherently subjective quantities. This means that the baseline for audio and video quality is the opinion of the user. However, one persons opinion of what is good may be quite different to another persons opinion neither person is correct, neither person is

28、 incorrect. Before a new audio or video transmission technology is deployed, it is good practice to assess the transmission quality using one or more subjective experiments. The purpose of a subjective experiment is to collect the opinions of multiple people (“subjects“) about the performance of the

29、 system for a number of well-defined use cases (“conditions“)1. The mean opinion score (MOS) for a given condition is simply the average of the opinions (“votes“) collected for that use case. Objective quality measurement algorithms aim to predict the MOS value that a given input signal would produc

30、e in a subjective experiment. Hence, when interpreting an objectively derived MOS value, it is essential to understand the basic design of the experiment being predicted. There are several different types of MOS value and many different test methodologies for producing them. The purpose of this Reco

31、mmendation is to give the reader an appreciation of the main points to consider when interpreting MOS values and the minimum information that should accompany MOS values when they are reported. 7 Subjective MOS values Types of MOS There is a common misconception that MOS values only pertain to voice

32、 services, but the process of asking subjects to provide their assessment of quality can be just as easily applied to video and general audio services as it can to voice services. It is also possible to ask subjects to rate the overall audiovisual quality of a service. The ITU has produced various s

33、tandards describing different aspects of subjective testing for video and general audio applications in addition to voice applications, and these are listed in the bibliography. Subjective experiments may be broadly divided into two types: passive and interactive. In a passive subjective experiment,

34、 subjects are presented with pre-recorded test samples representing the conditions of interest. The subjects are asked to passively listen to and/or watch the test material and provide their opinion using the rating scale provided. In an interactive experiment, two or more subjects actively engage i

35、n conversation using equipment designed to emulate the use cases of interest. The subjects are often given tasks in order to stimulate conversation and interaction. Most experiments tend to be passive in nature. However, there are some aspects of user experience, for example, the effects of delay an

36、d echo, that only become apparent in conversational scenarios. _ 1 In video experiments, conditions are often referred to as hypothetical reference circuits (HRCs). Rec. ITU-T P.800.2 (07/2016) 3 Test methodology and rating scale In a subjective experiment, subjects are asked to provide their opinio

37、ns using a “rating scale“. The purpose of the scale is to translate a subjects quality assessment into a numerical value that can be averaged across subjects and other experimental factors. There are several rating scales in common use, and the relative benefits of different scales are outside the s

38、cope of this Recommendation. The most commonly used scale is the 5-point absolute category rating (ACR) scale: Excellent 5 Good 4 Fair 3 Poor 2 Bad 1 The ACR scale is a discrete scale, meaning that the subjects response is limited to one of the five values listed above. However, the averaging proces

39、s used to combine results from different subjects means that MOS values are not confined to integer values. Some rating scales have more than five discrete labels, while others allow the subject to provide intermediate responses at points between the labels. The “absolute“ part of ACR relates to the

40、 fact that subjects are asked to independently rate each sample. Some rating scales, such as the degradation category rating (DCR) scale, ask for a subjects opinion about the difference between a sample processed through the condition of interest and an unprocessed version of the same sample. The MO

41、S value produced in such an experiment is often called a degradation MOS or DMOS. In most experimental designs, subjects are asked to rate the quality of short audio or video samples. The duration of such samples is usually in the range of 6 to 10 seconds, as this provides enough time for the subjec

42、t to form an opinion without introducing any bias towards the end of the sample. It is difficult for a single sample of this duration to represent a whole condition, and hence subjects are typically asked to rate multiple test samples derived from the same use case. For example, in a voice experimen

43、t, each network condition under test might be represented with speech samples from three male and three female talkers. This means that MOS values can be produced for the entire condition, by averaging across both subjects and talkers, or for a sub-condition, such as a particular talker or gender of

44、 talker. Test methods, such as single stimulus continuous quality evaluation (SSCQE), use much longer test samples, and require the subject to continuously update their opinion of quality as the test sample is being played. This results in a time sequence of quality ratings from each subject, rather

45、 than a single opinion value. Some test methodologies require the subject to answer multiple questions. Not only does this yield more information about the conditions under test, it can be a necessary part of the test design. For example, the ITU-T P.835 test method requires the subject to provide s

46、eparate opinions about the speech quality and the noise quality of a sample before providing an overall quality score. This process has been found to yield more stable results with noise suppression systems than the single question ACR test method. It should be noted that some questions may not rela

47、te directly to quality, but may address a different aspect of communications, for example, b-ITU-T P.800 defines a listening effort scale for voice experiments. Similarly, some conversational experiments ask the subject about their experience when talking, rather than when listening. 4 Rec. ITU-T P.

48、800.2 (07/2016) 8 Interpreting MOS values The following discussion initially focuses on voice MOS values; however, many of the points made in the subsections apply equally to video, audio and audio-video MOS values. The main differences for video are described in the following clause. The idea that

49、a particular voice codec has a particular MOS score is another common misconception. One source of this misconception is the widespread use of objective quality assessment models, which produce very repeatable results. Such models are designed to predict or estimate the output of subjective experiments; however, for any given codec at a given bit rate, the MOS value obtained in a subjective experiment can vary substantially from experiment to experiment. There are a number of reasons for this. Firstly, the exact MOS val

展开阅读全文