1、 International Telecommunication Union ITU-T P.1401TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (07/2012) SERIES P: TERMINALS AND SUBJECTIVE AND OBJECTIVE ASSESSMENT METHODS Statistical analysis, evaluation and reporting guidelines of quality measurements Methods, metrics and procedures for stati
2、stical evaluation, qualification and comparison of objective quality prediction models Recommendation ITU-T P.1401 ITU-T P-SERIES RECOMMENDATIONS TERMINALS AND SUBJECTIVE AND OBJECTIVE ASSESSMENT METHODS Vocabulary and effects of transmission parameters on customer opinion of transmission quality Se
3、ries P.10 Voice terminal characteristics Series P.30 P.300 Reference systems Series P.40 Objective measuring apparatus Series P.50 P.500 Objective electro-acoustical measurements Series P.60 Measurements related to speech loudness Series P.70 Methods for objective and subjective assessment of speech
4、 quality Series P.80 P.800 Audiovisual quality in multimedia services Series P.900 Transmission performance and QoS aspects of IP end-points Series P.1000 Communications involving vehicles Series P.1100 Models and tools for quality assessment of streamed media Series P.1200 Telemeeting assessment Se
5、ries P.1300 Statistical analysis, evaluation and reporting guidelines of quality measurements Series P.1400For further details, please refer to the list of ITU-T Recommendations. Rec. ITU-T P.1401 (07/2012) i Recommendation ITU-T P.1401 Methods, metrics and procedures for statistical evaluation, qua
6、lification and comparison of objective quality prediction models Summary A stable and self-sustained statistical evaluation procedure is required in the development of objective quality algorithms. This is required regardless of whether the algorithms will be used for estimating subscriber perceptio
7、n of voice, video, audio or multimedia quality. Recommendation ITU-T P.1401 presents a framework for the statistical evaluation of objective quality algorithms regardless of the assessed media type. History Edition Recommendation Approval Study Group 1.0 ITU-T P.1401 2012-07-14 12 ii Rec. ITU-T P.14
8、01 (07/2012) FOREWORD The International Telecommunication Union (ITU) is the United Nations specialized agency in the field of telecommunications, information and communication technologies (ICTs). The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is respons
9、ible for studying technical, operating and tariff questions and issuing Recommendations on them with a view to standardizing telecommunications on a worldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU
10、-T study groups which, in turn, produce Recommendations on these topics. The approval of ITU-T Recommendations is covered by the procedure laid down in WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative
11、 basis with ISO and IEC. NOTE In this Recommendation, the expression “Administration“ is used for conciseness to indicate both a telecommunication administration and a recognized operating agency. Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain manda
12、tory provisions (to ensure, e.g., interoperability or applicability) and compliance with the Recommendation is achieved when all of these mandatory provisions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. Th
13、e use of such words does not suggest that compliance with the Recommendation is required of any party. INTELLECTUAL PROPERTY RIGHTS ITU draws attention to the possibility that the practice or implementation of this Recommendation may involve the use of a claimed Intellectual Property Right. ITU take
14、s no position concerning the evidence, validity or applicability of claimed Intellectual Property Rights, whether asserted by ITU members or others outside of the Recommendation development process. As of the date of approval of this Recommendation, ITU had not received notice of intellectual proper
15、ty, protected by patents, which may be required to implement this Recommendation. However, implementers are cautioned that this may not represent the latest information and are therefore strongly urged to consult the TSB patent database at http:/www.itu.int/ITU-T/ipr/. ITU 2013 All rights reserved.
16、No part of this publication may be reproduced, by any means whatsoever, without the prior written permission of ITU. Rec. ITU-T P.1401 (07/2012) iii Table of Contents Page 1 Scope 1 2 References. 1 3 Definitions 2 4 Abbreviations and acronyms 2 5 Conventions 2 6 Subjective test and objective algorit
17、hms 2 6.1 Aspects related to subjective testing . 2 6.2 Aspects related to objective algorithms 3 7 Evaluation framework 4 7.1 Data preparation . 4 7.2 Analysis types . 5 7.3 Prediction on a numerical quality scale 6 7.4 Uncertainty of subjective results 14 7.5 Statistical evaluation metrics 14 7.6
18、Statistical significance evaluation 17 7.7 Statistical evaluation in the context of subjective uncertainty: epsilon insensitive rmse and its statistical significance 19 7.8 Statistical evaluation of the overall performance . 21 8 Guidance on algorithm selection 22 8.1 Per experiment performance . 23
19、 8.2 Overall figure of merit 23 8.3 Worst performance cases 23 8.4 Averaging statistical metrics across experiments 23 9 Special cases . 23 9.1 Evaluation of algorithms with more than one output . 23 9.2 Evaluation of algorithms against pre-defined minimum performance requirements . 23 10 Demonstrat
20、ion cases . 24 Appendix I Algorithm mapping to the subjective scale 27 Appendix II The impact of the third order versus first order mapping . 29 II.1 Application of third order and first order mappings . 29 II.2 Gain of third order mapping . 29 Appendix III Confidence intervals calculation . 31 III.
21、1 The standard deviation for file-based analysis . 31 III.2 The standard deviation for condition-based analysis . 31 III.3 Exceptional cases 32 Appendix IV Normality test 33 iv Rec. ITU-T P.1401 (07/2012) Page Appendix V Statistical significance of the rmse_tot* across all experiments 34 Bibliograph
22、y. 35 Rec. ITU-T P.1401 (07/2012) 1 Recommendation ITU-T P.1401 Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models 1 Scope This Recommendation defines methods, metrics and procedures for statistical evaluation, qualificatio
23、n and comparison of objective quality prediction models. This Recommendation can be used to assess any objective model that predicts a subjective judgement of a subjective test procedure. Guidance is provided on the design and cleansing of subjective test data, as well as the statistical metrics for
24、 model selection and characterization. Frameworks, metrics and example procedures are described. Specific procedures, minimum performance requirements or objectives to be used when selecting a model are not provided, as these depend on the scope of the model being assessed and are not part of this R
25、ecommendation. In this Recommendation, the term “sample“ refers to any type of media, and the terms “model“ and “algorithm“ are interchangeable. 2 References The following ITU-T Recommendations and other references contain provisions which, through reference in this text, constitute provisions of th
26、is Recommendation. At the time of publication, the editions indicated were valid. All Recommendations and other references are subject to revision; users of this Recommendation are therefore encouraged to investigate the possibility of applying the most recent edition of the Recommendations and othe
27、r references listed below. A list of the currently valid ITU-T Recommendations is regularly published. The reference to a document within this Recommendation does not give it, as a stand-alone document, the status of a Recommendation. ITU-T G.107 Recommendation ITU-T G.107 (2011), The E-model: a com
28、putational model for use in transmission planning. ITU-T G.1070 Recommendation ITU-T G.1070 (2012), Opinion model for video-telephony applications. ITU-T J.247 Recommendation ITU-T J.247 (2009), Objective perceptual multimedia video quality measurement in the presence of a full reference. ITU-T J.34
29、1 Recommendation ITU-T J.341 (2011), Objective perceptual multimedia video quality measurement of HDTV for digital cable television in the presence of a full reference. ITU-T P.563 Recommendation ITU-T P.563 (2004), Single-ended method for objective speech quality assessment in narrow-band telephony
30、 applications. ITU-T P.564 Recommendation ITU-T P.564 (2007), Conformance testing for voice over IP transmission quality assessment models. ITU-T P.800 Recommendation ITU-T P.800 (1996), Methods for subjective determination of transmission quality. ITU-T P.862 Recommendation ITU-T P.862 (2001), Perc
31、eptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T P.862.1 Recommendation ITU-T P.862.1 (2003), Mapping function for transforming P.862 raw result scores to MOS-LQO. ITU-T P.863 Recommend
32、ation ITU-T P.863 (2011), Perceptual objective listening quality assessment. 2 Rec. ITU-T P.1401 (07/2012) ITU-T P.910 Recommendation ITU-T P.910 (2008), Subjective video quality assessment methods for multimedia applications. ITU-T P.911 Recommendation ITU-T P.911 (1998), Subjective audiovisual qua
33、lity assessment methods for multimedia applications. ITU-T P-Sup.23 Recommendation ITU-T P-series Supplement 23 (1998), ITU-T coded-speech database. ITU-R BS.1116 Recommendation ITU-R BS.1116 (1997), Methods for the subjective assessment of small impairments in audio systems including multichannel s
34、ound systems. ITU-R BT.500 Recommendation ITU-R BT.500 (2012), Methodology for the subjective assessment of the quality of television pictures. 3 Definitions None. 4 Abbreviations and acronyms This Recommendation uses the following abbreviations and acronyms: CS Circuit Switch IMS Internet protocol
35、Multimedia Systems IPTV Internet Protocol TV MOS Mean Opinion Score OR Outlier ratio rmse root mean square error VoIP Voice over Internet Protocol 5 Conventions None. 6 Subjective test and objective algorithms Objective algorithms estimate subscriber perception and an algorithms performance is evalu
36、ated against subjective test results. For this to be valid, the subjective tests must be well defined and accurate to avoid misinterpretations of algorithm accuracy. 6.1 Aspects related to subjective testing For a stable evaluation to take place, a number of subjective tests are required. A group of
37、 subjective tests used in the evaluation of objective algorithms is called a subjective data pool. The subjective data pool must contain subjective tests that adhere to well established testing procedures, such as those listed below: Listening speech quality See ITU-T P.800 Video for multimedia See
38、ITU-T P.910 Audiovisual for multimedia See ITU-T P.911 Quality of TV pictures See ITU-R BT.500 Audio and music See ITU-R BS.1116 Rec. ITU-T P.1401 (07/2012) 3 The rapid development of networks and their myriad services require that objective quality metrics take into account new technologies (such a
39、s codecs, bandwidth), new networks (such as long term evolution (LTE), IP multimedia subsystems (IMS) and/or new applications (such mobile/IPTV, video streaming) and therefore must cope with new types of media degradations which impact the subscribers perception of the quality. First, it is necessar
40、y to design subjective tests that can accurately capture the impact of these degradations on a subscribers perception. These subjective tests require performing comprehensive experiments that are consistent in their results. In the last several years, these types of tests have been developed and use
41、d for objective quality evaluation metrics needed to deal with new test conditions such as (re-)buffering in multimedia streaming, super-wideband voice and the evaluation of the combined audio-visual impact that is modelled by multimedia quality evaluation algorithms. Subjective testing is extensive
42、ly covered in b-ITU-T Handbook. For the purpose of this Recommendation only the main aspects of subjective testing are discussed. The following aspects are required for accurate evaluation of an objective quality algorithm: i) Voters are recommended to be nave subjects representing normal subscriber
43、s whose perception is estimated by the objective quality models. However, for specific applications such as new codec developments or voice enhancement device evaluations, experienced voters are more suitable. ii) The number of voters per sample should meet the subjective testing requirements as des
44、cribed in the appropriate Recommendations, such as ITU-T P.800, ITU-T P.910, ITU-T P.911, ITU-R BT 500 or ITU-R BS.1116. Depending on the goal of the prediction (per sample prediction or per condition prediction) a minimum of 24 voters either per sample or per condition is recommended. iii) The expe
45、riments performed, either in the same or different labs, could contain an anchor pool of samples that best represent the particular application under evaluation. This would ensure the experiments alignment with respect to quality range and distortion types in the experiment, and would maintain consi
46、stency/repeatability across experiments and/or labs. However, it should be noted that even when anchor samples are used, a bias between different experiments is common. This is due to the fact that it is not always possible to include all distortion types in the anchor conditions. 6.2 Aspects relate
47、d to objective algorithms There are two main categories of objective algorithms. The first is based on network and device parameters (describing the network using abstract parameters). See ITU-T P.564 (used as a voice quality evaluation tool), ITU-T G.107 (used as a voice planning tool) and ITU-T G.
48、1070 (used as a video telephony planning tool). The second uses the real media signal (e.g., voice, video, audio) characteristics to describe the network performance, and the network is considered to be a black-box. Such models are often based on perceptual models. The analysis is not restricted to
49、the media signal itself but can also take into account associated information, for example from the transport layer (often called hybrid models). Both model categories estimate the media quality as subjectively perceived by test users. Regardless of their type, the evaluation procedure stays generally the same. However, the selection process of an algorithm depends on whether the standardization process is defined as a competition between several algorithms or a collaborat