1、STD-ITU-T RECMN P-85-ENGL 1779 98b257L Ob177Lib 773 INTERNATIONAL TELECOMMUNICATION UNION ITU-T TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU TELEPHONE TRANSMISSION QUALITY SUBJECTIVE OPINION TESTS P.85 (06/94) A METHOD FOR SUBJECTIVE PERFORMANCE ASSESSMENT OF THE QUALITY OF SPEECH VOICE OUTPUT DE
2、VICES ITU-T Recommendation P.85 (Previously “CCIlT Recommendation”) - STD-ITU-T RECMN P.85-ENGL 199i W i8b257L Ob17747 b3T FOREWORD The ITU-T (Telecommunication Standardization Sector) is a permanent organ of the International Telecommunication Union (ITU). The ITU-T is responsible for studying tech
3、nical, operating and tariff questions and issuing Recommen- dations on them with a view to standardizing telecommunications on a worldwide basis. The World Telecornmunication Standardization Conference (WTSC), which meets every four years, establishes the topics for study by the ITU-T Study Groups w
4、hich, in their turn. produce Recommendations on these topics. The approval of Recommendations by the Members of the ITU-T is covered by the procedure laid down in WTSC Resolution No. 1 (Helsinki, March 1-12, 1993). ITU-T Recommendation P.85 was prepared by IT-T Study Group 12 (1993-1996) and was app
5、roved under the WTSC Resolution No. 1 procedure on the 21 of June 1994. NOTE In this Recommendation, the expression “Administration” is used for conciseness to indicate both a telecommunication administration and a recognized operating agency. O ITU 1994 Ali rights reserved. No part of this publicat
6、ion may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from the ITU. STD-ITU-T RECMN P-BCi-ENGL L77Li 48b257L DbL7948 57b M CONTENTS scope Assessment method 2.1 General 2.2 Main features of the reco
7、mmended method 3.1 Speech material . 3.2 Source conditions 3.3 Stimulus preparation . Design of experiment . 4.1 Subject task . 4.2 Rating scales . 4.3 Experimental design . 4.4 Listening test procedure Statistical analysis and reporting the results . Other Methods Test preparati on Annex A . Messag
8、es . Annex B - Response sheets . Annex C - Evaluation of synthetic speech: instructions for listeners . References . Bibliograph y Recommendation P.85 (06/94) Page 1 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 5 8 9 9 I STD-ITU-T RECMN P-85-ENGL 1774 W 48b2571 Ob19947 402 W SUMMARY Various services providing voca
9、l answers related to telephone directory inquiries, weather forecast, mail order, etc., are now available to PSTN users using vocal servers. As the speech messages are produced by machines, they may suffer from some impairment. In this Recommendation a method is defined for subjective performance as
10、sessment of the quality of speech of voice output devices. This method allows the comparison of several systems between them. It will be useful for system designers and service providers for checking the quality of their products. This method is of the listening test type. Messages are presented aur
11、ally to subjects. The subjects express their opinion on one or more rating scales after having answered specific questions on the information contained in the messages. The results are measures of the perceived quality in several aspects, which makes it possible to compare the effectiveness of diffe
12、rent speech synthesis systems. i Recommendation P.85 (04) STD-ITU-T RECMN P-85-ENGL 1774 48b2591 ObL9750 124 Recommendation P.85 A METHOD FOR SUBJECTIVE PERFORMANCE ASSESSMENT OF THE QUALITY OF SPEECH VOICE OUTPUT DEVICES (Geneva, 1994) 1 Scope Voice servers are now available for Public Switched Tel
13、ephone Network subscribers. These devices make use either of stored announcements or of synthetic speech. Synthetic speech may be produced from stored segments such as words, syllabes or diphones; it may also be produced by synthesis by rule, e.g. formant synthesis. In all cases of signal processing
14、, such as digital compression of the signal, together with sound processing such as concatenation of segments and variation of pitch, intensity and segment duration, a noticeable impairment of speech quality may occur. This Recommendation, based on Recommendation P.80 and specific experiments i, 2,
15、3, defines a testing method for evaluating the subjective quality of synthetic speech. Some adaptation of the method may be needed, depending on the type of system which is being evaluated. The method takes into account both the performance and the attitudes of the users. The attitudes are assessed
16、by the use of multiple scales. The Recommendation covers both overall system performance and the application to specific tasks. Two examples of application are provided in Annex A. This Recommendation is intended to describe a method for obtaining overall evaluations from users about the acoustic ou
17、tput of speech production devices. Procedures for evaluating specific components of text-to-speech systems (e.g. text transcription into phonetic units, etc.) are currently under study. 2 Assessment method 2.1 General The recommended methods for assessing telephony speech quality described in Recomm
18、endation P.80 and in 2.5 (Opinion tests) of the 2nd edition of Handbook on Telephonornerry 4 can be applied to the assessment of synthetic speech. The use of multiple opinion scales improves the description of listeners perception. Since synthetic speech may need some effort to be understood, the te
19、st is designed so that the subjects must pay attention to the information contained in messages before expressing their opinions. 2.2 Main features of the recommended method During a test a number of different voice sources will be presented aurally, so that the subjects opinions related to a given
20、source may be obtained in relation to other sources. The sources will be synthesis systems as well as reference conditions (this may include natural speech corrupted with some calibrated degradation and known synthesis systems). Subjects are asked to express their opinion using one or more Spint opi
21、nion scales, as in the Absolute Category Rating (ACR) or Degradation Category Rating (DCR) method of Recommendation P.80. In addition to the overall quality scale, other scales measuring listening effort, voice pleasantness, etc., can be used. The messages transmitted by the systems should be relate
22、d to practical applications. In practice different applications will require different test sessions. Each message is presented twice. During the first presentation subjects answer specific questions on the information contained in a message; then subjects judge the speech quality by expressing thei
23、r opinion on one or more rating scales during the second presentation. Recommendation P.85 (W94) 1 3 Test preparation 3.1 Speech material The messages should be long enough so that the subjects have time to reproduce the essential content on the first response sheet and also to give their opinion us
24、ing the rating scales on the second sheet. A duration of i0 to 30 seconds per message is recommended. Each message should consist of a fixed part which is specific to the task and a variable part which is different between pairs of presentation. The messages should be designed so that the predictabi
25、lity of the variable part does not differ significantly from one message to another. In Annex A some examples of such messages are given. Other samples with different degrees of difficulty (load of short-time memory) may be used. 3.2 Source conditions If possible at least five different sources are
26、recommended, depending on the systems to be tested, applications involved and experimental design. Among these sources it is recommended to use at least one natural voice (male or female according to the test systems). The natural voice(s), degraded with a multiplicative noise conforming to Recommen
27、dation P.8 1 (see B.2.3P.80, “Reference conditions”), should be used as reference. However, research under progress suggests that other degradations may be more suitable to the evaluation of synthetic voices, Le. T-Reference System 6 or Time and Frequency Warping (TFW) 171. 3.3 Stimulus preparation
28、This subclause is the same as B.lP.80 (Source recordings), except that a microphone with a flat frequency response should be used for the recording of the natural voice. 4 Design of experiment 4.1 Subject task Subjects are given response sheets together with the test instructions. They are requested
29、 to use two sheets per message: one sheet is used for reproducing information contained in the message; the other is used for obtaining the subjects responses on a number of opinion scales. 4.2 Rating scales The recommended rating scales are: - overall impression (type I and type Q questionnaires) (
30、type I questionnaires) (type Q questionnaires) (type I and YF Q - listening effort - comprehension problems - articulation - pronunciation - speaking rate - voice pleasantness - acceptance questionnaires) The wording of the questions and the scaling grades are presented in Annex B. I 4.3 Experimenal
31、 design 4.3.1 The four factors are: source condition, message. order of presentation, group of subjects. 43.2 used for the necessary replications. Graecelatin squares (GLs) should be used if the number of source conditions is sufficient, i.e. seven or more. Within a session, the messages are related
32、 to one application only. Similar but different messages should be 2 Recommendation P.85 (W94) 4.3.3 When a message has been listened to twice. it shall not be used again. 4.3.4 If all the scales are used, a session will be divided into two blocks, each block corresponding to a type I or type Q ques
33、tionnaire (see Annex B). If GLs are used, each of the two blocks of a session shall be organized according to two different GLs. 4.3.5 A visit may consist of one or several sessions. Before the main sessions, a training session should be arranged. In the training session, at least six messages shoul
34、d be presented over sources that are sufficiently different to cover the quality range encountered in the test. 4.3.6 group). If GLs are used, the number of subjects should be at least 4 x GL-dimension (i.e. at least four subjects in each 4.3.7 depend on the actual message duration. Typical time bet
35、ween two presentations in a pair may be eight seconds, and 20 seconds between pairs, but will 4.3.8 A visit may last 40 to 60 minutes, including instructions, preliminaries and pauses. 4.3.9 If natural voices are used, one of them should be included into the training session. 4.4 Listening test proc
36、edure 4.4.1 Listening environment - Same as B.4.1P.80 4.4.2 Listening system - Same as B.4.2P.80. All sources should be band-pass filtered in the same way (according to the application, e.g. 300-3400 Hz). 4.4.3 Listening level - A target should be that the messages are presented at the preferred lev
37、el for synthetic speech. If not known, the preferred level for coded speech (79 dB/SPL, -15 dBPa, see 2.5.8.1 of the new version of the Handbook on Telephonometry) should be used. If possible one or more test blocks should be presented to the same subjects at two additional levels, one above, one be
38、low the preferred level. 4.4.4 Listeners - Same as B 4.4ff.80. 4.4.5 their written form. They may also be presented verbally, preferably using a tape. Instructions to subjects - Annex C gives an example of instructions to subjects; instructions must be given in 5 Statistical analysis and reporting t
39、he results It is recommended to summarize the opinion scores of the subjects in the form of histograms andor cumulative distributions for each rating scale. The comparison of different sources is recommended to be done by plotting the cumulative distributions for each source (one diagram per scale)
40、(see Figure 1). For the overall quality scale and the listening effort scale it is also possible to calculate the mean opinion scores (MOS) for each source condition and each type of message. An analysis of variance and HSD (Honestly Significant Difference) multiple comparison tests should be made f
41、or each rating scale for which MOS values have been calculated. There is no recommended method for analysing the answers on the information content of the messages. However it may be possible to draw some conclusions if performance (e.g. percentages of correct answers) is noticeably worse for a part
42、icular source than for the others. The results on the acceptance question should be given as percentage values. The results of the training sessions are not to be used. Recommendation P.85 (W94) 3 STD-ITU-T RECMN P-85-ENGL 1774 9b2571 Ob17753 733 100 O 1 2 3 4 5 MOS T120538093/d01 FIGURE 1P.85 MOS c
43、umulative distributions 6 Other Methods Sentence-level tests for the assessment of text-to-speech (1ITS) systems are especially useful to quantify the overall intelligibility of a synthesiser. Such a test has been designed in the frame of a multi-lingual European project on synthesiser and recognise
44、r assessment (Esprit “SAM Project No. 2589), the SUS (“Semantically Unpredictable Sentences”) test, which has been developed principally for performance evaluation of TTS systems under development 5. Annex A Messages (This annex forms an integral part of this Recommendation) This annex gives example
45、s of messages. These examples are based on the experiment described in 3. Two applications were involved in this experiment: mail order shopping (M) and railway traffic information (R). Three messages are given for each application. M1: Miss Robert, the running shoes colour: white, size: 11, referen
46、ce: 501-97-52, price: 319 francs, will be delivered to you in i week. Mr. Johnson, the multistandard TV set with remote control, 36 cm screen. reference: 81 1-61-32. price: 2 492 francs, will be delivered to you in 3 weeks. Mr. Moore, the electric drill D162, power: 550 watts, 2 speeds, reference: 4
47、81-20-30, price: 499 francs, will be delivered to you in 2 weeks. The train number 9783 from Glasgow will arrive at 9:24. platform number 3, track G. The train number 7826 to Ipswich will leave at 13:20. platform number 9. track A. M2: M3: R1: R2: R3: The train number 4320 from Birmingham will arriv
48、e at S:44, platform 2, track C. 4 Recommendation P.85 (06/94) Annex B Response sheets (This annex forms an integral part of this Recommendation) The following figures give examples of response sheets. Figures B.l and B.2 are related to the same applications as in Annex A. See Figures B.3 and B.4. Na
49、me Name of item (1 -3 words) Reference number Price Availability I francs I weeks FIGURE E. 1P.85 The five tasks related to a mail order shopping application Train number To or from Time I I Platform Track L FIGURE B.2P.85 The five tasks related to a railway traffic information application Recommendation P.85 (06/94) 5 STD-ITU-T RECMN P-85-ENGL 1774 48b2571 Ob17755 70b M Listening effort How would you describe the effort you were required to make in order to undetstand the message? Complete relaxation possible: no effort required Attention necess
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1