1、 INTERNATIONAL TELECOMMUNICATION UNION ITU-T P.851TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (11/2003) SERIES P: TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS Methods for objective and subjective assessment of quality Subjective quality evaluation of telephone ser
2、vices based on spoken dialogue systems ITU-T Recommendation P.851 ITU-T P-SERIES RECOMMENDATIONS TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS Vocabulary and effects of transmission parameters on customer opinion of transmission quality Series P.10 Subscribers lines an
3、d sets Series P.30 P.300 Transmission standards Series P.40 Objective measuring apparatus Series P.50 P.500 Objective electro-acoustical measurements Series P.60 Measurements related to speech loudness Series P.70 Methods for objective and subjective assessment of quality Series P.80 P.800 Audiovisu
4、al quality in multimedia services Series P.900 For further details, please refer to the list of ITU-T Recommendations. ITU-T Rec. P.851 (11/2003) i ITU-T Recommendation P.851 Subjective quality evaluation of telephone services based on spoken dialogue systems Summary This Recommendation describes me
5、thods and procedures for conducting subjective evaluation experiments for telephone services which are based on spoken dialogue systems. The respective systems enable a natural interaction via spoken language and possess speech recognition and interpretation, dialogue management, and speech output c
6、apabilities. The set-up and running of appropriate interaction experiments is described, and questionnaires for quantifying the relevant quality dimensions perceived by the user are given. Source ITU-T Recommendation P.851 was approved on 13 November 2003 by ITU-T Study Group 12 (2001-2004) under th
7、e ITU-T Recommendation A.8 procedure. Keywords Dialogue management, interaction parameter, speech generation, speech recognition, speech understanding, spoken dialogue system, subjective evaluation. ii ITU-T Rec. P.851 (11/2003) FOREWORD The International Telecommunication Union (ITU) is the United
8、Nations specialized agency in the field of telecommunications. The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is responsible for studying technical, operating and tariff questions and issuing Recommendations on them with a view to standardizing telecommun
9、ications on a worldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU-T study groups which, in turn, produce Recommendations on these topics. The approval of ITU-T Recommendations is covered by the proced
10、ure laid down in WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative basis with ISO and IEC. NOTE In this Recommendation, the expression “Administration“ is used for conciseness to indicate both a teleco
11、mmunication administration and a recognized operating agency. Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain mandatory provisions (to ensure e.g. interoperability or applicability) and compliance with the Recommendation is achieved when all of these
12、 mandatory provisions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. The use of such words does not suggest that compliance with the Recommendation is required of any party. INTELLECTUAL PROPERTY RIGHTS ITU d
13、raws attention to the possibility that the practice or implementation of this Recommendation may involve the use of a claimed Intellectual Property Right. ITU takes no position concerning the evidence, validity or applicability of claimed Intellectual Property Rights, whether asserted by ITU members
14、 or others outside of the Recommendation development process. As of the date of approval of this Recommendation, ITU had not received notice of intellectual property, protected by patents, which may be required to implement this Recommendation. However, implementors are cautioned that this may not r
15、epresent the latest information and are therefore strongly urged to consult the TSB patent database. ITU 2004 All rights reserved. No part of this publication may be reproduced, by any means whatsoever, without the prior written permission of ITU. ITU-T Rec. P.851 (11/2003) iii CONTENTS Page 1 Scope
16、 1 2 References. 1 3 Abbreviations 2 4 Introduction 2 4.1 Tasks and components of a spoken dialogue system . 2 4.2 Telephone interaction with a spoken dialogue system. 3 4.3 Quality aspects and influencing factors 4 4.4 Subjective evaluation methods. 7 5 Spoken dialogue system characterization. 8 5.
17、1 Agent factors 8 5.2 Task factors 10 5.3 User factors. 11 5.4 Environmental factors 11 5.5 Contextual factors. 11 6 Experimental set-up 12 6.1 System set-up and Wizard-of-Oz simulation . 12 6.2 Test scenarios . 13 6.3 Test subjects . 14 7 Questionnaires 15 7.1 Questions related to the users backgro
18、und 16 7.2 Questions related to the individual interaction. 18 7.3 Questions related to the users overall impression of the system . 20 8 Usability evaluation 22 9 Analysis and interpretation of collected information . 23 Appendix I Scenario examples . 24 BIBLIOGRAPHY 26 ITU-T Rec. P.851 (11/2003) 1
19、ITU-T Recommendation P.851 Subjective quality evaluation of telephone services based on spoken dialogue systems 1 Scope This Recommendation describes subjective evaluation methods providing information about the quality of telephone services based on spoken dialogue systems, as experienced by the us
20、ers of such services. Spoken dialogue systems addressed by the Recommendation enable a spoken language interaction with a human user via the telephone network on a turn-by-turn basis, and have speech recognition, speech understanding, dialogue management, response generation, and speech output capab
21、ilities. They may provide access to information stored in a database, or allow different types of transactions to be performed. The evaluation methods described here address different aspects of quality from a users point of view, taking the spoken dialogue system as a black box. Important quality a
22、spects are the usability of the service, the communication efficiency, task and service efficiency, user satisfaction, perceived speech input and output quality, the systems cooperativity, the symmetry of the interaction, and the perceived smoothness of the interaction. The methods are based on labo
23、ratory experiments in which subjects interact with the spoken dialogue system in order to perform a pre-defined, realistic task. The subjects opinion on perceptive quality dimensions can be rated in a guided or unguided way, on questionnaires that are given to them after the experiment, or with the
24、help of other usability evaluation methods. This Recommendation describes the set-up and running of interaction experiments, relevant quality dimensions perceived by the user, and methodologies that will provide information about these quality dimensions. Further guidance on subjective evaluation me
25、thods in general and on the assessment of speech output devices is available in ITU-T Recs P.800 and P.85, and in the Handbook on Telephonometry. 2 References The following ITU-T Recommendations and other references contain provisions which, through reference in this text, constitute provisions of t
26、his Recommendation. At the time of publication, the editions indicated were valid. All Recommendations and other references are subject to revision; users of this Recommendation are therefore encouraged to investigate the possibility of applying the most recent edition of the Recommendations and oth
27、er references listed below. A list of the currently valid ITU-T Recommendations is regularly published. The reference to a document within this Recommendation does not give it, as a stand-alone document, the status of a Recommendation. ITU-T Recommendation E.800 (1994), Terms and definitions related
28、 to quality of service and network performance including dependability. ITU-T Recommendation G.107 (2003), The E-Model, a computational model for use in transmission planning. ITU-T Recommendation G.1000 (2001), Communications Quality of Service: A framework and definitions. ITU-T Recommendation P.8
29、5 (1994), A method for subjective performance assessment of the quality of speech voice output devices. ITU-T Recommendation P.800 (1996), Methods for subjective determination of transmission quality. ITU-T Handbook on Telephonometry (1992). 2 ITU-T Rec. P.851 (11/2003) 3 Abbreviations This Recommen
30、dation uses the following abbreviations: ACR Absolute Category Rating ANOVA Analysis of Variance ASR Automatic Speech Recognition CCR Comparison Category Rating DARPA Defense Advanced Research Projects Agency DCR Degradation Category Rating DTMF Dual Tone Multiple Frequency HMM Hidden Markov Model H
31、SD Honestly Significant Difference MOS Mean Opinion Score MLP Multi-Layer Perceptron PARADISE PARAdigm for DIalogue System Evaluation QoS Quality of Service SDS Spoken Dialogue System WoZ Wizard-of-Oz 4 Introduction Spoken dialogue systems (SDSs), i.e., computer systems with which human users intera
32、ct via spoken language on a turn-by-turn basis, may be part of modern telephone networks. They enable access to databases and transactions via the telephone, e.g., for obtaining train or airline timetable information, stock exchange rates, tourist information, or to perform bank account operations,
33、or make hotel reservations, etc. In contrast to simple DTMF systems, spoken dialogue systems possess automatic speech recognition and speech understanding (i.e., syntactic/semantic/pragmatic and thus interpretatory) capabilities, and a dialogue management module that ensures the smooth and natural r
34、un of the spoken interaction between the user and the system. As a result, the interaction becomes more human-like, and the service provided by such systems may attract a wider range of potential users. Frequently, DTMF- and spoken-dialogue-based types of systems are implemented in an integrated way
35、, and a part of the respective quality aspects will be identical for both types of systems. Sometimes, spoken-dialogue-based types of systems make use of structures and interface protocols used in web application environments, and are built in a similar way to web interfaces; thus, web interfaces ma
36、y form a reference for obtaining the same functionality. 4.1 Tasks and components of a spoken dialogue system From a technical point of view, the components of a spoken dialogue system, operated over the telephone network, can be best displayed in a sequential structure. An example of such a structu
37、re is depicted in Figure 1. It consists of six major components which are accessed by the user via a phone server interface. The speech signal from the user is first processed by the speech recognizer. During the recognition process, it is transformed into a word string or a word hypothesis graph wh
38、ich is then semantically analysed. The output is a semantic frame representing what has been “understood“ from the users utterance. It is the task of the dialogue manager to interpret the semantic frame in the context of the dialogue and the task, and to keep track of the dialogue history. When all
39、relevant information has been collected from the user, a query to the underlying ITU-T Rec. P.851 (11/2003) 3application (in this example a database) can be launched. The information originating from the application program, as well as other communicative goals of the dialogue manager, has to be tra
40、nsformed into a response for the user. This is the task of the response generation module. It generates a response in text form, which is then transformed by the speech synthesizer into a speech signal which is transmitted to the human user. Sometimes, response generation and speech synthesis are im
41、plemented as a single module (without stepping to the textual representation), and pre-recorded messages are used instead of synthesized speech. P.851_F01AcousticmodelsLexiconLanguagemodelsGrammarDialoguehistoryDatabaseUnitdictionaryRulesSpeechsignalSpeechsignalSpeechrecognizerSpeechsynthesizerTextW
42、ordstringSemanticframeSemanticanalyserResponsegeneratorDialoguemanagerSemanticframeDatabaseaccessDatabaseinfoPhoneserverFigure 1/P.851 Sequential structure of a telephone-based spoken dialogue system 27, 33 This principle structure may be implemented in different ways. Examples can be found in 4. On
43、e popular way is the so-called “hub architecture“ 37, 45 which is used in the DARPA Communicator project. Other structures rely on asynchronously operating modules for interpretation, behaviour (reasoning and acting), and generation; see 1. 4.2 Telephone interaction with a spoken dialogue system The
44、 interaction with the spoken dialogue system takes place via some type of telecommunication network. This network will introduce a number of transmission impairments which will impact the quality of the transmitted speech, and as a consequence also the performance of a speech recognizer, and of subs
45、equent speech and natural language technology components in the spoken dialogue system. On its way back to the human user, the transmission channel will degrade the speech signal generated by the dialogue system. Because telecommunication networks will be confronted with human-to-human communication
46、 as well as human-machine-interaction scenarios, it is important to consider the requirements of both the human user and the speech technology device. The requirements will obviously differ, because the perceptive features influencing the users judgement on quality are not identical to the character
47、istics of a speech technology device, e.g., of an automatic speech recognizer (ASR). The human user carries out the interaction via some type of user interface, e.g., a telephone handset, a hands-free terminal, or a headset. The acoustic characteristics of the mentioned interfaces are very diverse,
48、and so is their sensitivity to room acoustic phenomena occurring in the talking and listening environment of the user. For example, ambient noise may significantly impact the intelligibility of speech signals transmitted through a hands-free terminal, and it also carries an influence of the talking
49、behaviour of the user. As a result, the whole interaction scenario including the spoken dialogue system, the transmission channel, and the user interface has to be taken into account for the overall quality of the interaction. 4 ITU-T Rec. P.851 (11/2003) 4.3 Quality aspects and influencing factors Humans are the users of SDS-based services which are offered over the phone. Thus, human factors have to be taken into account when the functions of a system/service and the degree of their fulfilment are determined. The quality of the service results from the