1、 International Telecommunication Union ITU-T Series PTELECOMMUNICATION STANDARDIZATION SECTOR OF ITU Supplement 24(10/2005) SERIES P: TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS Parameters describing the interaction with spoken dialogue systems ITU-T P-series Recomme
2、ndations Supplement 24 ITU-T P-SERIES RECOMMENDATIONS TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS Vocabulary and effects of transmission parameters on customer opinion of transmission quality Series P.10 Subscribers lines and sets Series P.30 P.300 Transmission stand
3、ards Series P.40 Objective measuring apparatus Series P.50 P.500 Objective electro-acoustical measurements Series P.60 Measurements related to speech loudness Series P.70 Methods for objective and subjective assessment of quality Series P.80 P.800Audiovisual quality in multimedia services Series P.9
4、00 Transmission performance and QoS aspects of IP end-points Series P.1000 For further details, please refer to the list of ITU-T Recommendations. P series Supplement 24 (10/2005) i Supplement 24 to ITU-T P-series Recommendations Parameters describing the interaction with spoken dialogue systems Sum
5、mary This Supplement provides definitions for a set of parameters which can be extracted from services which rely on spoken dialogue systems. The parameters can be extracted from logged (test) user interactions with the service under consideration. They quantify the flow of the interaction, the beha
6、viour of the user and the system, and the performance of the speech technology devices involved in the interaction. They provide useful information for system development, optimization and maintenance, and are complementary to subjective quality judgments collected according to ITU-T Rec. P.851. Sou
7、rce Supplement 24 to ITU-T P-series Recommendations was agreed on 21 October 2005 by ITU-T Study Group 12 (2005-2008). Keywords Assessment, automatic speech recognition, automatic speech understanding, dialogue management, interaction parameter, speech generation, speech technology, spoken dialogue
8、system. ii P series Supplement 24 (10/2005) FOREWORD The International Telecommunication Union (ITU) is the United Nations specialized agency in the field of telecommunications. The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is responsible for studying te
9、chnical, operating and tariff questions and issuing Recommendations on them with a view to standardizing telecommunications on a worldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU-T study groups whic
10、h, in turn, produce Recommendations on these topics. The approval of ITU-T Recommendations is covered by the procedure laid down in WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative basis with ISO and
11、IEC. NOTE In this publication, the expression “Administration“ is used for conciseness to indicate both a telecommunication administration and a recognized operating agency. Compliance with this publication is voluntary. However, the publication may contain certain mandatory provisions (to ensure e.
12、g. interoperability or applicability) and compliance with the publication is achieved when all of these mandatory provisions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. The use of such words does not sugge
13、st that compliance with the publication is required of any party. INTELLECTUAL PROPERTY RIGHTS ITU draws attention to the possibility that the practice or implementation of this publication may involve the use of a claimed Intellectual Property Right. ITU takes no position concerning the evidence, v
14、alidity or applicability of claimed Intellectual Property Rights, whether asserted by ITU members or others outside of the publication development process. As of the date of approval of this publication, ITU had not received notice of intellectual property, protected by patents, which may be require
15、d to implement this publication. However, implementors are cautioned that this may not represent the latest information and are therefore strongly urged to consult the TSB patent database. ITU 2005 All rights reserved. No part of this publication may be reproduced, by any means whatsoever, without t
16、he prior written permission of ITU. P series Supplement 24 (10/2005) iii CONTENTS Page 1 Scope 1 2 References. 1 3 Definitions 1 4 Abbreviations 2 5 Introduction 2 6 Characteristics of interaction parameters 3 7 Review of interaction parameters . 4 8 Interpretation of interaction parameter values 14
17、 BIBLIOGRAPHY 16 P series Supplement 24 (10/2005) 1 Supplement 24 to ITU-T P-series Recommendations Parameters describing the interaction with spoken dialogue systems 1 Scope This Supplement describes parameters providing information on the interaction with services which are based on spoken dialogu
18、e systems, as seen by the system developer and service operator. Spoken dialogue systems addressed by this Supplement enable a spoken language interaction with a human user via the telephone network on a turn-by-turn basis, and have automatic speech recognition, speech understanding, dialogue manage
19、ment, response generation, and speech output capabilities. They may provide access to information stored in a database, or allow different types of transactions to be performed. The parameters defined here quantify the flow of the interaction, the behaviour of the user and the system, and the perfor
20、mance of the speech technology devices involved in the interaction. For extracting all parameters, the spoken dialogue system has to be accessible as a glass box; still, some parameters may also be extracted in a black-box approach, i.e., without access to the individual system components. The extra
21、ction can partially be performed automatically, and partially relies on a human expert transcribing and annotating interaction log files. The parameters address system performance from a system developers point-of-view; thus, they provide complementary information to subjective evaluation experiment
22、s with spoken dialogue systems for which recommendations are given in ITU-T Rec. P.851. Further guidance on subjective evaluation methods in general and on the assessment of speech output devices, is available in ITU-T Recs P.800 and P.85, and in the Handbook on Telephonometry. The parameters listed
23、 in this Supplement do not specifically refer to possible degradations introduced by the transmission channel. These effects are an item for further study by ITU-T SG 12. 2 References ITU-T Recommendation P.85 (1994), A method for subjective performance assessment of the quality of speech voice outp
24、ut devices. ITU-T Recommendation P.800 (1996), Methods for subjective determination of transmission quality. ITU-T Recommendation P.851 (2003), Subjective quality evaluation of telephone services based on spoken dialogue systems. ITU-T Handbook on Telephonometry (1992). 3 Definitions For definitions
25、 not listed here, please refer to ITU-T Rec. P.10. 3.1 barge-in: The ability of a human to speak over a system prompt or system output 10. 3.2 dialogue: A conversation or an exchange of information. As an evaluation unit: One of several possible paths through the dialogue structure. 3.3 efficiency:
26、Measures of the accuracy and completeness of system tasks relative to the resources (e.g., time, human effort) used to achieve the specific system tasks. 3.4 exchange: A pair of contiguous and related turns, one spoken by each party in the dialogue 8. 3.5 functionality: Capability of the system to p
27、rovide functions which meet stated and implied needs when the system is used under specific conditions. 2 P series Supplement 24 (10/2005) 3.6 meta-communication: The communication about communication, e.g., for resolving misunderstandings (“Did I understand you right?“) or for reaching agreement on
28、 the use of the language. 3.7 performance: The ability of a unit to provide the function it has been designed for. 3.8 speech technology: The discipline concerned with the research and development of spoken language input and output systems, using contributions from the neighbouring disciplines of a
29、coustics, electrical engineering, statistics, phonetics, natural language processing, and involving system requirements specification, design, implementation and evaluation, corpus and linguistic resource processing, and consumer oriented product evaluation 10. 3.9 spoken dialogue system: A computer
30、 system with which human users interact via spoken language on a turn-by-turn basis. 3.10 task: All the activities which a user must develop in order to attain a fixed objective in some domain. 3.11 task-oriented dialogue: A dialogue concerning a specific subject, aiming at an explicit goal (such as
31、 resolving a problem or obtaining specific information) 8. 3.12 transaction: The part of a dialogue devoted to a single high-level task (e.g., making a travel booking or checking a bank account balance). A transaction may be coextensive with a dialogue, or a dialogue may consist of more than one tra
32、nsaction 8. 3.13 turn: Utterance. A stretch of speech, spoken by one party in a dialogue, from when this party starts speaking until another party definitely takes over 1. 3.14 utterance: See turn. 4 Abbreviations ASR Automatic Speech Recognition AVM Attribute-Value Matrix AVP Attribute-Value Pair D
33、ARPA Defense Advanced Research Projects Agency DP Dynamic Programming DTMF Dual Tone Multiple Frequency IVR Interactive Voice Response MOS Mean Opinion Score SDS Spoken Dialogue System WoZ Wizard-of-Oz 5 Introduction Spoken dialogue systems (SDSs), i.e., computer systems with which human users inter
34、act via spoken language on a turn-by-turn basis, may be part of modern telephone networks. They enable access to databases and transactions via the telephone, e.g., for obtaining train or airline timetable information, stock exchange rates, tourist information, or to perform bank account operations
35、or make hotel reservations. In contrast to simple interactive voice response (IVR) systems with DTMF input, SDSs offer the full range of speech interaction capabilities, including the recognition of user speech, the assignment of meaning to the recognized words, the decision on how to continue the P
36、 series Supplement 24 (10/2005) 3 dialogue, the formulation of a linguistic response, and the generation of spoken output to the user. In this way, a more-or-less “natural“ spoken interaction between user and system is enabled. In order to evaluate the quality of services which rely on SDSs from a u
37、sers perspective, ITU-T SG 12 set up ITU-T Rec. P.851 in 2003. This Recommendation describes methods for conducting subjective evaluation experiments in order to determine quality from a users point-of-view, taking the SDS as a black box. With the help of experiments carried out according to ITU-T R
38、ec. P.851, valuable information on quality, as it is seen by the user, may be obtained. However, it may be difficult to determine how the individual system components contribute to the overall quality experienced by the user, e.g., to determine which component needs improvement in case of interactio
39、n problems. Thus, the evaluation should be complemented by information which address the system performance from a system designer and service operators point-of-view. System-related information may be described in terms of so-called interaction parameters. Such parameters help to quantify the flow
40、of the interaction, the behaviour of the user and the system, and the performance of the speech technology devices involved in the interaction. They address system performance from a system developer and service operators point-of-view, and thus provide complementary information to subjective evalua
41、tion data. For extracting some of the parameters, the spoken dialogue system has to be accessible as a glass box; other parameters may also be extracted in a black-box approach, i.e., without an access to the individual system components. This Supplement provides a collection of interaction paramete
42、rs which have been used for evaluating SDSs in the past 15 years. The listed parameters are related to the overall communication of information between user and system, the meta-communication in case of misunderstandings, the cooperativity of the system, the task which can be carried out with the he
43、lp of the system, and the systems speech input capabilities. No parametric description is yet available for speech output quality (e.g., with respect to synthesized speech quality). The collection is based on theoretical work which is described in 17. Not all of the interaction parameters are in a d
44、irect relationship to the perceived quality of SDS-based services. In fact, correlations between individual parameters and users quality judgments are generally quite moderate. Still, it will be advantageous to dispose of a large set of parameters describing the interaction between user and system,
45、in this way, capturing most of the information which is potentially relevant for perceived quality from a system designers perspective. Such parameters provide useful information for system development, optimization, and maintenance. The parameters having once being defined and applied in evaluation
46、 experiments at different test sites, may facilitate an estimation of their impact on perceived quality for a wide range of systems and services. In this way, it may become possible to develop algorithms for predicting quality on the basis of interaction parameters. Work in this direction is still u
47、nder way within ITU-T SG 12 and elsewhere. 6 Characteristics of interaction parameters Interaction parameters can be extracted when real or test users interact with the service. The extraction can be performed partly instrumentally and partly with the help of log files which have to be transcribed a
48、nd annotated by a human expert. Simple parameters, like the duration of the interaction or of single utterances, can usually be measured fully instrumentally, with appropriate algorithms. On the other hand, human transcription and annotation is necessary when not only the surface form (speech signal
49、s) is addressed, but also the contents and meaning of system or user utterances (e.g., to determine the accuracy of a word or concept). SDSs are of such high complexity that a description of system behaviour and a comparison between systems or system versions needs to be based on a multitude of different parameters 24. As a consequence, both (instrumental and expert-based) ways of collecting interaction parameters should 4 P series Supplement 24 (10/2005) be followed in order to get as much information as possible. Based on the collected info