1、 ETSI TR 1Digital cellular telecoUniversal Mobile TelFeasibility stud(3GPP TR 22.9TECHNICAL REPORT 122 977 V13.0.0 (2016communications system (Phaelecommunications System (LTE; dy for speech-enabled servic.977 version 13.0.0 Release 1316-01) hase 2+); (UMTS); ices 13) ETSI ETSI TR 122 977 V13.0.0 (2
2、016-01)13GPP TR 22.977 version 13.0.0 Release 13Reference RTR/TSGS-0122977vd00 Keywords GSM,LTE,UMTS ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N 348 623 562 00017 - NAF 742 C Association but non lucratif enregistre la Sou
3、s-Prfecture de Grasse (06) N 7803/88 Important notice The present document can be downloaded from: http:/www.etsi.org/standards-search The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall
4、not be modified without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the only prevailing document is the print of the Portable Document Format (PDF) version kept on a specific network drive within ETSI Sec
5、retariat. Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at http:/portal.etsi.org/tb/status/status.asp If you find errors in the present document, please s
6、end your comment to one of the following services: https:/portal.etsi.org/People/CommiteeSupportStaff.aspx Copyright Notification No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permissi
7、on of ETSI. The content of the PDF version shall not be modified without the written authorization of ETSI. The copyright and the foregoing restriction extend to reproduction in all media. European Telecommunications Standards Institute 2016. All rights reserved. DECTTM, PLUGTESTSTM, UMTSTMand the E
8、TSI logo are Trade Marks of ETSI registered for the benefit of its Members. 3GPPTM and LTE are Trade Marks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. GSM and the GSM logo are Trade Marks registered and owned by the GSM Association. ETSI ETSI TR 122 977
9、 V13.0.0 (2016-01)23GPP TR 22.977 version 13.0.0 Release 13Intellectual Property Rights IPRs essential or potentially essential to the present document may have been declared to ETSI. The information pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members,
10、and can be found in ETSI SR 000 314: “Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards“, which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server (https:/ipr.etsi.org/). Pursuant to t
11、he ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document. Fo
12、reword This Technical Report (TR) has been produced by ETSI 3rd Generation Partnership Project (3GPP). The present document may refer to technical specifications or reports using their 3GPP identities, UMTS identities or GSM identities. These should be interpreted as being references to the correspo
13、nding ETSI deliverables. The cross reference between GSM, UMTS, 3GPP and ETSI identities can be found under http:/webapp.etsi.org/key/queryform.asp. Modal verbs terminology In the present document “shall“, “shall not“, “should“, “should not“, “may“, “need not“, “will“, “will not“, “can“ and “cannot“
14、 are to be interpreted as described in clause 3.2 of the ETSI Drafting Rules (Verbal forms for the expression of provisions). “must“ and “must not“ are NOT allowed in ETSI deliverables except when used in direct citation. ETSI ETSI TR 122 977 V13.0.0 (2016-01)33GPP TR 22.977 version 13.0.0 Release 1
15、3Contents Intellectual Property Rights 2g3Foreword . 2g3Modal verbs terminology 2g3Foreword . 4g31 Scope 5g32 References 6g32.1 Informative references 6g32.1 Normative references . 6g33 Definitions and abbreviations . 7g33.1 Definitions 7g33.1 Abbreviations . 8g34 Speech-Enabled Services . 8g34.1 Ap
16、plication Scenarios . 9g35 Multimodal Services. 9g35.1 Application Scenarios . 10g36 Speech recognition technology 10g36.1 DSR standards 13g37 Multimodal and Multi-device Technology . 14g37.1 Execution Model 14g37.2 Deployment configurations 15g37.3 Authoring . 18g38. Requirements to introduce Speec
17、h-enabled services 18g38.1 Initiation . 19g38.1.1 Service initiation . 19g38.1.2 Multimodal or multi-device access configuration. 19g38.2 Information during the interaction session . 19g38.3 Control 19g38.4 User perspective (user interface) 20g38.5 Service provisioning . 20g38.6 Security 20g38.7 Pri
18、vacy 21g38.8 Charging . 21g39 Impact on the 3GPP system 22g39.1 Speech Recognition within 3GPP system 22g39.2 Multimodal and Multi-device Services within 3GPP system . 23g3Annex A: Change history 25g3History 26g3ETSI ETSI TR 122 977 V13.0.0 (2016-01)43GPP TR 22.977 version 13.0.0 Release 13Foreword
19、This Technical Report has been produced by the 3rdGeneration Partnership Project (3GPP). The contents of the present document are subject to continuing work within the TSG and may change following formal TSG approval. Should the TSG modify the contents of the present document, it will be re-released
20、 by the TSG with an identifying change of release date and an increase in version number as follows: Version x.y.z where: x the first digit: 1 presented to TSG for information; 2 presented to TSG for approval; 3 or greater indicates TSG approved document under change control. y the second digit is i
21、ncremented for all changes of substance, i.e. technical enhancements, corrections, updates, etc. z the third digit is incremented when editorial only changes have been incorporated in the document. ETSI ETSI TR 122 977 V13.0.0 (2016-01)53GPP TR 22.977 version 13.0.0 Release 131 Scope Speech Enabled
22、Services The advancement in the Automatic Speech Recognition (ASR) technology, coupled with the rapid growth in the wireless telephony market has created a compelling need for speech-enabled services. Voice-activated dialling has become a de facto standard in many of the mobile phones in the market
23、today. The speech recognition technology has also been applied more recently to voice messaging and personal access services. A Voice Extensible Markup Language (Voice XML) has been designed to bring the full power of web development and content delivery to voice response applications 11. Voice port
24、als that provide voice access to conventional graphically oriented services over the Internet are now becoming popular. Forecasts show that speech-driven services will play an important role on the 3G market. Users of mobile terminals want the ability to access information while on the move and the
25、small portable mobile devices that will be used to access this information need improved user interfaces using speech input. Multimodal and Multi-device Services Speech-enabled services may utilize speech alone for input and output interaction, or may also utilise multiple input and output modalitie
26、s leading to the multimodal services. Online access to information is fast becoming a must-have. Along with this trend, come new usage models for information access, particularly in mobile environments. Information appliances in cars such as navigation systems are standard in high-end cars already a
27、nd this will penetrate lower-end vehicles soon. Data access using mobile phones, though limited and currently estimated to take three years to be widespread, has significant momentum that makes it certain to become widespread. In this new computing paradigm a person will expect to have access to inf
28、ormation and interactions in a seamless manner in many environments, be it in the office, at home, in the car, often on several different devices. These new access methods have compelling advantages, such as mobile accessibility, low cost, ease of use, and mass market penetration. They also have the
29、ir limitations - in particular, it is hard to enter and access data using small devices, speech recognition can introduce mistakes that can sometimes be repeating and therefore blocking the transaction; one interaction mode does not suit all circumstances, and so on. For example, a recent study of t
30、ask-performance using wireless phones, such as reading world headlines and checking local weather concluded that currently, these services are often poorly designed, have insufficient task analysis, and abuse existing non-mobile design guidelines. The full report from the field study can be download
31、ed at 6. The basic conclusion of this study is that wireless access usability fails miserably; accomplishing even the simplest of tasks takes much too long to provide any user satisfaction. It is thus essential for the widespread acceptance of this computing paradigm to provide an efficient and usab
32、le interface on the different device platforms that people are expected to use to access and interact with information. We can expect and already observe a trend towards a new frontier of interactive services: multimodal and multi-device services. These services exploit the fact that different inter
33、action modes are good at different things - for example, talking is easier than typing, but reading is faster than listening. Multi-modal interfaces combine the use of multiple interaction modes, such as voice, keypad and display to improve the user interface to services. Different standard bodies a
34、re addressing aspects of this space, driven by several industry proposals: W3C (e.g. MMI activity)11, OMA/WAP Forum, ETSI 1, IETF14,). In particular, the W3C MMI 13 aims at defining a programming model for multimodal and multi-device applications. Additional details and motivations are discussed in
35、2, 7, 8. Overview A brief overview of the speech-enabled services is presented in Chapter 4. The different ways of enabling speech recognition for the speech enabled services are described in chapter 5. Section 6 discusses multimodal services and options to enable multimodal and multi-device service
36、s. The scope of the report, references, definitions and abbreviations are detailed in the first few chapters. ETSI ETSI TR 122 977 V13.0.0 (2016-01)63GPP TR 22.977 version 13.0.0 Release 132 References The following documents contain provisions which, through reference in this text, constitute provi
37、sions of the present document. References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific. For a specific reference, subsequent revisions do not apply. For a non-specific reference, the latest version applies. In the case of a reference t
38、o a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document. 2.1 Informative references 1 D. Pearce, “Enabling new speech driven services for mobile devices: An overview of the ETSI standa
39、rds activities for distributed speech recognition“, Proc. of AVIOS00, 2000. 2 ETSI ES 201 108: “Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms; DRS front end“. 3 ETSI ES 202 050: “ Speech Proce
40、ssing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms“. 4 Y. Muthuswamy, P. Walther, “Applications and Requirements“, ETSI Aurora DSR Applications ICSLP 2002, Denver, CO, Sept 2002 16 D Pearce, “Developi
41、ng the ETSI Aurora Advanced Distributed Speech Recognition Front-end ASRU 2001, Madonna di Campiglio, Dec 2001 2.1 Normative references 17 3GPP TS 21.905: “Vocabulary for 3GPP Specifications“ 18 3GPP TS 22.243: “Speech recognition framework for automated voice services; Stage 1“. ETSI ETSI TR 122 97
42、7 V13.0.0 (2016-01)73GPP TR 22.977 version 13.0.0 Release 1319 3GPP TS 21.133: “3G security; Security threats and requirements“. 20 3GPP TS 22.228: “Service requirements for the Internet Protocol (IP) multimedia core network subsystem; Stage 1“. 3 Definitions and abbreviations 3.1 Definitions Automa
43、ted Voice Services: Voice applications that provide a voice interface driven by a voice dialog manager to drive the conversation with the user in order to complete a transaction and possibly execute requested actions. It relies on speech recognition engines to map user voice input into textual or se
44、mantic inputs to the dialog manager and mechanisms to generate voice or recorded audio prompts (text-to-speech synthesis, audio playback,). It is possible that it relies on additional speech processing (e.g. speaker verification). Typically telephony-based automated voice services also provide call
45、processing and DTMF recognition capabilities. Examples of traditional automated voice services are traditional IVR (Interactive Voice Response Systems) and VoiceXML Browsers. Conventional Codec: The module in UE that encodes the speech input waveform , similar to the encoder in a vocoder e.g. EFR, A
46、MR. Channel: denotes a particular user agent (browser), device, or a particular modality. Downlink exchanges: Exchanges from servers and networks to the terminal. DSR Optimised Codec: The module in UE which takes speech input, extracts acoustic features and encodes them with a scheme optimised for s
47、peech recognition. This module is similar to the the conventional codec (e.g. AMR). On the server-side, the uplink encoded stream can be directly consumed by speech engines without having to be converted to a waveform. Haptic interface: An interface that allows a user to interact by receiving feed b
48、ack achieved by applying a degree of opposing force to the user along the x, y, and z axes (e.g. pressure). Mono-modal application: application designed for access through only one channel or channel type (e.g. WAP, Web or Voice exclusively). Multi-channel application: applications designed for ubiq
49、uitous access through different channels, one channel at a time. No particular attention is paid to synchronization or coordination across different channels. Multi-device applications: denote application that supports the capability to interact with a particular application over a number of physical devices with browsers being synchronised with the MT accessing 3G services. These browsers may support the same (e.g. GUI) or different modalities. Multimodal application: denotes application that supports more than one interaction mode b