1、 ETSI TR 122 977 V14.0.0 (2017-03) Digital cellular telecommunications system (Phase 2+) (GSM); Universal Mobile Telecommunications System (UMTS); LTE; Feasibility study for speech-enabled services (3GPP TR 22.977 version 14.0.0 Release 14) TECHNICAL REPORT ETSI ETSI TR 122 977 V14.0.0 (2017-03)13GP
2、P TR 22.977 version 14.0.0 Release 14Reference RTR/TSGS-0122977ve00 Keywords GSM,LTE,UMTS ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N 348 623 562 00017 - NAF 742 C Association but non lucratif enregistre la Sous-Prfecture
3、 de Grasse (06) N 7803/88 Important notice The present document can be downloaded from: http:/www.etsi.org/standards-search The present document may be made available in electronic versions and/or in print. The content of any electronic and/or print versions of the present document shall not be modi
4、fied without the prior written authorization of ETSI. In case of any existing or perceived difference in contents between such versions and/or in print, the only prevailing document is the print of the Portable Document Format (PDF) version kept on a specific network drive within ETSI Secretariat. U
5、sers of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at https:/portal.etsi.org/TB/ETSIDeliverableStatus.aspx If you find errors in the present document, please se
6、nd your comment to one of the following services: https:/portal.etsi.org/People/CommiteeSupportStaff.aspx Copyright Notification No part may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm except as authorized by written permissio
7、n of ETSI. The content of the PDF version shall not be modified without the written authorization of ETSI. The copyright and the foregoing restriction extend to reproduction in all media. European Telecommunications Standards Institute 2017. All rights reserved. DECTTM, PLUGTESTSTM, UMTSTMand the ET
8、SI logo are Trade Marks of ETSI registered for the benefit of its Members. 3GPPTM and LTE are Trade Marks of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. GSM and the GSM logo are Trade Marks registered and owned by the GSM Association. ETSI ETSI TR 122 977
9、V14.0.0 (2017-03)23GPP TR 22.977 version 14.0.0 Release 14Intellectual Property Rights IPRs essential or potentially essential to the present document may have been declared to ETSI. The information pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, a
10、nd can be found in ETSI SR 000 314: “Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards“, which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server (https:/ipr.etsi.org/). Pursuant to th
11、e ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document. For
12、eword The present document may refer to technical specifications or reports using their 3GPP identities, UMTS identities or GSM identities. These should be interpreted as being references to the corresponding ETSI deliverables. The cross reference between GSM, UMTS, 3GPP and ETSI identities can be f
13、ound under http:/webapp.etsi.org/key/queryform.asp. Modal verbs terminology In the present document “should“, “should not“, “may“, “need not“, “will“, “will not“, “can“ and “cannot“ are to be interpreted as described in clause 3.2 of the ETSI Drafting Rules (Verbal forms for the expression of provis
14、ions). “must“ and “must not“ are NOT allowed in ETSI deliverables except when used in direct citation. ETSI ETSI TR 122 977 V14.0.0 (2017-03)33GPP TR 22.977 version 14.0.0 Release 14Contents Intellectual Property Rights 2g3Foreword . 2g3Modal verbs terminology 2g3Foreword . 4g31 Scope 5g32 Reference
15、s 6g32.1 Informative references 6g32.1 Normative references . 6g33 Definitions and abbreviations . 7g33.1 Definitions 7g33.1 Abbreviations . 8g34 Speech-Enabled Services . 8g34.1 Application Scenarios . 8g35 Multimodal Services. 9g35.1 Application Scenarios . 10g36 Speech recognition technology 10g3
16、6.1 DSR standards 13g37 Multimodal and Multi-device Technology . 14g37.1 Execution Model 14g37.2 Deployment configurations 15g37.3 Authoring . 18g38. Requirements to introduce Speech-enabled services 18g38.1 Initiation . 19g38.1.1 Service initiation . 19g38.1.2 Multimodal or multi-device access conf
17、iguration. 19g38.2 Information during the interaction session . 19g38.3 Control 19g38.4 User perspective (user interface) 20g38.5 Service provisioning . 20g38.6 Security 20g38.7 Privacy 21g38.8 Charging . 21g39 Impact on the 3GPP system 22g39.1 Speech Recognition within 3GPP system 22g39.2 Multimoda
18、l and Multi-device Services within 3GPP system . 23g3Annex A: Change history . 25g3History 26g3ETSI ETSI TR 122 977 V14.0.0 (2017-03)43GPP TR 22.977 version 14.0.0 Release 14Foreword This Technical Report (TR) has been produced by ETSI 3rd Generation Partnership Project (3GPP). The contents of the p
19、resent document are subject to continuing work within the TSG and may change following formal TSG approval. Should the TSG modify the contents of the present document, it will be re-released by the TSG with an identifying change of release date and an increase in version number as follows: Version x
20、.y.z where: x the first digit: 1 presented to TSG for information; 2 presented to TSG for approval; 3 or greater indicates TSG approved document under change control. y the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections, updates, etc. z the third d
21、igit is incremented when editorial only changes have been incorporated in the document. ETSI ETSI TR 122 977 V14.0.0 (2017-03)53GPP TR 22.977 version 14.0.0 Release 141 Scope Speech Enabled Services The advancement in the Automatic Speech Recognition (ASR) technology, coupled with the rapid growth i
22、n the wireless telephony market has created a compelling need for speech-enabled services. Voice-activated dialling has become a de facto standard in many of the mobile phones in the market today. The speech recognition technology has also been applied more recently to voice messaging and personal a
23、ccess services. A Voice Extensible Markup Language (Voice XML) has been designed to bring the full power of web development and content delivery to voice response applications 11. Voice portals that provide voice access to conventional graphically oriented services over the Internet are now becoming
24、 popular. Forecasts show that speech-driven services will play an important role on the 3G market. Users of mobile terminals want the ability to access information while on the move and the small portable mobile devices that will be used to access this information need improved user interfaces using
25、 speech input. Multimodal and Multi-device Services Speech-enabled services may utilize speech alone for input and output interaction, or may also utilise multiple input and output modalities leading to the multimodal services. Online access to information is fast becoming a must-have. Along with th
26、is trend, come new usage models for information access, particularly in mobile environments. Information appliances in cars such as navigation systems are standard in high-end cars already and this will penetrate lower-end vehicles soon. Data access using mobile phones, though limited and currently
27、estimated to take three years to be widespread, has significant momentum that makes it certain to become widespread. In this new computing paradigm a person will expect to have access to information and interactions in a seamless manner in many environments, be it in the office, at home, in the car,
28、 often on several different devices. These new access methods have compelling advantages, such as mobile accessibility, low cost, ease of use, and mass market penetration. They also have their limitations - in particular, it is hard to enter and access data using small devices, speech recognition ca
29、n introduce mistakes that can sometimes be repeating and therefore blocking the transaction; one interaction mode does not suit all circumstances, and so on. For example, a recent study of task-performance using wireless phones, such as reading world headlines and checking local weather concluded th
30、at currently, these services are often poorly designed, have insufficient task analysis, and abuse existing non-mobile design guidelines. The full report from the field study can be downloaded at 6. The basic conclusion of this study is that wireless access usability fails miserably; accomplishing e
31、ven the simplest of tasks takes much too long to provide any user satisfaction. It is thus essential for the widespread acceptance of this computing paradigm to provide an efficient and usable interface on the different device platforms that people are expected to use to access and interact with inf
32、ormation. We can expect and already observe a trend towards a new frontier of interactive services: multimodal and multi-device services. These services exploit the fact that different interaction modes are good at different things - for example, talking is easier than typing, but reading is faster
33、than listening. Multi-modal interfaces combine the use of multiple interaction modes, such as voice, keypad and display to improve the user interface to services. Different standard bodies are addressing aspects of this space, driven by several industry proposals: W3C (e.g. MMI activity)11, OMA/WAP
34、Forum, ETSI 1, IETF14,). In particular, the W3C MMI 13 aims at defining a programming model for multimodal and multi-device applications. Additional details and motivations are discussed in 2, 7, 8. Overview A brief overview of the speech-enabled services is presented in Chapter 4. The different way
35、s of enabling speech recognition for the speech enabled services are described in chapter 5. Section 6 discusses multimodal services and options to enable multimodal and multi-device services. The scope of the report, references, definitions and abbreviations are detailed in the first few chapters.
36、ETSI ETSI TR 122 977 V14.0.0 (2017-03)63GPP TR 22.977 version 14.0.0 Release 142 References The following documents contain provisions which, through reference in this text, constitute provisions of the present document. References are either specific (identified by date of publication, edition numb
37、er, version number, etc.) or non-specific. For a specific reference, subsequent revisions do not apply. For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version
38、 of that document in the same Release as the present document. 2.1 Informative references 1 D. Pearce, “Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition“, Proc. of AVIOS00, 2000. 2 ETSI ES 201 108: “Speech Process
39、ing, Transmission and Quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms; DRS front end“. 3 ETSI ES 202 050: “ Speech Processing, Transmission and Quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extr
40、action algorithm; Compression algorithms“. 4 Y. Muthuswamy, P. Walther, “Applications and Requirements“, ETSI Aurora DSR Applications ICSLP 2002, Denver, CO, Sept 2002 16 D Pearce, “Developing the ETSI Aurora Advanced Distributed Speech Recognition Front-end ASRU 2001, Madonna di Campiglio, Dec 2001
41、 2.1 Normative references 17 3GPP TS 21.905: “Vocabulary for 3GPP Specifications“ 18 3GPP TS 22.243: “Speech recognition framework for automated voice services; Stage 1“. ETSI ETSI TR 122 977 V14.0.0 (2017-03)73GPP TR 22.977 version 14.0.0 Release 1419 3GPP TS 21.133: “3G security; Security threats
42、and requirements“. 20 3GPP TS 22.228: “Service requirements for the Internet Protocol (IP) multimedia core network subsystem; Stage 1“. 3 Definitions and abbreviations 3.1 Definitions Automated Voice Services: Voice applications that provide a voice interface driven by a voice dialog manager to driv
43、e the conversation with the user in order to complete a transaction and possibly execute requested actions. It relies on speech recognition engines to map user voice input into textual or semantic inputs to the dialog manager and mechanisms to generate voice or recorded audio prompts (text-to-speech
44、 synthesis, audio playback,). It is possible that it relies on additional speech processing (e.g. speaker verification). Typically telephony-based automated voice services also provide call processing and DTMF recognition capabilities. Examples of traditional automated voice services are traditional
45、 IVR (Interactive Voice Response Systems) and VoiceXML Browsers. Conventional Codec: The module in UE that encodes the speech input waveform , similar to the encoder in a vocoder e.g. EFR, AMR. Channel: denotes a particular user agent (browser), device, or a particular modality. Downlink exchanges:
46、Exchanges from servers and networks to the terminal. DSR Optimised Codec: The module in UE which takes speech input, extracts acoustic features and encodes them with a scheme optimised for speech recognition. This module is similar to the the conventional codec (e.g. AMR). On the server-side, the up
47、link encoded stream can be directly consumed by speech engines without having to be converted to a waveform. Haptic interface: An interface that allows a user to interact by receiving feed back achieved by applying a degree of opposing force to the user along the x, y, and z axes (e.g. pressure). Mo
48、no-modal application: application designed for access through only one channel or channel type (e.g. WAP, Web or Voice exclusively). Multi-channel application: applications designed for ubiquitous access through different channels, one channel at a time. No particular attention is paid to synchroniz
49、ation or coordination across different channels. Multi-device applications: denote application that supports the capability to interact with a particular application over a number of physical devices with browsers being synchronised with the MT accessing 3G services. These browsers may support the same (e.g. GUI) or different modalities. Multimodal application: denotes application that supports more than one interaction mode by relying on a combination of multiple input (e.g. key, stylus, voice, ) to access and manipulate information