1、 I n t e r n a t i o n a l T e l e c o m m u n i c a t i o n U n i o n ITU-T F.745 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (07/2016) SERIES F: NON-TELEPHONE TELECOMMUNICATION SERVICES Multimedia services Functional requirements for network-based speech-to-speech translation services Recommen
2、dation ITU-T F.745 ITU-T F-SERIES RECOMMENDATIONS NON-TELEPHONE TELECOMMUNICATION SERVICES TELEGRAPH SERVICE Operating methods for the international public telegram service F.1F.19 The gentex network F.20F.29 Message switching F.30F.39 The international telemessage service F.40F.58 The international
3、 telex service F.59F.89 Statistics and publications on international telegraph services F.90F.99 Scheduled and leased communication services F.100F.104 Phototelegraph service F.105F.109 MOBILE SERVICE Mobile services and multidestination satellite services F.110F.159 TELEMATIC SERVICES Public facsim
4、ile service F.160F.199 Teletex service F.200F.299 Videotex service F.300F.349 General provisions for telematic services F.350F.399 MESSAGE HANDLING SERVICES F.400F.499 DIRECTORY SERVICES F.500F.549 DOCUMENT COMMUNICATION Document communication F.550F.579 Programming communication interfaces F.580F.5
5、99 DATA TRANSMISSION SERVICES F.600F.699 MULTIMEDIA SERVICES F.700F.799 ISDN SERVICES F.800F.849 UNIVERSAL PERSONAL TELECOMMUNICATION F.850F.899 HUMAN FACTORS F.900F.999 For further details, please refer to the list of ITU-T Recommendations. Rec. ITU-T F.745 (07/2016) i Recommendation ITU-T F.745 Fu
6、nctional requirements for network-based speech-to-speech translation services Summary Recommendation ITU-T F.745 specifies a high level functional model, a service description and requirements for speech-to-speech translation (S2ST) accomplished by connecting distributed S2ST modules all over the wo
7、rld through a network. To extend this network-based S2ST to other modalities, such as sign language, the modality conversion markup language (MCML) needs to have an expandable structure. The scope of this Recommendation is limited to the application protocol and the services using the network-based
8、S2ST. This revision includes additional information to clarify that Recommendation ITU-T F.745 could be applicable to both face-to-face communication and remote communication. History Edition Recommendation Approval Study Group Unique ID* 1.0 ITU-T F.745 2010-10-14 16 11.1002/1000/10982 2.0 ITU-T F.
9、745 2016-07-14 16 11.1002/1000/12897 Keywords Automatic speech recognition (ASR), machine translation (MT), modality conversion markup language (MCML), speech-to-speech translation (S2ST), text-to-speech synthesis (TTS). _ * To access the Recommendation, type the URL http:/handle.itu.int/ in the add
10、ress field of your web browser, followed by the Recommendations unique ID. For example, http:/handle.itu.int/11.1002/1000/11830-en. ii Rec. ITU-T F.745 (07/2016) FOREWORD The International Telecommunication Union (ITU) is the United Nations specialized agency in the field of telecommunications, info
11、rmation and communication technologies (ICTs). The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is responsible for studying technical, operating and tariff questions and issuing Recommendations on them with a view to standardizing telecommunications on a wo
12、rldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU-T study groups which, in turn, produce Recommendations on these topics. The approval of ITU-T Recommendations is covered by the procedure laid down in
13、 WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative basis with ISO and IEC. NOTE In this Recommendation, the expression “Administration“ is used for conciseness to indicate both a telecommunication admi
14、nistration and a recognized operating agency. Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain mandatory provisions (to ensure, e.g., interoperability or applicability) and compliance with the Recommendation is achieved when all of these mandatory pro
15、visions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. The use of such words does not suggest that compliance with the Recommendation is required of any party. INTELLECTUAL PROPERTY RIGHTSITU draws attention
16、to the possibility that the practice or implementation of this Recommendation may involve the use of a claimed Intellectual Property Right. ITU takes no position concerning the evidence, validity or applicability of claimed Intellectual Property Rights, whether asserted by ITU members or others outs
17、ide of the Recommendation development process. As of the date of approval of this Recommendation, ITU had not received notice of intellectual property, protected by patents, which may be required to implement this Recommendation. However, implementers are cautioned that this may not represent the la
18、test information and are therefore strongly urged to consult the TSB patent database at http:/www.itu.int/ITU-T/ipr/. ITU 2016 All rights reserved. No part of this publication may be reproduced, by any means whatsoever, without the prior written permission of ITU. Rec. ITU-T F.745 (07/2016) iii Tabl
19、e of Contents Page 1 Scope . 1 2 References . 1 3 Definitions 2 3.1 Terms defined elsewhere 2 3.2 Terms defined in this Recommendation . 2 4 Abbreviations and acronyms 3 5 Conventions 3 6 High-level functional model and generic service description 4 6.1 System overview 4 6.2 Functional model of moda
20、lity conversion (MC) through communication between modality conversion protocol (MCP) clients and servers 4 6.3 Service description . 7 7 Requirements 7 7.1 User input requirements . 7 7.2 Network requirements 8 7.3 User device requirements . 8 7.4 Modality conversion (MC) client requirements . 8 7.
21、5 Modality conversion (MC) server requirements 8 7.6 Quality requirements 9 7.7 Security and privacy requirements . 9 7.8 Codec requirements 9 Appendix I Service description in applications 10 I.1 Shared speech-to-speech translation (S2ST) client of two-party communication . 10 I.2 Personal speech-t
22、o-speech translation (S2ST) client communication 10 I.3 Cross-modality communication 10 Bibliography. 12 iv Rec. ITU-T F.745 (07/2016) Introduction The fact that the world has many different languages is one of the barriers to mutual understanding. The more directly people who speak different langua
23、ges can communicate without language boundaries, the more mutual understanding can be accelerated and the closer human relationships can be constructed all over the world. To achieve such communication between humans, speech-to-speech translation (S2ST) technologies can be used. S2ST is a technology
24、 that recognizes the speech in one language, translates the recognized speech into another language, and then synthesizes the translation into speech. The leveraging of S2ST technologies in a pragmatic manner, which has long been one of mankinds dreams, may have a significant impact on tourism, soci
25、al services, safety, and security by removing language barriers, and may ultimately influence language education. To construct S2ST systems, automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS) must be built for source and target languages by collecting spe
26、ech and language data, such as audio data, its manual transcriptions, pronunciation lexica for each word, parallel corpora for translation and so on. It is very difficult for individual organizations to build S2ST systems covering all topics and languages. However, by interconnecting ASR, MT and TTS
27、 modules developed by separate organizations and distributed globally through a network, one can create S2ST systems that break the worlds language barriers. This Recommendation defines the service description and the requirements for network-based S2ST technologies consisting of various distributed
28、 modules connected together in a network. Rec. ITU-T F.745 (07/2016) 1 Recommendation ITU-T F.745 Functional requirements for network-based speech-to-speech translation services 1 Scope This Recommendation specifies the service description and the requirements for speech-to-speech translation (S2ST)
29、 accomplished by connecting distributed S2ST modules all over the world through a network. This service provides S2ST that recognizes the speech in one language, translates the recognized speech into another language, and then synthesizes the translation into speech. People who speak different langu
30、ages can communicate using this service. The applications and services using network-based S2ST technologies are characterized by the following components: S2ST client: user client for speech/text input and output. S2ST servers: speech recognition: speech is recognized and transcribed; machine trans
31、lation: text in source language is translated into text in target language; speech synthesis: speech signal is created from text. Communication protocol: communication protocol to connect user clients and the above S2ST servers. In order to extend the network-based S2ST to other modalities (e.g., si
32、gn language), a communication protocol is incorporated for modality conversion (MC), which converts single/multiple modality information to different single/multiple modality information. The communication protocol for MC needs to have an expandable structure. Modality conversion markup language (MC
33、ML): XML schema that serves as a data description for data exchanged among modality conversion modules. 2 References The following ITU-T Recommendations and other references contain provisions which, through reference in this text, constitute provisions of this Recommendation. At the time of publica
34、tion, the editions indicated were valid. All Recommendations and other references are subject to revision; users of this Recommendation are therefore encouraged to investigate the possibility of applying the most recent edition of the Recommendations and other references listed below. A list of the
35、currently valid ITU-T Recommendations is regularly published. The reference to a document within this Recommendation does not give it, as a stand-alone document, the status of a Recommendation. ITU-T H.625 Recommendation ITU-T H.625 (2010), Architecture for network-based speech-to-speech translation
36、 services. IETF RFC 2279 IETF RFC 2279 (1998), UTF-8, a transformation format of ISO 10646. IETF RFC 2396 IETF RFC 2396 (1998), Uniform Resource Identifiers (URI): Generic Syntax. IETF RFC 2616 IETF RFC 2616 (1999), Hypertext Transfer Protocol HTTP/1.1. IETF RFC 2818 IETF RFC 2818 (2000), HTTP Over
37、TLS. 2 Rec. ITU-T F.745 (07/2016) IETF RFC 3550 IETF RFC 3550 STD 0064 (2003), RTP: A Transport Protocol for Real-Time Applications. W3C XML 1.0 W3C XML1.0 (2008), Extensible Markup Language (XML) 1.0, (Fifth Edition). W3C XML Schema W3C XML Schema (2004), XML Schema Part 2: Datatypes Second Edition
38、, W3C Recommendation 28 October 2004. 3 Definitions 3.1 Terms defined elsewhere This Recommendation uses the following terms defined elsewhere: 3.1.1 adaptive differential pulse code modulation (ADPCM) b-ITU-T G.701: ADPCM algorithms are compression algorithms that achieve bit rate reduction through
39、 the use of adaptive prediction and adaptive quantization. 3.1.2 multipurpose Internet mail extensions (MIME) b-ITU-T J.200: An application layer protocol. It features a content architecture to facilitate multimedia data such as text other than US-ASCII code, sound, image, etc. to be handled in Inte
40、rnet mails. 3.1.3 pulse code modulation (PCM) b-ITU-T J.177: A commonly-employed algorithm to digitize an analog signal (such as a human voice) into a digital bit stream using simple analog-to-digital conversion techniques. 3.2 Terms defined in this Recommendation This Recommendation defines the fol
41、lowing terms: 3.2.1 automatic speech recognition (ASR): A system that can recognize continuous speech, often having phoneme-sized references, using lexical, syntactic, semantic, and pragmatic knowledge, and reacts appropriately (therefore having interpreted the message and found the corresponding ac
42、tion to be taken). b-ITU-T P.10 3.2.2 machine translation (MT): Text in a source language is converted by computers into text in a target language which has the same meaning as the original text in the source language. 3.2.3 modality conversion (MC): The conversion of data to different formats/langu
43、ages using ASR, MT and TTS systems. 3.2.4 modality conversion markup language (MCML): An XML schema that serves as a data description for data exchanged among modality conversion modules. 3.2.5 modality conversion protocol (MCP): The communication protocol which transfers data between MC clients and
44、 servers using HTTP(S)/RTP IETF RFC 2616, IETF RFC 2818, IETF RFC 3550. This protocol transfers the MCML comprised of multimodal information (MI) data which is input into MC clients by users and MC results which are obtained by MC servers. 3.2.6 multimodal information (MI): The information input int
45、o MC clients by users via multimodal sensors. 3.2.7 N-best: The most likely “N“ hypotheses obtained from modality conversion engines. 3.2.8 speech-to-speech translation (S2ST): Speech in a source language is translated into speech in a target language. Rec. ITU-T F.745 (07/2016) 3 3.2.9 text-to-spee
46、ch (TTS) synthesis: A process that generates a speech signal from text codes. It is usually composed of the parts: a language-dependent text processing part (the high level processing part), which generates from the character string (by reading rules, vocabulary and semantic analysis) and a set of p
47、honetic, prosodic, etc., parameters that are used by an acoustical signal generating part, the synthesiser itself, which produces the audible speech. b-ITU-T P.10 4 Abbreviations and acronyms This Recommendation uses the following abbreviations and acronyms: ADPCM Adaptative Differential Pulse Code
48、Modulation ASR Automatic Speech Recognition HTTP HyperText Transfer Protocol HTTPS HyperText Transfer Protocol Secure ID Identifier MC Modality Conversion MCML Modality Conversion Markup Language MCP Modality Conversion Protocol MI Multimodal Information MIME Multipurpose Internet Mail Extensions MT
49、 Machine Translation PCM Pulse Code Modulation RTP Real Time Protocol S2ST Speech-To-Speech Translation TTS Text-To-Speech UCS Universal Character Set UTF-8 UCS Transformation Format-8 XML Extensible Markup Language 5 Conventions In this Recommendation: The expression “is required to“ indicates a requirement which must be strictly followed and from which no deviation is permitted if conformance to this Recommendation is to be claimed. The expression “is recommended to“