ITU-T F 745-2010 Functional requirements for network-based speech-to-speech translation services (Study Group 16)《基于网络的语音翻译业务的功能要求 16号研究组》.pdf

资源描述

1、 International Telecommunication Union ITU-T F.745TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (10/2010) SERIES F: NON-TELEPHONE TELECOMMUNICATION SERVICES Audiovisual services Functional requirements for network-based speech-to-speech translation services Recommendation ITU-T F.745 Copyright Int

2、ernational Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-ITU-T F-SERIES RECOMMENDATIONS NON-TELEPHONE TELECOMMUNICATION SERVICES TELEGRAPH SERVICE Operating methods for th

3、e international public telegram service F.1F.19 The gentex network F.20F.29 Message switching F.30F.39 The international telemessage service F.40F.58 The international telex service F.59F.89 Statistics and publications on international telegraph services F.90F.99 Scheduled and leased communication s

4、ervices F.100F.104 Phototelegraph service F.105F.109 MOBILE SERVICE Mobile services and multidestination satellite services F.110F.159 TELEMATIC SERVICES Public facsimile service F.160F.199 Teletex service F.200F.299 Videotex service F.300F.349 General provisions for telematic services F.350F.399 ME

5、SSAGE HANDLING SERVICES F.400F.499 DIRECTORY SERVICES F.500F.549 DOCUMENT COMMUNICATION Document communication F.550F.579 Programming communication interfaces F.580F.599 DATA TRANSMISSION SERVICES F.600F.699 AUDIOVISUAL SERVICES F.700F.799ISDN SERVICES F.800F.849 UNIVERSAL PERSONAL TELECOMMUNICATION

6、 F.850F.899 HUMAN FACTORS F.900F.999 For further details, please refer to the list of ITU-T Recommendations. Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from

7、IHS-,-,-Rec. ITU-T F.745 (10/2010) i Recommendation ITU-T F.745 Functional requirements for network-based speech-to-speech translation services Summary Recommendation ITU-T F.745 specifies a high level functional model, a service description and requirements for speech-to-speech translation (S2ST) a

8、ccomplished by connecting distributed S2ST modules all over the world through a network. To extend this network-based S2ST to other modalities, such as sign language, the modality conversion markup language (MCML) needs to have an expandable structure. The scope of this Recommendation is limited to

9、the application protocol and the services using the network-based S2ST. History Edition Recommendation Approval Study Group 1.0 ITU-T F.745 2010-10-14 16 Keywords Automatic speech recognition (ASR), machine translation (MT), modality conversion markup language (MCML), speech-to-speech translation (S

10、2ST), text-to-speech synthesis (TTS). Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-ii Rec. ITU-T F.745 (10/2010) FOREWORD The International Teleco

11、mmunication Union (ITU) is the United Nations specialized agency in the field of telecommunications, information and communication technologies (ICTs). The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of ITU. ITU-T is responsible for studying technical, operating and tar

12、iff questions and issuing Recommendations on them with a view to standardizing telecommunications on a worldwide basis. The World Telecommunication Standardization Assembly (WTSA), which meets every four years, establishes the topics for study by the ITU-T study groups which, in turn, produce Recomm

13、endations on these topics. The approval of ITU-T Recommendations is covered by the procedure laid down in WTSA Resolution 1. In some areas of information technology which fall within ITU-Ts purview, the necessary standards are prepared on a collaborative basis with ISO and IEC. NOTE In this Recommen

14、dation, the expression “Administration“ is used for conciseness to indicate both a telecommunication administration and a recognized operating agency. Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain mandatory provisions (to ensure, e.g., interoperabi

15、lity or applicability) and compliance with the Recommendation is achieved when all of these mandatory provisions are met. The words “shall“ or some other obligatory language such as “must“ and the negative equivalents are used to express requirements. The use of such words does not suggest that comp

16、liance with the Recommendation is required of any party. INTELLECTUAL PROPERTY RIGHTS ITU draws attention to the possibility that the practice or implementation of this Recommendation may involve the use of a claimed Intellectual Property Right. ITU takes no position concerning the evidence, validit

17、y or applicability of claimed Intellectual Property Rights, whether asserted by ITU members or others outside of the Recommendation development process. As of the date of approval of this Recommendation, ITU had not received notice of intellectual property, protected by patents, which may be require

18、d to implement this Recommendation. However, implementers are cautioned that this may not represent the latest information and are therefore strongly urged to consult the TSB patent database at http:/www.itu.int/ITU-T/ipr/. ITU 2011 All rights reserved. No part of this publication may be reproduced,

19、 by any means whatsoever, without the prior written permission of ITU. Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Rec. ITU-T F.745 (10/2010) iii

20、 Table of Contents Page 1 Scope 1 2 References. 1 3 Definitions 2 3.1 Terms defined elsewhere 2 3.2 Terms defined in this Recommendation . 2 4 Abbreviations and acronyms 3 5 Conventions 3 6 High-level functional model and generic service description 4 6.1 System overview 4 6.2 Functional model of mo

21、dality conversion (MC) through communication between modality conversion protocol (MCP) clients and servers 4 6.3 Service description . 7 7 Requirements 7 7.1 User input requirements . 7 7.2 Network requirements 8 7.3 User device requirements . 8 7.4 Modality conversion (MC) client requirements . 8

22、7.5 Modality conversion (MC) server requirements 8 7.6 Quality requirements 9 7.7 Security and privacy requirements . 9 7.8 Codec requirements 9 Appendix I Service description in applications 10 I.1 Shared speech-to-speech translation (S2ST) client of two-party communication . 10 I.2 Personal speech

23、to-speech translation (S2ST) client communication 10 I.3 Cross-modality communication 10 Bibliography. 11 Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IH

24、S-,-,-iv Rec. ITU-T F.745 (10/2010) Introduction The fact that the world has many different languages is one of the barriers to mutual understanding. The more directly people who speak different languages can communicate without language boundaries, the more mutual understanding can be accelerated a

25、nd the closer human relationships can be constructed all over the world. To achieve such communication between humans, speech-to-speech translation (S2ST) technologies can be used. S2ST is a technology that recognizes the speech in one language, translates the recognized speech into another language

26、 and then synthesizes the translation into speech. The leveraging of S2ST technologies in a pragmatic manner, which has long been one of mankinds dreams, may have a significant impact on tourism, social services, safety, and security by removing language barriers, and may ultimately influence langu

27、age education. To construct S2ST systems, automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS) must be built for source and target languages by collecting speech and language data, such as audio data, its manual transcriptions, pronunciation lexica for each

28、 word, parallel corpora for translation and so on. It is very difficult for individual organizations to build S2ST systems covering all topics and languages. However, by interconnecting ASR, MT and TTS modules developed by separate organizations and distributed globally through a network, one can cr

29、eate S2ST systems that break the worlds language barriers. This Recommendation defines the service description and the requirements for network-based S2ST technologies consisting of various distributed modules connected together in a network. Copyright International Telecommunication Union/ITU Telco

30、mmunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Rec. ITU-T F.745 (10/2010) 1 Recommendation ITU-T F.745 Functional requirements for network-based speech-to-speech translation services 1 Scope This Recommend

31、ation specifies the service description and the requirements for speech-to-speech translation (S2ST) accomplished by connecting distributed S2ST modules all over the world through a network. This service provides S2ST that recognizes the speech in one language, translates the recognized speech into

32、another language, and then synthesizes the translation into speech. People who speak different languages can communicate using this service. The applications and services using network-based S2ST technologies are characterized by the following components: S2ST client: user client for speech/text inp

33、ut and output. S2ST servers: speech recognition: speech is recognized and transcribed; machine translation: text in source language is translated into text in target language; speech synthesis: speech signal is created from text. Communication protocol: communication protocol to connect user clients

34、 and the above S2ST servers. In order to extend the network-based S2ST to other modalities (e.g., sign language), a communication protocol is incorporated for modality conversion (MC), which converts single/multiple modality information to different single/multiple modality information. The communic

35、ation protocol for MC needs to have an expandable structure. Modality conversion markup language (MCML): XML schema that serves as a data description for data exchanged among modality conversion modules. 2 References The following ITU-T Recommendations and other references contain provisions which,

36、through reference in this text, constitute provisions of this Recommendation. At the time of publication, the editions indicated were valid. All Recommendations and other references are subject to revision; users of this Recommendation are therefore encouraged to investigate the possibility of apply

37、ing the most recent edition of the Recommendations and other references listed below. A list of the currently valid ITU-T Recommendations is regularly published. The reference to a document within this Recommendation does not give it, as a stand-alone document, the status of a Recommendation. ITU-T

38、H.625 Recommendation ITU-T H.625 (2010), Architecture for network-based speech-to-speech translation services. IETF RFC 2279 IETF RFC 2279 (1998), UTF-8, a transformation format of ISO 10646. IETF RFC 2396 IETF RFC 2396 (1998), Uniform Resource Identifiers (URI): Generic Syntax. IETF RFC 2616 IETF R

39、FC 2616 (1999), Hypertext Transfer Protocol HTTP/1.1. IETF RFC 2818 IETF RFC 2818 (2000), HTTP Over TLS. Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-

40、2 Rec. ITU-T F.745 (10/2010) IETF RFC 3550 IETF RFC 3550 STD 0064 (2003), RTP: A Transport Protocol for Real-Time Applications. W3C XML 1.0 W3C XML1.0 (2008), Extensible Markup Language (XML) 1.0, (Fifth Edition). W3C XML Schema W3C XML Schema (2004), XML Schema Part 2: Datatypes Second Edition,

41、 W3C Recommendation 28 October 2004. 3 Definitions 3.1 Terms defined elsewhere This Recommendation uses the following terms defined elsewhere: 3.1.1 adaptive differential pulse code modulation (ADPCM) b-ITU-T G.701: ADPCM algorithms are compression algorithms that achieve bit rate reduction through

42、the use of adaptive prediction and adaptive quantization. 3.1.2 multipurpose Internet mail extensions (MIME) b-ITU-T J.200: An application layer protocol. It features a content architecture to facilitate multimedia data such as text other than US-ASCII code, sound, image, etc. to be handled in Inter

43、net mails. 3.1.3 pulse code modulation (PCM) b-ITU-T J.177: A commonly-employed algorithm to digitize an analog signal (such as a human voice) into a digital bit stream using simple analog-to-digital conversion techniques. 3.2 Terms defined in this Recommendation This Recommendation defines the foll

44、owing terms: 3.2.1 automatic speech recognition (ASR): A system that can recognize continuous speech, often having phoneme-sized references, using lexical, syntactic, semantic, and pragmatic knowledge, and reacts appropriately (therefore having interpreted the message and found the corresponding act

45、ion to be taken). b-ITU-T P.10 3.2.2 machine translation (MT): Text in a source language is converted by computers into text in a target language which has the same meaning as the original text in the source language. 3.2.3 modality conversion (MC): The conversion of data to different formats/langua

46、ges using ASR, MT and TTS systems. 3.2.4 modality conversion markup language (MCML): An XML schema that serves as a data description for data exchanged among modality conversion modules. 3.2.5 modality conversion protocol (MCP): The communication protocol which transfers data between MC clients and

47、servers using HTTP(S)/RTP IETF RFC 2616, IETF RFC 2818, IETF RFC 3550. This protocol transfers the MCML comprised of multimodal information (MI) data which is input into MC clients by users and MC results which are obtained by MC servers. 3.2.6 multimodal information (MI): The information input into

48、 MC clients by users via multimodal sensors. 3.2.7 N-best: The most likely “N“ hypotheses obtained from modality conversion engines. 3.2.8 speech-to-speech translation (S2ST): Speech in a source language is translated into speech in a target language. Copyright International Telecommunication Union/ITU Telcommunication Sector Provided by IHS under license with ITU-TNot for ResaleNo reproduction or networking permitted without license from IHS-,-,-Rec. ITU-T F.745 (10/2010) 3 3.2.9 text-to-speech (TTS) synthesis: A process that generates a speech signal from text c

展开阅读全文