1、 Reference number ISO/TR 19358:2002(E) ISO 2002TECHNICAL REPORT ISO/TR 19358 First edition 2002-10-01 Ergonomics Construction and application of tests for speech technology Ergonomie laboration et mise en uvre des tests des systmes de technologie de la parole ISO/TR 19358:2002(E) PDF disclaimer This
2、 PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept ther
3、ein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; th
4、e PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. ISO 2002 All rights reser
5、ved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the request
6、er. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.ch Web www.iso.ch Printed in Switzerland ii ISO 2002 All rights reservedISO/TR 19358:2002(E) ISO 2002 All rights reserved iiiContents Page Foreword iv Introduction iv 1 Scope 1
7、 2 Terms and definitions. 1 3 Description of speech technologies . 3 3.1 Introduction . 3 3.2 Available technologies . 3 4 Description of relevant variables related to speech technology 4 4.1 Introduction . 4 4.2 Speech type . 5 4.3 Speaker (specification of speaker-dependent aspects)5 4.4 Task (app
8、lication-specific description of relevant recognition parameters) 5 4.5 Training (task-related training aspects) 6 4.6 Environment (specification of the speech quality in a specific environment, for both input and output) 6 4.7 Input (specification of the transmission of the speech signal from the m
9、icrophone to a recognizer input) . 6 4.8 Specification of speech technology modules 6 5 Assessment methods . 7 5.1 General . 7 5.2 Field vs. laboratory evaluation 8 5.3 System transparency 8 5.4 Subjective vs. objective methods 9 5.5 Speech recognition systems . 9 5.6 Speech synthesis systems. 9 5.7
10、 Speaker identification and verification . 9 5.8 Corpora. 10 5.9 Related sources of information . 10 Annex A (informative) Example of assessment. 11 Annex B (informative) Performance measures 14 Bibliography 15 ISO/TR 19358:2002(E) iv ISO 2002 All rights reservedForeword ISO (the International Organ
11、ization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been establish
12、ed has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
13、 International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 3. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Pu
14、blication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. In exceptional circumstances, when a technical committee has collected data of a different kind from that which is normally published as an International Standard (“state of the art“, for e
15、xample), it may decide by a simple majority vote of its participating members to publish a Technical Report. A Technical Report is entirely informative in nature and does not have to be reviewed until the data it provides are considered to be no longer valid or useful. Attention is drawn to the poss
16、ibility that some of the elements of this Technical Report may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO/TR 19358 was prepared by Technical Committee ISO/TC 159, Ergonomics, Subcommittee SC 5, Ergonomics of the physical envi
17、ronment. ISO/TR 19358:2002(E) ISO 2002 All rights reserved vIntroduction This Technical Report advises on methods for determining the performance of speech-technology systems (automatic speech recognizers, text-to-speech systems and other devices that make use of the speech signal) and on selecting
18、appropriate test procedures. Human-to-human speech communication is not included in this Technical Report but is covered by ISO 9921. TECHNICAL REPORT ISO/TR 19358:2002(E) ISO 2002 All rights reserved 1Ergonomics Construction and application of tests for speech technology 1 Scope This Technical Repo
19、rt deals with the testing and assessment of speech-related products and services, and is intended for use by specialists active in the field of speech technology, as well as purchasers and users of such systems. Advanced users are referred to the detailed evaluation chapters of the EAGLES Handbook o
20、f Standards and Resources for Spoken Language Systems (Gibbon et al. 1997) and the EAGLES Handbook of Multimodel and Spoken dialogue Systems. EAGLES was a research project partly sponsored by the European Community. 2 Terms and definitions For the purposes of this Technical Report, the following ter
21、ms and definitions apply. 2.1 Automatic Speech Recognition ASR ability of a system to accept human speech as a means of input 2.2 dialogue interactive exchange of information between the speech system and the human speaker 2.3 dialogue management control of the dialogue between the speech system and
22、 the human 2.4 Natural Language Processing NLP automatic processing of text originating from humans 2.5 objective assessment assessment without direct involvement of human subjects during measurement, typically using prerecorded speech 2.6 performance measures means used to assess the system perform
23、ance, typically by diagnostic or relative performance methods 2.7 speaker-dependent system need of a speech-recognition system to be trained with the speech of the specific user 2.8 speaker identification identification of a particular speaker from a closed set of possible speakers ISO/TR 19358:2002
24、(E) 2 ISO 2002 All rights reserved2.9 speaker-independent system system not trained for a specific user but applicable for any user of a selected group (native speakers, adults, etc.) 2.10 speaker recognition general term for technology which identifies or verifies the identity of a speaker 2.11 spe
25、aker verification verification of the identity of a person by assessment of specific aspects of his/her speech 2.12 speaking style speech may be isolated or continuous, read or spontaneous, or dictated 2.13 speech communication conveying or exchanging information using speech, speaking, and hearing
26、modalities NOTE Speech communication may involve brief texts, sentences, groups of words, isolated words, hums and parts of words. 2.14 speech recognizer process in a machine capable of converting spoken language to recognized words NOTE This is the process by which a computer transforms an acoustic
27、 speech signal into text. 2.15 speech synthesis generation of speech from data 2.16 speech understanding technology that extracts the semantic contents of speech 2.17 subjective assessment assessment with the direct involvement of human subjects during measurement 2.18 text-to-speech synthesis gener
28、ation of audible speech from a text 2.19 vocabulary set of words used in a particular context 2.20 vocabulary size number of words in a vocabulary of the speech recognizer ISO/TR 19358:2002(E) ISO 2002 All rights reserved 33 Description of speech technologies 3.1 Introduction Speech technology inclu
29、des the automatic recognition of speech and of the speaker, speech synthesis, etc., Natural Language Processing (NLP) includes the understanding of text items and the management of a dialogue between a human speaker and a machine. Modern technologies are mostly based on algorithms, which make use of
30、 digital-signal processing embedded in a digital-signal processor or a (personal) computer system. The algorithms produce near real-time responses. The performance depends on the application. For example, a speech- recognition system designed for use with a small vocabulary and trained with speech f
31、rom a single user (e.g., control of a personal hand-held telephone) will generally perform (for this particular user) much better than a system designed for a domain with a large vocabulary and generally for a large group of unknown users (e.g., information services through a public telephone networ
32、k). For speech products and services, we can identify four main categories: a) Command and Control. The interface between a user and a system is accomplished by automatic speech recognition (ASR). ASR is normally used in a multimodal design, in which the control of a system by speech is one of the p
33、ossible modalities (i.e., a keyboard, mouse, touch screen, etc. may be an alternative modality). Control by an ASR system may be essential in “hands busy” situations. b) Services and Telephone Applications. Services such as an information kiosk normally require a combination of speech recognition, u
34、nderstanding, speech synthesis and dialogue management in order to control the unsupervised dialogue between user and system. Present state-of-the-art systems cover relatively simple dialogue structures such as travel-information systems (day, time and “from-to”), and call centres (selection of the
35、required information). c) Document Generation. Dictation systems trained for many languages are presently on the market. These systems can be linked to standard word-processing systems. Simple applications include data entry for a specific user domain (e.g. medical reports), more complex systems all
36、ow dictation of full documents and the control of the text processing system. These more complex systems are often trained for a large vocabulary and speaker-dependent use. However, for acceptable performance, the system has to be familiarized with the user and the domain of the use. This is often a
37、ccomplished in two steps: by an (adaptive) acoustical training session in which the user has to read a predefined text, and by presentation of a number of documents written for the user, which are used to extend the vocabulary and to modify the language model. d) Document Retrieval. Retrieval of com
38、plete documents (from a spoken-document archive), information retrieval of specific passages from a document or utterances from a specific speaker are of interest for archive documentation and management and the compilation of overviews. Various technologies are used for labelling of the speech utte
39、rances such as ASR, word spotting and speaker recognition. Specific search algorithms are used to retrieve the required information. 3.2 Available technologies 3.2.1 Speech recognition Automatic speech-recognition systems are capable of producing a transcription (text string) from a speech signal. F
40、or this purpose, trained systems are used. Modern systems, for use with a large vocabulary, extract specific spectral parameters that identify sub units (phonemes) from the speech signal. Words are described in terms of strings of these phonemes. The recognition architecture may require various leve
41、ls related to models of the phonemes (phone models), words (vocabulary) and the statistically description of word combinations (language model). Phone models are normally trained for a large number of speakers resulting in statistically based representation. The statistical approach is normally base
42、d on a Hidden Markov Model (HMM) or a Neural Network (NN). The vocabulary and the language model are obtained from digitally available text that are representative for the application domain. ISO/TR 19358:2002(E) 4 ISO 2002 All rights reserved3.2.2 Speaker identification and verification Automatic s
43、peaker identification is the capability to identify a speaker from a group of known speakers. It answers the question “To whom does this speech sample belong?” This technology involves two steps: modelling the speech of the speaker population (training) and comparing the unknown speech to all of the
44、 speaker models (testing). Speaker verification is a method of confirming that a speaker is the person that he or she claims to be. The heart of the speaker-verification system is an algorithm, which compares an utterance from the speaker with a model built from training utterances gathered from the
45、 authorized user during an enrolment phase. If the speech matches the model within some required tolerance threshold, the speaker is accepted as having the claimed identity. In order to protect against an intruder attempting to fool the system by making a recording of the voice of the authorized use
46、r, the verification system will usually prompt the speaker to say particular phrases, such as sequences of numbers which are selected to be different each time the user tries to gain entry. The speech verification system is combined with a recognition system to assure that the proper phrase was spok
47、en. 3.2.3 Speech synthesis For speech synthesis two methods are used: the first, generally known as “canned speech”, is generated on the basis of prestored messages. The coding techniques to compress the messages are normally used in order to save storage space. With this type of synthesis, high-qua
48、lity speech can be obtained, especially for quick-response applications that make use of a number of standard responses. The second method, “text-to-speech synthesis,” allows the generation of any message from a written text. This generally involves a first stage of linguistic processing, in which t
49、he text-input is converted into an internal representation of phoneme and prosodic markers, and a second stage of sound generation on the basis of this internal representation. The sound generation can be made either entirely by rule, typically using complex models of the speech production mechanism (formant synthesis, intonation), or by concatenating short prestored units (concatenate synthesis). The speech quality obtained with concatenate synthesis is generally considered higher. 3.2.4 Speech understanding Speech-understanding systems can be