ETSI ES 201 108-2003 Speech Processing Transmission and Quality Aspects (STQ) Distributed speech recognition Front-end feature extraction algorithm Compression algorithms (V1 1 3)《.pdf

资源描述

1、 ETSI ES 201 108 V1.1.3 (2003-09)ETSI Standard Speech Processing, Transmission and Quality Aspects (STQ);Distributed speech recognition;Front-end feature extraction algorithm;Compression algorithmsfloppy3 ETSI ETSI ES 201 108 V1.1.3 (2003-09) 2 Reference RES/STQ-00044 Keywords speech, performance, t

2、ransmission ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N 348 623 562 00017 - NAF 742 C Association but non lucratif enregistre la Sous-Prfecture de Grasse (06) N 7803/88 Important notice Individual copies of the present do

3、cument can be downloaded from: http:/www.etsi.org The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF). In case of dis

4、pute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat. Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other

5、ETSI documents is available at http:/portal.etsi.org/tb/status/status.asp If you find errors in the present document, send your comment to: editoretsi.org Copyright Notification No part may be reproduced except as authorized by written permission. The copyright and the foregoing restriction extend t

6、o reproduction in all media. European Telecommunications Standards Institute 2003. All rights reserved. DECTTM, PLUGTESTSTM and UMTSTM are Trade Marks of ETSI registered for the benefit of its Members. TIPHONTMand the TIPHON logo are Trade Marks currently being registered by ETSI for the benefit of

7、its Members. 3GPPTM is a Trade Mark of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. ETSI ETSI ES 201 108 V1.1.3 (2003-09) 3 Contents Intellectual Property Rights4 Foreword.4 Introduction 4 1 Scope 5 2 References 5 3 Definitions, symbols and abbreviations .6

8、 3.1 Definitions6 3.2 Symbols7 3.3 Abbreviations .8 4 Front-end feature extraction algorithm.8 4.1 Introduction 8 4.2 Front-end algorithm description.8 4.2.1 Front-end block diagram.8 4.2.2 Analog-to-digital conversion 9 4.2.3 Offset compensation .9 4.2.4 Framing.9 4.2.5 Energy measure 10 4.2.6 Pre-

9、emphasis.10 4.2.7 Windowing .10 4.2.8 FFT .10 4.2.9 Mel filtering10 4.2.10 Non-linear transformation.11 4.2.11 Cepstral coefficients .11 4.2.12 Front-end output .11 5 Feature Compression Algorithm 12 5.1 Introduction 12 5.2 Compression algorithm description12 5.2.1 Input12 5.2.2 Vector quantization.

10、12 6 Framing, Bit-Stream Formatting, and Error Protection13 6.1 Introduction 13 6.2 Algorithm description.13 6.2.1 Multiframe Format13 6.2.2 Synchronization Sequence14 6.2.3 Header Field14 6.2.4 Frame Packet Stream 15 7 Bit-Stream Decoding and Error Mitigation16 7.1 Introduction 16 7.2 Algorithm des

11、cription.16 7.2.1 Synchronization Sequence Detection16 7.2.2 Header Decoding 16 7.2.3 Feature Decompression.17 7.2.4 Error Mitigation17 7.2.4.1 Detection of frames received with errors 17 7.2.4.2 Substitution of parameter values for frames received with errors.18 Annex A (informative): Bibliography.

12、21 History 22 ETSI ETSI ES 201 108 V1.1.3 (2003-09) 4 Intellectual Property Rights IPRs essential or potentially essential to the present document may have been declared to ETSI. The information pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and c

13、an be found in ETSI SR 000 314: “Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards“, which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server (http:/webapp.etsi.org/IPR/home.asp). Purs

14、uant to the ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document

15、 Foreword This ETSI Standard (ES) has been produced by ETSI Technical Committee Speech Processing, Transmission and Quality Aspects (STQ). Introduction The performance of speech recognition systems receiving speech that has been transmitted over mobile channels can be significantly degraded when co

16、mpared to using an unmodified signal. The degradations are as a result of both the low bit rate speech coding and channel transmission errors. A Distributed Speech Recognition (DSR) system overcomes these problems by eliminating the speech channel and instead using an error protected data channel to

17、 send a parameterized representation of the speech, which is suitable for recognition. The processing is distributed between the terminal and the network. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a

18、data channel to a remote “back-end“ recognizer. The end result is that the transmission channel does not affect the recognition system performance and channel invariability is achieved. The present document presents the standard for a front-end to ensure compatibility between the terminal and the re

19、mote recognizer. The particular front-end used is called the Mel-Cepstrum which has been used extensively in speech recognition systems. ETSI ETSI ES 201 108 V1.1.3 (2003-09) 5 1 Scope The present document specifies algorithms for front-end feature extraction and their transmission which form part o

20、f a system for distributed speech recognition. The specification covers the following components: - the algorithm for front-end feature extraction to create Mel-Cepstrum parameters; - the algorithm to compress these features to provide a lower data transmission rate; - the formatting of these featur

21、es with error protection into a bitstream for transmission; - the decoding of the bitstream to generate the front-end features at a receiver together with the associated algorithms for channel error mitigation. The present document does not cover the “back-end“ speech recognition algorithms that mak

22、e use of the received DSR front-end features. The algorithms are defined in a mathematical form or as flow diagrams. Software implementing these algorithms written in the “C“ programming language is contained in the .ZIP file which accompanies the present document. Conformance tests are not specifie

23、d as part of the standard. The recognition performance of proprietary implementations of the standard can be compared with those obtained using the reference “C“ code on appropriate speech databases. It is anticipated that the DSR bitstream will be used as a payload in other higher level protocols w

24、hen deployed in specific systems supporting DSR applications. In particular, for packet data transmission, it is anticipated that the IETF AVT RTP DSR payload definition (see bibliography) will be used to transport DSR features using the frame pair format described in clause 6. 2 References The foll

25、owing documents contain provisions which, through reference in this text, constitute provisions of the present document. References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For a specific reference, subsequent revisions do not a

26、pply. For a non-specific reference, the latest version applies. Referenced documents which are not found to be publicly available in the expected location might be found at http:/docbox.etsi.org/Reference. 1 ETSI EN 300 903: “Digital cellular telecommunications system (Phase 2+); Transmission planni

27、ng aspects of the speech service in the GSM Public Land Mobile Network (PLMN) system (GSM 03.50)“. ETSI ETSI ES 201 108 V1.1.3 (2003-09) 6 3 Definitions, symbols and abbreviations 3.1 Definitions For the purposes of the present document, the following terms and definitions apply: analog-to-digital c

28、onversion: electronic process in which a continuously variable (analog) signal is changed, without altering its essential content, into a multi-level (digital) signal DC offset: direct current (DC) component of the waveform signal discrete cosine transform: process of transforming the log filterbank

29、 amplitudes into cepstral coefficients fast Fourier transform: fast algorithm for performing the discrete Fourier transform to compute the spectrum representation of a time-domain signal feature compression: process of reducing the amount of data to represent the speech features calculated in featur

30、e extraction feature extraction: process of calculating a compact parametric representation of speech signal features which are relevant for speech recognition NOTE: The feature extraction process is carried out by the front-end algorithm. feature vector: set of feature parameters (coefficients) cal

31、culated by the front-end algorithm over a segment of speech waveform framing: process of splitting the continuous stream of signal samples into segments of constant length to facilitate blockwise processing of the signal frame pair packet: combined data from two quantized feature vectors together wi

32、th 4 bits of CRC front-end: part of a speech recognition system which performs the process of feature extraction magnitude spectrum: absolute-valued Fourier transform representation of the input signal multiframe: grouping of multiple frame vectors into a larger data structure mel-frequency warping:

33、 process of non-linearly modifying the scale of the Fourier transform representation of the spectrum mel-frequency cepstral coefficients: cepstral coefficients calculated from the mel-frequency warped Fourier transform representation of the log magnitude spectrum notch filtering: filtering process i

34、n which the otherwise flat frequency response of the filter has a sharp notch at a pre-defined frequency. In the present document, the notch is placed at the zero frequency, to remove the DC component of the signal offset compensation: process of removing DC offset from a signal pre-emphasis: filter

35、ing process in which the frequency response of the filter has emphasis at a given frequency range. In the present document, the high-frequency range of the signal spectrum is pre-emphasized sampling rate: number of samples of an analog signal that are taken per second to represent it digitally windo

36、wing: process of multiplying a waveform signal segment by a time window of given shape, to emphasize pre-defined characteristics of the signal zero padding: method of appending zero-valued samples to the end of a segment of speech samples for performing a FFT operation ETSI ETSI ES 201 108 V1.1.3 (2

37、003-09) 7 3.2 Symbols For the purposes of the present document, the following symbols apply: For feature extraction (clause 4): binkabsolute value of complex-valued FFT output vector k iC ith cepstral coefficient cbiniCentre frequency of the ith Mel channel in terms of FFT bin indices fbankkoutput o

38、f Mel filter for filter bank k icf Center frequency of the ith Mel channel filog filter bank output for the ith Mel channel fsinput signal sampling rate fs1, fs2, fs3symbols for specific input signal sampling rates (8 kHz, 11 kHz, 16 kHz) fstartstarting frequency of Mel filter bank FFTL Length of FF

39、T block ()ln natural logarithm operation ()10log 10-base logarithm operation M frame shift interval Mel Mel scaling operator 1Mel inverse Mel scaling operator N frame length round operator for rounding towards nearest integer ins input speech signal ofs offset-free input speech signal pes speech sig

40、nal after pre-emphasis operation ws windowed speech signal For compression (clause 5): ()midxii 1, +codebook index m framenumber 1, +iiN compression: size of the codebook 1, +iiQ compression codebook 1, +iijq jth codevector in the codebook 1, +iiQ Wii 1, +weight matrix y(m) Feature vector with 14 co

41、mponents For error mitigation: idexbadframein indicator if received VQ index is suspected to be received with transmission error Tithreshold on cepstral coefficient ETSI ETSI ES 201 108 V1.1.3 (2003-09) 8 3.3 Abbreviations For the purposes of the present document, the following abbreviations apply:

42、ADC analog-to-digital conversion CRC cyclic redundancy code DSR distributed speech recognition logE logarithmic frame energy LSB least significant bit MSB most significant bit VQ vector quantizer 4 Front-end feature extraction algorithm 4.1 Introduction This clause describes the distributed speech r

43、ecognition front-end algorithm based on mel-cepstral feature extraction technique. The specification covers the computation of feature vectors from speech waveforms sampled at different rates (8 kHz, 11 kHz, and 16 kHz). The feature vectors consist of 13 static cepstral coefficients and a log-energy

44、 coefficient. The feature extraction algorithm defined in this clause forms a generic part of the specification while clauses 4 to 6 define the feature compression and bit-stream formatting algorithms which may be used in specific applications. The characteristics of the input audio parts of a DSR t

45、erminal will have an effect on the resulting recognition performance at the remote server. Developers of DSR speech recognition servers can assume that the DSR terminals will operate within the ranges of caracteristics as specified in EN 300 903 1. DSR terminal developers should be aware that reduce

46、d recognition performance may be obtained if they operate outside the recommended tolerances. 4.2 Front-end algorithm description 4.2.1 Front-end block diagram The following block diagram shows the different blocks of the front-end algorithm. The details of the analog-to-digital conversion (ADC) are

47、 not subject to the present document, but the block has been included to account for the different sampling rates. The blocks Feature Compression and Bit Stream Formatting are covered in clauses 4 to 6 of the present document. ETSI ETSI ES 201 108 V1.1.3 (2003-09) 9 ADC Offcom Framing PE W FFT MF LO

48、G DCTlogEFeature CompressionBit Stream FormattingFramingTo transmission channelAbbreviations:ADC analog-to-digital conversionOffcom offset compensationPE pre-emphasislogE energy measure computationWwidowingFFT fast Fourier transform (only magnitude components)MF mel-filteringLOG nonlinear transforma

49、tionDCT discrete cosine transformMFCC mel-frequency cepstral coefficientInputspeechFigure 4.1: Block diagram of the front-end algorithm 4.2.2 Analog-to-digital conversion The specifics of the analog-to-digital conversion are not part of the present document. Different word-lengths can be used depending on the application. The output sampling rates of the ADC block are fs1 = 8 kHz, fs2 = 11 kHz, and fs3 = 16 kHz. 4.2.3 Offset compensation Prior to the framing, a notch filtering operation is applied to the digital samples of the input speech signal ins to remov

展开阅读全文