ETSI ES 202 212-2005 Speech Processing Transmission and Quality Aspects (STQ) Distributed speech recognition Extended advanced front-end feature extraction algorithm Compression al_1.pdf

资源描述

1、 ETSI ES 202 212 V1.1.2 (2005-11)ETSI Standard Speech Processing, Transmission and Quality Aspects (STQ);Distributed speech recognition;Extended advanced front-end feature extraction algorithm;Compression algorithms;Back-end speech reconstruction algorithmfloppy3 ETSI ETSI ES 202 212 V1.1.2 (2005-11

2、) 2 Reference RES/STQ-00084a Keywords performance, speech, transmission ETSI 650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16 Siret N 348 623 562 00017 - NAF 742 C Association but non lucratif enregistre la Sous-Prfecture de Grasse (06) N

3、7803/88 Important notice Individual copies of the present document can be downloaded from: http:/www.etsi.org The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference v

4、ersion is the Portable Document Format (PDF). In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat. Users of the present document should be aware that the document may be subject to revision or change of

5、status. Information on the current status of this and other ETSI documents is available at http:/portal.etsi.org/tb/status/status.asp If you find errors in the present document, please send your comment to one of the following services: http:/portal.etsi.org/chaircor/ETSI_support.asp Copyright Notif

6、ication No part may be reproduced except as authorized by written permission. The copyright and the foregoing restriction extend to reproduction in all media. European Telecommunications Standards Institute 2005. All rights reserved. DECTTM, PLUGTESTSTM and UMTSTM are Trade Marks of ETSI registered

7、for the benefit of its Members. TIPHONTMand the TIPHON logo are Trade Marks currently being registered by ETSI for the benefit of its Members. 3GPPTM is a Trade Mark of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners. ETSI ETSI ES 202 212 V1.1.2 (2005-11) 3 Con

8、tents Intellectual Property Rights6 Foreword.6 Introduction 6 1 Scope 7 2 References 8 3 Definitions, symbols and abbreviations .8 3.1 Definitions8 3.2 Symbols9 3.3 Abbreviations .10 4 System overview 11 5 Feature extraction description 12 5.1 Noise reduction 12 5.1.1 Two stage mel-warped Wiener fil

9、ter approach.12 5.1.2 Buffering.13 5.1.3 Spectrum estimation .13 5.1.4 Power spectral density mean.14 5.1.5 Wiener filter design 15 5.1.6 VAD for noise estimation (VADNest)16 5.1.7 Mel filter-bank18 5.1.8 Gain factorization .19 5.1.9 Mel IDCT .20 5.1.10 Apply filter21 5.1.11 Offset compensation .21

10、5.2 Waveform Processing.22 5.3 Cepstrum Calculation.23 5.3.1 Log energy calculation23 5.3.2 Pre-emphasis (PE) 23 5.3.3 Windowing (W)23 5.3.4 Fourier transform (FFT) and power spectrum estimation.23 5.3.5 Mel Filtering (MEL-FB).24 5.3.6 Non-linear transformation (Log).25 5.3.7 Cepstral coefficients (

11、DCT)25 5.3.8 Cepstrum calculation output .26 5.4 Blind equalization.26 5.5 Extension to 11 kHz and 16 kHz sampling frequencies .26 5.5.1 FFT-based spectrum estimation26 5.5.2 Mel Filter-Bank 28 5.5.3 High-frequency band coding and decoding 28 5.5.4 VAD for noise estimation and spectral subtraction i

12、n high-frequency bands.29 5.5.5 Merging spectral subtraction bands with decoded bands30 5.5.6 Log energy calculation for 16 kHz .31 5.6 Pitch and class estimation.32 5.6.1 Spectrum and energy computation32 5.6.2 Voice Activity Detection for Voicing Classification (VADVC) 33 5.6.3 Low-band noise dete

13、ction.38 5.6.4 Pre-Processing for pitch and class estimation.38 5.6.5 Pitch estimation 39 5.6.5.1 Dirichlet interpolation .40 5.6.5.2 Non-speech and low-energy frames42 5.6.5.3 Search ranges specification and processing 42 5.6.5.4 Spectral peaks determination 42 5.6.5.5 F0 Candidates generation44 5.

14、6.5.6 Computing correlation scores46 ETSI ETSI ES 202 212 V1.1.2 (2005-11) 4 5.6.5.7 Pitch estimate selection.48 5.6.5.8 History information update .50 5.6.5.9 Output pitch value.51 5.6.6 Classification 51 6 Feature compression.52 6.1 Introduction 52 6.2 Compression algorithm description52 6.2.1 Inp

15、ut52 6.2.2 Vector quantization.52 6.2.3 Pitch and class quantization53 6.2.3.1 Class quantization .53 6.2.3.2 Pitch quantization54 7 Framing, bit-stream formatting and error protection55 7.1 Introduction 55 7.2 Algorithm description.56 7.2.1 Multiframe format 56 7.2.2 Synchronization sequence.56 7.2

16、3 Header field 56 7.2.4 Frame packet stream .58 8 Bit-stream decoding and error mitigation.58 8.1 Introduction 58 8.2 Algorithm description.58 8.2.1 Synchronization sequence detection .58 8.2.2 Header decoding .59 8.2.3 Feature decompression .59 8.2.4 Error mitigation 59 8.2.4.1 Detection of frames

17、 received with errors 59 8.2.4.2 Substitution of parameter values for frames received with errors.60 8.2.4.3 Modification of parameter values for frames received with errors .60 9 Server feature processing .63 9.1 lnE and c(0) combination .63 9.2 Derivatives calculation.63 9.3 Feature vector selecti

18、on63 10 Server side speech reconstruction 64 10.1 Introduction 64 10.2 Algorithm description.64 10.2.1 Speech reconstruction block diagram .64 10.2.2 Pitch Tracking and Smoothing65 10.2.2.1 First stage - gross pitch error correction66 10.2.2.2 Second stage - voiced/unvoiced decision and other correc

19、tions .68 10.2.2.3 Third stage - smoothing 69 10.2.2.4 Voicing class correction69 10.2.3 Harmonic Structure Initialization .70 10.2.4 Unvoiced phase synthesis .70 10.2.5 Cepstra de-equalization.70 10.2.6 Transformation of features extracted at 16 kHz71 10.2.7 Harmonic magnitudes reconstruction .71 1

20、0.2.7.1 High order cepstra recovery 71 10.2.7.2 Solving front-end equation73 10.2.7.3 Cepstra to magnitudes transformation.77 10.2.7.4 Combined magnitudes estimate calculation 79 10.2.7.4.1 Combined magnitude estimate for unvoiced harmonics79 10.2.7.4.2 Combined magnitude estimate for voiced harmoni

21、cs80 10.2.8 All-pole spectral envelope modelling .81 10.2.9 Postfiltering.83 10.2.10 Voiced phase synthesis .84 10.2.11 Line spectrum to time-domain transformation86 10.2.11.1 Mixed-voiced frames processing 86 10.2.11.2 Filtering very high-frequency harmonics 86 ETSI ETSI ES 202 212 V1.1.2 (2005-11)

22、 5 10.2.11.3 Energy normalization87 10.2.11.4 STFT spectrum synthesis 87 10.2.11.5 Inverse FFT.87 10.2.12 Overlap-Add .88 Annex A (informative): Voice Activity Detection (VAD)89 A.1 Introduction 89 A.2 Stage 1 - Detection .89 A.3 Stage 2 - VAD Logic90 Annex B (informative): Bibliography.92 History 9

23、3 ETSI ETSI ES 202 212 V1.1.2 (2005-11) 6 Intellectual Property Rights IPRs essential or potentially essential to the present document may have been declared to ETSI. The information pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and can be found

24、in ETSI SR 000 314: “Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards“, which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server (http:/webapp.etsi.org/IPR/home.asp). Pursuant to the

25、ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document. Forew

26、ord This ETSI Standard (ES) has been produced by ETSI Technical Committee Speech Processing, Transmission and Quality Aspects (STQ). Introduction The performance of speech recognition systems receiving speech that has been transmitted over mobile channels can be significantly degraded when compared

27、to using an unmodified signal. The degradations are as a result of both the low bit rate speech coding and channel transmission errors. A Distributed Speech Recognition (DSR) system overcomes these problems by eliminating the speech channel and instead using an error protected data channel to send a

28、 parameterized representation of the speech, which is suitable for recognition. The processing is distributed between the terminal and the network. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data ch

29、annel to a remote “back-end“ recognizer. The end result is that the degradation in performance due to transcoding on the voice channel is removed and channel invariability is achieved. The present document presents a standard for a front-end to ensure compatibility between the terminal and the remot

30、e recognizer. The first ETSI standard DSR front-end ES 201 108 1 was published in February 2000 and is based on the Mel-Cepstrum representation that has been used extensively in speech recognition systems. This second standard is for an Advanced DSR front-end that provides substantially improved rec

31、ognition performance in background noise. Evaluation of the performance during the selection of the present document showed an average of 53 % reduction in speech recognition error rates in noise compared to ES 201 108 1. For some applications, it may be necessary to reconstruct the speech waveform

32、at the back-end. Examples include: Interactive Voice Response (IVR) services based on the DSR of “sensitive“ information, such as banking and brokerage transactions. DSR features may be stored for future human verification purposes or to satisfy procedural requirements. Human verification of utteran

33、ces in a speech database collected from a deployed DSR system. This database can then be used to retrain and tune models in order to improve system performance. Applications where machine and human recognition are mixed (e.g. human assisted dictation). In order to enable the reconstruction of speech

34、 waveform at the back-end, additional parameters such as fundamental frequency (F0) and voicing class need to be extracted at the front-end, compressed, and transmitted. The availability of tonal parameters (F0 and voicing class) is also useful in enhancing the recognition accuracy of tonal language

35、s, e.g. Mandarin, Cantonese, and Thai. The present document specifies a proposed standard for an Extended Advanced Front-End (XAFE) that extends the noise-robust advanced front-end with additional parameters, viz., fundamental frequency F0 and voicing class. It also specifies the back-end speech rec

36、onstruction algorithm using the transmitted parameters. ETSI ETSI ES 202 212 V1.1.2 (2005-11) 7 1 Scope The present document specifies algorithms for extended advanced front-end feature extraction, their transmission, back-end pitch tracking and smoothing, and back-end speech reconstruction which fo

37、rm part of a system for distributed speech recognition. The specification covers the following components: a) the algorithm for advanced front-end feature extraction to create Mel-Cepstrum parameters; b) the algorithm for extraction of additional parameters, viz., fundamental frequency F0 and voicin

38、g class; c) the algorithm to compress these features to provide a lower data transmission rate; d) the formatting of these features with error protection into a bitstream for transmission; e) the decoding of the bitstream to generate the advanced front-end features at a receiver together with the as

39、sociated algorithms for channel error mitigation; f) the algorithm for pitch tracking and smoothing at the back-end to minimize pitch errors; g) the algorithm for speech reconstruction at the back-end to synthesize intelligible speech. NOTE: The components a), c), d) and e) are already covered by th

40、e ES 202 050 2. Besides these (four) components, the present document covers the components b), f) and g) to provide back-end speech reconstruction and enhanced tonal language recognition capabilities. If these capabilities are not of interest, the reader is better served by (un-extended) ES 202 050

41、 2. The present document does not cover the “back-end“ speech recognition algorithms that make use of the received DSR advanced front-end features. The algorithms are defined in a mathematical form, pseudo-code, or as flow diagrams. Software implementing these algorithms written in the C programming

42、 language is contained in the ZIP file es_202212v010101p0.zip which accompanies the present document. Conformance tests are not specified as part of the standard. The recognition performance of proprietary implementations of the standard can be compared with those obtained using the reference C code

43、 on appropriate speech databases. It is anticipated that the DSR bitstream will be used as a payload in other higher level protocols when deployed in specific systems supporting DSR applications. In particular, for packet data transmission, it is anticipated that the IETF AVT RTP DSR payload definit

44、ion (see bibliography) will be used to transport DSR features using the frame pair format described in clause 7. The extended advanced DSR standard is designed for use with discontinuous transmission and to support the transmission of Voice Activity information. Annex A describes a VAD algorithm tha

45、t is recommended for use in conjunction with the Advanced DSR standard, however it is not part of the present document and manufacturers may choose to use an alternative VAD algorithm. The Extended Advanced Front-End (XAFE) incorporates tonal information, viz., fundamental frequency F0 and voicing c

46、lass, as additional parameters. This information can be used for enhancing the recognition accuracy of tonal languages, e.g. Mandarin, Cantonese, and Thai. ETSI ETSI ES 202 212 V1.1.2 (2005-11) 8 2 References The following documents contain provisions which, through reference in this text, constitut

47、e provisions of the present document. References are either specific (identified by date of publication and/or edition number or version number) or non-specific. For a specific reference, subsequent revisions do not apply. For a non-specific reference, the latest version applies. Referenced document

48、s which are not found to be publicly available in the expected location might be found at http:/docbox.etsi.org/Reference. 1 ETSI ES 201 108: “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms“. 2

49、 ETSI ES 202 050: “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms“. 3 ETSI EN 300 903: “Digital cellular telecommunications system (Phase 2+) (GSM); Transmission planning aspects of the speech service in the GSM Public Land Mobile Network (PLMN) system (GSM 03.50)“. 3 Definitions, symbols and abbreviations 3.1 Definitions For the purposes of the present document, the following terms and definitions apply: analog-to-digital conversion:

展开阅读全文