1、 Reference numberISO/IEC 15938-4:2002(E)ISO/IEC 2002INTERNATIONAL STANDARD ISO/IEC15938-4First edition2002-06-15Information technology Multimedia content description interface Part 4: Audio Technologies de linformation Interface de description du contenu multimdia Partie 4: Audio Adopted by INCITS (
2、InterNational Committee for Information Technology Standards) as an American National Standard.Date of ANSI Approval: 12/3/2002Published by American National Standards Institute,25 West 43rd Street, New York, New York 10036Copyright 2002 by Information Technology Industry Council (ITI).All rights re
3、served.These materials are subject to copyright claims of International Standardization Organization (ISO), InternationalElectrotechnical Commission (IEC), American National Standards Institute (ANSI), and Information Technology Industry Council(ITI). Not for resale. No part of this publication may
4、be reproduced in any form, including an electronic retrieval system, withoutthe prior written permission of ITI. All requests pertaining to this standard should be submitted to ITI, 1250 Eye Street NW,Washington, DC 20005.Printed in the United States of AmericaISO/IEC 15938-4:2002(E) PDF disclaimer
5、This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept
6、therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file
7、; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. ISO/IEC 2002 All righ
8、ts reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the
9、 requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.ch Web www.iso.ch Printed in Switzerland ii ISO/IEC 2002 All rights reservedISO/IEC 15938-4:2002(E) ISO/IEC 2002 All rights reserved iiiContents Page Forewordv Introduc
10、tionvi 1 Scope 1 1.1 Definition of Scope 1 1.2 Fields of application 1 2 Terms and definitions .2 3 Symbols and abbreviated terms 2 4 Conventions .3 4.1 Description Definition Language .3 4.2 Audio representation.3 5 Audio Framework 4 5.1 Introduction4 5.2 Scalable Series 4 5.2.1 Introduction4 5.2.2
11、 ScalableSeriesType .5 5.2.3 SeriesOfScalarType.6 5.2.4 SeriesOfScalarBinaryType9 5.2.5 SeriesOfVectorType 10 5.2.6 SeriesOfVectorBinaryType .13 5.3 Low level Audio Descriptors 13 5.3.1 Introduction13 5.3.2 AudioLLDScalarType 14 5.3.3 AudioLLDVectorType 15 5.3.4 AudioWaveformType.16 5.3.5 AudioPower
12、Type .17 5.3.6 Audio Spectrum Descriptors17 5.3.7 AudioSpectrumEnvelopeType18 5.3.8 AudioSpectrumCentroidType.21 5.3.9 AudioSpectrumSpreadType .23 5.3.10 AudioSpectrumFlatnessType.24 5.3.11 AudioSpectrumBasisType26 5.3.12 AudioSpectrumProjectionType29 5.3.13 AudioHarmonicityType .33 5.3.14 Timbre De
13、scriptors36 5.3.15 LogAttackTimeType 38 5.3.16 HarmonicSpectralCentroidType.39 5.3.17 HarmonicSpectralDeviationType .41 5.3.18 HarmonicSpectralSpreadType .42 5.3.19 HarmonicSpectralVariationType 44 5.3.20 SpectralCentroidType .45 5.3.21 TemporalCentroidType .46 5.4 Silence 46 5.4.1 Introduction46 5.
14、4.2 SilenceHeaderType47 5.4.3 SilenceType47 5.4.4 Usage, examples and extraction (informative).48 6 High Level Tools 49 6.1 Introduction49 6.2 Audio Signature .49 ISO/IEC 15938-4:2002(E) iv ISO/IEC 2002 All rights reserved6.2.1 Introduction 49 6.2.2 AudioSignatureType50 6.2.3 Instantiation requireme
15、nts50 6.2.4 Usage and examples (informative) 50 6.3 Timbre .51 6.3.1 Introduction 51 6.3.2 InstrumentTimbreType52 6.3.3 HarmonicInstrumentTimbreType .53 6.3.4 PercussiveInstrumentTimbreType.54 6.3.5 Usage, extraction and examples (informative).55 6.4 General Sound Recognition and Indexing 56 6.4.1 I
16、ntroduction 56 6.4.2 SoundModelType .57 6.4.3 SoundClassificationModelType .59 6.4.4 SoundModelStatePathType 61 6.4.5 SoundModelStateHistogramType 62 6.4.6 General Sound Classification and Indexing Applications (informative)64 6.5 Spoken Content .66 6.5.1 Introduction 66 6.5.2 SpokenContentHeaderTyp
17、e67 6.5.3 SpeakerInfoType 68 6.5.4 SpokenContentIndexEntryType .71 6.5.5 ConfusionCountType 71 6.5.6 WordType, PhoneType, WordLexiconIndexType and PhoneLexiconIndexType73 6.5.7 LexiconType .74 6.5.8 WordLexiconType74 6.5.9 phoneticAlphabetType 75 6.5.10 PhoneLexiconType 75 6.5.11 SpokenContentLattic
18、eType 76 6.5.12 SpokenContentLinkType.78 6.5.13 Usage, extraction and examples (informative).79 6.6 Melody.84 6.6.1 Introduction 84 6.6.2 MelodyType 84 6.6.3 Meter85 6.6.4 scaleType86 6.6.5 MelodyKey 86 6.6.6 MelodyContourType 88 6.6.7 contourType .88 6.6.8 beatType .89 6.6.9 MelodySequence90 6.6.10
19、 Usage of MelodyContour (informative) .92 6.6.11 Usage of MelodySequence (informative) 94 6.6.12 Examples (informative) .94 Annex A (informative) Usage, extraction and examples of Scalable Series .96 Annex B (informative) Patent statements .105 ISO/IEC 15938-4:2002(E) ISO/IEC 2002 All rights reserve
20、d vForeword ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through tec
21、hnical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take p
22、art in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 3. The main task of the joint technical committee is to prepare Int
23、ernational Standards. Draft International Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International Standard requires approval by at least 75 % of the national bodies casting a vote. ISO/IEC 15938-4 was prepared by Joint Technica
24、l Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information. ISO/IEC 15938 consists of the following parts, under the general title Information technology Multimedia content description interface: Part 1: Systems Part 2: Desc
25、ription definition language Part 3: Visual Part 4: Audio Part 5: Multimedia description schemes Part 6: Reference software Part 7: Conformance testing Part 8: Extraction and use of MPEG-7 descriptions Annexes A and B of this part of ISO/IEC 15938 are for information only. ISO/IEC 15938-4:2002(E) vi
26、ISO/IEC 2002 All rights reservedIntroduction This standard, also known as “Multimedia Content Description Interface,“ provides a standardized set of technologies for describing multimedia content. The standard addresses a broad spectrum of multimedia applications and requirements by providing a meta
27、data system for describing the features of multimedia content. The following are specified in this standard: Description Schemes (DS) describe entities or relationships pertaining to multimedia content. Description Schemes specify the structure and semantics of their components, which may be Descrip
28、tion Schemes, Descriptors, or datatypes. Descriptors (D) describe features, attributes, or groups of attributes of multimedia content. Datatypes are the basic reusable datatypes employed by Description Schemes and Descriptors Description Definition Language (DDL) defines Description Schemes, Descrip
29、tors, and Datatypes by specifying their syntax, and allows their extension. Systems tools support delivery of descriptions, multiplexing of descriptions with multimedia content, synchronization, file format, and so forth. This standard is subdivided into eight parts: Part 1 Systems: specifies the to
30、ols for preparing descriptions for efficient transport and storage, compressing descriptions, and allowing synchronization between content and descriptions. Part 2 Description definition language: specifies the language for defining the standard set of description tools (DSs, Ds, and datatypes) and
31、for defining new description tools. Part 3 Visual: specifies the description tools pertaining to visual content. Part 4 Audio: specifies the description tools pertaining to audio content. Part 5 Multimedia description schemes: specifies the generic description tools pertaining to multimedia includin
32、g audio and visual content. Part 6 Reference software: provides a software implementation of the standard. Part 7 Conformance testing: specifies the guidelines and procedures for testing conformance of implementations of the standard. Part 8 Extraction and use of MPEG-7 descriptions: provides guidel
33、ines and examples of the extraction and use of descriptions. INTERNATIONAL STANDARD ISO/IEC 15938-4:2002(E) ISO/IEC 2002 All rights reserved 1Information technology Multimedia content description interface Part 4: Audio 1 Scope 1.1 Definition of Scope This International Standard defines a Multimedia
34、 Content Description Interface, specifying a series of interfaces from system to application level to allow disparate systems to interchange information about multimedia content. It describes the architecture for systems, a language for extensions and specific applications, description tools in the
35、audio and visual domains, as well as tools that are not specific to audio-visual domains. As a whole, this International Standard encompassing all of the aforementioned components is known as “MPEG-7.” MPEG-7 is divided into eight parts (as defined in the Foreword). This part of the MPEG-7 Standard
36、(Part 4: Audio) specifies description tools that pertain to multimedia in the audio domain. See below for further details of application. This part of the MPEG-7 Standard is intended to be implemented in conjunction with other parts of the standard. In particular, MPEG-7 Part 4: Audio assumes knowle
37、dge of Part 2: Description Definition Language (DDL) in its normative syntactic definitions of Descriptors and Description Schemes. This part of the standard also has dependencies upon clauses in Part 5: Multimedia Description Schemes, namely many of the fundamental Description Schemes that extend t
38、he basic type capabilities of the DDL. MPEG-7 is an extensible standard. The method to extend the standard beyond the Description Schemes provided in the standard is to define new ones in the DDL, and to make those DSs available with the instantiated descriptions. Further details are available in Pa
39、rt 2. To avoid duplicate functionality with other parts of the standard, the DDL is the only extension facility provided. 1.2 Fields of application MPEG-7 Part 4: Audio is applicable to all forms of audio content. The encoding format or medium of the said audio is not limited in any way, and may inc
40、lude audio held in an analogue medium such as magnetic tape or optical film. The content of the audio is not limited within or without music, speech, sound effects, soundtracks, or any mixtures thereof. The tools listed in this part of the International Standard are applicable to both audio in isola
41、tion and to audio associated with video. The specific tools provided within the Audio portion of the standard are designed to work in conjunction with the Multimedia Description Schemes that apply to both audio and video. Because of the “toolbox” nature of the standard, the most appropriate tools fr
42、om the different parts of the standard may be mixed, within the constraints of the DDL. The MPEG-7 Audio tools are applicable to two general areas: low-level audio description, in the case of the Audio Framework (clause 5), and application-driven description, in the case of the High Level Tools (cla
43、use 6). ISO/IEC 15938-4:2002(E) 2 ISO/IEC 2002 All rights reservedThe Audio Framework tools are applicable to general audio, without regard to the specific content carried by the encoded signal. The Scalable Series provides general capabilities for multi-level sampled data. The Audio Description Fra
44、mework defines specific descriptors for use with the Scalable Series or with Audio Segments, which has properties inherited from the general Segment described in the Multimedia Description Schemes part of the standard. The Silence Descriptor works with the Segment descriptor, and is applicable acros
45、s all possible audio signals. The high level description tools are applicable to specific types of content within audio. The specific domains are well documented within the introduction to each sub-clause. The audio domains encompassed by the various MPEG-7 Audio tools are speech, sound effects, mus
46、ical instruments, melodies within music and general audio recognition. These specialised tools may be employed in conjunction with the other tools within the standard. 2 Terms and definitions For the purposes of this part of ISO/IEC 15938, the following terms and definitions apply. 2.1 Frame A Frame
47、 is defined as a short period of time of the signal on which the instantaneous analysis if perform. For a signal, noted )(ts (in continuous time noted t ), and for an analysis window of type hamming, noted )(th and of temporal length L , the fthsignal frame is defined as )()(),( Sfthtstfx = where S
48、is the hop size 2.2 Hop size The hop size defines the temporal distance between two successive analyses 2.3 Running window analysis A running window analysis is an analysis obtained by multiplying the signal by a window function which is shifted along time by integer multiple of a parameter called t
49、he hop size. For a window function )(th , and a hop size S , the fthshifting of the window is equal to )( fSth . 2.4 Instantaneous values The instantaneous value of a (Timbre) descriptor based peak estimation is defined to be the result of analysis on a frame level. The global value of a (Timbre) descriptor based on peak estimation is defined to be the average over all frames of the segment of the instantaneous value. 3 Symbols and abbreviated terms ASR Automatic Speech Recognition CPU Centr