1、 Reference numberISO/IEC 15938-4:2002/Amd.1:2004(E)ISO/IEC 2004Information technology Multimedia content description interface Part 4: Audio AMENDMENT 1: Audio extensions Technologies de linformation Interface de description du contenu multimdia Partie 4: Audio AMENDEMENT 1: Extensions audio Amendme
2、nt 1:2005 toNational Standard of CanadaCAN/CSA-ISO/IEC 15938-4:04Amendment 1:2004 to International Standard ISO/IEC 15938-4:2002 has been adopted withoutmodification (IDT) as Amendment 1:2005 to CAN/CSA-ISO/IEC 15938-4:04. This Amendment was reviewedby the CSA Technical Committee on Information Tech
3、nology (TCIT) under the jurisdiction of the StrategicSteering Committee on Information Technology and deemed acceptable for use in Canada.October 2005ISO/IEC 15938-4:2002/Amd.1:2004(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file
4、may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat ac
5、cepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that
6、the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. ISO/IEC 2004 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized i
7、n any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 2
8、2 749 09 47 E-mail copyrightiso.org Web www.iso.org ii ISO/IEC 2004 All rights reservedISO/IEC 15938-4:2002/Amd.1:2004(E) ISO/IEC 2004 All rights reserved iiiForeword ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized
9、 system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committe
10、es collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. International S
11、tandards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of the joint technical committee is to prepare International Standards. Draft International Standards adopted by the joint technical committee are circulated to national bodies for voting. Public
12、ation as an International Standard requires approval by at least 75 % of the national bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent
13、rights. Amendment 1 to ISO/IEC 15938-4:2002 was prepared by Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information. ISO/IEC 15938-4:2002/Amd.1:2004(E) ISO/IEC 2004 All rights reserved 1Information technology Mult
14、imedia content description interface Part 4: Audio AMENDMENT 1: Audio extensions Add at the end of subclause 4.2: 4.3 Handling of multi-channel signals Introduction: The framework to handle multi-channel signals is given by the AudioD and AudioDS Types defined in ISO/IEC 15938-5/Amd.1 (MDS). The new
15、 additional attribute channels gives the channel numbers that are described by the assigned Descriptor or Description Scheme. However, to prevent some misunderstanding, a more detailed description and handling policy is given in this part of ISO/IEC 15938. In particular a recommendation is given to
16、handle typical surround formats, when only tag names like L, C, R, LS, LR, LFE are known. By using the channels attribute, defined in ISO/IEC 15938-5/Amd.1 (MDS) it is possible to specify which channels should be used for e.g. computing the mean with the extraction method. Therefore, the Descriptor
17、and Description Schemes contain information about these channels only. This is useful in order to separate a multi-channel input signal into subgroups that are closely related, e.g. the Left (L), Center (C) and Right (R) signal of a typical surround format. The highest possible channel number is giv
18、en in the file-format of the audio media file itself. All numbers given in the channels attribute higher than the number of channels given by the media file-format should be ignored. In the case where the numbering of the audio channels is not explicitly given in the file-format (like 5.1 surround s
19、ignals), the following convention to number the channels is recommended to be used. When mapping typical surround file-formats, consisting of tags like (L, R, C, LS, RS, LFE), the scheme shown in Figure AMD1-1 should be followed, in order to reduce the ambiguity between scheme and channel number. To
20、 define the channel number, the counting should start at an optional center channel and go from left to right, top to bottom and then from front to back (see for example Figure AMD1-2). An optional rear center will get the last channel number for the standard audio channels. The assigned number can
21、be higher if specialised channels are present, like an LFE channel for low frequency effect signals. Two examples are given in Tables 1 and 2. Furthermore, it is recommended that a textual description of the scheme used inside the AudioSegmentD-Framework (defined in ISO/IEC 15938-5) is included. An
22、instantiation example is given in ISO/IEC 15938-5/Amd.1 (MDS), subclause 4.2.4. ISO/IEC 15938-4:2002/Amd.1:2004(E) 2 ISO/IEC 2004 All rights reservedFigure AMD1-1 - Scheme and channel number for typical surround file-formats Figure AMD1-2 - Scheme and channel number for a 3D speaker arrangement (exa
23、mple) ISO/IEC 15938-4:2002/Amd.1:2004(E) ISO/IEC 2004 All rights reserved 3Examples for mapping: Table AMD1-1- Simple Stereo Tag name channel number Left 1 Right 2 Replace subclause 6.5.8 by: 6.5.8 WordLexiconType 6.5.8.1 Syntax TableAMD1-2- Surround 5.1 Tag name Channel number Center 1 Left 2 Right
24、 3 Left Surround (LS) 4 Right Surround (LR) 5 LFE 6 ISO/IEC 15938-4:2002/Amd.1:2004(E) 4 ISO/IEC 2004 All rights reserved6.5.8.2 Semantics Name Definition WordLexiconType A list of words (a lexicon). Each entry represents one orthographic representation (spelling) or one non-orthographic representat
25、ion of a word or linguistic unit. The lexicon is not a phonetic (pronunciation) dictionary. phoneticAlphabet The name of the encoding scheme of the phone lexicon. Only needed if phonetic representation is used. See 6.5.9 phoneticAlphabetType Token An entry in the lexicon linguisticUnit Indicates the
26、 type of the linguistic unit that is put into the entry of the word lexicon. The linguistic units are defined as follows. word an unit delimitated by whitespace. This is the default value. (example: psychcoacoustics) syllable minimal pronouncable unit (example: psy) morpheme minimal meaning bearing
27、unit (example: psycho ) stem the uninflected base of a word-form, can be polymorphemic. (example: psychoacoustic) affix needs to be added to a stem to get a word component a constituent part of a compound word. Important for compounding languages. (example from German: Forschungs (in English corresp
28、onds to “research-“) nonspeech noises, both human-produced and background, that are non-linguistic in nature. (example: throat clearing, coughing) phrase - a sequence of words (e.g. “God bless America”) other - a linguistic unit that does not map onto any of the above Other values that are datatype-
29、valid with respect to mpeg7:termReferenceType are reserved. ISO/IEC 15938-4:2002/Amd.1:2004(E) ISO/IEC 2004 All rights reserved 5representation Form of representation for a lexicon entry. The kinds of representation are defined as follows. orthographic representation of an entry by spelling nonortho
30、graphic representation of an entry by an identifier that is not synonymous with the spelling of a word. A non-orthographic representation may, for example, encode the phoneme string corresponding to the pronunciation of the entry. 6.5.8.3 Usage, Extraction and Examples (informative) 6.5.8.3.1 Purpos
31、e The word lexicon makes it possible to store the words contained in the lattice. It is common in both speech recognition and spoken document retrieval to include entries in the word lexicon that are “words” only in a wider sense of the term (e.g. acronyms or abbreviations) or not really words at al
32、l (e.g. phrases, syllables, morphemes or the individual components of compound words). The attribute linguisticUnit makes it possible to distinguish between these different types of units. Differentiating these units is useful, for instance, when the retrieval algorithm of an application needs to tr
33、eat different units in different ways. For example stemming, a pre-processing step applied to words, should not be applied to syllables or morphemes. Similarly, different types of units might receive different weightings in the calculation of the retrieval metric. In some applications, it is also ne
34、cessary to know if the entry is given in its human-readable form or not. For example, if the entry is human-interpretable and can potentially be displayed to the user or if certain algorithms are applied which are intended for the orthographic form only (e.g. stemming). 6.5.8.3.2 Extraction The gene
35、ration of syllable, morpheme, compound and phrase transcriptions of spoken input is performed in the following ways: a) The output of the word or phoneme recognizer is mapped to other linguistic units. For example the recognized word can be transformed into syllables using a syllable generation tool
36、. b) The ASR system produces the desired linguist unit directly during the recognition process. In this case, the linguistic units are parts of the recognition vocabulary of the speech recognition engine. For example, the dictionary used for the speech recognition system could be composed exclusivel
37、y of syllables. 6.5.8.3.3 Example The following example shows a lexicon containing six entries. The first two entries represent syllables and the next two entries represent words. The fifth entry also represents a word, but not in its written form. The last entry represents a phrase. 6: n 6:_s_ wate
38、r draw Q e: l e: f a n t ISO/IEC 15938-4:2002/Amd.1:2004(E) 6 ISO/IEC 2004 All rights reservedas a rule Replace subclause 6.5.12 by: 6.5.12 SpokenContentLinkType 6.5.12.1 Syntax 6.5.12.2 Semantics Name Definition SpokenContentLinkType The structure of a word or phone link in the lattice probability
39、The probability of this link. In a crude sense, this is to indicate which links are more likely than others, with larger numbers indicating higher likelihood. nodeOffset The node to which this link leads, specified as a relative offset and defaulting to 1. A node offset leading out of the current bl
40、ock implicitly refers to the next block. A node offset cannot span a whole block, i.e., a link from a node in block 3 must lead to a node in block 3 or block 4. acousticScore The score assigned by the acoustic models of the speech recognition engine only. It is given in logarithmic scale (base e) an
41、d indicates the quality of the match between the acoustic models and the corresponding signal segment. A higher value indicates a better match. Add a new subclause 6.7: 6.7 Audio Signal Quality 6.7.1 Introduction If an AudioSegment DS contains a piece of music, several features describing the signal
42、s quality can be computed to describe the quality attributes. The AudioSignalQualityType contains these quality attributes and uses the ErrorEventType to handle typical errors that occur in audio data and in the transfer process from analog audio to the digital domain. However, note that this DS is
43、not applicable to describe the subjective sound quality of audio signals resulting from sophisticated digital signal processing, including the use of noise shaping or other techniques based on perceptual/psychoacoustic considerations. ISO/IEC 15938-4:2002/Amd.1:2004(E) ISO/IEC 2004 All rights reserv
44、ed 7For example, in the case of searching an audio file on the Internet, quality information could be used to determine which one should be downloaded among several search results. Another application area would be an archiving system. There, it would be possible to browse through the archive using
45、quality information, and also the information could be used to decide if a file is of sufficient quality to be used e.g. for broadcasting. 6.7.2 Conventions The description of the Descriptors refers to the input signal x. If x is a multi channel signal, then the signal for a certain channel is desig
46、nated as xn for the n-th channel. The functions max(), min() and mean() are used as defined in ISO/IEC CD 15938-4 (Audio Part). The function abs() calculates the absolute value.6.7.3 Audio Signal Quality Description Scheme The AudioSignalQualityType is a set of AudioQuality Descriptors and some addi
47、tional tools for handling and describing audio signal quality information. In particular the handling of single error events in audio streams is considered. 6.7.3.1 Syntax ISO/IEC 15938-4:2002/Amd.1:2004(E) 8 ISO/IEC 2004 All rights reserved6.7.3.2 Semantics Name Definition AudioSignalQualityType Th
48、e AudioSignalQualityType describes the quality of an AudioSegment. It consists of several quality elements. Operator The Operator is the person who is responsible for the audio quality information. Operator is of type PersonType. UsedTool The UsedTool is the system that was used by the Operator to c
49、reate the quality information. UsedTool is of type CreationToolType. BackgroundNoiseLevel (BNL) The BackgroundNoiseLevel describes the noise level in an AudioSegment. BackgroundNoiseLevelType is defined in 6.7.4. RelativeDelay The RelativeDelay describes the relative delay between two or more channels of an AudioSegment. RelativeDelayType is defined in 6.7.6. Balance The Balance describes the relative level between two or more channels of an AudioSegment. BalanceType is defined in 6.7.7.