ITU-R BS 2076-1-2017 Audio Definition Model.pdf

资源描述

1、 Recommendation ITU-R BS.2076-1 (06/2017) Audio Definition Model BS Series Broadcasting service (sound) ii Rec. ITU-R BS.2076-1 Foreword The role of the Radiocommunication Sector is to ensure the rational, equitable, efficient and economical use of the radio-frequency spectrum by all radiocommunicat

2、ion services, including satellite services, and carry out studies without limit of frequency range on the basis of which Recommendations are adopted. The regulatory and policy functions of the Radiocommunication Sector are performed by World and Regional Radiocommunication Conferences and Radiocommu

3、nication Assemblies supported by Study Groups. Policy on Intellectual Property Right (IPR) ITU-R policy on IPR is described in the Common Patent Policy for ITU-T/ITU-R/ISO/IEC referenced in Annex 1 of Resolution ITU-R 1. Forms to be used for the submission of patent statements and licensing declarat

4、ions by patent holders are available from http:/www.itu.int/ITU-R/go/patents/en where the Guidelines for Implementation of the Common Patent Policy for ITU-T/ITU-R/ISO/IEC and the ITU-R patent information database can also be found. Series of ITU-R Recommendations (Also available online at http:/www

5、.itu.int/publ/R-REC/en) Series Title BO Satellite delivery BR Recording for production, archival and play-out; film for television BS Broadcasting service (sound) BT Broadcasting service (television) F Fixed service M Mobile, radiodetermination, amateur and related satellite services P Radiowave pro

6、pagation RA Radio astronomy RS Remote sensing systems S Fixed-satellite service SA Space applications and meteorology SF Frequency sharing and coordination between fixed-satellite and fixed service systems SM Spectrum management SNG Satellite news gathering TF Time signals and frequency standards em

7、issions V Vocabulary and related subjects Note: This ITU-R Recommendation was approved in English under the procedure detailed in Resolution ITU-R 1. Electronic Publication Geneva, 2017 ITU 2017 All rights reserved. No part of this publication may be reproduced, by any means whatsoever, without writ

8、ten permission of ITU. Rec. ITU-R BS.2076-1 1 RECOMMENDATION ITU-R BS.2076-1 Audio Definition Model (2015-2017) Scope This Recommendation describes the structure of a metadata model that allows the format and content of audio files to be reliably described. This model, called the Audio Definition Mo

9、del (ADM), specifies how XML metadata can be generated to provide the definitions of tracks in an audio file. Keywords ADM, Audio Definition Model, BWF, Metadata, Wave-file, WAVE, object-based, channel-based, scene-based, renderer, XML, XSD, format, immersive The ITU Radiocommunication Assembly, con

10、sidering a) that Recommendation ITU-R BS.2051 Advanced sound system for programme production, highlights the need for a file format that is capable of dealing with the requirements for future audio systems; b) that Recommendation ITU-R BS.1909 Performance requirements for an advanced multichannel st

11、ereophonic sound system for use with or without accompanying picture, outlines the requirements for an advanced multichannel stereophonic sound system; c) that it is desirable that there is a single open standard for a metadata model for defining audio content that file and streaming formats could e

12、ither adopt or become compatible with by means of suitable interfacing, recommends for the following use cases: applications requiring a generic metadata model for, and a formalized description of, custom/proprietary audio formats and content (including codecs); generating and parsing audio metadata

13、 with general-purpose tools, such as text editors; an organizations internal production developments, where multi-purpose metadata needs to be added; a human-readable and hand-editable file for describing audio configurations (such as describing a mixing studio channel configuration) in a consistent

14、 and translatable format is needed, to use the Audio Definition Model (ADM) described in Annex 1 for metadata to describe audio formats used in programme production and international exchange. 2 Rec. ITU-R BS.2076-1 Annex 1 Audio Definition Model 1 Introduction Audio for broadcasting and cinema is e

15、volving towards an immersive and interactive experience, which requires the use of more flexible audio formats. A fixed channel-based approach is not sufficient to encompass these developments and so combinations of channel, object and scene-based formats are being developed. Report ITU-R BS.2266 1

16、and Recommendations ITU-R BS.1909 2 and ITU-R BS.2051 3 highlight these developments and the need for the production chain to accommodate them. The central requirement for allowing all the different types of audio to be distributed, whether by file or by streaming, is that whatever file/stream forma

17、t is used, metadata should co-exist to fully describe the audio. Each individual track within a file or stream should be able to be correctly rendered, processed or distributed according to the accompanying metadata. To ensure compatibility across all systems, the Audio Definition Model is an open s

18、tandard that will make this possible. 2 Background The purpose of this model is to formalise the description of audio. It is not a format for carrying audio. This distinction will help in the understanding of the model. 2.1 Cooking analogy To help explain what the ADM actually does, it may be useful

19、 to consider a cooking analogy. The recipe for a cake will contain a list of ingredients, instructions on how to combine those ingredients and how to bake the cake. The ADM is like a set of rules for writing the list of ingredients; it gives a clear description of each item, for example: 2 eggs, 400

20、g flour, 200g butter, 200g sugar. The ADM provides the instructions for combining ingredients but does not tell you how to do the mixing or baking; in the audio world that is what the renderer does. The ADM is in general compatible with wave-file based formats such as the BW64 format specified in Re

21、commendation ITU-R BS.2088 7, the BWF as defined by the ITU in 4 and other wave based formats that support the usage of the needed additional chunks. When used in the context of Recommendation ITU-R BS.2088 file, the chunk of the BS.2088 file is like the bar code on the packet of each of the ingredi

22、ents; this code allows us to look up the models description of each item. The bag containing the actual ingredients is like the data chunk of the BS.2088 file that contains the audio samples. From a Recommendation ITU-R BS.2088 file point of view, we would look at our bar codes on each ingredient in

23、 our bag, and use that to look up the description of each item in the bag. Each description follows the structure of the model. There might be ingredients such as breadcrumbs, which could be divided into its own components (flour, yeast, etc.); which is like having an audio object containing multipl

24、e channels (e.g. stereo containing left and right). Rec. ITU-R BS.2076-1 3 2.2 Brief overview This model will initially use XML as its specification language, though it could be mapped to other languages such as JSON (JavaScript Object Notation) if required. When it is used with Recommendation ITU-R

25、 BS.2088 files, the XML can be embedded in the chunk of the file. The model is divided into two sections, the content part, and the format part. The content part describes what is contained in the audio, so will describe things like the language of any dialogue, the loudness and so on. The format pa

26、rt describes the technical nature of the audio so it can be decoded or rendered correctly. Some of the format elements may be defined before we have any audio signals, whereas the content parts can usually only be completed after the signals have been generated. While this model is based around a wa

27、ve-file based format, it is a more general model. However, examples are given using Recommendation ITU-R BS.2088 according to the definition in 7 as this explains more clearly how the model works. It is also expected that the models parameters are added to in subsequent versions of this specificatio

28、n to reflect the progress in audio technology. 3 Description of the model The overall diagram of the model is given in Fig. 1. This shows how the elements relate to each other and illustrates the split between the content and format parts. It also shows the chunk of a BS.2088 file and how it connect

29、s the tracks in the file to the model. Where a BS.2088 file contains a number of audio tracks, it is necessary to know what each track is. The chunk contains a list of numbers corresponding to each track in the file. Hence, for a 6 track file, the list is at least 6 long. For each track there is an

30、audioTrackFormatID number and an audioTrackUID number (notice the additional U which stands for unique). The reason the list could be longer than the number of tracks is that a single track may have different definitions at different times so will require multiple audioTrackUIDs and references. The

31、audioTrackFormatID is used to look up the definition of the format of that particular track. The audioTrackFormatIDs are not unique; for example, if a file contains 5 stereo pairs, there will be 5 identical audioTrackFormatIDs to describe the left channel, and 5 to describe the right channel. Thus,

32、only two different audioTrackFormatIDs will need to be defined. However, audioTrackUIDs are unique (hence the U), and they are there to uniquely identify the track. This use of IDs means that the tracks can be ordered in any way in the file; their IDs reveal what those tracks are. 4 Rec. ITU-R BS.20

33、76-1 FIGURE 1 Overall UML Model Rec. ITU-R BS.2076-1 5 3.1 Format The audioTrackFormatID answers the question “What is the format of this track?” The audioTrackFormat will also contain an audioStreamFormatID, which allows identification of the combination of the audioTrackFormat and audioStreamForma

34、t. An audioStreamFormat describes a decodable signal. The audioStreamFormat is made up of one or more audioTrackFormats. Hence, the combination of audioStreamFormat and audioTrackFormat reveals whether the signal has to be decoded or not. The next stage is to find out what type of audio the stream i

35、s; for example it may be a conventional channel (e.g. front left), an audio object (e.g. something named guitar positioned at the front), a HOA (Higher Order Ambisonics) component (e.g. X) or a group of channels. Inside audioStreamFormat there will be a reference to either an audioChannelFormat or a

36、udioPackFormat that will describe the audio stream. There will only be one of these references. If audioStreamFormat contains an audioChannelFormat reference (i.e. audioChannelFormatIDRef) then audioStreamFormat is one of several different types of audioChannelFormat. An audioChannelFormat is a desc

37、ription of a single waveform of audio. In audioChannelFormat there is a typeDefinition attribute, which is used to define what the type of channel is. The typeDefinition attribute can be set to DirectSpeakers, HOA, Matrix Objects or Binaural. For each of those types, there is a different set of sub-

38、elements to specify the static parameters associated with that type of audioChannelFormat. For example, the DirectSpeakers type of channel has the sub-element speakerLabel for allocating a loudspeaker to the channel. To allow audioChannelFormat to describe dynamic channels (i.e. channels that change

39、 in some way over time), it uses audioBlockFormat to divide the channel along the time axis. The audioBlockFormat element will contain a start time (relative to the start time of the parent audioObject) and duration. Within audioBlockFormat there are time-dependent parameters that describe the chann

40、el which depend upon the audioChannelFormat type. For example, the Objects type of channel has the sub-elements azimuth, elevation and distance to describe the location of the sound. The number and duration of audioBlockFormats is not limited, there could be an audioBlockFormat for every sample if s

41、omething moves rapidly, though that might be a bit excessive! At least one audioBlockFormat is required and so static channels will have one audioBlockFormat containing the channels parameters. If audioStreamFormat refers to an audioPackFormat, it describes a group of channels. An audioPackFormat el

42、ement groups together one or more audioChannelFormats that belong together (e.g. a stereo pair). This is important when rendering the audio, as channels within the group may need to interact with each other. The reference to an audioPackFormat containing multiple audioChannelFormats from an audioStr

43、eamFormat usually occurs when the audioStreamFormat contains non-PCM audio which carries several channels encoded together. AudioPackFormat would usually not be referred from audioStreamFormat for most channel and scene-based formats with PCM audio. Where this reference does exist, the function of a

44、udioPackFormat is to combine audioChannelFormats that belong together for rendering purposes. For example, stereo, 5.1, 1st order Ambisonics would all be examples of an audioPackFormat. Note that audioPackFormat just describes the format of the audio. For example, a file containing 5 stereo pairs wi

45、ll contain only one audioPackFormat to describe stereo. It is possible to nest audioPackFormats; a 2nd order HOA could contain a 1st order HOA audioPackFormat alongside audioChannelFormats for the R, S, T, U therefore, those two tracks will contain stereo audio. It will also contain a reference to a

46、udioPackFormat, which defines the format of those two tracks as a stereo pair. As there are 5 stereo pairs in this example, 5 audioObject elements will be needed. Each one will contain the same reference to a stereo audioPackFormat, but will contain different reference to audioTrackUIDs, as each ste

47、reo pair is carrying different audio. The order of audioTrackUIDRefs is not important in an audioObject, as the format definition through audioTrack, audioStreamFormat, audioChannelFormat and audioPackFormat determines which track is which. The audioObject element also contains start and duration at

48、tributes. This start time is the time when the signal for the object starts in a file or recording. Thus, if start is “00:00:10.00000”, the signal for the object will start 10 seconds into the track in the audio file. As audioPackFormat can be nested, it follows that audioObjects can be nested. Ther

49、efore, the audioObject will contain not only references to the two audioTrackUIDs carrying the stream, but also references to two audioObjects, one for the 5.1 and one for the 2.0. AudioObject is referred to by audioContent, which gives a description of the content of the audio; it has parameters such as language (if there is dialogue) and the loudness parameters. Some of the values for these parameters can only be calculated after the audio has been generated, and this is why they are not in the format p

展开阅读全文