1、BSI Standards Publication BS ISO/IEC 14496-11:2015 Information technology Coding of audio-visual objects Part 11: Scene description and application engineBS ISO/IEC 14496-11:2015 BRITISH STANDARD National foreword This British Standard is the UK implementation of ISO/IEC 14496-11:2015. The UK partic
2、ipation in its preparation was entrusted to Technical Committee IST/37, Coding of picture, audio, multimedia and hypermedia information. A list of organizations represented on this committee can be obtained on request to its secretary. This publication does not purport to include all the necessary p
3、rovisions of a contract. Users are responsible for its correct application. The British Standards Institution 2015. Published by BSI Standards Limited 2015 ISBN 978 0 580 82200 1 ICS 35.040 Compliance with a British Standard cannot confer immunity from legal obligations. This British Standard was pu
4、blished under the authority of the Standards Policy and Strategy Committee on 30 November 2015. Amendments/corrigenda issued since publication Date T e x t a f f e c t e dBS ISO/IEC 14496-11:2015Reference number ISO/IEC 14496-11:2015(E) ISO/IEC 2015INTERNATIONAL STANDARD ISO/IEC 14496-11 Second Edit
5、ion 2015-11-01 Information technology Coding of audio-visual objects Part 11: Scene description and application engine Technologies de linformation Codage des objets audiovisuels Partie 11: Description de scne et moteur dapplication BS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) PDF disclaimer Th
6、is PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept th
7、erein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file;
8、the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. ISO/IEC 2015 All rights
9、 reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the r
10、equester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO/IEC 2015 All rights reservedBS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) ISO/IEC 2015 All rights reserved iiiCont
11、ents Page Foreword . v 0 Introduction vii 0.1 Scene Description . vii 0.2 Extensible MPEG-4 Textual Format ix 0.3 MPEG-J ix 1 Scope 1 2 Normative references 1 3 Additional reference 2 4 Terms and definitions . 2 5 Abbreviations and Symbols . 7 6 Conventions . 7 7 MPEG-4 Systems Node Semantics 8 7.1
12、Scene Description . 8 7.2 Node Semantics . 24 7.3 Informative: Differences Between MPEG-4 Scripts and ECMA Scripts . 181 7.4 Informative: FlexTime behavior . 182 7.5 Informative: Implementation of MaterialKey node . 183 7.6 Informative: Example implementation of spatial audio processing (perceptual
13、approach) . 184 7.7 Informative: MPEG-4 Audio TTS application with Facial Animation 188 7.8 Informative: 3D Mesh Coding in BIFS scenes 188 7.9 Profiles 189 7.10 Metric information for resident fonts 220 7.11 Font metrics for SANS SERIF font (Albany) . 221 7.12 Font metrics for SERIF font (Thorndale)
14、 . 227 7.13 Font metrics for TYPEWRITER font (Cumberland) 234 8 BIFS . 242 8.1 Introduction 242 8.2 Decoding tables, data structures and associated functions 242 8.3 Quantization . 247 8.4 Compensation process . 257 8.5 BIFS Configuration 258 8.6 BIFS Command Syntax . 262 8.7 BIFS Scene . 274 8.8 BI
15、FS-Anim 305 8.9 Interpolator compression . 310 8.10 Definition of bodySceneGraph nodes . 349 8.11 Adaptive Arithmetic Decoder for BIFS-Anim 357 8.12 Informative : Adaptive Arithmetic Encoder for BIFS-Anim . 359 8.13 View Dependent Object Scalability 360 9 The Extensible MPEG-4 Textual Format . 381 9
16、.1 Introduction 381 9.2 XMT-A Format 381 9.3 XMT- Format . 433 9.4 XMT-C Modules 478 9.5 XMT Schemas 486 9.6 Informative: XMT/X3D Compatibility . 486 9.7 Informative: The usage of XMT-A BitWrapper element in authoring side . 487 BS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) iv ISO/IEC 2015 All r
17、ights reserved10 MPEG-J 500 10.1 Architecture . 500 10.2 MPEG-J Session . 502 10.3 Delivery of MPEG-J Data 503 10.4 MPEG-J API List 506 10.5 Informative: Starting the Java Virtual Machine . 512 10.6 Informative: Examples of MPEG-J API usage 513 Annex A (normative) Curve-based animators . 522 Annex B
18、 (normative) Procedural textures algorithms 525 Annex C (informative) Text Processing in BIFS 530 Annex D (informative) Patent statements 532 Bibliography . 533 BS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) ISO/IEC 2015 All rights reserved vForeword ISO (the International Organization for Standa
19、rdization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organizati
20、on to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, I
21、SO and IEC have established a joint technical committee, ISO/IEC JTC 1. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of the joint technical committee is to prepare International Standards. Draft International Standards adopte
22、d by the joint technical committee are circulated to national bodies for voting. Publication as an International Standard requires approval by at least 75 % of the national bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of p
23、atent rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights. ISO/IEC 14496-11 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information Technology, Subcommittee SC 29, Coding of Audio, Picture, Multimedia and Hypermedia Information. This second ed
24、ition cancels and replaces the first edition, which has been technically revised. ISO/IEC 14496 consists of the following parts, under the general title Information technology Coding of audio-visual objects: Part 1: Systems Part 2: Visual Part 3: Audio Part 4: Conformance testing Part 5: Reference s
25、oftware Part 6: Delivery Multimedia Integration Framework (DMIF) Part 7: Optimized reference software for coding of audio-visual objects Technical Report Part 8: Carriage of ISO/IEC 14496 contents over IP networks Part 9: Reference hardware description Technical Report Part 10: Advanced Video Coding
26、 Part 11: Scene description and application engine Part 12: ISO base media file format Part 13: Intellectual Property Management and Protection (IPMP) extensions Part 14: MP4 file format BS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) vi ISO/IEC 2015 All rights reserved Part 15: Advanced Video Cod
27、ing (AVC) file format Part 16: Animation Framework eXtension (AFX) Part 17: Streaming text format Part 18: Font compression and streaming Part 19: Synthesized texture stream Part 20: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format (SAF) Part 21: MPEG-J GFX BS ISO/I
28、EC 14496-11:2015 ISO/IEC 14496-11:2015(E) ISO/IEC 2015 All rights reserved viiIntroduction 1.1 Scene Description 1.1.1 Overview ISO/IEC 14496 addresses the coding of audio-visual objects of various types: natural video and audio objects as well as textures, text, 2- and 3-dimensional graphics, and a
29、lso synthetic music and sound effects. To reconstruct a multimedia scene at the terminal, it is hence not sufficient to transmit the raw audio-visual data to a receiving terminal. Additional information is needed in order to combine this audio-visual data at the terminal and construct and present to
30、 the end-user a meaningful multimedia scene. This information, called scene description, determines the placement of audio-visual objects in space and time and is transmitted together with the coded objects as illustrated in Figure 1. Note that the scene description only describes the structure of t
31、he scene. The action of assembling these objects in the same representation space is called composition. The action of transforming these audio-visual objects from a common representation space to a specific presentation device (i.e. speakers and a viewing window) is called rendering. multiplexed do
32、wnstream control / datamultiplexed upstream control / data audiovisual presentation 3D objects 2D background voice sprite hypothetical viewer projection video compositor plane audio compositor scene coordinate system x y z user events audiovisual speaker display user inputFigure 1 An example of an o
33、bject-based multimedia scene Independent coding of different objects may achieve higher compression, and also brings the ability to manipulate content at the terminal. The behaviors of objects and their response to user inputs can thus also be represented in the scene description. The scene descript
34、ion framework used in this part of ISO/IEC 14496 is based largely on ISO/IEC 14772-1:1998 (Virtual Reality Modeling Language VRML). 1.1.2 Composition and Rendering ISO/IEC 14496-11 defines the syntax and semantics of bitstreams that describe the spatio-temporal relationships of audio- visual objects
35、. For visual data, particular composition algorithms are not mandated since they are implementation- dependent; for audio data, subclause 7.1.1.2.13 and the semantics of the AudioBIFS nodes normatively define the composition process. The manner in which the composed scene is presented to the user is
36、 not specified for audio or visual data. The scene description representation is termed “BInary Format for Scenes” (BIFS). BS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) viii ISO/IEC 2015 All rights reserved1.1.3 Scene Description In order to facilitate the development of authoring, editing and i
37、nteraction tools, scene descriptions are coded independently from the audio-visual media that form part of the scene. This permits modification of the scene without having to decode or process in any way the audio-visual media. The following clauses detail the scene description capabilities that are
38、 provided by ISO/IEC 14496-11. 1.1.3.1 Grouping of audio-visual objects A scene description follows a hierarchical structure that can be represented as a graph. Nodes of the graph form audio- visual objects, as illustrated in Figure 2. The structure is not necessarily static; nodes may be added, del
39、eted or be modified. Figure 2 Logical structure of example scene 1.1.3.2 Spatio-Temporal positioning of objects Audio-visual objects have both a spatial and a temporal extent. Complex audio-visual objects are constructed by combining appropriate scene description nodes to build up the scene graph. A
40、udio-visual objects may be located in 2D or 3D space. Each audio-visual object has a local co-ordinate system. A local co-ordinate system is one in which the audio- visual object has a pre-defined (but possibly varying) spatio-temporal location and scale (size and orientation). Audio- visual objects
41、 are positioned in a scene by specifying a co-ordinate transformation from the objects local co-ordinate system into another co-ordinate system defined by a parent node in the scene graph. 1.1.3.3 Attributes of audio-visual objects Scene description nodes expose a set of parameters through which asp
42、ects of their appearance and behavior can be controlled. EXAMPLE the volume of a sound; the color of a synthetic visual object; the source of a streaming video. 1.1.3.4 Behavior of audio-visual objects ISO/IEC 14496-11 provides tools for enabling dynamic scene behavior and user interaction with the
43、presented content. User interaction can be separated into two major categories: client-side and server-side. Client-side interaction is an integral part of the scene description described herein. Server-side interaction is not dealt with. Client-side interaction involves content manipulation that is
44、 handled locally at the end-users terminal. It consists of the modification of attributes of scene objects according to specified user actions. EXAMPLE A user can click on a scene to start an animation or video sequence. The facilities for describing such interactive behavior are part of the scene d
45、escription, thus ensuring the same behavior in all terminals conforming to ISO/IEC 14496-11. scene globe desk person audiovisual presentation 2D background furniture voice spritBS ISO/IEC 14496-11:2015 ISO/IEC 14496-11:2015(E) ISO/IEC 2015 All rights reserved ix1.2 Extensible MPEG-4 Textual Format 1
46、.2.1 Overview The Extensible MPEG-4 Textual format (XMT) is a framework (illustrated in Figure 3) for representing MPEG-4 scene description using a textual syntax. The XMT allows the content authors to exchange their content with other authors, tools or service providers, and facilitates interoperab
47、ility with both the Extensible 3D (X3D) being developed by the Web3D and the Synchronized Multimedia Integration Language (SMIL) from the W3C. XMT MPEG-4 Representation (e.g. m p4) SMIL MPEG-7 SVG Parse C om pile SMIL Player VRML Browser MPEG-4 Player X3DFigure 3 Overview of the XMT Framework 1.2.2
48、Interoperability of XMT The XMT format can be interchangeable between SMIL players, VRML players, and MPEG-4 players. The format can be parsed and played directly by a W3C SMIL player, preprocessed to Web3D X3D and played back by a VRML player, or compiled to an MPEG-4 representation such as MP4, wh
49、ich can then be played by an MPEG-4 player. See below for a graphical description of interoperability of the XMT. 1.2.3 Two-tier Architecture: XMT-A and XMT- Formats The XMT framework consists of two levels of textual syntax and semantics: the XMT-A format and the XMT- format, which we will abbreviate by A and , respectively, and use them interchangeably where there is no confusion. The XMT-A is an XML-based version of MPEG-4 content, which contains a subset of the X3D. Also