1、 Reference numberISO 12620:2009(E)ISO 2009INTERNATIONAL STANDARD ISO12620Second edition2009-12-15Terminology and other language and content resources Specification of data categories and management of a Data Category Registry for language resources Terminologie et autres ressources langagires et res
2、sources de contenu Spcification de catgories de donnes et gestion dun registre de catgories de donnes pour les ressources langagires Copyright International Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license f
3、rom IHS-,-,-ISO 12620:2009(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editi
4、ng. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be f
5、ound in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at t
6、he address given below. COPYRIGHT PROTECTED DOCUMENT ISO 2009 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from eit
7、her ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2009 All rights reservedCopyright Internationa
8、l Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-ISO 12620:2009(E) ISO 2009 All rights reserved iiiContents Page Foreword iv Introduction.v 1 Scope1 2 Normative references1 3 Terms and definit
9、ions .1 3.1 Data elements and data categories .1 3.2 Data Category Registry.3 3.3 Data category specification components .4 3.4 DCR management5 3.5 Roles.6 3.6 Data exchange .6 4 Role of data categories in language resource management 7 4.1 Overview.7 4.2 Variety of Data Category Selections (DCSs).8
10、 5 Requirements for the implementation of a DCR for language resources .9 6 Registration Authority for the ISO/TC 37 DCR .10 7 Representation of data categories used in language resources .11 7.1 Introduction11 7.2 Global Information class.11 7.3 Data Category classes 13 7.4 Administration Informati
11、on Section 14 7.5 Documenting data categories 17 7.6 Conceptual Domain classes.20 7.7 Linguistic Section classes21 7.8 Referencing a data category 23 7.9 Data Category Interchange Format .23 8 Management procedures for the ISO/TC 37 DCR.24 8.1 General organization.24 8.2 Roles and responsibilities25
12、 8.3 Thematic domain groups25 8.4 Working procedure26 8.5 Data Category Registry Board (DCRB) .28 Annex A (normative) Compact DC Reference RELAX NG Schema.30 Annex B (informative) Example of a DCIF Representation.31 Annex C (normative) Compact DCIF RELAX NG Schema 33 Annex D (informative) Alphabetic
13、al listing of definitions .38 Bibliography40 Copyright International Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-ISO 12620:2009(E) iv ISO 2009 All rights reservedForeword ISO (the Internati
14、onal Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been
15、 established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standa
16、rdization. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for
17、voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all
18、such patent rights. ISO 12620 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 3, Systems to manage terminology, knowledge and content. This second edition cancels and replaces the first edition (ISO 12620:1999), which has been tech
19、nically revised. Copyright International Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-ISO 12620:2009(E) ISO 2009 All rights reserved vIntroduction Data associated with language resources are
20、 identified, collected, managed, and stored in a wide variety of environments. Data items appearing in individual language resources are themselves referred to in this International Standard as data categories, a designation commonly used in the environment of ISO Technical Committee ISO/TC 37. Data
21、 categories as cited in ISO/TC 37 standards correspond to data element concepts in the ISO/IEC 11179 series of standards, but differ slightly in terms of values defined. Differences in approach among different kinds of language resources and individual system objectives inevitably lead to variations
22、 in data category definitions and data category names. The use of uniform data category names and definitions employed in resources within the same thematic domain (for example, among terminological resources, lexicographic resources, annotated text corpora, etc.), at least at the interchange level,
23、 contributes to system coherence and enhances the re-usability of data. Procedures for defining data categories in a given thematic domain also need to be uniform in order to ensure interoperability among data categories, which becomes problematic if they are defined in individual data category regi
24、stries. The creation of a single global Data Category Registry (DCR) for all types of language resources treated within the ISO/TC 37 environment provides a unified view of the various applications of such a reference resource. This universal registry is designed to facilitate a wide range of Data C
25、ategory Selections (DCS) needed in conjunction with all current or future standardization projects. ISO/TC 37 or any of its sub-committees can resolve at any time to designate specific thematic domains to deal with the management of those DCSs. The following thematic domains, among others, have been
26、 recognized as definable subsets of the DCR for language resources: “Terminology”: ISO 16642:2003 explicitly refers to a set of reference data categories for terminology representation. Some of the data categories include general-purpose data management categories (for example, /source/1), /responsi
27、bility/, /date/, etc.) as well as linguistically oriented ones (for example, /partOfSpeech/). Many of these data categories are relevant to a variety of different language resources, not just to terminology management, and form the core of the DCR as described in this International Standard; “Semant
28、ic Content Representation” and “Lexical Resources”: the DCR serves as a reference for the descriptors that are used throughout ISO/TC 37-related language resources, for instance, in terminology management systems, at various levels of linguistic annotation (for example, morphosyntactic, syntactic, a
29、nd discourse levels), for lexical representation natural language processing (NLP) lexicons, including machine translation (MT) dictionaries, etc., or for specific applications such as metadata for language resources, query languages or multilingual data representation (for example, translation memo
30、ries, localization files, etc.); “Language Codes”: ISO 639-1 and ISO 639-2 contain codes for approximately 650 languages. ISO 639-3 extends this number by an order of magnitude, with a clearer separation between the description of the language and its coding proper 123. Including the reference set o
31、f language identifiers in the DCR in response to the evolution of the ISO 639 family of standards provides an essential element of any linguistic annotation or representation scheme. “Lexicography”: the deployment of the DCR will include data categories for the description of lexicographic data as c
32、ited in ISO 1951:20074in order to ensure that the formats used for describing lexicographical (SC 2), terminological (SC 3) and NLP-oriented (SC 4) data are comparable and compatible. 1) Names that function as class names are capitalized. When a name functions as an attribute, it is represented in b
33、old face with the + convention: i.e. +administration record as opposed to Administration Record. This function is context-dependent. Terminal values used with the data model appear in normal face bracketed by hyphens: -standardized name-. Data category names are represented using the convention /sou
34、rce/. Data categories are themselves defined in the DCR, not in this International Standard. Copyright International Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-ISO 12620:2009(E) vi ISO 200
35、9 All rights reservedThe DCR will eventually contain all ISO/TC 37 data categories, with their complete history, data category descriptions, and attendant metadata. It is not, however, the intent of this International Standard to define an ontology of language resources within ISO/TC 37. Nevertheles
36、s, the definition of the DCR has avoided any choices that would hamper further work in this direction. This document is intended to provide a background in the context of ISO/TC 37 on the various issues that have to be considered in order to implement a global DCR that can be used for the full range
37、 of language resources. More precisely, this document addresses the following issues: the role of data categories for use with language resources; requirements that have been identified with respect to information content and overall management; a description of the organization of the DCR; an inter
38、change format for data categories, the DCIF (Data Category Interchange Format); management procedures for the DCR. Specific user-oriented instructions and procedures pertaining to the implementation and use of the DCR are available on-line at http:/www.isocat.org. Copyright International Organizatio
39、n for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-INTERNATIONAL STANDARD ISO 12620:2009(E) ISO 2009 All rights reserved 1Terminology and other language and content resources Specification of data categories
40、 and management of a Data Category Registry for language resources 1 Scope This International Standard provides guidelines concerning constraints related to the implementation of a Data Category Registry (DCR) applicable to all types of language resources, for example, terminological, lexicographica
41、l, corpus-based, machine translation, etc. It specifies mechanisms for creating, selecting and maintaining data categories, as well as an interchange format for representing them. 2 Normative references The following referenced documents are indispensable for the application of this document. For da
42、ted references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 8601:2004, Data elements and interchange formats Information interchange Representation of dates and times ISO/IEC 11179-1:2004, Information t
43、echnology Metadata registries (MDR) Part 1: Framework ISO/IEC 11179-3, Information technology Metadata registries (MDR) Part 3: Registry metamodel and basic attributes 3 Terms and definitions For the purposes of this document, the terms and definitions given in ISO/IEC 11179-1:2004 and the following
44、 apply. Terms and definitions have evolved in the terminology community, represented here by citations from ISO 1087-2, independently of the terminology of the metadata community, which results in slightly different and at times overlapping concepts in the two communities of practice. 3.1 Data eleme
45、nts and data categories 3.1.1 data element language resources unit of data that, in a certain context, is considered indivisible ISO 1087-2:2000, 6.11 NOTE In terminology work, an individual field, for example, /term/, in a single terminological entry has been viewed as a data element and an instant
46、iation of a data category (3.1.3). Copyright International Organization for Standardization Provided by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-,-ISO 12620:2009(E) 2 ISO 2009 All rights reserved3.1.2 data element DE metadata standar
47、ds unit of data for which the definition, identification, representation and value domain are specified by means of a set of attributes ISO/IEC 11179-1:2004, 3.3.8 3.1.3 data category DC result of the specification of a given data field ISO 1087-2:2000, 6.14 EXAMPLE /partOfSpeech/, /grammaticalGende
48、r/, /grammaticalNumber/; the values associated with these items (for example, /noun/, /verb/, /feminine/, /plural/, etc.) are also data categories according to this International Standard, but values of this type are not viewed as data element concepts (3.1.4) in the ISO/IEC 11179 family of standards. NOTE 1 A data category is an elementary descriptor in a linguistic structure or an annotation scheme (3.1.15). NOTE 2 A data category corresponds closely, but not identically, to a data element concept in ISO/IEC 11179. NOTE 3 In running text, such as in this International Standard, data cat