BS ISO 24611-2013 Language resource management Morpho-syntactic annotation framework (MAF)《语言资源管理形体语法注释框架(MAF)》.pdf

资源描述

1、raising standards worldwideNO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAWBSI Standards PublicationBS ISO 24611:2012Language resource management Morpho-syntactic annotation framework (MAF)BS ISO 24611:2012 BRITISH STANDARDNational forewordThis British Standard is the UK implem

2、entation of ISO 24611:2012. The UK participation in its preparation was entrusted toT e c h n i c a l Committee TS/1, Terminology.A list of organizations represented on this committee can be obtained on request to its secretary.This publication does not purport to include all the necessary provision

3、s of a contract. Users are responsible for its correct application. The British Standards Institution 2013. Published by BSI Standards Limited 2013.ISBN 978 0 580 54234 3 ICS 01.020 Compliance with a British Standard cannot confer immunityfrom legal obligations.This British Standard was published un

4、der the authority of the Standards Policy and Strategy Committee on 31 March 2013.Amendments issued since publicationDate T e x t a f f e c t e dBS ISO 24611:2012Reference numberISO 24611:2012(E)ISO 2012INTERNATIONAL STANDARD ISO24611First edition2012-11-01Language resource management Morpho-syntact

5、ic annotation framework (MAF) Gestion des ressources langagires Cadre dannotation morphosyntaxique (MAF) BS ISO 24611:2012ISO 24611:2012(E) COPYRIGHT PROTECTED DOCUMENT ISO 2012 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or

6、by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-

7、mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2012 All rights reservedBS ISO 24611:2012ISO 24611:2012(E) ISO 2012 All rights reserved iiiContents Page Foreword . v Introduction vi 1 Scope 1 2 Normative references 1 3 Terms and definitions . 1 4 The MAF meta-model 4 4.1 Overvi

8、ew . 4 4.2 MAF Meta-model 4 5 Segmenting with tokens . 6 5.1 General . 6 5.2 Formal description: 7 5.3 Embedding notation 7 5.4 Alternate representation for TEI based documents . 8 5.5 Stand-off notation 9 5.6 Informative attributes 9 5.7 Completing the inline token notation 10 5.7.1 Joining tokens

9、in embedded mode . 10 5.7.2 Overlapping tokens . 11 6 Word-forms as linguistic units . 11 6.1 Formal description: 12 6.2 Token attachment 12 6.2.1 One token; one word-form . 12 6.2.2 Several contiguous tokens; one word-form . 12 6.2.3 Several discontinuous tokens; one word-form 13 6.2.4 Zero token;

10、one word-form . 13 6.2.5 One token; several word-forms . 14 6.3 Referring to lexical entries . 14 6.4 Compound word-forms . 15 6.5 Identification of word-forms within a TEI-compliant document . 15 7 Morpho-syntactic content . 18 7.1 General . 18 7.2 Using feature structures . 18 7.3 Compact morpho-s

11、yntactic tags . 18 7.4 FSR libraries 19 7.5 Designing tagsets 20 7.6 Formal description: . 22 8 Handling ambiguities 22 8.1 Word-form content ambiguities . 22 8.2 Lexical Ambiguities . 23 8.3 Structural ambiguities . 23 8.3.1 Structural ambiguities with word-forms . 23 8.3.2 Structural ambiguities w

12、ith tokens 24 8.4 Simplified structuring variants 24 8.4.1 Non-ambiguous linear representation 24 8.4.2 Mixed linear and lattice representation . 25 8.5 Expanding the simplified variants . 26 8.5.1 Separating tokens and word-forms . 26 8.5.2 Wrapping into local lattices 26 BS ISO 24611:2012ISO 24611

13、2012(E) iv ISO 2012 All rights reserved8.5.3 Merging local lattices 27 8.5.4 Removing 28 8.6 Formal description: and 29 Annex A (informative) Encoded example using the MAF serialization 30 Annex B (normative) MAF specification .33 B.1 Elements .33 B.1.1 33 B.1.2 34 B.1.3 34 B.1.4 35 B.1.5 .35 B.1.6

14、 36 B.1.7 36 B.1.8 .37 B.2 Model classes .38 B.3 Attribute classes 38 B.3.1 att.token.information .38 B.3.2 att.token.join .39 B.3.3 att.token.span .39 B.3.4 att.wordForm.content 39 B.3.5 att.wordForm.tokens .40 B.4 Macros 40 B.4.1 data.certainty 40 B.4.2 data.code 40 B.4.3 data.count .40 B.4.4 data

15、duration.w3c 41 B.4.5 data.enumerated 41 B.4.6 data.key .41 B.4.7 data.language .42 B.4.8 data.name .43 B.4.9 data.numeric .43 B.4.10 data.pointer 43 B.4.11 data.probability 44 B.4.12 data.temporal.w3c44 B.4.13 data.truthValue .44 B.4.14 data.word 45 B.4.15 data.xTruthValue 45 Annex C (normative) M

16、orpho-syntactic data categories 46 Bibliography 58 BS ISO 24611:2012ISO 24611:2012(E) ISO 2012 All rights reserved vForeword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standa

17、rds is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also tak

18、e part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is

19、 to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibilit

20、y that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 24611 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 4, Languag

21、e resource management. BS ISO 24611:2012ISO 24611:2012(E) vi ISO 2012 All rights reservedIntroduction ISO/TC 37/SC 4 focuses on the definition of models and formats for the representation of annotated language resources. To this end, it has generalised the modelling strategy initiated by its sister

22、committee, SC 3, for the representation of terminological data Romary, 2001, through which linguistic data models are seen as the combination of a generic data pattern (a meta-model), which is further refined through a selection of data categories that provide the descriptors for this specific annot

23、ation level. Such models are defined independently of any specific formats, and ensure that an implementer has the necessary conceptual instrument with which to design and compare formats with regard to their degrees of interoperability. One important aspect of representing any kind of annotation is

24、 the capacity to provide a clear and reliable semantics for the various descriptors used, either in the form of formal features and feature values, or directly as objects in a representation that is expressed, for instance, in XML. In order to be shared across various annotation schemas and encoding

25、 applications, such a semantics should be implemented as a centralised registry of concepts: we will henceforth refer to these as data categories. As such, data categories should bear the following constraints. From a technical point of view, they must provide unique, stable references (implemented

26、as persistent identifiers, in the sense of ISO 24619) such that the designer of a specific encoding schema can refer to them in his or her specification. By doing so, two annotations will be deemed to be equivalent when they are in fact defined in relation to the same data categories (as feature and

27、 feature value). From a descriptive point of view, each unique semantic reference should be associated with precise documentation combining a full text elicitation of the meaning of the descriptor with the expression of specific constraints that bear upon the category. In recent years, ISO has devel

28、oped a general framework for representing and maintaining such a registry of data categories, encompassing all domains of language resources. This initiative, described in ISO 12620, has led to the implementation of an online environment providing access to all data categories that have been standar

29、dized in the context of the various language resource-related activities within ISO, or specifically as part of the maintenance of the data category registry. It also provides access to the various data categories that individual language technology practitioners have defined in the course of their

30、own work and decided to share with the community. The ISO data category registry, as available through the ISOCat (www.isocat.org) implementation, is intended as a flat marketplace of semantic objects, providing only a limited set of ontological constraints. The objective there is to facilitate the

31、maintenance of a comprehensive descriptive environment where new categories are easily inserted and reused without the need for any strong consistency check with the registry at large. Indeed, the following basic constraints are part of the data category model, as defined in ISO 12620: simple generi

32、c-specific relations, when these are useful for the proper identification of interoperability descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/ makes it possible to compare morpho-syntactic annotations based on different descriptive levels of g

33、ranularity; the description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable, the possible value of so-called complex data categories For instance, it can be used to record that possible values of /grammaticalGender/ (limited to a small group of languages Romar

34、y 2011), could be a subset of /masculine/, /feminine/ and /neutral/; language-specific constraints, either in the form of specific application notes or as explicit restrictions bearing upon the conceptual domains of complex data categories. For instance, it is possible to express explicitly that /gr

35、ammaticalGender/ in French can only take the two values: /masculine/ and /feminine/. BS ISO 24611:2012ISO 24611:2012(E) ISO 2012 All rights reserved viiThis International Standard provides a comprehensive framework for the representation of morpho-syntactic (also referred to as part-of-speech) annot

36、ations. Such an annotation level corresponds to a first lexical abstraction level over language data (textual or spoken) and, depending on the language to be annotated, together with the characteristics of the annotation tool or annotation scheme that is being used, can vary enormously in structure

37、and complexity. In order to deal with such complex issues as ambiguity and determinism in morpho-syntactic annotation, this International Standard introduces a meta-model that draws a clear distinction between the two levels of tokens (representing the surface segmentation of the source) and word-fo

38、rms (identifying lexical abstractions associated with groups of tokens). These two levels share the following specificities: on the one hand, they can be represented as simple sequences and as local graphs such as multiple segmentations and ambiguous compounds; on the other hand, any n-to-n combinat

39、ion can stand between word forms and tokens. As linguistic segments (sometimes called markables in the literature see, for instance, Carletta et al. 1997), tokens may be embedded in the source document as inline mark-up, or they may point remotely to it by means of so-called stand-off annotations. A

40、s linguistic abstractions, word-forms can be qualified by various linguistic features characterising the morpho-syntactic properties that are instantiated in the realisation of the lexical entry within the annotated text. Such properties may range from the simple indication of a lemma up to an expli

41、cit reference to a lexical entry in a dictionary. In most existing applications of morpho-syntactic annotation, linguistic properties are expressed by means of so-called tags; these codes refer to basic feature structures (see early examples in Monachini and Calzolari, 1994). Such codes may also pro

42、vide morphological information, including its part of speech (e.g. noun, adjective or verb), and features such as number, gender, person, mood and verbal tense. In keeping with the general modelling strategy of ISO/TC 37, this International Standard/MAF provides means of relating morpho-syntactic ta

43、gs expressed as feature structures (compliant with ISO 24610) to the data categories available in ISOCat. A normative annex of this International Standard elicits a core set of data categories that can be used as reference for most current morpho-syntactic annotation tasks in a multilingual context.

44、 However, when implementers of this International Standard find these categories inappropriate in either coverage, scope or semantics, they are encouraged to use ISOCat to define their own categories in compliance with ISO/TC 37 principles. Associated to the meta-model, MAF also provides a default X

45、ML syntax that may be used to serialise MAF- compliant annotation models. Since many existing projects are based on the text encoding initiative (TEI) guidelines (www.tei-c.org) particularly in digital humanities, where a proper encoding of textual sources is essential this International Standard wi

46、ll also provide clues about how to articulate the MAF model with TEI- compliant encodings. Indeed, the TEI guidelines already offer a variety of constructs and mechanisms to cope with many issues relevant to spoken corpora and their annotations (Romary and Witt, 2012). Finally, it should be noted he

47、re that this International Standard forms the conceptual basis for the development of the ISO 24614 series on word segmentation, whereby all general principles and rules defined in ISO 24614-1, as well as the constraints expressed in additional parts for specific languages, are to be understood acco

48、rding to the tokenword-form dichotomy. BS ISO 24611:2012BS ISO 24611:2012INTERNATIONAL STANDARD ISO 24611:2012(E) ISO 2012 All rights reserved 1Language resource management Morpho-syntactic annotation framework (MAF) 1 Scope This International Standard provides a framework for the representation of

49、annotations of word-forms in texts; such annotations concern tokens, their relationship with lexical units, and their morpho-syntactic properties. It describes a metamodel for morpho-syntactic annotation that relates to a reference to the data categories contained in the ISOCat data category registry (DCR, as defined in ISO 12620). It also describes an XML serialization for morpho-syntactic annotations, with equivalences to the guidelines of the TEI (text encoding initiative). 2 Normative references The following ref

展开阅读全文

BS ISO 24611-2013 Language resource management Morpho-syntactic annotation framework (MAF)《语言资源管理 形体语法注释框架(MAF)》.pdf

BS ISO 24611-2013 Language resource management Morpho-syntactic annotation framework (MAF)《语言资源管理形体语法注释框架(MAF)》.pdf