1、Supporting Annotation Layers for Natural Language Processing,Archana Ganapathi, Preslav Nakov, Ariel Schwartz, and Marti Hearst Computer Science Division and SIMS University of California, Berkeley,Motivation,Most natural language processing (NLP) algorithms make use of the results of previous proce
2、ssing steps, e.g.: Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger No standard way to represent, store and retrieve text annotations efficiently. MEDLINE has close to 13 million abstracts. Full text starts to become available as well.,Text Annotation Frame
3、work,Annotations are stored independently of text in an RDBMS Declarative query language for annotation retrieval Indexing structure designed for efficient query processing Object Oriented API for annotations: insertion, deletion and modification,Key Contributions,Support for hierarchical and overla
4、pping layers of annotation Querying multiple levels of annotations simultaneously First to evaluate different physical database designs Focused on scaling annotation-based queries to very large corpora with many layers of annotations We propose a query language and demonstrate its power and the effi
5、ciency of the indexing architecture on a wide variety of query types that have been published in the NLP literature.,Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,Related Work,Annotation graphs (AG): directed acyclic graph; nodes can have time stamps or are c
6、onstrained via paths to labeled parents and children. (Bird and Liberman, 2001) Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy supports set operations. (Nenadic et al., 2002),Outline,Relat
7、ed Work Layered Query Language Database Design API Evaluation Conclusions,Layers of Annotations,Layers of Annotations,Layers of Annotations,Layers of Annotations,Full parse, sentence and section layers are not shown.,Layers of Annotation (cont.),Each annotation represents an interval spanning a sequ
8、ence of characters absolute start and end positionsEach layer corresponds to a conceptually different kind of annotation i.e., word, gene/protein, shallow parse can have several layers with the same semantics Layers can be sequential overlapping e.g., two multiple-word concepts sharing a word hierar
9、chical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology,Layer Type Properties,One-to-one correspondence between the Word and the Part-of-speech (POS) layers. The Word, POS and Shallow parse layers are sequenti
10、alThe Full parse layer is spanning hierarchical The Gene/protein layer assigns IDs from the LocusLink database of gene names many-to-one in the case of multiple species The Ontology layer assigns terms from the hierarchical medical ontology MeSH (Medical Subject Headings) Overlapping (share the word
11、 cell) and hierarchical: both spanning, since blood cell (with MeSH ID D001773) spans cell (which is also in MeSH), and ontologically, since blood cell is a kind of cell and cell death (D016923) is a type of Biological Phenomena.,Layered Query Language,Requirements for the query language on layers o
12、f annotations: Intuitive Compact Declarative Expressive power for real world queries Support for hierarchical and overlapping annotations Compatible with SQL LQL (Layered Query Language) XML-like Can be translated to SQL to run against an RDBMS Tested on real world bioscience NLP applications,LQL by
13、 Example,A01 A07 limb:vein shoulder: artery,LQL Syntax,“” Defines an arbitrary range over text. A range is typically restricted to a specific layer type using . All layers have a lex (the text spanned by the range) and a tag_type attribute. Predicates on attribute values are enclosed in square brack
14、ets, i.e. “ | = | ”. The language supports the boolean operators conjunction ( can be used to descend an ontological hierarchy.,Additional LQL Features,For spanning hierarchical layers we can have hierarchical queries with several nested references to the same layer. The following query finds a PP o
15、f the form preposition+NP and prints that NP: print $ The keyword noorder allows an arbitrary order for the tokens within a range, e.g.: print sentence The language allows for a combination of ordered and unordered constraints. For example, ( ) print sentence LQL currently does not support a range o
16、verlap operator.,LQL and SQL,LQL can be automatically translated into SQL (although this is not yet implemented), as: user-defined function, or a macro The result of an LQL query is a relation Thus, allowing the use of standard SQL syntax such as GROUP BY, COUNT, DISTINCT, ORDER BY, UNION etc. An ad
17、ded advantage of LQL over SQL is that the LQL queries do not need to be modified, if the underlying logical design is changed. LQL is still a work in progress; We plan to assess it via usability studies with computational linguistics researchers, modifying it as necessary. However, we feel it is mor
18、e intuitive and easier to use for text processing than the existing languages.,LQL Versus SQL,Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,Database Design,We evaluated 5 different logical and physical database designs. The basic model is similar to the one o
19、f TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation. Architecture 1 contains the following columns: docid: document ID; section: title, abstract or body text; layer_id: a unique identifier of the annotation layer; start_char_pos: starting character position, relative to p
20、articular section and docid; end_char_pos: end character position, relative to particular section and docid; tag_type: a layer-specific token unique identifier. There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.),Datab
21、ase Design (cont.),Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer. Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each
22、other immediately. Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id. Simplifies most queries since they are often limited to the same sentence.,Database Design (cont.),Architecture 4 merges the word a
23、nd POS layers, and adds word_id assuming a one-to-one correspondence between them. Reduces the number of stored annotations and the number of joins in queries with both word and POS constraints. Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the seque
24、nce_pos of the first/last word covered by the annotation. Requires all annotation boundaries to coincide with word boundaries. Copes naturally with adjacency constraints between different layers. Allows for a simpler indexing structure.,An Example Relation,Example: “Kinase inhibits RAG-1.”,2,31(NP),
25、39,34,3(s.parse),b,3345,2,59(VP),48,41,3,b,3345,2,31,54,50,3,b,3345,2,16654,54,50,6,b,3345,2,10770,39,34,6(mesh),b,3345,2,39,54,50,5,b,3345,2,39(prt),39,34,5 (gene),b,3345,89985,2,27,54,50,1,b,3345,55608,2,53 (VB),48,41,1,b,3345,59571,2,27 (NN),39,34,1 (POS),b,3345,89985,2,89985,54,50,0,b,3345,55608
26、,2,55608,48,41,0,b,3345,59571,2,59571,39,34,b (body),3345,WORD,ID,SENTE,NCE,SEQUE,NCE,POS,TAG,TYPE,END,CHAR,POS,START,CHAR,POS,LAYER,ID,SECTION,PMID,1,31(NP),39,34,3(s.parse),b,3345,2,59(VP),48,41,3,b,3345,3,31,54,50,3,b,3345,2,16654,54,50,6,b,3345,1,10770,39,34,6(mesh),b,3345,2,39,54,50,5,b,3345,1,
27、39(prt),39,34,5 (gene),b,3345,89985,3,27,54,50,1,b,3345,55608,2,53 (VB),48,41,1,b,3345,59571,1,27 (NN),39,34,1 (POS),b,3345,89985,3,89985,54,50,0,b,3345,55608,2,55608,48,41,0,b,3345,59571,1,59571,39,34,0 (word),b (body),3345,WORD,ID,SENTE,NCE,SEQUE,NCE,POS,TAG,TYPE,END,CHAR,POS,START,CHAR,POS,LAYER,
28、ID,SECTION,PMID,Basic architecture,Added, architecture 3,Added, architecture 2,Added, architecture 4,3,2,1,3,2,1,FIRST WORD POS,1,2,3,1,3,1,3,3,2,1,3,2,1,LAST WORD POS,1,2,3,1,3,1,3,Added, architecture 5,Indexing Structure,Two types of composite indexes: forward and inverted. An index lookup can be
29、performed on any column combination that corresponds to an index prefix. The forward indexes support lookup based on position in a given document. The inverted indexes support lookup based on annotation values (i.e., tag type and word id). Most query plans involve both forward and inverted indexes J
30、oins statistics would have been useful Detailed statistics are essential. Standard statistics in DB2 are insufficient. Records are clustered on their primary key,Indexing Structure (cont.),Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,API,Java based API allow
31、s for simple insertion, deletion and modification of annotations. Need to specify document ID, section, layer ID, and positional information. Supports editing a collection of annotations and storing them back to the database. We plan to develop a user interface for viewing, editing and querying anno
32、tations. Not a trivial task, since there are many HCI issues on how to display annotations effectively.,Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,Experimental Setup,Annotated 13,504 MEDLINE abstracts Stanford Lexicalized Parser (Klein and Manning, 2003) f
33、or sentence splitting, word tokenization, POS tagging and parsing. We wrote a shallow parser and tools for gene and MeSH term recognition. This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server. Defined 4 workloads based on variants of queries (a-d).,Results,Results (cont
34、.),Different architectures are optimized for different types of queries. Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type.Storage requirement of Architecture 5 is comparable to that of Architecture 1 Architecture 5
35、 results in much simpler queries We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.,Scalability Analysis,Combined workload of 3 query types Varying buffer pool sizes,Suggests that the query execution time grows as a sub-linear function of memo
36、ry size. We believe a similar ratio will be observed when increasing the database size and keeping the memory size fixed Parallel query execution can be enabled after partitioning the annotation on document_id,Conclusions,Provided a mechanism to effectively store and query layers of textual annotati
37、ons.Evaluated various structures for data storage and have arrived at an efficient and simple one. Used variations of queries drawn from published research, to ensure the real-world applicability.Presented a concise language (LQL) to express queries that span multiple levels of the annotation struct
38、ure, which captures the users intent better as the syntax is more intuitive and closely resembles the annotation structure.,Future Work,Conduct a usability study to assess the query language.Automate the LQL to SQL translation process.Test the scalability of this approach on larger document collecti
39、ons.,References,Steven Bird and Mark Liberman. 2001. A formal framework for linguistic annotation. Speech Communication, 33(12):2360. Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and corpus tools. Speech Communication, 33(12):6177. David McKelvie, Amy Isard, Andreas Mengel, Morten
40、B. Moller, Michael Grosse and Marion Klein. 2001. Speech annotation and corpus tools. Speech Communication, 33(12):97112. Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and Knowledge Acquisition in Biomedicine. International
41、 Journal of Medical Informatics, 67:3348. Ralph Grishman. 1996. Building an Architecture: a CAWG Saga. Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996. Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the Relational Model: Experiments with the Emu System
42、. 6th European Conference on Speech Communication and Technology Eurospeech 99, 21272130, Budapest, Hungary. Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002. Models and Tools for Collaborative Annotation. Third International Conference on Language Resources and Evaluation, 20662073.,Thank You,Questions and constructive comments are welcomedhttp:/biotext.berkeley.edu,