1、HIKM2006,AMTEx,Automatic Document Indexing in Large Medical Collections,Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University of Crete, Chania, GreeceEvangelos E. Milios Dalhousie University, Halifax, Canada,HIKM2006,AMTEx,Overview,The need for automatic assignment of
2、index terms in large medical collections MMTx (by the US NLM) The AMTEx approach to medical document indexing AMTEx resources: MeSH & C/NC value Experiments & evaluation Discussion and future research,HIKM2006,AMTEx,Motivation and Objectives,MeSH is a taxonomy of medical terms Subset of UMLS Metathe
3、saurus MEDLINE is indexed by MeSH terms (assigned by experts) Other medical texts need to be associated with MEDLINE, e.g. consumer medical literature Need for automatic assignment of MeSH terms to any medical text,HIKM2006,AMTEx,MMTx (MetaMap Transfer),Maps arbitrary text to UMLS Metathesaurus conc
4、epts:Parsing to extract noun phrases (syntactic analysis - linguistic filter)Variant Generation (uses SPECIALIST Lexicon)Candidate Retrieval (mapping process to Metathesaurus Concepts)Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness),HIKM2006,AMTEx,MMTx Example,Parsing S
5、hallow syntactic analysis of the input text Linguistic filtering: isolates noun phrases Variant Generatione.g. “obstructive sleep apnea” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa, Candidate RetrievalCandidate Metathesaurus concepts for the variant “osa” :osa osa antigen,o
6、sa osa gene productosa osa proteinosa obstructive sleep apnea Candidate EvaluationObstructive Sleep apnea 1000Sleep Apnea 901Apnea 827 Sleeping 793Sleepy 755,HIKM2006,AMTEx,MMTx limitations,MMTx focus on UMLS rather than MeSHBut MEDLINE indexing is based on MeSH Exhaustive variant generation: the in
7、itial phrase is iteratively expanded into all possible UMLS variantsterm overgenerationterm concept diffusionunrelated terms added to the final candidate list,HIKM2006,AMTEx,The AMTEx method,New method for automatic indexing of medical documents Main idea: Initial term extraction based on a hybrid l
8、inguistic/statistical approach, the C/NC value Extracts general single and multi-word terms Extracted terms are validated against MeSH,HIKM2006,AMTEx,x Outline,INPUT: Document Collection,C/NC value Multi-word Term Extraction & Term Ranking,MeSH Term Validation,Single-word Term Extraction Non-MeSH mu
9、lti-word are broken down & validated against MeSH,Variant Generation,Term Expansion (MeSH),MeSH Thesaurus Resource,OUTPUT: MeSH Term Lists,HIKM2006,AMTEx,MeSH: Medical Subject Headings,The NLM medical & biological terms thesaurus: Organized in IS-A hierarchies more than 15 taxonomies & more than 22,
10、000 terms a term may appear in multiple taxonomies No PART-OF relationships Terms organized into synonym sets called entry terms, including stemmed term forms,HIKM2006,AMTEx,Fragment of the MeSH IS-A Hierarchy,Root,Nervous system diseases,Neurologic manifestations,pain,headache,neuralgia,Cranial ner
11、ve diseases,Facial neuralgia,HIKM2006,AMTEx,The C/NC value method,Hybrid (linguistic / statistical) term extraction method Domain independent Specifically designed for the identification of multi-word and nested terms:compound & multi-word terms very common in biomedical domainmulti-word terms often
12、 used in indexing,HIKM2006,AMTEx,C-value,C-value: a phrase may be a term, if it often appears alone or within other candidate terms,otherwise,: candidate term f(): frequency T: set of candidate terms containing P(T): number of such terms,HIKM2006,AMTEx,NC-value,NC-value: a phrase is more likely a te
13、rm, if it often appears in specific word context,w: context word t(w): number of terms w appears with n: number of all terms f(w): frequency of w as context word of ,HIKM2006,AMTEx,AMTEx step 1: C/NC value Multi-word Term Extraction & Ranking,Part-of-Speech Tagging Linguistic filtering:N+ N (A|N)+ N
14、 ( (A|N)+ | ( (A|N)* (N P)? ) (A|N)* ) N Candidate term ranking based on C/NC-value Keep terms with NC-value T1,HIKM2006,AMTEx,AMTEx step 2: MeSH Term Validation,Candidate terms are validated against the MeSH Thesaurus (simple string matching) Only candidate terms matching MeSH are kept Multi-word c
15、andidates not matching MeSH may still contain (shorter) MeSH terms,HIKM2006,AMTEx,AMTEx step 3: Single-word Term Extraction,For multi-word terms not matching MeSH: Multi-word are split into single-word terms Single-word terms matched against MeSH Matched MeSH terms added to term list,HIKM2006,AMTEx,
16、AMTEx step 4: Term Variant Generation,Variants are added to the list of terms: Inflectional variants of the extracted terms identified during term extraction (C/NC-value) Stemmed term-forms available in MeSH,HIKM2006,AMTEx,AMTEx step 5: Term Expansion,HIKM2006,AMTEx,AMTEx step 5: Term Expansion,Each
17、 term in the list is expanded with neighbouring terms in MeSH hierarchy The expansion may include terms more than one level higher or lower than the original term, depending on similarity threshold T Semantic similarity metric by Li et al. Y. Li, Z. A. Bandar, and D. McLean. An Approach for Measurin
18、g Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowledge and Data Engineering, 15(4):871882, July/Aug. 2003.,HIKM2006,AMTEx,Example,Input: Full text articleMEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/co
19、mplications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Govt”MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target po
20、pulation”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “g
21、eneral practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative
22、study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rh
23、eum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history”AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires
24、”, “informatics”, “health”,HIKM2006,AMTEx,Evaluation,Precision and Recall measures Dataset: 61 full MEDLINE documents (not abstracts), from PMC database of NCBI Pubmed MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts Ground Truth: the set of MeSH document ind
25、ex terms Benchmark method: MMTx against our AMTEx,HIKM2006,AMTEx,Multi-Word Terms only,T: term expansion threshold, lower T means further expansion,HIKM2006,AMTEx,Contribution of Single-Word Terms,HIKM2006,AMTEx,Conclusions: AMTEx,Designed for indexing and retrieval of MEDLINE documents Focuses on m
26、ulti-word term extraction using valid linguistic & statistical criteria Based on MeSH - similarly to human indexing Selectively expands into term variants, synonyms Outperforms the current benchmark MMTx method, in both precision & recall,HIKM2006,AMTEx,Future Work,Better ranking of terms, using semantic similarity Learning of thresholds T1, T Word sense disambiguation to detect the correct sense for expansion rather than the most common sense Handling shorter documents,
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1