1、Digital Libraries and the Semantic Web A conceptual framework and an agenda for research and practice Keynote presentation at ICSD 2009,Dagobert Soergel Department of Library and Information Studies Graduate School of Education University at Buffalo,Acknowledgments,Many of the ideas in this presenta
2、tion originated from a review of the papers submitted to the International Conference on the Semantic Web and Digital Libraries 2009 (ICSD 2009). So acknowledgments are due to all the paper authors.,2,Soergel, ICSD 2009 Keynote,DLs versus SW,Digital Libraries Manage, often large, collections of docu
3、ments and data sets and provide access to these resources and ideally tools to process them. Retrieval often based on words in text. Semantic Web Uses inference over a large distributed storehouse of propositional data, including ontologies, to - answer a question, - derive a problem solution, - dev
4、ise a plan of action.,3,Soergel, ICSD 2009 Keynote,DL SW,DL SW How can digital libraries support Semantic Web functionality? Generate propositional knowledge, including ontologies, from document corpora through information extraction or statistical methods SW DL How can Semantic Web technology impro
5、ve digital libraries? Use semantics to improve retrieval and presentation Towards unified systems Harmonize standards from DLs (and libraries generally) and SW, profiting from the thinking of both communities,4,Soergel, ICSD 2009 Keynote,Overview,Information extraction (and it use for ontology creat
6、ion) Semantically enriched documents Integrated store of documents, propositions, data sets Navigation in concept structures and document spaces Support for learning, sense making, tasks Schema and ontology creation and mapping,5,Soergel, ICSD 2009 Keynote,Information extraction,Text High blood pres
7、sure is a serious disease often caused by being overweight. In kids 4 12 it can be treated highly effectively with Nystatin Formal representation Causation (HighBloodPressure, Obesity) Treatment (HighBloodPressure, Human, Age, 4-12y, Nystatin, Effectiveness, 4),6,Soergel, ICSD 2009 Keynote,Answering
8、 questions,Question How can high blood pressure be prevented? Answer Loose weight?,7,Soergel, ICSD 2009 Keynote,Information extraction,Text Kids begin grazing independently from their mothers at three months Formal representation Separation (Mother, Child, Goat, Age, 3m),8,Soergel, ICSD 2009 Keynote
9、,Automatic information extraction,Find suitable documents or images Highly structured documents (such as dictionaries) and documents containing structured lists (such as a classification of life events) work well Recognize entities (concepts, named entities) Find the unique identifier for each (from
10、 some standard scheme) Noun phrase and verb phrase identification Word sense disambiguation, co-reference resolution Determine relationships, express propositions in formal representation Much of this requires syntactic and semantic parsing Also recognition or relationships from typographical arrang
11、ement Recognition of propositions not expressed in a single sentence Deal with negation and other qualifications. Certainty (as expressed in one source),9,Soergel, ICSD 2009 Keynote,Automatic information extraction,Add to proposition store If proposition already known, just add reference to source I
12、f proposition new, add proposition with its source Identify relationships between propositions (such as contradictions) Certainty (from information across sources, considering evidential strength of each source) Can label proposition as to general origin (language of source document, cultural origin
13、 of source document, scholarly / scientific school of source document) Knowledge in proposition store assists in IE from new documents,10,Soergel, ICSD 2009 Keynote,Computer-supported IE,Automatic information extraction is hard, need to supplement with human IE IE as part of document authoring or du
14、ring publishing Collaborative IE (crowdsourcing) Build systems that support the human task Make human IE and semantic enrichment by authors feasible Person edits results of automatic IE Person enters free-form proposition, system converts to formal representation, person checks Reconciliation of dif
15、ferences in results Computer-supported IE system should learn from changes made by human editor,11,Soergel, ICSD 2009 Keynote,Corpus-based information extraction,Find associations in a corpus Data mining over text corpora or numeric databases Finding connections between non-overlapping literatures,
16、pioneered by Don Swanson,12,Soergel, ICSD 2009 Keynote,Multilingual information extraction,Requires IE tools in multiple languages Creates proposition store from many sources Interesting experiment Document exists in two languages Apply IE to both versions and compare results,13,Soergel, ICSD 2009 K
17、eynote,IE for Ontology creation,Some extracted propositions can be used as elements of an ontology Discussed later,14,Soergel, ICSD 2009 Keynote,Semantic enrichment,15,Soergel, ICSD 2009 Keynote,A semantically enriched document,Reis et al. (2008) Impact of Environment and Social Gradient on Leptospi
18、ra infection in Urban Slums (doi:10.1371/journal.pntd.0000228). Infectious disease studied: Leptospirosis Pathogen (causative agent of disease): Leptospira spirochete Vector of disease pathogen: Rat (Rattus norvegicus) Pathogen host subjected to study: Human (Homo sapiens) Number of subject individu
19、als in study: 3,171 . . . Purpose of study: Quantify risk factors for leptospirosis . . . Principal finding 1: Prevalence of Leptospira antibodies . . . Principal finding 2: Disease risk . . .open sewers . . .,16,(http:/dx.doi.org/10.1371/journal.pntd.0000228.x002),Soergel, ICSD 2009 Keynote,A seman
20、tically enriched document,17,Soergel, ICSD 2009 Keynote,18,Soergel, ICSD 2009 Keynote,Semantically enriched documents,Semantic enrichment supports semantic retrieval Broad area of its own Many different forms Explicit document structure Concept and named entity tagging and identification Assigning a
21、dditional concepts or named entities Assigning extracted propositions Closely linked with information extraction IE produces elements of semantic enrichment,19,Soergel, ICSD 2009 Keynote,Semantic enrichment through document structure,On a broad level, a documents semantics can be made explicit simpl
22、y by the internal document structure Requires a document template or frame for the type of document Document Structure Ontology with templates / frames for many types of documents, including learning objects. Standards for digital objects Includes document formats such as MPEG or SCORM,20,Soergel, I
23、CSD 2009 Keynote,Template for a research report,1 Background (could also be called Problem)1.1 General problem area (often including a review of the literature)1.2 Specific problem. Purpose of the study, question to be answered2 Methods2.1 Discussion of the methods used in the study2.2 Description o
24、f the actual conduct of the study3 ResultsConclusions4.1 Summary of methods and results4.2 Relationship to existing body of knowledge.4.3 Implications for decision making and/or further research,21,Soergel, ICSD 2009 Keynote,Computer-supported IE,Automatic information extraction is hard, need to sup
25、plement with human IE IE as part of document authoring or during publishing Collaborative IE (crowdsourcing) Build systems that support the human task Make human IE and semantic enrichment by authors feasible Person edits results of automatic IE Person enters free-form proposition, system converts t
26、o formal representation, person checks Reconciliation of differences in results Computer-supported IE system should learn from changes made by human editor,22,Soergel, ICSD 2009 Keynote,Concept and named entity tagging and identification,Includes abstract concepts and named entities such as persons,
27、 organizations, places, dates, events, etc. Identified with reference to some standard scheme, such as a Knowledge Organization System (KOS, includes ontologies, thesauri, etc.) or NE registry. Add identifier as part of the tag Can tag within text or list separately as metadata (with pointer to the
28、precise piece of the text),23,Soergel, ICSD 2009 Keynote,Additional concepts or named entities,Concepts or named entities that are not designated by a word or phrase in the text but implied by the document as a whole or a passage in it Assigned through Statistical automatic classifier Rule-based inf
29、erence Human editor (with ontology-based assistance) Each concept or NE should be linked to smallest text passage that implies it (may be the whole document),24,Soergel, ICSD 2009 Keynote,Assigning extracted propositions,Allows for more precise retrieval Example: Precise retrieval of documents on ca
30、usation is notoriously difficult Does A cause B? What are the effects of A? What causes B? If propositions of the form A causes B are assigned to the document in semantic enrichment, such searches are possible Propositions can be transferred to a larger repository (see IE) or be available only throu
31、gh the enriched Web document they can still be found and used be Semantic Web agents,25,Soergel, ICSD 2009 Keynote,Making semantic enrichment available,Documents are enriched from many sources The same document may receive multiple enrichments Digital libraries and publishers should ensure that a us
32、er looking at any copy of a document sees all the semantic enrichments for this document.,26,Soergel, ICSD 2009 Keynote,Dual representation of document content,Representation to use same content for two purposes for people (teach people) for computer processing (teach computer systems) How precise i
33、s the correspondence? How complete is each representation? How easy is it to get from one to the other Information extraction Text and image generation Text generation in multiple languages one approach to translation,27,Soergel, ICSD 2009 Keynote,Integrated Digital Libraries: Documents + Data + Too
34、ls,Elements Semantically enriched documents Proposition store (including propositions in any Web document) Data sets Tools for data analysis and reasoning All linked together, for example Drill down from a formally stated proposition to text and to supporting data Link from text to formal propositio
35、ns and related texts Link from data set to suitable data analysis tools Created and maintained collaboratively Example: Neurocommons http:/sciencecommons.org/projects/data,28,Soergel, ICSD 2009 Keynote,Navigation in concept structures and document/data spaces,29,Soergel, ICSD 2009 Keynote,Concept st
36、ructures,Internally, concept structures are often represented on RDF or OWL Externally, for the user, they need to be shown in a meaningful representation that reflects concept relationships so the user can understand them and navigate them Can be trees shown in outline form with cross-references or
37、 concept maps Challenge of producing these automatically,30,Soergel, ICSD 2009 Keynote,Concept structure with data,31,Soergel, ICSD 2009 Keynote,Document/data spaces,Documents and document passages are related in many ways that can be used for navigation and presentation Challenge 1. Identify passag
38、es in multiple documents and arrange them according to relationships that allow the user to see the whole picture and navigate passages in a meaningful sequence Challenge 2. Arrange passages to fit into the structure of an argument,32,Soergel, ICSD 2009 Keynote,Multi-level topical structure,33,Soerg
39、el, ICSD 2009 Keynote,Information arranged by role in argument,34,Soergel, ICSD 2009 Keynote,Function-based,Reasoning-based,35,Argument structureGroundsWarrantsClaim,Generic inference Comparison-based Induction / rule-based Causal-based Transitivity-based,Topical relevance typology,Rhetorical struct
40、ureMatching topicEvidence (Indirect)ContextComparisonEvaluationMethod / SolutionPurpose/ Goal,Semantic-based (Green & Bean, 1995),Taxonomy Partonomy Frame-based, etc.,Soergel, ICSD 2009 Keynote,Matching topic (Direct) . Manifestation . Image content . Image theme Evidence (Indirect) Context . Scope
41、. Framework . Environmental setting . Social background . Time & sequence . Assumption / expectation . Biographic information Condition . Helping or hindering factor . Unconditional . Exceptional condition Purpose / Motivation,Cause / Effect . Cause . Effect / Outcome . Explanation (causal) . Predic
42、tion Comparison . By similarity (analogy) / By difference (contrast) . By factor that is different Method / Solution . Method / Approach . Instrument . Technique / Style Evaluation . Significance . Limitation . Criterion / Standard . Comparative evaluation,36,RST+ Functional Role,Soergel, ICSD 2009
43、Keynote,Functional role: Comparison,Comparison . By similarity vs. By difference (Contrast) . . By similarity . . . Analogy & metaphor . . By difference (Contrast) . By factor that is different . . Different external factor . . . Different time . . . Different place . . Different participant . . . D
44、ifferent actor . . . Different subject acted upon . . Different act or experience . . . Different act . . . Different experience,37,Soergel, ICSD 2009 Keynote,Support for learning, sensemaking, tasks,38,Soergel, ICSD 2009 Keynote,Support for learning,Structuring learning objects from small reusable
45、elements Indexing learning objects so they can be matched with individual learners arranged in a meaningful didactic sequence Automatic composition of learning objects customized for individual learner Support learner control where appropriate,39,Soergel, ICSD 2009 Keynote,Support for learning,Requi
46、res specialized document structure ontology Requires ontologies for Learner characteristics Learning objectives Learning object characteristics that can be used for matching Types of relationships between learning objects Examples: Prerequisite, elaboration,40,Soergel, ICSD 2009 Keynote,Support for
47、learning,Requires domain ontologies adapted for learning and instruction Show meaningful structures for assimilation by the learner Support arrangement and sequencing of material to be learned Tools for ontology construction by the learner, for example, concept maps Active learning, building own str
48、uctures, constructivist approach,41,Soergel, ICSD 2009 Keynote,Sense-making,Sense-making is the process of creating an understanding of a problem or task so that further actions may be taken in an informed manner Sense-making is a pre-requisite for many other tasks such as decision making and proble
49、m solving; Sense-making involves making clear the interrelated concepts and their relationships in a problem or task space.,42,Soergel, ICSD 2009 Keynote,Sense-making scenario 1,Intelligence task T1: al-Bashir The US wants to take action to towards a resolution of the Darfur conflict . Al-Bashir, th
50、e Sudanese president, is one of the key players in the area who is believed to have significant responsibility for continuous conflicts in the region. The administration needs to know as much as possible about al-Bashir in order to better negotiate with the involved parties and strategize its effort
51、s. Your task is to produce a report that identifies information to assess the influence of al-Bashir and makes recommendations for policy decisions and diplomatic actions. Requested information includes: key figures, organizations, and countries who have been associated with al-Bashir; his rise to power; and groups who have resisted him and the level of success in their resistance. Could draw concept map drawing on multiple sources (map is for illustration),