1、1,Automated Text summarization Tutorial COLING/ACL98,Eduard Hovy and Daniel Marcu Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 hovy,marcuisi.edu http:/www.isi.edu/natural-language/people/hovy.html,marcu.html,2,an exciting ch
2、allenge.,.put a book on the scanner, turn the dial to 2 pages, and read the result.download 1000 documents from the web, send them to the summarizer, and select the best ones by reading the summaries of the clusters. .forward the Japanese email to the summarizer, select 1 par, and skim the translate
3、d summary.,3,Headline news informing,4,TV-GUIDES decision making,5,Abstracts of papers time saving,6,Graphical maps orienting,7,Textual Directions planning,8,Cliff notes Laziness support,9,Real systems Money making,10,Questions,What kinds of summaries do people want? What are summarizing, abstractin
4、g, gisting,.? How sophisticated must summ. systems be? Are statistical techniques sufficient? Or do we need symbolic techniques and deep understanding as well? What milestones would mark quantum leaps in summarization theory and practice? How do we measure summarization quality?,11,Table of contents
5、,1. Motivation. 2. Genres and types of summaries. 3. Approaches and paradigms. 4. Summarization methods (exercise). 5. Evaluating summaries. 6. The future.,12,Genres of Summary?,Indicative vs. informative .used for quick categorization vs. content processing. Extract vs. abstract .lists fragments of
6、 text vs. re-phrases content coherently. Generic vs. query-oriented .provides authors view vs. reflects users interest. Background vs. just-the-news .assumes readers prior knowledge is poor vs. up-to-date. Single-document vs. multi-document source .based on one text vs. fuses together many texts.,13
7、,Examples of Genres,Exercise: summarize the following texts for the following readers:,14,90 Soldiers Arrested After Coup Attempt In Tribal Homeland MMABATHO, South Africa (AP) About 90 soldiers have been arrested and face possible death sentences stemming from a coup attempt in Bophuthatswana, lead
8、ers of the tribal homeland said Friday.Rebel soldiers staged the takeover bid Wednesday, detaining homeland President Lucas Mangope and several top Cabinet officials for 15 hours before South African soldiers and police rushed to the homeland, rescuing the leaders and restoring them to power.At leas
9、t three soldiers and two civilians died in the uprising.Bophuthatswanas Minister of Justice G. Godfrey Mothibe told a news conference that those arrested have been charged with high treason and if convicted could be sentenced to death. He said the accused were to appear in court Monday.All those arr
10、ested in the coup attempt have been described as young troops, the most senior being a warrant officer.During the coup rebel soldiers installed as head of state Rocky Malebane-Metsing, leader of the opposition Progressive Peoples Party.Malebane-Metsing escaped capture and his whereabouts remained un
11、known, officials said. Several unsubstantiated reports said he fled to nearby Botswana.Warrant Officer M.T.F. Phiri, described by Mangope as one of the coup leaders, was arrested Friday in Mmabatho, capital of the nominally independent homeland, officials said.Bophuthatswana, which has a population
12、of 1.7 million spread over seven separate land blocks, is one of 10 tribal homelands in South Africa. About half of South Africas 26 million blacks live in the homelands, none of which are recognized internationally.Hennie Riekert, the homelands defense minister, said South African troops were to re
13、main in Bophuthatswana but will not become a permanent presence.Bophuthatswanas Foreign Minister Solomon Rathebe defended South Africas intervention.The fact that . the South African government (was invited) to assist in this drama is not anything new nor peculiar to Bophuthatswana, Rathebe said. Bu
14、t why South Africa, one might ask? Because she is the only country with whom Bophuthatswana enjoys diplomatic relations and has formal agreements.Mangope described the mutual defense treaty between the homeland and South Africa as similar to the NATO agreement, referring to the Atlantic military all
15、iance. He did not elaborate.Asked about the causes of the coup, Mangope said, We granted people freedom perhaps . to the extent of planning a thing like this.The uprising began around 2 a.m. Wednesday when rebel soldiers took Mangope and his top ministers from their homes to the national sports stad
16、ium.On Wednesday evening, South African soldiers and police stormed the stadium, rescuing Mangope and his Cabinet.South African President P.W. Botha and three of his Cabinet ministers flew to Mmabatho late Wednesday and met with Mangope, the homelands only president since it was declared independent
17、 in 1977.The South African government has said, without producing evidence, that the outlawed African National Congress may be linked to the coup.The ANC, based in Lusaka, Zambia, dismissed the claims and said South Africas actions showed that it maintains tight control over the homeland governments
18、. The group seeks to topple the Pretoria government.The African National Congress and other anti-government organizations consider the homelands part of an apartheid system designed to fragment the black majority and deny them political rights in South Africa.,15,If You Give a Mouse a Cookie Laura J
19、offe Numeroff 1985If you give a mouse a cookie,hes going to ask for a glass of milk. When you give him the milk, hell probably ask you for a straw. When hes finished, hell ask for a napkin. Then hell want to look in the mirror to make sure he doesnt have a milk mustache. When he looks into the mirro
20、r, he might notice his hair needs a trim. So hell probably ask for a pair of nail scissors. When hes finished giving himself a trim, hell want a broom to sweep up. Hell start sweeping. He might get carried away and sweep every room in the house. He may even end up washing the floors as well. When he
21、s done, hell probably want to take a nap. Youll have to fix up a little box for him with a blanket and a pillow. Hell crawl in, make himself comfortable, and fluff the pillow a few times. Hell probably ask you to read him a story. When you read to him from one of your picture books, hell ask to see
22、the pictures. When he looks at the pictures, hell get so excited that hell want to draw one of his own. Hell ask for paper and crayons. Hell draw a picture. When the picture is finished, hell want to sign his name, with a pen. Then hell want to hang his picture on your refrigerator. Which means hell
23、 need Scotch tape. Hell hang up his drawing and stand back to look at it. Looking at the refrigerator will remind him that hes thirsty. Sohell ask for a glass of milk. And chances are that if he asks for a glass of milk, hes going to want a cookie to go with it.,16,Aspects that Describe Summaries,In
24、put (Sparck Jones 97) subject type: domain genre: newspaper articles, editorials, letters, reports. form: regular text structure; free-form source size: single doc; multiple docs (few; many) Purpose situation: embedded in larger system (MT, IR) or not? audience: focused or general usage: IR, sorting
25、, skimming. Output completeness: include all aspects, or focus on some? format: paragraph, table, etc. style: informative, indicative, aggregative, critical.,17,Table of contents,1. Motivation. 2. Genres and types of summaries. 3. Approaches and paradigms. 4. Summarization methods (exercise). 5. Eva
26、luating summaries. 6. The future.,18,Making Sense of it All.,To understand summarization, it helps to consider several perspectives simultaneously: 1. Approaches: basic starting point, angle of attack, core focus question(s): psycholinguistics, text linguistics, computation. 2. Paradigms: theoretica
27、l stance; methodological preferences: rules, statistics, NLP, Info Retrieval, AI. 3. Methods: the nuts and bolts: modules, algorithms, processing: word frequency, sentence position, concept generalization.,19,Psycholinguistic Approach: 2 Studies,Coarse-grained summarization protocols from profession
28、al summarizers (Kintsch and van Dijk, 78): Delete material that is trivial or redundant. Use superordinate concepts and actions. Select or invent topic sentence. 552 finely-grained summarization strategies from professional summarizers (Endres-Niggemeyer, 98): Self control: make yourself feel comfor
29、table. Processing: produce a unit as soon as you have enough data. Info organization: use “Discussion” section to check results. Content selection: the table of contents is relevant.,20,Computational Approach: Basics,Top-Down: I know what I want! dont confuse me with drivel! User needs: only certain
30、 types of info System needs: particular criteria of interest, used to focus search,Bottom-Up: Im dead curious: whats in the text?User needs: anything thats important System needs: generic importance metrics, used to rate content,21,Query-Driven vs. Text-DRIVEN Focus,Top-down: Query-driven focus Crit
31、eria of interest encoded as search specs. System uses specs to filter or analyze text portions. Examples: templates with slots with semantic characteristics; termlists of important terms. Bottom-up: Text-driven focus Generic importance metrics encoded as strategies. System applies strategies over re
32、p of whole text. Examples: degree of connectedness in semantic graphs; frequency of occurrence of tokens.,22,Bottom-Up, using Info. Retrieval,IR task: Given a query, find the relevant document(s) from a large set of documents. Summ-IR task: Given a query, find the relevant passage(s) from a set of p
33、assages (i.e., from one or more documents).,Questions: 1. IR techniques work on large volumes of data; can they scale down accurately enough? 2. IR works on words; do abstracts require abstract representations?,23,Top-Down, using Info. Extraction,IE task: Given a template and a text, find all the in
34、formation relevant to each slot of the template and fill it in. Summ-IE task: Given a query, select the best template, fill it in, and generate the contents.,Questions: 1. IE works only for very particular templates; can it scale up? 2. What about information that doesnt fit into any templateis this
35、 a generic limitation of IE?,xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xx xxxx xxx xxxx xx xxxxx xxxxx
36、 xx xxx x xxxxx xxx,Xxxxx: xxxx Xxx: xxxx Xxx: xx xxx Xx: xxxxx x Xxx: xx xxx Xx: x xxx xx Xx: xxx x Xxx: xx Xxx: x,24,Paradigms: NLP/IE vs. ir/statistics,25,Toward the Final Answer.,Problem: What if neither IR-like nor IE-like methods work?Solution: semantic analysis of the text (NLP), using adequa
37、te knowledge bases that support inference (AI).,Mrs. Coolidge: “What did the preacher preach about?” Coolidge: “Sin.” Mrs. Coolidge: “What did he say?” Coolidge: “Hes against it.”,sometimes counting and templates are insufficient, and then you need to do inference to understand.,Word counting,Infere
38、nce,26,The Optimal Solution.,Combine strengths of both paradigms.use IE/NLP when you have suitable template(s), .use IR when you dontbut how exactly to do it?,27,A Summarization Machine,EXTRACTS,ABSTRACTS,?,MULTIDOCS,Extract,Abstract,Indicative,Generic,Background,Query-oriented,Just the news,10%,50%
39、,100%,Very Brief,Brief,Long,Headline,Informative,DOC,QUERY,CASE FRAMES TEMPLATES CORE CONCEPTS CORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS,28,The Modules of the Summarization Machine,E X T R A C T I O N,I N T E R P R E T A T I O N,EXTRACTS,ABSTRACTS,?,CASE FRAMES TEMPLATES CORE CONCEPTS C
40、ORE EVENTS RELATIONSHIPS CLAUSE FRAGMENTS INDEX TERMS,MULTIDOC EXTRACTS,G E N E R A T I O N,F I L T E R I N G,DOC EXTRACTS,29,Table of contents,1. Motivation. 2. Genres and types of summaries. 3. Approaches and paradigms. 4. Summarization methods (& exercise).Topic Extraction.Interpretation.Generati
41、on. 5. Evaluating summaries. 6. The future.,30,Overview of Extraction Methods,Position in the text lead method; optimal position policy title/heading method Cue phrases in sentences Word frequencies throughout the text Cohesion: links among words word co-occurrence coreference lexical chains Discour
42、se structure of the text Information Extraction: parsing and analysis,31,Note,The recall and precision figures reported here reflect the ability of various methods to match human performance on the task of identifying the sentences/clauses that are important in texts.Rely on evaluations using six co
43、rpora: (Edmundson, 68; Kupiec et al., 95; Teufel and Moens, 97; Marcu, 97; Jing et al., 98; SUMMAC, 98).,32,POSition-based method (1),Claim: Important sentences occur at the beginning (and/or end) of texts. Lead method: just take first sentence(s)! Experiments: In 85% of 200 individual paragraphs th
44、e topic sentences occurred in initial position and in 7% in final position (Baxendale, 58). Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).,33,position-Based Method (2),(Edmundson, 68) 52% recall & precision in combination with title (25% lead baseline) (K
45、upiec et al., 95) 33% recall & precision (24% lead baseline) (Teufel and Moens, 97) 32% recall and precision (28% lead baseline),(Edmundson, 68) the best individual methodKupiec et al., 95) the best individual method(Teufel and Moens, 97) increased performance by 10% when combined with the cue-based
46、 method,Individual contribution,Cumulative contribution,34,Optimum Position Policy (OPP),Claim: Important sentences are located at positions that are genre-dependent; these positions can be determined automatically through training (Lin and Hovy, 97). Corpus: 13000 newspaper articles (ZIFF corpus).
47、Step 1: For each article, determine overlap between sentences and the index terms for the article. Step 2: Determine a partial ordering over the locations where sentences containing important words occur: Optimal Position Policy (OPP),35,Opp (cont.),OPP for ZIFF corpus: (T) (P2,S1) (P3,S1) (P2,S2) (
48、P4,S1),(P5,S1),(P3,S2) (T=title; P=paragraph; S=sentence) OPP for Wall Street Journal: (T)(P1,S1).,Results: testing corpus of 2900 articles: Recall=35%, Precision=38%. Results: 10%-extracts cover 91% of the salient words.,36,Title-Based Method (1),Claim: Words in titles and headings are positively r
49、elevant to summarization. Shown to be statistically valid at 99% level of significance (Edmundson, 68). Empirically shown to be useful in summarization systems.,37,title-Based Method (2),(Edmundson, 68)40% recall & precision (25% lead baseline)(Teufel and Moens, 97) 21.7% recall & precision (28% lead baseline),(Edmundson, 68) increased performance by 8% when combined with the title- and cue-based methods. (Teufel and Moens, 97) increased performance by 3% when combined with cue-, location-, position-, and word-frequency-based methods.,