1、1,Automatic Text Classification,Yutaka Sasaki NaCTeM School of Computer Science,2007 Yutaka Sasaki, University of Manchester,2,Introduction,2007 Yutaka Sasaki, University of Manchester,3,Introduction,Text Classification is the task: to classify documents into predefined classes Text Classification i
2、s also called Text Categorization Document Classification Document Categorization Two approaches manual classification and automatic classification,2007 Yutaka Sasaki, University of Manchester,4,Relevant technologies,Text Clustering Create clusters of documents without any external information Infor
3、mation Retrieval (IR) Retrieve a set of documents relevant to a query Information Filtering Filter out irrelevant documents through interactions Information Extraction (IE) Extract fragments of information, e.g., person names, dates, and places, in documentsText Classification No query, interactions
4、, external information Decide topics of documents,2007 Yutaka Sasaki, University of Manchester,5,Examples of relevant technologies,2007 Yutaka Sasaki, University of Manchester,web documents,6,Example of clustering,web documents,2007 Yutaka Sasaki, University of Manchester,7,Examples of information r
5、etrieval,x,web documents,2007 Yutaka Sasaki, University of Manchester,8,Examples of information filtering,web documents,2007 Yutaka Sasaki, University of Manchester,9,Examples of information extraction,web documents about accidents,Date: 04/12/03 Place: London Type: traffic Casualty: 5,Key informati
6、on on accidents,2007 Yutaka Sasaki, University of Manchester,10,Examples of text classification,web documents,2007 Yutaka Sasaki, University of Manchester,sports,economics,11,Text Classification Applications,E-mail spam filtering Categorize newspaper articles and newswires into topics Organize Web p
7、ages into hierarchical categories Sort journals and abstracts by subject categories (e.g., MEDLINE, etc.) Assigning international clinical codes to patient clinical records,2007 Yutaka Sasaki, University of Manchester,12,Simple text classification example,You want to classify documents into 4 classe
8、s:economics, sports, science, life. There are two approaches that you can take: rule-based approach write a set of rules that classify documents machine learning-based approach using a set of sample documents that are classified into the classes (training data), automatically create classifiers base
9、d on the training data,2007 Yutaka Sasaki, University of Manchester,13,Comparison of Two Approaches (1),Rule-based classificationPros: very accurate when rules are written by experts classification criteria can be easily controlled when the number of rules are small.Cons: sometimes, rules conflicts
10、each other maintenance of rules becomes more difficult as the number of rules increases the rules have to be reconstructed when a target domain changes low coverage because of a wide variety of expressions,2007 Yutaka Sasaki, University of Manchester,14,Comparison of Two Approaches (2),Machine Learn
11、ing-based approachPros: domain independent high predictive performanceCons: not accountable for classification results training data required,2007 Yutaka Sasaki, University of Manchester,15,Formal Definition,Given: A set of documents D = d1, d2, dm A fixed set of topics T = t1, t2, tn Determine: The
12、 topic of d: t(d) T, where t(x) is a classification function whose domain is D and whose range is T.,2007 Yutaka Sasaki, University of Manchester,16,Rule-based approach,Example: Classify documents into sports “ball” must be a word that is frequently used in sports Rule 1: “ball” d t(d) = sports But
13、there are other meanings of “ball” Def.2-1 : a large formal gathering for social dancing (WEBSTER) Rule 2: “ball” d & “dance” d t(d) = sports Def.2-2 : a very pleasant experience : a good time (WEBSTER) Rule 3: “ball” d & “dance” d & “game” d &“play” d t(d) = sportsNatural language has a rich variet
14、y of expressions:e.g., “Many people have a ball when they play a bingo game.”,2007 Yutaka Sasaki, University of Manchester,17,Machine Learning Approach,Prepare a set of training data Attach topic information to the documents in a target domain.Create a classifier (model) Apply a Machine Learning too
15、l to the data Support Vector Machine (SVM), Maximum Entropy Models (MEM) Classify new documents by the classifier,sports,science,life,classifier,sports,science,life,classifier,life,sports,Training data,2007 Yutaka Sasaki, University of Manchester,18,Closer look at Machine Learning-based approach,f1,
16、f2,f3,f4,game,play,ball,dance,Classifier c(|),c(sports|x),document d,c(science|x),c(economics|x),c(y|x),x=(f1, f2, f3, f4),features,feature extraction,feature vector (input vector),c(life|x),Select the best classification result,2007 Yutaka Sasaki, University of Manchester,19,Rule-based vs. Machine
17、Learning-based Creecy at al., 1992,Data: US Census Bureau Decennial Census 1990 22 million natural language responses 232 industry categories and 504 occupation categories It costs about $15 million if fully done by hand Define classification rules manually: Expert System AIOCS Development time: 192
18、 person-months (2 people, 8 years) Accuracy = 57%(industry), 37%(occupation) Learn classification function Machine Learning-based System PACE Development time: 4 person-months Accuracy = 63%(industry), 57%(occupation),2007 Yutaka Sasaki, University of Manchester,20,Evaluation,2007 Yutaka Sasaki, Uni
19、versity of Manchester,21,Common Evaluation Metrics,Accuracy Precision Recall F-measure harmonic mean of recall and precision micro-average F1 global calculation of F1 regardless of topics macro-average F1: average on F1 scores of all the topics,2007 Yutaka Sasaki, University of Manchester,22,Accurac
20、y,The rate of correctly predicted topics,systems prediction,correct answer,true positive,false positive (Type I error,false alarm),false negative (Type II error,missed alarm),(TP),(FP),(FN),Accuracy =,TP + TN TP + FP + FN + TN,true negative,(TN),2007 Yutaka Sasaki, University of Manchester,23,Accura
21、cy,Example: classify docs into spam or not spam,Accuracy = = = 0.4,TP+TN TP+FP+FN+TN,d1,d2,d3,Y,Y,N,systems prediction correct answer,N,Y,Y,TP FP FN TN1 1111,d4,N,1+1 1+2+1+1,N,d5,N,Y,2007 Yutaka Sasaki, University of Manchester,24,Issue in Accuracy,When a certain topic (e.g., not-spam) is a majorit
22、y, the accuracy easily reaches a high percentage.,Accuracy = = = 0.99,TP+TN TP+FP+FN+TN,d1,N,N,systems prediction correct answer,Y,Y,TP FP FN TN11990,d11- d1000,990 1000,NN,d10, ,N N,2007 Yutaka Sasaki, University of Manchester,25,Precision,The rate of correctly predicted topics,systems prediction,c
23、orrect answer,true positive,false positive (Type I error,false alarm),false negative (Type II error,missed alarm),(TP),(FP),(FN),Precision =,TP TP + FP,true negative,(TN),2007 Yutaka Sasaki, University of Manchester,26,Precision,Example: classify docs into spam or not spam,Precision = = = 0.333,TP T
24、P+FP,d1,d2,d3,Y,Y,N,systems prediction correct answer,N,Y,Y,TP FP FN TN1 1111,d4,N,1 1+2,N,d5,N,Y,2007 Yutaka Sasaki, University of Manchester,27,Issue in Precision,When a system outputs only confident topics, the precision easily reaches a high percentage.,Accuracy = = = 1,TP TP+FP,d1,N,N,systems p
25、rediction correct answer,Y,N,TP FP FN TN111,1 1,Y,d999,(Y or N), ,Y,d1000,2007 Yutaka Sasaki, University of Manchester,28,Recall,The rate of correctly predicted topics,systems prediction,correct answer,true positive,false positive (Type I error,false alarm),false negative (Type II error,missed alarm
26、),(TP),(FP),(FN),Precision =,TP TP + FN,true negative,(TN),2007 Yutaka Sasaki, University of Manchester,29,Recall,Example: classify docs into spam or not spam,Precision = = = 0.5,TP TP+FN,d1,d2,d3,Y,Y,N,systems prediction correct answer,N,Y,Y,TP FP FN TN1 1111,d4,N,1 1+1,N,d5,N,Y,2007 Yutaka Sasaki,
27、 University of Manchester,30,Issue in Recall,When a system outputs loosely, the recall easily reaches a high percentage.,Accuracy = = = 1,TP TP+FN,d1,Y,Y,systems prediction correct answer,Y,N,TP FP FN TN111,n n,Y,d999,(Y or N), ,Y,d1000,2007 Yutaka Sasaki, University of Manchester,31,F-measure,Harmo
28、nic mean of recall and precisionSince there is a trade-off between recall and precision, F-measure is widely used to evaluate text classification system. Micro-average F1: Global calculation of F1 regardless of topics Macro-evarage F1: Average on F1 scores of all topics,2 Precision RecallPrecision +
29、 Recall,2007 Yutaka Sasaki, University of Manchester,32,F-measure,Example: classify docs into spam or not spam,F = = = 0.4,2RecallPrecision Recall+Precision,d1,d2,d3,Y,Y,N,systems prediction correct answer,N,Y,Y,TP FP FN TN1 1111,d4,N,21/31/2 1/3 + 1/2,N,d5,N,Y,2007 Yutaka Sasaki, University of Manc
30、hester,33,Summary: Evaluation Metrics,Accuracy PrecisionRecallF-measureMicro F1: Global average of F1 regardless of topics Macro F1: Average on F1 scores of all topics Cost-Sensitive Accuracy Measure (*) Multi-Topic Accuracy (*),TP (# systems correct predictions) TP+FP (# systems outputs),TP (# syst
31、ems correct predictions) TP+FN (# correct answers),2 * Recall * Precision Recall + Precision,2007 Yutaka Sasaki, University of Manchester,34,Feature Extraction: from Text to Data,2007 Yutaka Sasaki, University of Manchester,35,Basic Approach (1),Bag-of-Word approach a document is regarded as a set o
32、f words regardless of the word order and grammar.,The brown fox jumps over the lazy dog.,brown,fox,jumps,over,the,lazy,dog,The,2007 Yutaka Sasaki, University of Manchester,36,Basic Approach (2),Bi-grams, tri-grams, n-grams Extract all of two, three, or n words in a row in the text,The brown fox jump
33、s over the lazy dog.,Bi-grams:the brown,brown fox,fox jumps,jumps over,the lazy,lazy dog,Tri-grams:the brown fox,brown fox jumps,fox jumps over,jumps over the,the lazy dog,2007 Yutaka Sasaki, University of Manchester,37,Basic Approach (3),Normalization Convert words into a normalized forms down-case
34、, e.g, The the, NF-kappa B nf-kappa b lemmatization: to basic forms, e.g., jumps jump stemming: mechanically remove/change suffixes e.g., yi, s , “the brown fox jump over the lazi dog.” the Porters Stemmer is widely used. Stop-word removal ignore predefined common words, e.g., the, a, to, with, that
35、 the SMART Stop List is widely used,2007 Yutaka Sasaki, University of Manchester,38,From Symbols to Numeric,Term occurrence: occur (1) or not-occur (0) Term Frequency tfi = the number of times where word/n-gram wi appears in a document. Inverse document frequency the inverted rate of documents that
36、contain word/n-gram wi against a whole set of documentsidfi = | D | / | d | wi d D |. tf-idftf-idfi = tfi idfi frequent words that appear only in a small number of documents achieve high value.,2007 Yutaka Sasaki, University of Manchester,39,Create Feature Vectors,a an brown, dog fox jump lazi over
37、the,( 0, 0,0, 1, ,0,0, 1,0,0, 1,0,0,1, 0,0,1,0,0,1, 2, 0, ),1. enumerate all word/n-grams in a whole set of documents 2. remove duplications and sort the words/n-grams 3. convert each word into its value, e.g., tf, idf, or tf-idf. 4. create a vector whose i-th value is the value of i-th term,The bro
38、wn fox jumps over the lazy dog.,Generally, feature vectors are very sparse, i.e., most of the values are 0.,feature vector with tf weights:,2007 Yutaka Sasaki, University of Manchester,40,Multi-Topic Text Classification,2007 Yutaka Sasaki, University of Manchester,41,Multi-topic Text Classification,
39、ship The Panama Canal Commission, a U.S. government agency, said in its daily operations report that there was a backlog of 39 ships waiting to enter the canal early today.,crude Diamond Shamrock Corp said that effective today it had cut its contract prices for crude oil by 1.50 dlrs a barrel.,crude
40、:ship The port of Philadelphia was closed when a Cypriot oil tanker, Seapride II, ran aground after hitting a 200-foot tower supporting power lines across the river, a Coast Guard spokesman said.,(Excerpt from Ruters-21578),One single document belongs to multiple topics An interesting and important
41、research theme that is not nicely solved yet.,Topic A&B is not always a mixture of A and B,2007 Yutaka Sasaki, University of Manchester,42,A View on Multi-topic Text Classification,Open Topic Assumption (OTA) (conventional view) A document has multiple topics The topics other than the given topics a
42、re neutral. Closed Topic Assumption (CTA) A document has multiple topics The other topics are considered to be explicitly excluded. E.g., if there exist three topics A,B,C and a text d is given the topic A, then this assignment is regarded that d belongs to A but does not belong to B and C.,B,A,C,A,
43、A but neither B nor C,CTA,OTA,A,2007 Yutaka Sasaki, University of Manchester,43,2007 Yutaka Sasaki, University of Manchester,44,Case Studies,2007 Yutaka Sasaki, University of Manchester,45,Experiments,Objective compare the performance of approaches based on Closed Topic Assumption and Open Topic Ass
44、umption. Data 1 (Clinical records) Training: about 986 documents Test: 984 documents Data 2 (Reuters newswires) Training: 9,603 documents Test: 3,299 documents Machine Learning methods SVM: Support Vector Machines MEM: Maximum Entropy Models Approaches BC: Binary Class Classification MC: Multi Class
45、 Classification,2007 Yutaka Sasaki, University of Manchester,46,Classification of Clinical Records,Medical NLP Challenge (Computational Medicine Centre) Classify anonymized real clinical records into International Clinical Codes (ICD-9-CM) 44 research institutes participated Sample Record: # Clinica
46、l History This is a patient with meningomyelocele and neurogenic bladder. # Impression Normal renal ultrasound in a patient with neurogenic bladder.Correct codes (possibly multiple codes): 596.54 (Neurogenic bladder NOS) 741.90 (Without mention of hydrocephalus),2007 Yutaka Sasaki, University of Man
47、chester,47,Document,Predicted codes (multi-topics),Correct codes,Top 5 Candidates,Significance of each feature,2007 Yutaka Sasaki, University of Manchester,48,Evaluation Metrics,AC: multi-labelling accuracy Cost-Sensitive Accuracy Measure (for clinical data) PrecisionRecallF1Micro F1: Global calculation of F1 regardless of topics Macro F1: Average on F1 scores of all topics,# system correct labeling # system output,# system correct labeling # correct labeling,