Text Mining Techniques for Patent Analysis.ppt

资源描述

1、Text Mining Techniques for Patent Analysis,Yuen-Hsien Tseng, National Taiwan Normal University, samtsengntnu.edu.twYuen-Hsien Tseng, Yeong-Ming Wang, Yu-I Lin, Chi-Jen Lin and Dai-Wei Juang, “Patent Surrogate Extraction and Evaluation in the Context of Patent Mapping“, accepted for publication in Jo

2、urnal of Information Science, 2007 (SSCI, SCI) Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, “Text Mining Techniques for Patent Analysis“, to appear in Information Processing and Management, 2007 (SSCI, SCI, EI),Outline,Introduction A General Methodology Technique Details Technique Evaluation Applica

3、tion Example Discussions Conclusions,Introduction Why Patent Analysis?,Patent documents contain 90% research results valuable to the following communities: Industry Business Law Policy-making If carefully analyzed, they can: reduce 60% and 40% R&D time and cost, respectively show technological detai

4、ls and relations reveal business trends inspire novel industrial solutions help make investment policy,Introduction Gov. Efforts,PA has received much attention since 2001 Korea: to develop 120 patent maps in 5 years Japan: patent mapping competition in 2004 Taiwan: more and more PM were created Exam

5、ple: “carbon nanotube” (CNT) 5 experts dedicated more than 1 month Asian countries, such as, China, Japan, Korean, Singapore, and Taiwan have invested various resources in patent analysis PA requires a lot of human efforts Assisting tools are in great need,Typical Patent Analysis Scenario,1. Task id

6、entification: define the scope, concepts, and purposes for the analysis task. 2. Searching: iteratively search, filter, and download related patents. 3. Segmentation: segment, clean, and normalize structured and unstructured parts. 4. Abstracting: analyze the patent content to summarize their claims

7、 topics, functions, or technologies. 5. Clustering: group or classify analyzed patents based on some extracted attributes. 6. Visualization: create technology-effect matrices or topic maps. 7. Interpretation: predict technology or business trends and relations.,Technology-Effect Matrix,To make deci

8、sions about future technology development seeking chances in those sparse cells To inspire novel solutions by understanding how patents are related so as to learn how novel solutions were invented in the past and can be invented in the future To predict business trends by showing the trend distribut

9、ion of major competitors in this map,Part of the T-E matrix (from STIC) for “Carbon Nanotube”,Topic Map of Carbon Nanotube,Text Mining - Definition,Knowledge discovery is often regarded as a process to find implicit, previously unknown, and potentially useful patterns Data mining: from structured da

10、tabases Text mining: from a large text repository In practice, TM involves a series of user interactions with the text mining tools to explore the repository to find such patterns. After supplemented with additional information and interpreted by experienced experts, these patterns can become import

11、ant intelligence for decision-making.,Text Mining Process for Patent Analysis A General Methodology,Document preprocessing Collection Creation Document Parsing and Segmentation Text Summarization Document Surrogate Selection Indexing Keyword/Phrase extraction morphological analysis Stop word filteri

12、ng Term association and clustering Topic Clustering Term selection Document clustering/categorization Cluster title generation Category mapping Topic Mapping Trend map - Aggregation map Query map - Zooming map,Example: An US Patent Doc.,See Example or this URL: http:/patft.uspto.gov/netacgi/nph-Pars

13、er?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=5,695,734.PN.&OS=PN/5,695,734&RS=PN/5,695,734,Download and Parsing into DBMS,NSC Patents,612 US patents with assignee contains “National Science Council” downloaded on 2005/06/15,Document Parsing and Segmentation

14、Data conversion Parsing unstructured texts and citations into structured fields in DBMS Document segmentation Partition the full patent texts into 6 segments Abstract, application, task, summary, feature, claim Only 9 empty segments in 6*92=552 CNT patent segments =1.63% Only 79 empty segments in 6

15、612=3672 NSC patent segments = 2.15%,NPR Parsing for Most-Frequently Cited Journals and Citation Age Distribution,Data are for 612 NSC patents,Automatic Summarization,Segment the doc. into paragraphs and sentences Assess sentences, consider their Positions Clue words Title words keywordsSelect sent

16、ences Sort by the weights and select the top-k sentences. Assembly the selected sentences Concatenate the sentences in their original order,Example: Auto-summarization MS Word (blue) Vs Ours (red),Evaluation of Each Segment,abs: the Abstract section of each patent app: FIELD OF THE INVENTION task: B

17、ACKGROUND OF THE INVENTION sum: SUMMARY OF THE INVENTION fea: DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT cla: Claims section of each patent seg_ext: summaries from each of the sets: abs, app, task, sum, and fea full: full texts from each of the sets: abs, app, task, sum, and fea,Evaluation Goa

18、l,Analyze a human-crafted patent map to see which segments have more important termsPurposes (so as to): allow analysts to spot the relevant segments more quickly for classifying patents in the map provide insights to possibly improve automated clustering and/or categorization in creating the map,Ev

19、aluation Method,In the manual creation of a technology-effect matrix, it is helpful to be able to quickly spot the keywords that can be used for classifying the patents in the map. Once the keywords or category features are found, patents can usually be classified without reading all the texts. Thus

20、 a segment or summary that retains as many important category features as possible is preferable. Our evaluation design therefore is to reveal which segments contains most such features compared to the others.,Patent Maps for Evaluation,All patent maps are from STPI,Empty segments in the six patent

21、maps,Feature Selection,Well studied in machine learning Best feature selection algorithms Chi-square, information gain, But to select only a few features, correlation coefficient is better than chi-square co=1 if FN=FP=0 and TP 0 and TN 0,Best and worst terms by Chi-square and correlation coefficien

22、t,Data are from a small real-world collection of 116 documents with only two exclusive categories, construction vs. non-construction in civil engineering tasks,Some feature terms and their distribution in each set for the category FED in CNT,Note: The correlation coefficients in each segment correla

23、te to the set counts of the ordered features: the larger the set count, the larger the correlation coefficient in each segment.,Occurrence distribution of 30 top-ranked terms in each set for some categories in CNT,M_Best_Term_Coverage(Segment, Category)=,Occurrence distribution of manually ranked te

24、rms in each set for some categories in CNT,R_Best_Term_Covertage(Segment, Category)=,Occurrence distribution of terms in each segment averaged over all categories in CNT,M_Best_Term_Coverage(Segment)=,R_Best_Term_Coverage(Segment)=,Maximum correlation coefficients in each set averaged over all categ

25、ories in CNT,*: denoted those calculated from human judged relevant terms,Term-covering rates for M best terms for the effect taxonomy in CNT,Term-covering rates for M best terms for the technology taxonomy in CNT,Term-covering rates for M best terms,QDF: Quantum Dot Fluorescein Detection,QDL: Quant

26、um Dot LED,Term-covering rates for M best terms,QDO: Quantum-Dot Optical Sensor,NTD: Nano Titanium Dioxide,MCM: Molecular Motors,Findings,Most ICFs ranked by correlation coefficient occur in the “segment extracts”, the Abstract section, and the SUMMARY OF THE INVENTION section. Most ICFs selected by

27、 humans occur in the Abstract section or the Claims section. The “segment extracts” lead to more top-ranked ICFs than the “full texts”, regardless whether the category features are selected manually or automatically. The ICFs selected automatically have higher capability in discriminating a document

28、s categories than those selected manually according to the correlation coefficient.,Implications,Text summarization techniques help in patent analysis and organization, either automatically or manually. If one would determine a patents category based on only a few terms in a quick pace, one should f

29、irst read the Abstract section and the SUMMARY OF THE INVENTION section Or alternatively, one should first read the “segment extracts” prepared by a computer,Text Mining Process for Patent Analysis,Document preprocessing Collection Creation Document Parsing and Segmentation Text Summarization Docume

30、nt Surrogate Selection Indexing Keyword/Phrase extraction morphological analysis Stop word filtering Term association and clustering Topic Clustering Term selection Document clustering/categorization Cluster title generation Category mapping Topic Mapping Trend map - Aggregation map Query map - Zoom

31、ing map,Ideal Indexing for Topic Identification,No processing may result in low recall; More processing may have false drops.,Example: Extracted Keywords and Their Associated Terms,Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, “Text Mining Techniques for Patent Analysis“, to appear in Information Pro

32、cessing and Management, 2007 (SSCI and SCI) Yuen-Hsien Tseng, “Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text“, Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI and SCI),Clustering Methods,Clustering

33、 is a powerful technique to detect topics and their relations in a collection. Clustering techniques: HAC : Hierarchical Agglomerative Clustering K-means MDS: Multi-Dimensional Scaling SOM: Self-organization Map Many open source packages are available Need to define the similarity to use them Simila

34、rities Co-words: common words used between items Co-citations: common citations between items,Document Clustering,Effectiveness of clustering relies on how terms are selected Affect effectiveness most Automatic, manual, or hybrid Users have more confidence on the clustering results if terms are sele

35、cted by themselves, but this is costly Manual verification of selected terms is recommended whenever it is possible Recent trend: Text clustering with extended user feedback, SIGIR 2006 Near-duplicate detection by instance-level constrained clustering, SIGIR06 how they are weighted Boolean or TFxIDF

36、 how similarities are measured Cosine, Dice, Jaccard, etc, Direct HAC document clustering may be prohibited due to its complexity,Term Clustering,Single terms are often ambiguous, a group of near-synonym terms can be more specific in topic Goal: reduce number of terms for ease of topic detection, co

37、ncept identification, generation of classification hierarchy, or trend analysis Term clustering followed by document categorization Allow large collections to be clustered Methods: Keywords: maximally repeated words or phrases, extracted by patented algorithm (Tseng, 2002) Related terms: keywords wh

38、ich often co-occur with other keywords, extracted by association mining (Tseng, 2002) Simset: a set of keywords having common related terms, extracted by term clustering,Multi-Stage Clustering,Single-stage clustering is easy to get skewed distribution Ideally, in multi-stage clustering, terms or doc

39、uments can be clustered into concepts, which in turn can be clustered into topics or domains. In practice, we need to browse the whole topic tree to found desired concepts or topics.,Terms or docs.,Concepts,Topics,Cluster Descriptors Generation,One important step to help analysts interpret the clust

40、ering results is to generate a summary title or cluster descriptors for each cluster. CC (correlation Coefficient) is used But CC0.5 or CCxTFC yield better results See Yuen-Hsien Tseng, Chi-Jen Lin, Hsiu-Han Chen and Yu-I Lin, “Toward Generic Title Generation for Clustered Documents,“ Proceedings of

41、 Asia Information Retrieval Symposium, Oct. 16-18, Singapore, pp. 145-157, 2006. (Lecture Notes in Computer Science, Vol. 4182, SCI),Mapping Cluster Descriptors to Categories,More generic title words can not be generated automatically Furniture is a generic term for beds, chairs, tables, etc. But if

42、 there is no furniture in the documents, there is no way to yield furniture as a title word, unless additional knowledge resources were used, such as thesauri See also Tseng et al, AIRS 2006,Search WordNet for Cluster Class,Using external resource to get cluster categories For each of 352 (0.005) or

43、 328 (0.001) simsets generated from 2714 terms Submit the sinset heads to WordNet to get their hypernyms (upper-level hypernyms as categories) Accumulate occurrence of each of these categories Rank these categories by occurrence Select the top-ranked categories as candidates for topic analysis These

44、 top-ranked categories still need manual filtering Current results are not satisfying Need to try to search scientific literature databases which support topic-based search capability and which have needed categories,Mapping Cluster Titles to Categories,Search Stanfords InfoMap http:/infomap.stanfor

45、d.edu/cgi-bin/semlab/infomap/classes/print_class.pl?args=$term1+$term2 Search WordNet directly Results similar to InfoMap Higher recall, lower Precision than InfoMap Yield meaningful results only when terms are in high quality Search google directory: http:/ Often yield: your search did not match an

46、y documents. Or wrong category: Ex1: submit: “CMOS dynamic logics” get: Computers Programming Languages Directories Ex2: submit: “laser, wavelength, beam, optic, light”, get: Business Electronics and Electrical Optoelectronics and Fiber, Health Occupational Health and Safety Lasers Searching WordNet

47、 yield better results but still unacceptable D:demoFileperl -s wntool.pl =0.1816 : device%1 =0.1433 : actinic_radiation%1 actinic_ray%1 =0.1211 : signal%1 signaling%1 sign%3 =0.0980 : orientation%2 =0.0924 : vitality%1 verve%1,NSC Patents,612 US patents whose assignees are NSC NSC sponsors most acad

48、emic researches Own the patents resulted from the researches Documents in the collection are knowledge-diversified (cover many fields) long (2000 words in average) full of advanced technical details Hard for any single analyst to analyze them Motivate the need to generate generic titles,Text Mining

49、from NSC Patents,Download NSC patents from USPTO with assignee=National Science Council Automatic key-phrase extraction Terms occurs more than once can be extracted Automatic segmentation and summarization 20072 keywords from full texts vs 19343 keywords from 5 segment summarization The 5 segment abstracts contain more category-specific terms then full texts (Tseng, 2005) Automatic index compilation Occurring frequency of each term in each document was recorded Record more than 500,000 terms (words, phrases, digits) among 612 documents in 72 seconds,

展开阅读全文