ImageVerifierCode 换一换
格式:PPT , 页数:87 ,大小:509KB ,
资源ID:373211      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373211.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Text and Web Search.ppt)为本站会员(ideacase155)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

Text and Web Search.ppt

1、Text and Web Search,Text Databases and IR,Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Information retrieval A field developed in parallel with

2、 database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents,Information Retrieval,Typical IR systems Online library catalogs Online document management systems Info

3、rmation retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance,Basic Measures for Text Retrieval,Precis

4、ion: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved,Information Retrieval Techniques,Index Terms (Attribute) Selection: Stop list Word stem Index t

5、erms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model,Problem - Motivation,Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries informatio

6、n filtering,Types of queries: boolean (data AND retrieval AND NOT .) additional features (data ADJACENT retrieval) keyword queries (data, retrieval)How to search a large collection of documents?,Problem - Motivation,Full-text scanning,for single term: (naive: O(N*M),ABRACADABRA,text,CAB,pattern,for

7、single term: (naive: O(N*M) Knuth, Morris and Pratt (77) build a small FSA; visit every text letter once only, by carefully shifting more than one step,ABRACADABRA,text,CAB,pattern,Full-text scanning,ABRACADABRA,text,CAB,pattern,CAB,CAB,CAB,.,Full-text scanning,for single term: (naive: O(N*M) Knuth

8、Morris and Pratt (77) Boyer and Moore (77) preprocess pattern; start from right to left & skip!,ABRACADABRA,text,CAB,pattern,Full-text scanning,Text - Detailed outline,text problem full text scanning inversion signature files clustering information filtering and LSI,Text Inverted Files,Q: space over

9、head?,Text Inverted Files,A: mainly, the postings lists,how to organize dictionary?stemming Y/N? Keep only the root of each word ex. inverted, inversion invert insertions?,Text Inverted Files,how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees, . stemming Y/N? insertions?,Text Inverte

10、d Files,postings list more Zipf distr.: eg., rank-frequency plot of Bible,log(rank),log(freq),freq 1/rank / ln(1.78V),Text Inverted Files,postings lists Cutting+Pedersen(keep first 4 in B-tree leaves) how to allocate space: Faloutsos+92 geometric progression compression (Elias codes) Zobel+ down to

11、2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest,Text Inverted Files,Vector Space Model and Clustering,Keyword (free-text) queries (vs Boolean) each document: - vector (HOW?) each query: - vector search for similar vectors,Vector Space Model and Clustering,main idea: e

12、ach document is a vector of size d: d is the number of different terms in the database,document,.data.,aaron,zoo,data,d (= vocabulary size),indexing,Document Vectors,Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating po

13、ints Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse,Document Vectors One location for each word.,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,“Nova” occurs 1

14、0 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.),Document Vectors One location for each word.,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,“Hollywood” occurs 7 times in

15、 text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I,Document Vectors,nova galaxy heat hwood film role diet fur10 5 3 5 1010 8 7 9 10 510 109 105 7 96 10 2 87 5 1 3,A B C D E F G H I,Document ids,We Can Plot the Vectors,Star,Diet,Doc about astronomy,D

16、oc about movie stars,Doc about mammal behavior,Vector Space Model and Clustering,Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?Two significant contributions ranked output relevance feedback,Vector Space Model and Clustering,cluster search: visit the (k) closest super

17、clusters; continue recursively,CS TRs,MD TRs,Vector Space Model and Clustering,ranked output: easy!,CS TRs,MD TRs,Vector Space Model and Clustering,relevance feedback (brilliant idea) Roccio73,CS TRs,MD TRs,Vector Space Model and Clustering,relevance feedback (brilliant idea) Roccio73 How?,CS TRs,MD

18、 TRs,Vector Space Model and Clustering,How? A: by adding the good vectors and subtracting the bad ones,CS TRs,MD TRs,Cluster generation,Problem:given N points in V dimensions, group them,Cluster generation,Problem:given N points in V dimensions, group them (typically a k-means or AGNES is used),Assi

19、gning Weights to Terms,Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents BUT infrequent in the collection as a whole,Binary Weights,Only the presence (1) or absence (0) of a term is included in the vector,Ra

20、w Term Weights,The frequency of occurrence for the term in each document is included in the vector,Assigning Weights,tf x idf measure: term frequency (tf) inverse document frequency (idf) - a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each d

21、ocument,tf x idf,Inverse Document Frequency,IDF provides high values for rare words and low values for common words,For a collection of 10000 documents,Similarity Measures for document vectors,Simple matching (coordination level match)Dices CoefficientJaccards CoefficientCosine CoefficientOverlap Co

22、efficient,tf x idf normalization,Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.,Vector space similarity (use the weights to compare the documents),Computi

23、ng Similarity Scores,1.0,0.8,0.6,0.8,0.4,0.6,0.4,1.0,0.2,0.2,Vector Space with Term Weights and Cosine Matching,1.0,0.8,0.6,0.4,0.2,0.8,0.6,0.4,0.2,0,1.0,D2,D1,Q,Term B,Term A,Di=(di1,wdi1;di2, wdi2;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;qit, wqit),Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7),Text - Detailed

24、outline,Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI,Information Filtering + LSI,Foltz+,92 Goal: users specify interests (= keywords) system alerts them, on suitable news-documents Major contribut

25、ion: LSI = Latent Semantic Indexing latent (hidden) concepts,Information Filtering + LSI,Main idea map each document into some concepts map each term into some conceptsConcept: a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) - DBMS_concept,Information Filtering + L

26、SI,Pictorially: term-document matrix (BEFORE),Information Filtering + LSI,Pictorially: concept-document matrix and.,Information Filtering + LSI,. and concept-term matrix,Information Filtering + LSI,Q: How to search, eg., for system?,Information Filtering + LSI,A: find the corresponding concept(s); a

27、nd the corresponding documents,Information Filtering + LSI,A: find the corresponding concept(s); and the corresponding documents,Information Filtering + LSI,Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DONT have the term system, but they contain almost

28、everything else (data, retrieval),SVD,LSI: find concepts,SVD - Definition,An x m = Un x r L r x r (Vm x r)TA: n x m matrix (eg., n documents, m terms)U: n x r matrix (n documents, r concepts)L: r x r diagonal matrix (strength of each concept) (r : rank of the matrix)V: m x r matrix (m terms, r conce

29、pts),SVD - Example,A = U L VT - example:,data,inf.,retrieval,brain,lung,=,CS,MD,x,x,SVD - Example,A = U L VT - example:,data,inf.,retrieval,brain,lung,=,CS,MD,x,x,CS-concept,MD-concept,SVD - Example,A = U L VT - example:,data,inf.,retrieval,brain,lung,=,CS,MD,x,x,CS-concept,MD-concept,doc-to-concept

30、 similarity matrix,SVD - Example,A = U L VT - example:,data,inf.,retrieval,brain,lung,=,CS,MD,x,x,strength of CS-concept,SVD - Example,A = U L VT - example:,data,inf.,retrieval,brain,lung,=,CS,MD,x,x,term-to-concept similarity matrix,CS-concept,SVD - Example,A = U L VT - example:,data,inf.,retrieval

31、,brain,lung,=,CS,MD,x,x,term-to-concept similarity matrix,CS-concept,SVD for LSI,documents, terms and concepts: U: document-to-concept similarity matrix V: term-to-concept sim. matrixL: its diagonal elements: strength of each concept,SVD for LSI,Need to keep all the eigenvectors?NO, just keep the fi

32、rst k (concepts),Web Search,What about web search? First you need to get all the documents of the web. Crawlers. Then you have to index them (inverted files, etc) Find the web pages that are relevant to the query Report the pages with their links in a sorted order Main difference with IR: web pages

33、have links may be possible to exploit the link structure for sorting the relevant documents,Kleinbergs Algorithm (HITS),Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times) Harvard : www.harvard.e

34、du Search Engines: yahoo, google, altavistaAuthorities and hubs,Kleinbergs algorithm,Problem dfn: given the web and a query find the most authoritative web pages for this queryStep 0: find all pages containing the query terms (root set) Step 1: expand by one move forward and backward (base set),Klei

35、nbergs algorithm,Step 1: expand by one move forward and backward,Kleinbergs algorithm,on the resulting graph, give high score (= authorities) to nodes that many important nodes point to give high importance score (hubs) to nodes that point to good authorities),hubs,authorities,Kleinbergs algorithm,o

36、bservations recursive definition! each node (say, i-th node) has both an authoritativeness score ai and a hubness score hi,Kleinbergs algorithm,Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be n x 1 vectors with the hubness and aut

37、horitativiness scores. Then:,Kleinbergs algorithm,Then: ai = hk + hl + hm that is ai = Sum (hj) over all j that (j,i) edge exists or a = AT h,k,l,m,i,Kleinbergs algorithm,symmetrically, for the hubness: hi = an + ap + aq that is hi = Sum (qj) over all j that (i,j) edge exists or h = A a,p,n,q,i,Klei

38、nbergs algorithm,In conclusion, we want vectors h and a such that: h = A a a = AT h,Start from a and h to all 1. Then apply the following trick:h=Aa=A(ATh)=(AAT)h = =(AAT)2 h = (AAT)k h a = (ATA)ka,Kleinbergs algorithm,In short, the solutions to h = A a a = AT h are the left- and right- eigenvectors

39、 of the adjacency matrix A. Starting from random a and iterating, well eventually converge (Q: to which of all the eigenvectors? why?),Kleinbergs algorithm,(Q: to which of all the eigenvectors? why?) A: to the ones of the strongest eigenvalue, because of property : (AT A ) k v (constant) v1,So, we c

40、an find the a and h vectors and the page with the highest a values are reported!,Kleinbergs algorithm - results,Eg., for the query java: 0.328 0.251 0.190 (“the java developer”),Kleinbergs algorithm - discussion,authority score can be used to find similar pages to page pclosely related to citatio

41、n analysis, social networs / small world phenomena,google/page-rank algorithm,closely related: The Web is a directed graph of connected nodes imagine a particle randomly moving along the edges (*) compute its steady-state probabilities. That gives the PageRank of each pages (the importance of this p

42、age) (*) with occasional random jumps,PageRank Definition,Assume a page A and pages T1, T2, , Tm that point to A. Let d is a damping factor. PR(A) the Pagerank of A. C(A) the out-degree of A. Then:,google/page-rank algorithm,Compute the PR of each pageidentical problem: given a Markov Chain, compute

43、 the steady state probabilities p1 . p5,1,2,3,4,5,Computing PageRank,Iterative procedure Also, navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), ci out-degree of page i Prob(Ai-Aj) = dn-1+(1-d)ci1AijAi,j = Prob(Ai-Aj),google/page-rank

44、 algorithm,Let A be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1),1,2,3,4,5,=,google/page-rank algorithm,A p = p,1,2,3,4,5,=,A p = p,google/page-rank algorithm,A p = p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is ro

45、w-normalized),Kleinberg/google - conclusions,SVD helps in graph analysis: hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix,References,Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1