ImageVerifierCode 换一换
格式:PPT , 页数:88 ,大小:586KB ,
资源ID:377910      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-377910.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(A Whirlwind Tour of Search Engine Design Issues.ppt)为本站会员(postpastor181)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

A Whirlwind Tour of Search Engine Design Issues.ppt

1、,A Whirlwind Tour of Search Engine Design Issues,CMSC 498W April 4, 2006 Douglas W. Oard,Information Access Problems,Collection,Information Need,Stable,Stable,Different Each Time,Retrieval,Filtering,Different Each Time,Design Strategies,Foster human-machine synergy Exploit complementary strengths Ac

2、commodate shared weaknessesDivide-and-conquer Divide task into stages with well-defined interfaces Continue dividing until problems are easily solvedCo-design related components Iterative process of joint optimization,Human-Machine Synergy,Machines are good at: Doing simple things accurately and qui

3、ckly Scaling to larger collections in sublinear timePeople are better at: Accurately recognizing what they are looking for Evaluating intangibles such as “quality”Both are pretty bad at: Mapping consistently between words and concepts,Supporting the Search Process,Source Selection,Query Reformulatio

4、n and Relevance Feedback,Source Reselection,Choose,Supporting the Search Process,Source Selection,Taylors Model of Question Formation,Q1 Visceral Need,Q2 Conscious Need,Q3 Formalized Need,Q4 Compromised Need(Query),End-user Search,Intermediated Search,Search Goal,Choose the same documents a human wo

5、uld Without human intervention (less work) Faster than a human could (less time) As accurately as possible (less accuracy) Humans start with an information need Machines start with a query Humans match documents to information needs Machines match document & query representations,Search Component Mo

6、del,Comparison Function,Representation Function,Query Formulation,Human Judgment,Representation Function,Retrieval Status Value,Utility,Query,Information Need,Document,Query Representation,Document Representation,Query Processing,Document Processing,Relevance,Relevance relates a topic and a document

7、 Duplicates are equally relevant, by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document Accounts for prior knowledge We seek utility, but relevance is what we get!,Problems With Wor

8、d Matching,Word matching suffers from two problems Synonymy: paper vs. article Homonymy: bank (river) vs. bank (financial) Disambiguation in IR: seek to resolve homonymy Index word senses rather than words Synonymy usually addressed by Thesaurus-based query expansion Latent semantic indexing,Stemmin

9、g,Stem: in IR, a word equivalence class that preserves the main concept. Often obtained by affix-stripping (Porter, 1980) destroy, destroyed, destruction: destr Inexpensive to compute Usually helps IR performance Can make mistakes! (over-/understemming) centennial,century,center: cent acquire,acquir

10、ing,acquired: acquiracquisition: acquis,N-gram Indexing,Consider a Chinese document c1 c2 c3 cnDont segment (you could be wrong!)Instead, treat every character bigram as a term _c1 c2 , c2 c3 , c3 c4 , , cn-1 cnBreak up queries the same way,“Bag of Terms” Representation,Bag = a “set” that can contai

11、n duplicates “The quick brown fox jumped over the lazy dogs back” back, brown, dog, fox, jump, lazy, over, quick, the, theVector = values recorded in any consistent order back, brown, dog, fox, jump, lazy, over, quick, the, the 1 1 1 1 1 1 1 1 2,Bag of Terms Example,The quick brown fox jumped over t

12、he lazy dogs back.,Document 1,Document 2,Now is the time for all good men to come to the aid of their party.,the,quick,brown,fox,over,lazy,dog,back,now,is,time,for,all,good,men,to,come,jump,aid,of,their,party,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0,1,1,Term,Document 1,Documen

13、t 2,StopwordList,Boolean IR,Strong points Accurate, if you know the right strategies Efficient for the computer Weaknesses Often results in too many documents, or none Users must learn Boolean logic Sometimes finds relationships that dont exist Words can have many meanings Choosing the right words i

14、s sometimes hard,Proximity Operators,More precise versions of AND “NEAR n” allows at most n-1 intervening terms “WITH” requires terms to be adjacent and in order Easy to implement, but less efficient Store a list of positions for each word in each doc Stopwords become very important! Perform normal

15、Boolean computations Treat WITH and NEAR like AND with an extra constraint,Proximity Operator Example,time AND come Doc 2 time (NEAR 2) come Empty quick (NEAR 2) fox Doc 1 quick WITH fox Empty,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men,come,jump,aid,their,party,0,1 (9),Term,1 (13),1 (6

16、),1 (7),1 (8),1 (16),1 (1),1 (2),1 (15),1 (4),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 (5),1 (9),1 (3),1 (4),1 (8),1 (6),1 (10),Doc 1,Doc 2,Advantages of Ranked Retrieval,Closer to the way people think Some documents are better than othersEnriches browsing behavior Decide how far down the list to go as you

17、 read itAllows more flexible queries Long and short queries can produce useful results,Ranked Retrieval Challenges,“Best first” is easy to say but hard to do! The best we can hope for is to approximate itWill the user understand the process? It is hard to use a tool that you dont understandEfficienc

18、y becomes a concern Only a problem for long queries, though,Similarity-Based Queries,Treat the query as if it were a document Create a query bag-of-words Find the similarity of each document Using the coordination measure, for example Rank order the documents by similarity Most similar to the query

19、first Surprisingly, this works pretty well! Especially for very short queries,Counting Terms,Terms tell us about documents If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms “the” is in every document - not discriminating Documents are most likely described well by rare

20、 terms that occur in them frequently Higher “term frequency” is stronger evidence Low “collection frequency” makes it stronger still,The Document Length Effect,Humans look for documents with useful parts But probabilities are computed for the wholeDocument lengths vary in many collections So probabi

21、lity calculations could be inconsistentTwo strategies Adjust probability estimates for document length Divide the documents into equal “passages”,Incorporating Term Frequency,High term frequency is evidence of meaning And high IDF is evidence of term importance Recompute the bag-of-words Compute TF

22、* IDF for every element,TF*IDF Example,4,5,6,3,1,3,1,6,5,3,4,3,7,1,nuclear,fallout,siberia,contaminated,interesting,complicated,information,retrieval,2,1,2,3,2,3,2,4,4,0.50,0.63,0.90,0.13,0.60,0.75,1.51,0.38,0.50,2.11,0.13,1.20,1,2,3,0.60,0.38,0.50,4,0.301,0.125,0.125,0.125,0.602,0.301,0.000,0.602,U

23、nweighted query:contaminated retrieval Result: 2, 3, 1, 4,Weighted query:contaminated(3) retrieval(1) Result: 1, 3, 2, 4,IDF-weighted query:contaminated retrieval Result: 2, 3, 1, 4,Document Length Normalization,Long documents have an unfair advantage They use a lot of terms So they get more matches

24、 than short documents And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects Related somehow to maximum term frequency But also sensitive to the of number of terms,“Okapi” Term Weights,TF component,IDF component,Passage Retrieval,

25、Another approach to long-document problem Break it up into coherent unitsRecognizing topic boundaries is hard But overlapping 300 word passages work fineDocument rank is best passage rank And passage information can help guide browsing,Summary,Goal: find documents most similar to the queryCompute no

26、rmalized document term weights Some combination of TF, DF, and LengthOptionally, get query term weights from the user Estimate of term importanceCompute inner product of query and doc vectors Multiply corresponding elements and then add,Another Perspective,We ask “is this document relevant?” Vector

27、space: we answer “somewhat” Probabilistic: we answer “probably”,Probability Ranking Principle,Assume binary relevance/document independence Each document is either relevant or it is not Relevance of one doc reveals nothing about anotherAssume the searcher works down a ranked list Seeking some number

28、 of relevant documentsTheorem (provable from assumptions): Documents should be ranked in order of decreasing probability of relevance to the query, P(d relevant-to q),Some Questions for Indexing,How long will it take to find a document? Is there any work we can do in advance? If so, how long will th

29、at take?How big a computer will I need? How much disk space? How much RAM?What if more documents arrive? How much of the advance work must be repeated? Will searching become slower? How much more disk space will be needed?,The Indexing Process,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men

30、,come,jump,aid,their,party,0,0,1,1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,0,1,Term,Doc 1,Doc 2,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,Doc 3,Doc 4,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,Doc 5,Doc 6,0,0,1,1,0,0,1,0,0,1,0,0,1,0,

31、0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,1,1,0,0,0,Doc 7,Doc 8,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,The Finished Product,quick,brown,fox,over,l

32、azy,dog,back,now,time,all,good,men,come,jump,aid,their,party,Term,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,How Big Is the Postings File?

33、,Very compact for Boolean retrieval About 10% of the size of the documents If an aggressive stopword list is used!Not much larger for ranked retrieval Perhaps 20%Enormous for proximity operators Sometimes larger than the documents!,Building an Inverted Index,Simplest solution is a single sorted arra

34、y Fast lookup using binary search But sorting large files on disk is very slow And adding one document means starting overTree structures allow easy insertion But the worst case lookup time is linearBalanced trees provide the best of both Fast lookup and easy insertion But they require 45% more disk

35、 space,How Big is the Inverted Index?,Typically smaller than the postings file Depends on number of terms, not documentsEventually, most terms will already be indexed But the postings file will continue to growPostings dominate asymptotic space complexity Linear in the number of documents,Index Comp

36、ression,CPUs are much faster than disks A disk can transfer 1,000 bytes in 20 ms The CPU can do 10 million instructions in that timeCompressing the postings file is a big win Trade decompression time for fewer disk readsKey idea: reduce redundancy Trick 1: store relative offsets (some will be the sa

37、me) Trick 2: use an optimal coding scheme,Summary,Slow indexing yields fast query processing Key fact: most terms dont appear in most documentsWe use extra disk space to save query time Index space is in addition to document space Time and space complexity must be balancedDisk block reads are the cr

38、itical resource This makes index compression a big win,Problems with “Free Text” Search,Homonymy Terms may have many unrelated meanings Polysemy (related meanings) is less of a problemSynonymy Many ways of saying (nearly) the same thingAnaphora Alternate ways of referring to the same thing,Two Ways

39、of Searching,Write the document using terms to convey meaning,Author,Applications,When implied concepts must be captured Political action, volunteerism, When terminology selection is impractical Searching foreign language materials When no words are present Photos w/o captions, videos w/o transcript

40、s, When user needs are easily anticipated Weather reports, yellow pages, ,Supervised Learning,Example: kNN Classifier,Challenges,Changing concept inventories Literary warrant and user needs are hard to predictAccurate concept indexing is expensive Machines are inaccurate, humans are inconsistentUser

41、s and indexers may think differently Diverse user populations add to the complexityUsing thesauri effectively requires training Meta-knowledge and thesaurus-specific expertise,Relevance Feedback,Make Profile Vector,Compute Similarity,Select and Examine (user),Assign Ratings (user),Update User Model,

42、New Documents,Vector,Documents, Vectors, Rank Order,Document, Vector,Rating, Vector,Vector(s),Make Document Vectors,Initial Profile Terms,Vectors,Rocchio Formula,0,4,0,8,0,0,1,2,4,0,0,1,2,0,1,1,0,4,-1,6,3,7,0,-3,0,4,0,8,0,0,2,4,8,0,0,2,8,0,4,4,0,16,Original profile,Positive Feedback,Negative feedbac

43、k,(+),(-),New profile,Implicit Feedback,Observe user behavior to infer a set of ratings Examine (reading time, scrolling behavior, ) Retain (bookmark, save, save & annotate, print, ) Refer to (reply, forward, include link, cut & paste, ) Some measurements are directly useful e.g., use reading time t

44、o predict reading time Others require some inference Should you treat cut & paste as an endorsement?,Some Observable Behaviors,Behavior Category,Behavior Category,Minimum Scope,Some Examples,Read/Ignored, Saved/Deleted, Replied to(Stevens, 1993)Reading time (Morita Konstan et al., 1997)Hypertext Lin

45、k(Brin & Page, 1998),Estimating Authority from Links,Authority,Authority,Hub,Collecting Click Streams,Browsing histories are easily captured Make all links initially point to a central site Encode the desired URL as a parameter Build a time-annotated transition graph for each user Cookies identify u

46、sers (when they use the same machine) Redirect the browser to the desired pageReading time is correlated with interest Can be used to build individual profiles Used to target advertising by ,0,20,40,60,80,100,120,140,160,180,No Interest,Low Interest,Moderate Interest,High Interest,Rating,Reading Tim

47、e (seconds),Full Text Articles (Telecommunications),50,32,58,43,Critical Issues,Protecting privacy What absolute assurances can we provide? How can we make remaining risks understood?Scalable rating servers Is a fully distributed architecture practical?Non-cooperative users How can the effect of spa

48、mming be limited?,Gaining Access to Observations,Observe public behavior Hypertext linking, publication, citing, Policy protection EU: Privacy laws US: Privacy policies + FTC enforcementArchitectural assurance of privacy Distributed architecture Model and mitigate privacy risks,Adversarial IR,Search is user-controlled suppression Everything is known to the search system Goal: avoid showing things the user doesnt wantOther stakeholders have different goals Authors risk little by wasting your time Marketers hope for serendipitous interestMetadata from trusted sources is more reliable,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1