1、,A Whirlwind Tour of Search Engine Design Issues,CMSC 498W April 4, 2006 Douglas W. Oard,Information Access Problems,Collection,Information Need,Stable,Stable,Different Each Time,Retrieval,Filtering,Different Each Time,Design Strategies,Foster human-machine synergy Exploit complementary strengths Ac
2、commodate shared weaknessesDivide-and-conquer Divide task into stages with well-defined interfaces Continue dividing until problems are easily solvedCo-design related components Iterative process of joint optimization,Human-Machine Synergy,Machines are good at: Doing simple things accurately and qui
3、ckly Scaling to larger collections in sublinear timePeople are better at: Accurately recognizing what they are looking for Evaluating intangibles such as “quality”Both are pretty bad at: Mapping consistently between words and concepts,Supporting the Search Process,Source Selection,Query Reformulatio
4、n and Relevance Feedback,Source Reselection,Choose,Supporting the Search Process,Source Selection,Taylors Model of Question Formation,Q1 Visceral Need,Q2 Conscious Need,Q3 Formalized Need,Q4 Compromised Need(Query),End-user Search,Intermediated Search,Search Goal,Choose the same documents a human wo
5、uld Without human intervention (less work) Faster than a human could (less time) As accurately as possible (less accuracy) Humans start with an information need Machines start with a query Humans match documents to information needs Machines match document & query representations,Search Component Mo
6、del,Comparison Function,Representation Function,Query Formulation,Human Judgment,Representation Function,Retrieval Status Value,Utility,Query,Information Need,Document,Query Representation,Document Representation,Query Processing,Document Processing,Relevance,Relevance relates a topic and a document
7、 Duplicates are equally relevant, by definition Constant over time and across users Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document Accounts for prior knowledge We seek utility, but relevance is what we get!,Problems With Wor
8、d Matching,Word matching suffers from two problems Synonymy: paper vs. article Homonymy: bank (river) vs. bank (financial) Disambiguation in IR: seek to resolve homonymy Index word senses rather than words Synonymy usually addressed by Thesaurus-based query expansion Latent semantic indexing,Stemmin
9、g,Stem: in IR, a word equivalence class that preserves the main concept. Often obtained by affix-stripping (Porter, 1980) destroy, destroyed, destruction: destr Inexpensive to compute Usually helps IR performance Can make mistakes! (over-/understemming) centennial,century,center: cent acquire,acquir
10、ing,acquired: acquiracquisition: acquis,N-gram Indexing,Consider a Chinese document c1 c2 c3 cnDont segment (you could be wrong!)Instead, treat every character bigram as a term _c1 c2 , c2 c3 , c3 c4 , , cn-1 cnBreak up queries the same way,“Bag of Terms” Representation,Bag = a “set” that can contai
11、n duplicates “The quick brown fox jumped over the lazy dogs back” back, brown, dog, fox, jump, lazy, over, quick, the, theVector = values recorded in any consistent order back, brown, dog, fox, jump, lazy, over, quick, the, the 1 1 1 1 1 1 1 1 2,Bag of Terms Example,The quick brown fox jumped over t
12、he lazy dogs back.,Document 1,Document 2,Now is the time for all good men to come to the aid of their party.,the,quick,brown,fox,over,lazy,dog,back,now,is,time,for,all,good,men,to,come,jump,aid,of,their,party,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0,1,1,Term,Document 1,Documen
13、t 2,StopwordList,Boolean IR,Strong points Accurate, if you know the right strategies Efficient for the computer Weaknesses Often results in too many documents, or none Users must learn Boolean logic Sometimes finds relationships that dont exist Words can have many meanings Choosing the right words i
14、s sometimes hard,Proximity Operators,More precise versions of AND “NEAR n” allows at most n-1 intervening terms “WITH” requires terms to be adjacent and in order Easy to implement, but less efficient Store a list of positions for each word in each doc Stopwords become very important! Perform normal
15、Boolean computations Treat WITH and NEAR like AND with an extra constraint,Proximity Operator Example,time AND come Doc 2 time (NEAR 2) come Empty quick (NEAR 2) fox Doc 1 quick WITH fox Empty,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men,come,jump,aid,their,party,0,1 (9),Term,1 (13),1 (6
16、),1 (7),1 (8),1 (16),1 (1),1 (2),1 (15),1 (4),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 (5),1 (9),1 (3),1 (4),1 (8),1 (6),1 (10),Doc 1,Doc 2,Advantages of Ranked Retrieval,Closer to the way people think Some documents are better than othersEnriches browsing behavior Decide how far down the list to go as you
17、 read itAllows more flexible queries Long and short queries can produce useful results,Ranked Retrieval Challenges,“Best first” is easy to say but hard to do! The best we can hope for is to approximate itWill the user understand the process? It is hard to use a tool that you dont understandEfficienc
18、y becomes a concern Only a problem for long queries, though,Similarity-Based Queries,Treat the query as if it were a document Create a query bag-of-words Find the similarity of each document Using the coordination measure, for example Rank order the documents by similarity Most similar to the query
19、first Surprisingly, this works pretty well! Especially for very short queries,Counting Terms,Terms tell us about documents If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms “the” is in every document - not discriminating Documents are most likely described well by rare
20、 terms that occur in them frequently Higher “term frequency” is stronger evidence Low “collection frequency” makes it stronger still,The Document Length Effect,Humans look for documents with useful parts But probabilities are computed for the wholeDocument lengths vary in many collections So probabi
21、lity calculations could be inconsistentTwo strategies Adjust probability estimates for document length Divide the documents into equal “passages”,Incorporating Term Frequency,High term frequency is evidence of meaning And high IDF is evidence of term importance Recompute the bag-of-words Compute TF
22、* IDF for every element,TF*IDF Example,4,5,6,3,1,3,1,6,5,3,4,3,7,1,nuclear,fallout,siberia,contaminated,interesting,complicated,information,retrieval,2,1,2,3,2,3,2,4,4,0.50,0.63,0.90,0.13,0.60,0.75,1.51,0.38,0.50,2.11,0.13,1.20,1,2,3,0.60,0.38,0.50,4,0.301,0.125,0.125,0.125,0.602,0.301,0.000,0.602,U
23、nweighted query:contaminated retrieval Result: 2, 3, 1, 4,Weighted query:contaminated(3) retrieval(1) Result: 1, 3, 2, 4,IDF-weighted query:contaminated retrieval Result: 2, 3, 1, 4,Document Length Normalization,Long documents have an unfair advantage They use a lot of terms So they get more matches
24、 than short documents And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects Related somehow to maximum term frequency But also sensitive to the of number of terms,“Okapi” Term Weights,TF component,IDF component,Passage Retrieval,
25、Another approach to long-document problem Break it up into coherent unitsRecognizing topic boundaries is hard But overlapping 300 word passages work fineDocument rank is best passage rank And passage information can help guide browsing,Summary,Goal: find documents most similar to the queryCompute no
26、rmalized document term weights Some combination of TF, DF, and LengthOptionally, get query term weights from the user Estimate of term importanceCompute inner product of query and doc vectors Multiply corresponding elements and then add,Another Perspective,We ask “is this document relevant?” Vector
27、space: we answer “somewhat” Probabilistic: we answer “probably”,Probability Ranking Principle,Assume binary relevance/document independence Each document is either relevant or it is not Relevance of one doc reveals nothing about anotherAssume the searcher works down a ranked list Seeking some number
28、 of relevant documentsTheorem (provable from assumptions): Documents should be ranked in order of decreasing probability of relevance to the query, P(d relevant-to q),Some Questions for Indexing,How long will it take to find a document? Is there any work we can do in advance? If so, how long will th
29、at take?How big a computer will I need? How much disk space? How much RAM?What if more documents arrive? How much of the advance work must be repeated? Will searching become slower? How much more disk space will be needed?,The Indexing Process,quick,brown,fox,over,lazy,dog,back,now,time,all,good,men
30、,come,jump,aid,their,party,0,0,1,1,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,0,1,Term,Doc 1,Doc 2,0,0,1,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,Doc 3,Doc 4,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,Doc 5,Doc 6,0,0,1,1,0,0,1,0,0,1,0,0,1,0,
31、0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,1,1,0,0,0,Doc 7,Doc 8,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,The Finished Product,quick,brown,fox,over,l
32、azy,dog,back,now,time,all,good,men,come,jump,aid,their,party,Term,A,B,C,F,D,G,J,L,M,N,O,P,Q,T,AI,AL,BA,BR,TH,TI,4, 8,2, 4, 6,1, 3, 7,1, 3, 5, 7,2, 4, 6, 8,3, 5,3, 5, 7,2, 4, 6, 8,3,1, 3, 5, 7,2, 4, 8,2, 6, 8,1, 3, 5, 7, 8,6, 8,1, 3,1, 5, 7,2, 4, 6,Postings,Inverted File,How Big Is the Postings File?
33、,Very compact for Boolean retrieval About 10% of the size of the documents If an aggressive stopword list is used!Not much larger for ranked retrieval Perhaps 20%Enormous for proximity operators Sometimes larger than the documents!,Building an Inverted Index,Simplest solution is a single sorted arra
34、y Fast lookup using binary search But sorting large files on disk is very slow And adding one document means starting overTree structures allow easy insertion But the worst case lookup time is linearBalanced trees provide the best of both Fast lookup and easy insertion But they require 45% more disk
35、 space,How Big is the Inverted Index?,Typically smaller than the postings file Depends on number of terms, not documentsEventually, most terms will already be indexed But the postings file will continue to growPostings dominate asymptotic space complexity Linear in the number of documents,Index Comp
36、ression,CPUs are much faster than disks A disk can transfer 1,000 bytes in 20 ms The CPU can do 10 million instructions in that timeCompressing the postings file is a big win Trade decompression time for fewer disk readsKey idea: reduce redundancy Trick 1: store relative offsets (some will be the sa
37、me) Trick 2: use an optimal coding scheme,Summary,Slow indexing yields fast query processing Key fact: most terms dont appear in most documentsWe use extra disk space to save query time Index space is in addition to document space Time and space complexity must be balancedDisk block reads are the cr
38、itical resource This makes index compression a big win,Problems with “Free Text” Search,Homonymy Terms may have many unrelated meanings Polysemy (related meanings) is less of a problemSynonymy Many ways of saying (nearly) the same thingAnaphora Alternate ways of referring to the same thing,Two Ways
39、of Searching,Write the document using terms to convey meaning,Author,Applications,When implied concepts must be captured Political action, volunteerism, When terminology selection is impractical Searching foreign language materials When no words are present Photos w/o captions, videos w/o transcript
40、s, When user needs are easily anticipated Weather reports, yellow pages, ,Supervised Learning,Example: kNN Classifier,Challenges,Changing concept inventories Literary warrant and user needs are hard to predictAccurate concept indexing is expensive Machines are inaccurate, humans are inconsistentUser
41、s and indexers may think differently Diverse user populations add to the complexityUsing thesauri effectively requires training Meta-knowledge and thesaurus-specific expertise,Relevance Feedback,Make Profile Vector,Compute Similarity,Select and Examine (user),Assign Ratings (user),Update User Model,
42、New Documents,Vector,Documents, Vectors, Rank Order,Document, Vector,Rating, Vector,Vector(s),Make Document Vectors,Initial Profile Terms,Vectors,Rocchio Formula,0,4,0,8,0,0,1,2,4,0,0,1,2,0,1,1,0,4,-1,6,3,7,0,-3,0,4,0,8,0,0,2,4,8,0,0,2,8,0,4,4,0,16,Original profile,Positive Feedback,Negative feedbac
43、k,(+),(-),New profile,Implicit Feedback,Observe user behavior to infer a set of ratings Examine (reading time, scrolling behavior, ) Retain (bookmark, save, save & annotate, print, ) Refer to (reply, forward, include link, cut & paste, ) Some measurements are directly useful e.g., use reading time t
44、o predict reading time Others require some inference Should you treat cut & paste as an endorsement?,Some Observable Behaviors,Behavior Category,Behavior Category,Minimum Scope,Some Examples,Read/Ignored, Saved/Deleted, Replied to(Stevens, 1993)Reading time (Morita Konstan et al., 1997)Hypertext Lin
45、k(Brin & Page, 1998),Estimating Authority from Links,Authority,Authority,Hub,Collecting Click Streams,Browsing histories are easily captured Make all links initially point to a central site Encode the desired URL as a parameter Build a time-annotated transition graph for each user Cookies identify u
46、sers (when they use the same machine) Redirect the browser to the desired pageReading time is correlated with interest Can be used to build individual profiles Used to target advertising by ,0,20,40,60,80,100,120,140,160,180,No Interest,Low Interest,Moderate Interest,High Interest,Rating,Reading Tim
47、e (seconds),Full Text Articles (Telecommunications),50,32,58,43,Critical Issues,Protecting privacy What absolute assurances can we provide? How can we make remaining risks understood?Scalable rating servers Is a fully distributed architecture practical?Non-cooperative users How can the effect of spa
48、mming be limited?,Gaining Access to Observations,Observe public behavior Hypertext linking, publication, citing, Policy protection EU: Privacy laws US: Privacy policies + FTC enforcementArchitectural assurance of privacy Distributed architecture Model and mitigate privacy risks,Adversarial IR,Search is user-controlled suppression Everything is known to the search system Goal: avoid showing things the user doesnt wantOther stakeholders have different goals Authors risk little by wasting your time Marketers hope for serendipitous interestMetadata from trusted sources is more reliable,