ImageVerifierCode 换一换
格式:PPT , 页数:27 ,大小:394.50KB ,
资源ID:373104      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373104.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(The Search Engine Architecture.ppt)为本站会员(medalangle361)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

The Search Engine Architecture.ppt

1、The Search Engine Architecture,CSCI 572: Information Retrieval and Search Engines Summer 2010,Outline,Introduction Google The PageRank algorithm The Google Architecture Architectural components Architectural interconnections Architectural data structures Evaluation of Google Summary,Problems with se

2、arch engines circa the last decade,Human maintenance Subjective Example: Ranking hits based on $ Automated search engines Quality of result Neglect to take users context into account Searching process High quality results arent always at the top of the list,The Typical Search Engine Process,In what

3、stages is the most time spent?,How to scale to modern times?,Currently Efficient index Petabyte scale storage space Efficient Crawling Cost effectiveness of hardware Future Qualitative context Maintaining localization data Perhaps send indexing to clients Client computers help gather Googles index i

4、n a distributed, decentralized fashion?,Google,The whole idea is to keep up with the growth of the web Design Goals: -Remove Junk Results -Scalable document indicesUse of link structure to improve quality filtering Use as an academic digital library Provide search engine datasets Search engine infra

5、structure and evolution,Google,Archival of information Use of compression Efficient data structures Proprietary file system Leverage of usage data PageRank algorithm Sort of a “lineage” of a source of information Citation graph,PageRank Algorithm,Numerical method to calculate pages importance this a

6、pproach might well be followed by people doing research Page Rank of a page A With damping factor d Where PR(x) = Page Rank of page X Where C(x) = the amount of outgoing links from page x Where T1Tn is the set of pages with incoming links to page A PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(Tn)/C(Tn) Its actuall

7、y a bit more complicated than it first looks For instance, whats PR(T1) and PR(T2) and so on?,PageRank Algorithm,An excellent explanation http:/ Since the PR(A) equation is a probability distribution over all web pages linking to web page A And because of the (1-d) term and the d*(PR.) term The Page

8、Ranks of all the web pages on the web will sum to 1,PageRank: Example,So, where do you start? It turns out that you can effectively “guess” what the PageRanks for the web pages are initially In our example, guess 0 for all of the pages Then you run the PR function to calculate PR for all the web pag

9、es iteratively You do this until The page ranks for each web page stop changing in each iteration They “settle down”,PageRank: Example,PR(a) = 1 - $damp + $damp * PR(c); PR(b) = 1 - $damp + $damp * (PR(a)/2) PR(c) = 1 - $damp + $damp * (PR(a)/2 + PR(b) + PR(d); PR(d) = 1 - $damp;,Below is the iterat

10、ive calculation that we would run,PageRank Algorithm: First 18 iterations,a: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000 a: 0.81913 b: 0.49813 c: 1.04904 d: 0.15000 a: 1.04169 b: 0.59272 c: 1.22403 d: 0.15000 a: 1.1

11、9042 b: 0.65593 c: 1.34097 d: 0.15000 a: 1.28982 b: 0.69818 c: 1.41912 d: 0.15000 a: 1.35626 b: 0.72641 c: 1.47136 d: 0.15000 a: 1.40065 b: 0.74528 c: 1.50626 d: 0.15000 a: 1.43032 b: 0.75789 c: 1.52959 d: 0.15000 a: 1.45015 b: 0.76632 c: 1.54518 d: 0.15000 a: 1.46341 b: 0.77195 c: 1.55560 d: 0.1500

12、0 a: 1.47226 b: 0.77571 c: 1.56257 d: 0.15000 a: 1.47818 b: 0.77823 c: 1.56722 d: 0.15000 a: 1.48214 b: 0.77991 c: 1.57033 d: 0.15000 a: 1.48478 b: 0.78103 c: 1.57241 d: 0.15000 a: 1.48655 b: 0.78178 c: 1.57380 d: 0.15000 a: 1.48773 b: 0.78228 c: 1.57473 d: 0.15000,PageRank: next 13 iterations,a: 1.

13、48852 b: 0.78262 c: 1.57535 d: 0.15000 a: 1.48904 b: 0.78284 c: 1.57576 d: 0.15000 a: 1.48940 b: 0.78299 c: 1.57604 d: 0.15000 a: 1.48963 b: 0.78309 c: 1.57622 d: 0.15000 a: 1.48979 b: 0.78316 c: 1.57635 d: 0.15000 a: 1.48990 b: 0.78321 c: 1.57643 d: 0.15000 a: 1.48997 b: 0.78324 c: 1.57649 d: 0.150

14、00 a: 1.49001 b: 0.78326 c: 1.57652 d: 0.15000 a: 1.49004 b: 0.78327 c: 1.57655 d: 0.15000 a: 1.49007 b: 0.78328 c: 1.57656 d: 0.15000 a: 1.49008 b: 0.78328 c: 1.57657 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57658 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57659 d: 0.15000,PageRank: Last 9 iterations,a: 1.

15、49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49011 b: 0.78329 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.150

16、00 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 Average pagerank = 1.0000,Google Architecture,Key components Interconnections Data structures A reference architecture for search engines?,Google Data Components,BigFiles Repository Use zlib to compress Lexico

17、n Word base Hit Lists Word-document ID map Document Indexing Forward Index Inverted Index,Google File System (GFS),BigFiles A.k.a. Googles Proprietary Filesystem 64-bit addressable Compression Conventional operating systems dont suffice No explanation of why? GFS: http:/ Key Data Components,Reposito

18、ry Stores full text of web pages Use zlib to compress Zlib less efficient than bzip Tradeoff of time complexity versus space efficiency Bzip more space efficient, but slower Why is it important to compress the pages?,Google Lexicon,Lexicon Contains 14 million words Implemented as a hash table of poi

19、nters to words Full explanation beyond the scope of this discussion Why is it important to have a lexicon? Tokenization Analysis Language Identification SPAM,Mapping queries to hits,HitLists wordID-(docID,position,font,capitalization) mapping Takes up most of the space in the forward and inverted in

20、dices Types: Fancy,Plain,Anchor,Document Indexing,Document Indexing Forward Index docIDs-wordIDs Partially sorted Duplicated doc IDs Makes it easier for final indexing and coding Inverted Index wordIDs-docIDs 2 sets of inverted barrels,Crawling and Indexing,Crawling Distributed, Parallel Social issu

21、es Bringing down web servers: politeness Copyright issues Text versus code Indexing Developed their own web page parser Barrels Distribution of compressed documents Sorting,Googles Query Evaluation,1: Parse the query 2: Convert words into WordIDs Using Lexicon 3: Select the barrels that contain docu

22、ments which match the WordIDs 4: Search through documents in the selected barrels until one is discovered that matches all the search terms 5: Compute that documents rank (using PageRank as one of the components) 6: Repeat step 4 until no documents are found and weve went through all the barrels 7:

23、Sort the set of returned documents by document rank and return the top k documents,Google Evaluation,Performed by generating numerical results Query satisfaction Bill Clinton Example Storage requirements 55GB Total System Performance 9 days to download 26 million pages 63 hours to get the final 11 m

24、illion (at the time) Search Performance Between 1 and 10 seconds for most queries (at the time),Wrapup,Loads of future work Even at that time, there were issues of: Information extraction from semi-structured sources (such as web pages) Still an active area of research Search engines as a digital li

25、brary What services, APIs and toolkits should a search engine provide? What storage methods are the most efficient? From 2005 to 2010 to ? Enhancing metadata Automatic markup and generation What are the appropriate fields? Automatic Concept Extraction Present the Searcher with a context Searching la

26、nguages: beyond context-free queries Other types of search: Facet, GIS, etc.,The Future?,User poses keyword query search “Google-like” result page comes back Along with each link returned, there will be A “Concept Map” outlining using extraction methods what the “real” content of the document is This basically allows you to “visually” see what the page rank is Discover information visually Existing evidence that this works well http:/ Carrot2/3 clustering,Concept Map,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1