1、M.P. Johnson, DBMS, Stern/NYU, Sp2004,1,C20.0046: Database Management Systems Lecture #28,Matthew P. Johnson Stern School of Business, NYU Spring, 2004,M.P. Johnson, DBMS, Stern/NYU, Sp2004,2,Agenda,Previously: Failure/recovery Next: Finish Data mining WebsearchProj5 due today no extensions! Final E
2、xam Thursday, 5/6,10-11:50am 1-minute responses,M.P. Johnson, DBMS, Stern/NYU, Sp2004,3,Method: Genetic Algorithms,for optimization problems each possible solution x as some fitness f(x) space of possible solutions a surface goal: find an x at the top of a hill on the surface example: TSP with limit
3、ations must go here before some point cannot go from here to there 25 cities 24! Possible ways billions of years, under reasonable assumptions GAs are one method for hill climbing Avoid local maxima,M.P. Johnson, DBMS, Stern/NYU, Sp2004,4,GA motivation,Analogy from genetics: Create from random a pop
4、. of poss. solns Most probably very bad, but Compute fitness f(x) for each member x Create next generation by picking relatively good pairs Merge half of info from each to create new one repeat Examples: Glower, Dawkins, etc.,M.P. Johnson, DBMS, Stern/NYU, Sp2004,5,Moodys example (D&S),Financial ser
5、vices needed helpdesk dispatching system send right person for each IT problem experienced technicians to solve hard problems minimize loss of value due to computer downtime minimize individual wait time tell users when ere issues would be fixed schedules did not have to be perfect but good job-shop
6、 scheduling,M.P. Johnson, DBMS, Stern/NYU, Sp2004,6,Moodys example,SOGA: Scheduling Optimizing GAFitness function: fitness inverse to total amount of downtime, max indy. waittime on live set of current jobs! includes time of request, type of user, type of prob., etc. updates schedules every 10-15 mi
7、nutes, based on new inputs not-yet-started jobs can be reassigned,M.P. Johnson, DBMS, Stern/NYU, Sp2004,7,Neural Networks,Also do hill climbing, but completely different idea based on connections between neurons in brain but used for supervised learningsimple NN: input layer with 3 nodes hidden laye
8、r with 2 nodes output layer with 1 node each node points to each node in next level each node as some activation level and a something like a critical mass Draw picture What kind of graph is this?,M.P. Johnson, DBMS, Stern/NYU, Sp2004,8,Neural Networks,values passed into input nodes represent the pr
9、oblem instance given weighted sum of inputs to neuron, sends out pulse only if the sum is greater than its threshold values output by hidden nodes sent to output node if the weighted sum going into output node is high enough, it outputs 1, otherwise 0,M.P. Johnson, DBMS, Stern/NYU, Sp2004,9,NN appli
10、cations,plausible application: we have data about potential customers party registration married or not gender income levelor: have credit applicant information employment income home ownership bankruptcy should we give a credit card to him?,M.P. Johnson, DBMS, Stern/NYU, Sp2004,10,How NNs work,hope
11、: plug in customer out comes whether we should market toward himHow does it get the right answer?Initially, all weights are random! But: we assume we have data for lots of people which we know to be either interested in our products or not we have data for both kinds so, when we plug in one of these
12、 customers, we know what the right answer is supposed to be,M.P. Johnson, DBMS, Stern/NYU, Sp2004,11,How NNs work,can used the Backpropagation algorithm: for each known problem instance, plug in and look at answer if answer is wrong, cane edges weights in one way; o.w., change them the opposite way
13、(details omitted) repeatthe more iterations we do, the more the NN learns our known data with enough confidence, can apply NN to unknown customer data to learn whether to market toward them,M.P. Johnson, DBMS, Stern/NYU, Sp2004,12,LBS example,Investments goal: maximize return on investments buy/sell
14、 right securities at right timelots of time-series data for different props for different stocks return market signals pick right ones reactsoln: create NN for each stock retrain weekly,M.P. Johnson, DBMS, Stern/NYU, Sp2004,13,Decision Trees,One technology: decision trees Yet another use of trees! E
15、ach node one attribute Its children possible values of that attribute E.g.: given votes on issues, dist. Dems v. GOP Each path from root to leaf is one rule Like 20Qs/BSTs/B-trees whats important diff? Learn rules from the data Given value of this and this and that, this value will be something Cust
16、omer details potential good customer, financial details good credit risk, etc.,M.P. Johnson, DBMS, Stern/NYU, Sp2004,14,Decision Trees,Details: for binary property, two out edges, but may be more for continuous property (income), divide values into discrete ranges a property may appear more tan once
17、Example: top node: history of bankruptcy? if yes, REJECT if no, ten: employed? If no, (maybe look for high monthly housing payment) If yes, Algorithms: ID3, C4.5, CART,M.P. Johnson, DBMS, Stern/NYU, Sp2004,15,DB Functionality,Association study frequencies of items occurring together buys(x,cereal) b
18、uys(x,milk) Mail order tulip bulbs Conservative party voters intuitionTechniques: Decision trees GAs - Glower,M.P. Johnson, DBMS, Stern/NYU, Sp2004,16,Mining v. warehousing,Warehousing: let user search, group by interesting properties Give me the sales of A4s by year and dealer, for these colors Use
19、r tries to learn from results which properties are important/interesting Mining: tell the user what the interesting properties are Whats driving sales?,M.P. Johnson, DBMS, Stern/NYU, Sp2004,17,Social/political concerns,Privacy TIA Sensitive data Allow mining but not queries Opt-in/opt-out “Dont be e
20、vil.”,M.P. Johnson, DBMS, Stern/NYU, Sp2004,18,For more info,See Dahr & Stein: Seven Methods for Transforming Corporate Data into Business Intelligence (1997) Drawn on above A few years old, but very accessibleData mining courses offered here,M.P. Johnson, DBMS, Stern/NYU, Sp2004,19,Next topic: Webs
21、earch,Websearch is not running queries on the web Info/connections downloaded from web Crawl paths from tree rooted at homepage Special database is formed On request, we run query on DB Draw picture Issues: DNS bottleneck search strategy refresh strategy robot exclusion protocol bad HTML, non-respon
22、sive servers, etc.,M.P. Johnson, DBMS, Stern/NYU, Sp2004,20,Crawling issues,Content-seen test compute fingerprint/hash of page content also: URL-seen test DNS bottleneck to view page by text link, must get address BP claim: 87% crawling time DNS look-up Referring to webpages use DocIDs, not addresse
23、s more popular pages get shorter DocIDs (why?) many commodity machines frequent crashes Logging, checkpointing,M.P. Johnson, DBMS, Stern/NYU, Sp2004,21,Crawling,Comparatively straightforward although Google uses PageRank for crawling, tooFirst, must get pages Spiders Prof. Davis (NYU/CS): http:/www.
24、cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.javahttp:/pages.stern.nyu.edu/mjohnson/dbms/eg/WebCrawler.javaRun program: sales% java WebCrawler http:/pages.stern.nyu.edu/mjohnson/dbms 200,M.P. Johnson, DBMS, Stern/NYU, Sp2004,22,Inverted indices,Basic idea of finding pages: Create inverted index
25、 mapping words to pagesFirst, think of each webpage as a tuple One column for every possible word True means the word appears on the page Index on all columns Now can search: youre fired select * from T where youre=T and fired=T,M.P. Johnson, DBMS, Stern/NYU, Sp2004,23,Inverted indices,Can simplify
26、somewhat: For each field index, delete False entries True entries for each index become a bucketCreate “inverted index”: One entry for each search word Search word entry points to corresponding bucket Bucket points to pages with its word Amazon,M.P. Johnson, DBMS, Stern/NYU, Sp2004,24,Inverted Indic
27、es,Whats stored? For each word W, for each doc D relevance of D to W #/% occurs. of W in D meta-data/context: bold, font size, title, etc.In addition to page importance, keep in mind: this info is used to determine relevance of particular words appearing on the page,M.P. Johnson, DBMS, Stern/NYU, Sp
28、2004,25,Google-like infrastructure,Very large distributed system 100MB, GB files Google File System Block size = 64MB (not kb)! hundreds or thousands of low-quality Linux servers system failures are rule, not exception Divide index up by words into many barrels lexicon maps word ids to words barrel
29、also, do RAID two-D matrix of servers Draw picture,M.P. Johnson, DBMS, Stern/NYU, Sp2004,26,Google-like infrastructure,Respond to query Q(w1wn): for each wi, send to barrel column for wi pick random server in that column for each wi in parallel, step through until find doc containing all words, add
30、to set of results index ordered on word;docID Linear time return sorted results,M.P. Johnson, DBMS, Stern/NYU, Sp2004,27,Sorting Results,How to respond to Q(w1,w2,wn)? Search index for pages with w1,w2,wn Return in sorted order (how?)Soln 1: current order Return 100,000 (mostly) useless resultsSoln
31、2: ways from Information Retrieval Theory library science + CS,M.P. Johnson, DBMS, Stern/NYU, Sp2004,28,Information Retrieval Theory,Standard ways from IR theory: Clustering algorithms simplest: for each word W in a doc D, compute # occurs of W in D / total # word occurs in D each document becomes a
32、 point in a space one dimension for every possible word value in that dim is ratio from above (maybe with logs) Choose pages with high values for sought wordsVariants work well for small sets of docs But the web has billions Problem: very short pages containing query words are privileged query “bill
33、 clinton” “Bill Clinton Sucks” page (BP),M.P. Johnson, DBMS, Stern/NYU, Sp2004,29,Sorting Results,Soln 3: sort by “quality” What do you mean by quality? hire readers to rate your webpage (early Yahoo) problem: more webpages than Yahoo employeesSoln 4: # citations (links) peer review idea: the rest o
34、f the web has already voted on the quality of your webpage 1 link to your page = 1 vote similar to counting academic citations,M.P. Johnson, DBMS, Stern/NYU, Sp2004,30,Sorting Results,Soln 5: Googles PageRank count citations, but not equally weighted sum motiv: already decided some pages are better
35、their votes count for more two cases at ends of continuum: many pages link to you G links to youMore precisely, for page P, each Li links to P; each page Li links to NLi pages: PR(P) = SUM(PR(Li)/NLi) Motiv: each page votes with its quality; its quality is divided among the pages it votes for,M.P. J
36、ohnson, DBMS, Stern/NYU, Sp2004,31,Understanding PageRank,Analogy 1: Friendster someone lets you in someone else let that you in, etc.Analogy 2: PKE certificates my cert authenticated by your cert your cert endorsed by someone elses Both cases ere: eventually reach a foundationAnalogy 3: job/school
37、recommendations three people recommend you why should we believe them? three other people rec-ed them, etc. eventually, we trust,M.P. Johnson, DBMS, Stern/NYU, Sp2004,32,Understanding PageRank,Analogy 3: Random Surfer Model Idealized web surfer: start at some page at each page, pick a random link Tu
38、rns out: after long time surfing, Pr(were at some page P right now) = PR(P) PRs are normalized,M.P. Johnson, DBMS, Stern/NYU, Sp2004,33,Computing PageRank,For each page P, we want: PR(P) = SUM(PR(Li)/NLi) But its circular how to compute?Meth 1: for n pages, weve got n linear eqs and n vars can solve
39、 for all PR(P)s, but too hard see your mathematical programming course Meth 2: iteratively start with PR(P) set to E for each P iterate until no more significant change PB report O(50) iterations for O(300M) links, O(30M) pages #iters req. rows only with log of web size,M.P. Johnson, DBMS, Stern/NYU
40、, Sp2004,34,Problems with PageRank,Example (Ullman): A points to Y, M; Y points to self, A; M points nowhereStart A,Y,M at 1 (1,1,1) (0,0,0) The rank dissipates stern% java PageRankSoln: add self link to any dead-end,M.P. Johnson, DBMS, Stern/NYU, Sp2004,35,Problems with PageRank,Example (Ullman): A
41、 points to Y, M; Y points to self, A; M points to selfStart A,Y,M at 1 (1,1,1) (0,0,3) M becomes a rank sink RSM interp: eventually we end up at M and get stuck stern% java PageRank2Soln: add “inherent quality” E to each page,M.P. Johnson, DBMS, Stern/NYU, Sp2004,36,Modified PageRank,Apart from inhe
42、rited quality, each page also as inherent quality E: PR(P) = E + SUM(PR(Li)/OLi) More precisely, have weighted sum of the two terms: PR(P) = .15*E + .85*SUM(PR(Li)/OLi) stern% java PageRank2Leads to new random surfer model,M.P. Johnson, DBMS, Stern/NYU, Sp2004,37,Random Surfer Model,Motiv: if we end
43、 up at M, we dont really stay there forever We type in a new URLIdealized web surfer: start at some page at each page, pick a random link but occasionally, we get bored and jump to a random pageTurns out: after long time surfing, Pr(were at some page P right now) = PR(P),M.P. Johnson, DBMS, Stern/NY
44、U, Sp2004,38,Understanding PageRank,One more interp: hydraulic model picture graph of web imagine each link as a tube bet. two nodes, imagine quality as fluid each node is a reservoir of E amount of fluid Now let flow Steady state is: each node P w/PR(P) amount of fluid PR(P) eventually settles in n
45、ode P,M.P. Johnson, DBMS, Stern/NYU, Sp2004,39,Understanding PageRank,Sornette: Why Stock Markets Crash Si(t+1) = sign(ei + SUM(Sj(t) trader buys/sells based on is inclination and what is associates are saying directions. of magnet det-ed by old direction and dirs. of neighbors activation of neuron
46、det-ed by its props and activation of neighbors connected by synapses PR of P based on its inherent value and PR of in-links,M.P. Johnson, DBMS, Stern/NYU, Sp2004,40,Non-uniform Es,So far, assumed E was const for all pages Can let E=E(P) vary by page How do we choose E(P)? Choice 1: set high for pag
47、es we already trust Choice 2: set high for pages I like PB paper gave high E to John McCarthys homepage pages he links to get high PR, etc. Result: Google personalized for him Q: How would get your prefs?,M.P. Johnson, DBMS, Stern/NYU, Sp2004,41,Tricking search engines,“Search Engine Optimization”O
48、ld search engines tat dont sort results or sort tem badly: include on your page (maybe hidden) lots of words you think people will query onThis doesnt work for Google because the pages doing this probably arent linked to but,M.P. Johnson, DBMS, Stern/NYU, Sp2004,42,Tricking search engines,I can try
49、to make my page look popular to Google create a page wit 1000 links to my page does this work?Create 1000 other pages linking to itPR2: put limit on weight one domain can give to itself Trick2: buy another domain and put the 1000 pages there PR3: put limit on weight from any single domain,M.P. Johns
50、on, DBMS, Stern/NYU, Sp2004,43,Googles good ideas,Google had two good ideas: Googles big idea: PageRank Googles little idea: use anchor textMotiv: pages may not give best descrips. of themselves most search engines dont contain “search engine“ PB claim: only 1 of top 4 search engines could find themselves on query “search engine“ Anchor text also describes page: many pages link to Google many of them likely say “search engine“ in/near the link Treat anchor text words as part of page This is why search for “US West” found Quest!,