Technologies for Mining Frequent Patterns in Large Databases.ppt

资源描述

1、Technologies for Mining Frequent Patterns in Large Databases,Jiawei Han Intelligent Database Systems Research Lab. Simon Fraser University, Canada http:/www.cs.sfu.ca/han,Tutorial Outline,What is frequent pattern mining? Frequent pattern mining algorithms Apriori and its variations A multi-dimension

2、al view of frequent pattern mining Constraint-based frequent pattern mining Recent progress on efficient mining methods Mining frequent patterns without candidate generation CLOSET: Efficient mining of frequent closet itemsets FreeSpan: Towards efficient sequential pattern mining,Part I What Is Freq

3、uent Pattern Mining?,What is frequent pattern? Why frequent pattern mining? Challenges in frequent pattern mining,What Is Frequent Pattern Mining?,What is a frequent pattern? Pattern (set of items, sequence, etc.) that occurs together frequently in a database AIS93 Frequent pattern: an important for

4、m of regularity What products were often purchased together? beers and diapers! What are the consequences of a hurricane? What is the next target after buying a PC?,Application Examples,Market Basket Analysis * Maintenance AgreementWhat the store should do to boost Maintenance Agreement sales Home E

5、lectronics *What other products should the store stocks up on if the store has a sale on Home Electronics Attached mailing in direct marketing Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients,Frequent Patt

6、ern MiningA Corner Stone in Data mining,Association analysis Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis Correlation or causality analysis Clustering Classification Association-based classification analysis Sequential pattern analysis Web log s

7、equence, DNA analysis, etc. Partial periodicity, cyclic/temporal associations,Association Rule Mining,Given A database of customer transactions Each transaction is a list of items (purchased by a customer in a visit) Find all rules that correlate the presence of one set of items with that of another

8、 set of items Example: 98% of people who purchase tires and auto accessories also get automotive services done Any number of items in the consequent/antecedent of rule Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).,Basic Concepts,Rule form: “A B s

9、upport s, confidence c”.Support: usefulness of discovered rules Confidence: certainty of the detected associationRules that satisfy both min_sup and min_conf are called strong. Examples: buys(x, “diapers”) buys(x, “beers”) 0.5%, 60% age(x, “30-34”) income(x ,“42K-48K”) buys(x, “high resolution TV”)

10、2%,60% major(x, “CS”) takes(x, “DB”) grade(x, “A”) 1%, 75%,Rule Measures: Support and Confidence,Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains X, Y, Z confidence, c, conditional probability that a transaction having X, Y also conta

11、ins Z.,Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%),Customer buys diaper,Customer buys both,Customer buys beer,Part II Frequent pattern mining methods: Apriori and its variations,The Apriori algorithm Improvements of Apriori Incremental, parallel, and

12、 distributed methods Different measures in association mining,An Influential Mining Methodology The Apriori Algorithm,The Apriori method: Proposed by Agrawal & Srikant 1994 A similar level-wise algorithm by Mannila et al. 1994 Major idea: A subset of a frequent itemset must be frequent E.g., if beer

13、 diaper, nuts is frequent, beer, diaper must be. Anyone is infrequent, its superset cannot be! A powerful, scalable candidate set pruning technique: It reduces candidate k-itemsets dramatically (for k 2),Mining Association Rules Example,For rule A C: support = support(A C) = 50% confidence = suppor

14、t(A C)/support(A) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent.,Min. support 50% Min. confidence 50%,Procedure of Mining Association Rules:,Find the frequent itemsets: the sets of items that have minimum support (Apriori) A subset of a frequent itemset must also b

15、e a frequent itemset, i.e., if A B is a frequent itemset, both A and B should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.,The Apriori Algorithm,Join StepCk is generated by joining Lk-1with

16、itself Prune StepAny (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed.(Ck: Candidate itemset of size k)(Lk : frequent itemset of size k),AprioriPseudocode,Ck: Candidate itemset of size k Lk : frequent itemset of size kL1 = frequent items; for (k

17、 1; Lk !=; k+) do beginCk+1 = candidates generated from Lk;for each transaction t in database doincrement the count of all candidates in Ck+1 that are contained in tLk+1 = candidates in Ck+1 with min_supportend return k Lk;,The Apriori Algorithm Example,Database D,Scan D,C1,L1,L2,C2,C2,Scan D,C3,L3

18、Scan D,How to Generate Candidates?,Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1 Step 2: pruning forall itemsets c in Ck

19、do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck,How to Count Supports of Candidates?,Why counting supports of candidates a problem? The total number of candidates can be very hugeOne transaction may contain many candidates Method: Candidate itemsets are stored in a hash

20、tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction,Example of Generating Candidates,L3=abc, abd, acd, ace, bcd Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pru

21、ning: acde is removed because ade is not in L3 C4=abcd,Example: Counting Supports of Candidates,Transaction: 1 2 3 5 6,1 + 2 3 5 6,1 2 + 3 5 6,1 3 + 5 6,Generating Strong Association Rules,Confidence(A B) = Prob(B|A)= support(A B)/support(A) Example:L3=2,3,523 5, confidence=2/2=100%25 3, confidence=

22、2/3=67%35 2, confidence=2/2=100% 2 35, confidence=2/3=67%3 25, confidence=2/3=67%5 32, confidence=2/3=67%,Efficient Implementation of Apriori in SQL,S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and Implications. In SIGMOD9

23、8 Implementations based on pure SQL-92 Impossible to get good performance out of pure SQL based approaches alone Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. Get orders of magnitude improvement,Improvements of Apriori,General ideas Scan the transaction database as

24、fewer passes as possible Reduce number of candidates Facilitate support counting of candidates,DIC: Reduce Number of Scans,S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97 Basic idea Count the itemsets at the boundary in a

25、 lattice Push the boundary dynamically Using trie structure to keep track counters and reordering items to reduce counting costs,Example of DIC,ABCD,ABC,ABD,ACD,BCD,AB,AC,BC,AD,BD,CD,A,B,C,D,Itemset lattice and boundary,Once all (k-1)-itemset of a k-itemset are all frequent, the counting of the k-it

26、emset can begin Any upper nodes of an infrequent itemset should not be counted,Transactions,Apriori,2-items,3-items,DIC,DIC: Pros and Cons,Number of scans Can be reduced in some cases But how about non-homogeneous data and high support situations? Item reordering “Item reordering did not work as wel

27、l as we had hoped” Performance 30% gain at low support ends 30% lose at high support ends,DHP: Reduce the Number of Candidates,J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD95 Major features Efficient generation for candidate itemsets Effective

28、 reduction on transaction database size,DHP: Efficient Generation for Candidates,In the k pass, count support for k-candidates, entries in hash table A (k+1)-itemset in Lk*Lk is qualified as a (k+1)-candidate only if it passes the hash filtering, i.e., it is hashed into a hash entry whose value is n

29、o less than support threshold Example Candidates: a, b, c, d, e Hash entries: ab, ad, ae bd, be, de Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the count of the hash bucket, ab, ad, ae, is below support threshold,DHP: Effective Reduction on Database Size,An item in transaction

30、t can be trimmed if it does not appear in at least k of the candidate k-itemsets in t Examples Transaction acd can be discarded if only ac is frequent Transaction bce must be kept if bc, be, and cd are frequent,Partition: Scan Database Only Twice,A. Savasere, E. Omiecinski, and S. Navathe. An effici

31、ent algorithm for mining association in large databases. In VLDB95 Mine all frequent itemsets by scanning transaction database only twice,Scan One in Partition,Divide database into n partitions. A global frequent itemset must be frequent in at least one partition. Process one partition in main memor

32、y at a time, for each partition generate local frequent itemsets using the Apriori algorithm also form tidlist for all itemsets to facilitate counting in the merge phase tidlist: contains the transaction Ids of all transactions that contain the itemset within a given partition,Scan Two in Partition,

33、Merge local frequent itemsets to generate a set of all potential large itemsets Count actual supports Support can be computed from the tidlists,Partition: Pros and Cons,Achieve both CPU and I/O improvements over Apriori The number of distinct local frequent itemsets may be very large tidlists to be

34、maintained can be huge,Sampling for Mining Frequent Itemsets,H. Toivonen. Sampling large databases for association rules. In VLDB96 Select a sample of original database, mine frequent itemsets within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of

35、 closure of frequent itemsets are checked Example: check abcd instead of ab, ac, , etc. Scan database again to find missed frequent itemsets,Challenges for the Sampling Method,How to sample a large database? When support threshold is pretty low, sampling may not generate results good enough,Incremen

36、tal Association Mining,A transaction database and a set of frequent itemset already mined A set of update transactions for transaction database, including insertion and deletion How to update the frequent itemset for the updated transaction database?,Transaction database,Frequent itemsets,Update tra

37、nsactions,What are the updated frequent itemsets?,FUP: Incremental Update of Discovered Rules,D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE96 View a database: original DB incremental db.A k-itemset (f

38、or any k) frequent in DB db if frequent in both DB and db. infrequent in DB db if also in both DB and db. For those only frequent in DB, merge corresponding counts in db. For those only frequent in db, search DB to update their itemset counts.,Incremental Update of Discovered Rules,A fast updating a

39、lgorithm, FUP (Cheung et al.96) View a database: original DB incremental db.A k-itemset (for any k), frequent in DB db if frequent in both DB and db. infrequent in DB db if also in both DB and db. For those only frequent in DB, merge corresponding counts in db. For those only frequent in db, search

40、DB to update their itemset counts. Similar methods can be adopted for data removal and update, or distributed/parallel mining.,Parallel and Distributed Association Mining,D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In PDIS 1996 M. Tamura and

41、 M. Kitsuregawa. Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. In VLDB 1999 E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD97 M. Zaki, S. Parthasarathy, and M. Ogihara. Parallel algorithms for discover

42、y of association rules. In Data Mining and Knowledge Discovery. Vol.1 No.4, 1997,Interestingness Measures,Objective measures Two popular measurements: support; and confidenceSubjective measures (Silberschatz and/or actionable (the user can do something with it),Criticism to Support and Confidence,Ex

43、ample 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal 40%, 66.7% is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat

44、cereal 20%, 33.3% is far more accurate, although with lower support and confidence,Criticism to Support and Confidence (Cont.),Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=Z dominates,Other Interestingness Measures: Interest,Interest (lift) takin

45、g both P(A) and P(B) in consideration P(AB)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated.,Other Interestingness Measures: Conviction,Convictionfrom implication: A B A ( B) factors in both P(A) and P(B

46、) and has value 1 when the relevant items are completely unrelated (confidence does not) rules which hold 100% of the time have the highest possible value (interest does not),Collective Strength,Collective strength is a number between 0 and with 1 as the break-even pointwhere v(I) is the violation r

47、atio of itemset I. An itemset is said to be in violation of a transaction if some of the items are present in the transaction, and others are not. v(I) is equal to the fraction of transactions which contain a proper non-null subset of I Recasting collective strength as:,Collective Strength (2),Let I

48、 be a set of items i1, i2, ik. Let pr denote the frequency of the item ir in the database. the probability that the itemset I occurs in a transaction isthe probability that none of the items in I occurs in the transaction isthe expected fraction of transactions that contains at least one item in I,

49、and where at least one item is absent:,Example:Collective Strength of I X,Y:,Collective Strength (3),Summary,Frequent pattern mining is an important data mining task Apriori is an important frequent pattern mining methodology A set of Apriori-like mining methods have been developed since 1994 Interestingness measure is important at discovery interesting rules,Technologies for Mining Frequent Patterns in Large Databases,

展开阅读全文