1、 1Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/1Advanced databases Inferring new knowledge from data(bases): Knowledge Discovery in DatabasesBettina BerendtKatholieke Universiteit Leuven, Department of Computer Sciencehttp:/www.cs.kuleuven.be/
2、berendt/teaching/2007w/adb/ Last update: 15 November 20072Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/2AgendaMotivation: Application examplesThe process of knowledge discoveryOrigins and contextMajor issues in knowledge discoveryA short overv
3、iew of key techniques3Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/3What is the impact of genetically modified organisms?4Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/4Is our school syst
4、em good for immigrants and/or children from poor backgrounds?5Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/5What are the effects of teaching in English at universities?6Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be
5、/berendt/teaching/2007w/adb/6What makes people happy?7Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/7What do men and women like?8Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/8Is this a ma
6、n or a woman?clicked on9Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/9Primary Tasks of Data MiningClassification Deviation andchange detection SummarizationClusteringDependency ModelingRegressionfinding the descriptionof several predefined cla
7、sses and classify a data item into one of them.maps a data item to a real-valued prediction variable.identifying a finite set of categories or clusters to describe the data.finding a compact description for a subset of datafinding a model which describes significant dependencies between variables.di
8、scovering the most significant changes in the data 10Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/10AgendaMotivation: Application examplesThe process of knowledge discoveryOrigins and contextMajor issues in knowledge discoveryA short overview
9、of key techniques11Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/11Data mining“ and knowledge discovery“n (informal definition):data mining is about discovering knowledge in (huge amounts of) datan Therefore, it is clearer to speak about “knowl
10、edge discovery in data(bases)”12Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/12Recall: Data, information, and knowledgeData represents a fact or statement of event without relation to other things.n Ex: It is raining.Information embodies the u
11、nderstanding of a relationship of some sort, possibly cause and effect.n Ex: The temperature dropped 15 degrees and then it started raining.Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next.n Ex: If the
12、 humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains.(This is from knowledge-management theory. If you want to know about wisdom, check the Web page:G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowled
13、ge, and Wisdom. http:/www.systems-thinking.org/dikw/dikw.htm )13Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/13Why Data Mining? The Explosive Growth of Data: from terabytes to petabytesn Data collection and data availabilityl Automated data co
14、llection tools, database systems, Web, computerized societyn Major sources of abundant datal Business: Web, e-commerce, transactions, stocks, l Science: Remote sensing, bioinformatics, scientific simulation, l Society and everyone: news, digital cameras, We are drowning in data, but starving for kno
15、wledge! “Necessity is the mother of invention” Data mining Automated analysis of massive data sets14Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/14Background: Evolution of Database Technology1960s:n Data collection, database creation, IMS and
16、network DBMS1970s: n Relational data model, relational DBMS implementation1980s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc.) n Application-oriented DBMS (spatial, scientific, engineering, etc.)1990s: n Data mining, data warehousing, multimedia databases, and Web database
17、s2000sn Stream data management and miningn Data mining and its applicationsn Web technology (XML, data integration) and global information systems 15Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/15The KDD processThe non-trivial process of ident
18、ifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) non-trivial processMultiple processvalid Justified patterns/modelsnovel Previously unknownuseful Can be used understandable by human and machine16Berendt: Advanced databa
19、ses, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/16The process part of knowledge discoveryCRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.17Berendt
20、: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/17Knowledge discovery, machine learning, data miningn Knowledge discovery= the whole process n Machine learningthe application of induction algorithms and other algorithms that can be said to learn.“= mode
21、ling“ phasen Data miningl sometimes = KD, sometimes = ML18Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/18The KDD ProcessData organized by function Create/selecttarget databaseSelect samplingtechnique and sample dataSupply missing valuesNormali
22、zevaluesSelect DM task (s)Transform todifferentrepresentationEliminatenoisy dataTransformvaluesSelect DM method (s)Create derivedattributesExtract knowledgeFind importantattributes &value rangesTest knowledge Refine knowledgeQuery & report generationAggregation & sequencesAdvanced methodsData wareho
23、using123 4519Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/19AgendaMotivation: Application examplesThe process of knowledge discoveryOrigins and contextMajor issues in knowledge discoveryA short overview of key techniques20Berendt: Advanced dat
24、abases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/20Main Contributing Areas of KDDDatabasesStore, access, search, update data (deduction)StatisticsInfer info from data (deduction & induction, mainly numeric data) Machine LearningComputer algorithms that improve automat
25、ically through experience (mainly induction, symbolic data)KDDdata warehouses:integrated dataOLAP: On-Line Analytical Processing21Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/21Data Mining: Classification SchemesGeneral functionalityn Descript
26、ive data mining n Predictive data miningDifferent views lead to different classificationsn Data view: Kinds of data to be minedn Knowledge view: Kinds of knowledge to be discoveredn Method view: Kinds of techniques utilizedn Application view: Kinds of applications adapted22Berendt: Advanced database
27、s, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/22Data Mining: Confluence of Multiple Disciplines Data MiningDatabase Technology StatisticsMachineLearningPatternRecognition Algorithm OtherDisciplinesVisualization23Berendt: Advanced databases, winter term 2007/08, http:/ww
28、w.cs.kuleuven.be/berendt/teaching/2007w/adb/23Why Not Traditional Data Analysis?Tremendous amount of datan Algorithms must be highly scalable to handle such as tera-bytes of dataHigh-dimensionality of data n Micro-array may have tens of thousands of dimensionsHigh complexity of datan Data streams an
29、d sensor datan Time-series data, temporal data, sequence data n Structure data, graphs, social networks and multi-linked datan Heterogeneous databases and legacy databasesn Spatial, spatiotemporal, multimedia, text and Web datan Software programs, scientific simulationsNew and sophisticated applicat
30、ions24Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/24AgendaMotivation: Application examplesThe process of knowledge discoveryOrigins and contextMajor issues in knowledge discoveryA short overview of key techniques25Berendt: Advanced databases,
31、 winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/25Data Mining: On What Kinds of Data?Database-oriented data sets and applicationsn Relational database, data warehouse, transactional databaseAdvanced data sets and advanced applications n Data streams and sensor datan Time-se
32、ries data, temporal data, sequence data (incl. bio-sequences) n Structure data, graphs, social networks and multi-linked datan Object-relational databasesn Heterogeneous databases and legacy databasesn Spatial data and spatiotemporal datan Multimedia databasen Text databasesn The World-Wide Web26Ber
33、endt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/26Data Mining FunctionalitiesMultidimensional concept description: Characterization and discriminationn Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regionsFrequent patte
34、rns, association, correlation vs. causalityn Diaper Beer 0.5%, 75% (Correlation or causality?)Classification and prediction n Construct models (functions) that describe and distinguish classes or concepts for future predictionl E.g., classify countries based on (climate), or classify cars based on (
35、gas mileage)n Predict some unknown or missing numerical values 27Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/27Data Mining Functionalities (2)Cluster analysisn Class label is unknown: Group data to form new classes, e.g., cluster houses to fi
36、nd distribution patternsn Maximizing intra-class similarity & minimizing interclass similarityOutlier analysisn Outlier: Data object that does not comply with the general behavior of the datan Noise or exception? Useful in fraud detection, rare events analysisTrend and evolution analysisn Trend and
37、deviation: e.g., regression analysisn Sequential pattern mining: e.g., digital camera large SD memoryn Periodicity analysisn Similarity-based analysisOther pattern-directed or statistical analyses28Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/
38、28Are All the “Discovered” Patterns Interesting?Data mining may generate thousands of patterns: Not all of them are interestingn Suggested approach: Human-centered, query-based, focused miningInterestingness measuresn A pattern is interesting if it is easily understood by humans, valid on new or tes
39、t data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measuresn Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.n Subjective: based on users belief
40、in the data, e.g., unexpectedness, novelty, actionability, etc.29Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/29Find All and Only Interesting Patterns?Find all the interesting patterns: Completenessn Can a data mining system find all the inter
41、esting patterns? Do we need to find all of the interesting patterns?n Heuristic vs. exhaustive searchn Association vs. classification vs. clusteringSearch for only interesting patterns: An optimization problemn Can a data mining system find only the interesting patterns?n Approachesl First general a
42、ll the patterns and then filter out the uninteresting onesl Generate only the interesting patternsmining query optimization30Berendt: Advanced databases, winter term 2007/08, http:/www.cs.kuleuven.be/berendt/teaching/2007w/adb/30Other Pattern Mining IssuesPrecise patterns vs. approximate patternsn A
43、ssociation and correlation mining: possible find sets of precise patternsl But approximate patterns can be more compact and sufficientl How to find high quality approximate patterns?n Gene sequence mining: approximate patterns are inherentl How to derive efficient approximate pattern mining algorithms?Constrained vs. non-constrained patternsn Why constraint-based mining?n What are the possible kinds of constraints? How to push constraints into the mining process?