1、,Historical PerspectiveThe Relational Modelrevolutionized transaction processing systemsDBMS gave access to the data storedOLTPs are good at putting data into databasesThe data explosionIncrease in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc.Data storage b
2、ecame easier and cheaper with increasing computing powerProblemsDBMS gave access to the data stored but no analysis of dataAnalysis required to unearth the hidden relationships within the data i.e. for decision supportSize of databases has increased e.g. VLDBs, need automated techniques for analysis
3、 as they have grown beyond manual extractionObstaclestypical scientific user knew nothing of commercial business applicationsthe business database programmers, knew nothing of massively parallel principlessolution was for database software producers to create easy-to-use tools and form strategic rel
4、ationships with hardware manufacturersWhat is data mining? the non trivial extraction of implicit, previously unknown, and potentially useful information from dataWilliam J Frawley, Gregory Piatetsky-Shapiro and Christopher J MatheusData mining is the analysis of data and the use of software techniq
5、ues for finding patterns and regularities in sets of data.The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.It is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernible or s
6、o obvious that no-one has noticed them before.Mining analogy:large volumes of data are sifted in an attempt to find something worthwhilein a mining operation large amounts of low grade materials are sifted through in order to find something of value.Books:Jiawei Han and Micheline Kamber, Data Mining
7、: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1-55860-489-8. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 1999, ISBN 1-55860-552-5.,Data Mining vs. DBMSDBMS - queries based on the data held e.g.last mo
8、nths sales for each productsales grouped by customer age etc.list of customers who lapsed their policy Data Mining - infer knowledge from the data held to answer queries e.g.what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies
9、?why is the Cleveland division so profitable?Characteristics of a data mining systemLarge quantities of datavolume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information, credit card transactions etc. Noisy, incomplete dataimprecise data is characteristic of a
10、ll data collectiondatabases - usually contaminated by errors, cannot assume that the data they contain is entirely correcte.g. some attributes rely on subjective or measurement judgments Complex data structure - conventional statistical analysis not possible Heterogeneous data stored in legacy syste
11、msWho needs data mining?Who(ever) has information fastest and uses it winsDon McKeough, former president of Coke Cola,Data Mining ApplicationsMedicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.Finance - stock market prediction, credit assessment, fraud de
12、tection etc.Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying unusual behavior etc.Knowledge AcquisitionExpert systems are models of real world processesMuch of the information is available straight from the process e.g.in production systems, data is
13、collected for monitoring the systemknowledge can be extracted using data mining toolsexperts can verify the knowledgeEngineering - automotive diagnostic expert systems, fault detection etc.Data Mining GoalsClassificationDM system learns from examples or the data how to partition or classify the data
14、 i.e. it formulates classification rulesExample - customer database in a bankQuestion - Is a new customer applying for a loan a good investment or not?Typical rule formulated:if STATUS = married and INCOME 10000 and HOUSE_OWNER = yesthen INVESTMENT_TYPE = goodAssociationRules that associate one attr
15、ibute of a relation to anotherSet oriented approaches are the most efficient means of discovering such rulesExample - supermarket database72% of all the records that contain items A and B also contain item Cthe specific percentage of occurrences, 72 is the confidence factor of the ruleSequence/Tempo
16、ralSequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of timeDifference between sequence rules and other rules is the temporal factorExample - retailers databaseCan be used to discover the set of purchases that frequently preced
17、es the purchase of a microwave oven,Data Mining and Machine LearningData Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge Machine Learning (ML) is concerned with improving performance of an agenttraining a neural network to balance a pole is part of ML,
18、 but not of KDD Efficiency of the algorithm and scalability is more important in DM or KDDDM is concerned with very large, real-world databasesML typically looks at smaller data sets ML has laboratory type examples for the training set DM deals with real world data. Real world data tend to have prob
19、lems such as:missing valuesdynamic datanoiseStatistical Data Analysis Ill-suited for Nominal and Structured Data Types Completely data driven - incorporation of domain knowledge not possible Interpretation of results is difficult and daunting Requires expert user guidance,Stages of the Data Mining P
20、rocessData pre-processingheterogeneity resolutiondata cleansingdata warehousing Applying Data Mining Tools: extraction of patterns from the pre-processed data Interpretation and evaluation: the user bias can direct DM tools to areas of interestattributes of interest in databasesgoal of discoverydoma
21、in knowledgeprior knowledge or belief about the domainTechniquesMachine Learning methods Statistics: can be used in several data mining stagesdata cleansing i.e. the removal of erroneous or irrelevant dataEDA, exploratory data analysis e.g. frequency counts, histograms etc.data selection - sampling
22、facilities and so reduce the scale of computationattribute re-definitiondata analysis - measures of association and relationships between attributes, interestingness of rules, classification etc. Visualization: enhances EDA, makes patterns more visible Clustering (Cluster Analysis)Clustering and seg
23、mentation is basically partitioning the database so that each partition or group is similar accordingto some criteria or metricClustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of moleculesData mining applications make use of clusterin
24、g according to similarity e.g. to segment a client/customer baseIt provides sub-groups of a population for further analysis or action - very important when dealing with very large databases,Knowledge Representation MethodsNeural Networksa trained neural network can be thought of as an “expert“ in th
25、e category of information it has beengiven to analyzeprovides projections given new situations of interest and answers “what if“ questionsproblems include:the resulting network is viewed as a black boxno explanation of the results is given i.e. difficult for the user to interpret the results difficu
26、lt to incorporate user interventionslow to train due to their iterative nature Decision treesused to represent knowledgebuilt using a training set of data and can then be used to classify new objectsproblems are:opaque structure - difficult to understandmissing data can cause performance problemsthe
27、y become cumbersome for large data sets Rulesprobably the most common form of representationtend to be simple and intuitive unstructured and less rigidproblems are:difficult to maintaininadequate to represent many types of knowledgeExample format: if X then Y,Related Technologies: Data WarehousingDe
28、finition A data warehouse can be defined as any centralized data repository which can be queried for business benefit warehousing makes it possible to:extract archived operational dataovercome inconsistencies between different legacy data formatsintegrate data throughout an enterprise, regardless of
29、 location, format, or communication requirementsincorporate additional or expert informationCharacteristics of a data warehousesubject-oriented - data organized by subject instead of application e.g.an insurance company would organize their data by customer, premium, and claim, insteadof by differen
30、t products (auto, life, etc.)contains only the information necessary for decision support processingintegrated - encoding of data is often inconsistent e.g. gender might be coded as “m“ and “f“ or 0 and 1 but when data are moved from the operational environment into the data warehouse theyassume a c
31、onsistent coding conventiontime-variant - the data warehouse is a place for storing data that are five to 10 years old, or older e.g.this data is used for comparisons, trends, and forecastingthese data are not updatednon-volatiledata are not updated or changed in any way once they enter the data war
32、ehousedata are only loaded and accessed,Data warehousing Processesinsulate data - i.e. the current operational informationpreserves the security and integrity of mission-critical OLTP applicationsgives access to the broadest possible base of dataretrieve data - from a variety of heterogeneous operat
33、ional databasesdata is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)metadata - information describing the model and definition of the source data elementsdata cleansing - removal of certain aspects of operational data, such as low-level trans
34、action information, which slow down the query times.transfer - processed data transferred to the data warehouse, a large database on a high performance box,Criteria for a data warehouseLoad Performancerequire incremental loading of new data on a periodic basismust not artificially constrain the volu
35、me of data Load Processingdata conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update Data Quality Managementensure local consistency, global consistency, and referential integrity despite “dirty“ sources and massive database size Query Performancemus
36、t not be slowed or inhibited by the performance of the data warehouse RDBMS Terabyte ScalabilityData warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management. Mass User ScalabilityAccess to warehouse data m
37、ust not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance. Networked Data WarehouseData warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client wor
38、kstation Warehouse Administrationlarge scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility The RDBMS must Integrate Dimensional Analysisdimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools
39、Advanced Query FunctionalityEnd users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data,Data warehousing vs. OLTPOLTP systems designed to maximize transaction capacity but they:cannot be repositories of facts and histor
40、ical data for business analysiscannot quickly answer ad hoc queriesrapid retrieval is almost impossibledata is inconsistent and changing, duplicate entries exist, entries can be missingOLTP offers large amounts of raw data which is not easily understood Typical OLTP query is a simple aggregation e.g
41、.what is the current account balance for this customer? Data warehouses are interested in query processing as opposed to transaction processing Typical business analysis query e.g.which product line sells best in middle-America and how does this correlate to demographic data?OLAP (On-line Analytical
42、 processing)Problem is how to process larger and larger databasesOLAP involves many data items (many thousands or even millions) which are involved in complex relationshipsFast response is crucial in OLAPDifference between OLAP and OLTPOLTP servers handle mission-critical production data accessed th
43、rough simple queriesOLAP servers handle management-critical data accessed through an iterative analytical investigationOLAP operationsConsolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related datae.g. sales offices can be rolled-up to distri
44、cts and districts rolled-up to regionsDrill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data“Slicing and Dicing“ - ability to look at the data base from different viewpoints e.g.one slice of the sales database might show all sales of product type within regions;another slice might show all sales by sales channel within each product typeoften performed along a time axis in order to analyze trends and find patterns,