1、A Quick Introduction to Approximate Query Processing Part II,CS286, Spring2007 Minos Garofalakis,Decision Support Systems,Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration. OLAP: Complex SQL queries and v
2、iews. Queries based on spreadsheet-style operations and “multidimensional” view of data. Interactive and “online” queries. Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!),Approximate Query Processing using Data Synopses,How to construct effective data synopse
3、s ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute domain values),tuple counts,Three-dimensional distribution,tuple count
4、s,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet h
5、istogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,One-Dimensional Haar Wavelets,Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet b
6、asis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions,Resolution Averages Detail Coefficients,2, 2, 0, 2, 3, 5, 4, 4,2, 1, 4, 4,0, -1, -1, 0,-,3,2,1,0,Haar Wavelet Coefficients,Coefficient “Supports”,Hierarchical decomposition structure (a.k.a.
7、 “error tree”),Original data,Wavelet-based Histograms MVW98,Problem: range-query selectivity estimation Key idea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution Steps compute (cumulative) data distribution C compute Haar (or linear) wavelet transform
8、 of C coefficient thresholding : only b|C| coefficients can be kept take largest coefficients in absolute normalized value Haar basis: divide coefficients at resolution j by Optimal in terms of the overall Mean Squared (L2) Error Greedy heuristic methods Retain coefficients leading to large error re
9、duction Throw away coefficients that give small increase in error,Using Wavelet-based Histograms,Selectivity estimation: sel(a= X= b) = Cb - Ca-1 C is the (approximate) “reconstructed” cumulative distribution Time: O(minb, logN), where b = size of wavelet synopsis (no. of coefficients), N= size of d
10、omain,At most logN+1 coefficients are needed to reconstruct any C value,Haar Wavelet Coefficients,Reconstruct data values d(i) d(i) = (+/-1) * (coefficient on path) Range sum calculation d(l:h) d(l:h) = simple linear combination of coefficients on paths to l, h Only O(logN) terms,Original data,3 = 2
11、.75 - (-1.25) + 0 + (-1),6 = 4*2.75 + 4*(-1.25),Dynamic Maintenance of Wavelet-based Histograms MVW00,Build Haar-wavelet synopses on the original data distributionKey issues with dynamic wavelet maintenance Change in single distribution value can affect the values of many coefficients (path to the r
12、oot of the decomposition tree),d+,As distribution changes, “most significant” (e.g., largest) coefficients can also change! Important coefficients can become unimportant, and vice-versa,Effect of Distribution Updates,Key observation: for each coefficient c in the Haar decomposition tree c = ( AVG(le
13、ftChildSubtree(c) - AVG(rightChildSubtree(c) ) / 2,Only coefficients on path(d) are affected and each can be updated in constant time,Maintenance Architecture,“Shake up” when log reaches max size: for each insertion at d for each coefficient c on path(d) and in H : update c for each coefficient c on
14、 path(d) and not in H or H: insert c into H with probability proportional to 1/2h, where h is the “height” of c (Probabilistic Counting FM85) Adjust H and H (move largest coefficients to H),m+m top coefficients,m,m,INSERTIONS/ DELETIONS,Problems with Conventional Wavelet Synopses,An example data vec
15、tor and wavelet synopsis (|D|=16, B=8 largest coefficients retained),Original Data Values 127 71 87 31 59 3 43 99 100 42 0 58 30 88 72 130,Wavelet Answers 65 65 65 65 65 65 65 65 100 42 0 58 30 88 72 130,Large variation in answer quality Within the same data set, when synopsis is large, when data va
16、lues are about the same, when actual answers are about the same Heavily-biased approximate answers! Root causes Thresholding for aggregate L2 error metric Independent, greedy thresholding ( large regions without any coefficient!) Heavy bias from dropping coefficients without compensating for loss,Ap
17、proach: Optimize for Maximum-Error Metrics,Key metric for effective approximate answers: Relative error with sanity boundSanity bound “s” to avoid domination by small data valuesTo provide tight error guarantees for all reconstructed data valuesMinimize maximum relative error in the data reconstruct
18、ionAnother option: Minimize maximum absolute errorAlgorithms can be extended to general “distributive” metrics (e.g., average relative error),Minimize,Our Approach: Deterministic Wavelet Thresholding for Maximum Error,Key Idea: Dynamic-Programming formulation that conditions the optimal solution on
19、the error that “enters” the subtree (through the selection of ancestor nodes),Our DP table:Mj, b, S = optimal maximum relative (or, absolute) error in T(j) with space budget of b coefficients (chosen in T(j), assuming subset S of js proper ancestors have already been selected for the synopsis Clearl
20、y, |S| minB-b, logN+1 Want to compute M0, B, ,Basic Observation: Depth of the error tree is only logN+1 we can explore and tabulate all S-subsets for a given node at a space/time cost of only O(N) !,Base Case for DP Recurrence: Leaf (Data) Nodes,Base case in the bottom-up DP computation: Leaf (i.e.,
21、 data) node Assume for simplicity that data values are numbered N, , 2N-1,Again, time/space complexity per leaf node is only O(N),Mj, b, S is not defined for b0 Never allocate space to leaves For b=0,for each coefficient subset with |S| minB, logN+1 Similarly for absolute error,DP Recurrence: Intern
22、al (Coefficient) Nodes,Two basic cases when examining node/coefficient j for inclusion in the synopsis: (1) Drop j; (2) Keep j,In this case, the minimum possible maximum relative error in T(j) is,+,-,root=0,S = subset of selected j-ancestors,Case (1): Drop Coefficient j,Optimally distribute space b
23、between js two child subtrees Note that the RHS of the recurrence is well-defined Ancestors of j are obviously ancestors of 2j and 2j+1,DP Recurrence: Internal (Coefficient) Nodes (cont.),In this case, the minimum possible maximum relative error in T(j) is,Case (2): Keep Coefficient j,Take 1 unit of
24、 space for coefficient j, and optimally distribute remaining space Selected subsets in RHS change, since we choose to retain j Again, the recurrence RHS is well-defined,+,-,root=0,S = subset of selected j-ancestors,Finally, define Overall complexity: time, space,Outline,Intro & Approximate Query Ans
25、wering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Multi-dimensional Histograms Join sampling Multi-dimensional Haar Wavelets Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Relations as Frequency Distributions,8 10 10,30 20
26、50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute domain values),tuple counts,Three-dimensional distribution,tuple counts,Multi-dimensional Data Synopses,Problem: Approximate the joint data distribution of multiple attributes,Motivation Selectivity estimat
27、ion for queries with multiple predicates Approximating OLAP data cubes and general relations,Conventional approach: Attribute-Value Independence (AVI) assumption sel(p(A1) & p(A2) & . . .) = sel(p(A1) * sel(p(A2) * . . . Simple - one-dimensional marginals suffice BUT: almost always inaccurate, gross
28、 errors in practice (e.g., Chr84, FK97, Poo97,Multi-dimensional Histograms,Use small number of multi-dimensional buckets to directly approximate the joint data distribution Uniform spread & frequency approximation within buckets n(i) = no. of distinct values along Ai, F = total bucket frequency appr
29、oximate data points on a n(1)*n(2)*. . . uniform grid, each with frequency F / (n(1)*n(2)*. . .),20,40,35,90,120,Actual Distribution (ONE BUCKET),Multi-dimensional Histogram Construction,Construction problem is much harder even for two dimensions MPS99 Multi-dimensional equi-depth histograms MD88 Fi
30、x an ordering of the dimensions A1, A2, . . ., Ak, let kth root of desired no. of buckets, initialize B = data distribution For i=1, . . ., k: Split each bucket in B in equi-depth partitions along Ai; return resulting buckets to B Problems: limited set of bucketizations; fixed and fixed dimension or
31、dering can result in poor partitionings,MHIST-p histograms PI97 At each step Choose the bucket b in B containing the attribute Ai whose marginal is the most in need of partitioning Split b along Ai into p (e.g., p=2) buckets,Equi-depth vs. MHIST Histograms,A1,A2,Equi-depth (a1=2,a2=3) MD88,A1,A2,MHI
32、ST-2 (MaxDiff) PI97,MHIST: choose bucket/dimension to split based on its criticality ; allows for much larger class of bucketizations (hierarchical space partitioning) Experimental results verify superiority over AVI and equi-depth,450 280 340,460360250,Other Multi-dimensional Histogram Techniques -
33、 GENHIST GKT00,Key idea: allow for overlapping histogram buckets Allows for a much larger no. of distinct frequency regions for a given space budget (= #buckets),Greedy construction algorithm: Consider increasingly-coarser grids At each step select the cell(s) c of highest density and move enough ra
34、ndomly-selected points from c into a bucket to make c and its neighbors “close-to-uniform” Truly multi-dimensional “split decisions” based on tuple density - unlike MHIST,d,a,b,c,Other Multi-dimensional Histogram Techniques - STHoles BCG01,Multi-dimensional, workload-based histograms Allow bucket ne
35、sting - “bucket tree” Intercept query result stream and count |q b| for each bucket b ( 10% overhead in MS SQL Server 2000) Drill “holes” in b for regions of different tuple density and “pull” them out as children of b (first-class buckets) Consolidate/merge buckets of similar densities (keep #bucke
36、ts constant),200,100,300,150,0 1 2 3 4 5 6 7 8 9,3 1 0 3 7 3 7 1 4 2 4 0 1 2 1 2 7 0 8 5 1 9 1 0 7 1 3 8 2 0,Sampling for Multi-D Synopses,Taking a sample of the rows of a table captures the attribute correlations in those rows Answers are unbiased & confidence intervals apply Thus guaranteed accura
37、cy for count, sum, and average queries on single tables, as long as the query is not too selectiveProblem with joins AGP99,CMN99: Join of two uniform samples is not a uniform sample of the join Join of two samples typically has very few tuples,Foreign Key Join 40% Samples in Red Size of Actual Join
38、= 30,Join Synopses for Foreign-Key Joins AGP99,Based on sampling from materialized foreign key joins Typically 10% added space required Yet, can be used to get a uniform sample of ANY foreign key join Plus, fast to incrementally maintainSignificant improvement over using just table samples E.g., for TPC-H query Q5 (4 way join) 1%-6% relative error vs. 25%-75% relative error, for synopsis size = 1.5%, selectivity ranging from 2% to 10%10% vs. 100% (no answer!) error, for size = 0.5%, select. = 3%,