1、Approximate Query Processing: Taming the TeraBytes! A Tutorial,Minos Garofalakis and Phillip B. Gibbons Information Sciences Research Center Bell Laboratories http:/www.bell- pbgibbons/,Outline,Intro & Approximate Query Answering Overview Synopses, System architecture, Commercial offerings One-Dimen
2、sional Synopses Histograms, Samples, Wavelets Multi-Dimensional Synopses and Joins Multi-D Histograms, Join synopses, Wavelets Set-Valued Queries Using Histograms, Samples, WaveletsDiscussion & Comparisons Advanced Techniques & Future Directions Dependency-based, Workload-tuned, Streaming data Concl
3、usions,Introduction & Motivation,Exact answers NOT always required DSS applications usually exploratory: early feedback to help identify “interesting” regions Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” (display as bar graph) Preview a
4、nswers while waiting. Trial queries Base data can be remote or unavailable: approximate processing using locally-cached data synopses is the only option,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,Fast Approximate Answers,Primarily for Aggregate queries Goal is to quic
5、kly report the leading digits of answers In seconds instead of minutes or hours Most useful if can provide error guaranteesE.g., Average salary$59,000 +/- $500 (with 95% confidence) in 10 secondsvs. $59,152.25 in 10 minutesAchieved by answering the query based on samples or other synopses of the dat
6、aSpeed-up obtained because synopses are orders of magnitude smaller than the original data,Approximate Query Answering,Basic Approach 1: Online Query Processing e.g., Control Project HHW97, HH99, HAR00 Sampling at query time Answers continually improve, under user control,Approximate Query Answering
7、,Basic Approach 2: Precomputed Synopses Construct & store synopses prior to query time At query time, use synopses to answer the queryLike estimation in query optimizers, but reported to the user (need higher accuracy) more general queriesNeed to maintain synopses up-to-dateMost work in the area bas
8、ed on the precomputed approach e.g., Sample Views OR92, Olk93, Aqua Project GMP97a, AGP99,etc,The Aqua Architecture,Data Warehouse (e.g., Oracle),SQL Query Q,Network,Q,Result,HTML XML,Warehouse Data Updates,Browser Excel,Picture without Aqua: User poses a query QData Warehouse executes Q and returns
9、 resultWarehouse is periodically updated with new data,The Aqua Architecture,Picture with Aqua: Aqua is middleware, between the user and the warehouseAqua Synopses are stored in the warehouseAqua intercepts the user query and rewrites it to be a query Q on the synopses. Data warehouse returns approx
10、imate answer,Data Warehouse (e.g., Oracle),SQL Query Q,Network,Q,Result (w/ error bounds),HTML XML,Warehouse Data Updates,AQUA Synopses,AQUA Tracker,Browser Excel,GMP97a, AGP99,Online vs. Precomputed,Online: + Continuous refinement of answers (online aggregation) + User control: what to refine, when
11、 to stop + Seeing the query is very helpful for fast approximate results + No maintenance overheads + See HH01 Online Query Processing tutorial for detailsPrecomputed: + Seeing entire data is very helpful (provably & in practice) (But must construct synopses for a family of queries) + Often faster:
12、better access patterns, small synopses can reside in memory or cache + Middleware: Can use with any DBMS, no special index striding + Also effective for remote or streaming data,Commercial DBMS,Oracle, IBM Informix: Sampling operator (online)IBM DB2: “IBM Almaden is working on a prototype version of
13、 DB2 that supports sampling. The user specifies a priori the amount of sampling to be done.”Microsoft SQL Server: “New auto statistics extract statistics e.g., histograms using fast sampling, enabling the Query Optimizer to use the latest information.” The index tuning wizard uses sampling to build
14、statistics.see CN97, CMN98, CN98In summary, not much announced yet,Outline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Ha
15、ar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Histograms,Partition attribute value(s) domain into a set of buckets Issues: How to partition What to store for each bu
16、cket How to estimate an answer using the histogramLong history of use for selectivity estimation within a query optimizer Koo80, PSC84, etc.PIH96 Poo97 introduced a taxonomy, algorithms, etc.,1-D Histograms: Equi-Depth,Goal: Equal number of rows per bucket (B buckets in all)Can construct by first so
17、rting then taking B-1 equally-spaced splitsFaster construction: Sample & take equally-spaced splits in sample Nearly equal buckets Can also use one-pass quantile algorithms (e.g., GK01),Count in bucket,Domain values,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,Can maintain using one-pass algor
18、ithms (insertions only), orUse a backing sample GMP97b: Maintain a larger sample on disk in support of histogram maintenance Keep histogram bucket counts up-to-date by incrementing on row insertion, decrementing on row deletionMerge adjacent buckets with small countsSplit any bucket with a large cou
19、nt, using the sample to select a split value, i.e, take median of the sample points in bucket range Keeps counts within a factor of 2; for more equal buckets, can recompute from the sample,1-D Histograms: Equi-Depth,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,Count in bucket,Domain values,1-D
20、 Histograms: Compressed,Create singleton buckets for largest values, equi-depth over the restImprovement over equi-depth since get exact info on largest values, e.g., join estimation in DB2 compares largest values in the relations Construction: Sorting + O(B log B) + one pass; can use sampleMaintena
21、nce: Split & Merge approach as with equi-depth, but must also decide when to create and remove singleton buckets GMP97b,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,PIH96,1-D Histograms: V-Optimal,IP95 defined V-optimal AVGFi:j = avg freq for values ij SSEi:j = sumk=ij (Fk2 (j-i+1)*AVGFi:j2) F
22、or i=1N, compute Pi = sumk=1i Fk & Qi = sumk=1i Fk2 Then can compute any SSEi:j in constant time Let SSEP(i,k) = min SSE for F1Fi using k buckets Then SSEP(i,k) = minj=1i-1 (SSEP(j,k-1) + SSEj+1:i), i.e.,suffices to consider all possible left boundaries for kth bucket Also gave faster approximation
23、algorithms,Answering Queries: Equi-Depth,Answering queries: select count(*) from R where 4 = R.A = 15 approximate answer: F * |R|/B, where F = number of buckets, including fractions, that overlap the range error guarantee: 2 * |R|/B,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,4 R.A 15, 0.5 *
24、|R|/6,Answering Queries: Histograms,Answering queries from 1-D histograms (in general): (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relationContinuous value mapping SAC79:,Count spread evenly among bucket values,- Uniform spread mapping PIH96:
25、,Self-Tuning 1-D Histograms,1. Tune Bucket Frequencies: Compare actual selectivity to histogram estimate Use to adjust bucket frequencies,AC99,query range,Actual = 60 Estimate = 40 Error = +20,d= of Error = +10 So divide +4,+3,+3,- Divide d*Error proportionately, d=dampening factor,Self-Tuning 1-D H
26、istograms,2. Restructure: Merge buckets of near-equal frequencies Split large frequency buckets,Also Extends to Multi-D,Sampling: Basics,Idea: A small random sample S of the data often well-represents all the data For a fast approx answer, apply the query to S & “scale” the result E.g., R.a is 0,1,
27、S is a 20% sampleselect count(*) from R where R.a = 0select 5 * count(*) from S where S.a = 0,1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0,Red = in S,R.a,Est. count = 5*2 = 10, Exact count = 10,Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the e
28、xpected value of the answer is the actual answer, even for (most) queries with predicates!Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval a,b with a given probabilityE.g., 54,000 600 with prob 90%,Sampling: Confidence Intervals,If predicates, S
29、 above is subset of sample that satisfies the predicate Quality of the estimate depends only on the variance in R & |S| after the predicate: So 10K sample may suffice for 10B row relation! Advantage of larger samples: can handle more selective predicates,Confidence intervals for Average: select avg(
30、R.A) from R(Can replace R.A with any arithmetic expression on the attributes in R) (R) = standard deviation of the values of R.A; (S) = s.d. for S.A,Sampling from Databases,Sampling disk-resident data is slow Row-level sampling has high I/O cost: must bring in entire disk block to get the row Block-
31、level sampling: rows may be highly correlated Random access pattern, possibly via an index Need acceptance/rejection sampling to account for the variable number of rows in a page, children in an index node, etc Alternatives Random physical clustering: destroys “natural” clustering Precomputed sample
32、s: must incrementally maintain (at specified size) Fast to use: packed in disk blocks, can sequentially scan, can store as relation and leverage full DBMS query support, can store in main memory,One-Pass Uniform Sampling,Best choice for incremental maintenance Low overheads, no random data accessRes
33、ervoir Sampling Vit85: Maintains a sample S of a fixed-size M Add each new item to S with probability M/N, where N is the current number of data items If add an item, evict a random item from S Instead of flipping a coin for each item, determine the number of items to skip before the next to be adde
34、d to STo handle deletions, permit |S| to drop to L M, e.g., L = M/2 remove from S if deleted item is in S, else ignore If |S| = M/2, get a new S using another pass (happens only if delete roughly half the items & cost is fully amortized) GMP97b,Biased Sampling,Often, advantageous to sample different
35、 data at different rates (Stratified Sampling) E.g., outliers can be sampled at a higher rate to ensure they are accounted for; better accuracy for small groups in group-by queries Each tuple j in the relation is selected for the sample S with some probability Pj (can depend on values in tuple j) If
36、 selected, it is added to S along with its scale factor sf = 1/PjAnswering queries from S: e.g., select sum(R.a) from R where R.b 5 select sum(S.a * S.sf) from S where S.b 5Unbiased answer. Good choice for Pjs results in tighter confidence intervals,R.a 10 10 10 50 50 Pj 1/3 1/3 1/3 S.sf - 3 - - 2Su
37、m(R.a) = 130 Sum(S.a*S.sf) = 10*3 + 50*2 = 130,One-Dimensional Haar Wavelets,Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet basis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions,
38、Resolution Averages Detail Coefficients,2, 2, 0, 2, 3, 5, 4, 4,2, 1, 4, 4,0, -1, -1, 0,-,3,2,1,0,Haar Wavelet Coefficients,Coefficient “Supports”,Hierarchical decomposition structure (a.k.a. “error tree”),Original data,Wavelet-based Histograms MVW98,Problem: range-query selectivity estimation Key id
39、ea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution Steps compute cumulative data distribution C compute Haar (or linear) wavelet transform of C coefficient thresholding : only b|C| coefficients can be kept take largest coefficients in absolute normal
40、ized value Haar basis: divide coefficients at resolution j by Optimal in terms of the overall Mean Squared (L2) Error Greedy heuristic methods Retain coefficients leading to large error reduction Throw away coefficients that give small increase in error,Using Wavelet-based Histograms,Selectivity est
41、imation: sel(a= X= b) = Cb - Ca-1 C is the (approximate) “reconstructed” cumulative distribution Time: O(minb, logN), where b = size of wavelet synopsis (no. of coefficients), N= size of domainEmpirical results over synthetic data Improvements over random sampling and histograms (MaxDiff),At most lo
42、gN+1 coefficients are needed to reconstruct any C value,Dynamic Maintenance of Wavelet-based Histograms MVW00,Build Haar-wavelet synopses on the original data distribution Similar accuracy with CDF, makes maintenance simpler Key issues with dynamic wavelet maintenance Change in single distribution v
43、alue can affect the values of many coefficients (path to the root of the decomposition tree),d+,As distribution changes, “most significant” (e.g., largest) coefficients can also change! Important coefficients can become unimportant, and vice-versa,Effect of Distribution Updates,Key observation: for
44、each coefficient c in the Haar decomposition tree c = ( AVG(leftChildSubtree(c) - AVG(rightChildSubtree(c) ) / 2,Only coefficients on path(d) are affected and each can be updated in constant time,Maintenance Architecture,“Shake up” when log reaches max size: for each insertion at d for each coeffici
45、ent c on path(d) and in H : update c for each coefficient c on path(d) and not in H or H: insert c into H with probability proportional to 1/2h, where h is the “height” of c (Probabilistic Counting FM85) Adjust H and H (move largest coefficients to H),m+m top coefficients,m,m,INSERTIONS/ DELETIONS,O
46、utline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Multi-dimensional Histograms Join sampling Multi-dimensional Haar Wavelets Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Multi-dimension
47、al Data Synopses,Problem: Approximate the joint data distribution of multiple attributes,Motivation Selectivity estimation for queries with multiple predicates Approximating OLAP data cubes and general relations,Conventional approach: Attribute-Value Independence (AVI) assumption sel(p(A1) & p(A2) &
48、 . . .) = sel(p(A1) * sel(p(A2) * . . . Simple - one-dimensional marginals suffice BUT: almost always inaccurate, gross errors in practice (e.g., Chr84, FK97, Poo97,Multi-dimensional Histograms,Use small number of multi-dimensional buckets to directly approximate the joint data distribution Uniform spread & frequency approximation within buckets n(i) = no. of distinct values along Ai, F = total bucket frequency approximate data points on a n(1)*n(2)*. . . uniform grid, each with frequency F / (n(1)*n(2)*. . .),20,40,35,90,120,Actual Distribution (ONE BUCKET),