A Quick Introduction to Approximate Query Processing Part-.ppt

资源描述

1、A Quick Introduction to Approximate Query Processing Part-IV,CS286, Spring2007 Minos Garofalakis,Logistics,Draft CS286 web site is finally up! http:/db.cs.berkeley.edu/cs286sp07/Project list and guidelines being worked on Please email me & Raghu to discuss your own project ideas,Approximate Query Pr

2、ocessing using Data Synopses,How to construct effective data synopses ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute do

3、main values),tuple counts,Three-dimensional distribution,tuple counts,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics

4、, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,Outline,Intro Using Histograms, Samples, WaveletsDiscussion & Comparisons A

5、dvanced Techniques & Future Directions Dependency-based, Streaming data,Two-dimensional Haar Wavelets - Non-standard decomposition,Wavelet Transform Array:,(a+b-c-d)/4,(a+b+c+d)/4,(a-b-c+d)/4,(a-b+c-d)/4,Multi-dimensional Haar Wavelets,Haar decomposition in d dimensions = d-dimensional array of wave

6、let coefficients Coefficient support region = d-dimensional rectangle of cells in the original data array Sign of coefficients contribution can vary along the quadrants of its support,Support regions & signs for the 16 nonstandard 2-dimensional Haar coefficients of a 4X4 data array A,Range-sum Estim

7、ation Using Wavelet Synopses,Coefficient thresholding As in 1-d case, normalizing by appropriate constants and retaining the largest coefficients minimizes the overall L2 error Range-sums: selectivity estimation or OLAP-cube aggregates VW99 (“measure attribute” as count) Only coefficients with suppo

8、rt regions intersecting the query hyper-rectangle can contribute Many contributions can cancel each other CGR00, VW99,Query Range,Contribution to range sum = 0Only nodes on the path to range endpoints can have nonzero contributions (Extends naturally to multi-dimensional range sums),Decomposition Tr

9、ee (1-d),Approximate Query Processing Using Wavelets CGR00,Wavelet Synopses,Approximate Relations,Query Results in Wavelet Domain,Final Approximate Results,Render,Render,Querying in Wavelet Domain,Querying in Relation Domain,Reduce relations into compact wavelet-coefficient synopses,Entire query pro

10、cessing in the compressed (wavelet) domain,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: set of w

11、avelet coefficients output: (multi)set of tuples,render,Selection - Relational Domain,In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells,Dim. D2,6,3,7,3,3,2,2,4,1,1,8,6,3,Query Range,Dim. D1,Jo

12、int Data Distribution Array,Relation,Selection - Wavelet Domain,-,-,+,+,+,-,-,+,+,-,D2,D1,Query Range,Equi-join - Relational Domain,Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in

13、 join dimension(s), (2) compute output coefficients,3,Coefficients A1 (+) and A3 (-) contribute to this cell,Coefficients B2 (+), and B3 (+) contribute to this cell,Join along D1,Joint Data Distributionof Relation 1,Joint Data Distr.of Relation 2,Dim. D3,Join Dim.D1,Relation 1,Relation 2,Equi-join -

14、 Wavelet Domain,-,+,D3,D1,-,-,+,+,D2,D1,D1,v1,v2,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: se

15、t of wavelet coefficients output: (multi)set of tuples,render,Outline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Discussion & Comparisons (1

16、),Histograms & Wavelets: Limited by “curse of dimensionality” Rely on data space partitioning in “regions” Ineffective above 5-6 dimensions Value/frequency uniformity assumptions within buckets break down in medium-to-high dimensionalities! Sampling: No such limitations, BUT. Ineffective for ad-hoc

17、relational joins over arbitrary schemas Uniformity property is lost Quality guarantees degrade Effectiveness for set-valued approximate queries is unclear Only (very) small subsets of the answer set are returned (especially, when joins are present),Discussion & Comparisons (2),Histograms & Wavelets:

18、 Compress data by accurately capturing rectangular “regions” in the data space Advantage over sampling for typical, “range-based” relational DB queries BUT, unclear how to effectively handle unordered/non-numeric data sets (no such issues with sampling.) Sampling: Provides strong probabilistic quali

19、ty guarantees (unbiased answers) for individual aggregate queries Histograms & Wavelets: Can guarantee a bound on the overall error (e.g., L2) for the approximation, BUT answers to individual queries can be heavily biased!,No clear winner exists! (Hybrids?),Outline,Intro & Approximate Query Answerin

20、g Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Dependency-based Synopses Streaming Data XML Synopses Conclusions,Dependency-based Histogram Synopses DGR01,Extremes in terms of the underlying

21、 correlations!Dependency-Based Histograms: explore space between extremes by explicitly identifying data correlations/independences Build a statistical interaction model on data attributes Based on the model, build a collection of low-dimensional histograms Use this histogram collection to provide a

22、pproximate answersGeneral methodology, also applicable to other synopsis techniques (e.g., wavelets),Dependency-based Histograms,Identify (and exploit) attribute correlation and independence Partial Independence : p(salary, height, weight) = p(salary) * p(height, weight) Conditional Independence :p(

23、salary, age | YPE) = p(salary| YPE) * p(age | YPE) Use forward selection to build a decomposable statistical model BFH75, Lau96 on the attributes A,D are conditionally independent given B,C p(AD|BC) = p(A|BC) * p(D|BC) Joint distribution p(ABCD) = p(ABC) * p(BCD) / p(BC) Build histograms on model cl

24、iquesSignificant accuracy improvements (factor of 5) over pure MHIST New histogram construction & usage algorithms, etc.,Workload-tuned Biased Sampling -Congressional Samples AGP00,Decision support queries routinely segment data into groups & then aggregate the information within each group Each tab

25、le has a set of “grouping columns”: queries can group by any subset of these columnsGoal: Maximize the accuracy for all groups (large or small) in each Group-by query E.g., census DB with state (s), gender(g), and income (i) Q: Avg(i) group-by s : seek good accuracy for all 50 states Q: Avg(i) group

26、-by s,g : seek good accuracy for all 100 groupsTechnique: Congressional Samples House: Uniform sample: good for when no group-by Senate: Same size sample per group when use all grouping columns: good for queries with all columns Congress: Combines House & Senate, but considers all subsets of groupin

27、g columns, and then scales down,Workload-tuned Biased Sampling - ICICLES GLR00,Biased sampling scheme that dynamically adapts to query workload Exploit data locality - more focus (i.e., #sample points) in frequently-queried regions Let Q = q1, q2, . . . be a query workload, R(qi) = subset of R used

28、in answering query qi L(R, Q) = Extension of R wrt Q = R R(qi) (multiset of tuples)Icicle: Uniform random sample of L(R,Q) Incrementally maintained and adapt (“self-tune”) to workload through Reservoir Sampling technique Vit85 Unbiased Icicle estimators: New formulas to account for duplicates and bi

29、as in sample selection Provably better (smaller variance) than uniform for “focused” queries (that follow the workload model),Workload-tuned Biased Sampling - Lifted Workloads CDN01,Formulate sample selection as an optimization problem Minimize query-answering error for a given workload model Techni

30、que for “lifting a fixed workload W” to produce a probability distribution over all possible queries Similar to kernel density estimation (queries in W = “sample points”),“Fundamental regions” induced by W,q1,q2,R,q,W = q1, q2 ,prob(q|W) = parametric function of qs overlap with queries in W,Workload

31、-tuned Biased Sampling - Lifted Workloads,Problem: Find sample of size k that minimizes expected error for a given “lifted” workloadSolution: Stratified sampling Coc77 Collection of uniform samples (of total size k) over disjoint subsets (“strata”) of the population Much better estimates when varian

32、ce within strata is small Coc77Stratification: Selecting appropriate partitioning of R Using “fundamental regions” as strata is optimal for COUNT For SUM, partition “fundamental regions” further to reduce variance of the aggregated attribute (Neymann technique Coc77)Allocation: Dividing k among stra

33、ta Closed form solutions (valid under certain simplifying assumptions),Data Streams,Data is continually arriving. Collect & maintain synopses on the data. Goal: Highly-accurate approximate answers State-of-the-art: Good techniques for narrow classes of queries E.g., Any one-pass algorithm for collec

34、ting & maintaining a synopsis can be used effectively for data streamsAlternative scenario: A collection of data sets. Compute a compact sketch of each data set & then answer queries (approximately) comparing the data sets E.g., detecting near-duplicates in a collection of web pages: Altavista E.g.,

35、 estimating join sizes among a collection of tables AGM99,Looking Forward.,Optimizing queries for approximation e.g., minimize length of confidence interval at the plan rootExploiting mining-based techniques (e.g., decision trees) for data reduction and approximate query processing see, e.g., BGR01,

36、 GTK01, JMN99Dynamic maintenance of complex (e.g., dependency-based DGR01 or mining-based BGR01) synopsesSynopses and approximate query processing for richer data models and data streams e.g., XPath/XQuery over XML databases,XML Data (Text),RichardFeynmanThe character of Physical LawR.K.NarayanWaiti

37、ng for the Mahatma1981,XML Data (Tree),booklist,book,book,a,t,p,a,t,“Richard”,“The character of physical Law”,g,“Science”,g,“”,“”,“”,f,“Hardcover”,f,l,“Feynman”,f,l,“”,“”,XML Basics,Elements Encode “concepts” in the XML database Nesting denotes association/inclusion Attributes Record information spe

38、cific to an element (e.g., the genre of a book) References Links between elements in different parts of the document,XML vs. Relational Data,row,row,row,name,name,name,phone,phone,phone,“John”,3634,“Sue”,“Dick”,6343,6363,Relation,XML,XML vs. Relational Data,A relation instance is basically a tree wi

39、th: Unbounded fanout at level 1 (i.e., any # of rows) Fixed fanout at level 2 (i.e., fixed # fields)XML data is essentially an arbitrary tree Unbounded fanout at all nodes/levels Any number of levels Variable # of children at different nodes, variable path lengths,XPath Expressions,Examples: /bookli

40、st/book /booklist/book/author /booklist/book/author/lastnameGiven an XML document, the value of a path expression p is a set of elements (= XML subtrees),Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E1

41、4,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v1

42、1 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath exp

43、ressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,XPath Syntax,Path wildcards/ = descendant at any level (or self) * = any (single) tag Example: /booklist/lastname Query attributes and attribute c

44、ontent Use “” Examples: /booklist/bookformat=“Paperback”, /booklist/book/genre Branching predicates: Apred Predicate on As subtree using logical connectives (and, or, etc.), path expressions, built-in functions (e.g., contains(), etc. Example: /authorcontains(./lastname, “Fey”),Synopses for XML,Summ

45、arize labeled tree/graph structure for approximate path navigation queries Selectivity estimation: How many elements satisfy p? Approximate answers: Return an approximate XML document as output of an XQuery fragment Key idea: Build a concise Graph Synopsis that captures the path/branching distributi

46、on in limited space Use appropriate uniformity/independence assumptions to approximate path structure Refine synopsis in parts of the XML document where assumptions fail XSketches SIGMOD02,VLDB02, TreeSketches SIGMOD04,Conclusions,Commercial data warehouses: approaching several 100s TB and continuou

47、sly growing Demand for high-speed, interactive analysis (click-stream processing, IP traffic analysis) also increasingApproximate Query Processing “Tame” these TeraBytes and satisfy the need for interactive processing and exploration Great promise Commercial acceptance still lagging, but will most probably grow in coming years Still lots of interesting research to be done!,

展开阅读全文