ImageVerifierCode 换一换
格式:PPT , 页数:29 ,大小:367.50KB ,
资源ID:377870      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-377870.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(A Quick Introduction to Approximate Query ProcessingPart II.ppt)为本站会员(ideacase155)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

A Quick Introduction to Approximate Query ProcessingPart II.ppt

1、A Quick Introduction to Approximate Query Processing Part II,CS286, Spring2007 Minos Garofalakis,Decision Support Systems,Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration. OLAP: Complex SQL queries and v

2、iews. Queries based on spreadsheet-style operations and “multidimensional” view of data. Interactive and “online” queries. Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!),Approximate Query Processing using Data Synopses,How to construct effective data synopse

3、s ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute domain values),tuple counts,Three-dimensional distribution,tuple count

4、s,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet h

5、istogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,One-Dimensional Haar Wavelets,Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet b

6、asis, easy to understand and implement Recursive pairwise averaging and differencing at different resolutions,Resolution Averages Detail Coefficients,2, 2, 0, 2, 3, 5, 4, 4,2, 1, 4, 4,0, -1, -1, 0,-,3,2,1,0,Haar Wavelet Coefficients,Coefficient “Supports”,Hierarchical decomposition structure (a.k.a.

7、 “error tree”),Original data,Wavelet-based Histograms MVW98,Problem: range-query selectivity estimation Key idea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution Steps compute (cumulative) data distribution C compute Haar (or linear) wavelet transform

8、 of C coefficient thresholding : only b|C| coefficients can be kept take largest coefficients in absolute normalized value Haar basis: divide coefficients at resolution j by Optimal in terms of the overall Mean Squared (L2) Error Greedy heuristic methods Retain coefficients leading to large error re

9、duction Throw away coefficients that give small increase in error,Using Wavelet-based Histograms,Selectivity estimation: sel(a= X= b) = Cb - Ca-1 C is the (approximate) “reconstructed” cumulative distribution Time: O(minb, logN), where b = size of wavelet synopsis (no. of coefficients), N= size of d

10、omain,At most logN+1 coefficients are needed to reconstruct any C value,Haar Wavelet Coefficients,Reconstruct data values d(i) d(i) = (+/-1) * (coefficient on path) Range sum calculation d(l:h) d(l:h) = simple linear combination of coefficients on paths to l, h Only O(logN) terms,Original data,3 = 2

11、.75 - (-1.25) + 0 + (-1),6 = 4*2.75 + 4*(-1.25),Dynamic Maintenance of Wavelet-based Histograms MVW00,Build Haar-wavelet synopses on the original data distributionKey issues with dynamic wavelet maintenance Change in single distribution value can affect the values of many coefficients (path to the r

12、oot of the decomposition tree),d+,As distribution changes, “most significant” (e.g., largest) coefficients can also change! Important coefficients can become unimportant, and vice-versa,Effect of Distribution Updates,Key observation: for each coefficient c in the Haar decomposition tree c = ( AVG(le

13、ftChildSubtree(c) - AVG(rightChildSubtree(c) ) / 2,Only coefficients on path(d) are affected and each can be updated in constant time,Maintenance Architecture,“Shake up” when log reaches max size: for each insertion at d for each coefficient c on path(d) and in H : update c for each coefficient c on

14、 path(d) and not in H or H: insert c into H with probability proportional to 1/2h, where h is the “height” of c (Probabilistic Counting FM85) Adjust H and H (move largest coefficients to H),m+m top coefficients,m,m,INSERTIONS/ DELETIONS,Problems with Conventional Wavelet Synopses,An example data vec

15、tor and wavelet synopsis (|D|=16, B=8 largest coefficients retained),Original Data Values 127 71 87 31 59 3 43 99 100 42 0 58 30 88 72 130,Wavelet Answers 65 65 65 65 65 65 65 65 100 42 0 58 30 88 72 130,Large variation in answer quality Within the same data set, when synopsis is large, when data va

16、lues are about the same, when actual answers are about the same Heavily-biased approximate answers! Root causes Thresholding for aggregate L2 error metric Independent, greedy thresholding ( large regions without any coefficient!) Heavy bias from dropping coefficients without compensating for loss,Ap

17、proach: Optimize for Maximum-Error Metrics,Key metric for effective approximate answers: Relative error with sanity boundSanity bound “s” to avoid domination by small data valuesTo provide tight error guarantees for all reconstructed data valuesMinimize maximum relative error in the data reconstruct

18、ionAnother option: Minimize maximum absolute errorAlgorithms can be extended to general “distributive” metrics (e.g., average relative error),Minimize,Our Approach: Deterministic Wavelet Thresholding for Maximum Error,Key Idea: Dynamic-Programming formulation that conditions the optimal solution on

19、the error that “enters” the subtree (through the selection of ancestor nodes),Our DP table:Mj, b, S = optimal maximum relative (or, absolute) error in T(j) with space budget of b coefficients (chosen in T(j), assuming subset S of js proper ancestors have already been selected for the synopsis Clearl

20、y, |S| minB-b, logN+1 Want to compute M0, B, ,Basic Observation: Depth of the error tree is only logN+1 we can explore and tabulate all S-subsets for a given node at a space/time cost of only O(N) !,Base Case for DP Recurrence: Leaf (Data) Nodes,Base case in the bottom-up DP computation: Leaf (i.e.,

21、 data) node Assume for simplicity that data values are numbered N, , 2N-1,Again, time/space complexity per leaf node is only O(N),Mj, b, S is not defined for b0 Never allocate space to leaves For b=0,for each coefficient subset with |S| minB, logN+1 Similarly for absolute error,DP Recurrence: Intern

22、al (Coefficient) Nodes,Two basic cases when examining node/coefficient j for inclusion in the synopsis: (1) Drop j; (2) Keep j,In this case, the minimum possible maximum relative error in T(j) is,+,-,root=0,S = subset of selected j-ancestors,Case (1): Drop Coefficient j,Optimally distribute space b

23、between js two child subtrees Note that the RHS of the recurrence is well-defined Ancestors of j are obviously ancestors of 2j and 2j+1,DP Recurrence: Internal (Coefficient) Nodes (cont.),In this case, the minimum possible maximum relative error in T(j) is,Case (2): Keep Coefficient j,Take 1 unit of

24、 space for coefficient j, and optimally distribute remaining space Selected subsets in RHS change, since we choose to retain j Again, the recurrence RHS is well-defined,+,-,root=0,S = subset of selected j-ancestors,Finally, define Overall complexity: time, space,Outline,Intro & Approximate Query Ans

25、wering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Multi-dimensional Histograms Join sampling Multi-dimensional Haar Wavelets Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Relations as Frequency Distributions,8 10 10,30 20

26、50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute domain values),tuple counts,Three-dimensional distribution,tuple counts,Multi-dimensional Data Synopses,Problem: Approximate the joint data distribution of multiple attributes,Motivation Selectivity estimat

27、ion for queries with multiple predicates Approximating OLAP data cubes and general relations,Conventional approach: Attribute-Value Independence (AVI) assumption sel(p(A1) & p(A2) & . . .) = sel(p(A1) * sel(p(A2) * . . . Simple - one-dimensional marginals suffice BUT: almost always inaccurate, gross

28、 errors in practice (e.g., Chr84, FK97, Poo97,Multi-dimensional Histograms,Use small number of multi-dimensional buckets to directly approximate the joint data distribution Uniform spread & frequency approximation within buckets n(i) = no. of distinct values along Ai, F = total bucket frequency appr

29、oximate data points on a n(1)*n(2)*. . . uniform grid, each with frequency F / (n(1)*n(2)*. . .),20,40,35,90,120,Actual Distribution (ONE BUCKET),Multi-dimensional Histogram Construction,Construction problem is much harder even for two dimensions MPS99 Multi-dimensional equi-depth histograms MD88 Fi

30、x an ordering of the dimensions A1, A2, . . ., Ak, let kth root of desired no. of buckets, initialize B = data distribution For i=1, . . ., k: Split each bucket in B in equi-depth partitions along Ai; return resulting buckets to B Problems: limited set of bucketizations; fixed and fixed dimension or

31、dering can result in poor partitionings,MHIST-p histograms PI97 At each step Choose the bucket b in B containing the attribute Ai whose marginal is the most in need of partitioning Split b along Ai into p (e.g., p=2) buckets,Equi-depth vs. MHIST Histograms,A1,A2,Equi-depth (a1=2,a2=3) MD88,A1,A2,MHI

32、ST-2 (MaxDiff) PI97,MHIST: choose bucket/dimension to split based on its criticality ; allows for much larger class of bucketizations (hierarchical space partitioning) Experimental results verify superiority over AVI and equi-depth,450 280 340,460360250,Other Multi-dimensional Histogram Techniques -

33、 GENHIST GKT00,Key idea: allow for overlapping histogram buckets Allows for a much larger no. of distinct frequency regions for a given space budget (= #buckets),Greedy construction algorithm: Consider increasingly-coarser grids At each step select the cell(s) c of highest density and move enough ra

34、ndomly-selected points from c into a bucket to make c and its neighbors “close-to-uniform” Truly multi-dimensional “split decisions” based on tuple density - unlike MHIST,d,a,b,c,Other Multi-dimensional Histogram Techniques - STHoles BCG01,Multi-dimensional, workload-based histograms Allow bucket ne

35、sting - “bucket tree” Intercept query result stream and count |q b| for each bucket b ( 10% overhead in MS SQL Server 2000) Drill “holes” in b for regions of different tuple density and “pull” them out as children of b (first-class buckets) Consolidate/merge buckets of similar densities (keep #bucke

36、ts constant),200,100,300,150,0 1 2 3 4 5 6 7 8 9,3 1 0 3 7 3 7 1 4 2 4 0 1 2 1 2 7 0 8 5 1 9 1 0 7 1 3 8 2 0,Sampling for Multi-D Synopses,Taking a sample of the rows of a table captures the attribute correlations in those rows Answers are unbiased & confidence intervals apply Thus guaranteed accura

37、cy for count, sum, and average queries on single tables, as long as the query is not too selectiveProblem with joins AGP99,CMN99: Join of two uniform samples is not a uniform sample of the join Join of two samples typically has very few tuples,Foreign Key Join 40% Samples in Red Size of Actual Join

38、= 30,Join Synopses for Foreign-Key Joins AGP99,Based on sampling from materialized foreign key joins Typically 10% added space required Yet, can be used to get a uniform sample of ANY foreign key join Plus, fast to incrementally maintainSignificant improvement over using just table samples E.g., for TPC-H query Q5 (4 way join) 1%-6% relative error vs. 25%-75% relative error, for synopsis size = 1.5%, selectivity ranging from 2% to 10%10% vs. 100% (no answer!) error, for size = 0.5%, select. = 3%,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1