ImageVerifierCode 换一换
格式:PPT , 页数:42 ,大小:458.50KB ,
资源ID:373177      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373177.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(A Quick Introduction to Approximate Query Processing Part-.ppt)为本站会员(livefirmly316)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

A Quick Introduction to Approximate Query Processing Part-.ppt

1、A Quick Introduction to Approximate Query Processing Part-IV,CS286, Spring2007 Minos Garofalakis,Logistics,Draft CS286 web site is finally up! http:/db.cs.berkeley.edu/cs286sp07/Project list and guidelines being worked on Please email me & Raghu to discuss your own project ideas,Approximate Query Pr

2、ocessing using Data Synopses,How to construct effective data synopses ?,SQL Query,Exact Answer,Decision Support Systems (DSS),Long Response Times!,GB/TB,Relations as Frequency Distributions,8 10 10,30 20 50,25 8 15,salary,age,name,age,salary,sales,sales,One-dimensional distribution,Age (attribute do

3、main values),tuple counts,Three-dimensional distribution,tuple counts,Outline,Intro & Approximate Query Answering Overview Synopses, System architectures, Commercial offerings One-Dimensional Synopses Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning Samples: Basics

4、, Sampling from DBs, Reservoir Sampling Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions,Outline,Intro Using Histograms, Samples, WaveletsDiscussion & Comparisons A

5、dvanced Techniques & Future Directions Dependency-based, Streaming data,Two-dimensional Haar Wavelets - Non-standard decomposition,Wavelet Transform Array:,(a+b-c-d)/4,(a+b+c+d)/4,(a-b-c+d)/4,(a-b+c-d)/4,Multi-dimensional Haar Wavelets,Haar decomposition in d dimensions = d-dimensional array of wave

6、let coefficients Coefficient support region = d-dimensional rectangle of cells in the original data array Sign of coefficients contribution can vary along the quadrants of its support,Support regions & signs for the 16 nonstandard 2-dimensional Haar coefficients of a 4X4 data array A,Range-sum Estim

7、ation Using Wavelet Synopses,Coefficient thresholding As in 1-d case, normalizing by appropriate constants and retaining the largest coefficients minimizes the overall L2 error Range-sums: selectivity estimation or OLAP-cube aggregates VW99 (“measure attribute” as count) Only coefficients with suppo

8、rt regions intersecting the query hyper-rectangle can contribute Many contributions can cancel each other CGR00, VW99,Query Range,Contribution to range sum = 0Only nodes on the path to range endpoints can have nonzero contributions (Extends naturally to multi-dimensional range sums),Decomposition Tr

9、ee (1-d),Approximate Query Processing Using Wavelets CGR00,Wavelet Synopses,Approximate Relations,Query Results in Wavelet Domain,Final Approximate Results,Render,Render,Querying in Wavelet Domain,Querying in Relation Domain,Reduce relations into compact wavelet-coefficient synopses,Entire query pro

10、cessing in the compressed (wavelet) domain,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: set of w

11、avelet coefficients output: (multi)set of tuples,render,Selection - Relational Domain,In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells,Dim. D2,6,3,7,3,3,2,2,4,1,1,8,6,3,Query Range,Dim. D1,Jo

12、int Data Distribution Array,Relation,Selection - Wavelet Domain,-,-,+,+,+,-,-,+,+,-,D2,D1,Query Range,Equi-join - Relational Domain,Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in

13、 join dimension(s), (2) compute output coefficients,3,Coefficients A1 (+) and A3 (-) contribute to this cell,Coefficients B2 (+), and B3 (+) contribute to this cell,Join along D1,Joint Data Distributionof Relation 1,Joint Data Distr.of Relation 2,Dim. D3,Join Dim.D1,Relation 1,Relation 2,Equi-join -

14、 Wavelet Domain,-,+,D3,D1,-,-,+,+,D2,D1,D1,v1,v2,Wavelet Query Processing,set of coefficients,set of coefficients,set of coefficients,Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: se

15、t of wavelet coefficients output: (multi)set of tuples,render,Outline,Intro & Approximate Query Answering Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Conclusions,Discussion & Comparisons (1

16、),Histograms & Wavelets: Limited by “curse of dimensionality” Rely on data space partitioning in “regions” Ineffective above 5-6 dimensions Value/frequency uniformity assumptions within buckets break down in medium-to-high dimensionalities! Sampling: No such limitations, BUT. Ineffective for ad-hoc

17、relational joins over arbitrary schemas Uniformity property is lost Quality guarantees degrade Effectiveness for set-valued approximate queries is unclear Only (very) small subsets of the answer set are returned (especially, when joins are present),Discussion & Comparisons (2),Histograms & Wavelets:

18、 Compress data by accurately capturing rectangular “regions” in the data space Advantage over sampling for typical, “range-based” relational DB queries BUT, unclear how to effectively handle unordered/non-numeric data sets (no such issues with sampling.) Sampling: Provides strong probabilistic quali

19、ty guarantees (unbiased answers) for individual aggregate queries Histograms & Wavelets: Can guarantee a bound on the overall error (e.g., L2) for the approximation, BUT answers to individual queries can be heavily biased!,No clear winner exists! (Hybrids?),Outline,Intro & Approximate Query Answerin

20、g Overview One-Dimensional Synopses Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions Dependency-based Synopses Streaming Data XML Synopses Conclusions,Dependency-based Histogram Synopses DGR01,Extremes in terms of the underlying

21、 correlations!Dependency-Based Histograms: explore space between extremes by explicitly identifying data correlations/independences Build a statistical interaction model on data attributes Based on the model, build a collection of low-dimensional histograms Use this histogram collection to provide a

22、pproximate answersGeneral methodology, also applicable to other synopsis techniques (e.g., wavelets),Dependency-based Histograms,Identify (and exploit) attribute correlation and independence Partial Independence : p(salary, height, weight) = p(salary) * p(height, weight) Conditional Independence :p(

23、salary, age | YPE) = p(salary| YPE) * p(age | YPE) Use forward selection to build a decomposable statistical model BFH75, Lau96 on the attributes A,D are conditionally independent given B,C p(AD|BC) = p(A|BC) * p(D|BC) Joint distribution p(ABCD) = p(ABC) * p(BCD) / p(BC) Build histograms on model cl

24、iquesSignificant accuracy improvements (factor of 5) over pure MHIST New histogram construction & usage algorithms, etc.,Workload-tuned Biased Sampling -Congressional Samples AGP00,Decision support queries routinely segment data into groups & then aggregate the information within each group Each tab

25、le has a set of “grouping columns”: queries can group by any subset of these columnsGoal: Maximize the accuracy for all groups (large or small) in each Group-by query E.g., census DB with state (s), gender(g), and income (i) Q: Avg(i) group-by s : seek good accuracy for all 50 states Q: Avg(i) group

26、-by s,g : seek good accuracy for all 100 groupsTechnique: Congressional Samples House: Uniform sample: good for when no group-by Senate: Same size sample per group when use all grouping columns: good for queries with all columns Congress: Combines House & Senate, but considers all subsets of groupin

27、g columns, and then scales down,Workload-tuned Biased Sampling - ICICLES GLR00,Biased sampling scheme that dynamically adapts to query workload Exploit data locality - more focus (i.e., #sample points) in frequently-queried regions Let Q = q1, q2, . . . be a query workload, R(qi) = subset of R used

28、in answering query qi L(R, Q) = Extension of R wrt Q = R R(qi) (multiset of tuples)Icicle: Uniform random sample of L(R,Q) Incrementally maintained and adapt (“self-tune”) to workload through Reservoir Sampling technique Vit85 Unbiased Icicle estimators: New formulas to account for duplicates and bi

29、as in sample selection Provably better (smaller variance) than uniform for “focused” queries (that follow the workload model),Workload-tuned Biased Sampling - Lifted Workloads CDN01,Formulate sample selection as an optimization problem Minimize query-answering error for a given workload model Techni

30、que for “lifting a fixed workload W” to produce a probability distribution over all possible queries Similar to kernel density estimation (queries in W = “sample points”),“Fundamental regions” induced by W,q1,q2,R,q,W = q1, q2 ,prob(q|W) = parametric function of qs overlap with queries in W,Workload

31、-tuned Biased Sampling - Lifted Workloads,Problem: Find sample of size k that minimizes expected error for a given “lifted” workloadSolution: Stratified sampling Coc77 Collection of uniform samples (of total size k) over disjoint subsets (“strata”) of the population Much better estimates when varian

32、ce within strata is small Coc77Stratification: Selecting appropriate partitioning of R Using “fundamental regions” as strata is optimal for COUNT For SUM, partition “fundamental regions” further to reduce variance of the aggregated attribute (Neymann technique Coc77)Allocation: Dividing k among stra

33、ta Closed form solutions (valid under certain simplifying assumptions),Data Streams,Data is continually arriving. Collect & maintain synopses on the data. Goal: Highly-accurate approximate answers State-of-the-art: Good techniques for narrow classes of queries E.g., Any one-pass algorithm for collec

34、ting & maintaining a synopsis can be used effectively for data streamsAlternative scenario: A collection of data sets. Compute a compact sketch of each data set & then answer queries (approximately) comparing the data sets E.g., detecting near-duplicates in a collection of web pages: Altavista E.g.,

35、 estimating join sizes among a collection of tables AGM99,Looking Forward.,Optimizing queries for approximation e.g., minimize length of confidence interval at the plan rootExploiting mining-based techniques (e.g., decision trees) for data reduction and approximate query processing see, e.g., BGR01,

36、 GTK01, JMN99Dynamic maintenance of complex (e.g., dependency-based DGR01 or mining-based BGR01) synopsesSynopses and approximate query processing for richer data models and data streams e.g., XPath/XQuery over XML databases,XML Data (Text),RichardFeynmanThe character of Physical LawR.K.NarayanWaiti

37、ng for the Mahatma1981,XML Data (Tree),booklist,book,book,a,t,p,a,t,“Richard”,“The character of physical Law”,g,“Science”,g,“”,“”,“”,f,“Hardcover”,f,l,“Feynman”,f,l,“”,“”,XML Basics,Elements Encode “concepts” in the XML database Nesting denotes association/inclusion Attributes Record information spe

38、cific to an element (e.g., the genre of a book) References Links between elements in different parts of the document,XML vs. Relational Data,row,row,row,name,name,name,phone,phone,phone,“John”,3634,“Sue”,“Dick”,6343,6363,Relation,XML,XML vs. Relational Data,A relation instance is basically a tree wi

39、th: Unbounded fanout at level 1 (i.e., any # of rows) Fixed fanout at level 2 (i.e., fixed # fields)XML data is essentially an arbitrary tree Unbounded fanout at all nodes/levels Any number of levels Variable # of children at different nodes, variable path lengths,XPath Expressions,Examples: /bookli

40、st/book /booklist/book/author /booklist/book/author/lastnameGiven an XML document, the value of a path expression p is a set of elements (= XML subtrees),Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E1

41、4,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v1

42、1 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath expressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,Path Expressions,XPath exp

43、ressions Simple: /A/P/T Branching: /AB/P/T Values: /A/P/T=v11 Result is a set,/,PB3,P6,B9,T13,A1,T11,P7,T12,A2,B5,T10,E14,N4,N8,V4,V10,V11,V12,V13,V8,V14,XPath Syntax,Path wildcards/ = descendant at any level (or self) * = any (single) tag Example: /booklist/lastname Query attributes and attribute c

44、ontent Use “” Examples: /booklist/bookformat=“Paperback”, /booklist/book/genre Branching predicates: Apred Predicate on As subtree using logical connectives (and, or, etc.), path expressions, built-in functions (e.g., contains(), etc. Example: /authorcontains(./lastname, “Fey”),Synopses for XML,Summ

45、arize labeled tree/graph structure for approximate path navigation queries Selectivity estimation: How many elements satisfy p? Approximate answers: Return an approximate XML document as output of an XQuery fragment Key idea: Build a concise Graph Synopsis that captures the path/branching distributi

46、on in limited space Use appropriate uniformity/independence assumptions to approximate path structure Refine synopsis in parts of the XML document where assumptions fail XSketches SIGMOD02,VLDB02, TreeSketches SIGMOD04,Conclusions,Commercial data warehouses: approaching several 100s TB and continuou

47、sly growing Demand for high-speed, interactive analysis (click-stream processing, IP traffic analysis) also increasingApproximate Query Processing “Tame” these TeraBytes and satisfy the need for interactive processing and exploration Great promise Commercial acceptance still lagging, but will most probably grow in coming years Still lots of interesting research to be done!,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1