ImageVerifierCode 换一换
格式:PPT , 页数:152 ,大小:987KB ,
资源ID:373059      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373059.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(The Energy Data Collection Project.ppt)为本站会员(priceawful190)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

The Energy Data Collection Project.ppt

1、1,The Energy Data Collection Project,2,The Vision: Ask the Government.,How have property values in the area changed over the past decade?,How many people had breast cancer in the area over the past 30 years?,Is there an orchestra? An art gallery? How far are the nightclubs?,Were thinking of moving t

2、o Denver.What are the schools like there?,3,The Vision: Ask the Government.,Are alternative energy sources any cheaper to use?,Which state has the highest oil production?,How long has the nuclear plant been in service?,Were thinking of moving to CambridgeHow much does gas cost there?,4,The problem a

3、nd the solution,Solution: Create a system to provide easy standardized access: need multi-database access engine, need powerful user interface, need terminology standardization mechanism.,Problem:FedStats has thousands of databases in over seventy Government agencies: data is duplicated and near-dup

4、licated, even Government officials and specialists cannot find it,5,The purpose of DGRC,To Make Digital Government HappenAdvance information systems research Bring the benefits of cutting edge IS research to government systems Help educate government and the community Learn needs from government par

5、tners to drive next stage system development Built pilot systems as part of new infrastructure,6,Research challenges,Scale to incorporate many databases build data models automaticallyProcess large and disparate data efficiently develop fast processing techniques create aggregation and substitution

6、operatorsIntegrate data models across sources and agencies take a large ontology and link the models into it automatically Incorporate additional information that is available from text use language processing tools to extract it Display complex information from distributed sources develop and evalu

7、ate new presentation techniques,7,System Architecture,Sources,Construction phase:Deploy DBsExtend ontol.,Text,Tables,Data,8,Columbias Team Approach,User Interface Year One: Hatzivassiloglou, Sandhaus Year Two: Feiner, Temiyabutr Database Aggregation Year One: Gravano, Singla Year Two: Ross, Zaman Au

8、tomatic Inter-Agency Ontologies Years One and Two: Klavans, Whitman,9,System interface Year One Progress,Components: 1. Query formation 2. Ontology/glossary browsing for concept navigation 3. Answer display, interaction historyGUI incorporates key technologies for facilitating user access to diverse

9、 databases: Context-sensitive menu-based input mechanism Visualization and navigation of results and the ontology Lightweight client runs on multiple platforms without downloads Java/Swing implementation allows client-side processing,Vasileios Hatzivassiloglou Jay Sandhaus,10,Information Aggregation

10、 Yr. 1 Progress,Problem: Data is not in exactly the form the user needs (monthly, not annually; actual values, not averaged) Solution: Attempt to provide unified view of data of various granularities: time period geographical region product Example over BLS data: View: monthly data available for all

11、 geographical regions Query: monthly prices for LA in 1979 Answer: yearly price for LA in 1979,Luis Gravano Anurag Singla,11,Aggregation challenges,Different coverage along these dimensions across data sets Users see a simple, unified view of the data; if a query cannot be answered, we answer the cl

12、osest query that we have data for Answers are always exact Key challenges: defining query proximity (default vs. user-specific) communicating query relaxation to users defining and navigating the space of answerable queries efficiently,12,Extracting and Structuring Information from Definitions Yr 1,

13、Problems: Proliferation of terms in domain Agencies define terms differently Many refer to the same or related entity Lengthy and dense term definitions often contain important information which is buried,Judith Klavans Brian Whitman,13,Glossary analysis framework,Extract ontological information app

14、lying language sensitive analysis tools Structure and deliver to ISI for access and display Based on past projects: analysis of definitions in machine-readable dictionaries Original domain specific glossaries,Gather glossaries, thesauri, definitions from govt agencies Create framework into which tex

15、t will be analyzed,14,DGRC-EDC Plans for Year Two,User Interface Incorporate new presentation approaches Link ontology access mechanisms to query input Incorporate other DG research (Marchionini) DatabaseIntegrate existing aggregation prototypeMain memory for fast performance Lexical Knowledge Bases

16、 Incorporate into SENSUS Add web crawler to extend coverage Develop mechanisms to merge definitions,15,End of Part I : DGRC EDC,Reviewed goals of DGRC Energy Data Collection Project Showed first year progress Gave early second year results Presented Columbias team approach Set out future goalsBut wh

17、at is next?,16,Next Steps for DGRC Growth,Ambitious two-pronged plan,Additional Funding For DGRC TRADE (NSF),Independent Foundation Funding (leverage NSF Investment),17,One Facet: From DGRC-EDC to DGRC-TRADE,Builds on past successes Brings in a new domain trade data Adds three new enhancements User

18、Needs and Evaluation Electronic Data Service at Columbia Users and Experts to test usefulness and usability Database incorporate cross data set aggregation Ontology add multilingual capability,18,Data Integration,Labor,EPA,EIA,Census,Heterogeneous Data Sources,User Interface,Information Access,Defin

19、ition Ontology,query,19,Data Integration,Labor,EPA,EIA,Census,Heterogeneous Data Sources,User Interface,Information Access,Definition Ontology,Trade,Main Memory Query Processing,Multilingual Access,User Evaluation,Task-based Evaluation,query,20,Columbias Electronic Data Service,Established to serve

20、social science researchers Operational unit of the Libraries Excellent relationship with faculty, staff and students Capable of supporting many levels of development and testing Evaluation effort led by Walter Bourne,21,Partners DGRC Trade,Evaluation experts from the US and Canada Cognitive evaluati

21、on User needs evaluation User interface evaluation Social scientists ISERP and CIESEN at Columbia Public Health Policy research,22,Facet Two: Building the DGRC,Seek substantial Foundation support Pursue a large vision Involvement of high level Columbia and ISI administration Gather an advisory board

22、 to develop a sustainable plan,23,What do we need from the NSF?,1. Information Ways to interact with portals E.g. Private companies delivering (free) government data 2. Contacts Leverage peer-review process of NSF to establish key contacts,24,To Sum,DGRC Energy Data Collection (EDC) Progress from Y

23、ear One Plans and early results from Year Two Larger Plans for Growing DGRC Trade Proposal NSF Plans for other funding,25,Todays Plan: Focus on DGRC-EDC,Major research challenges: Building and structuring the ontology Automated data aggregation Presentation of complex information Major practical cha

24、llenges: Getting more data into the system Understanding users needs,26,Thank you! Any questions?,27,Information Integration: Heterogeneity in Aggregation Luis Gravano Assistant Professor, Columbia U. (joint work with Anurag Singla and Vasilis Vassalos),28,Information Integration,Goal: To Provide Si

25、ngle-Stop Access to Multiple Distributed Autonomous Data Sets,Data Sets/Sources: Tables with statistical data, potentially produced by different organizations,29,My Research Background,Databases Distributed search and retrieval over text sources,30,Metasearchers: Single-Stop Access to Heterogeneous

26、Text Sources,Source 1,Source 2,Source n,.,Meta Searcher,User,Query,Unified Results,31,Main Metasearcher Tasks,Selects good text sources for query (source discovery) Evaluates query at these sources (query translation) Combines query results from sources (result merging),32,Some of my Previous Work o

27、n Metasearchers,GlOSS: a scalable source discovery system that selects relevant text sources STARTS: a protocol that facilitates metasearching (Participants included Infoseek, Microsoft, Hewlett-Packard, Fulcrum, Verity, and Netscape.),33,Challenges for Information Integration,“Semantic” Heterogenei

28、ty of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets,34,“Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data

29、Sets,ISIs SIMS,Future Work,Challenges for Information Integration,35,“Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets,Last Year Focus,Challenges for Information Integration,36,Mediators:

30、Single-Stop Access to Heterogeneous Statistical Sources,Mediator,Query,Unified Results,User,Main- Memory DBMS,Traditional DBMS,.,37,Varying Data Coverage and Granularity,Time period Geographical region Products,Average Price of Gasoline from BLS,38,Varying Data Coverage (I),Region : US Average Produ

31、ct : Leaded Regular Gasoline Time Period: Oct 1973 to Mar 1991 Source: BLS Series APU000074712 Product: Leaded Premium Gasoline Time Period: Oct 1973 to Dec 1983 Source: BLS Series APU000074713,39,Varying Data Coverage (II),Product: Leaded Regular Gasoline Region: San Diego, CA Time Period : Jan 197

32、8 to Dec 1986 Source: BLS Series APUA42474712 Region: Boston, Massachusetts Time Period : Jan 1978 to Jan 1989 Source: BLS Series APUA10374712,40,Varying Data Coverage (III),Geographical coverage varies for different data fields (even for same gasoline type) Not all data fields available for all gas

33、oline types (e.g., Consumer Price Index available for Unleaded Regular but not for Leaded Premium),41,Varying Data Granularity,Granularity “hierarchies” for: Time period Geographical region Products,Granularity Hierarchy for Time Period,Year,Day,Week,Quarter,Month,Granularity Hierarchy for Geographi

34、cal Region,Region (Spanning citiesor states),City,World,Country,State,Granularity Hierarchy for Products,Leaded Gasoline,Gasoline,Unleaded Gasoline,Leaded Gasoline (Regular),Leaded Gasoline (Midgrade),Leaded Gasoline (Premium),45,Some BLS Data Sets for our Demo (Gasoline Unleaded Regular, Average Pr

35、ice),US; Monthly; 10/1973 to 3/1991 Source: APU000074712 San Diego; Monthly; 1/1978 to 12/1986 Source: APUA42474712 Los Angeles; Monthly; 1/1986 to 4/1991 Source: APUA42174712 Los Angeles; Yearly; 1978 to 1985 Source: APUA42174712 (aggregated),46,What Do We Show Users as Data Sets Available for Quer

36、ying?,47,What Do We Show Users as Data Sets Available for Querying?,Possibility 1: All the details! US; Monthly; 10/1973 to 3/1991 San Diego; Monthly; 1/1978 to 12/1986 Los Angeles; Monthly; 1/1986 to 4/1991 Los Angeles; Yearly; 1978 to 1985,48,What Do We Show Users as Data Sets Available for Queryi

37、ng?,Possibility 1: All the details! Advantages: Users can exploit all data sets,49,What Do We Show Users as Data Sets Available for Querying?,Possibility 1: All the details! Advantages: Users can exploit all data sets Disadvantages: if they dont get overwhelmed first.,50,Possibility 2: “Least common

38、 denominator” of data setsE.g., “only yearly data available”,What Do We Show Users as Data Sets Available for Querying?,51,Possibility 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data.,What Do We Show Users as Data Sets Available for Querying?,52,Possibilit

39、y 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data. Disadvantages: Almost nothing is left!,What Do We Show Users as Data Sets Available for Querying?,53,Possibility 3 (our approach): Define a reasonably expressive, unified view,What Do We Show Users as Data

40、 Sets Available for Querying?,54,Our Approach,Users have a simple, unified view of the data. If a query cannot be answered, we answer the closest query that we have data for. Answers are always exact.,55,Example over BLS Sources,View: monthly data available for all geographical regionsQuery: monthly

41、 prices for LA in 1979 Answer: yearly price for LA in 1979,56,Possibility 3 (our approach): Define a reasonably expressive, unified view Advantages: Users get a unified view of the data; most data sets exploited. Disadvantages: Sometimes user queries cannot be answered.,What Do We Show Users as Data

42、 Sets Available for Querying?,57,Key Challenges,Defining query proximity (“default” vs. user-specific) Communicating “query relaxation” to users Defining and navigating the space of “answerable queries” efficiently,59,Proof-of-Concept Demo,Four BLS sources Simple integrated view Results for “closest

43、” query when original answer cannot be computedhttp:/db-pc01.cs.columbia.edu/digigov/Main.html,60,Some Open Issues,Definition of “right view” Interaction with user interface Addition of aggregation into ISIs SIMS system,61,Aggregation in Main Memory,Kenneth A. Ross Kazi A. Zaman Columbia University,

44、 New York,62,Research Experience,Complex query processing Data Warehousing Main memory databases,63,64,Outline,Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan,65,The CUBE BY Operator,State Year Grade Sales,CA 1997 Regu

45、lar 90,NY 1997 Premium 70,CA 1998 Premium 65,NY 1998 Premium 95,State Year Grade Sales,CA 1997 Regular 90,CA 1997 ALL 90,ALL 1997 Regular 90,CA ALL Regular 90,ALL 1997 Regular 90,ALL 1997 ALL 160,ALL ALL Regular 90,CA ALL ALL 155,ALL ALL ALL 320,CUBE BY (sum Sales),Large increase in total Size, espe

46、cially with many dimensions,.,Additional records,66,Lattice Representation,State, Year, Grade,State, Year,State, Grade,Year, Grade,State,Year,Grade,67,Modeling Queries,Slice Queries ask for a single aggregate record,SELECT State, year, sum(sales) FROM BLS-12345 GROUP BY State, year HAVING State = “N

47、Y” ANDyear = “1998”,68,Existing Frameworks,State, Year, Grade,State, Year State,Grade Year,Grade,State Year Grade,Choose subset of cube to materialize based on workload.,Materialize on disk,Appropriate record recovered or computed for incoming slice query,Drawbacks: Ignores Clustering of Relation on

48、 disk. Smallest unit of materialization is too big.,69,Our approach,State, Year, Grade,State, Year State,Grade Year,Grade,State Year Grade,The full cube is often larger than available memory, but .,The finest granularity aggregate may fit.,Any record can be computed without having to go to disk.,How

49、 should the finest granularity be organized ?,70,Framework,Level-1 Store,Level-2 Store,records in linked lists,Slot directory,Selected coarse records in hash table,Finest granularity cuboid,Query q,71,The Level-1 Store,Records are pairs stored in a hash table.Records can contain ALLsGiven query Q, form composite key and check level-1 store (constant time).If not found, use level-2 store,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1