The Energy Data Collection Project.ppt_麦多课文库mydoc123.com

资源描述

1、1,The Energy Data Collection Project,2,The Vision: Ask the Government.,How have property values in the area changed over the past decade?,How many people had breast cancer in the area over the past 30 years?,Is there an orchestra? An art gallery? How far are the nightclubs?,Were thinking of moving t

2、o Denver.What are the schools like there?,3,The Vision: Ask the Government.,Are alternative energy sources any cheaper to use?,Which state has the highest oil production?,How long has the nuclear plant been in service?,Were thinking of moving to CambridgeHow much does gas cost there?,4,The problem a

3、nd the solution,Solution: Create a system to provide easy standardized access: need multi-database access engine, need powerful user interface, need terminology standardization mechanism.,Problem:FedStats has thousands of databases in over seventy Government agencies: data is duplicated and near-dup

4、licated, even Government officials and specialists cannot find it,5,The purpose of DGRC,To Make Digital Government HappenAdvance information systems research Bring the benefits of cutting edge IS research to government systems Help educate government and the community Learn needs from government par

5、tners to drive next stage system development Built pilot systems as part of new infrastructure,6,Research challenges,Scale to incorporate many databases build data models automaticallyProcess large and disparate data efficiently develop fast processing techniques create aggregation and substitution

6、operatorsIntegrate data models across sources and agencies take a large ontology and link the models into it automatically Incorporate additional information that is available from text use language processing tools to extract it Display complex information from distributed sources develop and evalu

7、ate new presentation techniques,7,System Architecture,Sources,Construction phase:Deploy DBsExtend ontol.,Text,Tables,Data,8,Columbias Team Approach,User Interface Year One: Hatzivassiloglou, Sandhaus Year Two: Feiner, Temiyabutr Database Aggregation Year One: Gravano, Singla Year Two: Ross, Zaman Au

8、tomatic Inter-Agency Ontologies Years One and Two: Klavans, Whitman,9,System interface Year One Progress,Components: 1. Query formation 2. Ontology/glossary browsing for concept navigation 3. Answer display, interaction historyGUI incorporates key technologies for facilitating user access to diverse

9、 databases: Context-sensitive menu-based input mechanism Visualization and navigation of results and the ontology Lightweight client runs on multiple platforms without downloads Java/Swing implementation allows client-side processing,Vasileios Hatzivassiloglou Jay Sandhaus,10,Information Aggregation

10、 Yr. 1 Progress,Problem: Data is not in exactly the form the user needs (monthly, not annually; actual values, not averaged) Solution: Attempt to provide unified view of data of various granularities: time period geographical region product Example over BLS data: View: monthly data available for all

11、 geographical regions Query: monthly prices for LA in 1979 Answer: yearly price for LA in 1979,Luis Gravano Anurag Singla,11,Aggregation challenges,Different coverage along these dimensions across data sets Users see a simple, unified view of the data; if a query cannot be answered, we answer the cl

12、osest query that we have data for Answers are always exact Key challenges: defining query proximity (default vs. user-specific) communicating query relaxation to users defining and navigating the space of answerable queries efficiently,12,Extracting and Structuring Information from Definitions Yr 1,

13、Problems: Proliferation of terms in domain Agencies define terms differently Many refer to the same or related entity Lengthy and dense term definitions often contain important information which is buried,Judith Klavans Brian Whitman,13,Glossary analysis framework,Extract ontological information app

14、lying language sensitive analysis tools Structure and deliver to ISI for access and display Based on past projects: analysis of definitions in machine-readable dictionaries Original domain specific glossaries,Gather glossaries, thesauri, definitions from govt agencies Create framework into which tex

15、t will be analyzed,14,DGRC-EDC Plans for Year Two,User Interface Incorporate new presentation approaches Link ontology access mechanisms to query input Incorporate other DG research (Marchionini) DatabaseIntegrate existing aggregation prototypeMain memory for fast performance Lexical Knowledge Bases

16、 Incorporate into SENSUS Add web crawler to extend coverage Develop mechanisms to merge definitions,15,End of Part I : DGRC EDC,Reviewed goals of DGRC Energy Data Collection Project Showed first year progress Gave early second year results Presented Columbias team approach Set out future goalsBut wh

17、at is next?,16,Next Steps for DGRC Growth,Ambitious two-pronged plan,Additional Funding For DGRC TRADE (NSF),Independent Foundation Funding (leverage NSF Investment),17,One Facet: From DGRC-EDC to DGRC-TRADE,Builds on past successes Brings in a new domain trade data Adds three new enhancements User

18、Needs and Evaluation Electronic Data Service at Columbia Users and Experts to test usefulness and usability Database incorporate cross data set aggregation Ontology add multilingual capability,18,Data Integration,Labor,EPA,EIA,Census,Heterogeneous Data Sources,User Interface,Information Access,Defin

19、ition Ontology,query,19,Data Integration,Labor,EPA,EIA,Census,Heterogeneous Data Sources,User Interface,Information Access,Definition Ontology,Trade,Main Memory Query Processing,Multilingual Access,User Evaluation,Task-based Evaluation,query,20,Columbias Electronic Data Service,Established to serve

20、social science researchers Operational unit of the Libraries Excellent relationship with faculty, staff and students Capable of supporting many levels of development and testing Evaluation effort led by Walter Bourne,21,Partners DGRC Trade,Evaluation experts from the US and Canada Cognitive evaluati

21、on User needs evaluation User interface evaluation Social scientists ISERP and CIESEN at Columbia Public Health Policy research,22,Facet Two: Building the DGRC,Seek substantial Foundation support Pursue a large vision Involvement of high level Columbia and ISI administration Gather an advisory board

22、 to develop a sustainable plan,23,What do we need from the NSF?,1. Information Ways to interact with portals E.g. Private companies delivering (free) government data 2. Contacts Leverage peer-review process of NSF to establish key contacts,24,To Sum,DGRC Energy Data Collection (EDC) Progress from Y

23、ear One Plans and early results from Year Two Larger Plans for Growing DGRC Trade Proposal NSF Plans for other funding,25,Todays Plan: Focus on DGRC-EDC,Major research challenges: Building and structuring the ontology Automated data aggregation Presentation of complex information Major practical cha

24、llenges: Getting more data into the system Understanding users needs,26,Thank you! Any questions?,27,Information Integration: Heterogeneity in Aggregation Luis Gravano Assistant Professor, Columbia U. (joint work with Anurag Singla and Vasilis Vassalos),28,Information Integration,Goal: To Provide Si

25、ngle-Stop Access to Multiple Distributed Autonomous Data Sets,Data Sets/Sources: Tables with statistical data, potentially produced by different organizations,29,My Research Background,Databases Distributed search and retrieval over text sources,30,Metasearchers: Single-Stop Access to Heterogeneous

26、Text Sources,Source 1,Source 2,Source n,.,Meta Searcher,User,Query,Unified Results,31,Main Metasearcher Tasks,Selects good text sources for query (source discovery) Evaluates query at these sources (query translation) Combines query results from sources (result merging),32,Some of my Previous Work o

27、n Metasearchers,GlOSS: a scalable source discovery system that selects relevant text sources STARTS: a protocol that facilitates metasearching (Participants included Infoseek, Microsoft, Hewlett-Packard, Fulcrum, Verity, and Netscape.),33,Challenges for Information Integration,“Semantic” Heterogenei

28、ty of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets,34,“Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data

29、Sets,ISIs SIMS,Future Work,Challenges for Information Integration,35,“Semantic” Heterogeneity of Data Sets “Syntactic” Heterogeneity of Data Sets Varying Granularity of Data Sets Varying Data Coverage Number of Available Data Sets,Last Year Focus,Challenges for Information Integration,36,Mediators:

30、Single-Stop Access to Heterogeneous Statistical Sources,Mediator,Query,Unified Results,User,Main- Memory DBMS,Traditional DBMS,.,37,Varying Data Coverage and Granularity,Time period Geographical region Products,Average Price of Gasoline from BLS,38,Varying Data Coverage (I),Region : US Average Produ

31、ct : Leaded Regular Gasoline Time Period: Oct 1973 to Mar 1991 Source: BLS Series APU000074712 Product: Leaded Premium Gasoline Time Period: Oct 1973 to Dec 1983 Source: BLS Series APU000074713,39,Varying Data Coverage (II),Product: Leaded Regular Gasoline Region: San Diego, CA Time Period : Jan 197

32、8 to Dec 1986 Source: BLS Series APUA42474712 Region: Boston, Massachusetts Time Period : Jan 1978 to Jan 1989 Source: BLS Series APUA10374712,40,Varying Data Coverage (III),Geographical coverage varies for different data fields (even for same gasoline type) Not all data fields available for all gas

33、oline types (e.g., Consumer Price Index available for Unleaded Regular but not for Leaded Premium),41,Varying Data Granularity,Granularity “hierarchies” for: Time period Geographical region Products,Granularity Hierarchy for Time Period,Year,Day,Week,Quarter,Month,Granularity Hierarchy for Geographi

34、cal Region,Region (Spanning citiesor states),City,World,Country,State,Granularity Hierarchy for Products,Leaded Gasoline,Gasoline,Unleaded Gasoline,Leaded Gasoline (Regular),Leaded Gasoline (Midgrade),Leaded Gasoline (Premium),45,Some BLS Data Sets for our Demo (Gasoline Unleaded Regular, Average Pr

35、ice),US; Monthly; 10/1973 to 3/1991 Source: APU000074712 San Diego; Monthly; 1/1978 to 12/1986 Source: APUA42474712 Los Angeles; Monthly; 1/1986 to 4/1991 Source: APUA42174712 Los Angeles; Yearly; 1978 to 1985 Source: APUA42174712 (aggregated),46,What Do We Show Users as Data Sets Available for Quer

36、ying?,47,What Do We Show Users as Data Sets Available for Querying?,Possibility 1: All the details! US; Monthly; 10/1973 to 3/1991 San Diego; Monthly; 1/1978 to 12/1986 Los Angeles; Monthly; 1/1986 to 4/1991 Los Angeles; Yearly; 1978 to 1985,48,What Do We Show Users as Data Sets Available for Queryi

37、ng?,Possibility 1: All the details! Advantages: Users can exploit all data sets,49,What Do We Show Users as Data Sets Available for Querying?,Possibility 1: All the details! Advantages: Users can exploit all data sets Disadvantages: if they dont get overwhelmed first.,50,Possibility 2: “Least common

38、 denominator” of data setsE.g., “only yearly data available”,What Do We Show Users as Data Sets Available for Querying?,51,Possibility 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data.,What Do We Show Users as Data Sets Available for Querying?,52,Possibilit

39、y 2: “Least common denominator” of data sets Advantages: Users get a unified view of the data. Disadvantages: Almost nothing is left!,What Do We Show Users as Data Sets Available for Querying?,53,Possibility 3 (our approach): Define a reasonably expressive, unified view,What Do We Show Users as Data

40、 Sets Available for Querying?,54,Our Approach,Users have a simple, unified view of the data. If a query cannot be answered, we answer the closest query that we have data for. Answers are always exact.,55,Example over BLS Sources,View: monthly data available for all geographical regionsQuery: monthly

41、 prices for LA in 1979 Answer: yearly price for LA in 1979,56,Possibility 3 (our approach): Define a reasonably expressive, unified view Advantages: Users get a unified view of the data; most data sets exploited. Disadvantages: Sometimes user queries cannot be answered.,What Do We Show Users as Data

42、 Sets Available for Querying?,57,Key Challenges,Defining query proximity (“default” vs. user-specific) Communicating “query relaxation” to users Defining and navigating the space of “answerable queries” efficiently,59,Proof-of-Concept Demo,Four BLS sources Simple integrated view Results for “closest

43、” query when original answer cannot be computedhttp:/db-pc01.cs.columbia.edu/digigov/Main.html,60,Some Open Issues,Definition of “right view” Interaction with user interface Addition of aggregation into ISIs SIMS system,61,Aggregation in Main Memory,Kenneth A. Ross Kazi A. Zaman Columbia University,

44、 New York,62,Research Experience,Complex query processing Data Warehousing Main memory databases,63,64,Outline,Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan,65,The CUBE BY Operator,State Year Grade Sales,CA 1997 Regu

45、lar 90,NY 1997 Premium 70,CA 1998 Premium 65,NY 1998 Premium 95,State Year Grade Sales,CA 1997 Regular 90,CA 1997 ALL 90,ALL 1997 Regular 90,CA ALL Regular 90,ALL 1997 Regular 90,ALL 1997 ALL 160,ALL ALL Regular 90,CA ALL ALL 155,ALL ALL ALL 320,CUBE BY (sum Sales),Large increase in total Size, espe

46、cially with many dimensions,.,Additional records,66,Lattice Representation,State, Year, Grade,State, Year,State, Grade,Year, Grade,State,Year,Grade,67,Modeling Queries,Slice Queries ask for a single aggregate record,SELECT State, year, sum(sales) FROM BLS-12345 GROUP BY State, year HAVING State = “N

47、Y” ANDyear = “1998”,68,Existing Frameworks,State, Year, Grade,State, Year State,Grade Year,Grade,State Year Grade,Choose subset of cube to materialize based on workload.,Materialize on disk,Appropriate record recovered or computed for incoming slice query,Drawbacks: Ignores Clustering of Relation on

48、 disk. Smallest unit of materialization is too big.,69,Our approach,State, Year, Grade,State, Year State,Grade Year,Grade,State Year Grade,The full cube is often larger than available memory, but .,The finest granularity aggregate may fit.,Any record can be computed without having to go to disk.,How

49、 should the finest granularity be organized ?,70,Framework,Level-1 Store,Level-2 Store,records in linked lists,Slot directory,Selected coarse records in hash table,Finest granularity cuboid,Query q,71,The Level-1 Store,Records are pairs stored in a hash table.Records can contain ALLsGiven query Q, form composite key and check level-1 store (constant time).If not found, use level-2 store,

展开阅读全文