Introduction to Stanford DB Group Research.ppt

资源描述

1、1,Introduction to Stanford DB Group Research,Li Ruixuan http:/ ,2,Contents,Introduction Past projects Current projects Events References Links,3,The Stanford Database Group,“Mainstream” faculty Hector Garcia-Molina Jennifer Widom Jeff Ullman Gio Wiederhold “Adjunct” faculty Chris Manning (natural la

2、nguage processing) Rajeev Motwani (theory) Terry Winograd (human-computer interaction) A.k.a. Stanford InfoLab,4,Database Group (contd),Approximately 25 Ph.D. students Varying numbers of M.S. and undergraduate students Handful of visitors One senior research associate One systems administrator, one

3、programmer Excellent administrative staff Resident photographer,5,Research Areas (very coarse),Digital libraries Peer-to-peer systems Data streams Replication, caching, archiving, broadcast, The Web Ontologies, semantic Web Data mining Miscellaneous,6,Past Projects,LIC: Large-Scale Interoperation an

4、d Composition (1999) mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB) SKC: Scalable Knowledge Composition (2000) - semantic heterogeneity TID: Trusted Image Distribution (2001) - Image Filtering for Secure Distribution of Medical Information Image Database: Content-based Image Retrieval (2003) SimQL

5、Simulation Access Language (2001) - Software modules in manufacturing, acquisition, and planning systems,7,Past Projects (contd),TSIMMIS: Wrapping and mediation for heterogenous information sources (1998) Lore: A Database Management System for XML (2000) WHIPS: WareHouse Information Prototype at St

6、anford (1998) - Data warehouse creation and maintenance MIDAS: Mining Data at Stanford (1999) WSQ: Web-Supported Queries (2000) - Integrating database queries and Web searches,8,Current Projects,WebBase: Crawling, storage, indexing, and querying of large collections of Web pages. (Molina) STREAM: A

7、Database Management System for Data Streams (Widom) Peers: Building primitives for peer-to-peer systems (Molina) Digital Libraries: Interoperating on-line services for end-user support (TID,WebBase,OntoAgents) (Molina) TRAPP: Approximate data caching: trading precision for performance (Widom) CHAIMS

8、 Compiling High-level Access Interfaces for Multi-site Software (1999) (Wiederhold) OntoAgents: Ontology based Infrastructure for Agents (2002) (Wiederhold),9,WebBase: Objectives,Provide a storage infrastructure for Web-like content Store a sizeable portion of the Web Enable researchers to easily b

9、uild indexes of page features across large sets of pages Distribute Webbase content via multicast channels Support structure and content-based querying over the stored collection,10,WebBase: Architecture,11,WebBase: Current Status,Efficient “smart” crawler Parallelism Freshness & Relevance Efficient

10、 and scalable indexing Distributed Web-scale content indexes Indexes over graph structure Unicast dissemination Within Stanford External clients: Columbia, U.Wash, U.C.Berkeley,12,WebBase: In Progress,WebBase Infrastructure Multicast dissemination Complex queries Other work PageRank extensions Clust

11、ering and similarity search Structured data extraction Hidden Web crawling,13,Data Streams: Motivation,Traditional DBMS - data stored in finite, persistent data sets New applications - data as multiple, continuous, rapid, time-varying data streams Network monitoring and traffic engineering Security

12、applications Telecom call records Financial applications Web logs and click-streams Sensor networks Manufacturing processes,14,STREAM: Architecture,15,STREAM: Challenges,Multiple, continuous, rapid, time-varying streams of data Queries may be continuous (not just one-time) Evaluated continuously as

13、stream data arrives Answer updated over time Queries may be complex Beyond element-at-a-time processing Beyond stream-at-a-time processing,16,DBMS versus DSMS,Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store,T

14、ransient streams (and persistent relations) Continuous queries Sequential access Unpredictable data arrival and characteristics Bounded main memory,17,STREAM: Current Status,Data streams and stored relations Declarative language for registering continuous queries Flexible query plans Designed to cop

15、e with high data rates and query workloads Graceful approximation when needed Careful resource allocation and usage Relational, centralized (for now),18,STREAM: Ongoing Work,Algebra for streams Semantics for continuous queries Synopses and algorithmic issues Memory management issues Exploiting const

16、raints on streams Approximation in query processing Distributed stream processing System development,19,STREAM: Related Work,Amazon/Cougar (Cornell) sensors Aurora (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) telecom streams Niagara (OGI/Wisconsin) Internet XML databases OpenCQ (Georgia) t

17、riggers, incr. view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pub/sub content-based filtering Telegraph (Berkeley) adaptive engine for sensors Tribeca (Bellcore) network monitoring,20,Peer-To-Peer Systems,Multiple sites (at edge) Distributed resources Sites are autonomous (

18、different owners) Sites are both clients and servers Sites have equal functionality,21,P2P Benefits,Pooling available (inexpensive) resourcesHigh availability and fault-toleranceSelf-organization,22,P2P Challenges,Search Query Expressiveness Comprehensiveness Topology Data Placement Message Routing

19、Resource Management fairness load balancing,Security & Privacy Anonymity Reputation Accountability Information Preservation Information Quality Trust Denial of service attacks,23,Peers: Stanford Research,New Architectures Performance Modeling and Optimization Security and Trust Distributed Resource

20、Management Applications,24,Digital Library Project: Overview,25,DigLib Projects: DLI1,DLI2,Resource Discovery Retrieving Information Interpreting Information Managing Information Sharing Information,26,DigLib: Resource Discovery,Geographic Views (Tools to assist you in more systematically locating d

21、ifferent types of information from a large and diverse number of information sources),27,DigLib: Retrieving Information,Information Tiling PalmPilot Infrastructure (PDA)Power Browsing (PDA)Query Translator SDLIP (Simple Digital Library Interoperability Protocol)Value Filtering WebBase,28,DigLib: Int

22、erpreting Information,Murals (Tools to help a user interpret and organize search results) Web Clustering,29,DigLib: Managing Information,Archival Repositories Archiving Movie InterBib (a tool for maintaining bibliographic information)Medical Transport Info PhotoBrowser,30,DigLib: Sharing Information

23、Diet ORB (PDA, based on MICO)Digital Wallets Mobile Info Delivery Mobile Security Multicasting,31,DLI1 Projects (95-99),AHA ComMentor DLITE Google GLOSS FAB Grassroots Metadata Architecture,RManage/FIRM SenseMaker SCAM Shopping Models, U-PAI SONIA STARTS WebWriter,32,TRAPP: Overview,TRAPP: Tradeoff

24、 in Replication Precision and Performance A.k.a: Approximate Data Caching Project goal: investigating techniques to permit controlled and explicit relaxation of data precision in exchange for improved performance,33,TRAPP: Motivation,Transactional consistency too expensive Even nontransactional prop

25、agation of every update still too expensive in many casesSolution: Approximate Caching Exploit the fact that many applications do not require exact consistency Avoid propagating insignificant updates Trade cache precision for network load,34,Example: TRAPP Over Numeric Data,Caches store intervals th

26、at bound the exact source values Sources refresh when value leaves interval,Query answers are intervals Precision constraints specify maximum width,35,Eg(contd): Querying in TRAPP,For one-time aggregation queries: Answers computed by combining approximate cached data and exact source data At query-t

27、ime: Find low-cost subset of sources to probe so final answer will have adequate precision Algorithm determined by aggregation function Some easy, some hard,36,TRAPP: Approximate Caching,Two common scenarios: Minimize bandwidth usage, precision fixed TRAPP: caches store bounds as approximations Quer

28、ies select combination of cached & source data Adaptive bound adjustment for good precision levelBandwidth fixed, maximize precision Best-Effort Synchronization: caches store stale copies Refreshing based on priority scheduling Global priority order via threshold Adaptive threshold setting for flow

29、control,37,TRAPP: Status,Past work: focused on an approximate data caching architecture that permits fine-grained control of the precision-performance tradeoff for numerical data in data caching environments. Current work: applying the above techniques and others to more complex data such as Web pag

30、es.,38,CHAIMS: Overview,CHAIMS: Compiling High-level Access Interfaces for Multi-site Software Objective: Investigate revolutionary approaches to large-scale software composition. Approach: Develop and validate a composition-only language, a protocol for large, distributed, heterogeneous and autonom

31、ous megamodules, and a supporting system. Planned contributions: Asynchrony by splitting up CALL-statement. Hardware and software platform independence. Potential for multi-site dataflow optimization. Performance optimization by invocation scheduling.,39,CHAIMS: Overview,Megaprogram for composition,

32、 written by domain programmer,CHAIMS system automates generation of client for distributed system,Megamodules, provided by various megamodule providers,40,CHAIMS: Architecture,41,OntoAgents: Objective,OntoAgents goal: establish an agent infrastructure on the WWW or WWW-like networks Such an agent in

33、frastructure requires an information food chain: every part of the food chain provides information, which enables the existence of the next part.,42,OntoAgents: Architecture,Ontology Construction Tool,Ontology Articulation Toolkit,Annotated Webpages,Webpage Annotation Tool,Ontologies,Agents,Metadata

34、 Repository,Inference Engine,Community Portal,End User,43,Events: DB Seminars,44,Events: Meetings,Stanford Computer Science Forum - Annual Affiliates Meeting, Stanford, May 2003. SWiM (the Stream Winter Meeting): About 35 researchers in the data streams are came together at Stanford for SWiM, Jan. 2

35、003. Stream Team: A few data streams research groups held some informal get-togethers, 2002. Conference Talk: ACM SIGMOD/PODS, VLDB, ICDT, ICDE, ICDCS, CIDR,45,References: WebBase,Junghoo Cho, Hector Garcia-Molina. “Parallel Crawlers,“ In Proceedings of the Eleventh World Wide Web Conference, May 20

36、02. Taher Haveliwala, Aristides Gionis, etc. “Evaluating Strategies for Similarity Search on the Web,“ Proceedings of the Eleventh International World Wide Web Conference, May 2002. Taher Haveliwala. “Topic-Sensitive PageRank,“ Proceedings of the Eleventh International World Wide Web Conference, May

37、 2002.,46,References: STREAM,R. Motwani, J. Widom, etc. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of the 2003 Conference on Innovative Data Systems Research (CIDR), January 2003 A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream Data Man

38、ager In Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003 B. Babcock, S. Babu, etc. Models and Issues in Data Stream Systems Invited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002,47,References: Peers,Neil Daswani, Hector Garcia

39、Molina and Beverly Yang. Open Problems in Data-Sharing Peer-to-Peer Systems, In ICDT, 2003. Hector Garcia-Molina. Peer-To-Peer Data Management, Key-notes In ICDE, 2002. Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina. Streaming Live Media over a Peer-to-Peer Network.,48,References: TRAP

40、P,C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. ACM SIGMOD 2002 International Conference on Management of Data, Madison, Wisconsin, June 2002, pp. 73 -84. C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001

41、 International Conference on Management of Data, Santa Barbara , California, May 2001, pp. 355-366.,49,Useful Links,Database Group: http:/www-db.stanford.edu/ STREAM: http:/www-db.stanford.edu/stream/ Peers: http:/www-db.stanford.edu/peers/ DigLib: http:/www-diglib.stanford.edu/ TRAPP: http:/www-db.stanford.edu/trapp/ WebBase: http:/www-diglib.stanford.edu/testbed/doc2/WebBase/,

展开阅读全文