Building PetaByte Servers.ppt_麦多课文库mydoc123.com

资源描述

1、1,Building PetaByte Servers,Jim Gray Microsoft Research GrayM http:/www.Research.M 103 Mega 106 Giga 109 Tera 1012 today, we are here Peta 1015 Exa 1018,2,Outline,The challenge: Building GIANT data stores for example, the EOS/DIS 15 PB system Conclusion 1 Think about MOX and SCANS Conclusion 2: Thin

2、k about Clusters,3,The Challenge - EOS/DIS,Antarctica is melting - 77% of fresh water liberated sea level rises 70 meters Chico & Memphis are beach-front property New York, Washington, SF, LA, London, Paris Lets study it! Mission to Planet Earth EOS: Earth Observing System (17B$ = 10B$) 50 instrumen

3、ts on 10 satellites 1997-2001 Landsat (added later) EOS DIS: Data Information System: 3-5 MB/s raw, 30-50 MB/s processed. 4 TB/day, 15 PB by year 2007,4,The Process Flow,Data arrives and is pre-processed. instrument data is calibrated, gridded averaged Geophysical data is derived Users ask for store

4、d data OR to analyze and combine data. Can make the pull-push split dynamically,Pull Processing,Push Processing,Other Data,5,Designing EOS/DIS,Expect that millions will use the system (online) Three user categories: NASA 500 - funded by NASA to do science Global Change 10 k - other dirt bags Interne

5、t 20 m - everyone elseGrain speculatorsEnvironmental Impact ReportsNew applications = discovery & access must be automatic Allow anyone to set up a peer- node (DAAC & SCF) Design for Ad Hoc queries, Not Standard Data Products If push is 90%, then 10% of data is read (on average). = A failure: no one

6、 uses the data, in DSS, push is 1% or less.= computation demand is enormous (pull:push is 100: 1),6,The architecture,2+N data center design Scaleable OR-DBMS Emphasize Pull vs Push processing Storage hierarchy Data Pump Just in time acquisition,7,Obvious Point: EOS/DIS will be a cluster of SMPs,It n

7、eeds 16 PB storage = 1 M disks in current technology = 500K tapes in current technology It needs 100 TeraOps of processing = 100K processors (current technology) and 100 Terabytes of DRAM 1997 requirements are 1000x smaller smaller data rate almost no re-processing work,8,2+N data center design,dupl

8、ex the archive (for fault tolerance) let anyone build an extract (the +N) Partition data by time and by space (store 2 or 4 ways). Each partition is a free-standing OR-DBBMS (similar to Tandem, Teradata designs). Clients and Partitions interact via standard protocols OLE-DB, DCOM/CORBA, HTTP,9,Hardw

9、are Architecture,2 Huge Data Centers Each has 50 to 1,000 nodes in a cluster Each node has about 25250 TB of storage SMP .5Bips to 50 Bips 20K$ DRAM 50GB to 1 TB 50K$ 100 disks 2.3 TB to 230 TB 200K$ 10 tape robots 25 TB to 250 TB 200K$ 2 Interconnects 1GBps to 100 GBps 20K$ Node costs 500K$ Data Ce

10、nter costs 25M$ (capital cost),10,Scaleable OR-DBMS,Adopt cluster approach (Tandem, Teradata, VMScluster,) System must scale to many processors, disks, links OR DBMS based on standard object model CORBA or DCOM (not vendor specific) Grow by adding components System must be self-managing,11,Storage H

11、ierarchy,Cache hot 10% (1.5 PB) on disk. Keep cold 90% on near-line tape. Remember recent results on speculation (more on this later MOX/GOX/SCANS),12,Data Pump,Some queries require reading ALL the data (for reprocessing) Each Data Center scans the data every 2 weeks. Data rate 10 PB/day = 10 TB/nod

12、e/day = 120 MB/s Compute on demand small jobsless than 1,000 tape mountsless than 100 M disk accessesless than 100 TeraOps.(less than 30 minute response time) For BIG JOBS scan entire 15PB database Queries (and extracts) “snoop” this data pump.,13,Just-in-time acquisition 30%,Hardware prices decline

13、 20%-40%/year So buy at last moment Buy best product that day: commodity Depreciate over 3 years so that facility is fresh. (after 3 years, cost is 23% of original). 60% decline peaks at 10M$,1996,EOS DIS Disk Storage Size and Cost,1994,1998,2000,2002,2004,2006,2008,Storage Cost M$,Data Need TB,assu

14、me 40% price decline/year,14,Problems,HSM Design and Meta-data Ingest Data discovery, search, and analysis reorg-reprocess disaster recovery cost,15,What this system teaches us,Traditional storage metrics KOX: KB objects accessed per second $/GB: Storage cost New metrics: MOX: megabyte objects acces

15、sed per second SCANS: Time to scan the archive,16,Thesis: Performance =Storage Accesses not Instructions Executed,In the “old days” we counted instructions and IOs Now we count memory references Processors wait most of the time,17,The Pico Processor,1 M SPECmarks106 clocks/fault to bulk ramEvent-hor

16、izon on chip.VM reincarnatedMulti-program cache,Terror Bytes!,18,Storage Latency: How Far Away is the Data?,Registers,On Chip Cache,On Board Cache,Memory,Disk,1,2,10,100,Tape /Optical,Robot,10,9,10,6,Sacramento,This Campus,This Room,My Head,10 min,1.5 hr,2 Years,1 min,Pluto,2,000 Years,Andromeda,19,

17、DataFlow Programming Prefetch & Postwrite Hide Latency,Cant wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram.) to cpu cache,20,MetaMessage: Technology Ratios Are Impo

18、rtant,If everything gets faster&cheaper at the same rate THEN nothing really changes. Things getting MUCH BETTER: communication speed & cost 1,000x processor speed & cost 100x storage size & cost 100x Things staying about the same speed of light (more or less constant) people (10x more expensive) st

19、orage speed (only 10x better),21,Trends: Storage Got Cheaper,$/byte got 104 better $/access got 103 better capacity grew 103 Latency improved 10 Bandwidth improved 10,Year,1960,1970,1980,1990,2000,Disk (kB),Storage Capacity,RAM (b),Tape (kB),Unit Storage Size,22,Trends: Access Times Improved Little,

20、1,1960,1970,1980,1990,2000,Processors,Year,Instructions / second,Processor Speedups,Bits / second,WANs,1e 2,1e 1,1e 0,1e -1,1e-2,1e-3,1e-4,1e-5,1e-6,1e-7,1e 3,1960,1970,1980,1990,2000,Tape,Disk,RAM,Year,Access Times Improved Little,23,Trends: Storage Bandwidth Improved Little,1960,1970,1980,1990,200

21、0,Tape,Disk,RAM,Year,Transfer Rates Improved Little,1e -1,1,1960,1970,1980,1990,2000,Processors,Year,Processor Speedups,WANs,24,Todays Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs,Typical System (bytes),Size vs Speed,Access Time (seconds),10,-9,10,-6,10,-3,10,0,10,3,Cache,Main,Secondary,Di

22、sc,Nearline,Tape,Offline,Tape,Online,Tape,$/MB,Price vs Speed,Access Time (seconds),10,-9,10,-6,10,-3,10,0,10,3,Cache,Main,Secondary,Disc,Nearline,Tape,Offline,Tape,Online,Tape,25,Trends: Application Storage Demand Grew,The New World: Billions of objects Big objects (1MB),The Old World: Millions of

23、objects 100-byte objects,26,Trends: New Applications,The paperless officeLibrary of congress online (on your campus)All information comes electronicallyentertainmentpublishingbusinessInformation Network, Knowledge Navigator, Information at Your Fingertips,Multimedia: Text, voice, image, video, .,27,

24、Whats a Terabyte,1 Terabyte1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg)4,000 LandSat images Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs5 M$ of tape silo 10,000 tapes1997: 200 K$ of magnetic disc 120 discs300 K$

25、 of optical disc robot 250 platters50 K$ of tape silo 50 tapesTerror Byte ! .1% of a PetaByte!,150 miles of bookshelf15 miles of bookshelf7 miles of bookshelf10 days of video,28,The Cost of Storage & Access,File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 10$/ft2) 180$ tot

26、al 700$ 3 /sheet Disk: disk (9 GB =) 2,000$ ASCII: 5 m pages 0.2 /sheet (50x cheaper Image: 200 k pages 1 /sheet (similar to paper),29,Standard Storage Metrics,Capacity: RAM: MB and $/MB: today at 10MB & 100$/MB Disk: GB and $/GB: today at 5GB and 500$/GB Tape: TB and $/TB: today at .1TB and 100k$/T

27、B (nearline) Access time (latency) RAM: 100 ns Disk: 10 ms Tape: 30 second pick, 30 second position Transfer rate RAM: 1 GB/s Disk: 5 MB/s - - - Arrays can go to 1GB/s Tape: 3 MB/s - - - not clear that striping works,30,New Storage Metrics: KOXs, MOXs, GOXs, SCANs?,KOX: How many kilobyte objects ser

28、ved per second the file server, transaction procssing metric MOX: How many megabyte objects served per second the Mosaic metric GOX: How many gigabyte objects served per hour the video & EOSDIS metric SCANS: How many scans of all the data per day the data mining and utility metric,31,How To Get Lots

29、 of MOX, GOX, SCANS,parallelism: use many little devices in parallel Beware of the media myth Beware of the access time myth,At 10 MB/s: 1.2 days to scan,1,000 x parallel: 15 minute SCAN.,Parallelism: divide a big problem into many smaller ones to be solved in parallel.,32,Tape & Optical: Beware of

30、the Media Myth,Optical is cheap: 200 $/platter 2 GB/platter= 100$/GB (2x cheaper than disc)Tape is cheap: 30 $/tape20 GB/tape= 1.5 $/GB (100x cheaper than disc).,33,Tape & Optical Reality: Media is 10% of System Cost,Tape needs a robot (10 k$ . 3 m$ )10 . 1000 tapes (at 20GB each) = 20$/GB . 200$/GB

31、 (1x10x cheaper than disc) Optical needs a robot (100 k$ )100 platters = 200GB ( TODAY ) = 400 $/GB ( more expensive than mag disc )Robots have poor access timesNot good for Library of Congress (25TB)Data motel: data checks in but it never checks out!,34,The Access Time Myth,The Myth: seek or pick t

32、ime dominates The reality: (1) Queuing dominates(2) Transfer dominates BLOBs(3) Disk seeks often short Implication: many cheap servers better than one fast expensive server shorter queues parallel transfer lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for t

33、ape arrays,35,The Disk Farm On a Card,The 100GB disc card An array of discs Can be used as100 discs1 striped disc10 Fault Tolerant discsetc LOTS of accesses/secondbandwidth,14“,Life is cheap, its the accessories that cost ya.Processors are cheap, its the peripherals that cost ya(a 10k$ disc card).,3

34、6,My Solution to Tertiary Storage Tape Farms, Not Mainframe Silos,Scan in 24 hours. many independent tape robots (like a disc farm),10K$ robot,10 tapes,500 GB,6 MB/s,20$/GB,30 MOX,15 GOX,100 robots,50TB,50$/GB,3K MOX,1.5K GOX,1 Scans,1M$,37,0.01,0.1,1,10,100,1,000,10,000,100,000,1,000,000,1000 x,D,i

35、,sc Farm,STC Tape Robot,6,000 tapes, 8 readers,100x DLT,Tape Farm,GB/K$,MOX,GOX,SCANS/Day,K,OX,The Metrics: Disk and Tape Farms Win,Data Motel: Data checks in, but it never checks out,38,Cost Per Access (3-year),0.1,1,10,100,100,000,120,2,1000 x Disc Farm,STC Tape Robot,6,000 tapes, 16,readers,100x

36、DLT Tape Farm,KOX/$,MOX/$,GOX/$,SCANS/k$,500K,540,000,67,000,68,7,7,4.3,1.5,0.2,23,100,39,Summary (of new ideas),Storage accesses are the bottleneck Accesses are getting larger (MOX, GOX, SCANS) Capacity and cost are improving BUT Latencies and bandwidth are not improving much SO Use parallel access

37、 (disk and tape farms),40,MetaMessage: Technology Ratios Are Important,If everything gets faster&cheaper at the same rate nothing really changes. Some things getting MUCH BETTER: communication speed & cost 1,000x processor speed & cost 100x storage size & cost 100x Some things staying about the same

38、 speed of light (more or less constant) people (10x worse) storage speed (only 10x better),41,Ratios Changed,10x better access time 10x more bandwidth 10,000x lower media price DRAM/DISK 100:1 to 10:10 to 50:1,42,The Five Minute Rule,Trade DRAM for Disk Accesses Cost of an access (DriveCost / Access

39、_per_second) Cost of a DRAM page ( $/MB / pages_per_MB) Break even has two terms: Technology term and an Economic term Grew page size to compensate for changing ratios. Now at 10 minute for random, 2 minute sequential,43,Shows Best Page Index Page Size 16KB,44,The Ideal Interconnect,High bandwidth L

40、ow latency No software stack Zero Copy User mode access to device Low HBA latency Error Free (required if no software stack) Flow Controlled WE NEED A NEW PROTOCOL best of SCSI and Comm allow push & pull industry is doing it SAN + VIA,SCSI Comm + - -+ -+ - - + + -+ -,45,Outline,The challenge: Buildi

41、ng GIANT data stores for example, the EOS/DIS 15 PB system Conclusion 1 Think about MOX and SCANS Conclusion 2: Think about Clusters SMP report Cluster report,46,Scaleable Computers BOTH SMP and Cluster,SMP,Super Server,Departmental,Server,Personal,System,Grow Up with SMP4xP6 is now standardGrow Out

42、 with ClusterCluster has inexpensive parts,Cluster of PCs,47,TPC-C Current Results,Best Performance is 30,390 tpmC $305/tpmC (Oracle/DEC) Best Price/Perf. is 7,693 tpmC $43.5/tpmC (MS SQL/Dell) Graphs show UNIX high price UNIX scaleup diseconomy,48,Compare SMP Performance,49,Where the money goes,50,TPC C improved fast,40% hardware, 100% software, 100% PC Technology,

展开阅读全文