1、The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer: A PRAM-On-Chip Proof-of-Concept,Uzi Vishkin (Lack of) ease-of-programming failed all parallel computers to date Vendors are yet to offer easy-to-program (& scalable) many-cores PRAM “sets the programmer free”. Unique also in other
2、ways 4. Contender for era of general-purpose many-cores: PRAM-On-Chip “XMT” (add-on ?) + 2010s GPU replace old Pentium+GPU XMT home page: www.umiacs.umd.edu/users/vishkin/XMT,Commodity computer systems,19462003 General-purpose computing: Serial. 5KHz4GHz. 2004 Clock frequency growth: flat. If you wa
3、nt your program to run significantly faster youre going to have to parallelize it Parallelism: only game in town #Transistors/chip 19802011: 29K30B! General-purpose computing goes parallel. #”cores”: dy-2003 But, what about the programmer? Systems communities claim to be objectively guided by “the q
4、uantitative approach”. Yet, they “forget” to quantify or benchmark the human factor. E.g., to whom can an approach be taught: graduate or middle-school students? Development time?40 years of parallel computing Never a successful general-purpose parallel computer: Easy to program & good speedups Lett
5、er grade from NSF Blue-Ribbon Panel on Cyberinfrastructure: F. To many users programming existing parallel computers is “as intimidating and time consuming as programming in assembly language”. What would be todays grade? If higher, is this a game changer?Are theorists the canaries in the coal mine?
6、 Low numbers a worrying sign?,Intel Platform 2015, March05,2 Paradigm Shifts,Serial to parallel widely agreed. Within parallel Existing “decomposition-first” paradigm. Too painful to program. Hence: Express only “what can be done in parallel” (PRAM: Parallel Random-Access Model). Build machine aroun
7、d this. Serial doctrine Natural (parallel) algorithmtime = work “work” = total #ops time work Late 1970s- : THEORY: figure out how to think algorithmically in parallel 1997- : PRAM-On-ChipUMD: derive specs for architecture; design and build 2 premises: (i) parallel algorithmic thinking. (ii) Specs f
8、irst; contrast with: J. Hennessy: “Many of the early ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”. Implies: parallel HW followed ”build-first figure-out-how-to-program-later”.,What could I do in parallel at each step assuming un
9、limited hardware ,# ops,. .,. .,. .,#ops,time,time,Pre many-core parallelism: 3 main trusts,Improving single-task completion time for general-purpose parallelism was not the primary target of parallel machines: 1. Application-specific: e.g., computer graphics. Limiting origin. GPUs: great performanc
10、e if you figure out how. Example: limited interaction between threads; what to do with textbook parallel graph algorithms? 2. Parallel machines for high-throughput (of serial programs); hence, cache coherence, SMP, DSM. Only choice for “HPC”Language standards, but many issues, e.g., F grade. Heard f
11、rom HW designers (that dominate vendors): YOU figure out how to program (their machines) for locality. Nothing fundamentally new in HW since 1990s. Serious lack of parallel machine diversity. What can a non-HW designers do?! HW for CS is like nature for Physics (Are vendors Gods of CS?) Theory Alway
12、s had its eyes on the ball. Started with a clean slate targeting single task completion time for general-purpose parallel computing. 3. PRAM and its extensive algorithmic theory. As simple as it gets. Ahead of its time: avant-garde. 1990s Common wisdom (LOGP): never implementable. Well: we built it
13、Showed 100x speedups for 1000 processors. Also: taught to grad students, seniors, freshmen, HS (&MS). humans to humans. Validated understanding & performance with programming assignments. Problems on par with serial courses. Students see immediate speedups.,Welcome to the 2009 Impasse,All vendors co
14、mmitted to multi-cores. Yet, their architecture and how to program them for single program completion time not clear The software spiral (HW improvements SW imp HW imp) growth engine for IT (A. Grove, Intel); Alas, now broken! SW vendors avoid investment in long-term SW development since may bet on
15、the wrong horse. Impasse bad for business. For current students: Does CS&E degree mean: being trained for a 50yr career dominated by parallelism by programming yesterdays serial computers? How can the same impasse & need to serve current students be mitigated for education? Answer “What can be done
16、next in parallel” common cognition for all approachesCan teach PRAM common denominator. The education enterprise has an actionable agenda for a time critical need. Comments: 1. Is this a tie-breaker among approaches? 2. A machine is not easy-to-program if it is not easy-to-teach education for parall
17、elism has become a key benchmark. Namely, for parallelism, education is CS research.,Need,A general-purpose parallel computer framework “successor to the Pentium for the multi-core era” that: is easy to program; gives good performance with any amount (grain or regularity) of parallelism provided by
18、the algorithm; namely, up- and down-scalability including backwards compatibility on serial code; supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming; and fits current chip technology and scales with it. (in particular: strong speed-ups for single-task complet
19、ion time)Key point: PRAM-On-ChipUMD is addressing (i)-(iv).,The PRAM Rollercoaster ride,Late 1970s Theory work began UP Won the battle of ideas on parallel algorithmic thinking. No silver or bronze! Model of choice in all theory/algorithms communities. 1988-90: Big chapters in standard algorithms te
20、xtbooks. DOWN FCRC93: “PRAM is not feasible”. 93+ despair no good alternative! Where vendors expect good enough alternatives to come from in 2009?. Device changed it all with #transistors on-chip: UP Highlights: eXplicit-multi-threaded (XMT) FPGA-prototype computer (not simulator): SPAA07,CF08; 90nm
21、 ASIC tape-outs: int. network, HotI07, XMT. 1000 processors can fit on a single chip by mid-2010s. But, how come? Crash “course” on parallel computing How much processors-to-memories bandwidth? Enough: Ideal Programming Model (PRAM) Limited: Programming difficulties,Hardware prototypes of PRAM-On-Ch
22、ip,The design scales to 1000+ cores on-chip,64-core, 75MHz FPGA prototype SPAA07, Computing Frontiers08 Original explicit multi-threaded (XMT) architecture SPAA98 (Cray started to use “XMT” 7 years later),Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype HotInterconne
23、cts07,Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype,What else changed since the 1990s? “Multi-Core Interconnects: Scale-Up or Melt-Down“ Panel discussion, Hot Interconnects, 2007, Stanford University,Panel abstract: As we anticipate 32, 64, 100+ processors on a single ch
24、ip, the problem of interconnecting the cores looms as a potential showstopper to scaling. Are we heading for the cliff here, or will our panelists bring visions of interconnect architectures, especially those that work on-chip but not between chips, that will enable the scaling to continue? Panelist
25、s from Google, Yahoo, others. Summary Noted several issues with power consumption of multi-core architectures coming from industry: high power consumption of wide communication buses needed to implement cache coherence; basic nm complexity of cache coherence traffic (given n cores and m invalidation
26、s) and its implied huge toll on inter-core bandwidth; and high power consumption needed for a tightly synchronous implementation in silicon used in these designs. Panels conclusion: the industry must first converge to an easy-to-program highly-scalable multi-core architecture. These issues should be
27、 addressed in the context of such an architecture.,How does it work, and what should people know to participate,“Work-depth” Alg Methodology (SV82) State all ops you can do in parallel. Repeat. Minimize: Total #operations, #rounds. Note: 1 The rest is skill. 2. Sets the algorithm Program single-prog
28、ram multiple-data (SPMD). Short (not OS) threads. Independence of order semantics (IOS). XMTC: C plus 3 commands: Spawn+Join, Prefix-Sum (PS) Unique First parallelism. Then decomposition Legend: Level of abstraction Means Means: Programming methodology Algorithms effective programs. Extend the SV82
29、Work-Depth framework from PRAM-like to XMTC Alternative Established APIs (VHDL/Verilog,OpenGL,MATLAB) “win-win proposition” Performance-Tuned Program minimize length of sequence of round-trips to memory + QRQW + Depth; take advantage of architecture enhancements (e.g., prefetch). Means: Compiler: id
30、eally: given XMTC program, compiler provides decomposition: tune-up manually “teach the compiler”,Architecture HW hooks. E.g., HW-supported run-time load-balancing of concurrent threads over processors. Low thread creation overhead. (Extend stored-program + program counter; cited by 15 Intel patents
31、; Prefix-sum to registers & to memory. ) All Computer Scientists will need to know 1 levels of abstraction (LoA) CS programmers model: WD+P. CS expert : WD+P+PTP. Systems: +A.,Merging: Example for Algorithm & Program,Input: Two arrays A1. . n, B1. . n; elements from a totally ordered domain S. Each
32、array is monotonically non-decreasing. Merging: map each of these elements into a monotonically non-decreasing array C12n Serial Merging algorithm SERIAL RANK(A1 . . ;B1. .) Starting from A(1) and B(1), in each round: compare an element from A with an element of B determine the rank of the smaller a
33、mong them Complexity: O(n) time (and O(n) work.)PRAM Challenge: O(n) work, least time Also (new): fewest spawn-joins,Merging algorithm (contd),“Surplus-log” parallel algorithm for Merging/Ranking for 1 i n pardo Compute RANK(i,B) using standard binary search Compute RANK(i,A) using binary search Com
34、plexity: W=(O(n log n), T=O(log n) The partitioning paradigm n: input size for a problem. Design a 2-stage parallel algorithm: Partition the input into a large number, say p, of independent small jobs AND size of the largest small job is roughly n/p. Actual work - do the small jobs concurrently, usi
35、ng a separate (possibly serial) algorithm for each.,Linear work parallel merging: using a single spawn,Stage 1 of algorithm: Partitioning for 1 i n/p pardo p = n/log and p | n b(i):=RANK(p(i-1) + 1),B) using binary search a(i):=RANK(p(i-1) + 1),A) using binary search Stage 2 of algorithm: Actual wor
36、k Observe Overall ranking task broken into 2p independent “slices”. Example of a slice Start at A(p(i-1) +1) and B(b(i). Using serial ranking advance till: Termination condition Either some A(pi+1) or some B(jp+1) loses Parallel program 2p concurrent threads using a single spawn-join for the whole a
37、lgorithm Example Thread of 20: Binary search B. Rank as 11 (index of 15 in B) + 9 (index of 20 in A). Then: compare 21 to 22 and rank 21; compare 23 to 22 to rank 22; compare 23 to 24 to rank 23; compare 24 to 25, but terminate since the Thread of 24 will rank 24.,Linear work parallel merging (contd
38、),Observation 2p slices. None larger than 2n/p. (not too bad since average is 2n/2p=n/p) Complexity Partitioning takes W=O(p log n), and T=O(log n) time, or O(n) work and O(log n) time, for p = n/log n. Actual work employs 2p serial algorithms, each takes O(n/p) time. Total W=O(n), and T=O(n/p), for
39、 p = n/log n.IMPORTANT: Correctness & complexity of parallel programSame as for algorithm.This is a big deal. Other parallel programming approaches do not have a simple concurrency model, and need to reason w.r.t. the program.,Workflow from parallel algorithms to programming versus trial-and-error,O
40、ption 1,PAT,Rethink algorithm: Take better advantage of cache,Hardware,PAT,Tune,Hardware,Option 2,Parallel algorithmic thinking (say PRAM),Compiler,Is Option 1 good enough for the parallel programmers model? Options 1B and 2 start with a PRAM algorithm, but not option 1A. Options 1A and 2 represent
41、workflow, but not option 1B.,Not possible in the 1990s. Possible now. Why settle for less?,Insufficient inter-thread bandwidth?,Domain decomposition, or task decomposition,Program,Program,Memory architecture alternatives,Good news Fits on chip. Allows high-bandwidth interconnection network. Left sid
42、e: Makes sense for high-throughput. Uses cache coherence. Significant toll on bandwidth, power, and chip resources (each processor needs more silicon) Right side: Amortizes memory resources more processors. Unless in local registers (also “read-only caches” & “prefetch buffers”) need round-trip acro
43、ss ICN. Sequence of round-trips to memory a bigger issue. Remaining constraint Off-chip bandwidth remains limited. Still less than for high-throughput, since one program.,Synchrony versus ?,Synchrony Pros,PRAM algorithmic theory Reasoning about correctness. Correlates with determinism Predictable ti
44、ming of computation Synchrony Cons Memory: 1 to 400 clocks separate average-case from worst-case Tighter synchrony more power.Asynchrony? Asynchrony too difficult for algorithms and architecture. What can we do?, Reduced synchrony,Balance (synchronous) PRAM algorithms and a no-busy-wait finite-state
45、-machine (NBW FSM) “ideal”. Role for HW enhanced prefix-sum (F&A) Start with Arbitrary CRCW PRAM algorithms. Set concurrent threads that are as long-as-possible. Each advances at its own speed without letting another thread cause it to busy-wait. When all fails, join (every threads expires) Discipli
46、ned non-determinism.,How I updated my basic intuition I first expected parallel systems to operate through punctual orchestration. “Swiss train system paradigm”: set clock when train leaves the station. Example: Cray 1970s vector machines. Cray was brilliant, but technology stood in the way. Expensi
47、ve cooling was a warning sign. I went through 180-degree conceptual transformation between my SPAA97 and SPAA98 papers. I now expect progress to occur as distributedly as possible.,Accommodating a fixed number of processors (thread controls units, TCUs),Issues,Optimizing two parameters (work and tim
48、e) Can have more than one “best” parallel algorithm: W1T2T3. “building blocks”. Constants matter Cases: (i) Saturated TCUs, and (ii) unsaturated TCUs Ideally: programmer should not rewrite for more TCUs Responsibilities of programmer, compiler, hardware:,Programmer,Algorithm and its XMTC code All bu
49、ilding blocks Compiler Assemble building blocks into ready to run code, given #TCUs, input size, and HW parameters Clustering threads Hardware Initially, allocate virtual threads to TCUs, and then dynamically as TCUs become available No-busy-wait finite-state-machine (NBW FSM) ideal.,Realistic Role
50、for theory,Wing: Computational thinking is often at multiple levels of abstraction (LoA) One of these levels got to be the theory of parallel algorithms (reflecting parallel algorithmic thinking.) The “how does it work” slide demonstrates our PRAM-like approach. If you restrict attention to one laye
51、r of this multi-layer onion, it becomes similar to other approaches, but misses the point. (The XMTC language layer is one of the least innovative in XMT, but gets much attention.) Need the vertical integration to make sense and understand in what ways it is different (unique). Roles for theory and
52、beyond: The algorithms (WD) LoA is traditional PRAM theory. Performance tuning : Experimental Algorithms. Work needed: develope understanding of what gives best performance in implementation. Model, assess and compare HW, and HW options. Model reasoning about multiple levels of abstraction. SW architectures: even for serial.,