Advanced Computer Architecture.ppt_麦多课文库mydoc123.com

资源描述

1、Advanced Computer Architecture,Chapter 4 Advanced Pipelining Ioannis Papaefstathiou CS 590.25 Easter 2003 (thanks to Hennesy & Patterson),Chap. 4 - Pipelining II,2,Chapter Overview,4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Redu

2、cing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP,Chap. 4 - Pipelining II,3,Chapter Overview,Chap. 4 - Pipelining II,4,Instruction

3、 Level Parallelism,4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware S

4、upport for Extracting more Parallelism 4.7 Studies of ILP,ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel.This is easier said than done: Issues include: Building compilers to analyze the c

5、ode, Building hardware to be even smarter than that code.This section looks at some of the problems to be solved.,Chap. 4 - Pipelining II,5,Terminology,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Basic Block - That set of instructions between entry points and between branche

6、s. A basic block has only one entry and one exit. Typically this is about 6 instructions long.Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inhere

7、nt in the loop.,Chap. 4 - Pipelining II,6,Simple Loop and its Assembler Equivalent,for (i=1; i=1000; i+) x(i) = x(i) + s;,Loop: LD F0,0(R1) ;F0=vector elementADDD F4,F0,F2 ;add scalar from F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8bytes (DW)BNEZ R1,Loop ;branch R1!=zeroNOP ;delayed

8、 branch slot,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,This is a clean and simple example!,Chap. 4 - Pipelining II,7,Loop: LD F0,0(R1) ;F0=vector elementADDD F4,F0,F2 ;add scalar in F2SD 0(R1),F4 ;store resultSUBI R1,R1,8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=

9、zeroNOP ;delayed branch slot,FP Loop Hazards,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 In

10、teger op Integer op 0,Where are the stalls?,Chap. 4 - Pipelining II,8,FP Loop Showing Stalls,10 clocks: Rewrite code to minimize stalls?,1 Loop: LD F0,0(R1) ;F0=vector element2 stall3 ADDD F4,F0,F2 ;add scalar in F24 stall5 stall6 SD 0(R1),F4 ;store result7 SUBI R1,R1,8 ;decrement pointer 8Byte (DW)

11、8 stall9 BNEZ R1,Loop ;branch R1!=zero10 stall ;delayed branch slot,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load d

12、ouble Store double 0 Integer op Integer op 0,Chap. 4 - Pipelining II,9,Scheduled FP Loop Minimizing Stalls,Now 6 clocks: Now unroll loop 4 times to make faster.,Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load d

13、ouble FP ALU op 1,1 Loop: LD F0,0(R1)2 SUBI R1,R1,83 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch6 SD 8(R1),F4 ;altered when move past SUBI,Swap BNEZ and SD by changing address of SD,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Stall is because SD cant proceed.,Chap.

14、4 - Pipelining II,10,Unroll Loop Four Times (straightforward way),Rewrite loop to minimize stalls.,1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 stall5 stall6 SD 0(R1),F47 LD F6,-8(R1)8 stall9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -8(R1),F8 13 LD F10,-16(R1) 14 stall,Instruction Level Parallelism,Pipel

15、ine Scheduling and Loop Unrolling,15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iterationAssumes R1 is multiple of 4,15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1),F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2 22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1,#32 26 BNEZ R1,LOOP 27 stall 28

16、NOP,Chap. 4 - Pipelining II,11,Unrolled Loop That Minimizes Stalls,What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes?,1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F

17、10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -2414 clock cycles, or 3.5 per iteration,Instruction Level Parallelism,Pipeline Scheduling and Lo

18、op Unrolling,No Stalls!,Chap. 4 - Pipelining II,12,Summary of Loop Unrolling Example,Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset. Determine that unrolling the loop would be useful by finding that the loop iterations were independent

19、, except for the loop maintenance code.Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations. Eliminate the extra tests and branches and adjust the loop maintenance code.Determine that the loads and stores in the unrolled

20、 loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield the same result as the origin

21、al code.,Instruction Level Parallelism,Pipeline Scheduling and Loop Unrolling,Chap. 4 - Pipelining II,13,Compiler Perspectives on Code Movement,Compiler concerned about dependencies in program. Not concerned if a HW hazard depends on a given pipeline. Tries to schedule code to avoid hazards. Looks f

22、or Data dependencies (RAW if a hazard for HW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, cant execute in parallel Easy to determine for registers (fixed names) Hard fo

23、r memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)?,Instruction Level Parallelism,Dependencies,Chap. 4 - Pipelining II,14,Compiler Perspectives on Code Movement,1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SUBI R1,R1,8 4 BNEZ R1,Loop ;delayed branch5 SD 8(R1),F4 ;altered

24、when move past SUBI,Instruction Level Parallelism,Data Dependencies,Where are the data dependencies?,Chap. 4 - Pipelining II,15,Compiler Perspectives on Code Movement,Another kind of dependence called name dependence: two instructions use same name (register or memory location) but dont exchange dat

25、aAnti-dependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed firstOutput dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instru

26、ctions must be preserved.,Instruction Level Parallelism,Name Dependencies,Chap. 4 - Pipelining II,16,Compiler Perspectives on Code Movement,1 Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 4 LD F0,-8(R1)5 ADDD F4,F0,F26 SD -8(R1),F4 7 LD F0,-16(R1)8 ADDD F4,F0,F29 SD -16(R1),F4 10 LD F0,-24(R1)11 ADD

27、D F4,F0,F212 SD -24(R1),F413 SUBI R1,R1,#32 14 BNEZ R1,LOOP15 NOPHow can we remove these dependencies?,Instruction Level Parallelism,Name Dependencies,Where are the name dependencies?,No data is passed in F0, but cant reuse F0 in cycle 4.,Chap. 4 - Pipelining II,17,Where are the name dependencies?,1

28、 Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 LD F14,-24(R1)11 ADDD F16,F14,F212 SD -24(R1),F1613 SUBI R1,R1,#3214 BNEZ R1,LOOP15 NOPCalled “register renaming”,Instruction Level Parallelism,Name Dependenci

29、es,Compiler Perspectives on Code Movement,Now there are data dependencies only. F0 exists only in instructions 1 and 2.,Chap. 4 - Pipelining II,18,Compiler Perspectives on Code Movement,Again Name Dependencies are Hard for Memory Accesses Does 100(R4) = 20(R6)? From different loop iterations, does 2

30、0(R6) = 20(R6)? Our example required compiler to know that if R1 doesnt change then: 0(R1) -8(R1) -16(R1) -24(R1) There were no dependencies between some loads and stores so they could be moved around each other,Instruction Level Parallelism,Name Dependencies,Chap. 4 - Pipelining II,19,Final kind of

31、 dependence called control dependence Exampleif p1 S1;if p2 S2;S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.,Instruction Level Parallelism,Control Dependencies,Compiler Perspectives on Code Movement,Chap. 4 - Pipelining II,20,Two (obvious) constraints on control depe

32、ndences: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.

33、 Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch),Instruction Level Parallelism,Control Dependencies,Compiler Perspectives on Code Movement,Chap. 4 -

34、 Pipelining II,21,Where are the control dependencies?,1 Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit6 LD F0,0(R1)7 ADDD F4,F0,F28 SD 0(R1),F4 9 SUBI R1,R1,8 10 BEQZ R1,exit11 LD F0,0(R1)12 ADDD F4,F0,F213 SD 0(R1),F4 14 SUBI R1,R1,8 15 BEQZ R1,exit ,Instruction Level P

35、arallelism,Control Dependencies,Compiler Perspectives on Code Movement,Chap. 4 - Pipelining II,22,When Safe to Unroll Loop?,Example: Where are data dependencies? (A,B,C distinct each iteration was distinct,Instruction Level Parallelism,Loop Level Parallelism,for (i=1; i=100; i=i+1) Ai+1 = Ai + Ci; /

36、* S1 */ Bi+1 = Bi + Ai+1; /* S2 */ ,Chap. 4 - Pipelining II,23,When Safe to Unroll Loop?,Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) 1. No dependence from S1 to S2. If there were, then there would be a cycle in the dependencies and the loop would not be parallel. Since

37、 this other dependence is absent, interchanging the two statements will not affect the execution of S2. 2. On the first iteration of the loop, statement S1 depends on the value of B1 computed prior to initiating the loop.,Instruction Level Parallelism,Loop Level Parallelism,for (i=1; i=100; i=i+1) A

38、i+1 = Ai + Bi; /* S1 */ Bi+1 = Ci + Di; /* S2 */ ,Chap. 4 - Pipelining II,24,Now Safe to Unroll Loop? (p. 240),A1 = A1 + B1;for (i=1; i=99; i=i+1) Bi+1 = Ci + Di; Ai+1 = + Ai+1 + Bi+1; B101 = C100 + D100;,for (i=1; i=100; i=i+1) Ai+1 = Ai + Bi; /* S1 */ Bi+1 = Ci + Di; /* S2 */,OLD:,NEW:,Instruction

39、 Level Parallelism,Loop Level Parallelism,No circular dependencies.,Loop caused dependence on B.,Have eliminated loop dependence.,Chap. 4 - Pipelining II,25,Dynamic Scheduling,4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing

40、Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP,Dynamic Scheduling is when the hardware rearranges the order of instruction execution

41、 to reduce stalls. Advantages: Dependencies unknown at compile time can be handled by the hardware. Code compiled for one type of pipeline can be efficiently run on another. Disadvantages: Hardware much more complex.,Chap. 4 - Pipelining II,26,Dynamic Scheduling,Why in HW at run time? Works when can

42、t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple execution units, so use them.DIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14 Enables out

43、-of-order execution = out-of-order completion,The idea:,HW Schemes: Instruction Parallelism,Chap. 4 - Pipelining II,27,Dynamic Scheduling,Out-of-order execution divides ID stage: 1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands

44、Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a “data structure” that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( als

45、o called completion) First used in CDC6600. Our example modified here for DLX. CDC had 4 FP units, 5 memory reference units, 7 integer units. DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.,The idea:,HW Schemes: Instruction Parallelism,Chap. 4 - Pipelining II,28,Scoreboard Implications,O

46、ut-of-order completion = WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase = multiple execution units or

47、 pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages,Dynamic Scheduling,Using A Scoreboard,Chap. 4 - Pipelining II,29,Four Stages of Scoreboard Control,1. Issue decode instructions & check for structural hazards (ID1) If

48、a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no

49、 further instructions will issue until these hazards are cleared.,Dynamic Scheduling,Using A Scoreboard,Chap. 4 - Pipelining II,30,Four Stages of Scoreboard Control,2. Read operands wait until no data hazards, then read operands (ID2)A source operand is available if no earlier issued active instruct

50、ion is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.,

展开阅读全文