A Practical Approach toExploiting Coarse-GrainedPipeline .ppt

上传人:花仙子 文档编号:373173 上传时间:2018-10-05 格式:PPT 页数:50 大小:1.85MB
下载 相关 举报
A Practical Approach toExploiting Coarse-GrainedPipeline .ppt_第1页
第1页 / 共50页
A Practical Approach toExploiting Coarse-GrainedPipeline .ppt_第2页
第2页 / 共50页
A Practical Approach toExploiting Coarse-GrainedPipeline .ppt_第3页
第3页 / 共50页
A Practical Approach toExploiting Coarse-GrainedPipeline .ppt_第4页
第4页 / 共50页
A Practical Approach toExploiting Coarse-GrainedPipeline .ppt_第5页
第5页 / 共50页
亲,该文档总共50页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs William Thies, Vikram Chandrasekhar, Saman Amarasinghe,Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of TechnologyMICRO 40 December 4, 2007,Legacy Code,310 billion lines of legacy c

2、ode in industry today 60-80% of typical IT budget spent re-engineering legacy code (Source: Gartner Group) Now code must be migrated to multicore machines Current best practice: manual translation,Parallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,P

3、arallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,Parallelization: Man vs. Compiler,Can we improve compilers by making them more human?,Humanizing Compilers,Current: An Omnipotent Being,New: An Expert Programmer,Ric

4、hard Stallman,First step: change our expectations of correctness,Zeus,Humanizing Compilers,First step: change our expectations of correctness Second step: use compilers differently Option A: Treat them like a programmer Transformations distrusted, subject to test Compiler must examine failures and f

5、ix themOption B: Treat them like a tool Make suggestions to programmer Assist programmers in understanding high-level structure How does this change the problem? Can utilize unsound but useful information In this talk: utilize dynamic analysis,Dynamic Analysis for Extracting Coarse-Grained Paralleli

6、sm from C,Dynamic Analysis for Extracting Coarse-Grained Parallelism from C,Focus on stream programs Audio, video, DSP, networking, and cryptographic processing kernels Regular communication patterns Static analysis complex or intractable Potential aliasing (pointer arithmetic, function pointers, et

7、c.) Heap manipulation (e.g., Huffman tree) Circular buffers (modulo ops) Correlated input parameters Dynamic analysis promising Observe flow of data Very few variations at runtime,Adder,LPF1,LPF2,LPF3,HPF1,HPF2,HPF3,Speaker,AtoD,FMDemod,Scatter,Gather,Dynamic Analysis for Extracting Coarse-Grained P

8、arallelism from C,Focus on stream programs Audio, video, DSP, networking, and cryptographic processing kernels Regular communication patterns Static analysis complex or intractable Potential aliasing (pointer arithmetic, function pointers, etc.) Heap manipulation (e.g., Huffman tree) Circular buffer

9、s (modulo ops) Correlated input parameters Opportunity for dynamic analysis If flow of data is very stable, can infer it with a small sample,Adder,LPF1,LPF2,LPF3,HPF1,HPF2,HPF3,Speaker,AtoD,FMDemod,Scatter,Gather,Overview of Our Approach,Original Program,Annotated Program,Mark Potential Actor Bounda

10、ries,Run Dynamic Analysis,No,Hand Parallelized Program,Auto Parallelized Program,Satisfied with Parallelism?,Yes,Communicate data by hand,Communicate based on trace,test and refine using multiple inputs,MPEG-2 Decoder,Stability of MPEG-2,MPEG-2 Decoder,Stability of MPEG-2,Top 10 YouTube Videos,Stabi

11、lity of MPEG-2 (Within an Execution),Frame,Stability of MPEG-2 (Across Executions),Minimum number of training iterations (frames) needed on each video in order to correctly decode the other videos.,Stability of MPEG-2 (Across Executions),Minimum number of training iterations (frames) needed on each

12、video in order to correctly decode the other videos.,5 frames of training on one video is sufficient to correctly parallelize any other video,Stability of MP3 (Across Executions),Minimum number of training iterations (frames) needed on each track in order to correctly decode the other tracks.,Stabil

13、ity of MP3 (Across Executions),Minimum number of training iterations (frames) needed on each track in order to correctly decode the other tracks.,Stability of MP3 (Across Executions),Layer 1 frames,Minimum number of training iterations (frames) needed on each track in order to correctly decode the o

14、ther tracks.,Stability of MP3 (Across Executions),CRC Error,Minimum number of training iterations (frames) needed on each track in order to correctly decode the other tracks.,Stability of MP3 (Across Executions),Minimum number of training iterations (frames) needed on each track in order to correctl

15、y decode the other tracks.,Outline,Analysis Tool Case Studies,Outline,Analysis Tool Case Studies,Annotating Pipeline Parallelism,Programmer indicates potential actor boundaries in a long-running loop,Annotating Pipeline Parallelism,Programmer indicates potential actor boundaries in a long-running lo

16、op,Annotating Pipeline Parallelism,Programmer indicates potential actor boundaries in a long-running loop,Annotating Pipeline Parallelism,Programmer indicates potential actor boundaries in a long-running loopServes as a fundamental API for pipeline parallelism Comparable to OpenMP for data paralleli

17、sm Comparable to Threads for task parallelism,Legacy C Code,MP3 Decoding,while (!end_bs(. ,Dynamic Analysis,Legacy C Code,Record Who Produces / Consumes each Location,MP3 Decoding,Huffman () ,Dequantize() ,Antialias() ,Hybrid() ,Polyphase() ,out_fifo() ,while (!end_bs(. ,Build Block Diagram,Mem,Huff

18、man(),Antialias(),Polyphase(),out_fifo(),Dynamic Analysis,Implemented Using Valgrind,Dequantize(),Hybrid(),Dequantize(),Dequantize(),Hybrid(),Huffman(),Antialias(),Polyphase(),out_fifo(),Exploiting the Parallelism,Stateless stage (data parallel),Antialias(),Polyphase(),Stateful stage (sequential),Hy

19、brid(),Huffman(),Antialias(),Polyphase(),out_fifo(),Exploiting the Parallelism,Reorder(),Dequantize(),Antialias(),for (i=0; iN; i+) PIPELINE(); Dequantize(); PIPELINE(); . ,Polyphase(),Stateless stage (data parallel),Stateful stage (sequential),Hybrid(),Huffman(),Antialias(),Polyphase(),out_fifo(),E

20、xploiting the Parallelism,Reorder(),Dequantize(),Antialias(),for (i=0; iN; i+) PIPELINE(N); Dequantize(); PIPELINE(); . ,Polyphase(),Stateless stage (data parallel),Stateful stage (sequential),DequantizeN(),Dequantize1(),Dequantize(),Hybrid(),Huffman(),Antialias(),Polyphase(),out_fifo(),Exploiting t

21、he Parallelism,Antialias(),for (i=0; iN; i+) PIPELINE(N); Dequantize(); PIPELINE(); . ,Polyphase(),Stateless stage (data parallel),Stateful stage (sequential),Parallel Runtime Environment,Pipeline parallelism requires buffering between stages Two ways to implement buffering: 1. Modify original progr

22、am to add buffers 2. Wrap original code in virtual execution environment We fork each actor into an independent process, and communicate the recorded variables via pipesRobust in the presence of aliasing Suitable to shared or distributed memory Efficient (7% communication overhead on MP3),Mem,Dequan

23、tize(),Programmer assistance needed for: - mallocd data- nested loops- reduction vars,pipe,Outline,Analysis Tool Case Studies,Extracted Stream Graphs,Ground Moving Target Indicator (GMTI),Extracted with tool:,From GMTI specification:,Audio and Video Codecs,MP3 Decoder,MPEG-2 Decoder,SPEC Benchmarks,

24、197.parser,256.bzip2 (compression),256.bzip2 (decompression),456.hmmer,Interactive Parallelization Process,Analysis tool exposed serializing dependences As annotated back-edges in stream graph (main.c:9 fft.c:5) How to deal with serializing dependences? 1. Rewrite code to eliminate dependence, or 2.

25、 Instruct the tool to ignore the dependence Lesson learned: Many memory dependences can be safely ignored! Allow malloc (or free) to be called in any order (GMTI, hmmer) Allow rand() to be called in any order (hmmer) Ignore dependences on uninitialized memory (parser) Ignore ordering of demand-drive

26、n buffer expansion (hmmer),Results,On two AMD 270 dual-core processors,Results,Profiled for 10 iterations of training data Ran for complete length of testing data Only observed unsoundness: MP3,How to Improve Soundness?,Revert to sequential version upon seeing new code (fixes MP3) Hardware support M

27、ondriaan memory protection (Witchel et. al) Versioned memory (used by Bridges et al.) Would provide safe communication, but unsafe parallelism Rigorous testing with maximal code coverage Programmer review,Related Work,Revisiting the Sequential Programming Model for Multi-Core (Bridges et al., yester

28、day) Same pipeline-parallel decompositions of parser, bzip2 Like commutative annotation, we tell tool to ignore dependences But since we target distributed memory, annotation represents privatization rather than reordering Dynamic analysis for understanding, parallelization Rul et. al (2006) program

29、mer manages communication Redux (2003) fine-grained dependence visualization Karkowski and Corporaal (1997) focus on data parallelism Inspector/executor for DOACROSS parallelism Rauchwerger (1998) survey,Conclusions,Dynamic analysis can be useful for parallelization Our tool is simple, transparent, and one of the first to extract coarse-grained pipeline parallelism from C programs Primary application: program understanding Secondary application: automatic parallelization Future work in improving soundness, automation,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1