多核编程A.ppt

上传人:lawfemale396 文档编号:384985 上传时间:2018-10-10 格式:PPT 页数:143 大小:3.44MB
下载 相关 举报
多核编程A.ppt_第1页
第1页 / 共143页
多核编程A.ppt_第2页
第2页 / 共143页
多核编程A.ppt_第3页
第3页 / 共143页
多核编程A.ppt_第4页
第4页 / 共143页
多核编程A.ppt_第5页
第5页 / 共143页
亲,该文档总共143页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、Parallel programming with MPI,2,Agenda,Part : Seeking Parallelism/Concurrency Part : Parallel Algorithm Design Part : Message-Passing Programming,Part ,Seeking Parallel/Concurrency,4,Outline,1 Introduction 2 Seeking Parallel,5,1 Introduction(1/6),Well done is quickly done Caesar Auguest Fast, Fast,

2、Fast-is not “fast” enough. How to get Higher Performance Parallel Computing.,6,1 Introduction(2/6),What is parallel computing? is the use of a parallel computer to reduce the time needed to solve a single computational problem. is now considered a standard way for computational scientists and engine

3、ers to solve problems in areas as diverse as galactic evolution, climate modeling, aircraft design, molecular dynamics and economic analysis.,7,Parallel Computing,A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we basi

4、cally need? The ability to start the tasks A way for them to communicate,8,1 Introduction(3/6),Whats parallel computer? Is a Multi-processor computer system supporting parallel programming. Multi-computer Is a parallel computer constructed out of multiple computers and an interconnection network. Th

5、e processors on different computers interact by passing message to each other. Centralized multiprocessor (SMP: Symmetrical multiprocessor) Is a more high integrated system in which all CPUs share access to a single global memory. The shared memory supports communications and synchronization among p

6、rocessors.,9,1 Introduction(4/6),Multi-core platform Integrated duo/quad or more core in one processor, and each core has their own registers and Level 1 cache, all cores share Level 2 cache, which supports communications and synchronizations among cores. All cores share access to a global memory.,1

7、0,1 Introduction(5/6),Whats parallel programming? Is programming in language that allows you to explicitly indicate how different portions of the computation may be executed paralleled/concurrently by different processors/cores. Do I need parallel programming really? YES, for the reasons of: Althoug

8、h a lot of research has been invested in and many experimental parallelizing compilers have been developed, there are still no commercial system thus far. The alternative is for you to write your own parallel programs.,11,1 Introduction(6/6),Why should I program using MPI and OpenMP? MPI ( Message P

9、assing Interface) is a standard specification for message passing libraries. Which is available on virtually every parallel computer system. Free. If you develop programs using MPI, you will be able to reuse them when you get access to a newer, faster parallel computer. On Multi-core platform or SMP

10、, the cores/CPUs have a shared memory space. While MPI is a perfect satisfactory way for cores/processors to communicate with each other, OpenMP is a better way for cores/processors with a single Processor/SMP to interact. The hybrid MPI/OpenMP program can get even high performance.,12,2 Seeking Par

11、allel(1/7),In order to take advantage of multi-core/multiple processors, programmers must be able to identify operations that may be performed in parallel. Several ways: Data Dependence Graphs Data Parallelism Functional Parallelism Pipelining ,13,2 Seeking Parallel(2/7),Data Dependence Graphs A dir

12、ected graph Each vertex: represent a task to be completed. An edge from vertex u to vertex v means: task u must be completed before task v begins. - Task v is dependent on task u. If there is no path from u to v, then the tasks are independent and may be performed parallelized.,14,2 Seeking Parallel

13、(3/7),Data Dependence Graphs,15,2 Seeking Parallel(4/7),Data Parallelism Independent tasks applying the same operation to different elements of a data set. e.g.,16,2 Seeking Parallel(5/7),Functional Parallelism Independent tasks applying different operations to different data elements of a data set.

14、,17,2 Seeking Parallel(6/7),Pipelining A data dependence graph forming a simple path/chain admits no parallelism if only a single problem instance must be processed. If multiple problems instance to be processed: If a computation can be divided into several stage with the same time consumption. Then

15、, can support parallelism. E.g. Assembly line.,18,2 Seeking Parallel(7/7),Pipelining,19,For Example:,Landscape maintains Prepare for dinner Data cluster ,20,Homework,Given a task that can be divided into m subtasks, each require one unit of time, how much time is needed for an m-stage pipeline to pr

16、ocess n tasks? Consider the data dependence graph in figure below. identify all sources of data parallelism; identify all sources of functional parallelism.,Parallel Algorithm Design,Part ,22,1.Introduction 2.The Task/Channel Model 3.Fosters Design Methodology,Outline,23,1.Introduction,Foster, Ian.

17、Design and Building Parallel Programs: Concepts and Tools for Parallel Software engineering. Reading, MA: Addison-Wesley, 1995. Describe the Task/Channel Model; A few simple problems,24,2.The Task/Channel Model,The model represents a parallel computation as a set of tasks that may interact with each

18、 other by sending message through channels.,Task: is a program, its local memory, and a collection of I/O ports. Local memory: instructionsprivate data,25,2.The Task/Channel Model,channel: Via channel: A task can send local data to other tasks via output ports; A task can receive data value from oth

19、er tasks via input ports. A channel is a message queue: Connect one tasks output port with another tasks input port. Data value appears at the inputs port in the same order in which they were placed in the output port of the other end of the channel. Receiving data can be blocked: Synchronous. Sendi

20、ng data can never be blocked: Asynchronous. Access to local memory: faster than nonlocal data access.,26,3.Fosters Design Methodology,Four-step process: Partitioning Communication Agglomeration mapping,27,3.Fosters Design Methodology,Partitioning Is the process of dividing the computation and the da

21、ta into pieces. More small pieces is good. How to Data-centric approach Function-centric approach Domain Decomposition First, divide data into pieces; Then, determine how to associate computations with the data. Focus on: the largest and/or most frequently accessed data structure in the program. E.g

22、., Functional Decomposition,28,3.Fosters Design Methodology Domain Decomposition,1-D,2-D,3-D,Better,Primitive Task,29,3.Fosters Design Methodology Functional Decomposition,Yield collections of tasks that achieve parallel through pipelining. E.g., a system supporting interactive image-guided surgery.

23、,30,3.Fosters Design Methodology,The quality of Partition (evaluation) At least an order of magnitude more primitive tasks than processors in the target parallel computer. Otherwise: later design options may be too constrained. Redundant computations and redundant data structure storage are minimize

24、d. Otherwise: the design may not work well when the size of the problem increases. Primitive tasks are roughly the same size. Otherwise: it may be hard to balance work among the processors/cores. The number of tasks is an increasing function of the problem size. Otherwise: it may be impossible to us

25、e more processor/cores to solve large problem.,31,3.Fosters Design Methodology,Communication After identifying the primitive tasks, the communications type between those primitive tasks should be determined. Two kinds of communication type: Local Global,32,3.Fosters Design Methodology,Communication

26、Local: A task needs values from a small number of other tasks in order to perform a computation, a channel is created from the tasks supplying the data to the task consuming the data. Global: When a significant number of the primitive tasks must be contribute data in order to perform a computation.

27、E.g., computing the sums of the values held by the primitive processes.,33,3.Fosters Design Methodology,Communication Evaluate the communication structure of the designed parallel algorithm. The communication operations are balanced among the tasks. Each task communications with only a small number

28、of neighbors. Tasks can perform their communication in parallel/concurrently. Tasks can perform their computations in parallel/concurrently.,34,3.Fosters Design Methodology,Agglomeration Why we need agglomeration? If the number of tasks exceeds the number of processors/cores by several orders of mag

29、nitude, simply creating these tasks would be a source of significant overhead. So, combine primitive tasks into large tasks and map them into physical processors/cores to reduce the amount of parallel overhead. Whats agglomeration? Is the process of grouping tasks into large tasks in order to improv

30、e performance or simplify programming. When developing MPI programs, ONE task per core/processor is better.,35,3.Fosters Design Methodology,Agglomeration Goals 1: lower communication overhead. Eliminate communication among tasks. Increasing the locality of parallelism. Combining groups of sending an

31、d receiving tasks.,36,3.Fosters Design Methodology,Agglomeration Goals 2: Maintain the scalability of the parallel design. Enable that we have not combined so many tasks that we will not be able to port our program at some point in the future to a computer with more processors/cores. E.g. 3-D Matrix

32、 Operation size: 8*128*258,37,3.Fosters Design Methodology,Agglomeration Goals 3: reduce software engineering costs. Make greater use of the existing sequential code. Reducing time; Reducing expense.,38,3.Fosters Design Methodology,Agglomeration evaluation: Has increased the locality of the parallel

33、 algorithm. Replicated computations take less time than the computations the replace. The amount of replicated data is small enough to allow algorithm to scale. Agglomeration tasks have similar computational and communication costs. The number of tasks is an increasing function of the problem size.

34、The number of tasks is as small as possible, yet at least as great as the number of cores/processors in the target computers. The trade-off between the chosen agglomeration and the cost of modifications to existing sequential code is reasonable.,39,3.Fosters Design Methodology,Mapping,Increasing pro

35、cessor utilization Minimizing inter-processor communication,Message-Passing Programming,Part ,41,Preface,42,43,44,Hello World!,Hello world from process 0 of 4 Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4,45,Introduction The Message-Passing Model The

36、 Message-Passing Interface (MPI) Communication Mode Circuit satisfiability Point-to-Point Communication Collective Communication Benchmarking parallel performance,Outline,46,Introduction,MPI: Message Passing Interface Is a library, not a parallel language. C&MPI, Fortran&MPI Is a standard, not a imp

37、lement for a actually problem. MPICH Intel MPI MSMPI LAM MPI Is a Message Passing Model,47,Introduction,The history of MPI: Draft: 1992 MPI-1: 1994 MPI-2:1997 http:/www.mpi-forum.org,48,Introduction,MPICH: http:/www-unix.mcs.anl.gov/mpi/mpich1/download.html; http:/www-unix.mcs.anl.gov/mpi/mpich2/ind

38、ex.htm#download Main Features: Open source; Synchronized on MPI standard; Supports MPMD (multiple Program Multiple Data) and heterogeneous clusters. Supports combining with C/C+, Fortran77 and Fortran90; Supports Unix, Windows NT platform; Supports multi-core, SMP, Cluster, Large Scale Parallel Comp

39、uter System.,49,Introduction,Intel MPI According to MPI-2 standard. Latest version: 3.1 DAPL (Direct Access Programming Library),50,Introduction-Intel MPI,Intel MPI Library Supports Multiple Hardware Fabrics,51,Introduction-Intel MPI,Features is a multi-fabric message passing library. implements the

40、 Message Passing Interface, v2 (MPI-2) specification. provides a standard library across Intel platforms that: Focuses on making applications perform best on IA based clustersEnables adoption of the MPI-2 functions as the customer needs dictateDelivers best in class performance for enterprise, divis

41、ional, departmental and workgroup high performance computing,52,Introduction-Intel MPI,Why Intel MPI Library? High performance MPI-2 implementation Linux and Windows CCS support Interconnect independence Smart fabric selection Easy installation Free Runtime EnvironmentClose integration with the Inte

42、l and 3rd party development toolsInternet based licensing and technical support,53,Introduction-Intel MPI,Standards Based Argonne National Laboratorys MPICH-2 implementation. Integration, can be easily integrated with: Platform LSF 6.1 and higher Altair PBS Pro* 7.1 and higher OpenPBS* 2.3 Torque* 1

43、.2.0 and higher Parallelnavi* NQS* for Linux V2.0L10 and higher Parallelnavi for Linux Advanced Edition V1.0L10A and higher NetBatch* 6.x and higher,54,Introduction-Intel MPI,System Requirements: Host and Target Systems hardware: IA-32, Intel 64, or IA-64 architecture using Intel Pentium 4, Intel Xe

44、on processor, Intel Itanium processor family and compatible platforms 1 GB of RAM - 4 GB recommended Minimum 100 MB of free hard disk space - 10GB recommended.,55,Introduction-Intel MPI,Operating Systems Requirements: Microsoft Windows* Compute Cluster Server 2003 (Intel 64 architecture only) Red Ha

45、t Enterprise Linux* 3.0, 4.0, or 5.0SUSE* Linux Enterprise Server 9 or 10SUSE Linux 9.0 thru 10.0 (all except Intel 64 architecture starts at 9.1)HaanSoft Linux 2006 Server*Miracle Linux* 4.0Red Flag* DC Server 5.0 Asianux* Linux 2.0 Fedora Core 4, 5, or 6 (IA-32 and Intel 64 architectures only) Tur

46、boLinux*10 (IA-32 and Intel 64 architecture) Mandriva/Mandrake* 10.1 (IA-32 architecture only) SGI* ProPack 4.0 (IA-64 architecture only) or 5.0 (IA-64 and Intel 64 architectures),56,The Message-Passing Model,57,The Message-Passing Model,A task in task/channel model become a process in Message-Passi

47、ng Model; The number of processes: Is specified by user; Is specified when the program begins; Is constant throughout the execution of the program; Each process: Has a unique ID number;,58,The Message-Passing Model,Goals of Message-Passing Model: Communication with each other;Synchronization with ea

48、ch other;,59,The Message-Passing Interface (MPI),Advantages: Run well on a wide variety of MPMD architectures;Easily to debugging;Threading safe,60,What is in MPI,Point-to-point message passing Collective communication Support for process groups Support for communication contexts Support for applica

49、tion topologies Environmental inquiry routines Profiling interface,61,Introduction to Groups & Communicator,Process model and groups Communication scope Communicators,62,Process model and groups,Fundamental computational unit is the process. Each process has: an independent thread of control, a sepa

50、rate address space MPI processes execute in MIMD style, but: No mechanism for loading code onto processors, or assigning processes to processors No mechanism for creating or destroying processes MPI supports dynamic process groups. Process groups can be created and destroyed Membership is static Groups may overlap No explicit support for multithreading, but MPI is designed to be thread-safe.,

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1