Introductions to Parallel Programming Using OpenMP.ppt

上传人:boatfragile160 文档编号:372941 上传时间:2018-10-04 格式:PPT 页数:62 大小:381.50KB
下载 相关 举报
Introductions to Parallel Programming Using OpenMP.ppt_第1页
第1页 / 共62页
Introductions to Parallel Programming Using OpenMP.ppt_第2页
第2页 / 共62页
Introductions to Parallel Programming Using OpenMP.ppt_第3页
第3页 / 共62页
Introductions to Parallel Programming Using OpenMP.ppt_第4页
第4页 / 共62页
Introductions to Parallel Programming Using OpenMP.ppt_第5页
第5页 / 共62页
亲,该文档总共62页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、Introductions to Parallel Programming Using OpenMP,Zhenying Liu, Dr. Barbara Chapman High Performance Computing and Tools group Computer Science Department University of Houston,April 7, 2005,Content,Overview of OpenMP Acknowledgement OpenMP constructs (5 categories) OpenMP exercises References,Over

2、view of OpenMP,OpenMP is a set of extensions to Fortran/C/C+ OpenMP contains compiler directives, library routines and environment variables. Available on most single address space machines. shared memory systems, including cc-NUMA Chip MultiThreading: Chip MultiProcessing (Sun UltraSPARC IV), Simul

3、taneous Multithreading (Intel Xeon) not on distributed memory systems, classic MPPs, or PC clusters (yet!),Shared Memory Architecture,All processors have access to one global memory All processors share the same address space The system runs a single copy of the OS Processors communicate by reading/

4、writing to the global memory Examples: multiprocessor PCs (Intel P4), Sun Fire 15K, NEC SX-7, Fujitsu PrimePower, IBM p690, SGI Origin 3000.,Shared Memory Systems (cont),OpenMP Pthreads,Distributed Memory Systems,MPI HPF,Clustered of SMPs,MPIhybrid MPI + OpenMP,OpenMP Usage,Applications Applications

5、 with intense computational needs From video games to big science & engineering Programmer Accessibility From very early programmers in school to scientists to parallel computing experts Available to millions of programmers In every major (Fortran & C/C+) compiler,OpenMP Syntax,Most of the construct

6、s in OpenMP are compiler directives or pragmas. For C and C+, the pragmas take the form: #pragma omp construct clause clause For Fortran, the directives take one of the forms: C$OMP construct clause clause !$OMP construct clause clause *$OMP construct clause clause Since the constructs are directive

7、s, an OpenMP program can be compiled by compilers that dont support OpenMP.,OpenMP: Programming Model,Fork-Join Parallelism: Master thread spawns a team of threads as needed. Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program.,OpenMP: How is OpenMP Typica

8、lly Used?,OpenMP is usually used to parallelize loops:Find your most time consuming loops.Split them up between threads.,void main() double Res1000;#pragma omp parallel forfor(int i=0;i1000;i+) do_huge_comp(Resi); ,void main() double Res1000;for(int i=0;i1000;i+) do_huge_comp(Resi); ,Split-up this l

9、oop between multiple threads,Sequential program,Parallel program,OpenMP: How do Threads Interact?,OpenMP is a shared memory model. Threads communicate by sharing variables. Unintended sharing of data can lead to race conditions: race condition: when the programs outcome changes as the threads are sc

10、heduled differently. To control race conditions: Use synchronization to protect data conflicts. Synchronization is expensive so: Change how data is stored to minimize the need for synchronization.,OpenMP vs. POSIX Threads,POSIX threads is the other widely used shared programming API. Fairly widely a

11、vailable, usually quite simple to implement on top of OS kernel threads. Lower level of abstraction than OpenMP library routines only, no directives more flexible, but harder to implement and maintain OpenMP can be implemented on top of POSIX threads Not much difference in availability not that many

12、 OpenMP C+ implementations no standard Fortran interface for POSIX threads,Content,Overview of OpenMP Acknowledgement OpenMP constructs (5 categories) OpenMP exercises References,Acknowledgement,Slides provided by Tim Mattson and Rudolf Eigenmann, SC 99 Mark Bull from EPCC OpenMP program examples La

13、wrence Livermore National Lab NAS FT parallelization from PGI tutorial Dr. Garbey provided us serial codes of Naiver-Stokes,Content,Overview of OpenMP Acknowledgement OpenMP constructs (5 categories) OpenMP exercises References,OpenMP Constructs,OpenMPs constructs fall into 5 categories: Parallel Re

14、gions Worksharing Data Environment Synchronization Runtime functions/environment variables OpenMP is basically the same between Fortran and C/C+,OpenMP: Parallel Regions,You create threads in OpenMP with the “omp parallel” pragma. For example, To create a 4-thread Parallel region:Each thread calls p

15、ooh(ID,A) for ID = 0 to 3,double A1000; omp_set_num_threads(4); #pragma omp parallel int ID =omp_get_thread_num();pooh(ID,A); ,Each thread redundantly executes the code within the structured block,OpenMP: Work-Sharing Constructs,The “for” Work-Sharing construct splits up loop iterations among the th

16、reads in a team,#pragma omp parallel #pragma omp for for (I=0;IN;I+)NEAT_STUFF(I); ,By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier.,Work Sharing Constructs A motivating example,for(i=0;IN;i+) ai = ai + bi;,#pragma omp parallel int id, i, N

17、thrds, istart, iend;id = omp_get_thread_num();Nthrds = omp_get_num_threads();istart = id * N / Nthrds;iend = (id+1) * N / Nthrds;for(i=istart;Iiend;i+) ai=ai+bi; ,#pragma omp parallel #pragma omp for schedule(static) for(i=0;IN;i+) ai=ai+bi;,OpenMP parallel region and a work-sharing for construct,Se

18、quential code,OpenMP Parallel Region,OpenMP Parallel Region and a work-sharing for construct,OpenMP For Construct: The Schedule Clause,The schedule clause effects how loop iterations are mapped onto threadsuschedule(static ,chunk)Deal-out blocks of iterations of size “chunk” to each thread.uschedule

19、(dynamic,chunk)Each thread grabs “chunk” iterations off a queue until all iterations have been handled.uschedule(guided,chunk) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds.uschedule(runtime) Schedule an

20、d chunk size taken from the OMP_SCHEDULE environment variable.,OpenMP: Work-Sharing Constructs,The Sections work-sharing construct gives a different structured block to each thread.,#pragma omp parallel #pragma omp sections X_calculation(); #pragma omp sectiony_calculation(); #pragma omp sectionz_ca

21、lculation(); ,By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.,Data Environment: Changing Storage Attributes,One can selectively change storage attributes constructs using the following clauses* SHARED PRIVATE FIRSTPRIVATE THREADPRIVAT

22、E The value of a private inside a parallel loop can be transmitted to a global value outside the loop with: LASTPRIVATE The default status can be modified with: DEFAULT (PRIVATE | SHARED | NONE),* All data clauses apply to parallel regions and worksharing constructs except “shared” which only applie

23、s to parallel regions.,Data Environment: Default Storage Attributes,Shared Memory programming model: Most variables are shared by default Global variables are SHARED among threads Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static But not everything is shared. S

24、tack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE.,Private Clause,private(var) creates a local copy of var for each thread. The value is uninitialized Private copy is not storage associated with the original,void wrong()

25、int IS = 0; #pragma parallel for private(IS) for(int J=1;J1000;J+) IS = IS + J; printf(“%i”, IS); ,OpenMP: Reduction,Another clause that effects the way variables are shared:reduction (op : list) The variables in “list” must be shared in the enclosing parallel region. Inside a parallel or a workshar

26、ing construct: A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) pair wise “op” is updated on the local value Local copies are reduced into a single global copy at the end of the construct.,OpenMP: An Reduction Example,#include #define NUM_THREADS 2 vo

27、id main () int i;double ZZ, func(), sum=0.0;omp_set_num_threads(NUM_THREADS)#pragma omp parallel for reduction(+:sum) private(ZZ)for (i=0; i 1000; i+)ZZ = func(i);sum = sum + ZZ; ,OpenMP: Synchronization,OpenMP has the following constructs to support synchronization:barriercritical sectionatomicflus

28、horderedsinglemaster,Critical and Atomic,Only one thread at a time can enter a critical section,C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)DO 100 I=1,NITERSB = DOIT(I) C$OMP CRITICALCALL CONSUME (B, RES) C$OMP END CRITICAL 100 CONTINUE,C$OMP PARALLEL PRIVATE(B)B = DOIT(I) C$OMP ATOMICX = X + B C

29、$OMP END PARALLEL,Atomic is a special case of a critical section that can be used for certain simple statements:,Master directive,The master construct denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes).,#pragma omp p

30、arallel private (tmp) do_many_things();#pragma omp master exchange_boundaries(); #pragma barrierdo_many_other_things(); ,Single directive,The single construct denotes a block of code that is executed by only one thread. A barrier and a flush are implied at the end of the single block.,#pragma omp pa

31、rallel private (tmp) do_many_things();#pragma omp single exchange_boundaries(); do_many_other_things(); ,OpenMP: Library routines,Lock routines omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock() Runtime environment routines: Modify/Check the number of threads omp_set_num_threads(), o

32、mp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() Turn on/off nesting and dynamic mode omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_num_procs(),OpenMP: Environment Variab

33、les,OMP_NUM_THREADS bsh: export OMP_NUM_THREADS=2 csh: setenv OMP_NUM_THREADS 4,Content,Overview of OpenMP Acknowledgement OpenMP constructs (5 categories) OpenMP exercises References,#include main () int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma o

34、mp parallel private(nthreads, tid)/* Obtain thread number */tid = omp_get_thread_num();printf(“Hello World from thread = %dn“, tid);/* Only master thread does this */if (tid = 0) nthreads = omp_get_num_threads();printf(“Number of threads = %dn“, nthreads); /* All threads join master thread and disba

35、nd */ ,1. Hello World!,Example Code - Pthread Creation and Termination #include #include #define NUM_THREADS 5 void *PrintHello(void *threadid) printf(“n%d: Hello World!n“, threadid); pthread_exit(NULL); int main (int argc, char *argv) pthread_t threadsNUM_THREADS; int rc, t; for(t=0; tNUM_THREADS;

36、t+) printf(“Creating thread %dn“, t); rc = pthread_create( ,PROGRAM REDUCTIONINTEGER I, NREAL A(100), B(100), SUM! Some initializationsN = 100DO I = 1, NA(I) = I *1.0B(I) = A(I)ENDDOSUM = 0.0!$OMP PARALLEL DO REDUCTION(+:SUM)DO I = 1, NSUM = SUM + (A(I) * B(I)ENDDOPRINT *, Sum = , SUMEND,2. Parallel

37、 Loop Reduction,3. Matrix-vector multiply using a parallel loop and critical directive,/* Spawn a parallel region explicitly scoping all variables */ #pragma omp parallel shared(a,b,c,nthreads,chunk) private(tid,i,j,k) #pragma omp for schedule (static, chunk)for (i=0; iNRA; i+) printf(“thread=%d did

38、 row=%dn“,tid,i);for(j=0; jNCB; j+) for (k=0; kNCA; k+)cij += aik * bkj; ,Steps of Parallelization using OpenMP: An Example from a PGI Tutorial,Compile a code with the option to enable a profiler Run the code and check if the results are correct Find out the most time-consuming part of the code via

39、the profiler information Parallelize the time-consuming part Repeat above steps until you get reasonable speedup,How to Use a Profiler,PGI compiler pgf90 -fast -Minfo -Mprof=func fftpde.F -o fftpde (function level) -Mprof=lines (line level) -mp for compiling OpenMP codes pgprof pgprof.out (show the

40、profiler result) Pathscale compiler pathf90 -Ofast -pg Fftpde.F -o Fftpde pathprof Fftpde|more,do k=1,n3do j=1,n2do i=1,n1 z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k)end docall fft(z,inverse,w,n1,m1)do i=1,n1x1real(i,j,k)=real(z(i)x1imag(i,j,k)=aimag(z(i)end doend do end do,The most time-consuming loop i

41、n Fftpde.F:,!$OMP PARALLEL PRIVATE(Z) !$OMP DO do k=1,n3do j=1,n2do i=1,n1 z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k)end docall fft(z,inverse,w,n1,m1)do i=1,n1x1real(i,j,k)=real(z(i)x1imag(i,j,k)=aimag(z(i)end doend do end do !$OMP END PARALLEL,The OpenMP version of this loop in Fftpde_1.F:,NEXT: compar

42、e the 1 and 2 processor profiles after adding OpenMP to this loop,Parallelizing the Reminder of Fftpde.F,The DO 130 loop near line 64 (fftpde_2.F)The DO 190 loop near line 115 (fftpde_3.F)3) The DO 220 loop near line 139 (fftpde_4.F)4) The DO 250 loop near line 155 (fftpde_5.F),!$OMP PARALLEL PRIVAT

43、E(KK,KL,T1,T2,IK) !$OMP DODO 130 K = 1, N3KK = K - 1KL = KKT1 = ST2 = AN C C Find starting seed T1 for this KK using the binary rule for exponentiation. CDO 110 I = 1, 100IK = KK / 2IF (2 * IK .NE. KK) T2 = RANDLC (T1, T2)IF (IK .EQ. 0) GOTO 120T2 = RANDLC (T2, T2)KK = IK110 CONTINUE C C Compute 2 *

44、 NQ pseudorandom numbers. C120 continueCALL VRANLC (N1*N2, T2, aa, x1real(1,1,k)CALL VRANLC (N1*N2, T2, aa, x1imag(1,1,k)130 CONTINUE !$OMP END PARALLEL,1. Parallelize the DO 130 loop in Fftpde_2.F,!$OMP PARALLEL PRIVATE(K1,J1,JK,I1) !$OMP DODO 190 K = 1, N3K1 = K - 1IF (K .GT. N32) K1 = K1 - N3 CDO

45、 180 J = 1, N2J1 = J - 1IF (J .GT. N22) J1 = J1 - N2JK = J1 * 2 + K1 * 2 CDO 170 I = 1, N1I1 = I - 1IF (I .GT. N12) I1 = I1 - N1X3(I,J,K) = EXP (AP * (I1 * 2 + JK)170 CONTINUE C180 CONTINUE190 CONTINUE !$OMP END PARALLEL,2. Parallelize the DO 190 loop in Fftpde_3.F,3. Parallelize the DO 220 loop in

46、Fftpde_4.F,!$OMP PARALLEL PRIVATE(T1) !$OMP DODO 220 K = 1, N3DO 210 J = 1, N2DO 200 I = 1, N1T1 = X3(I,J,K) * KTX2real(I,J,K) = T1 * X1real(I,J,K)X2imag(I,J,K) = T1 * X1imag(I,J,K)200 CONTINUE210 CONTINUE220 CONTINUE !$OMP END PARALLEL,4. Parallelize the DO 250 loop in Fftpde_5.F,!$OMP PARALLEL !$O

47、MP DODO 250 K = 1, N3DO 240 J = 1, N2DO 230 I = 1, N1X2real(I,J,K) = RN * X2real(I,J,K)X2imag(I,J,K) = RN * X2imag(I,J,K)230 CONTINUE240 CONTINUE250 CONTINUE !$OMP END PARALLEL,Conclusion,OpenMP is successful in small-to-medium SMP systemsMultiple cores/CPUs dominate the future computer architecture

48、s; OpenMP would be the major parallel programming language in these architectures. Simple: everybody can learn it in 2 weeks Not so simple: Dont stop learning! keep learning it for better performance,Some Buggy Codes,#pragma omp parallel for shared(a,b,c,chunk) private(i,tid) schedule(static,chunk)t

49、id = omp_get_thread_num();for (i=0; i N; i+)ci = ai + bi;printf(“tid= %d i= %d ci= %fn“, tid, i, ci); /* end of parallel for construct */,Content,Overview of OpenMP Acknowledgement OpenMP constructs (5 categories) OpenMP exercises References,References,OpenMP Official Website: www.openmp.org OpenMP

50、2.5 Specifications An OpenMP book Rohit Chandra, “Parallel Programming in OpenMP”. Morgan Kaufmann Publishers. Compunity The community of OpenMP researchers and developers in academia and industry http:/punity.org/ Conference papers: WOMPAT, EWOMP, WOMPEI, IWOMP http:/www.nic.uoregon.edu/iwomp2005/index.html#program,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1