Autovectorization in GCC.ppt

上传人:postpastor181 文档编号:378768 上传时间:2018-10-09 格式:PPT 页数:32 大小:499KB
下载 相关 举报
Autovectorization in GCC.ppt_第1页
第1页 / 共32页
Autovectorization in GCC.ppt_第2页
第2页 / 共32页
Autovectorization in GCC.ppt_第3页
第3页 / 共32页
Autovectorization in GCC.ppt_第4页
第4页 / 共32页
Autovectorization in GCC.ppt_第5页
第5页 / 共32页
亲,该文档总共32页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、Autovectorization in GCC,Dorit Naishlos ,2,Vectorization in GCC - Talk Layout,Background: GCC HRL and GCC,Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Future Work Working with an Open Source Community Concluding Remarks,3,GCC GNU Compiler Collection,Ope

2、n Source Download from gcc.gnu.orgMulti-platform 2.1 million lines of code, 15 years of development How does it work cvs mailing list: gcc-patchesgcc.gnu.org steering committee, maintainers Whos involved Volunteers Linux distributors Apple, IBM HRL (Haifa Research Lab),4,GCC Passes,machine descripti

3、on,C front-end,Java front-end,C+ front-end,parse trees,int i, a16, b16 for (i=0; i 16; i+)ai = ai + bi;,int i; int T.1, T.2, T.3;i = 0; L1:if (i 16) break;T.1 = ai ;T.2 = bi ;T.3 = T.1 + T.2;ai = T.3;i = i + 1;goto L1; L2:,int i_0, i_1, i_2; int T.1_3, T.2_4, T.3_5;i_0 = 0; L1: i_1 = PHIif (i_1 16)

4、break;T.1_3 = ai_1 ;T.2_4 = bi_1 ;T.3_5 = T.1_3 + T.2_4;ai_1 = T.3_5;i_2 = i_1 + 1;goto L1; L2:,GIMPLE:,SSA,5,GCC Passes,GCC 4.0,6,GCC Passes,The Haifa GCC team: Leehod Baruch Revital Eres Olga Golovanevsky Mustafa Hagog Razya Ladelsky Victor Leikehman Dorit Naishlos Mircea Namolaru Ira Rosen Ayal Z

5、aks,machine description,Fortran 95 front-end,IPO CP Aliasing Data layout,Vectorization,Loop unrolling,Scheduler,Modulo Scheduling,Power4,7,Vectorization in GCC - Talk Layout,Background: GCC HRL and GCC,Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Future

6、 Work Working with an Open Source Community Concluding Remarks,8,Programming for Vector Machines,Proliferation of SIMD (Single Instruction Multiple Data) model MMX/SSE, Altivec Communications, Video, Gaming Fortran90 a0:N = b0:N + c0:N; Intrinsicsvector float vb = vec_load (0, ptr_b);vector float vc

7、 = vec_load (0, ptr_c);vector float va = vec_add (vb, vc); vec_store (va, 0, ptr_a); Autovectorization: Automatically transform serial code to vector code by the compiler.,9,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,OP(a) OP(b) OP(c) OP(d),Data in Memory:,VOP( a, b, c, d ),VR1,What is vectorization,Vector Reg

8、isters,Data elements packed into vectors Vector length Vectorization Factor (VF),VF = 4,Vector operation,vectorization,10,Vectorization,original serial loop: for(i=0; iN; i+) ai = ai + bi; loop in vector notation: for (i=0; iN; i+=VF) ai:i+VF = ai:i+VF + bi:i+VF; ,loop in vector notation: for (i=0;

9、i(N-N%VF); i+=VF) ai:i+VF = ai:i+VF + bi:i+VF; for ( ; i N; i+) ai = ai + bi; ,vectorization,Loop based vectorization No dependences between iterations,vectorized loop,epilog loop,11,Loop Dependence Tests,for (i=0; iN; i+) for (j=0; jN; j+) Ai+1j = Aij + X,for (i=0; iN; i+) Di = Ai + Y Ai+1 = Bi + X

10、 ,for (i=0; iN; i+) Bi = Ai + Y Ai+1 = Bi + X ,12,Loop Dependence Tests,for (i=0; iN; i+) for (j=0; jN; j+) Ai+1j = Aij + X,for (i=0; iN; i+) Ai+1 = Bi + X,for (i=0; iN; i+) Di = Ai + Y,for (i=0; iN; i+) Ai+1 = Bi + X Di = Ai + Y ,for (i=0; iN; i+) Di = Ai + Y Ai+1 = Bi + X ,for (i=0; iN; i+) Bi = A

11、i + Y Ai+1 = Bi + X ,13,Classic loop vectorizer,dependence graph,int exist_dep(ref1, ref2, Loop) Separable Subscript testsZeroIndexVar SingleIndexVar MultipleIndexVar (GCD, Banerjee.) Coupled Subscript tests (Gamma, Delta, Omega),find SCCs reduce graph topological sort for all nodes:Cyclic: keep seq

12、uential loop for this nest. non Cyclic:,for ifor jfor kA5 i+1 j = AN i k,for ifor jfor kA5 i+1 i = AN i k,replace node with vector code,loop transform to break cycles,14,Vectorizer Skeleton,get candidate loops nesting, entry/exit, countable,scalar dependences,vectorizable operations data-types, VF,

13、target support,vectorize loop,known loop bound,1D aligned arrays,Basic vectorizer 01.01.2004,idiom recognition,invariants,saturation,conditional code,for (i=0; iN; i+)ai = bi + ci; ,li r9,4 li r2,0 mtctr r9 L2:lvx v0,r2,r30lvx v1,r2,r29vaddfp v0,v0,v1stvx v0,r2,r0addi r2,r2,16bdnz L2,arrays and poin

14、ters,unaligned accesses,force alignment,mainline,15,Vectorization on SSA-ed GIMPLE trees,int T.1, T.2, T.3;loop:if ( i 16 ) break; S1: T.1 = ai ; S2: T.2 = bi ; S3: T.3 = T.1 + T.2; S4: ai = T.3; S5: i = i + 1; goto loop;,loop: if (i 16) break;T.11 = ai ;T.12 = ai+1;T.13 = ai+2; T.14 = ai+3; T.21 =

15、bi ;T.22 = bi+1;T.23 = bi+2; T.24 = bi+3;T.31 = T.11 + T.21;T.32 = T.12 + T.22;T.33 = T.13 + T.23;T.34 = T.14 + T.24; ai = T.31;ai+1 = T.32;ai+2 = T.33;ai+3 = T.34;i = i + 4;goto loop;,VF = 4,“unroll by VF and replace”,int i; int aN, bN; for (i=0; i 16; i+)ai = ai + bi ;,v4si VT.1, VT.2, VT.3; v4si

16、*VPa = (v4si *)a, *VPb = (v4si *)b; int indx;loop:if ( indx 4 ) break;VT.1 = VPaindx ;VT.2 = VPbindx ;VT.3 = VT.1 + VT.2;VPaindx = VT.3;indx = indx + 1; goto loop;,16,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,OP(c) OP(d) OP(e) OP(f),Data in Memory:,VOP( c, d, e, f ),VR3,Alignment,Vector Registers,c,d,e,f,misa

17、lign = -2,Alignment support in a multi-platform compiler General (new trees: realign_load) Efficient (new target hooks: mask_for_load) Hide low-level details,(VR1,VR2) vload (mem) mask (0,0,1,1,1,1,0,0) VR3 pack (VR1,VR2),mask VOP(VR3),17,Handling Alignment,Alignment analysis Transformations to forc

18、e alignment loop versioning loop peeling Efficient misalignment support,for (i=0; iN; i+)x = qi; /misalign(q) = unknownpi = x; /misalign(p) = unknown ,Loop versioning: if (q is aligned) for (i=0; iN; i+)x = qi; /misalign(q) = 0pi = x; /misalign(p) = -2 else for (i=0; iN; i+)x = qi; /misalign(q) = un

19、knownpi = x; /misalign(p) = -2 ,Loop peeling (for access pi): for (i = 0; i 2; i+)x = qi;pi = x; for (i = 2; i N; i+)x = qi; /misalign(q) = unknownpi = x; /misalign(p) = 0 ,Peeling for pi and versioning: for (i = 0; i 2; i+)x = qi;pi = x; if (q is aligned) for (i = 3; iN; i+)x = qi; /misalign(q) = 0

20、pi = x; /misalign(p) = 0 else for (i = 3; iN; i+)x = qi; /misalign(q) = unknownpi = x; /misalign(p) = 0 ,-2,vector *vp_q = q; vector *vp_p = p; int indx = 0, vector vx; LOOP:vx1 = vp_q1indx;vx2 = vp_q2indx;mask = target_hook_mask_for_load (q);vx = realign_load(vx1, vx2, mask)vp_pindx = vx;indx+;,vx

21、= vp_qindx;,vector *vp_q1 = q, *vp_q2 = ,int indx = 0, vector vx, vx1, vx2;,Aart J.C. Bik, Milind Girkar, Paul M. Grey, Ximmin Tian. Automatic intra-register vectorization for the intel architecture. International Journal of Parallel Programming, April 2002.,18,Vectorization in GCC - Talk Layout,Bac

22、kground: GCC HRL and GCC,Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Future Work Working with an Open Source Community Concluding Remarks,19,Vectorizer Status,In the main GCC development trunk Will be part of the GCC 4.0 release New development branch

23、(autovect-branch) Vectorizer Developers: Dorit Naishlos Olga Golovanevsky Ira Rosen Leehod Baruch Keith Besaw (IBM US) Devang Patel (Apple),20,Preliminary Results,Pixel Blending Application - small dataset: 16x improvement - tiled large dataset: 7x improvement - large dataset with display: 3x improv

24、ement for (i = 0; i 8 + (input2i * (-1)8 ); SPEC gzip 9% improvement for (n = 0; n = WSIZE ? m-WSIZE : 0); Kernels:,lvx v0,r3,r2 vsubuhs v0,v0,v1 stvx v0,r3,r2 addi r2,r2,16 bdnz L2,21,Performance improvement (aligned accesses),22,Performance improvement (unaligned accesses),23,Future Work,Reduction

25、 Multiple data types Non-consecutive data-accesses,24,1. Reduction,Cross iteration dependence Prolog and epilog Partial sums,s = 0; for (i=0; iN; i+) s += ai * bi; ,loop: s_1 = phi (0, s_2) i_1 = phi (0, i_1) xa_1 = ai_1 xb_1 = bi_1 tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 if (i_2 N) goto loo

26、p,25,2. Mixed data types,short bN; int aN; for (i=0; iN; i+) ai = (int) bi; Unpack,26,3. Non-consecutive access patterns,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,OP(a) OP(f) OP(k) OP(p),Data in Memory:,VOP( a, f, k, p ),VR5,a,f,k,p,a,f,k,p,a,f,k,p,Ai, i=0,5,10,15,; access_fn(i) = (0,+,5),(VR1,VR4) vload (mem

27、) mask (1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1) VR5 pack (VR1,VR4),mask VOP(VR5),27,Developing a generic vectorizer in a multi-platform compiler,Internal Representation machine independent high level Low-level, architecture-dependent details vectorize only if supported (efficiently) may affect benefit of v

28、ectorization may affect vectorization scheme cant be expressed using existing tree-codes,28,Vectorization in GCC - Talk Layout,Background: GCC HRL and GCC,Vectorization Background The GCC Vectorizer Developing a vectorizer in GCC Status & Results Future Work Working with an Open Source Community Con

29、cluding Remarks,29,Working with an Open Source Community - Difficulties,Its a shock “project management” ? No control Whats going on, whos doing what NoiseCulture shock Language Working conventions How to get the best for your purposes Multiplatform Politics,30,Working with an Open Source Community

30、- Advantages,World Wide Collaboration Help, Development Testing World Wide Exposure The Community,31,Concluding Remarks,GCC HRL and GCC Evolving - new SSA frameworkGCC vectorizer Developing a generic vectorizer in a multi-platform compiler Open,GCC 4.0 http:/gcc.gnu.org/projects/tree-ssa/vectorization.html,32,The End,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1