1、2008/6/24,Second French-Japanese PAAP Workshop,1,Automatic Tuning for Parallel FFTs,Daisuke Takahashi University of Tsukuba, Japan,2008/6/24,Second French-Japanese PAAP Workshop,2,Outline,Background Objectives Approach Block Six-Step/Nine-Step FFT Algorithm Automatic Tuning for Parallel FFTs Perform
2、ance Results Conclusion,2008/6/24,Second French-Japanese PAAP Workshop,3,Background,The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. Parallel FFT algorithms on distributed-memory parallel computers have been well studied. Many numerical libraries with an
3、 automatic performance tuning have been developed, e.g., ATLAS, FFTW, and I-LIB.,2008/6/24,Second French-Japanese PAAP Workshop,4,Background (contd),One goal for large FFTs is to minimize the number of cache misses. Many FFT algorithms work well when data sets fit into a cache. When a problem exceed
4、s the cache size, however, the performance of these FFT algorithms decreases dramatically. We modified the conventional six-step FFT algorithm to reuse data in the cache memory. We will call it a “block six-step FFT”.,2008/6/24,Second French-Japanese PAAP Workshop,5,Related Works,FFTW Frigo and John
5、son (MIT) The recursive call is employed to access main memory hierarchically. This technique is very effective in the case that the total amount of data is not so much greater than the cache size. For 1-D parallel MPI FFT, the six-step FFT is used. http:/www.fftw.org SPIRAL Pueschel et al. (CMU) Th
6、e goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms. http:/,2008/6/24,Second French-Japanese PAAP Workshop,6,FFTE: A High-Performance FFT Library,FFTE is a Fortran subroutine library for computing t
7、he Fast Fourier Transform (FFT) in one or more dimensions. It includes complex, mixed-radix and parallel transforms. Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI) It also supports Intels SSE2/SSE3 instructions. HPC Challenge Benchmark FFTEs 1-D parallel FFT routine ha
8、s been incorporated into the HPC Challenge (HPCC) benchmark http:/www.ffte.jp,2008/6/24,Second French-Japanese PAAP Workshop,7,Objectives,To improve the performance, we need to select the optimal parameters according to the computational environment and the problem size. We implement an automatic tu
9、ning facility for parallel 1-D FFT routine in the FFTE library.,2008/6/24,Second French-Japanese PAAP Workshop,8,Discrete Fourier Transform (DFT),DFT is given by,2008/6/24,Second French-Japanese PAAP Workshop,9,2-D Formulation,If has factors and then,2008/6/24,Second French-Japanese PAAP Workshop,10
10、,Six-Step FFT Algorithm,individual,-point FFTs,Transpose,Transpose,Transpose,2008/6/24,Second French-Japanese PAAP Workshop,11,Block Six-Step FFT Algorithm,individual,-point FFTs,Partial Transpose,Partial Transpose,Transpose,2008/6/24,Second French-Japanese PAAP Workshop,12,3-D Formulation,For very
11、large FFTs, we should switch to a 3-D formulation. If has factors , and then,2008/6/24,Second French-Japanese PAAP Workshop,13,Parallel Block Nine-Step FFT,Partial Transpose,Partial Transpose,Partial Transpose,All-to-all comm.,2008/6/24,Second French-Japanese PAAP Workshop,14,Automatic Tuning for Pa
12、rallel FFTs,If the condition of is satisfied, then we can choose the arbitrary , and , where . In the original FFTE library, we chose The blocking parameter can be also varied. For a given , the best block size is determined by the L2 cache size. In the original FFTE, for Xeon processor. We implemen
13、ted the automatic tuning facility for varying , , and .,2008/6/24,Second French-Japanese PAAP Workshop,15,2008/6/24,Second French-Japanese PAAP Workshop,16,Performance Results,To evaluate parallel 1-D FFTs, we compared FFTE (ver 4.0) FFTE (ver 4.0) with automatic tuning FFTW (ver. 3.2alpha3) “mpi-be
14、nch” with “PATIENT” planner was used. Target parallel machine: A 16-node dual-core Xeon PC cluster (Woodcrest 2.4GHz, 2GB SDRAM/node, Linux 2.6.18). Interconnected through a Gigabit Ethernet switch. Open MPI 1.2.5 was used as a communication library The compilers used were Intel C compiler 10.1 and
15、Intel Fortran compiler 10.1.,2008/6/24,Second French-Japanese PAAP Workshop,17,2008/6/24,Second French-Japanese PAAP Workshop,18,2008/6/24,Second French-Japanese PAAP Workshop,19,Results of Automatic Tuning on dual-core Xeon 2.4GHz PC cluster,2008/6/24,Second French-Japanese PAAP Workshop,20,Discuss
16、ion,For N = 228 and P = 32, the FFTE with automatic tuning runs about 1.25 times faster than the FFTW. Since the FFTW uses the six-step FFT, each column FFT does not fit into the L1 data cache. Moreover, the FFTE exploits the SSE3 instructions. These are two reasons why the FFTE is most advantageous
17、 than the FFTW. We can clearly see that all-to-all communication overhead contributes significantly to the execution time.,2008/6/24,Second French-Japanese PAAP Workshop,21,Conclusions,We proposed the automatic tuning method for parallel 1-D FFTs on distributed-memory parallel computers. A blocking algorithm for parallel 1-D FFTs utilizes cache memory effectively. We found that the default parameters of the FFTE is not always optimal according to the results of the automatic tuning. The performance of the FFTE with automatic tuning is better than that of the FFTW.,