D3 Center, The University of Osaka » How to use Optimization / SIMD (SQUID General Purpose CPU nodes)

With the intel compiler available in SQUID, you can make your program more efficient by two methods, "optimization" and "SIMD". Basically, you can get some effects only by specifying compile options, so please try "Recommended compiler options" in here first. You can further improve efficiency by adding your own instruction lines in the program.

Optimization
This function improves the efficiency of processing with the aim of reducing execution time and code size. As a detriment, the calculation results may change.

SIMD (Vectorization)
The compiler applies vector instructions to operations on regularly ordered array data (called vector data) that is repeated in a loop.

Basic use

$ module load BaseCPU
$ ifort -O0 -qopt-report-phase=all -qopt-report=2 source_file

# Red：Optimization Level
# Blue：Compiler Message

Optimization Level (-O option)

-O0	Disables all optimizations.
-O1	Enables optimizations for speed and disables some optimizations that increase code size and affect speed.
-O2	Enables optimizations for speed. This is the generally recommended optimization level. Vectorization is enabled at O2 and higher levels.
-O3	Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The optimizations may slow down code in some cases compared to O2 optimizations.

Optimization report output (-qopt-report)

source file name.optrpt

optimizer phases(-qopt-report-phase)

-qopt-report-phase=all	All optimizer phases. This is the default if you do not specify list.
-qopt-report-phase=vec	The phase for vectorization
-qopt-report-phase=loop	The phase for loop nest optimization
-qopt-report-phase=par	The phase for auto-parallelization

the level of detail in the report(-qopt-report)

-qopt-report=0	Does not output a report.
-qopt-report=1	Output a vectorized loop.
-qopt-report=2	Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
-qopt-report=3	Level 2 content + output summary information for loops that were not vectorized. (Default)
-qopt-report=4	Level 3 content + output detailed information about vectorized and unvectorized loops.
-qopt-report=5	Level 4 content + detailed information on dependencies found or assumed.

For more information about the optimization report output, please see the following page.
-qopt-reportマニュアル

For those who are running the program for the first time

$ ifort -O2 source_file

For those who want to speed up a program that terminates normally

$ ifort -O3 source_file

For those who want to perform speed-up for general-purpose CPU node group

After confirming that the calculation result remains the same, changing -O2 to -O3 may result in further speed-up.

Uses AVX-512 instructions and uses zmm registers without restriction

$ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=low source_file

Use the AVX-512 instruction and use the zmm register without restriction

$ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=high source_file

AVX- AVX2 instructions instead of using 512 instructions, optimized for Icelake processors

$ ifort -O2 -xCORE-AVX2 -mtune=icelake-client source_file

For those who want to debug

$ ifort -g -traceback source_file #(Debugging and traceback information will be output)

$ ifort -check uninit -check bounds source_file # (Checks for initialization leaks and out-of-array references at runtime, slows down execution time).

About directive

a list of directive

!DEC$ IVDEP　　(Fortran)
#pragma ivdep　(C)

It gives a hint to the compiler that there is no dependency. If it is judged that vectorization has no effect, vectorization is not performed.

!DEC$ vector always　　(Fortran)
#pragma vector always　(C)

Ignore the efficiency heuristic.

!DEC$ VECTOR NONTEMPORAL　　(Fortran)
#pragma vector nontemporal　(C)

Hints you to use a streaming store.

!DEC$ VECTOR [UN]ALIGNED　　(Fortran)
#pragma vector [un]aligned　(C)

Stating that it is [not] aligned.

!DEC$ NOVECTOR　　(Fortran)
#pragma novector　(C)

Disables vectorization of the target loop.

!DEC$ DISTRIBUTE POINT　　(Fortran)
#pragma distribute point　(C)

Hints at this position to split the loop.

!DEC$ LOOP COUNT ()　　(Fortran)
#pragma loop count ()　(C)

Gives a hint of the number of iterations to be expected.

!DEC$ SIMD　　(Fortran)
#pragma simd　(C)

Force to vectorizate

Example

Even if the programmer knows that it can be vectorized, the compiler may not be able to determine it.

For example,

subroutine add(A, N, X)

integer N, X

real A(N)

DO I= X + 1, N

A(I) = A(I) + A(I - X)

ENDDO

end

The compiler assumes a "possible dependency" due to an unknown variable X, which prevents the DO statement from being vectorized. It is possible to vectorize by inserting a SIMD directive line and giving a hint to the compiler, as shown below.

subroutine add(A, N, X)

integer N, X

real A(N)

!DIR$ SIMD

DO I=X+1, N

A(I) = A(I) + A(I-X)

ENDDO

end

By specifying -qopt-report-phase=vec and -qopt-report=5 at compile time, it is possible to output all vectorization reports including the data that cause the inhibition of vectorization. Basically, we recommend you to insert instruction lines based on this report.

Toggle Title

Factors that inhibit vectorization"]Factors that inhibit vectorization include the following

Toggle Title

Loop unrolling"]One of the speedup techniques. By unrolling loop instructions, you can increase the amount of code that can be executed in parallel, thus increasing the speed. However, the amount of code increases as the loop is expanded.

do j = 1, n
do k = 1, n
do i = 1, n
a(i,j) = a(i,j) + b(i,k) * (k,j)
enddo
enddo
enddo

↓

do j = 1, n
do k = 1, n, k
do i = 1, n
a(i,j) = a(i,j) + b(i,k) * (k,j)
& + b(i,k+1) * (k+1,j)
& + b(i,k+2) * (k+2,j)
& + b(i,k+3) * (k+3,j)
enddo
enddo
enddo

Load and store instructions for array a are reduced by a factor of four, and memory accesses are reduced, so speedup is expected.

Toggle Title

Load/Store Instructions"]The program reads data from memory into a storage device called a register to perform calculations. Load instruction: reading data from memory into registers Store instruction: writing data in registers into memory refers to. In the above example, as the number of loops in k is reduced, the number of times array a is accessed is reduced, thus reducing the number of load and store instructions.

Basic use

Optimization Level (-O option)

Optimization report output (-qopt-report)

optimizer phases(-qopt-report-phase)

the level of detail in the report(-qopt-report)

Recommended compiler options (all are written based on the Fortran language compiler)

For those who are running the program for the first time

For those who want to speed up a program that terminates normally

For those who want to perform speed-up for general-purpose CPU node group

Uses AVX-512 instructions and uses zmm registers without restriction

Use the AVX-512 instruction and use the zmm register without restriction

AVX- AVX2 instructions instead of using 512 instructions, optimized for Icelake processors

For those who want to debug

About directive

a list of directive

!DEC$ IVDEP (Fortran) #pragma ivdep (C)

!DEC$ vector always (Fortran) #pragma vector always (C)

!DEC$ VECTOR NONTEMPORAL (Fortran) #pragma vector nontemporal (C)

!DEC$ VECTOR [UN]ALIGNED (Fortran) #pragma vector [un]aligned (C)

!DEC$ NOVECTOR (Fortran) #pragma novector (C)

!DEC$ DISTRIBUTE POINT (Fortran) #pragma distribute point (C)

!DEC$ LOOP COUNT () (Fortran) #pragma loop count () (C)

!DEC$ SIMD (Fortran) #pragma simd (C)

Example

!DEC$ IVDEP　　(Fortran)
#pragma ivdep　(C)

!DEC$ vector always　　(Fortran)
#pragma vector always　(C)

!DEC$ VECTOR NONTEMPORAL　　(Fortran)
#pragma vector nontemporal　(C)

!DEC$ VECTOR [UN]ALIGNED　　(Fortran)
#pragma vector [un]aligned　(C)

!DEC$ NOVECTOR　　(Fortran)
#pragma novector　(C)

!DEC$ DISTRIBUTE POINT　　(Fortran)
#pragma distribute point　(C)

!DEC$ LOOP COUNT ()　　(Fortran)
#pragma loop count ()　(C)

!DEC$ SIMD　　(Fortran)
#pragma simd　(C)