With the intel compiler available in SQUID, you can make your program more efficient by two methods, "optimization" and "SIMD". Basically, you can get some effects only by specifying compile options, so please try "Recommended compiler options" in here first. You can further improve efficiency by adding your own instruction lines in the program.
Optimization
This function improves the efficiency of processing with the aim of reducing execution time and code size. As a detriment, the calculation results may change.
SIMD (Vectorization)
The compiler applies vector instructions to operations on regularly ordered array data (called vector data) that is repeated in a loop.
Basic use
$ module load BaseCPU/2021
$ ifort -O0 -qopt-report-phase=all -qopt-report=2 source_file# Red:Optimization Level
# Blue:Compiler Message
Optimization Level (-O option)
-O0
Disables all optimizations.
-O1
Enables optimizations for speed and disables some optimizations that increase code size and affect speed.
-O2
Enables optimizations for speed. This is the generally recommended optimization level.
Vectorization is enabled at O2 and higher levels.
-O3
Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The optimizations may slow down code in some cases compared to O2 optimizations.
Optimization report output (-qopt-report)
The report of the contents of the optimization and vectorization done by the compiler, and the factor that inhibits them, etc. is output. The report is output with the file name of source file name.optrpt.
optimizer phases(-qopt-report-phase)
Specify the contents to be output in the report. If you do not specify anything, all reports are output.
-qopt-report-phase=all
All optimizer phases. This is the default if you do not specify list.
-qopt-report-phase=vec
The phase for vectorization
-qopt-report-phase=loop
The phase for loop nest optimization
-qopt-report-phase=par
The phase for auto-parallelization
There are several other phases.
You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas.例:-qopt-report-phase=par,vec)
the level of detail in the report(-qopt-report)
Specify the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0~5 in case of vec phase, and 0~2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase. The following is an example of -qopt-report-phase=vec.
-qopt-report=0
Does not output a report.
-qopt-report=1
Output a vectorized loop.
-qopt-report=2
Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
-qopt-report=3
Level 2 content + output summary information for loops that were not vectorized. (Default)
-qopt-report=4
Level 3 content + output detailed information about vectorized and unvectorized loops.
-qopt-report=5
Level 4 content + detailed information on dependencies found or assumed.
For more information about the optimization report output, please see the following page.
-qopt-reportマニュアル
-O0 | Disables all optimizations. |
-O1 | Enables optimizations for speed and disables some optimizations that increase code size and affect speed. |
-O2 | Enables optimizations for speed. This is the generally recommended optimization level. Vectorization is enabled at O2 and higher levels. |
-O3 | Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The optimizations may slow down code in some cases compared to O2 optimizations. |
-
The report of the contents of the optimization and vectorization done by the compiler, and the factor that inhibits them, etc. is output. The report is output with the file name of source file name.optrpt.
optimizer phases(-qopt-report-phase)
Specify the contents to be output in the report. If you do not specify anything, all reports are output.
-qopt-report-phase=all
All optimizer phases. This is the default if you do not specify list.
-qopt-report-phase=vec
The phase for vectorization
-qopt-report-phase=loop
The phase for loop nest optimization
-qopt-report-phase=par
The phase for auto-parallelization
There are several other phases.
You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas.例:-qopt-report-phase=par,vec)
the level of detail in the report(-qopt-report)
Specify the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0~5 in case of vec phase, and 0~2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase. The following is an example of -qopt-report-phase=vec.
-qopt-report=0
Does not output a report.
-qopt-report=1
Output a vectorized loop.
-qopt-report=2
Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
-qopt-report=3
Level 2 content + output summary information for loops that were not vectorized. (Default)
-qopt-report=4
Level 3 content + output detailed information about vectorized and unvectorized loops.
-qopt-report=5
Level 4 content + detailed information on dependencies found or assumed.
For more information about the optimization report output, please see the following page.
-qopt-reportマニュアル
-qopt-report-phase=all | All optimizer phases. This is the default if you do not specify list. |
-qopt-report-phase=vec | The phase for vectorization |
-qopt-report-phase=loop | The phase for loop nest optimization |
-qopt-report-phase=par | The phase for auto-parallelization |
-
There are several other phases.
You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas.例:-qopt-report-phase=par,vec)
-
Specify the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0~5 in case of vec phase, and 0~2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase. The following is an example of -qopt-report-phase=vec.
-qopt-report=0 | Does not output a report. |
-qopt-report=1 | Output a vectorized loop. |
-qopt-report=2 | Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it. |
-qopt-report=3 | Level 2 content + output summary information for loops that were not vectorized. (Default) |
-qopt-report=4 | Level 3 content + output detailed information about vectorized and unvectorized loops. |
-qopt-report=5 | Level 4 content + detailed information on dependencies found or assumed. |
For more information about the optimization report output, please see the following page.
-qopt-reportマニュアル
Recommended compiler options (all are written based on the Fortran language compiler)
For those who are running the program for the first time
- Optimization of the specified value level is performed.
$ ifort -O2 source_file
For those who want to speed up a program that terminates normally
- This will speed up the program. Also, since the calculation result may change, please check if the result is the same as when the -O2 option is specified. In some cases, it may become slower.
$ ifort -O3 source_file
For those who want to perform speed-up for general-purpose CPU node group
After confirming that the calculation result remains the same, changing -O2 to -O3 may result in further speed-up.
Uses AVX-512 instructions and uses zmm registers without restriction
$ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=low source_file
Use the AVX-512 instruction and use the zmm register without restriction
$ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=high source_file
AVX- AVX2 instructions instead of using 512 instructions, optimized for Icelake processors
$ ifort -O2 -xCORE-AVX2 -mtune=icelake-client source_file
For those who want to debug
$ ifort -g -traceback source_file #(Debugging and traceback information will be output)
$ ifort -check uninit -check bounds source_file # (Checks for initialization leaks and out-of-array references at runtime, slows down execution time).
About directive
-
It is possible to embed optimization/vectoring "hints" to the compiler and "instructions" for forced optimization/vectoring in a program. Since the instruction lines work for the dependencies which cannot be optimized/vectorized originally, the result may be changed. It is necessary to add it at the user's responsibility. In addition, in Fortran and C, the writing style is slightly different.
a list of directive
!DEC$ IVDEP (Fortran)
#pragma ivdep (C)
It gives a hint to the compiler that there is no dependency. If it is judged that vectorization has no effect, vectorization is not performed.
!DEC$ vector always (Fortran)
#pragma vector always (C)
Ignore the efficiency heuristic.
!DEC$ VECTOR NONTEMPORAL (Fortran)
#pragma vector nontemporal (C)
Hints you to use a streaming store.
!DEC$ VECTOR [UN]ALIGNED (Fortran)
#pragma vector [un]aligned (C)
Stating that it is [not] aligned.
!DEC$ NOVECTOR (Fortran)
#pragma novector (C)
Disables vectorization of the target loop.
!DEC$ DISTRIBUTE POINT (Fortran)
#pragma distribute point (C)
Hints at this position to split the loop.
!DEC$ LOOP COUNT () (Fortran)
#pragma loop count () (C)
Gives a hint of the number of iterations to be expected.
!DEC$ SIMD (Fortran)
#pragma simd (C)
Force to vectorizate
Example
Even if the programmer knows that it can be vectorized, the compiler may not be able to determine it.
For example,
1 2 3 4 5 6 7 |
subroutine add(A, N, X) integer N, X real A(N) DO I= X + 1, N A(I) = A(I) + A(I - X) ENDDO end |
The compiler assumes a "possible dependency" due to an unknown variable X, which prevents the DO statement from being vectorized. It is possible to vectorize by inserting a SIMD directive line and giving a hint to the compiler, as shown below.
1 2 3 4 5 6 7 8 |
subroutine add(A, N, X) integer N, X real A(N) !DIR$ SIMD DO I=X+1, N A(I) = A(I) + A(I-X) ENDDO end |
By specifying -qopt-report-phase=vec and -qopt-report=5 at compile time, it is possible to output all vectorization reports including the data that cause the inhibition of vectorization. Basically, we recommend you to insert instruction lines based on this report.
-
- Dependency relation of data
- Call of function and subroutine
- Access to structure
- Conditional branch
- Unknown end condition of loop
- Pointer access
do j = 1, n
do k = 1, n
do i = 1, n
a(i,j) = a(i,j) + b(i,k) * (k,j)
enddo
enddo
enddo
↓
do j = 1, n
do k = 1, n, k
do i = 1, n
a(i,j) = a(i,j) + b(i,k) * (k,j)
& + b(i,k+1) * (k+1,j)
& + b(i,k+2) * (k+2,j)
& + b(i,k+3) * (k+3,j)
enddo
enddo
enddo
Load and store instructions for array a are reduced by a factor of four, and memory accesses are reduced, so speedup is expected.