With the Intel compiler available on OCTOPUS, you can make your programs faster using two methods: "optimization" and "SIMD". Basically, you can get some effects only by specifying compile options, so please try "Recommended compiler options" in here first. You can further improve performance by adding compiler directives directly within your source code.
Optimization
This function improves the efficiency of processing with the aim of reducing execution time and code size. As a detriment, the calculation results may change.
SIMD (Vectorization)
The compiler applies vector instructions to operations on regularly ordered array data (called vector data) that is repeated in a loop.
Basic use
$ module load BaseCPU
$ ifx -O2 -qopt-report-phase=all -qopt-report=2 source_file# Red: Optimization Level
# Blue: Compiler Message
Optimization Level (-O option)
-O0
Disables all optimizations and vectorization.
-O1
Disables optimizations from -O2 that increase code size. Disables vectorization.
-O2
Enables optimizations such as vectorization, inlining, loop unrolling, constant propagation, and dead code elimination.
-O3
In addition to -O2, enables optimizations such as loop fusion, unroll-and-jam, and other loop transformations.
-Ofast
In addition to -O3, enables interprocedural optimization and optimizations that affect calculation results.
* Note that enabling optimizations may change computation results, so always verify the computation results.
* There is no guarantee that higher optimization levels will result in shorter execution times, so measure the actual execution time.
* Higher optimization levels tend to increase compilation time.
Optimization report output (-qopt-report)
The report of the contents of the optimizations and vectorization done by the compiler, and the factor that inhibits them, etc. is output. The optimization report is output with the file name of [source file name].optrpt.
You can also output the optimization report to standard output by specifying the -qopt-report-stdout option.
Optimization report output phase specification (-qopt-report-phase)
Specifies the contents to be output in the report. If you do not specify anything, all reports are output.
-qopt-report-phase=all
All optimizer phases. This is the default if you do not specify list.
-qopt-report-phase=cg
The phase for code generation
-qopt-report-phase=ipo
The phase for Interprocedural Optimization
-qopt-report-phase=loop
The phase for loop nest optimization
-qopt-report-phase=openmp
The phase for OpenMP
-qopt-report-phase=pgo
The phase for Profile Guided optimization
-qopt-report-phase=vec
The phase for vectorization
There are several other phases.
You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas. Example: -qopt-report-phase=openmp,vec )
Optimization report output level specification (-qopt-report)
Specifies the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0–5 in case of vec phase, and 0–2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase.
The following is an example of -qopt-report-phase=vec.
-qopt-report=0
Does not output a report.
-qopt-report=1
Output a vectorized loop.
-qopt-report=2
Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
-qopt-report=3
Level 2 content + output summary information for loops that were not vectorized. (Default)
-qopt-report=4
Level 3 content + output detailed information about vectorized and unvectorized loops.
-qopt-report=5
Level 4 content + detailed information on dependencies found or assumed.
For more information about the optimization report output, please see the following page.
qopt-report, Qopt-report
| -O0 | Disables all optimizations and vectorization. |
| -O1 | Disables optimizations from -O2 that increase code size. Disables vectorization. |
| -O2 | Enables optimizations such as vectorization, inlining, loop unrolling, constant propagation, and dead code elimination. |
| -O3 | In addition to -O2, enables optimizations such as loop fusion, unroll-and-jam, and other loop transformations. |
| -Ofast | In addition to -O3, enables interprocedural optimization and optimizations that affect calculation results. |
* Note that enabling optimizations may change computation results, so always verify the computation results.
* There is no guarantee that higher optimization levels will result in shorter execution times, so measure the actual execution time.
* Higher optimization levels tend to increase compilation time.
-
The report of the contents of the optimizations and vectorization done by the compiler, and the factor that inhibits them, etc. is output. The optimization report is output with the file name of [source file name].optrpt.
You can also output the optimization report to standard output by specifying the -qopt-report-stdout option.
Optimization report output phase specification (-qopt-report-phase)
-
Specifies the contents to be output in the report. If you do not specify anything, all reports are output.
| -qopt-report-phase=all | All optimizer phases. This is the default if you do not specify list. |
| -qopt-report-phase=cg | The phase for code generation |
| -qopt-report-phase=ipo | The phase for Interprocedural Optimization |
| -qopt-report-phase=loop | The phase for loop nest optimization |
| -qopt-report-phase=openmp | The phase for OpenMP |
| -qopt-report-phase=pgo | The phase for Profile Guided optimization |
| -qopt-report-phase=vec | The phase for vectorization |
-
There are several other phases.
You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas. Example: -qopt-report-phase=openmp,vec )
Optimization report output level specification (-qopt-report)
-
Specifies the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0–5 in case of vec phase, and 0–2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase.
The following is an example of -qopt-report-phase=vec.
| -qopt-report=0 | Does not output a report. |
| -qopt-report=1 | Output a vectorized loop. |
| -qopt-report=2 | Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it. |
| -qopt-report=3 | Level 2 content + output summary information for loops that were not vectorized. (Default) |
| -qopt-report=4 | Level 3 content + output detailed information about vectorized and unvectorized loops. |
| -qopt-report=5 | Level 4 content + detailed information on dependencies found or assumed. |
For more information about the optimization report output, please see the following page.
qopt-report, Qopt-report
Recommended compiler options (all are written based on the Fortran language compiler)
For those who are running the program for the first time
- Optimization of the specified value level is performed.
$ ifx -O2 source_file
For those who want to speed up a program that terminates normally
- This will speed up the program. Also, since the calculation result may change, please check if the result is the same as when the -O2 option is specified. In some cases, it may become slower.
$ ifx -O3 source_file
For those who want to perform speed-up for general-purpose CPU nodes
After confirming that the calculation result remains the same, changing -O2 to -O3 may result in further speed-up.
Use AVX-512 instructions and use zmm registers minimally
$ ifx -O2 -xCORE-AVX512 -qopt-zmm-usage=low source_file
Use AVX-512 instructions and use zmm registers without restrictions
$ ifx -O2 -xCORE-AVX512 -qopt-zmm-usage=high source_file
Use AVX-512 instructions and optimize for Granite Rapids processors
$ ifx -O2 -xGRANITERAPIDS source_file
For those who want to debug
$ ifx -g -traceback source_file #(Debugging and traceback information will be output)
$ ifx -check uninit -check bounds source_file # (Checks for initialization leaks and out-of-array references at runtime, slows down execution time).
About using compiler directives
-
It is possible to embed optimization/vectorization "hints" to the compiler and "instructions" for forced optimization/vectorization in a program. Since the instruction lines work for the dependencies which cannot be optimized/vectorization originally, the result may be changed. It is necessary to add it at the user's responsibility. In addition, in Fortran and C, the writing style is slightly different.
List of directives
!DIR$ IVDEP (Fortran)
#pragma ivdep (C)
It gives a hint to the compiler that there is no dependency. If it is judged that vectorization has no effect, vectorization is not performed.
!DIR$ VECTOR ALWAYS (Fortran)
#pragma vector always (C)
Ignore performance improvement predictions, and force vectorization whenever possible.
!DIR$ VECTOR NONTEMPORAL (Fortran)
#pragma vector nontemporal (C)
Hints you to use a streaming store.
!DIR$ VECTOR [UN]ALIGNED (Fortran)
#pragma vector [un]aligned (C)
States that all data in the target loop is [not] aligned.
!DIR$ NOVECTOR (Fortran)
#pragma novector (C)
Disables vectorization of the target loop.
!DIR$ DISTRIBUTE POINT (Fortran)
#pragma distribute point (C)
Hints at this position to split the loop.
!DIR$ LOOP COUNT (Number of loops) (Fortran)
#pragma loop count (Number of loops) (C)
Gives a hint of the number of loops to be expected.
Example
Even if the programmer knows that it can be vectorized, the compiler may not be able to determine it.
For example,
|
1 2 3 4 5 6 7 |
subroutine add(A, N, X) integer N, X real A(N) do I= X + 1, N A(I) = A(I) + A(I - X) end do end |
The compiler assumes a "possible dependency" due to an unknown variable X, which prevents the DO statement from being vectorized. It is possible to vectorize by inserting a IVDEP directive line and giving a hint to the compiler, as shown below.
|
1 2 3 4 5 6 7 8 |
subroutine add(A, N, X) integer N, X real A(N) !DIR$ IVDEP do I=X+1, N A(I) = A(I) + A(I-X) end do end |
By specifying -qopt-report-phase=vec and -qopt-report=5 at compile time, it is possible to output all vectorization reports including the data that cause the inhibition of vectorization. Basically, we recommend you to insert instruction lines based on this report.
-
- Dependency relation of data
- Call of function and subroutine
- Access to structure
- Conditional branch
- Unknown end condition of loop
- Pointer access
|
1 2 3 4 5 6 7 |
do j = 1, n do k = 1, n do i = 1, n a(i,j) = a(i,j) + b(i,k) * c(k,j) end do end do end do |
|
1 2 3 4 5 6 7 8 9 10 |
do j = 1, n do k = 1, n, 4 do i = 1, n a(i,j) = a(i,j) + b(i,k ) * c(k ,j) & + b(i,k+1) * c(k+1,j) & + b(i,k+2) * c(k+2,j) & + b(i,k+3) * c(k+3,j) end do end do end do |
Load and store instructions for array a are reduced by a factor of four, and memory accesses are reduced, so speedup is expected.
Load instruction: Refers to reading data from memory into registers
Store instruction: Refers to writing data in registers into memory
In the above example, as the number of loops in k is reduced, the number of times array a is accessed is reduced, thus reducing the number of load and store instructions.

