D3 Center, The University of Osaka » How to use Optimization / SIMD (OCTOPUS CPU nodes)

With the Intel compiler available on OCTOPUS, you can make your programs faster using two methods: "optimization" and "SIMD". Basically, you can get some effects only by specifying compile options, so please try "Recommended compiler options" in here first. You can further improve performance by adding compiler directives directly within your source code.

Optimization
This function improves the efficiency of processing with the aim of reducing execution time and code size. As a detriment, the calculation results may change.

SIMD (Vectorization)
The compiler applies vector instructions to operations on regularly ordered array data (called vector data) that is repeated in a loop.

Basic use

$ module load BaseCPU
$ ifx -O2 -qopt-report-phase=all -qopt-report=2 source_file

# Red: Optimization Level
# Blue: Compiler Message

Optimization Level (-O option)

-O0	Disables all optimizations and vectorization.
-O1	Disables optimizations from -O2 that increase code size. Disables vectorization.
-O2	Enables optimizations such as vectorization, inlining, loop unrolling, constant propagation, and dead code elimination.
-O3	In addition to -O2, enables optimizations such as loop fusion, unroll-and-jam, and other loop transformations.
-Ofast	In addition to -O3, enables interprocedural optimization and optimizations that affect calculation results.

* Note that enabling optimizations may change computation results, so always verify the computation results.
* There is no guarantee that higher optimization levels will result in shorter execution times, so measure the actual execution time.
* Higher optimization levels tend to increase compilation time.

Optimization report output (-qopt-report)

[source file name].optrpt

Optimization report output phase specification (-qopt-report-phase)

-qopt-report-phase=all	All optimizer phases. This is the default if you do not specify list.
-qopt-report-phase=cg	The phase for code generation
-qopt-report-phase=ipo	The phase for Interprocedural Optimization
-qopt-report-phase=loop	The phase for loop nest optimization
-qopt-report-phase=openmp	The phase for OpenMP
-qopt-report-phase=pgo	The phase for Profile Guided optimization
-qopt-report-phase=vec	The phase for vectorization

Optimization report output level specification (-qopt-report)

-qopt-report=0	Does not output a report.
-qopt-report=1	Output a vectorized loop.
-qopt-report=2	Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
-qopt-report=3	Level 2 content + output summary information for loops that were not vectorized. (Default)
-qopt-report=4	Level 3 content + output detailed information about vectorized and unvectorized loops.
-qopt-report=5	Level 4 content + detailed information on dependencies found or assumed.

For more information about the optimization report output, please see the following page.
qopt-report, Qopt-report

For those who are running the program for the first time

$ ifx -O2 source_file

For those who want to speed up a program that terminates normally

$ ifx -O3 source_file

For those who want to perform speed-up for general-purpose CPU nodes

After confirming that the calculation result remains the same, changing -O2 to -O3 may result in further speed-up.

Use AVX-512 instructions and use zmm registers minimally

$ ifx -O2 -xCORE-AVX512 -qopt-zmm-usage=low source_file

Use AVX-512 instructions and use zmm registers without restrictions

$ ifx -O2 -xCORE-AVX512 -qopt-zmm-usage=high source_file

Use AVX-512 instructions and optimize for Granite Rapids processors

$ ifx -O2 -xGRANITERAPIDS source_file

For those who want to debug

$ ifx -g -traceback source_file #(Debugging and traceback information will be output)

$ ifx -check uninit -check bounds source_file # (Checks for initialization leaks and out-of-array references at runtime, slows down execution time).

About using compiler directives

List of directives

!DIR$ IVDEP　　(Fortran)
#pragma ivdep　(C)

It gives a hint to the compiler that there is no dependency. If it is judged that vectorization has no effect, vectorization is not performed.

!DIR$ VECTOR ALWAYS　　(Fortran)
#pragma vector always　(C)

Ignore performance improvement predictions, and force vectorization whenever possible.

!DIR$ VECTOR NONTEMPORAL　　(Fortran)
#pragma vector nontemporal　(C)

Hints you to use a streaming store.

!DIR$ VECTOR [UN]ALIGNED　　(Fortran)
#pragma vector [un]aligned　(C)

States that all data in the target loop is [not] aligned.

!DIR$ NOVECTOR　　(Fortran)
#pragma novector　(C)

Disables vectorization of the target loop.

!DIR$ DISTRIBUTE POINT　　(Fortran)
#pragma distribute point　(C)

Hints at this position to split the loop.

!DIR$ LOOP COUNT (Number of loops)　　(Fortran)
#pragma loop count (Number of loops)　(C)

Gives a hint of the number of loops to be expected.

Example

Even if the programmer knows that it can be vectorized, the compiler may not be able to determine it.

For example,

subroutine add(A, N, X)

integer N, X

real A(N)

do I= X + 1, N

A(I) = A(I) + A(I - X)

end do

end

The compiler assumes a "possible dependency" due to an unknown variable X, which prevents the DO statement from being vectorized. It is possible to vectorize by inserting a IVDEP directive line and giving a hint to the compiler, as shown below.

subroutine add(A, N, X)

integer N, X

real A(N)

!DIR$ IVDEP

do I=X+1, N

A(I) = A(I) + A(I-X)

end do

end

By specifying -qopt-report-phase=vec and -qopt-report=5 at compile time, it is possible to output all vectorization reports including the data that cause the inhibition of vectorization. Basically, we recommend you to insert instruction lines based on this report.

Toggle Title

Factors that inhibit vectorization"]Factors that inhibit vectorization include the following

Toggle Title

Loop unrolling"]One of the speedup techniques. By unrolling loop instructions, you can increase the amount of code that can be executed in parallel, thus increasing the speed. However, the amount of code increases as the loop is expanded.

do j = 1, n

do k = 1, n

do i = 1, n

a(i,j) = a(i,j) + b(i,k) * c(k,j)

end do

↓

do j = 1, n

do k = 1, n, 4

do i = 1, n

a(i,j) = a(i,j) + b(i,k ) * c(k ,j) &

+ b(i,k+1) * c(k+1,j) &

+ b(i,k+2) * c(k+2,j) &

+ b(i,k+3) * c(k+3,j)

end do

Load and store instructions for array a are reduced by a factor of four, and memory accesses are reduced, so speedup is expected.

Toggle Title

Load/Store Instructions"]The program reads data from memory into a storage device called a register to perform calculations.

Load instruction: Refers to reading data from memory into registers
Store instruction: Refers to writing data in registers into memory

In the above example, as the number of loops in k is reduced, the number of times array a is accessed is reduced, thus reducing the number of load and store instructions.

Basic use

Optimization Level (-O option)

Optimization report output (-qopt-report)

Optimization report output phase specification (-qopt-report-phase)

Optimization report output level specification (-qopt-report)

Recommended compiler options (all are written based on the Fortran language compiler)

For those who are running the program for the first time

For those who want to speed up a program that terminates normally

For those who want to perform speed-up for general-purpose CPU nodes

Use AVX-512 instructions and use zmm registers minimally

Use AVX-512 instructions and use zmm registers without restrictions

Use AVX-512 instructions and optimize for Granite Rapids processors

For those who want to debug

About using compiler directives

List of directives

!DIR$ IVDEP (Fortran) #pragma ivdep (C)

!DIR$ VECTOR ALWAYS (Fortran) #pragma vector always (C)

!DIR$ VECTOR NONTEMPORAL (Fortran) #pragma vector nontemporal (C)

!DIR$ VECTOR [UN]ALIGNED (Fortran) #pragma vector [un]aligned (C)

!DIR$ NOVECTOR (Fortran) #pragma novector (C)

!DIR$ DISTRIBUTE POINT (Fortran) #pragma distribute point (C)

!DIR$ LOOP COUNT (Number of loops) (Fortran) #pragma loop count (Number of loops) (C)

Example

!DIR$ IVDEP　　(Fortran)
#pragma ivdep　(C)

!DIR$ VECTOR ALWAYS　　(Fortran)
#pragma vector always　(C)

!DIR$ VECTOR NONTEMPORAL　　(Fortran)
#pragma vector nontemporal　(C)

!DIR$ VECTOR [UN]ALIGNED　　(Fortran)
#pragma vector [un]aligned　(C)

!DIR$ NOVECTOR　　(Fortran)
#pragma novector　(C)

!DIR$ DISTRIBUTE POINT　　(Fortran)
#pragma distribute point　(C)

!DIR$ LOOP COUNT (Number of loops)　　(Fortran)
#pragma loop count (Number of loops)　(C)