With the intel compiler available in SQUID, you can make your program more efficient by two methods, "optimization" and "SIMD". Basically, you can get some effects only by specifying compile options, so please try "Recommended compiler options" in here first. You can further improve efficiency by adding your own instruction lines in the program.

This function improves the efficiency of processing with the aim of reducing execution time and code size. As a detriment, the calculation results may change.

SIMD (Vectorization)
The compiler applies vector instructions to operations on regularly ordered array data (called vector data) that is repeated in a loop.

Basic use

    $ module load BaseCPU/2021
    $ ifort -O0 -qopt-report-phase=all -qopt-report=2 source_file

    # Red:Optimization Level
    # Blue:Compiler Message


    Optimization Level (-O option)

      -O0 Disables all optimizations.
      -O1 Enables optimizations for speed and disables some optimizations that increase code size and affect speed.
      -O2 Enables optimizations for speed. This is the generally recommended optimization level.
      Vectorization is enabled at O2 and higher levels.
      -O3 Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The optimizations may slow down code in some cases compared to O2 optimizations.


    Optimization report output (-qopt-report)

      The report of the contents of the optimization and vectorization done by the compiler, and the factor that inhibits them, etc. is output. The report is output with the file name of source file name.optrpt.


    optimizer phases(-qopt-report-phase)

      Specify the contents to be output in the report. If you do not specify anything, all reports are output.

      -qopt-report-phase=all All optimizer phases. This is the default if you do not specify list.
      -qopt-report-phase=vec The phase for vectorization
      -qopt-report-phase=loop The phase for loop nest optimization
      -qopt-report-phase=par The phase for auto-parallelization
        There are several other phases.
        You can specify multiple phases by separating them with commas. (You can specify multiple phases by separating them with commas.例:-qopt-report-phase=par,vec)


    the level of detail in the report(-qopt-report)

      Specify the level (how detailed) to output in the report. The upper limit of the level that can be specified by -qopt-report depends on the phase. For example, it is 0~5 in case of vec phase, and 0~2 in case of loop phase. If you specify a level that exceeds the upper limit, it is automatically set to the upper limit level of each phase. The following is an example of -qopt-report-phase=vec.

      -qopt-report=0 Does not output a report.
      -qopt-report=1 Output a vectorized loop.
      -qopt-report=2 Outputs the contents of level 1 plus the loop that was not vectorized and a brief reason for it.
      -qopt-report=3 Level 2 content + output summary information for loops that were not vectorized. (Default)
      -qopt-report=4 Level 3 content + output detailed information about vectorized and unvectorized loops.
      -qopt-report=5 Level 4 content + detailed information on dependencies found or assumed.

      For more information about the optimization report output, please see the following page.

Recommended compiler options (all are written based on the Fortran language compiler)


    For those who are running the program for the first time

      Optimization of the specified value level is performed.

      $ ifort -O2 source_file


    For those who want to speed up a program that terminates normally

      This will speed up the program. Also, since the calculation result may change, please check if the result is the same as when the -O2 option is specified. In some cases, it may become slower.

      $ ifort -O3 source_file


    For those who want to perform speed-up for general-purpose CPU node group

    After confirming that the calculation result remains the same, changing -O2 to -O3 may result in further speed-up.

      Uses AVX-512 instructions and uses zmm registers without restriction

      $ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=low source_file


      Use the AVX-512 instruction and use the zmm register without restriction

      $ ifort -O2 -xCORE-AVX512 -qopt-zmm-usage=high source_file


      AVX- AVX2 instructions instead of using 512 instructions, optimized for Icelake processors

      $ ifort -O2 -xCORE-AVX2 -mtune=icelake-client source_file


    For those who want to debug

      $ ifort -g -traceback source_file #(Debugging and traceback information will be output)

      $ ifort -check uninit -check bounds source_file # (Checks for initialization leaks and out-of-array references at runtime, slows down execution time).


About directive

    It is possible to embed optimization/vectoring "hints" to the compiler and "instructions" for forced optimization/vectoring in a program. Since the instruction lines work for the dependencies which cannot be optimized/vectorized originally, the result may be changed. It is necessary to add it at the user's responsibility. In addition, in Fortran and C, the writing style is slightly different.

    a list of directive

    !DEC$ IVDEP  (Fortran)
    #pragma ivdep (C)

    It gives a hint to the compiler that there is no dependency. If it is judged that vectorization has no effect, vectorization is not performed.


    !DEC$ vector always  (Fortran)
    #pragma vector always (C)

    Ignore the efficiency heuristic.


    #pragma vector nontemporal (C)

    Hints you to use a streaming store.


    !DEC$ VECTOR [UN]ALIGNED  (Fortran)
    #pragma vector [un]aligned (C)

    Stating that it is [not] aligned.


    !DEC$ NOVECTOR  (Fortran)
    #pragma novector (C)

    Disables vectorization of the target loop.


    #pragma distribute point (C)

    Hints at this position to split the loop.


    !DEC$ LOOP COUNT ()  (Fortran)
    #pragma loop count () (C)

    Gives a hint of the number of iterations to be expected.


    !DEC$ SIMD  (Fortran)
    #pragma simd (C)

    Force to vectorizate



    Even if the programmer knows that it can be vectorized, the compiler may not be able to determine it.
    For example,

    The compiler assumes a "possible dependency" due to an unknown variable X, which prevents the DO statement from being vectorized. It is possible to vectorize by inserting a SIMD directive line and giving a hint to the compiler, as shown below.

    By specifying -qopt-report-phase=vec and -qopt-report=5 at compile time, it is possible to output all vectorization reports including the data that cause the inhibition of vectorization. Basically, we recommend you to insert instruction lines based on this report.

    Toggle Title
    Factors that inhibit vectorization"]Factors that inhibit vectorization include the following

      - Dependency relation of data
      - Call of function and subroutine
      - Access to structure
      - Conditional branch
      - Unknown end condition of loop
      - Pointer access


    Toggle Title
    Loop unrolling"]One of the speedup techniques. By unrolling loop instructions, you can increase the amount of code that can be executed in parallel, thus increasing the speed. However, the amount of code increases as the loop is expanded.

    do j = 1, n
    do k = 1, n
    do i = 1, n
    a(i,j) = a(i,j) + b(i,k) * (k,j)

    do j = 1, n
    do k = 1, n, k
    do i = 1, n
    a(i,j) = a(i,j) + b(i,k) * (k,j)
    & + b(i,k+1) * (k+1,j)
    & + b(i,k+2) * (k+2,j)
    & + b(i,k+3) * (k+3,j)

    Load and store instructions for array a are reduced by a factor of four, and memory accesses are reduced, so speedup is expected.

    Toggle Title
    Load/Store Instructions"]The program reads data from memory into a storage device called a register to perform calculations.   Load instruction: reading data from memory into registers Store instruction: writing data in registers into memory   refers to.   In the above example, as the number of loops in k is reduced, the number of times array a is accessed is reduced, thus reducing the number of load and store instructions.