Intel compilers. Eighth version of Intel compilers Compilers for the Microsoft Windows platform

IntroductionIn late 2003, Intel introduced version 8.0 of its compiler collection. New compilers are designed to improve the performance of applications running on servers, desktop PCs and mobile systems(laptops, mobile phones and pocket computers) on the base Intel processors. It is pleasant to note that this product was created with the active participation of employees of the Nizhny Novgorod Intel Software Development Center and Intel specialists from Sarov.

The new series includes Intel C++ and Fortran compilers for Windows and Linux, as well as Intel C++ compilers for Windows CE .NET. The compilers are targeted at systems based on the following Intel processors: Intel Itanium 2, Intel Xeon, Intel Pentium 4, processors with Intel Personal Internet Client Architecture for mobile phones and Pocket PCs and the Intel Pentium M Mobile Processor (a component of Intel Centrino Mobile Technology).

The Intel Visual Fortran Compiler for Windows provides next-generation compilation technologies for high-performance computing solutions. It combines the functionality of the Compaq Visual Fortran (CVF) language with the performance improvements made possible by Intel's compilation and code generation optimization technologies to simplify the task of porting source code, developed using CVF, into the Intel Visual Fortran environment. This compiler implements CVF functions for the first time both for 32-bit Intel systems, and for systems based on Intel Itanium family processors operating in Windows environment. In addition, this compiler allows you to implement CVF language functions on Linux systems based on 32-bit Intel processors and Intel Itanium family processors. In 2004, it is planned to release an expanded version of this compiler - the Intel Visual Fortran Compiler Professional Edition for Windows OS, which will include the IMSL Fortran 5.0 Library developed by Visual Numerics, Inc.

"The new compilers also support future Intel processors, codenamed Prescott, which include new graphics and video performance commands and other performance enhancements. They also support new technology Mobile MMX(tm), which similarly improves the performance of graphics, audio and video applications for mobile phones and pocket PCs, noted Alexey Odinokov, co-director of the Intel Software Development Center in Nizhny Novgorod. - These compilers provide application developers with a unified package tools to build new applications for wireless networks based on Intel architecture. The new Intel compilers also support Intel's Hyper-Threading technology and the OpenMP 2.0 industry specification, which defines the use of high-level directives to control instruction flow in applications."

New tools included in the compilers include Intel Code Coverage and Intel Test Prioritization. Together, these tools can speed up application development and improve application quality by improving the testing process. software.

The Code Coverage tool provides complete information about the use of application logic and the location of the used areas in the application source code during application testing. If changes are made to the application or if this test does not allow checking the part of the application that interests the developer, the Test Prioritization tool allows you to check the operation of the selected section of the program code.

New Intel compilers are available in different configurations, costing from $399 to $1,499. They can be purchased today from Intel or from resellers around the world, a list of which is located on the website http://www.intel.com/software/products/reseller.htm#Russia.

Prescott processor support

Support for the Intel Pentium 4 (Prescott) processor in the eighth version of the compiler is as follows:

1. Support for SSE3 commands (or PNI, Prescott New Instructions). There are three ways to distinguish here:

A. Assembly inserts (Inline assembly). For example, the compiler recognizes the following use of the SSE3 command _asm(addsubpd xmm0, xmm1). This way, users interested in low-level optimization can gain direct access to assembly commands.

b. In the C/C++ compiler, new instructions are available from a higher level than the use of assembly inserts. Namely, through built-in functions (intrinsic functions):

Built-in functions

Built-in function	Generated command
_mm_addsub_ps	Addsubps
_mm_hadd_ps	Haddps
_mm_hsub_ps	Msubps
_mm_moveldup_ps	Movsldup
_mm_movehdup_ps	Movshdup
_mm_addsub_pd	Addsubpd
_mm_hadd_pd	Haddpd
_mm_hsub_pd	Hsubpd
_mm_loaddup_pd	movddup xmm, m64
_mm_movedup_pd	movddup reg, reg
_mm_lddqu_si128	Lddqu

The table shows the built-in functions and corresponding assembly instructions from the SSE3 set. The same support exists for commands from the MMX\SSE\SSE2 sets. This allows the programmer to perform low-level code optimization without resorting to assembly language programming: the compiler itself takes care of mapping built-in functions to the corresponding processor instructions and optimal use of registers. The programmer can concentrate on creating an algorithm that efficiently uses new instruction sets.

V. Automatic generation of new commands by the compiler. The previous two methods require the programmer to use new commands. But the compiler is also capable (using the appropriate options - see section 3 below) to automatically generate new commands from the SSE3 set for program code in C/C++ and Fortran. For example, the optimized unaligned loading command (lddqu), the use of which allows you to achieve a performance gain of up to 40% (for example, in video and audio encoding tasks). Other commands in the SSE3 set allow you to get significant speedup in 3D graphics tasks or calculation problems using complex numbers. For example, the graph in section 3.1 below shows that for application 168.wupwise from the SPEC CPU2000 FP suite, the speedup obtained from automatic SSE3 instruction generation was ~25%. The performance of this application depends significantly on the speed of complex number arithmetic.

2. Using the microarchitectural advantages of the Prescott processor. When generating code, the compiler takes into account microarchitectural changes in the new processor. For example, performing certain operations (such as integer shifts, multiplying integers, or converting numbers between various formats floating point in SSE2) has accelerated on the new processor compared to previous versions (for example, an integer shift now takes one processor cycle versus four for the previous version of the Intel Pentium 4 processor). More intensive use of such commands allows you to significantly speed up applications.
Another example of microarchitectural changes is the improved store forwarding mechanism ( fast loading data previously stored in memory); real saving occurs not even in the cache memory, but in some intermediate storage buffer, which then makes it possible to carry out very fast access to the data. This feature of the architecture makes it possible, for example, to implement more aggressive automatic vectorization of program code.
The compiler also takes into account the increased size of the first and second level cache.

3. Improved support for Hyper-Threading technology. This point may well be related to the previous one - microarchitectural changes and their use in the compiler. For example, the runtime library that implements support for the OpenMP industry specification has been optimized to run on the new processor.

Performance

Using compilers is simple and effective method take advantage of Intel processor architectures. Below, conditionally (very) two ways of using compilers are highlighted: a) recompilation of programs with a possible change in the compiler settings, b) recompilation with a change in both the compiler settings and the source text, as well as the use of compiler diagnostics based on optimizations being carried out and the possible use of others software(for example, profilers).

1.1 Optimizing programs using recompilation and changing compiler settings

Often the first step in migrating to a new optimizing compiler is to use it with its default settings. The next logical step is to use options for more aggressive optimization. Figures 1, 2, 3 and 4 show the effect of switching to the Intel compiler version 8.0 compared to using other industry-leading products (-O2 - default compiler settings, base - settings for maximum performance). The comparison is made on 32- and 64-bit Intel architectures. Applications from SPEC CPU2000 are used as a test set.

Picture 1

Figure 2

Figure 3

Figure 4

Some options are listed below (the following options are for the Windows OS family; for the Linux OS family there are options with the same effect, but the name may differ; for example, -Od or QxK for Windows has a similar effect to -O0 or -xK for Linux accordingly; more detailed information can be found in the compiler manual) supported by the Intel compiler.

Controlling optimization levels: Options -Od (no optimizations; used for debugging programs), -O1 (maximum speed while minimizing code size), -O2 (optimization for code execution speed; applied by default), -O3 (enables the most aggressive optimizations for code execution speed ; in some cases it can lead to the opposite effect, i.e. to a slowdown; it should be noted that on the IA-64 the use of -O3 leads to acceleration in most cases, while the positive effect on the IA-32 is less pronounced). Examples of optimizations enabled by -O3: loop interchange, loop fusion, loop distribution (optimization, inverse loop fusion), software prefetch of data. The reason why there may be slowdown when using -O3 may be that the compiler used a heuristic approach to selecting aggressive optimizations for a particular case, without having sufficient information about the program (for example, it generated prefetch instructions for the data used in the loop, believing that that the loop is executed a large number of times, when in fact it only has a few iterations). Interprocedural optimization for profiling, as well as various programmer “tips” (see section 3.2) can help in this situation.

Interprocedural optimization: -Qip (within one file) and -Qipo (within several or all project files). Includes optimizations such as, for example, inline substitution of frequently used code (reducing the cost of calling a function/procedure). Provides information to other optimization stages - for example, information about the loop upper bound (say, if it is a compile-time constant defined in one file but used in many) or information about data alignment in memory (many MMX\SSE\SSE2\SSE3 commands work faster if the operands are aligned in memory to an 8- or 16-byte boundary). Analysis of memory allocation procedures (implemented/called in one of the project files) is passed to those functions/procedures where this memory is used (this can help the compiler to abandon the conservative assumption that the data is not properly aligned in memory; and the assumption should be conservative when absence additional information). Another example is disambiguation, data aliasing analysis: in the absence of additional information and the impossibility of proving the absence of intersections, the compiler makes a conservative assumption that there are intersections. Such a decision may negatively affect the quality of optimizations such as automatic vectorization on the IA-32 or software pipelining (SWP) on the IA-64. Interprocedural optimization can help analyze the presence of memory intersections.

Optimization by profiling: Includes three stages. 1) generation of instrumented code using the -Qprof_gen option. 2) the resulting code is run on representative data, while information is collected about various characteristics of code execution (for example, transition probabilities or a typical value for the number of loop iterations). 3) Recompilation with the -Qprof_use option, which ensures that the compiler uses the information collected in the previous step. Thus, the compiler is able to use not only static estimates of important program characteristics, but also data obtained during the actual execution of the program. This can help with subsequent selection of certain optimizations (for example, more efficient arrangement of different branches of the program in memory, based on information about which branches were executed at what frequency; or applying optimizations to a loop based on information about the typical number of iterations in it) . Optimization by profiling is especially useful in cases where it is possible to select a small but representative set of data (for step #2) that well illustrates the most typical cases of future use of the program. In some subject areas selecting such a representative set is entirely possible. For example, profiling optimization is used by DBMS developers.

The optimizations listed above are of the generic type, i.e. the generated code will work on all different processors of the family (say, in the case of 32 bit architecture- on all the following processors: Intel Pentium-III, Pentium 4, including the Prescott core, Intel Pentium M). There are also optimizations for specific processors.

Processor-specific optimizations: -QxK (Pentium-III; use of SSE commands, microarchitecture features), -QxW and -QxN (Pentium 4; use of SSE and SSE2 commands, microarchitecture features), -QxB (Pentium M; use of SSE and SSE2 commands, microarchitecture features ), QxP (Prescott; use of SSE, SSE2, and SSE3 commands, microarchitecture features). In this case, code generated using such options may not work on other representatives of the processor line (for example, -QxW code may result in the execution of an invalid command if executed on a system based on an Intel Pentium-III processor). Or not work with maximum efficiency (for example, -QxB code on Pentium processor 4 due to differences in microarchitecture). With these options, it is also possible to use runtime libraries optimized for a specific processor using its instruction set. To control that the code is actually executed on the target processor, a dispatch mechanism (cpu-dispatch) is implemented: checking the processor during program execution. In different situations, this mechanism can either be activated or not. Dispatch is always used if the -Qax(KWNP) option variation is used. In this case, two versions of the code are generated: optimized for a specific processor and “general” (generic), the choice occurs during program execution. Thus, by increasing the code size, it is possible to achieve program execution on all processors of the line and optimal execution on the target processor. Another option is to use code optimization for the previous representative of the line and use this code on this and subsequent processors. For example, -QxN code can run on a Pentium 4 with either a Northwood or Prescott core. There is no increase in code size. With this approach, you can get good, but still not optimal performance on a system with a Prescott processor (since SSE3 is not used and differences in microarchitecture are not taken into account) with optimal performance on Northwood. Similar options also exist for IA-64 architecture processors. On this moment there are two of them: -G1 (Itanium) and -G2 (Itanium 2; default option).

The graph below (Figure 5) shows the speedup (based on one - the absence of any speedup) from using some of the optimizations listed above (namely -O3 -Qipo -Qprof_use -Qx(N,P)) on the Prescott processor compared with default settings (-O2). Using -QxP helps in some cases to get a speedup compared to -QxN. The greatest speedup is achieved in the 168.wupwise application, already mentioned in the previous section (due to intensive optimization of complex arithmetic using SSE3 instructions).

Figure 5

Figure 6 below shows the ratio (in times) of the speed of code with optimal settings compared to completely unoptimized code (-Od) on Pentium 4 and Itanium 2 processors. It can be seen that Itanium 2 is much more dependent on the quality of optimization. This is especially pronounced for floating point (FP) calculations, where the ratio is approximately 36 times. Floating point calculations are strong point architecture IA-64, but at the same time you need to carefully approach the use of the most effective compiler settings. The resulting gain in productivity pays for the labor costs of searching for them.

Figure 6. Speedup with Best SPEC CPU200 Optimization Options

Intel compilers support the OpenMP industry specification for creating multi-threaded applications. Explicit (option -Qopenmp) and automatic (-Qparallel) parallelization modes are supported. In the case of explicit mode, the programmer is responsible for the correct and efficient use of OpenMP standard tools. In the case of automatic parallelization, the compiler has an additional burden associated with analyzing the program code. For this reason, at present, automatic parallelization works effectively only on fairly simple codes.

The graph in Figure 7 shows the acceleration from using explicit parallelization on a pre-production sample system based on an Intel Pentium 4 (Prescott) processor with support for Hyper-Threading technology: 2.8GHz, 2GB RAM, 8K L1-Cache, 512K L2-Cache . The test suite used is SPEC OMPM2001. This set is aimed at small and medium SMP systems, memory consumption is up to two gigabytes. Applications were compiled using Intel 8.0 C/C++ and Fortran with two sets of options: -Qopenmp -Qipo -O3 -QxN and -Qopenmp -Qipo -O3 -QxP, each of which ran applications with Hyper-Threading technology enabled and disabled. The acceleration values in the graph are normalized to the performance of the single-threaded version with Hyper-Threading technology disabled.

Figure 7: SPEC OMPM2001 Applications on Prescott Processor

It can be seen that in 9 out of 11 cases, using explicit parallelization using OpenMP gives a performance increase when Hyper-Threading technology is enabled. One of the applications (312.swim) is experiencing a slowdown. This is a known fact: this application characterized by a high degree of dependence on memory bandwidth. Just like in the case of SPEC CPU2000, the wupwise application greatly benefits from applying optimizations for Prescott (-QxP).

1.2 Optimizing programs by making changes to the source text and using compiler diagnostics

In previous sections, we looked at the influence of the compiler (and its settings) on the speed of code execution. At the same time, Intel compilers provide broader opportunities for code optimization than just changing settings. In particular, compilers enable the programmer to make "hints" in the program code, which allow the generation of more efficient code in terms of performance. Below are some examples for the C/C++ language (for the Fortran language there are similar tools that differ only in syntax).

#pragma ivdep (where ivdep stands for ignore vector dependencies) is used before program loops to tell the compiler that there are no data dependencies inside. This hint works in the case when the compiler (based on analysis) conservatively assumes that such dependencies may exist (if the compiler, as a result of analysis, can prove that the dependency exists, then the “hint” has no effect), while the author of the code knows that such dependencies cannot arise. With this hint, the compiler can generate more efficient code: automatic vectorization for IA-32 (using vector instructions from the MMX\SSE\SSE2\SSE3 sets for program loops in C/C++ and Fortran; you can get acquainted with this technique in more detail, for example, in the next article in Intel Technology Journal), software pipelining (SWP) for the IA-64.

#pragma vector always is used so that the compiler changes the decision about the inefficiency of loop vectorization (both automatic for IA-32 and SWP for IA-64), made based on an analysis of the quantitative and qualitative characteristics of the work at each iteration.

#pragma novector has the opposite effect of #pragma vector always.

#pragma vector aligned is used to tell the compiler that the data used in the loop is aligned to a 16-byte boundary. This allows you to generate more efficient and/or compact (due to the lack of runtime checks) code.

#pragma vector unaligned has the opposite effect of #pragma aligned. It’s difficult to talk about performance gains in this case, but you can count on more compact code.

#pragma distribute point is used inside a program loop so that the compiler can split the loop (loop distribution) at this point into several smaller ones. For example, such a "hint" can be used in the case where the compiler fails to automatically vectorize the source loop (for example, due to a data dependency that cannot be ignored even with #pragma ivdep), whereas each (or part) of the newly formed cycles can be effectively vectorized.

#pragma loop count (N), is used to tell the compiler that the most likely value for the number of iterations of the loop will be N. This information helps decide the most effective optimization for this loop (for example, whether to unroll, whether to SWP or automatic vectorization, is it necessary to use software data prefetch commands, ...)

The "hint" _assume_aligned(p, base) is used to tell the compiler that the memory region associated with pointer p is aligned to a boundary of base = 2^n bytes.

This is not a complete list of various compiler "hints" that can significantly affect the efficiency of the generated code. You may wonder how to determine that the compiler needs a hint.

First, you can use compiler diagnostics in the form of reports that it provides to the programmer. For example, using the -Qvec_reportN option (where N ranges from 0 to 3 and represents the level of detail), you can obtain an automatic vectorization report. The programmer will have access to information about which loops have been vectorized and which have not. In the negative case, the compiler indicates in the report the reasons why the vectorization failed. Let us assume that the cause was a conservatively assumed relationship in the data. In this case, if the programmer is sure that a dependency cannot arise, then #pragma ivdep can be used. The compiler provides similar (compared to Qvec_reportN for IA-32) capabilities on IA-64 to monitor the presence and effectiveness of SWP. In general, Intel compilers provide extensive capabilities for diagnosing optimizations.

Second, other software products (such as the Intel VTune profiler) can be used to find performance bottlenecks in the code. The results of the analysis can help the programmer make necessary changes.

You can also use the assembly code listing generated by the compiler for analysis.

Figure 8

Above Figure 8 shows step by step process optimizing the application using the compiler (and other software products) Intel in Fortran language for the IA-64 architecture. As an example, we consider the non-adiabatic regional forecast scheme for 48 hours of Roshydrometcenter (you can read about it, for example, in this article. The article talks about a calculation time of about 25 minutes, but significant changes have occurred since it was written. The performance of the code is taken as a reference point on a Cray-YMP system. Unmodified code with default compiler options (-O2) showed a 20% performance increase on a four-processor system based on an Intel Itanium 2900 MHz processor. Application of more aggressive optimization (-O3) resulted in a speedup of ~2.5 times without code changes, mainly due to SWP and data prefetching. Analysis using compiler diagnostics and Intel VTune profiler revealed some bottlenecks. For example, the compiler did not software pipeline several performance-critical loops, reporting in the report that it suggests data dependency . Minor changes code (ivdep directive) helped achieve efficient pipelining. Using the VTune profiler, we were able to discover (and the compiler report confirmed this) that the compiler did not change the order of nested loops (loop interchange) for more efficient use of cache memory. The reason was again the conservative assumptions about the dependence in the data. Changes have been made to the source code of the program. As a result, we managed to achieve a 4-fold acceleration compared to the initial version. Using explicit parallelization using OpenMP standard directives, and then moving to a system with more processors high frequency allowed us to reduce the calculation time to less than 8 minutes, which gave a more than 16-fold speedup compared to the initial version.

Intel Visual Fortran

Intel Visual Fortran 8.0 uses front-end (the part of the compiler responsible for converting a program from text in a programming language into an internal compiler representation, which is largely independent of either the programming language or the target machine) CVF compiler technologies and components of the Intel compiler, responsible for a set of optimizations and code generation.

Figure 9

Figure 10

Figures 9 and 10 show graphs comparing the performance of Intel Visual Fortran 8.0 with previous version Intel Fortran 7.1 and other industry-popular compilers from this language running under Windows and Linux operating systems. For comparison, tests were used, the source texts of which, meeting the F77 and F90 standards, are available on the website http://www.polyhedron.com/. On the same site, more detailed information is available on comparing the performance of compilers (Win32 Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks and Linux Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks): more different compilers are shown, and the geometric mean is given in combination with the individual results of each test.

You are not a slave!
Closed educational course for children of the elite: "The true arrangement of the world."
http://noslave.org

Material from Wikipedia - the free encyclopedia

Intel C++ Compiler