IntroductionIn late 2003, Intel introduced version 8.0 of its compiler collection. New compilers are designed to improve the performance of applications running on servers, desktop PCs and mobile systems(laptops, mobile phones and pocket computers) on the base Intel processors. It is pleasant to note that this product was created with the active participation of employees of the Nizhny Novgorod Intel Software Development Center and Intel specialists from Sarov.

The new series includes Intel C++ and Fortran compilers for Windows and Linux, as well as Intel C++ compilers for Windows CE .NET. The compilers are targeted at systems based on the following Intel processors: Intel Itanium 2, Intel Xeon, Intel Pentium 4, processors with Intel Personal Internet Client Architecture for mobile phones and Pocket PCs and the Intel Pentium M Mobile Processor (a component of Intel Centrino Mobile Technology).

The Intel Visual Fortran Compiler for Windows provides next-generation compilation technologies for high-performance computing solutions. It combines the functionality of the Compaq Visual Fortran (CVF) language with the performance improvements made possible by Intel's compilation and code generation optimization technologies to simplify the task of porting source code, developed using CVF, into the Intel Visual Fortran environment. This compiler implements CVF functions for the first time both for 32-bit Intel systems, and for systems based on Intel Itanium family processors operating in Windows environment. In addition, this compiler allows you to implement CVF language functions on Linux systems based on 32-bit Intel processors and Intel Itanium family processors. In 2004, it is planned to release an expanded version of this compiler - the Intel Visual Fortran Compiler Professional Edition for Windows OS, which will include the IMSL Fortran 5.0 Library developed by Visual Numerics, Inc.


"The new compilers also support future Intel processors, codenamed Prescott, which include new graphics and video performance commands and other performance enhancements. They also support new technology Mobile MMX(tm), which similarly improves the performance of graphics, audio and video applications for mobile phones and pocket PCs, noted Alexey Odinokov, co-director of the Intel Software Development Center in Nizhny Novgorod. - These compilers provide application developers with a unified package tools to build new applications for wireless networks based on Intel architecture. The new Intel compilers also support Intel's Hyper-Threading technology and the OpenMP 2.0 industry specification, which defines the use of high-level directives to control instruction flow in applications."

New tools included in the compilers include Intel Code Coverage and Intel Test Prioritization. Together, these tools can speed up application development and improve application quality by improving the testing process. software.

The Code Coverage tool provides complete information about the use of application logic and the location of the used areas in the application source code during application testing. If changes are made to the application or if this test does not allow checking the part of the application that interests the developer, the Test Prioritization tool allows you to check the operation of the selected section of the program code.

New Intel compilers are available in different configurations, costing from $399 to $1,499. They can be purchased today from Intel or from resellers around the world, a list of which is located on the website http://www.intel.com/software/products/reseller.htm#Russia.

Prescott processor support

Support for the Intel Pentium 4 (Prescott) processor in the eighth version of the compiler is as follows:

1. Support for SSE3 commands (or PNI, Prescott New Instructions). There are three ways to distinguish here:

A. Assembly inserts (Inline assembly). For example, the compiler recognizes the following use of the SSE3 command _asm(addsubpd xmm0, xmm1). This way, users interested in low-level optimization can gain direct access to assembly commands.

b. In the C/C++ compiler, new instructions are available from a higher level than the use of assembly inserts. Namely, through built-in functions (intrinsic functions):

Built-in functions

Built-in functionGenerated command
_mm_addsub_psAddsubps
_mm_hadd_psHaddps
_mm_hsub_psMsubps
_mm_moveldup_psMovsldup
_mm_movehdup_psMovshdup
_mm_addsub_pdAddsubpd
_mm_hadd_pdHaddpd
_mm_hsub_pdHsubpd
_mm_loaddup_pdmovddup xmm, m64
_mm_movedup_pdmovddup reg, reg
_mm_lddqu_si128Lddqu

The table shows the built-in functions and corresponding assembly instructions from the SSE3 set. The same support exists for commands from the MMX\SSE\SSE2 sets. This allows the programmer to perform low-level code optimization without resorting to assembly language programming: the compiler itself takes care of mapping built-in functions to the corresponding processor instructions and optimal use of registers. The programmer can concentrate on creating an algorithm that efficiently uses new instruction sets.

V. Automatic generation of new commands by the compiler. The previous two methods require the programmer to use new commands. But the compiler is also capable (using the appropriate options - see section 3 below) to automatically generate new commands from the SSE3 set for program code in C/C++ and Fortran. For example, the optimized unaligned loading command (lddqu), the use of which allows you to achieve a performance gain of up to 40% (for example, in video and audio encoding tasks). Other commands in the SSE3 set allow you to get significant speedup in 3D graphics tasks or calculation problems using complex numbers. For example, the graph in section 3.1 below shows that for application 168.wupwise from the SPEC CPU2000 FP suite, the speedup obtained from automatic SSE3 instruction generation was ~25%. The performance of this application depends significantly on the speed of complex number arithmetic.

2. Using the microarchitectural advantages of the Prescott processor. When generating code, the compiler takes into account microarchitectural changes in the new processor. For example, performing certain operations (such as integer shifts, multiplying integers, or converting numbers between various formats floating point in SSE2) has accelerated on the new processor compared to previous versions (for example, an integer shift now takes one processor cycle versus four for the previous version of the Intel Pentium 4 processor). More intensive use of such commands allows you to significantly speed up applications.
Another example of microarchitectural changes is the improved store forwarding mechanism ( fast loading data previously stored in memory); real saving occurs not even in the cache memory, but in some intermediate storage buffer, which then makes it possible to carry out very fast access to the data. This feature of the architecture makes it possible, for example, to implement more aggressive automatic vectorization of program code.
The compiler also takes into account the increased size of the first and second level cache.

3. Improved support for Hyper-Threading technology. This point may well be related to the previous one - microarchitectural changes and their use in the compiler. For example, the runtime library that implements support for the OpenMP industry specification has been optimized to run on the new processor.

Performance

Using compilers is simple and effective method take advantage of Intel processor architectures. Below, conditionally (very) two ways of using compilers are highlighted: a) recompilation of programs with a possible change in the compiler settings, b) recompilation with a change in both the compiler settings and the source text, as well as the use of compiler diagnostics based on optimizations being carried out and the possible use of others software(for example, profilers).


1.1 Optimizing programs using recompilation and changing compiler settings


Often the first step in migrating to a new optimizing compiler is to use it with its default settings. The next logical step is to use options for more aggressive optimization. Figures 1, 2, 3 and 4 show the effect of switching to the Intel compiler version 8.0 compared to using other industry-leading products (-O2 - default compiler settings, base - settings for maximum performance). The comparison is made on 32- and 64-bit Intel architectures. Applications from SPEC CPU2000 are used as a test set.


Picture 1




Figure 2




Figure 3




Figure 4


Some options are listed below (the following options are for the Windows OS family; for the Linux OS family there are options with the same effect, but the name may differ; for example, -Od or QxK for Windows has a similar effect to -O0 or -xK for Linux accordingly; more detailed information can be found in the compiler manual) supported by the Intel compiler.


Controlling optimization levels: Options -Od (no optimizations; used for debugging programs), -O1 (maximum speed while minimizing code size), -O2 (optimization for code execution speed; applied by default), -O3 (enables the most aggressive optimizations for code execution speed ; in some cases it can lead to the opposite effect, i.e. to a slowdown; it should be noted that on the IA-64 the use of -O3 leads to acceleration in most cases, while the positive effect on the IA-32 is less pronounced). Examples of optimizations enabled by -O3: loop interchange, loop fusion, loop distribution (optimization, inverse loop fusion), software prefetch of data. The reason why there may be slowdown when using -O3 may be that the compiler used a heuristic approach to selecting aggressive optimizations for a particular case, without having sufficient information about the program (for example, it generated prefetch instructions for the data used in the loop, believing that that the loop is executed a large number of times, when in fact it only has a few iterations). Interprocedural optimization for profiling, as well as various programmer “tips” (see section 3.2) can help in this situation.

Interprocedural optimization: -Qip (within one file) and -Qipo (within several or all project files). Includes optimizations such as, for example, inline substitution of frequently used code (reducing the cost of calling a function/procedure). Provides information to other optimization stages - for example, information about the loop upper bound (say, if it is a compile-time constant defined in one file but used in many) or information about data alignment in memory (many MMX\SSE\SSE2\SSE3 commands work faster if the operands are aligned in memory to an 8- or 16-byte boundary). Analysis of memory allocation procedures (implemented/called in one of the project files) is passed to those functions/procedures where this memory is used (this can help the compiler to abandon the conservative assumption that the data is not properly aligned in memory; and the assumption should be conservative when absence additional information). Another example is disambiguation, data aliasing analysis: in the absence of additional information and the impossibility of proving the absence of intersections, the compiler makes a conservative assumption that there are intersections. Such a decision may negatively affect the quality of optimizations such as automatic vectorization on the IA-32 or software pipelining (SWP) on the IA-64. Interprocedural optimization can help analyze the presence of memory intersections.

Optimization by profiling: Includes three stages. 1) generation of instrumented code using the -Qprof_gen option. 2) the resulting code is run on representative data, while information is collected about various characteristics of code execution (for example, transition probabilities or a typical value for the number of loop iterations). 3) Recompilation with the -Qprof_use option, which ensures that the compiler uses the information collected in the previous step. Thus, the compiler is able to use not only static estimates of important program characteristics, but also data obtained during the actual execution of the program. This can help with subsequent selection of certain optimizations (for example, more efficient arrangement of different branches of the program in memory, based on information about which branches were executed at what frequency; or applying optimizations to a loop based on information about the typical number of iterations in it) . Optimization by profiling is especially useful in cases where it is possible to select a small but representative set of data (for step #2) that well illustrates the most typical cases of future use of the program. In some subject areas selecting such a representative set is entirely possible. For example, profiling optimization is used by DBMS developers.

The optimizations listed above are of the generic type, i.e. the generated code will work on all different processors of the family (say, in the case of 32 bit architecture- on all the following processors: Intel Pentium-III, Pentium 4, including the Prescott core, Intel Pentium M). There are also optimizations for specific processors.

Processor-specific optimizations: -QxK (Pentium-III; use of SSE commands, microarchitecture features), -QxW and -QxN (Pentium 4; use of SSE and SSE2 commands, microarchitecture features), -QxB (Pentium M; use of SSE and SSE2 commands, microarchitecture features ), QxP (Prescott; use of SSE, SSE2, and SSE3 commands, microarchitecture features). In this case, code generated using such options may not work on other representatives of the processor line (for example, -QxW code may result in the execution of an invalid command if executed on a system based on an Intel Pentium-III processor). Or not work with maximum efficiency (for example, -QxB code on Pentium processor 4 due to differences in microarchitecture). With these options, it is also possible to use runtime libraries optimized for a specific processor using its instruction set. To control that the code is actually executed on the target processor, a dispatch mechanism (cpu-dispatch) is implemented: checking the processor during program execution. In different situations, this mechanism can either be activated or not. Dispatch is always used if the -Qax(KWNP) option variation is used. In this case, two versions of the code are generated: optimized for a specific processor and “general” (generic), the choice occurs during program execution. Thus, by increasing the code size, it is possible to achieve program execution on all processors of the line and optimal execution on the target processor. Another option is to use code optimization for the previous representative of the line and use this code on this and subsequent processors. For example, -QxN code can run on a Pentium 4 with either a Northwood or Prescott core. There is no increase in code size. With this approach, you can get good, but still not optimal performance on a system with a Prescott processor (since SSE3 is not used and differences in microarchitecture are not taken into account) with optimal performance on Northwood. Similar options also exist for IA-64 architecture processors. On this moment there are two of them: -G1 (Itanium) and -G2 (Itanium 2; default option).

The graph below (Figure 5) shows the speedup (based on one - the absence of any speedup) from using some of the optimizations listed above (namely -O3 -Qipo -Qprof_use -Qx(N,P)) on the Prescott processor compared with default settings (-O2). Using -QxP helps in some cases to get a speedup compared to -QxN. The greatest speedup is achieved in the 168.wupwise application, already mentioned in the previous section (due to intensive optimization of complex arithmetic using SSE3 instructions).


Figure 5


Figure 6 below shows the ratio (in times) of the speed of code with optimal settings compared to completely unoptimized code (-Od) on Pentium 4 and Itanium 2 processors. It can be seen that Itanium 2 is much more dependent on the quality of optimization. This is especially pronounced for floating point (FP) calculations, where the ratio is approximately 36 times. Floating point calculations are strong point architecture IA-64, but at the same time you need to carefully approach the use of the most effective compiler settings. The resulting gain in productivity pays for the labor costs of searching for them.


Figure 6. Speedup with Best SPEC CPU200 Optimization Options


Intel compilers support the OpenMP industry specification for creating multi-threaded applications. Explicit (option -Qopenmp) and automatic (-Qparallel) parallelization modes are supported. In the case of explicit mode, the programmer is responsible for the correct and efficient use of OpenMP standard tools. In the case of automatic parallelization, the compiler has an additional burden associated with analyzing the program code. For this reason, at present, automatic parallelization works effectively only on fairly simple codes.

The graph in Figure 7 shows the acceleration from using explicit parallelization on a pre-production sample system based on an Intel Pentium 4 (Prescott) processor with support for Hyper-Threading technology: 2.8GHz, 2GB RAM, 8K L1-Cache, 512K L2-Cache . The test suite used is SPEC OMPM2001. This set is aimed at small and medium SMP systems, memory consumption is up to two gigabytes. Applications were compiled using Intel 8.0 C/C++ and Fortran with two sets of options: -Qopenmp -Qipo -O3 -QxN and -Qopenmp -Qipo -O3 -QxP, each of which ran applications with Hyper-Threading technology enabled and disabled. The acceleration values ​​in the graph are normalized to the performance of the single-threaded version with Hyper-Threading technology disabled.


Figure 7: SPEC OMPM2001 Applications on Prescott Processor


It can be seen that in 9 out of 11 cases, using explicit parallelization using OpenMP gives a performance increase when Hyper-Threading technology is enabled. One of the applications (312.swim) is experiencing a slowdown. This is a known fact: this application characterized by a high degree of dependence on memory bandwidth. Just like in the case of SPEC CPU2000, the wupwise application greatly benefits from applying optimizations for Prescott (-QxP).


1.2 Optimizing programs by making changes to the source text and using compiler diagnostics


In previous sections, we looked at the influence of the compiler (and its settings) on the speed of code execution. At the same time, Intel compilers provide broader opportunities for code optimization than just changing settings. In particular, compilers enable the programmer to make "hints" in the program code, which allow the generation of more efficient code in terms of performance. Below are some examples for the C/C++ language (for the Fortran language there are similar tools that differ only in syntax).

#pragma ivdep (where ivdep stands for ignore vector dependencies) is used before program loops to tell the compiler that there are no data dependencies inside. This hint works in the case when the compiler (based on analysis) conservatively assumes that such dependencies may exist (if the compiler, as a result of analysis, can prove that the dependency exists, then the “hint” has no effect), while the author of the code knows that such dependencies cannot arise. With this hint, the compiler can generate more efficient code: automatic vectorization for IA-32 (using vector instructions from the MMX\SSE\SSE2\SSE3 sets for program loops in C/C++ and Fortran; you can get acquainted with this technique in more detail, for example, in the next article in Intel Technology Journal), software pipelining (SWP) for the IA-64.

#pragma vector always is used so that the compiler changes the decision about the inefficiency of loop vectorization (both automatic for IA-32 and SWP for IA-64), made based on an analysis of the quantitative and qualitative characteristics of the work at each iteration.

#pragma novector has the opposite effect of #pragma vector always.

#pragma vector aligned is used to tell the compiler that the data used in the loop is aligned to a 16-byte boundary. This allows you to generate more efficient and/or compact (due to the lack of runtime checks) code.

#pragma vector unaligned has the opposite effect of #pragma aligned. It’s difficult to talk about performance gains in this case, but you can count on more compact code.

#pragma distribute point is used inside a program loop so that the compiler can split the loop (loop distribution) at this point into several smaller ones. For example, such a "hint" can be used in the case where the compiler fails to automatically vectorize the source loop (for example, due to a data dependency that cannot be ignored even with #pragma ivdep), whereas each (or part) of the newly formed cycles can be effectively vectorized.

#pragma loop count (N), is used to tell the compiler that the most likely value for the number of iterations of the loop will be N. This information helps decide the most effective optimization for this loop (for example, whether to unroll, whether to SWP or automatic vectorization, is it necessary to use software data prefetch commands, ...)

The "hint" _assume_aligned(p, base) is used to tell the compiler that the memory region associated with pointer p is aligned to a boundary of base = 2^n bytes.

This is not a complete list of various compiler "hints" that can significantly affect the efficiency of the generated code. You may wonder how to determine that the compiler needs a hint.

First, you can use compiler diagnostics in the form of reports that it provides to the programmer. For example, using the -Qvec_reportN option (where N ranges from 0 to 3 and represents the level of detail), you can obtain an automatic vectorization report. The programmer will have access to information about which loops have been vectorized and which have not. In the negative case, the compiler indicates in the report the reasons why the vectorization failed. Let us assume that the cause was a conservatively assumed relationship in the data. In this case, if the programmer is sure that a dependency cannot arise, then #pragma ivdep can be used. The compiler provides similar (compared to Qvec_reportN for IA-32) capabilities on IA-64 to monitor the presence and effectiveness of SWP. In general, Intel compilers provide extensive capabilities for diagnosing optimizations.

Second, other software products (such as the Intel VTune profiler) can be used to find performance bottlenecks in the code. The results of the analysis can help the programmer make necessary changes.

You can also use the assembly code listing generated by the compiler for analysis.


Figure 8


Above Figure 8 shows step by step process optimizing the application using the compiler (and other software products) Intel in Fortran language for the IA-64 architecture. As an example, we consider the non-adiabatic regional forecast scheme for 48 hours of Roshydrometcenter (you can read about it, for example, in this article. The article talks about a calculation time of about 25 minutes, but significant changes have occurred since it was written. The performance of the code is taken as a reference point on a Cray-YMP system. Unmodified code with default compiler options (-O2) showed a 20% performance increase on a four-processor system based on an Intel Itanium 2900 MHz processor. Application of more aggressive optimization (-O3) resulted in a speedup of ~2.5 times without code changes, mainly due to SWP and data prefetching. Analysis using compiler diagnostics and Intel VTune profiler revealed some bottlenecks. For example, the compiler did not software pipeline several performance-critical loops, reporting in the report that it suggests data dependency . Minor changes code (ivdep directive) helped achieve efficient pipelining. Using the VTune profiler, we were able to discover (and the compiler report confirmed this) that the compiler did not change the order of nested loops (loop interchange) for more efficient use of cache memory. The reason was again the conservative assumptions about the dependence in the data. Changes have been made to the source code of the program. As a result, we managed to achieve a 4-fold acceleration compared to the initial version. Using explicit parallelization using OpenMP standard directives, and then moving to a system with more processors high frequency allowed us to reduce the calculation time to less than 8 minutes, which gave a more than 16-fold speedup compared to the initial version.

Intel Visual Fortran

Intel Visual Fortran 8.0 uses front-end (the part of the compiler responsible for converting a program from text in a programming language into an internal compiler representation, which is largely independent of either the programming language or the target machine) CVF compiler technologies and components of the Intel compiler, responsible for a set of optimizations and code generation.


Figure 9




Figure 10


Figures 9 and 10 show graphs comparing the performance of Intel Visual Fortran 8.0 with previous version Intel Fortran 7.1 and other industry-popular compilers from this language running under Windows and Linux operating systems. For comparison, tests were used, the source texts of which, meeting the F77 and F90 standards, are available on the website http://www.polyhedron.com/. On the same site, more detailed information is available on comparing the performance of compilers (Win32 Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks and Linux Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks): more different compilers are shown, and the geometric mean is given in combination with the individual results of each test.

You are not a slave!
Closed educational course for children of the elite: "The true arrangement of the world."
http://noslave.org

Material from Wikipedia - the free encyclopedia

Intel C++ Compiler
Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).
Type
Author

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Developer
Developers

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Written on

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Interface

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

operating system
Interface languages

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

First edition

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Hardware platform
Latest version
Release Candidate

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Beta version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Alpha version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Test version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Readable file formats

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Generated file formats

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

State

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

License

Main features:

  • Vectorization for SSE, SSE2, SSE3, SSE4

The compiler supports the OpenMP 3.0 standard for writing parallel programs. Also contains a modification of OpenMP called Cluster OpenMP, with which you can run applications written in accordance with OpenMP on clusters using MPI.

Intel C++ Compiler uses the frontend (the part of the compiler that parses the compiled program) from Edison Design Group. The same frontend is used by the SGI MIPSpro, Comeau C++, and Portland Group compilers.

This compiler is widely used for compiling SPEC CPU benchmarks.

There are 4 series of products from Intel containing the compiler:

  • Intel C++ Compiler Professional Edition
  • Intel Cluster Toolkit (Compiler Edition)

To the disadvantages Linux versions The compiler may include partial incompatibility with the GNU extensions of the C language (supported by the GCC compiler), which may cause problems when compiling some programs.

Experimental options

The following experimental versions of the compiler were published:

  • Intel STM Compiler Prototype Edition dated September 17, 2007. Software Transactional Memory (STM) support. Released for Linux and Windows, only for IA-32 (x86 processors);
  • Intel Concurrent Collections for C/C++ 0.3 from September 2008. Contains mechanisms that make it easier to write parallel C++ programs.

Basic flags

Windows Linux, MacOSX Description
/Od -O0 Disable optimizations
/O1 -O1 Optimize to minimize executable file size
/O2 -O2 Optimize for speed. Some optimizations included
/O3 -O3 Enable all optimizations from O2. Also perform intensive cycle optimizations
/Oip -Oip Enable file-by-file interprocedural optimization
/Oipo -Oipo Enable global interprocedural optimization
/QxO -xO Allow the use of SSE3, SSE2 and SSE extensions for processors manufactured by any company
/fast -fast « Fast mode" Equivalent to the options "/O3 /Qipo /QxHost /no-prec-div" on Windows and "-O3 -ipo -static -xHOST -no-prec-div" on Linux. Please note that the “-xHOST” flag means optimization for the processor on which the compiler is running.
/Qprof-gen -prof_gen Create an instrumented version of the program that will assemble a performance profile
/Qprof-use -prof_use Use profile information from program launches collected with the prof_gen flag.

Write a review about the article "Intel C++ compiler"

Notes

see also

Links

An excerpt characterizing the Intel C++ compiler

And also, she returned to see the White Magus for the last time... Her husband and truest friend, whom she could never forget. In her heart she forgave him. But, to his great regret, she could not bring him the forgiveness of Magdalene.... So, as you see, Isidora, the great Christian fable about “forgiveness” is just a childish lie for naive believers, in order to allow them to do any Evil, knowing that no matter what they do, they will eventually be forgiven. But you can forgive only that which is truly worthy of forgiveness. A person must understand that he has to answer for any Evil committed... And not before some mysterious God, but before himself, forcing himself to suffer cruelly. Magdalena did not forgive Vladyka, although she deeply respected and sincerely loved him. Just as she failed to forgive all of us for the terrible death of Radomir. After all, SHE understood better than anyone else - we could have helped him, we could have saved him from a cruel death... But we didn’t want to. Considering the White Magus' guilt to be too cruel, she left him to live with this guilt, not forgetting it for a minute... She did not want to grant him easy forgiveness. We never saw her again. Just like they never saw their babies. Through one of the knights of her Temple - our sorcerer - Magdalene conveyed the answer to the Vladyka to his request to return to us: “The sun does not rise twice on the same day... The joy of your world (Radomir) will never return to you, just as I will not return to you and I... I found my FAITH and my TRUTH, they are ALIVE, but yours is DEAD... Mourn your sons - they loved you. I will never forgive you for their death while I am alive. And may your guilt remain with you. Perhaps someday she will bring you Light and Forgiveness... But not from me.” The head of the Magus John was not brought to Meteora for the same reason - none of the Knights of the Temple wanted to return to us... We lost them, as we have lost many others more than once, who did not want to understand and accept our victims... Who did just like you - they left, condemning us.
My head was spinning!.. Like a thirsty person, quenching my eternal hunger for knowledge, I greedily absorbed the flow of amazing information generously given by the North... And I wanted a lot more!.. I wanted to know everything to the end. It was a breath of fresh water in a desert scorched by pain and troubles! And I couldn't get enough of it...
– I have thousands of questions! But there is no time left... What should I do, North?..
- Ask, Isidora!.. Ask, I will try to answer you...
– Tell me, Sever, why does it seem to me that this story seems to combine two life stories, intertwined with similar events, and they are presented as the life of one person? Or am I not right?
– You are absolutely right, Isidora. As I told you earlier, the “powers of this world,” who created the false history of mankind, “put” on the true life of Christ the alien life of the Jewish prophet Joshua, who lived one and a half thousand years ago (from the time of the story of the North). And not only himself, but also his family, his relatives and friends, his friends and followers. After all, it was the wife of the prophet Joshua, the Jewish Mary, who had a sister Martha and a brother Lazarus, the sister of his mother Maria Yakobe, and others who were never near Radomir and Magdalene. Just as there were no other “apostles” next to them - Paul, Matthew, Peter, Luke and the rest...
It was the family of the prophet Joshua who moved one and a half thousand years ago to Provence (which in those days was called Transalpine Gaul), to the Greek city of Massalia (present-day Marseille), since Massalia at that time was the “gateway” between Europe and Asia, and it was the easiest way for all those “persecuted” in order to avoid persecution and troubles.

Intel C++ and Fortran compilers and MKL library

Along with the standard GNU compilers for Linux, Intel C++ and Fortran compilers are installed on the clusters of the NIVC computing complex. Currently (beginning of 2006), compilers version 9.1 are installed on all clusters. This page is dedicated to describing the most important options and settings of these compilers, as well as their main differences from the GNU compilers. The page is aimed mainly at users of MSU Research Computing Center clusters, but may also be useful to other Russian-speaking users. Issues related to compilation for the IA-64 platform are not addressed here.

Also, the Intel library is installed on all clusters Kernel Math Library(MKL) version 8.0.2. The library is located in the /usr/mkl directory. Please note that subdirectories 32, 64 and em64t are available in the lib directory. On the Ant cluster you need to use the libraries from the em64t subdirectory, and on other clusters - from the 32 subdirectory. All necessary documentation and examples can be obtained from the /usr/mkl/doc directory.

Why were new compilers needed?

The need for new compilers arose mainly to a) support programming in Fortran 90, and also b) for more powerful optimization of Fortran programs than is provided by the g77 compiler, which uses translation to C and then compilation using gcc.

PGI (Portland Group) compilers also meet these requirements, but the developer company refused to supply them to Russia.

How to use?

Intel compilers are invoked using commands icc(C or C++), icpc(C++) and ifort(Fortran 77/90). The mpicc, mpiCC, and mpif77 commands for compiling and assembling MPI programs are also configured to use Intel compilers.

It is also possible to use GNU compilers using the mpigcc, mpig++ and mpig77 commands (Fortran 90 is not supported).

Input files

By default, files with the extension .cpp And .cxx are considered source texts in the C++ language, files with the extension .c- C source code, and the icpc compiler also compiles .c files as C++ source code.

Files with extensions .f, .ftn And .for are recognized as source texts in the Fotran language, with a fixed form of notation, and the files .fpp And .F additionally passed through the Fortran language preprocessor. Files with the extension .f90 are considered Fortran 90/95 source texts with free form notation. You can explicitly specify a fixed or free form of notation for Fortran programs using the options -FI And -FR respectively.

Files with the extension .s recognized as assembly language code for the IA-32.

Intel Compiler Features

Here we present the characteristics of Intel compilers as stated by the developer in the user manual with some of our comments.

  • Significant optimization
    Apparently, this means optimizing the code even further high level, i.e. first of all, various loop transformations, which almost all compilers do with greater or less success
  • Floating point optimization
    Apparently, this means, first of all, the maximum use of commands implemented at the hardware level
  • Interprocedural Optimizations
    those. global optimization of the entire program, as opposed to ordinary optimization, which affects only the code of specific functions
  • Profile-based optimization
    those. the ability to run a program in test mode, collect data on the time it takes to pass certain code fragments inside frequently used functions, and then use this data for optimization
  • Support for the SSE instruction set in Pentium III processors
    note: for computing tasks, SSE2 commands are of more interest, i.e. vector commands over 64-bit real numbers, but they are only supported on Pentium 4 processors, which we don’t have at our disposal yet
  • Automatic vectorization
    those. again, using the SSE and SSE2 commands, inserted automatically by the compiler
  • OpenMP support for programming on SMP systems
    note: on a cluster it is recommended to primarily use the MPI interface; widespread use of OpenMP on the cluster is not expected and such experiments have not yet been carried out; but it probably makes sense to use libraries (BLAS, etc.) that are parallelized for shared memory.
  • Data Prefetching
    those. Apparently, the use of preload commands from memory into the data cache, which will be needed after some time
  • "Dispatching" code for different processors
    those. the ability to generate code for different processors in a single executable file, which allows you to take advantage of the latest processors to achieve the highest performance on them, while maintaining binary compatibility of programs with earlier processors; On our cluster this is not relevant yet, because only Pentium III processors are used, and programs compiled on the cluster are not supposed to be transferred and run on other machines

Basic compiler options

The most interesting, of course, are the code optimization options. Most of the options are common to the C++ and Fortran compilers. More detailed description options in English user manuals.

Optimization levels
OptionDescription
-O0Disables optimization
-O1 or -O2Basic optimization for speed. Inline insertion of library functions is disabled. For the C++ compiler, these options provide the same optimization; for the Fortran compiler, the -O2 option is preferable, because also includes cycle promotion.
-O3More powerful optimizations including loop transformations, data prefetching, and use of OpenMP. Some programs may not guarantee improved performance compared to -O2. Makes sense to use in conjunction with vectorization options -xK And -xW.
-unroll[n]Enables loop unwinding up to n times.
Optimizations for a specific processor
OptionDescription
-tpp6Optimization for Penitum Pro, Pentium II and Pentium III processors
-tpp7Optimization for Penitum 4 processors (this option is enabled by default for the IA-32 compiler)
-xMCode generation using MMX extensions specific to Pentium MMX, Pentium II and later processors
-xKCode generation using SSE extensions specific to Pentium III processors
-xWCode generation using SSE2 extensions specific to Pentium 4 processors
Interprocedural optimization
-ipInterprocedural optimization is enabled within one file. If you specify the option -ip_no_inlining, then inline function insertions are disabled.
-ipoEnables interprocedural optimization between different files
Optimizations using profiles
-prof_genA "profiling" code is generated that will be used for profiling, i.e. collecting data on the frequency of passing certain places in the program
-prof_useOptimization is carried out based on data obtained during the profiling stage. It makes sense to use it together with the interprocedural optimization option -ipo.
Parallelization for SMP systems
-openmpEnables support for the OpenMP 2.0 standard
-parallelAutomatic loop parallelization is enabled

Performance

According to the results of running the SPEC CPU2000 tests, published on the ixbt.com server, Intel compilers version 6.0 were almost universally better compared to gcc compilers versions 2.95.3, 2.96 and 3.1, and PGI version 4.0.2. These tests were conducted in 2002 on a computer with a Pentium 4/1.7 GHz processor and RedHat Linux 7.3.

According to tests conducted by Polyhedron, the Intel Fortran compiler version 7.0 was almost universally superior to other Fortran 77 compilers for Linux (Absoft, GNU, Lahey, NAG, NAS, PGI). Only in some tests the Intel compiler is slightly inferior to the Absoft, NAG and Lahey compilers. These tests were performed on a computer with a Pentium 4/1.8 GHz processor and Mandrake Linux 8.1.

Intel compilers version 9.1 also outperform gcc compilers, and show performance comparable to Absoft, PathScale and PGI.

We will be grateful to those users and readers who send us data on the impact of the choice of compiler (GCC or Intel) and optimization options on the speed of work on their real-life problems.

Libraries

The C language compiler uses a runtime library developed within the GNU project ( libc.a).

The following libraries are supplied with the Intel C++ compiler:

  • libcprts.a- runtime C++ language library developed by Dinkumware.
  • libcxa.a- additional runtime library for C++ development by Intel.
  • libimf.a- library mathematical functions Intel development, which includes optimized and high-precision implementations of trigonometric, hyperbolic, exponential, special, complex and other functions (see the list of functions for more details).
  • libirc.a- runtime support for profiling (PGO) and code dispatch depending on the processor (see above).
  • libguide.a- OpenMP implementation.

This list contains static libraries, but for most of them there are also dynamic ones, i.e. options connected during startup ( .so).

The following libraries are supplied with the Fortran compiler: libCEPCF90.a, libIEPCF90.a, libintrins.a, libF90.a, the library of mathematical functions libimf.a is also used.

Building the executable file

Libraries can be connected statically (during build) or dynamically (during program startup). The dynamic approach allows you to reduce the size of the executable file, allows you to share the same copy of the library in memory, but for this you need to install on each node where the programs will be launched, a full set of used dynamic libraries.

Thus, if you installed the Intel compiler on your Linux machine and want to run the compiled executable files on other machines, then you need to either use a static build (which is easier) or copy the Intel dynamic libraries to these machines (usually from a directory like /opt/intel /compiler70/ia32/lib) to one of the directories listed in the /etc/ld.so.conf file, and also make sure that the same set of GNU/Linux dynamic libraries are installed on these machines.

By default, all Intel development libraries (except libcxa.so) are linked statically, and all system libraries Linux and GNU libraries are linked dynamically. Using the option -static you can force the collector (link editor) to connect all libraries statically (which will increase the size of the executable file), and using the option -i_dynamic You can dynamically link all Intel development libraries.

When connecting additional libraries using the view option -llibrary you may need to use the option -Ldirectory to specify the path where the libraries are located.

Using options -Bstatic And -Bdynamic You can explicitly specify dynamic or static linking for each of the libraries specified on the command line.

Using the option -c assembly of the executable file is disabled and only compilation is performed (object module generation).

Sharing modules in Fortran and C

To share modules written in Fortran and C, you need to agree on the naming of procedures in object modules, the passing of parameters, and access to global variables, if any.

By default, the Intel Fortran compiler converts procedure names to lowercase and adds an underscore to the end of the name. The C compiler never changes function names. Thus, if we want to call a function or procedure FNNAME implemented in C from a Fortran module, then in the C module it should be called fnname_.

The Fortran compiler supports the option -nus [filename], which allows you to disable the addition of underscores to internal procedure names. If a file name is specified, this is done only for procedure names listed in the specified file.

By default, in Fortran parameters are passed by reference, and in C they are always passed by value. Thus, when calling a Fortran procedure from a C module, we must pass pointers to the corresponding variables containing the values ​​of the actual parameters as parameters. When writing a function in C that will need to be called from a Fortran module, we must describe the formal parameters as pointers to the corresponding types.

In C modules, it is possible to use COMMON blocks defined inside Fortran modules (for more information, see Intel Fortran Compiler User's Guide, chapter Mixing C and Fortran).

Sharing Intel and GCC compilers

C object modules produced by the Intel C++ compiler are compatible with modules produced by the GCC compiler and the GNU C library. Thus, these modules can be used together in a single program compiled using the icc or gcc commands, but it is recommended to use icc to correctly include Intel libraries.

The Intel compiler supports a number of non-standard C language extensions used by the GNU project and supported by the GCC compiler (but not all of them, see here for more details).

The user manual does not say anything about the compatibility of object modules in the C++ and Fortran languages; apparently, it is not supported.

Standards support

Intel C++ Compiler 7.0 for Linux supports the ANSI/ISO C language standard (ISO/IEC 9899/1990). It is possible to establish strict compatibility with the ANSI C standard ( -ansi) or extended ANSI C dialect ( -Xa). When using the option -c99

  • Compiler manuals in HTML format (available "online" on our server, but require Java language support)
    • Intel C++ Compiler User's Guide.
    • Intel Fortran Compiler User's Guide.
  • Compiler manuals for English language in PDF format (requires Acrobat Reader, you need to download PDF files to your computer)
    • Intel C++ Compiler User Guide: Intel C++ Compiler User's Guide (1.3 MB, 395 pages).
    • Intel Fortran Compiler User Guide: Intel Fortran Compiler User's Guide (1.1 MB, 285 pages).
    • Programmer's Reference in Fortran: Intel Fortran Programmer's Reference (7 MB, 566 pages).
    • Reference to libraries for the Fortran language: Intel Fortran Libraries Reference Manual (9.5 MB, 881 pages).
  • Intel Application Debugger Guide.
  • Comparison of compilers on SPEC CPU2000 tests (article on ixbt.com in Russian).
  • The Polyhedron website presents comparison results between various compilers.
  • In the previous issue of the magazine, we discussed products from the Intel VTune Performance Analyzer family - performance analysis tools that are deservedly popular among application developers and allow detection in code team applications, which consume too much CPU resources, giving developers the opportunity to identify and eliminate potential bottlenecks associated with such code sections, thereby speeding up the application development process. Note, however, that the performance of applications largely depends on how efficient the compilers used in their development are, and what features hardware they are used when generating machine code.

    The latest versions of the Intel Intel C++ and Intel Fortran compilers for Windows and Linux operating systems allow you to gain up to 40% in application performance for systems based on Intel Itanium 2, Intel Xeon and Intel Pentium 4 processors compared to existing compilers from other manufacturers through the use of such features of these processors, such as Hyper-Threading technology.

    Differences associated with code optimization by this family of compilers include the use of a stack for performing floating-point operations, interprocedural optimization (IPO), optimization in accordance with the application profile (Profile Guided Optimization (PGO), preloading data into the cache (Data prefetching), which avoids latency associated with memory access, support for the characteristic features of Intel processors (for example, extensions for streaming data processing Intel Streaming SIMD Extensions 2, characteristic of the Intel Pentium 4), automatic parallelization of code execution, application creation, running on several different types processors when optimizing for one of them, tools for “predicting” subsequent code (branch prediction), expanded support for working with execution threads.

    Note that Intel compilers are used in such well-known companies as Alias/Wavefront, Oracle, Fujitsu Siemens, ABAQUS, Silicon Graphics, IBM. According to independent testing conducted by a number of companies, the performance of Intel compilers is significantly higher than the performance of compilers from other manufacturers (see, for example, http://intel.com/software/products/compilers/techtopics/compiler_gnu_perf.pdf).

    Below we will look at some features latest versions Intel compilers for desktop and server operating systems.

    Compilers for the Microsoft Windows platform

    Intel C++ Compiler 7.1 for Windows

    Intel C++ Compiler 7.1 is a compiler released earlier this year that allows you to achieve high degree code optimization for Intel Itanium, Intel Itanium 2, Intel Pentium 4 and Intel Xeon processors, as well as for the Intel Pentium M processor using Intel technology Centrino and intended for use in mobile devices.

    The specified compiler is fully compatible with the tools Microsoft development Visual C++ 6.0 and Microsoft Visual Studio.NET: It can be built into appropriate development environments.

    This compiler supports ANSI and ISO C/C++ standards.

    Intel Fortran Compiler 7.1 for Windows

    Intel Fortran Compiler 7.1 for Windows, also released earlier this year, allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4 and Intel Xeon, Intel Pentium M processors.

    This compiler is fully compatible with Microsoft Visual C++ 6.0 and Microsoft Visual Studio .NET development tools, that is, it can be built into the corresponding development environments. In addition, this compiler allows you to develop 64-bit applications for operating systems running on Itanium/Itanium 2 processors, with using Microsoft Visual Studio on a 32-bit Pentium processor using the 64-bit Intel Fortran Compiler. When debugging code, this compiler allows you to use a debugger for the Microsoft .NET platform.

    If you have Compaq Visual Fortran 6.6 installed, you can use it instead of the original Intel Fortran Compiler 7.1, since these compilers are compatible at the source code level.

    Intel Fortran Compiler 7.1 for Windows is fully compatible with the ISO Fortran 95 standard and supports the creation and debugging of applications containing code in two languages: C and Fortran.

    Compilers for the Linux platform

    Intel C++ Compiler 7.1 for Linux

    Another compiler that was released at the beginning of the year, Intel C++ Compiler 7.1 for Linux, allows you to achieve a high degree of code optimization for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the GNU C compiler at the source code and object modules, which allows you to migrate applications created using GNU C to it without additional costs. The Intel C++ Compiler supports the C++ ABI (an addition to the Linux kernel that allows you to run under Linux control compiled code for other platforms, such as early SCO operating systems, early Sun Solaris, etc.), which means full compatibility with the gcc 3.2 compiler at the binary level. Finally, with Intel C++ Compiler 7.1 for Linux, you can even recompile the Linux kernel by making a few minor changes to its source code.

    Intel Fortran Compiler 7.1 for Linux

    The Intel Fortran Compiler 7.1 for Linux allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the Compaq Visual Fortran 6.6 compiler at the source code level, allowing you to recompile applications using it created using Compaq Visual Fortran, thereby increasing their performance.

    In addition, the specified compiler is compatible with such utilities used by developers as the emacs editor, the gdb debugger, and the make application build utility.

    Like the Windows version of this compiler, Intel Fortran Compiler 7.1 for Linux is fully compatible with ISO standard Fortran 95 and supports the creation and debugging of applications containing code in two languages: C and Fortran.

    It should be especially emphasized that a significant contribution to the creation of the listed Intel compilers was made by specialists from the Intel Russian Software Development Center in Nizhny Novgorod. More information about Intel compilers can be found on the Intel Web site at www.intel.com/software/products/.

    The second part of this article will be devoted to Intel compilers that create applications for mobile devices.