Despite the fact that VTune has many built-in profiles, it does not yet have a special profile for measuring FLOPS. But no one is stopping us from creating our own user profile in 30 seconds. Without bothering you with the basics of working with the VTune interface (you can learn them in the Getting Started Tutorial that comes with it), I will immediately describe the process of creating a profile and collecting data.

  1. Create a new project and specify our application as the target application matrix.
  2. Select the Lightweight Hotspots profile (which uses Hadware Event-based Sampling technology) and copy it to create a custom profile. We call it My FLOPS Analysis.
  3. We edit the profile, add new processor event counters for the Sandy Bridge processor (Events). Let's look at them in a little more detail. Their name encrypts the execution devices (x87, SSE, AVX) and the type of data on which the operation was performed. Each processor cycle, counters add up the number of computational operations assigned to execution. Just in case, we added counters for all possible operations with FP:
  • FP_COMP_OPS_EXE. SSE_PACKED_DOUBLE – vectors (PACKED) of double precision data (DOUBLE)
  • FP_COMP_OPS_EXE. SSE_PACKED_SINGLE – single precision data vectors
  • FP_COMP_OPS_EXE. SSE_SCALAR_DOUBLE – scalar DPs
  • FP_COMP_OPS_EXE. SSE_ SCALAR _SINGLE – scalar SP
  • SIMD_FP_256.PACKED_DOUBLE – AVX vectors of DP data
  • SIMD_FP_256.PACKED_SINGLE – AVX SP data vectors
  • FP_COMP_OPS_EXE.x87 – x87 scalar data
All we have to do is run the analysis and wait for the results. In the results obtained, switch to the Hardware Events viewpoint and copy the number of events collected for the function multiply3: 34,648,000,000.

Next, we simply calculate the FLOPS values ​​using the formulas. Our data was collected for all processors, so multiplication by their number is not required here. Double-precision data operations are performed simultaneously on four 64-bit DP operands in a 256-bit register, so we multiply by a factor of 4. Single-precision data, respectively, are multiplied by 8. In the last formula, we do not multiply the number of instructions by a factor, since coprocessor operations x87 are performed only with scalar quantities. If a program executes several different types FP operations, then their number, multiplied by the coefficients, is summed to obtain the resulting FLOPS.

FLOPS = 4 * SIMD_FP_256.PACKED_DOUBLE / Elapsed Time
FLOPS = 8 * SIMD_FP_256.PACKED_SINGLE / Elapsed Time
FLOPS = (FP_COMP_OPS_EXE.x87) / Elapsed Time

In our program, only AVX instructions were executed, so the results contain the value of only one counter, SIMD_FP_256.PACKED_DOUBLE.
Let's make sure that the event data is collected for our loop in a function multiply3(switching to Source View):

FLOPS = 4 *34.6Gops/7s = 19.7 GFlops
The value is quite consistent with the estimate calculated in the previous paragraph. Therefore, with a sufficient degree of accuracy we can say that the results of the evaluation method and the measurement method coincide. However, there are cases where they may not match. If there is some interest from readers, I can do their research and tell you how to use more complex and accurate methods. And in return, I’d really like to hear about your cases when you need to measure FLOPS in programs.

Conclusion
FLOPS – performance unit computing systems, which characterizes the maximum computing power of the system itself for floating point operations. FLOPS can be stated as theoretical, for not yet existing systems, and measured using benchmarks. Developers of high-performance programs, in particular, solvers of systems of linear differential equations, evaluate the performance of the implementation of their algorithms, including the FLOPS value of the program, calculated using a theoretically/empirically known number of FP operations required to execute the algorithm, and the measured test execution time. For cases where the complexity of the algorithm does not allow estimating the number of FP operations of the algorithm, they can be measured using performance counters built into Intel microprocessors.

Almost all users who understand something about SoCs get into bloody arguments about whose smartphone, processor, or GPU is cooler. Actually, GPU power is measured in FLOPS, a special unit that shows how many floating point operations the GPU (and not only) can perform per second. Anyone interested, please under cut!

Let's start with the most popular GPU - Mali-400. This GPU has gained considerable fame due to its performance and power consumption. At the same time, powerful and battery-efficient, the chip has been used in many processors, from NovaThor U8500 to Exynos 4412. There are many varieties of this GPU, which differ in the number of cores. Below are several smartphones in which this GPU is embedded and the number of GFLOPS.

Samsung Galaxy Ace 2- Mali-400MP- 275MHz- 2.48Gflops
Samsung Galaxy S3- Mali-400MP4- 533MHz- 19.2Gflops

Quite a big difference, isn't it?
I also spread the myth that the higher the frequency, the more powerful the chip

Mali-450MP4- 700MHz, which is found in MT6592, and which, according to several Trashbox users, should beat even the not yet released Adreno 420. The result is 41.8Gflops. Enough big step ahead compared to Mali-400MP4, but Adreno 330-450MHz gains as much as 129.6 Gflops, which is unrealistically high. Moreover, its frequency is lower than that of the Mali-450MP4 at 250MHz. For comparison, the top-end PowerVR G6430- 450MHz, which is found in the IPhone 5S and iPad Air gains 115.2Gflops. The most powerful Mali-628MP6- 533MHz, which is in Octa Samsung versions Galaxy Note 3 achieves 102.4Gflops.

Also don't forget Tegra 4 and Tegra 4i. GeForce ULP x72, which is installed in Tegra 4, scores 96.8Gflops, and its LTE brother with GeForce ULP x60 achieves 79.2

But here the most interesting thing happens, because Adreno 330 also has a 550MHz version (which in the near future can be obtained using custom cores) and this most overclocked version gains as much as 158.4Gflops! This is a record.

Let's look at older GPUs, such as Adreno 320, Adreno 225, GeForce ULP x12 and PowerVR SGX544MP3 and SGX554MP4, and don't forget about the simple SGX544MP, which is found in the super popular MT6589 chip.

Also, let's look at the video processors Adreno 203, Adreno 205, Adreno 200, Adreno 220 and Adreno 305. The first 4 video processors score the following: Adreno 200 - 3.92Gflops at a frequency of 245MHz, Adreno 203 - 7.84Gflops at the same frequency of 245MHz. As we can see: double the result at the same frequency.
Adreno 205 is a continuation of the 203rd. It gains 8.5Gflops, which is not very much, but the next GPU, called Adreno 220, breaks the stereotypes of not the most top-end GPUs: an incredible 18Gflops - the level of the Mali-400MP4 533MHz, which is found in the top-end Samsung Galaxy S3. Now let's look at the Adreno 305, which is a simplified version of the Adreno 320. This GPU is found in processors such as Snapdragon S4 Plus and Snapdragon 400. So, this accelerator gains 21.6Gflops at a frequency of 450MHz.

Adreno 320 is divided into two categories: the one found in the S4 Pro, and the one found in the Snapdragon 600. They differ in the number of blocks: if the S4 Pro versions There are 64 of them, then the 600 version has 96 of them. The Adreno 320 S4 Pro gains 57 Gflops, and its S600 version is as much as 97.2 at a frequency of 450 MHz. This is even more than the GeForce ULP x72, so the Snapdragon 600 1.9GHz has a more powerful GPU than the Tegra 4. A shocking result.

Let's look at the Adreno 225. At a frequency of 400MHz, it gains 25.6Gflops. For comparison, the GeForce ULP x12, which is installed in Tegra 3, gains 12.5Gflops at a frequency of 520MHz. Adreno 225 is more powerful than GeForce ULP x12... Hmm... But to be honest, GeForce ULP x12 has performance level... 4.5Gflops lower than Adreno 220...

Now let's move on to the PowerVR SGX544MP3, which is found in the Exynos 5410 or, more simply, in the Samsung Galaxy S4. Its performance is 51.1Gflops. Not the most powerful. The higher-end SGX554MP4, which served as the gaming basis for the iPad 4, produces 76.8Gflops. Much bigger.

But as soon as I found out the performance of the SGX544MP, which is found in the MT6589 and MT6589T, I... never mind. MT6589 has a version with a frequency of 286MHz. It produces only 9.2Gflops. This is very little, but still more than that of its younger brother MT6589M. He has a count. The accelerator operates at a frequency of only 156MHz. To be honest, I don’t want to talk about this processor, but I have to. So, it produces only 4.9 Gflops. This is slightly better than the Adreno 200. The turbocharged MT6589T has an accelerator clocked at 357MHz and this gives it 11.4Gflops.

And now about consoles. Many people's favorite PSP produces only 2.6Gflops. Do you remember the incredible graphics of PSP games? And how smoothly they walked on it? Adreno 330 is more than 50 times more powerful than PSP. But the 50-fold increase is not felt. PSVita is a serious development of hardware. It has PowerVR SGX543MP4+ and this gives an impressive 51.2Gflops.

And now about PS and Xbox. The PS3 has a performance of 228.8 Gflops and I believe that the next generation of GPUs will be more powerful than the beloved console, but to the level of the PS4, which still gains 1840 Gflops, like cancer to China. By the way, super powerful Nvidia video card The GeForce GTX Titan achieves 4500Gflops, and the new GTX 780Ti reaches approximately 4800Gflops. To the computer is like to the moon: D

Oh, I forgot about the Vivante GC6400 video accelerator, which operates at 800MHz. This video accelerator is the only competitor to the hellish Adreno 330: its performance is 128!!! Gflops, which is only 1.6 Gflops less than the Adreno 330, but we know that developers are not very keen on optimizing games for this rare accelerator. For example, I don’t know a single device with this accelerator. Who knows: please write in the comments

Since the very time when the very first computer (its likeness) appeared, the pursuit of power and productivity began, and in our days nothing has changed in this regard, because every owner personal computer whose work involves a load on the computing power of the PC dreams of even more productive hardware.

All computers that exist are divided into several categories, ranging from microchips to supercomputers, which consume tens of kilowatts of electricity and are top-notch in computing capabilities. In this material you will learn how to measure the performance of a personal computer.

From the earliest times, in order to measure the performance of a particular computer, they decided to use the number of floating point operations performed per 1 second of time. In practice, this turned out to be a very significant result indeed. The unit of measurement for operation 1 was called Flops. However, computers are very productive devices, so the prefix kilo/mega/Giga/Peta/Exa, etc. is used before flops. Each listed operation is 1000 times larger than the previous one. For the final evaluation, Flops/s results are given, i.e. flops per second. If you want to read more about Flops, then go here.

Personal PC performance measurement

There are many tools to measure the flop performance of a personal computer or laptop. However, all tools are based on the same operating principle.

From possible interfaces there is a performance analysis via command line, through Fortran and C++ compilers, etc. But we will take an easier route and use the already compiled exe file programs in Linpack, which is the most popular in measuring the performance of Windows computers.

Below we present to your attention 2 versions of the Linpack program, which will help you determine how many floating point operations your computer performs per second.

How to check?

First, unpack the archive and run the program (LinX.exe file). The program interface is very simple and you can easily figure it out. First, go to the settings and give the program the highest priority. After this, try to turn off resource-intensive programs. In the LinX interface, you can choose how many times or minutes to run the test and how much data to use during the test. When all the settings are set, click Test. Once completed, you will most likely see the result in GFlops/s (Gigaflops per second).

To give an idea of ​​how much it is: 1 Flops=1 Floating Point Operation; 1GFlops= 1,000,000,000 Floating Point Operations.

ARM processor is a mobile processor for smartphones and tablets.

This table shows all currently known ARM processors. The table of ARM processors will be supplemented and upgraded as new models appear. This table uses a conditional system for evaluating CPU and GPU performance. ARM processor performance data was taken from a variety of sources, mainly based on the results of tests such as: PassMark, Antutu, GFXBench.

We do not claim absolute accuracy. Absolutely accurately rank and evaluate the performance of ARM processors impossible, for the simple reason that each of them has advantages in some ways, but in some ways lags behind other ARM processors. The table of ARM processors allows you to see, evaluate and, most importantly, compare different SoCs (System-On-Chip) solutions. Using our table, you can compare mobile processors and it’s enough to find out exactly how the ARM heart of your future (or present) smartphone or tablet is positioned.

Here we have compared ARM processors. We looked at and compared the performance of CPU and GPU in different SoCs (System-on-Chip). But the reader may have several questions: Where are ARM processors used? What is an ARM processor? How does ARM architecture differ from x86 processors? Let's try to understand all this without going too deep into details.

First, let's define the terminology. ARM is the name of the architecture and at the same time the name of the company leading its development. The abbreviation ARM stands for (Advanced RISC Machine or Acorn RISC Machine), which can be translated as: advanced RISC machine. ARM architecture combines a family of both 32 and 64-bit microprocessor cores developed and licensed by ARM Limited. I would like to note right away that the ARM Limited company is exclusively engaged in the development of kernels and tools for them (debugging tools, compilers, etc.), but not in the production of the processors themselves. Company ARM Limited sells licenses for the production of ARM processors to third parties. Here is a partial list of companies licensed to produce ARM processors today: AMD, Atmel, Altera, Cirrus Logic, Intel, Marvell, NXP, Samsung, LG, MediaTek, Qualcomm, Sony Ericsson, Texas Instruments, nVidia, Freescale... and many others.

Some companies that have received a license to produce ARM processors create their own versions of cores based on ARM architecture. Examples include: DEC StrongARM, Freescale i.MX, Intel XScale, NVIDIA Tegra, ST-Ericsson Nomadik, Qualcomm Snapdragon, Texas Instruments OMAP, Samsung Hummingbird, LG H13, Apple A4/A5/A6 and HiSilicon K3.

Today they work on ARM-based processors virtually any electronics: PDA, Cell phones and smartphones, digital players, portable game consoles, calculators, external hard disks and routers. They all contain an ARM core, so we can say that ARM - mobile processors for smartphones and tablets.

ARM processor represents a SoC, or "system on a chip". An SoC system, or “system on a chip,” can contain in one chip, in addition to the CPU itself, other parts full-fledged computer. This includes a memory controller, an I/O port controller, a graphics core, and a geopositioning system (GPS). It may also contain a 3G module, as well as much more.

If we consider a separate family of ARM processors, say Cortex-A9 (or any other), it cannot be said that all processors of the same family have the same performance or are all equipped with a GPS module. All these parameters strongly depend on the chip manufacturer and what and how he decided to implement in his product.

What is the difference between ARM and X86 processors?? The RISC (Reduced Instruction Set Computer) architecture itself implies a reduced set of instructions. Which accordingly leads to very moderate energy consumption. After all, inside any ARM chip there are much fewer transistors than its counterpart from the x86 line. Don't forget that in the SoC system everything peripherals located inside a single chip, which allows the ARM processor to be even more energy efficient. The ARM architecture was originally designed to calculate only integer operations, unlike x86, which can work with floating point calculations or FPU. It is impossible to clearly compare these two architectures. In some ways, ARM will have an advantage. And somewhere it’s the other way around. If you try to answer the question in one phrase: what is the difference between ARM and X86 processors, then the answer will be this: the ARM processor does not know the number of commands that the x86 processor knows. And those that do know look much shorter. This has both its pros and cons. Be that as it may, lately everything suggests that ARM processors are beginning to slowly but surely catch up, and in some ways even surpass conventional x86 processors. Many openly declare that ARM processors will soon replace the x86 platform in the home PC segment. As we already know, in 2013 several world-famous companies completely abandoned the further production of netbooks in favor of tablet PCs. Well, what will actually happen, time will tell.

We will monitor the ARM processors already available on the market.