Performance averagesTo illustrate the relative performance of compilers, we have calculated some averages, setting an index of 100 to the Microsoft compiler on each of the platforms. 7-Zip and x264 have of course been excluded from the averages.
In the interests of precision, we have also drawn up two other averages allocating the index of 100 to the scores for the SSE2 version of the Microsoft compiler and the base version of the Intel compiler.
[ cl base ] [ cl SSE2 ] [ icc base ]
Hold the mouse over the compilers to centre the various indexes
Of course, performance isn’t identical for these three processors but by choosing an equivalent index for each, we can see more clearly how performance is affected by the compilers or options used.
If you have read this report from the top, the results given here won't surprise you. The Intel compiler does best. While gcc does okay, in practice the tuning
options are often counterproductive and undermine the gains they bring. Visual Studio is significantly slower in its standard version because it still generates x87 code by default for floating point operations. Moving over to SSE2 for the maths operations improves performance, particularly on AMD processors where x87 code has been somewhat sidelined by new instruction sets such as SSE2 that have been designed to replace them.
When we compare this mode to the default Intel mode (that also compiles for SSE2), the Intel compiler takes the lead and the FX-8150 actually benefits most with a performance gain of 24% against just 20% for the Core i7.
Centreing our performance index on this mode brings several points of interest to the fore. Firstly, the very progressive way in which each
mode benefits Intel processors. These gradual gains are a little too perfect to result just from the use of an instruction set. Simply looking at the difference with and without a dispatcher on the AVX version for the Core i7 2600k demonstrates this quite clearly. The compatible modes aren’t optimised with the same vigour by Intel as the other optimisation modes and Intel doesn’t hide this.
modes are carry-all optimisation modes where many of the optimsations aren’t linked to the level of processor support at all. The fact that these other optimisations, which sometimes benefit AMD processors as we have seen, are not made available in the modes that are supposed to be widely compatible (without a dispatcher) is particularly problematic. While these modes are indeed “fairer”, they are in practice slower for all solutions, including those from Intel.
So, what are these other optimisations? By analysing the assembly code generated by the Intel compiler, a certain number of points are exposed.
Firstly, some optimisations do not concern the developer code. Depending on which version of the memory allocation and copy functions is used, the results can be very much affected. All this extra code generated by the compiler for standard C/C++ functionalities is important and while the dispatcher does affect some of them (the functionalities), others aren’t affected. Moreover this is what lies behind the range of results recorded. For the lbm test, the assembly code generated in the critical section of the program is identical in the SSE 4.1 and 4.2 versions. Worse, the code generated isn’t dispatched (it is in AVX mode, with a very modest gain) but there is a significant difference between a Core i7 and an FX in SSE 4.1 because some blocks of extra code are dispatched.
Next, we certainly must not overestimate the quantity of SSE/AVX code generated. While the compiler is often able to use it, we did notice in many tests that the code produced in critical sections (the most resource-hungry parts of the code) doesn’t always use AVX code. SSE2 instructions are often preferred and quite rightly so.
Finally, we have the opaque optimisations. For example we noticed that the Qax modes have an influence on the unrolling of loops. This optimisation consists in replacing loops (pieces of code that we request to repeat many times) by this code repeated several times. This increases the length of the code generated but in practice, avoiding conditional jumps (very costly on x86 architecture) can bring significant performance gains. We noted, in the case of the lbm test, that the Intel compiler doesn’t unroll the code in QaxSSE2 mode and that it unrolls it to a greater or lesser extent in the other modes, although the unrolling option is activated by default in the compiler.