The impact of compilers on x86/x64 CPU architectures - BeHardware
>> Processors

Written by Guillaume Louel

Published on February 28, 2012


Page 1


Do compilers bias processor performance one way or another? This is a legitimate question and something we sometimes raise ourselves when we do our processor testing. The fact that Intel and AMD both make their own compilers is clearly a clue to the answer and certain past practices have contributed to the debate.

More recently, the US Federal Trade Commission looked into, among other things, the workings of Intel compilers, highlighting the fact that they donít offer optimisations for a given level of support (inclusion of SSE 4 for example), but rather for precise processor models (second gen Core i7 and so on). The inquiry resulted in an agreement that, in the end, didnít change much with respect to compilers.

One of the few consequences of the agreement between Intel
and the FTC was the incluseion of this note in the documentation.

The question of optimisations still holds however and we wanted to take a closer look at the subject to see whether there is an impact or not and, if so, how extensive it is. In the interests of thoroughness, weíll briefly go back over the different types of optimisation that are offered in compilers and the impact these optimisations have.

For the purposes of the test we used applications for which the source code was available to us. This is important as it allows us both to put into context applications for which the source code isn't available and to gauge the complexity of the problem: while you can always recompile open source software, this isnít the case for commercial applications. Itís impossible to recompile Adobe Photoshop for a precise processor model and the user simply has to put up with the choices made by developers in terms of the compiler Ė and compilation options. An important point to keep in mind!

Page 2
C, C++, role of the compiler

C, C++, role of the compiler
For this article, we focused particularly on C/C++ compilers. Developed at the end of the 60s and 70s, C and C++ are extremely popular programming languages that were originally designed for Unix. They're now mainly used for the conception of applications in Windows. C is the oldest and is whatís known as a procedural language Ė function based Ė while C++ introduced, among other things, the notion of object orientation (or multi-paradigm for purists).

While C is still very popular in Unix and its derivatives (Linux or MacOS via Objective-C), in Windows C++ is generally what is used, Microsoft having favoured C++ in its APIs since the 90s (like MFC for example, one of the libraries that allows you to design graphical interfaces for Windows applications). These languages have a common root (C++ was originally developed as an extension of C, though the languages have since diverged) and common compilers are therefore often used for both languages.

Interpreted, compiled, on the fly
In contrast to simplified languages like BASIC, where an interpreter reads the program line by line to interpret and translate it into code that can be read by the processor, C and C++ are classed as compiled languages. Developers write software in these languages then compile them. The goal of compilation is to translate the C/C++ program, which is relatively legible, into code that can be executed by the processor (machine language). This means that C/C++ are (relatively) portable languages and a program can be compiled for different types of architecture (ARM, PowerPC or x86 for example). In a given architecture, compilation can be optimised for certain variants, such as 32-bit and 64-bit versions of an application, or specificity of this architecture (version of an application requiring SSE2 for example). Unless you have access to a programís source code and a compiler, you have to accept the optimisations and choices made by developers, which is one of the reasons for this report.

Some C. code, 'main' being the function from where the program starts.

Because yes, even for a given standard such as x86, not all chips are the same. While same gen AMD and Intel processors can process the same machine instructions, AMD and Intel engineers may prioritise certain instructions over others. Division may for example be faster on one processor than another. Compilation, like translation from one language to another, isnít an exact science and the same program can be compiled in different ways. While the end programs will carry out the same work (as long as there arenít any compilation errors), the machine language code will be different. The fact that x86 is a processor language with lots of instructions (an ISA CISC rather than ISA RISC with a deliberately reduced instruction set) gives a fairly wide margin, with x86 now including, along with the various extensions (MMX, SSE, etc) more than a thousand instructions. The way a compiler works would therefore theoretically seem to have a very significant impact on final performance and itís hardly surprising to see that AMD and Intel actively participate not only in the development of existing compilers (whether those included in Microsoftís Visual Studio or the open source GCC) but also provide their own.

Note that the advent of languages such as Java/JVM and C#/.NET has been changing the game over the last few years. Developed by Sun (Java) and Microsoft (C#/.NET ), these mainly enterprise application languages (they get more limited traction in consumer applications, though the AMD and Intel control panels are written in C#/.NET) are compiled for what is called a virtual machine. What we have here is a mix of compiled and interpreted language concepts: with Java or C#, the program is compiled in an intermediate language, that of the virtual machine (JVM/.NET). On launch, the virtual machine carries out a final compilation of the program, adapted to the machine itís running on (JIT or Just In Time compilation). Virtual machines thus easily support several ISAs and/or several operating systems, which is one of the advantages of Java for example.

When it comes to optimisations that may favour one processor brand over another, the situation is different as, here, whoever controls the VM controls the optimisations that are or are not included. Note that while C/C++ are generally compiled, itís possible (with a few restrictions) to use C++ with .NET. C/C++ can also be compiled on the fly on an open source platform such as Clang/LLVM. The only drawback of this solution is that itís currently very badly supported in Windows and only supports certain ďMicrosoftĒ extensions to C/C++.

In effect, while C and C++ are standardised languages, their implementation does vary from one compiler to another. Firstly because there are several versions of standards (ANSI, ISO, C9 and C11 versions for C alone) and then because they're also open to numerous extensions. Worse, some compilers only partially implement the established standards - this is particularly true of Microsoft tools. As you can see, compiling the same program in three different environments is a challenge!

Page 3
Optimisations and performance

Optimisations and performance
As we said previously, thereís more than one way of translating a C programme into machine language that can be executed by a processor. Compilers tend to differentiate themselves by offering options and optimisations designed to improve performance. Before going further and clarifying what we mean by optimisations, itís important to put how much of an impact compilers have into context.

Although compilers are increasingly intelligent and can sometimes offer impressive optimisations and gains in performance, they donít perform miracles and canít automatically transform a slow and poorly designed algorithm (the programís logic) into something that is very fast. Whether through parallelism (using all the cores) or by optimising processing on modern processors, compilers attempt to do their best with the code written by developers. To make things more complicated, the gap between the developer's domain and that of the compiler is getting smaller all the time. While managing multiple cores in parallel is supposed to be part of the developers domain, some compilers do attempt (often badly) to have an impact in this area. On the other hand, processor architecture, supposedly entirely hidden by C/C++ and purely the domain of the compiler, is in fact taken into account by developers via various mechanisms designed to inform the compiler of their intentions. Almost an art, aiming to get the most out of modern processor performance is a very difficult balancing act, both for the developer and the compiler. Having said this, letís now get back to the different optimisations that we can expect to find in the various compilers!

Reducing the size
In past times, the quality of a compiler was judged on how big the executable file (the compiled machine code) that it produced was. At a time when storage was costly, economising a few kilobytes was significant.

Above and beyond the simple gain in space however, reducing the size of this file was considered to represent a performance optimisation. Given that processors are classified according to the number of instructions they can process per second, reducing the number of instructions in a program and reducing its size does therefore seem worthwhile!

While there is something in this, the issue of the size of the executable file is now more complicated. Not all instructions take the same time to be processed. With Intelís Sandy Bridge architecture for example, an addition (add) has a latency of one cycle (1 hertz of the processorís GHz total), a full multiplication (imul) three and a division (div) 26! Moreover this can change from one processor to another: on a Pentium 4 F a multiplication has a latency of 10 cycles.

Things get more complicated when you take into account the fact that processors have multiple units in each of their cores (known as superscalar architectures and providing the possibility of executing several instructions per cycle, an innovation introduced with the Pentium; we often speak about this in our articles on processor architecture such as Sandy Bridge). A Sandy Bridge processor can thus carry out three additions at the same time in a single cycle, as can Phenom IIs. With Bulldozer, this drops to two per core (if youíre interested you can find a full list of latencies and instruction speeds on multiple x86 architectures in this PDF from TorbjŲrn Granlund).

Optimising for size does nevertheless remain an option included in compilers and later in this report we will see a case where this was beneficial. In most other cases, this option slowed performance down.

Page 4
Optimising for particular CPU architectures

Generating optimised code for CPU architectures
As youíll have guessed, as there are big differences in how processors process instructions, compilers can be developed to optimise the code they generate so as to take the particularities of each architecture into account.

With the arrival of the Pentium and superscalar processors, the order in which the compiler placed instructions became extremely important. Placed correctly, two additions could be processed simultaneously on this architecture, doubling the performance that you would otherwise get, though this had to be obtained manually as compilers at the time werenít as developed and couldnít generate the optimised code automatically. This led Intel to develop a new type of architecture with the Pentium Pro, introducing what is known as Out of Order or OoO processing. OoO allows the processor to change the order of instructions so as to best utilise superscalar units. While at that time OoO architectures served above all to maximise the use of all units, ordering engines have continued to evolve along with developments in modern processors: masking memory access latency to a maximum (by dealing with read operations as soon as possible so that theyíre ready when the processor needs them) has now become the new preoccupation as the speed of memory accesses hasnít kept pace with the increase in arithmetical performance of processors.

Since the arrival of the Pentium Pro in 1995, the trend has been constant: to integrate as much innovation as possible at hardware level (superscalar, OoO, caches, MMU, prefetchers and so on) to get as much efficiency as possible from ever more complex architectures. You might therefore think that the role of the compiler had become less important as processors have become increasingly able to deal with some of the heavy parts of the code that they're required to process. In practice however, there are still cases where choices made by the compiler are important.

The choice of instructions for example is still crucial. To take an example that fast forwards us into the 21st century, AVX instructions are mostly available in two variants: 128-bit and 256-bit (the number of bits indicates the size of the operands, the data that instructions work on) and you can generally replace a 256-bit instruction with two 128-bit instructions.

As discussed in the report on its architecture, a Bulldozer module combines two cores and these cores share a certain number of resources. Amongst these is the floating point unit, which takes charge of the execution of SSE (128-bit) and AVX (128 or 256-bit) instructions. It has the particularity of being split into two parts that can function independently in 128-bit mode. If it has to function in 256-bit mode however, the two blocks must be synchronised and work together, which can imply a cost in performance. Mixing 128 and 256-bit instructions can therefore reduce efficiency. To take this particularity into account, the GCC compiler will attempt to favour the use of AVX128 instructions if itís asked to optimise for the Bulldozer architecture (architecture bdver1).

Rather than optimizing for a particular architecture, the Intel compiler
allows you to optimise for a given, Intel brand(!) processor model.

Things get more complex with certain specific C/C++ language instructions. While standard mathematical operations can be translated into machine language fairly easily, with other tasks, the language offers functions that simplify the programmerís work. For example this is the case with the manipulation of strings (letters or figures) or memory blocks (in practice, the manipulation of chains is based on memory manipulation). The C/C++ language provides functions that are translated, by the compiler, into relatively long (and therefore optimisable!) pieces of machine language. Data access latency, cache size, the internal functioning of prefetchers and the MMU may all be taken into account for compilation by conscientious developers, as may the instructions available on the processor. As weíll see later then, the Intel compiler includes specific implementations of these functions for each of its processors.

Architecture optimisation is therefore still very much an issue and the incursion of hardware into the domain of the compiler is balanced by the fact that optimisations have become increasingly complex. And then, thereís also vectorisationÖ

Page 5
SSE, AVX: the problematic of vectorisation

SSE, AVX: the problematic of vectorisation
Earlier, we mentioned the fact that AVX allows you to work on 128 or 256-bit operands (data). This was in fact an oversimplification. Most of the time, programmers work with numbers encoded in 32-bit (between about -2.1 billion and +2.1 billion) or 64-bit (known as double precision +/- 9 trillion). These are types of data used by programming languages like C and C++. Putting a few very rare cases to one side, storing and working on 128 or 256-bit numbers isnít generally all that worthwhile, and moreover isnít really what AVX is used for.

Advanced Vector eXtension instructions are in effect vector instructions, which means they can work on an array of data. A 256-bit AVX instruction can thus work on eight pieces of 32-bit data at the same time, which significantly speeds up program execution, as long as you need to process eight identical operations in parallel!

The ideal vectorisation situation: working on four pieces of data at the same time
allows you to quadruple processing speed. Extract from an Intel PDF

Of course this is where things get more complicated as C and C++ arenít really adapted to arrays. Programs use variables, which store information according to type (integers, decimal (real) numbers and so on) and while the notion of array does exist (several pieces of data of the same type), the language doesn't include operations that can be applied directly to them (such as adding the contents of table A to table B). The developer must therefore create algorithms that carry out these operations, some of which are now accelerated by processors using AVX. In practice, the fact that C and C++ donít have structures that are natively adapted to the way processors now function internally has become a real problem for which several solutions have been implemented.
Replacing mathematical operations
In some cases, using a vector instruction with a single piece of data can be faster than using its standard equivalent. For historical reasons this is particularly so with floating point calculations. These operations, known as x87, were handled by arithmetic coprocessors in the 80s. Even though arithmetic coprocessors have been built into processors (starting with the 486 DXs and Pentiums), they are still as costly as they ever were, organised in stacks. SSE, SSE2 and AVX now offer instructions that can replace x87, getting away from the concept of stacks and making execution a lot faster. Using a vector instruction for a single piece of data can therefore be faster. Note that although all compilers offer this type of optimisation, Visual Studio still compiles in x87 by default.
Automatic vectorisation
If you truly want to benefit from the parallel processing power of vector instructions, you might think it a good idea to ask compilers to detect cases where developers have introduced arrays in their code. The compiler would then interpret the C/C++ code (generally loops that repeat an instruction) to generate machine language code automatically using vector instructions.

In theory this is an excellent idea. In practice, the existing code isnít necessarily written to be vectorised. Outside of simple cases, the developer often has to rewrite code to remove dependencies. While in the era of multimedia you might think that all processing is parallel, in practice the level of parallelism (or granularity) isnít necessarily limited to one instruction. When there are several and there are dependencies between results, vectorisation rapidly becomes impossible (there are also other issues such as jumps in code which donít really suit SIMD). Code can often be rewritten, but the compiler canít interpret the ďideaĒ behind a complex algorithm on its own and rewrite it for the developer. The written code would probably be in a less legible or logical version for the developer, though more logical for the compiler.

In practice as weíll see, automatic vectorisation is far from being on a par with other techniques.
The assembly code
This is the simplest solution technically speaking. Rather than asking the impossible of the compiler, the developer can decide to write assembly code themselves (a ďlegibleĒ version of machine language) for some parts of their program. While you can get excellent gains from this (as weíll see with x264), in practice very few developers choose this route as it is quite simply very complex.
This is a slightly more flexible option. Rather than writing pieces of assembly language, intrinsic functions provide shortcuts in C language to AVX instructions. They are more or less complex to use and differ from one compiler to another, which can limit the portability of the code. This solution isn't really used in the software that we have tested (which is open source and portable).
Developers can also use external libraries that have been optimised for modern processors. Theyíre often libraries that implement algorithms that can be reused by developers. Intel supplies such a library with its compiler called Performance Primitives, which offers diverse and varied implementations (going from simple things like the manipulation of matrices to complex blocks of code such as decoding and encoding of JPEG images or H.264 video!). Use of these primitives is obviously limited to the use of the Intel compiler.

In the end, using new instructions is often still a problem for C/C++ developers. As languages are relatively poorly adapted to the way our processors now function, solutions for getting the most out of them have either become very complex (writing your own code in assembly language) or impose the use of proprietary extensions, eroding the standard nature of the language and, like it or not, tying the developer to whoever supplies the development tools. This can be a problem when the compiler supplier is also a processor provider.

Page 6
Generic, targeted, with dispatcher

Generic, targeted, with dispatcher
Although weíve spoken about how different processor models are catered for by the various different optimisations offered by compilers, we have dodged the most important question which is how to write a program which runs as well as possible on all existing processors.

Unfortunately, compilers donít really supply a solution and developers have several choices:
Prioritise compatibility
This is the simplest option and often the one chosen by developers: create a single executable file all of the code of which runs on all modern x86 processors (generally speaking as of Pentium Pros). The compiler wonít then generate any code using SSE or AVX instruction sets. This is the default mode for compilers.
Multiple builds
As specific versions of compilers can be successfully created for given processor models, developers can simply create several versions (or builds) each of which is optimised for different types of processor.

Thus you can find standard versions for certain projects (very often open source) and SSE2 and other versions (these SSE2 builds using SSE2 arithmetic operations in place of x87).
Manual dispatcher
Developers who use assembly code can choose to make multiple builds but they can also choose to go down the dispatcher route. The concept is relatively simple and involves adding verification of the processor capabilities on which the program is running on start-up of the program. Developers can run quite a long way with this and provide sections of assembly code for numerous processor models (SSE, SSE2, AVX accelerations and so on) including those which arenít x86 models (NEON code, SIMD instructions with ARM processors). This requires quite a bit of work from developers. x264 has such a dispatcher.
Automatic dispatcher
The last option consists in asking the compiler to generate a version of the programm that includes a dispatcher. In addition to a basic version that works anywhere, optimisations can be added by the compiler to target a given processor. You can then obtain wide compatibility as well as improved performance on a given processor. In theory this seems a perfect solutionÖ

Page 7
Compilers: Microsoft, Intel, GCC

We used three different compilation environments to complete this report: Visual Studio 2010 SP1, Intel C++ Compiler XE 12.0u5, and TDM-GCC (MinGW/GCC 4.6.1).
     Visual Studio 2010 SP1
It's no surprise to see that the Microsoft development environment is the most commonly used environment in Windows. In terms of optimisations, note that while it does provide for the generation of SSE, SSE2 and AVX code (only by the addition of a switch in the command line for AVX as the option isnít available in the Visual Studio 2010 interface), it doesnít offer automatic vectorisation. Nor does it provide an automatic dispatcher.

The Visual Studio compiler (weíll call it 'cl' from now on, the name of its executable) is by far the most pernickety in terms of what it can compile. The (very) partial implementation of C/C++ standards poses a certain number of problems of interoperability with other compilers. While there are many editions that you have to pay for, Microsoft has also been offering a free version of Visual C++, known as Express, over the last few years. Itís available for download on Microsoftís site.
     Intel C++ Compiler XE 12.0u5
Intel also has its own Windows compiler. It's partly based on components created by Edison Design Group and then extensively customised by Intel. From a practical point of view, it has the particularity of integrating easily with Visual Studio and allowing easy project conversion. You can moreover move from one compiler to another at any moment, which is a very good argument to convince developers to try it.

ICC is extremely rich in terms of optimisations, offering among other things automatic vectorisation. It can also generate targeted builds for different levels of processor functionalities (the QxSSE2, 3, 4.1, 4.2, AVX options...) as well as creating a dispatcher version, though only for a given level. The QaxAVX option will for example get the AVX version of its code to run on compatible AVX processors and a basic version (SSE2) on all other processors.

You can see why Intel has done this: a program compiled with the QaxAVX option would run on a Sandy Bridge processor with code optimally generated for an AVX processor (including code generated for strings and memory) and in SSE2 mode on a Core i7 ďNehalemĒ. If we were being provocative, we might say that this helps create generational gaps for certain benchmarks presented by the constructor in some of its presentations.

Using the Intel compiler in Visual Studio is only a click away

The other issue with these options is that in contrast to what you might think given what they're called, these options get the Intel compiler to check for the processor brand. Moreover this was one of the issues that came up in the FCC inquiry of Intel practices. One of the (known) consequences of the AMD/Intel/FCC agreement is that the Intel documentation is now packed with warnings to the effect that "non Intel" processors may receive different treatment than Intel processors, though without any further details. Weíre going to try and ascertain in practice whether Intel has changed its practices or not. ICC also has the reputation of being the highest performance compiler and this is something weíre going to verify!

Note that ICC includes a third optimisation mode (arch:SSE2 for example) allowing it to create a build for a given level of functionality and not only a given Intel processor. The documentation for the ICC version that we used only indicated SSE4.1 support as a max for this option, with SSE4.2 and AVX not showing. However arch:SSE4.2 and arch:AVX parameters can indeed be used. Is this an ideal solution? Not necessarily in practice as weíll soon see.

Intel charges for ICC and it is available in numerous editions. A trial version is also available.
     TDM-GCC (MinGW/GCC 4.6.1)
From the open source sector, GCC is a compiler that historically tended towards universality. It can therefore be used on all architectures (though this doesnít mean that the same code is compilable everywhere and subtle differences particularly to do with memory management often pose a problem when you try to generate cross-compatible code, for example for both ARM and x86) and is available for almost all operating systems.

GCC requires a development environment for it to run in Windows. Two main ones exist: Cygwin and MinGW. Cygwin offers a full POSIX development environment which allows you to compile a Unix program and get it to run in Windows. This is a worthwhile implementation but does come with a performance cost. These days open source applications in Windows mainly use MinGW, a minimalist environment for Windows which serves as a bridge between GCC and the OS, notably by giving access to certain Microsoft DLL systems such as the notorious msvcrt.dll, MS Visual C++ Runtime which caused so many problems in Windows 95 and 98.

Among other things, this DLL offers implementations for standard C/C++ functionalities (manipulation of strings and memory space). For compilation of programs in C only, these Microsoft routines will be the ones used by the program. As the DLL is very old (1998), its implementations are outdated with respect to modern processors, which seriously impacts on C programs that are compiled with it. Thereís no such problem with C++ as GCC has a standard library for these functions. Weíll have to keep this issue in mind when we evaluate performance. Note lastly that while GCC was originally designed with universality in mind Ė earning it a longstanding reputation for slower performance Ė developers have been betting more heavily on performance for some time. Many optimisations have therefore been introduced, from automatic vectorisation to the generation of SSE2 maths (AVX is partially supported) as well as profiles for a large number of x86 architectures. For these tests we used the TDM-GCC version that is more up to date than the orginal. It includes GCC in version 4.6.1.

And AMD?
Unlike Intel, AMD doesnít develop its own compiler. This doesnít however mean that it isn't working on the subject. First of all, AMD (like Intel) participates in the development of GCC. Microsoft also works actively with AMD and Intel to obtain coherent (and unbiased, according to them) support in their compilers, as well as for languages (.NET) where optimisations also exist for both processor brands. Finally AMD sponsors and distributes an Open64 fork. Open64 (partly) came out of a research project on compilers - financed by Intel and targeting its Itanium architecture. Itaniums have the particularity of using a VLIW instruction set. In reality, each instruction contains three, which are processed by a group of three scalar units. It's therefore over to the compiler, with Itanium, to choose which instructions to mix to obtain maximum performance, which is a particularly difficult task.

While Intel is no longer working on this, AMD is still offering an alternative version of Open64. Unfortunately itís only available in Linux which makes it of limited interest for our article.

Page 8
Test configurations, software, SPEC

Test configurations
So as to evaluate the impact of compilers, we first chose three processors built with different architectures:
  • Intel Core i7 2600k (Sandy Bridge, supports all SSEs and AVXs, except SSE 4a)
  • AMD Phenom II 975 (Deneb, SSE support up to 4a)
  • AMD FX 8150 (Bulldozer, supports all SSEs and AVXs, including SSE 4a)

To recap, SSE 4a is an AMD extension (a few details are available here and here on an AMD blog) that has never been adopted by Intel. In practice itís very little used, except in one case that weíll see later! Note however, the Phenom II doesnít support either SSE 4.1 or SSE 4.2. While there is a very slight overlap between SSE 4.1 and SSE 4a for some instructions, the great majority arenít supported. This poses a certain number of problems with the Intel compiler as weíll see.

With respect to the platform, we used:
  • An Asus P8Z68-V Pro (Intel) motherboard
  • An Asus M5A97-Evo (AMD) motherboard
  • 2x2 GB of DDR3 1600 MHz
  • Radeon HD 5450
  • SSD Corsair F120
  • Windows 7 SP1

While the configurations were identical for all three platforms, the tests on the FX processor were carried out with the patches available on the Microsoft site that we discussed previously here. We concentrated exclusively on the performance of compilers available in Windows as this is the platform we use regularly in our relative performance tests of processors. Linux, BSD and other operating systems are based on significantly different environments and although some compilers can be used commonly across different operating systems, the problematics of one system can be different to another (Windows doesnít really have an equivalent to the standard library). The results that we present here therefore only apply to this operating system.

Note that the relative performance of the three processors isnít really of interest to us in this article (you can consult the FX 8150 test to see this in detail). What interests us is how compilers respond to these platforms, whether in comparison to each other or in terms of the optimisations chosen.

To measure the impact of compilers, we looked for C/C++ software that could be compiled in Windows in GCC, CL and ICC, which is a relatively difficult thing. The software developed mainly for Windows very often uses Microsoft compiler ďextensionsĒ (or should we say eccentricities). The limited compatibility of this compiler with standards however sometimes prevents it from compiling a lot of programs that are supposed to be universal.

In addition to a certain number of open source applications that corresponded to our narrow description, we also turned to SPEC. SPEC (Standard Performance Evaluation Corporation) is an organisation that tries to carry out universal benchmarks to obtain standardised performance measurements on multiple systems. These are however more than simple 3D Mark type benchmarks as SPEC benchmarks are provided for multiple operating systems and multiple compilers. Thus on each platform itís possible to compile and execute these tests to obtain scores.

These scores known as SPECint and SPECfp (integers and floating point numbers) are based on a very precise performance index calculated from a long series of tests. To be validated, SPEC imposes strict rules when it comes to compilation options with two standards: the basic standard is the traditional one and restricts the use of optimisations judged to be too aggressive and makes a comparison between different architectures possible. A ďpeakĒ standard also exists, where anything goes.

The SPECint and SPECfp scores are very often used in Intel (and sometimes AMD) presentations to compare one architecture with another. SPEC takes on significant importance with compilers: compiler developers often use the benchmarks included in SPEC to optimise their compilers, but the impact also depends on how optimisations are developed and presented. The basic version authorises optimisations for a given processor model for example.

To carry out our tests, we used a certain number of individual SPEC tests and the scores we give for these show execution times. In no sense have we published SPECint and SPECfp scores in this article, something itís important to clarify. Visual Studio quite simply cannot compile many of the tests included in SPEC but in spite of everything some tests included in SPEC give us plenty of room to play with.

Letís now (finally!) move on to the tests!

Page 9
SPEC performance: bzip2, mcf

SPEC performance, bzip2, mcf
The benchmarks which follow are all from the SPEC cpu2006 suite, version 1.2 (September 2011). We used the compilation scripts included in SPEC for Visual Studio (cl) and Intel C++ Compiler (icc). While SPEC officially supports GCC (gcc) in Linux, it isn't supported in Windows. A port was available in Cygwin but has since been removed. We therefore modified cpu2006 to get it to run in mingw.

As we said previously, mingw can be seriously handicapped with string and memory operations, which are relatively slow as they have evolved from an old Microsoft DLL. We have noted the bench language as well as the type of load (integers or floating point) for each benchmark that follows. Finally, we also say whether or not the benchmark is multithreaded.

For each test we measured performance using builds written with the compilation options authorised by the basic SPEC standard. We did however (for Intel) remove the option that automatically creates code optimised according to the processor platform. In our graph the basic standard corresponds to the default configuration. Then we forced with Visual Studio the generation of SSE2 and AVX code (the only two available options) and for Intel, we tested the Qax options, recommended by Intel and which generate optimised code for a given processor level as well as a basic code which is executed on all the other processors that donít correspond to this level (SSE2, 3, 4.1, 4.2 and AVX are tested). These builds bear the letter D (for dispatcher) in our graphs. We have also added builds with the arch option for the different modes supported by Intel.

Finally for GCC, we used the profiles designed for the processors, namely barcelona (Phenom II architecture), bdver1 (Bulldozer v1, FX architecture), corei7 (previous Intel architecture) and corei7avx (Sandy Bridge architecture). As they donít include a dispatcher, the AVX builds werenít tested on the Phenom II as they quite simply donít run.

Note finally that while some benchmark names refer to known programs, they're generally modified Ė by SPEC Ė versions of these programs, notably to improve compatibility, remove any bias for one architecture or another or improve the quality of readings.


Language: C
Type of load: Integers
Multithreaded: Yes
bzip2 is a very popular compression utility in Linux. The benchmark has been modified to carry out compression and decompression tasks in memory only and thus limit the impact of the drive. Data with different levels of compressibility (JPEG image files, source code, binary program) was compressed and decompressed with three different block sizes. The results obtained are given in seconds.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

First benchmark and first surprise: the Microsoft compiler gives better performance here on both Core i7 and FX! Ironically, the Intel compiler is fastest with the Phenom II. The differences in performance aren't bad as on the Core i7 the Microsoft compiler gives 12% more than an AVX build with a dispatcher, which is supposed to be as optimised as it's possible to get. When it comes to optimisations, as weíll see, results arenít necessarily what we might expect. Intriguingly, note that SSE 4.1/4.2 and AVX versions are the slowest on the Core i7 and the FX, but not on the Phenom II. In any case, the SSE3 mode without a dispatcher is the fastest from ICC.

The way the dispatcher versions perform on the Phenom II is particularly intriguing because if we accept that the dispatcher deals with all AMD processors in the same way, the same code should be running both on the Phenom II and the FX. In practice however, things arenít as simple.

Note finally that GCC is the only compiler to give a gain in performance via its dedicated processor profiles, with all profiles giving equivalent performance on each architecture. Note that for this first test GCC comes in in the middle of the field for Intel processors and at the back for AMD machines. Thereís a 10% difference in CL/GCC performance between the two architectures, ICC simply being more efficient on the Phenom IIÖ


Language: C
Type of load: Integers
Multithreaded: No
This benchmark is from MCF, a program for the generation of times for public transport. It uses a network simplex algorithm to give a solution to what's known as the minimum flow problem. A PDF describing the problem and this type of solution is available here, the algorithm used in the benchmark being a bit more complex and optimised.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over a processor model to display its results

While the SSE2 and AVX modes donít bring any gain on this benchmark with the Microsoft compiler, we can see the first gains linked to Intel optimisations.

If we look at the dispatcher builds first we can see that the SSE3 version brings a 30% gain in performance, with the three modes that follow giving a slight further gain. Is this a problem? These gains arenít reproduced exactly when we use the ďarchĒ builds that are supposed to optimise in the same way on an Intel processor. Here SSE3 mode is a good deal faster, but this isnít the case for the other modes! In defence of Intel, these builds without a dispatcher behave in the same way on the AMD processors, with the Phenom II once again doing amazingly well with the arch:SSE3 mode.

The dispatcher versions on AMD FX and Phenom II all perform identically in any case, which is what we expected.

Page 10
SPEC performance: gombk, hmmer, sjeng


Language: C
Type of load: Integers
Multithreaded: Yes
gombk is an artificial intelligence algorithm designed for the game Go. Itís an AI taken from the open source game GNU Go.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over a processor model to display its results

In this test, the gains given by the various optimisations are relatively minor, however it will be noted that the MS and GCC compilers are faster than the Intel one on the Core i7sÖ though the Intel compiler is fastest on the Phenom II! Decidedly, the Intel compiler does very well on this processor! Note, for the version without a dispatcher that the SSE 4.1, 4.2 and AVX builds are slightly faster than the others on the Core i7s and the opposite is true on the FXs, though the differences are extremely small. Note that there is another minor trend to verify, minimal as it is here: the SSE2/AVX modes in Visual Studio seem slightly slower on Intel processors than the standard x87 version, while they give a slight gain on the FXs and Phenom IIsÖ


Language: C
Type of load: Integers
Multithreaded: Non
hmmer is an algorithm that carries out searches in a gene database and is used in particular to analyse protein sequences.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

Big surprise! While there are gains between the different dispatcher versions of the Intel compiler on the Core i7s there are also significant gains on the AMD processors!

So what's happening? To get an idea, we have to take a look at how the AVX dispatcher version performs on the Phenom II. It is indeed three times faster than the basic version, however the Phenom II isnít only incapable of running any AVX code, detection of Intel processors by the dispatcher also prevents it from running the optimised code. If we look a bit closer, we can see that the gains arenít actually exactly identical. While the different modes can be grouped two by two on the Core i7 in terms of their performance (AVX and SSE 4.2, SSE 4.1 and SSE3, SSE2 and basic), things are slightly different on the AMD processors with an SSE3 build as slow as an SSE2 version. This difference disappears when the dispatcher is removed!

While this is in itself good news for the AMD processors, it also highlights the complete lack of transparency of the Qax option of the Intel compiler. There are multiple optimisations and in contrast to what you might think, it isnít only AVX or SSE code that is generated but multiple optimisation layers that can have to do with both string and memory functions too. The Intel compiler seems to allow other processors Ė though in a slightly different way Ė to benefit from its optimisations. This generosity is appreciated! Note finally that gcc is significantly faster than the Microsoft compiler in this test.


Language: C
Type of load: Integers
Multithreaded: No
sjeng is an artificial intelligence for chess, taken from version 11.2 of the software of the same name.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over a processor model to display its results

While there is here a very slight gain with the SSE 4.1, 4.2 and AVX versions on Core i7, thereís a slight dip in these modes on the AMD processors with the dispatcher. Algorithms shared by the standard version and dispatched by the builds donít benefit the different architectures in the same way. Once the dispatcher has been removed, AVX does give a small gain on the FX.

Page 11
SPEC performance: h264ref, astar, milc


Language: C
Type of load: Integers
Multithreaded: Yes
h264ref is a reference implementation of the H.264/AVC video compression standard and the benchmark consists of the encoding of two videos via the baseline and main profiles (we refer you to our article on the subject).

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over a processor model to display its results

The Microsoft compiler is a bit faster than Intelís for the Core i7, while once again the code produced by the Intel compiler is significantly faster on the Phenom II, but also on the FX. The versions with a dispatcher are the most efficient of the Intel compiler versions. While useful on the Phenom II, the GCC optimisations are counterproductive on the FX. corei7avx tuning gives a small gain in performance on the Core i7.


Language: C++
Type of load: Integers
Multithreaded: No
astar is an implementation of the pathfinding A* algorithm that is much used in real time strategy games.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

Although this is our first C++ benchmark, supposed to be more favourable to GCC, this compiler doesnít do well here. Its optimisations are rather counterproductive. The Microsoft compiler is more efficient, except for the Core i7 where optimisations that arenít shared become a factor as of the SSE3 version.

Note once again that only the SSE3 version benefits from the optimisations of the Intel compiler when without a dispatcher, whereas the other versions donít. Of course thereís no link between the instruction set supported here and these optimisations.


Language : C
Type of load: Floating point
Multithreaded: Yes
Milc is a physical simulation of quantum chromodynamics (QCD).

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

Like the previous benchmark, thereís a marked gain on the Core i7 as of the SSE3 mode with the Intel compiler with the dispatcher. Performance here is more than doubled. Note that although thereís no change in performance on the FX, there is a 9% gain on the Phenom II.

When we look at the builds without the dispatcher, once again only the SSE3 version benefits from the optimisations, though not all of them! Thus even on the Core i7, the SSE3 version is significantly slower. The AMD processors do benefit from these small gains but the lack of transparency in terms of how these options function continues to have an impact.

Page 12
SPEC performance: namd, lbm, sphinx3


Language: C++
Type of load: Floating point
Multithreaded: Yes
namd is a scientific benchmark for the dynamic simulation of molecules. A portable benchmark was extracted from the original application that is available here.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

The AVX dispatcher mode of the Intel compiler gives a very small gain that you also get on the FX. Interestingly, the AVX mode of the Microsoft compiler gives big gains in performance, particularly on the AMD platforms. The AMD and GCC optimisations are more effective here than the Core i7 optimisationsÖ on the Core i7!


Language: C
Type of load: Floating point
Multithreaded: Yes
Lbm is a scientific program that simulates the behavior of fluids. Images and video are available on this site in German.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

The Intel compiler is significantly more efficient than the others here, with what can be staggering gains on the FX. Note however a small issue for the AVX and SSE 4.1 modes with a dispatcher, which are significantly slower on AMD processors. While this isnít the first time that weíve seen differences with these modes, particularly on the FX, the gap has never been as wide as it is here.


Language: C
Type of load: Floating point
Multithreaded: Yes
Letís finish our tour of SPEC with sphinx3, a voice recognition algorithm available on this site.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

Firstly letís look more particularly at the behavior of the Microsoft compiler. Although the SSE2 mode significantly impacts on performance on all three platforms, the AVX mode behaves quite differently on the Core i7, where it's almost as slow as in SSE2 mode, and on the FX, where it suddenly becomes faster, which is rare.

In terms of the Intel compiler, while the AVX mode dispatched gives a gain in performance on the Core i7, performance dips in the non dispatched mode across all the platforms.

Page 13
Performance: C-Ray, TSCP, 7-Zip

We also used several pieces of open source software to try and measure the relative performance of the compilers.


Language: C
Type of load: Floating point
Multithreaded: Yes
C-Ray is an implementation of a raytracer in C. Particularly concise, this implementation pushes the floating point capabilities of modern processors very hard and has been transformed into a benchmark to this effect (you can find the source code here).

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

The results obtained highlight first of all the difference that using SSE2 or AVX maths operations can make. Going from x87 to SSE2 with the Microsoft compiler more than doubles performance on all three platforms (the Phenom II being by far the one that benefits the most!). The AVX mode of the Microsoft compiler is even faster than the Intel compiler on the Core i7. Ironically once again, the Intel compiler is the most efficient on the AMD platforms. The different optimisations of the Intel compiler donít change much with the dispatcher, even if when it's off, the FX does dip slightly in SSE 4.2 and AVX modes. Note finally that the Core i7 optimisations are very counterproductive in GCC, which isnít the case with the AMD optimisations.


Language: C
Type of load: Integers
Multithreaded: No
TSCP is an implementation of a chess AI from Tom Kerrigan (who we thank for allowing us to use his bench).

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

For once, the different optimisations of the Intel compiler have no impact without a dispatcher. With it, the AVX mode has a small advantage. GCC does especially well with the best results on all three platforms, sometimes in a tie with the AMD compiler which also does very well.


Language: C
Type of load: Integers
Multithreaded: Yes
Here we use the bench built in to 7-Zip 9.20, which measures performance of the LZMA2 compression, which is carried out exclusively in the memory on the image of what we saw for SPEC and bzip2.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

Note, in contrast to all our other graphs, the 7-Zip one gives a performance index and not a time. The longer the bar therefore, the better the performance. The 7-Zip code can't be compiled directly with GCC. There is a unix/posix port (p7zip) but comparing execution times of various source codes wouldnít make any sense here.

Note above all here the fact that the Microsoft compiler is faster on the Core i7 and FX. Intel only leads on the Phenom II, a processor it obviously really appreciates! Finally in passing, the optimised version for code size was very slightly faster in this test in Visual Studio (-O1). Moroever this choice is made by the developer for the versions it distributes.

Page 14
Impact of assembly language: x264


Language: C++
Type of load: Integers
Multithreaded: Yes
The x264 video compression software that we use regularly in our articles is quite particular. It includes quite a large number of assembly optimisations. Moreover, the authors donít just settle for x86 as, for example, there are optimisations using the ARM SIMD NEON instruction set. Itís also possible to compile software with all the assembly optimisations off, which means we can check the impact of these optimisations and see if the compiler optimisations can do anything to compensate.

We used mingw/gcc only for this test (the other compilers aren't directly supported,though there are patches and forks), in a slightly different version to the one used up until now. Here the 4.6.2 version of Gcc is used. i686 is the basic profile here and noasm indicates that we turned the assembly optimisations off during compilation.

[ Core i7 2600k ]  [ FX-8150 ]  [ Phenom II X4 975 ]
Hold the mouse over the processor model to display its results

The first point to note is that a build compiled via gcc with the SSE4a optimisations (which is what happens with AMD profiles) is detected as such on launch by x264, which then stops its execution. Only the FX, which has both AVX and SSE4a, can launch all the compiled versions.

Next, if there were still any doubt, code written in assembly language is the best route to getting maximum performance. Assembly versions however manage to benefit very slightly from corei7/corei7-avx profiles for the Core i7-2600k, and bdver1 and (above all) barcelona for the AMD processors. More surprisingly these optimisations are generally unfavourable on code without assembly language. Note finally that forcing the use of AVX 128 bit instructions (which is indicated by p128) does have an impact, which is however very slight here.

Of course these results arenít directly comparable to the others that we have presented here but they do allow us to illustrate that while the compiler still has an important role, a developer can always significantly optimise C/C++ code.

Page 15
Performance averages

Performance averages
To illustrate the relative performance of compilers, we have calculated some averages, setting an index of 100 to the Microsoft compiler on each of the platforms. 7-Zip and x264 have of course been excluded from the averages.

In the interests of precision, we have also drawn up two other averages allocating the index of 100 to the scores for the SSE2 version of the Microsoft compiler and the base version of the Intel compiler.

[ cl base ]  [ cl SSE2 ]  [ icc base ]
Hold the mouse over the compilers to centre the various indexes

Of course, performance isnít identical for these three processors but by choosing an equivalent index for each, we can see more clearly how performance is affected by the compilers or options used.

If you have read this report from the top, the results given here won't surprise you. The Intel compiler does best. While gcc does okay, in practice the tuning options are often counterproductive and undermine the gains they bring. Visual Studio is significantly slower in its standard version because it still generates x87 code by default for floating point operations. Moving over to SSE2 for the maths operations improves performance, particularly on AMD processors where x87 code has been somewhat sidelined by new instruction sets such as SSE2 that have been designed to replace them.

When we compare this mode to the default Intel mode (that also compiles for SSE2), the Intel compiler takes the lead and the FX-8150 actually benefits most with a performance gain of 24% against just 20% for the Core i7.

Centreing our performance index on this mode brings several points of interest to the fore. Firstly, the very progressive way in which each Qax mode benefits Intel processors. These gradual gains are a little too perfect to result just from the use of an instruction set. Simply looking at the difference with and without a dispatcher on the AVX version for the Core i7 2600k demonstrates this quite clearly. The compatible modes arenít optimised with the same vigour by Intel as the other optimisation modes and Intel doesnít hide this.

The Qax modes are carry-all optimisation modes where many of the optimsations arenít linked to the level of processor support at all. The fact that these other optimisations, which sometimes benefit AMD processors as we have seen, are not made available in the modes that are supposed to be widely compatible (without a dispatcher) is particularly problematic. While these modes are indeed ďfairerĒ, they are in practice slower for all solutions, including those from Intel.

So, what are these other optimisations? By analysing the assembly code generated by the Intel compiler, a certain number of points are exposed.

Firstly, some optimisations do not concern the developer code. Depending on which version of the memory allocation and copy functions is used, the results can be very much affected. All this extra code generated by the compiler for standard C/C++ functionalities is important and while the dispatcher does affect some of them (the functionalities), others arenít affected. Moreover this is what lies behind the range of results recorded. For the lbm test, the assembly code generated in the critical section of the program is identical in the SSE 4.1 and 4.2 versions. Worse, the code generated isnít dispatched (it is in AVX mode, with a very modest gain) but there is a significant difference between a Core i7 and an FX in SSE 4.1 because some blocks of extra code are dispatched.

Next, we certainly must not overestimate the quantity of SSE/AVX code generated. While the compiler is often able to use it, we did notice in many tests that the code produced in critical sections (the most resource-hungry parts of the code) doesnít always use AVX code. SSE2 instructions are often preferred and quite rightly so.

Finally, we have the opaque optimisations. For example we noticed that the Qax modes have an influence on the unrolling of loops. This optimisation consists in replacing loops (pieces of code that we request to repeat many times) by this code repeated several times. This increases the length of the code generated but in practice, avoiding conditional jumps (very costly on x86 architecture) can bring significant performance gains. We noted, in the case of the lbm test, that the Intel compiler doesnít unroll the code in QaxSSE2 mode and that it unrolls it to a greater or lesser extent in the other modes, although the unrolling option is activated by default in the compiler.

Page 16

As we come to the end of this article, we hope first of all that we have been able to demystify somewhat the problematic of compilers and the role they play in processor performance. In the course of the article we have asked a certain number of questions and we now owe it to ourselves to attempt to answer them.

Firstly, yes, the choice of the compiler does have an impact on performance. The reputation of the Intel compiler Ė all optimisations put aside Ė as the fastest holds true. Although there were some counterexamples, it is generally faster than the Microsoft solution with its default options. Even when we force the use of SSE2 maths operations, the Microsoft compiler is sometimes 15 to 20% down on its rival.

If Intel invests in the development of a compiler, it is of course so that it can use it as one more asset in the battle for performance. By choosing to allow automatic optimisation for each of its processor families Ė at the same time as naming these options by the instruction set supported, like QaxAVX for AVX Ė Intel tends to reserve the latest optimisations for its latest processor models, which can inflate somewhat the impression we might have of the gains afforded by the instruction set. Looking at the performance of the Phenom II with these dispatched AVX versions shows this quite clearly.

Example of dispatch function, extracted from the assembly code of the dispatched AVX build produced with the Intel compiler in the 470.lbm test. Based the value of the __intel_cpu_indicator variable, a fast (.R) or slow (.A) path will be executed.

When the dispatcher isnít used, the problem gets more complex as some of the optimisation layers Ė not really linked to the instructions supported by the processor - disappear. Worse still, some modes are strangely more efficient than others, as with SSE3 mode which is the fastest on average!

The Intel compiler therefore leaves the developer with some complex choices if theyíre prepared to delve into the manual to try and understand how the optimisations options work. On the one hand, then, we find ourselves confronted with a universal mode, which is equitable in terms of its optimisations but which doesnít offer optimum performance and this for reasons that donít have much to do with the generation of SSE or AVX code. On the other we have the dispatch, carry-all, modes which, in comparison, give up to an additional 15% in performance on Intel processors. These modes tend, to a greater or lesser degree depending on the model, to diminish, or not to increase, the performance of AMD processors.

The question of the exact implementation of the Intel dispatcher in its compiler thus remains an open one. By choosing to execute a different version of some parts of programs according to the processor brand and not according to its technical capabilities, Intel canít really be said to be providing a level playing field. The companyís usual response on this point is unequivocal: sure thing theyíre marketing software that tends to favour their own products! Why wouldn't they? Itís up to the competition to do the same. The intervention of the Federal Trade Commission hasnít changed the situation except in as much as Intel now clearly states that its compiler is optimised differently for different brands of processor (without really explaining how). However the fact that many of these optimisations arenít realy linked to the processor model but to other layers of optimisations that are added on top, such as loops unrolling for example, makes the situation even less transparent for developers. Note in addition that the detection of the brand of processor is carried out on a case by case basis not only by ICC but also by other secondary products sold by Intel, such as the MKL mathematical library or the TBB thread management library, both of which are indeed dispatched but without reference to the brand of the processor.

The same code dispatched, seen on the execution of the program via OllyDbg

The plot thickens when Intel then goes prospecting for software developers to offer them use of the Intel compiler, suggesting they use these Qax modes which, apart from being recommended by Intel, are detected as giving higher performance by those developers who take the time to check respective performance. In terms of transparency, one might find fault with Intel, particularly when developers of in vogue games or benchmarks are targeted as these are used in the specialised press to measure processor performance.

Can we blame the developers who succumb? This is the crux of the problem! If you offer developers the choice between using an equitable compiler against another that gives a 35% performance gain to AMD processor users and 54% to users of Intel processors, should we blame them for choosing the one that will improve the user experience in both cases (though in a partisan way)?

And what about AMD processor users? Do they want an equitable and slow version or something with improved performance even though it favours competitor processor users? In practice, outside of the open source world, the user's opinion isnít of any significance. Users of commercial applications have to accept the choices, good or bad, made by developers.

Equally, the performance readings that we take in our processor tests are also affected by these choices. Although we avoid overly partisan benchmarks in our protocols, when a very popular application favours one processor model over another, measuring these performance differences simply reflects how users experience this software.

Getting an unskewed vision of the situation is difficult and there arenít many solutions to hand. The development of compilers that, like LLVM, produce managed code may be a solution, as may the widespread use of .NET, keeping in mind nevertheless that whoever controls the virtual machine, in the end controls the key to performance on each of the architectures. Relocating a problem doesnít necessarily resolve it!

Copyright © 1997-2015 BeHardware. All rights reserved.