Home  |  News  |  Reviews  | About Search :  HardWare.fr 



MiscellaneousStorageGraphics CardsMotherboardsProcessors
Advertise on BeHardware.com
Review index:
The impact of compilers on x86/x64 CPU architectures
by Guillaume Louel
Published on September 27, 2012

Generating optimised code for CPU architectures
As you’ll have guessed, as there are big differences in how processors process instructions, compilers can be developed to optimise the code they generate so as to take the particularities of each architecture into account.

With the arrival of the Pentium and superscalar processors, the order in which the compiler placed instructions became extremely important. Placed correctly, two additions could be processed simultaneously on this architecture, doubling the performance that you would otherwise get, though this had to be obtained manually as compilers at the time weren’t as developed and couldn’t generate the optimised code automatically. This led Intel to develop a new type of architecture with the Pentium Pro, introducing what is known as Out of Order or OoO processing. OoO allows the processor to change the order of instructions so as to best utilise superscalar units. While at that time OoO architectures served above all to maximise the use of all units, ordering engines have continued to evolve along with developments in modern processors: masking memory access latency to a maximum (by dealing with read operations as soon as possible so that they’re ready when the processor needs them) has now become the new preoccupation as the speed of memory accesses hasn’t kept pace with the increase in arithmetical performance of processors.

Since the arrival of the Pentium Pro in 1995, the trend has been constant: to integrate as much innovation as possible at hardware level (superscalar, OoO, caches, MMU, prefetchers and so on) to get as much efficiency as possible from ever more complex architectures. You might therefore think that the role of the compiler had become less important as processors have become increasingly able to deal with some of the heavy parts of the code that they're required to process. In practice however, there are still cases where choices made by the compiler are important.

The choice of instructions for example is still crucial. To take an example that fast forwards us into the 21st century, AVX instructions are mostly available in two variants: 128-bit and 256-bit (the number of bits indicates the size of the operands, the data that instructions work on) and you can generally replace a 256-bit instruction with two 128-bit instructions.

As discussed in the report on its architecture, a Bulldozer module combines two cores and these cores share a certain number of resources. Amongst these is the floating point unit, which takes charge of the execution of SSE (128-bit) and AVX (128 or 256-bit) instructions. It has the particularity of being split into two parts that can function independently in 128-bit mode. If it has to function in 256-bit mode however, the two blocks must be synchronised and work together, which can imply a cost in performance. Mixing 128 and 256-bit instructions can therefore reduce efficiency. To take this particularity into account, the GCC compiler will attempt to favour the use of AVX128 instructions if it’s asked to optimise for the Bulldozer architecture (architecture bdver1).


Rather than optimizing for a particular architecture, the Intel compiler
allows you to optimise for a given, Intel brand(!) processor model.

Things get more complex with certain specific C/C++ language instructions. While standard mathematical operations can be translated into machine language fairly easily, with other tasks, the language offers functions that simplify the programmer’s work. For example this is the case with the manipulation of strings (letters or figures) or memory blocks (in practice, the manipulation of chains is based on memory manipulation). The C/C++ language provides functions that are translated, by the compiler, into relatively long (and therefore optimisable!) pieces of machine language. Data access latency, cache size, the internal functioning of prefetchers and the MMU may all be taken into account for compilation by conscientious developers, as may the instructions available on the processor. As we’ll see later then, the Intel compiler includes specific implementations of these functions for each of its processors.

Architecture optimisation is therefore still very much an issue and the incursion of hardware into the domain of the compiler is balanced by the fact that optimisations have become increasingly complex. And then, there’s also vectorisation…

<< Previous page
Optimisations and performance

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16
Next page >>
SSE, AVX: the problematic of vectorisation  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.