SSE, AVX: the problematic of vectorisationEarlier, we mentioned the fact that AVX allows you to work on 128 or 256-bit operands (data). This was in fact an oversimplification. Most of the time, programmers work with numbers encoded in 32-bit (between about -2.1 billion and +2.1 billion) or 64-bit (known as double precision +/- 9 trillion). These are types of data used by programming languages like C and C++. Putting a few very rare cases to one side, storing and working on 128 or 256-bit numbers isn’t generally all that worthwhile, and moreover isn’t really what AVX is used for.
Advanced Vector eXtension instructions are in effect vector instructions, which means they can work on an array of data. A 256-bit AVX instruction can thus work on eight pieces of 32-bit data at the same time, which significantly speeds up program execution, as long as you need to process eight identical operations in parallel!
The ideal vectorisation situation: working on four pieces of data at the same time
allows you to quadruple processing speed. Extract from an Intel PDF.
Of course this is where things get more complicated as C and C++ aren’t really adapted to arrays. Programs use variables, which store information according to type (integers, decimal (real) numbers and so on) and while the notion of array does exist (several pieces of data of the same type), the language doesn't include operations that can be applied directly to them (such as adding the contents of table A to table B). The developer must therefore create algorithms that carry out these operations, some of which are now accelerated by processors using AVX. In practice, the fact that C and C++ don’t have structures that are natively adapted to the way processors now function internally has become a real problem for which several solutions have been implemented.
Replacing mathematical operations
In some cases, using a vector instruction with a single piece of data can be faster than using its standard equivalent. For historical reasons this is particularly so with floating point calculations. These operations, known as x87, were handled by arithmetic coprocessors in the 80s. Even though arithmetic coprocessors have been built into processors (starting with the 486 DXs and Pentiums), they are still as costly as they ever were, organised in stacks. SSE, SSE2 and AVX now offer instructions that can replace x87, getting away from the concept of stacks and making execution a lot faster. Using a vector instruction for a single piece of data can therefore be faster. Note that although all compilers offer this type of optimisation, Visual Studio still compiles in x87 by default.
If you truly want to benefit from the parallel processing power of vector instructions, you might think it a good idea to ask compilers to detect cases where developers have introduced arrays in their code. The compiler would then interpret the C/C++ code (generally loops that repeat an instruction) to generate machine language code automatically using vector instructions.
In theory this is an excellent idea. In practice, the existing code isn’t necessarily written to be vectorised. Outside of simple cases, the developer often has to rewrite code to remove dependencies. While in the era of multimedia you might think that all processing is parallel, in practice the level of parallelism (or granularity) isn’t necessarily limited to one instruction. When there are several and there are dependencies between results, vectorisation rapidly becomes impossible (there are also other issues such as jumps in code which don’t really suit SIMD). Code can often be rewritten, but the compiler can’t interpret the “idea” behind a complex algorithm on its own and rewrite it for the developer. The written code would probably be in a less legible or logical version for the developer, though more logical for the compiler.
In practice as we’ll see, automatic vectorisation is far from being on a par with other techniques.
The assembly code
This is the simplest solution technically speaking. Rather than asking the impossible of the compiler, the developer can decide to write assembly code themselves (a “legible” version of machine language) for some parts of their program. While you can get excellent gains from this (as we’ll see with x264), in practice very few developers choose this route as it is quite simply very complex.
This is a slightly more flexible option. Rather than writing pieces of assembly language, intrinsic functions provide shortcuts in C language to AVX instructions. They are more or less complex to use and differ from one compiler to another, which can limit the portability of the code. This solution isn't really used in the software that we have tested (which is open source and portable).
Developers can also use external libraries that have been optimised for modern processors. They’re often libraries that implement algorithms that can be reused by developers. Intel supplies such a library with its compiler called Performance Primitives
, which offers diverse and varied implementations (going from simple things like the manipulation of matrices to complex blocks of code such as decoding and encoding of JPEG images or H.264 video!). Use of these primitives is obviously limited to the use of the Intel compiler.
In the end, using new instructions is often still a problem for C/C++ developers. As languages are relatively poorly adapted to the way our processors now function, solutions for getting the most out of them have either become very complex (writing your own code in assembly language) or impose the use of proprietary extensions, eroding the standard nature of the language and, like it or not, tying the developer to whoever supplies the development tools. This can be a problem when the compiler supplier is also a processor provider.