Improvements to the Sandy Bridge coreSandy Bridge is mostly based on the Nehalem architecture, but brings its lot of improvements, some of which have been taken from the Pentium 4 Netburst micro-architecture. Netburst was not based on an existing architecture, this meant a multitude of new concepts previously unheard of in the x86 universe could surface. HyperThreading, which reappeared with Nehalem, is one of them, and the Sandy Bridge micro-architecture has updated various other Netburst ‘innovations’.
An "L0" cacheIn our study of the Nehalem architecture, we saw how this processor has a mechanism optimised for loops using a buffer containing micro-operations (namely, already de-coded instructions), thus economising on resources by not decoding the same code in a loop several times when correctly predicted. Moreover we mentioned the ressemblance with the Pentium 4 trace cache principle.
Sandy Bridge goes a bit further by introducing a 1.5 KB micro-operations cache (uop cache), which receives its data from the instruction decode units. The branching unit accesses the uop cache once a new branch is decoded and checks to see if it’s in the uop cache. If it is, the majority of front-end processing (retrieval of instructions at decoding) becomes useless. This results in reduced use of front end units and an overall improvement in terms of performance per watt. Note that the Sandy Bridge uop cache isn’t really comparable to the Pentium 4 trace cache: in effect the Sandy Bridge still has its L1 cache for instructions and both caches work together, in contrast to the Pentium 4 on which the trace cache replaces the L1I entirely and therefore represents a more complex implementation.
Register set-up inspired by Pentium 4
For Sandy Bridge, the Intel engineers chose to use a Physical Register File (PRF) as with Pentium 4. To understand what this consists of exactly and the reasons for this choice, we need to go back to a few of the concepts inherent in the x86 processors register set-up.
x86 processors are characterised by the reduced amount of registers: 8 in 32-bit mode, and 16 in 64-bit mode (for comparison, an IA64 processor such as the Itanium has 128). These are the eax, ebx … familiar to those who’ve seen x86 assembly code and which constitute the register file.
Yet modern CPUs use out-of-order (OoO) execution engines, that’s to say which can process instructions in a different order to the assembly code written by the programmer or generated by a compiler. So as to facilitate the work of the OoO, CPUs have much more than 8 internal registers and therefore resort to register renaming to maintain coherence between the CPU’s physical registers (internal and not visible to programmers) and architectural registers (those the programmer can see: eax, ebx …) to which they refer.
In practice, the processor uses a Reorder Buffer (ROB), the role of which is to restore the instruction order to the way it appears in the programme after instructions have been executed, perhaps in a different order. In P6 derived architectures (Core 2, Nehalem and Westmere), the ROB contains the results of each micro-operation underway, accompanied by a register index that allows the re-establishment of correspondance between the physical and architectural registers. These results are copied to a Retirement Register File (RRF), which corresponds to all the architectural registers after processing.
In its time, the Netburst architecture modified this schema by using a file containing the CPU’s internal physical registers (Physical Register File, or PRF). The ROB then no longer contains any data from micro-operations underway, but only pointers to the PRF. The advantage is obviously that each entry in the ROB takes up less space and the ROB can then contain more entries for the same capacity (the Sandy Bridge ROB contains 168, Nehalem / Westmere 128). There’s no longer an RRF and coherence with architectural registers takes place via the Register Alias Table (RAT), the entries for which also point towards the PRF. These references towards the PRF mean that the data copying stages of the ROB + RRF system aren’t required. This gives the Sandy Bridge system quite an advantage as there’s a lot of this data.
Using a PRF on Pentium 4 was motivated by the optimisation of Netburst for instruction sets such as SSE and SSE2, which handle data in 128-bit chunks. Sandy Bridge ushers in the new AVX instruction set on which operands can go up to a max of 256 bits, justifying Intel’s choice to equip its new architecture with a PRF.