Sandy Bridge: the cores

After the two-year AMD delay on integration of the CPU and GPU, it is in fact Intel who should be first to market such a product with Sandy Bridge. To recap, the current Core i3/i5s only put the CPU and the northbridge in the same packaging to reduce the cost of the platform. With Sandy Bridge, there’s full integration and the two components become one, which allows numerous optimisations both in terms of performance and energy consumption.

Sandy Bridge also marks the arrival of a new generation of CPU cores which benefit from small improvements at many levels detailed by Intel at this IDF, in addition to the new instruction set: AVX. Most of the architectural modifications are linked to AVX and have been put into place to make AVX efficient. The question of energy consumption is also at the center of all these changes, with Intel aiming to increase the performance/watts ratio with each improvement made.

To recap, AVX is a 256-bit vectorial instruction set which, with ideal implementation, allows you to double the processing performance of 128-bit SSEs. On this first implementation, 256-bit floating point operations can be executed at full speed, with AVX operations on integers cut into two 128-bit operations. To feed these 256-bit floating point processing units, without pushing the cost and energy envelope too far, Intel has somewhat reorganised its execution units so as to be able to reuse some (mainly the datapaths) of those dedicated to operations on integers.

This isn’t all though, as the current architecture can only load and store 128-bits of data per cycle. Intel has removed certain store and cache bandwidth constraints on Sandy Bridge enabling it to load 256-bits as well as store 128-bits. Per cycle, this architecture can execute a 256-bit floating point multiply + a 256-bit floating point add + a 256-bit load, which double the current architecture throughput.
To optimise the Sandy Bridge out-of-order hardware, Intel has opted for a Physical Register File, just as it did with the Pentium 4 and as AMD is doing with Bulldozer. In comparison to the Retirement Register File used in current CPU cores, the Physical Register File (PRF) allows all data to be stored locally (via a renaming of the registers) and therefore means moving registers around is avoided, which is beneficial in terms of power consumption and has given Intel more room for manoeuvre when it comes to enlarging the instruction window in which the order of execution can be optimised. It is up from 128 to 168 uops.

Still with the aim of best feeding this execution core, Intel has revisited the front end, first of all adding a cache for decoded instructions which also has the advantage of reducing power consumption as the decoding logic will be able to rest more frequently. Here you’ll get a hit rate of 80% for most applications, which means that most of the time Sandy Bridge will be able to process more uops than Nehalem/Westmere, at the same time as reducing energy consumption. Lastly Intel says that the branch prediction unit has been entirely revisited with a longer history and improved identification of paths.
Sandy Bridge: the cache
A new memory cache structure has been implemented with the graphics part in mind as well as to make it more modular. Intel is no longer talking about an L3 cache but rather a Last Level Cache, because from the graphics point of view the cache isn’t at the same level as it is for the CPU cores. This name removes any ambiguity there may have been in this respect.

This LLC is split into 2 MB segments per CPU core. A ring bus interconnects all the LLC segments, the graphics controller and the System Agent, through which access to the memory controller is handled. Intel says that this ring bus has the advantage of not increasing die size as it is cabled above the LLC. Note that it is in fact made up of four independent rings: a 256-bit bus for data + request bus + acknowledge bus + snoop bus.
In terms of implementation, you can see on the image that instead of putting into place a bi-directional ring bus, Intel runs it twice through each cache segment and therefore each core. There are also two accesses for the graphics, but the System Agent makes do with one. Transmission latency is therefore lowered with the shortest path always picked on the ring, without having recourse to the added weight of a bi-directional bus. The graphics core has been placed at the other end to the memory controller because in the case of cache miss, it is the least sensitive to latency of main memory accesses.

Latency during LLC accesses can vary according to where the data is (which segment of LLC) as each stop on the ring bus requires one cycle. The LLC runs at the core clock in Sandy Bridge, in contrast to the current architecture.