The front-end unit
The front-end unit handles the supply of instructions to the rest of the processing pipeline. It plays an essential role in terms of performance as processing capacity can only be fully exploited if there’s a high and constant flow of instructions. The front-end of the basic Bulldozer module now has to be able to supply instructions to two cores so you can see what a key role the unit has in AMD’s new architecture.
Branching, or the jumps in the code, is the main source of breaks in the instruction flow, which is why modern architectures use branch prediction. Several complementary mechanisms are used to reach maximum efficiency. Bulldozer is subject to the same restrictions as any other architecture in terms of branching and uses most of the mechanisms to be found in Nehalem! This involves a loop detector, management of direct and indirect branches, as well as a hybrid prediction mechanism which manages branches according to whether they’re global or local. There’s also a mechanism for the storage of return addresses (this is different to BTBs (Branch Target Buffers), which stock target addresses).
AMD also mentions a trace-cache (a cache containing micro-instructions that have already been decoded), which reduces penalties in the case of mispredicts. Note that such a cache is used in the loop detector in Nehalem.
The Bulldozer module has a single 64 KB L1 instruction cache. This is a two-way associative structure, with one for each core.
The Bulldozer decoding unit is bigger than the one used on K10, with a view to satisfying the needs of both cores. A Bulldozer module can therefore decode up to 4 instructions per cycle, which is one more than K10. Introduced in the Core 2 by Intel, branch fusion has been used for the first time by AMD in Bulldozer. To recap, branch fusion consists in decoding instruction pairings as a single instruction. For Bulldozer this consists of pairings of a comparison or arithmetic test and a jump instruction. Thus when such an occurrence occurs, the module can decode up to 5 instructions per cycle.
OoO engine and processing units
During our study of Sandy Bridge architecture, we talked about the change that using a physical register file (PRF) made. To recap, the physical register file consists of a table of registers of work used by the out-of-order (OoO) execution engine, towards which the re-order buffer (ROB) entries point. This pointer system means you can have a larger ROB in comparison to a system where the ROB contains the data from micro-operations itself. Bulldozer also uses a physical register file and, as with Sandy Bridge, the motivation for this choice lies in the size of the AVX instruction set operands.
Each of the two x86 execution units in a Bulldozer module is made up of two ALUs (arithmetic logic unit) as well as two AGUs (address generation unit). Where K10 architecture has three ALUs for a maximum of 3 instructions executed per cycle, the Bulldozer module offers a maximum speed of 2 x 2 full instructions per cycle. The entire theoretical raw performance of a Bulldozer module is therefore equal to 2 x 2 / 3 x 3 = 67% of that of a K10 dual core.
Given that this is the case, this is therefore the most unfavourable theoretical case for Bulldozer in comparison to K10 and AMD says that the single threaded IPC should be improved in practice. You also have to keep in mind that what makes the module interesting isn’t pure performance but rather the performance to power consumed ratio and, here, a Bulldozer module should prove itself a lot more efficient than two K10 cores.
The floating point unit is one of the resources shared by the two cores of a Bulldozer module. It consists of two 128-bit FMAC (fused multiply accumulate) type processing pipelines, which means that the units can carry out a dot product operation (often found in geometry engines and graphics processing). Apart from the gain in performance, the calculation also retains a high level of precision: there’s no rounding between the two operations (multiply and add), which guarantees maximum calculation precision. These two units can be unified in one 256-bit unit for the processing of AVX instructions. Note that the Bulldozer FPU seems to be able to run in “energy economy” mode by not operating on all the bits of operands.