AMD Bulldozer architecture - BeHardware
Written by Franck Delattre
Published on May 13, 2011
AMD is getting ready to relaunch itself on the x86 market. Although K10 architecture still gives good performance, it’s beginning to show its limitations in the face of the competition, particularly Sandy Bridge which is proving so effective on high end desktop platforms.
For its new family of processors, AMD has invested in the development of two new micro-architectures: Bobcat and Bulldozer. While Bobcat targets the market currently dominated by Atom, namely ultra-low energy consumption platforms, Bulldozer has been designed for server and high end desktop platforms.
Bulldozer represents the first truly new AMD architecture since 2003. The main innovation of the architecture lies in its CMT (Cluster Multi-Threading) technology which constitutes a new approach in the compromise between energy consumption and performance in multi-threaded processing and this is what we’ll be looking at in detail in this report. AMD has also looked at certain areas that are particularly critical in terms of good performance in a modern microprocessor and for which K10 architecture doesn’t cover all the bases. Bulldozer now has branch prediction worthy of the name and in this sense moves in the same direction as Intel on its latest architectures. AMD is setting out its stall to offer a serious alternative to the best Intel processors, highlighting its use of original technologies.
An architecture which puts its money on shared resources and … high frequency design!
After the failure of Netburst which was designed to increase performance using high clocks, it was thought that manufacturers would frown for some time yet on architectures designed to use higher clocks. AMD K8 and K10 are both architectures with high IPCs (instructions per cycle), and give quite an improvement in raw performance terms on what Netburst could offer at the same time. With Bulldozer, AMD has adopted a new strategy. As we’ll see further on in this article, with an equal number of cores, Bulldozer’s raw processing power is down on that of K10 architecture. A lot of the K15 specificities are those of an architecture designed to run on high clocks: a processing pipeline cut into numerous stages, improved branch prediction (critical for the efficiency of such an architecture), a cache architecture mixing small L1s with low latency with large higher level caches and of course well-focussed handling of heat management.
Apart from this, the main innovation with Bulldozer consists in a very original design known as CMT (Cluster Multi-Threading) that mixes dedicated resources and resources shared between cores. Sharing hardware resources means economising on transistors and therefore the surface area of the chip and the power dissipated, at the same time as aiming to maintain performance levels close to those supplied by entirely dedicated resources. AMD hopes to combine efficient handling of multi-threading with high performance per watt.
CMT technology (Cluster Multi-threading)
CMT technology (Cluster Multi-threading)Bulldozer partly overturns the definition of what a core is in terms of how cores are implemented in current x86 architecture. The new AMD architecture is based on what AMD is calling a “Bulldozer module” which combines two integer cores. At the heart of the module, the two cores share a certain number of components:
- the front-end groups the fetch unit and instruction decoding as well as the L1 instruction cache which is supplied by these units;
- the floating point unit;
- the L2 cache.
While sharing an L2 cache between the cores is nothing new, sharing the other units is. Up until now each core had its own front-end. AMD seems to have chosen which units to share well: the units that make up the front-end are complex and costly in terms of transistors and power. Sharing them allows a reduction in these two elements. The FPU is also frequently used at a rate of under 50%, making it pertinent to share it between two cores. In the end, a module is a lot smaller and consumes less energy than two full cores, yet it maintains a similar level of performance. AMD is claiming 80% of the performance of two full cores for 50% of the silicon area.
A “Bulldozer” processor will therefore be made up of several of these modules, a memory controller, a bus controller and, on some models, an L3 cache. Of course AMD’s marketing language will not be based on modules but rather the number of cores. Thus the 8-core version of Bulldozer will be made up of 4 modules and Windows will see this as 8 logic units.
Looking a little closer, you can see the Bulldozer module as a 4-way “super core”, partially cut into two so as to be able to process two threads in parallel. AMD’s choices are quite distinct from those made by Intel, who have retained 4-way hyper-threaded “super cores” (SMT). It’s difficult to say which method is the best and no doubt depends on the type of application you’re using. SMT has the advantage of being very modular (a single thread can garner 100% of the core performance) and enables optimal use of the out of order engine (OoO). CMT involves a more marked sharing of resources for each thread but doesn’t increase modularity. AMD says that a single thread running on a module enjoys all the shared resources… which is true but half the dedicated resources remain unused.
Front-end, OoO, processing units
The front-end unit
The front-end unit handles the supply of instructions to the rest of the processing pipeline. It plays an essential role in terms of performance as processing capacity can only be fully exploited if there’s a high and constant flow of instructions. The front-end of the basic Bulldozer module now has to be able to supply instructions to two cores so you can see what a key role the unit has in AMD’s new architecture.
Branching, or the jumps in the code, is the main source of breaks in the instruction flow, which is why modern architectures use branch prediction. Several complementary mechanisms are used to reach maximum efficiency. Bulldozer is subject to the same restrictions as any other architecture in terms of branching and uses most of the mechanisms to be found in Nehalem! This involves a loop detector, management of direct and indirect branches, as well as a hybrid prediction mechanism which manages branches according to whether they’re global or local. There’s also a mechanism for the storage of return addresses (this is different to BTBs (Branch Target Buffers), which stock target addresses).
AMD also mentions a trace-cache (a cache containing micro-instructions that have already been decoded), which reduces penalties in the case of mispredicts. Note that such a cache is used in the loop detector in Nehalem.
The Bulldozer module has a single 64 KB L1 instruction cache. This is a two-way associative structure, with one for each core.
The Bulldozer decoding unit is bigger than the one used on K10, with a view to satisfying the needs of both cores. A Bulldozer module can therefore decode up to 4 instructions per cycle, which is one more than K10. Introduced in the Core 2 by Intel, branch fusion has been used for the first time by AMD in Bulldozer. To recap, branch fusion consists in decoding instruction pairings as a single instruction. For Bulldozer this consists of pairings of a comparison or arithmetic test and a jump instruction. Thus when such an occurrence occurs, the module can decode up to 5 instructions per cycle.
OoO engine and processing unitsDuring our study of Sandy Bridge architecture, we talked about the change that using a physical register file (PRF) made. To recap, the physical register file consists of a table of registers of work used by the out-of-order (OoO) execution engine, towards which the re-order buffer (ROB) entries point. This pointer system means you can have a larger ROB in comparison to a system where the ROB contains the data from micro-operations itself. Bulldozer also uses a physical register file and, as with Sandy Bridge, the motivation for this choice lies in the size of the AVX instruction set operands.
Each of the two x86 execution units in a Bulldozer module is made up of two ALUs (arithmetic logic unit) as well as two AGUs (address generation unit). Where K10 architecture has three ALUs for a maximum of 3 instructions executed per cycle, the Bulldozer module offers a maximum speed of 2 x 2 full instructions per cycle. The entire theoretical raw performance of a Bulldozer module is therefore equal to 2 x 2 / 3 x 3 = 67% of that of a K10 dual core.
Given that this is the case, this is therefore the most unfavourable theoretical case for Bulldozer in comparison to K10 and AMD says that the single threaded IPC should be improved in practice. You also have to keep in mind that what makes the module interesting isn’t pure performance but rather the performance to power consumed ratio and, here, a Bulldozer module should prove itself a lot more efficient than two K10 cores.
The floating point unit is one of the resources shared by the two cores of a Bulldozer module. It consists of two 128-bit FMAC (fused multiply accumulate) type processing pipelines, which means that the units can carry out a dot product operation (often found in geometry engines and graphics processing). Apart from the gain in performance, the calculation also retains a high level of precision: there’s no rounding between the two operations (multiply and add), which guarantees maximum calculation precision. These two units can be unified in one 256-bit unit for the processing of AVX instructions. Note that the Bulldozer FPU seems to be able to run in “energy economy” mode by not operating on all the bits of operands.
The memory sub-system
The cachesAs we’ve already seen, the L1 instruction cache is shared by the two cores on the Bulldozer module. This cache is 64 KB and has a 2-way associative design (one way per core). Each of the two cores in the Bulldozer module has its own L1 data cache, 16 KB in size and with a 4-way associative design.
AMD has worked hard on the performance of these L1D caches, an essential condition in optimal performance of a high frequency architecture (remember that the L1D on the Pentium 4 Northwood had a record latency of 2 cycles). AMD has used the ‘replay’ mechanism already used by Intel on the Pentium 4. ‘Replay, logic-track, re-execute’ is a prediction technique that speculates on which way the required data will take. In the case of misprediction, which is to say if the wrong piece of data is pre-extracted, only the instructions involved are executed again. Bulldozer’s L1D should thus have a 4 cycles latency.
Still looking at the module as a whole, a 16-way associative 2 MB L2 cache is shared between the two cores. Some implementations of Bulldozer will use an L3 cache shared between the modules that make up the processor. This L3 could be up to 8 MB in size, with a 64-way associative design.
The relationship between Bulldozer’s cache levels has moved on from previous architectures. Traditionally, AMD implements exclusive relationships between cache levels, meaning that two successive cache levels do not contain the same data (the L2 for example contains the data that has been evicted from the L1). To understand what has changed with Bulldozer, you have to look at how the data is updated in the caches.
The L1D in Bulldozer is write-through (WT), which is to say that when the data is modified locally the new value is updated in the L1D and in the L2. The immediate consequence of this is the inclusive relationship of the two caches: any data written by the L1D is also written in the L2. You may wonder why AMD has gone for the write-through policy, which doesn’t perform as well as the write-back (WB) method in which data is only written to the L2 when the line is evicted from the L1, allowing some of the writes to be deferred.
The reason for making the Bulldozer caches WT is to reduce the latency in the case of a cache miss. If there’s a cache miss, an L1 line is evicted and a WB relationship would trigger writing of the line to the L2. In WT mode however, the write has already taken place at the time the data was copied to the L1, so no operation is then required. WT also guarantees that the data between the cache levels is identical, which simplifies coherence.
The WT mode does however multiply writes to the L2 cache, which takes up a lot of its bandwidth. To alleviate this problem, AMD has included a small cache, the write coalescing cache (WCC), designed to receive L1D writes in Bulldozer. The WCC stores the successive writes and once it’s full, sends the data to the L2 in one write.
AMD describes the L3 cache as a “non-inclusive victim cache”, a victim as the data in the L3 cache has been evicted from the L2. When data is read in the L3 it is sent back to the L1D of the core concerned. At this stage, it’s important to note that this data in the L1D is not necessarily also written to the L2 and the relationship between the two caches isn’t therefore 100% inclusive. So what? Well, when the relationship between caches is not fully inclusive, maintenance of coherence requires cache snooping, which is something we spoke about in our study of Nehalem. Substantial snoop traffic is extremely undesirable and results in power and performance costs and is one of the major faults of the K10 cache sub-system, especially as there are a lot of cores. AMD hasn’t revealed whether Bulldozer has a getaround for this problem or not.
The integrated memory controllerThe integrated memory controller in the desktop variants of Bulldozer supports DDR3-1866 (933 MHz, or PC3-15000) on two 64-bit channels, while the Phenom IIs were officially limited to DDR3-1333. The server models can support four DDR3-1600 channels. Note that the controller has a data prefetcher, which doesn’t send the data to the processor caches but has its own storage buffer.
Instructions, power management
Instruction setBulldozer supports all current x86 instruction sets, including the AVX (Advanced Vector Extension), which was introduced by Intel with Sandy Bridge. To recap, AVX is an SIMD instruction set that operates on floating point numbers and with operands of up to 256 bits. One of the advantages of the new instruction set lies in the existence of instructions designed to facilitate formatting of data in the 256-bit registers, thus making developer work easier.
Some instructions specific to Bulldozer are introduced, grouped under the following: XOP, FMA4 and CVT16. These instruction sets actually correspond to SSE5 (announced by AMD in 2007 but never implemented) adapted to the AVX format. XOP operates mainly on integer operands, FMA4 on 128-bit floating point numbers and CVT16 groups high precision floating point conversion instructions to medium and low precision floating points.
Note that AMD has been forced to use a different coding for its specific instructions to that used by Intel, so as to avoid any interference with any future extension of AVX. It can only be hoped for AMD that the compilers adopt this specificity, otherwise nobody will use these instructions.
Power managementAs we’ve seen, the Bulldozer architecture is based on a module that has been designed to supply 80% of the performance of two cores for half the energy consumption. This qualifies the Bulldozer architecture as economical in power terms. It also has its lot of mechanisms designed to reduce the power consumption of the processor in use.
In addition to PowerNow! (the equivalent of Intel’s SpeedStep) which operates on the processor in its globality, Bulldozer has lower level power gating granularity. Apart from the fact that certain units can be put on standby, there are some other interesting features, notably a “low consumption” (and therefore low precision) FPU compute mode. The caches can also be put on standby and for the biggest (L3 in particular), standby only impacts some sectors. This is a mechanism Intel has had since the Pentium M!
Note that some implementations of Bulldozer include a Turbo mode, with power consumed constantly monitored by the processor.
The first Bulldozer models in a few figures
We’ll first see Bulldozer on the FX desktop platform, codename Zambezi. Let’s take a look at the main specifications currently known, or estimated (clocks). Note that they might vary on launch and are only given here as an indication:
- Available with 4 (FX-4110), 6 (FX-6110) and 8 (FX-8110 and FX-8130P) cores, or 2, 3 and 4 modules respectively;The question of backwards compatibility with Socket AM3 has been much discussed and AMD’s official position is that AM3+ CPU support is only guaranteed on AM3+ motherboards. This hasn’t however stopped manufacturers such as ASUS and MSI from announcing lists of AM3 cards compatible with AM3+ processors, which will delight anyone who owns them. AMD is however saying that not all Bulldozer features, notably in the domain of power management, will be supported on these motherboards, but hasn’t given full details. AM3+ motherboards based on the AMD 800 or 900 chipset are already available from most manufacturers.
- Socket AM3+;
- 32 nm SOI manufacturing (Silicon On Insulator) with HKMG (High-K Metal Gate technology);
- 213 million transistors per module;
- HyperTransport 3.1 (3.20 GHz, 25.6 GB/s, 16-bit links up and down);
- L3 up to 8 MB;
- Base clock: 3.2 GHz (and over?);
- Turbo Core: up to 1 GHz, and up to 500 MHz with both cores running;
- TDP max of 95 and 125 Watts depending on the model;
- Module voltage between 0.8 and 1.3 volts;
- Advanced power management (clock & power gating, L3 sectoring);
Two versions of Opteron are also likely to be released this summer. Interlagos is the codename of the high-end version, designed for bi and quad socket machines. This is a dual-chip (two dies on the same package) on socket G34 (1974 pins) made up of no fewer than 8 to 16 cores supporting up to four channels of DDR3-1600. Valencia is the single die version for socket C32 (1207 pins) designed for mono and bi socket machines. Here there will be 6 to 8 cores and two channels for the DDR3.
We’ll have to wait for 2012 and the Komodo APU to see Bulldozer architecture on laptops.
ConclusionAMD is taking a big risk with its new range of products, particularly with its Bulldozer architecture which will have to take on Sandy Bridge now, and Sandy Bridge-E in the near future, and then Ivy Bridge further down the line.
The AMD architecture nevertheless holds some strong cards and looks capable of making quite an impact. The flexibility of the module structure means we should see versions with more than 8 cores in the not too distant future and if the clocks are high enough, performance levels will be good, especially in multithreaded processing.
We will nevertheless need to be patient. While the first samples of Zambezi are starting to filter out, its launch is unlikely to come in July. We’ll have to wait a little longer before we can judge the new architecture in practice and see if AMD’s gamble has been worthwhile. Watch this space!
Copyright © 1997-2013 BeHardware. All rights reserved.