Home  |  News  |  Reviews  | About Search :  HardWare.fr 

MiscellaneousStorageGraphics CardsMotherboardsProcessors
Advertise on BeHardware.com
Review index:
AMD Bulldozer architecture
by Franck Delattre
Published on July 13, 2011

The caches
As we’ve already seen, the L1 instruction cache is shared by the two cores on the Bulldozer module. This cache is 64 KB and has a 2-way associative design (one way per core). Each of the two cores in the Bulldozer module has its own L1 data cache, 16 KB in size and with a 4-way associative design.

AMD has worked hard on the performance of these L1D caches, an essential condition in optimal performance of a high frequency architecture (remember that the L1D on the Pentium 4 Northwood had a record latency of 2 cycles). AMD has used the ‘replay’ mechanism already used by Intel on the Pentium 4. ‘Replay, logic-track, re-execute’ is a prediction technique that speculates on which way the required data will take. In the case of misprediction, which is to say if the wrong piece of data is pre-extracted, only the instructions involved are executed again. Bulldozer’s L1D should thus have a 4 cycles latency.

Still looking at the module as a whole, a 16-way associative 2 MB L2 cache is shared between the two cores. Some implementations of Bulldozer will use an L3 cache shared between the modules that make up the processor. This L3 could be up to 8 MB in size, with a 64-way associative design.

The relationship between Bulldozer’s cache levels has moved on from previous architectures. Traditionally, AMD implements exclusive relationships between cache levels, meaning that two successive cache levels do not contain the same data (the L2 for example contains the data that has been evicted from the L1). To understand what has changed with Bulldozer, you have to look at how the data is updated in the caches.

The L1D in Bulldozer is write-through (WT), which is to say that when the data is modified locally the new value is updated in the L1D and in the L2. The immediate consequence of this is the inclusive relationship of the two caches: any data written by the L1D is also written in the L2. You may wonder why AMD has gone for the write-through policy, which doesn’t perform as well as the write-back (WB) method in which data is only written to the L2 when the line is evicted from the L1, allowing some of the writes to be deferred.

The reason for making the Bulldozer caches WT is to reduce the latency in the case of a cache miss. If there’s a cache miss, an L1 line is evicted and a WB relationship would trigger writing of the line to the L2. In WT mode however, the write has already taken place at the time the data was copied to the L1, so no operation is then required. WT also guarantees that the data between the cache levels is identical, which simplifies coherence.

The WT mode does however multiply writes to the L2 cache, which takes up a lot of its bandwidth. To alleviate this problem, AMD has included a small cache, the write coalescing cache (WCC), designed to receive L1D writes in Bulldozer. The WCC stores the successive writes and once it’s full, sends the data to the L2 in one write.

AMD describes the L3 cache as a “non-inclusive victim cache”, a victim as the data in the L3 cache has been evicted from the L2. When data is read in the L3 it is sent back to the L1D of the core concerned. At this stage, it’s important to note that this data in the L1D is not necessarily also written to the L2 and the relationship between the two caches isn’t therefore 100% inclusive. So what? Well, when the relationship between caches is not fully inclusive, maintenance of coherence requires cache snooping, which is something we spoke about in our study of Nehalem. Substantial snoop traffic is extremely undesirable and results in power and performance costs and is one of the major faults of the K10 cache sub-system, especially as there are a lot of cores. AMD hasn’t revealed whether Bulldozer has a getaround for this problem or not.
The integrated memory controller
The integrated memory controller in the desktop variants of Bulldozer supports DDR3-1866 (933 MHz, or PC3-15000) on two 64-bit channels, while the Phenom IIs were officially limited to DDR3-1333. The server models can support four DDR3-1600 channels. Note that the controller has a data prefetcher, which doesn’t send the data to the processor caches but has its own storage buffer.

<< Previous page
Front-end, OoO, processing units

Page index
1 | 2 | 3 | 4 | 5 | 6
Next page >>
Instructions, power management  

Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.