Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphics Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Report: Intel Nehalem architecture
by Franck Delattre
Published on October 28, 2008

The new cache hierarchy

Maintaining the coherence of data manipulated by each of the cores in monolithic architecture is accomplished via a shared cache. The Core 2 thus integrates a large cache L2 shared between two cores. The implementation of quadruple cores on the Core 2 relies on the processor bus to maintain this coherence which isn’t optimal for performances.

Source : Chip Architect

It’s therefore almost natural to find a large cache shared between the four cores on the Nehalem. However, things are much more complicated with four cores instead of two. Indeed, a cache can only respond to requests from the four cores that solicit it in an intensive manner and without any significant latency – unless the technical characteristics of the cache are improved but this implies complexity beyond that of a consumer processor. The economical solution thus consists of reducing the number of requests that come to the shared cache. To do this Intel inserted a small cache of 256 KB between the L1 of each core and the shared cache. These four caches of 256 KB do not take up too much space on the chip and their smallness in size is a guarantee of speed. On the other hand, such a size does not translate into record success rates but this is not the goal. If each L2 offers a success rate of only 50% (which is pessimistic), every other request will not reach the shared cache and things happen as if the requests only make it from two of the four cores. And with only two cores, we already know that this works fine.

L1

The first level L1 caches of each core of the Nehalem have the same size characteristics as the Core 2: 32 KB for data and 32 KB for instructions. Doing away with the micro-TLB, which we mentioned above, unfortunately translates into a slight increase in access time to L1. The data L1’s latency thus goes to 4 cycles (versus the 3 with Core 2). For L1 devoted to instructions, Intel chose to favor latency to the detriment of associativity. Indeed, managing cache ways takes time and this is all the more so the greater the number of ways. By reducing the associativity of L1 instruction cache from 8 to 4 ways, it can keep a latency of 3 cycles like on the Core 2 and this despite the absence of a micro-TLB. Why this choice? Because an instructions cache is more sensitive to latency than a data cache. Access latency on the latter can be (or at least partially) compensated for by the work of the OOO engine which reorganizes instructions in order to mask latency (the Nehalem’s has been considerably improved), while each access to the instructions cache is directly affected by the effects of higher latency, in particular access carried out by branching prediction mechanisms. The instructions L1 thus has less to lose when reducing its associativity instead of increasing its latency. In the end it’s a compromise.

Finally, the Nehalem’s L1 is capable of handling more parallel cache misses than the Core 2. This is due to the gain in bandwidth offered by the integrated memory controller: a cache miss signifies a memory access and the average time between two memory requests diminishes. The increase proves to be particularly interesting for SMT as two threads generate more cache misses than a single one.

Inclusive cache

The Nehalem’s cache hierarchy necessarily reminds us of the Phenom’s. However, the resemblance stops at the number of levels because cache does not function in the same way for the two architectures. This begins with the fact that the Nehalem’s shared L3 cache has an inclusive relation with all of the other cache levels, meaning that it contains a copy of the contents of L1 and L2. This characteristic distinguishes it from AMD’s choice on the Phenom whose L3 has a pseudo-exclusive relationship with other cache levels (data cannot be found in the two cache levels at the same time, although when we say “pseudo” this means that there are a few exceptions).
An inclusive cache relationship generally translates into higher performances but to the detriment of the total size of useful cache (due to the redundancy of certain data in two successive levels). In multi-core architecture, this inclusive relationship amplifies the defect: of the 8 MB of L3, more than 1 MB is occupied by copies of L1 and L2 caches. However, it also has the advantage of affecting the private L1 and L2 caches less. Why? Because in the case of an L3 cache miss, we are sure that this data is not in the private caches of each of the cores (otherwise it would be in L3 due to the inclusive relationship), which enables avoiding verification and immediately creating a reading request in memory. Things become more complicated in the case of an L3 hit because verification is then required to see if the data is already present in one of the private caches, which means verifying all the caches of each core. This necessary step in the coherence of cache is called “cache snooping” and can be a significant source of latency. To overcome this problem, the Nehalem has for each line of L3 cache a flag that indicates in private cache in which core(s) the data is found. While the gain in time is appreciable, the storage of these flags adds a little weight to the structure of L3.

The first latency tests showed an average of 40 processor cycles for the L3 cache of current Nehalem models (4 cycles for L1 and roughly 10 cycles for L2). Such a value can be partly explained by the fact that L3 cache functions at a different frequency (as well as voltage) from that of the rest of the processor and this like the “uncore” part of the Nehalem. Thus, on the 2.93 GHz model, L3 runs at 2.66 GHz. This slightly distorts latency measurements expressed in processor cycles therefore to 2.93 GHz. The separate frequencies and voltages add more flexibility to the processor’s design and notably avoid having to align the processor’s overall frequency with other slower elements. In addition, this enables better control of overall thermal dissipation of the socket, which as we will see later, gives the Nehalem another special characteristic.

In terms of flexibility, the size of the Nehalem’s L3 is easily adaptable depending on the capabilities of each processor version and also with each evolution in manufacturing. The transition to 32 nm engraving will probably be accompanied by an L3 cache of 12 MB as was the case for the Core 2.

<< Previous page
Memory controller, TLB

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Next page >>
QPI Bus, the core  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.