The Coreīs cachesThe core architecture introduced new restrictions to the cache sub system. On the one hand, a high IPC requires a cache subsystem with a high success rate in order to efficiently mask memory latencies. It also requires a high transfer rate to face increasing data demands that go along with the IPC.
The table below regroups the main cache characteristics of the new architecture and includes access latencies and transfer rates obtained with the SSE2 memory bandwidth test (128 Bits) of RightMark Memory Analyzer (RMMA) :
The Core L1 caches shares the same size and associativity characteristics as the Mobileīs. However, available bandwidth is doubled as shown in the reading transfer rate test of RMMA. We find this result by looking at the 128 bit SSE2 movapd memory reading instruction transfer rate of one 128 bit reading per cycle or 16 octets/ cycle.
The L2 cache access requires an additional cycle. Its transfer rate is 8 octets per cycle.
Unlike the Pentium D and Athlon 64 X2, Core uses the Advanced Smart Cache
technique inaugurated with the Yonah and which consists of sharing the L2 cache between the two execution cores. Compared to a L2 cache devoted to each core, the main advantage of this method is to share data between the two cores without using the memory bus. It reduces memory accesses (and latencies that go along) and optimises L2 filling (redundancies disappear).
Shared cache also gives the possibility of being dynamically allocated by each of the two cores, until becoming integrally accessible by only one. This technique, which was specifically developed for a dual core implementation, is paradoxically more efficient than separated caches when only one of the two cores is used, which means for all single thread applications.
An intelligent memory access
In addition to improvements to memory cache, Intel has developed new techniques to improve memory accesses. They are grouped under the slightly pompous name, Smart Memory Access
The idea consists of working on two criteria, whose objective is to, once again, mask memory access latencies:
ensuring that a piece of data can be used as soon as possible (the temporal constraint).make sure that a piece of data is the closest possible (in the memory hierarchy) to the processing unit (constraint of "where").
The temporal constraint refers to how a processor plans memory reading and writing operations. Indeed, when there is a memory reading in the out-of-order engine, it canīt be entirely processed before all on going read instructions are completed. If it didnīt follow this procedure, the risk would be a reading of data that hasnīt been updated in the memory hierarchy. This constraint imposes waiting and a slowing down.
Core introduced a speculative mechanism that predicts if a read instruction is susceptible to depend on writes that are currently being processed, which means if it has to be processed without waiting. This predictive role is to remove ambiguities and is called Memory Disambiguation
. Beside the wait reduction, the methodīs objective is to reduce dependencies between instructions and increase the efficiency of the out-of-order engine.
Addressing the "where" constraint, which means trying very hard to bring data closer to processing unit, is the function of the cache subsystem. In order to help it in this task, Core uses hardware prefetch. This technique consists in using the memory bus when itīs inactive to preload code and data from memory to the cache subsystem.
Hardware prefetch isnīt a new technique. It started with the Pentium III Tualatin. However, it is mainly the Netburst that fundamentally improved it. The important difference between the processor frequency and bus makes the Netburst particularly sensible to the harmful effects of a cache miss and that increases the interest of an efficient prefetch. For once, Core inherits prefetch technique from Netburst and slightly improves it.
Several types of prefetchers are included in Core:
the instruction prefetcher pre loads instructions in the instruction L1 cache based on branching prediction results. Each of the two cores has one.
the IP prefetcher scrutinizes historical reading in order to have an overall diagram and loads "foreseeable" data in L1 cache. Each core also has one.
The DCU prefetcher detects multiple reading from a single cache line for a determined period of time and decides to load the following line in the L1 cache. One per core as well.
the DPL prefetcher has a similar functioning to the DCU. The only difference is that it detects requests on two successive cache lines (N and N+1) and is triggered if the reading of the line N+2 moves from the central memory to cache L2. The cache L2 has two of them, which are shared dynamically between the two cores.
The total of prefetchers is 8 for a Core 2 Duo.
The small suns represent the 8 prefetchers of the Core 2 Duo.
The hardware prefetch mechanisms are generally very efficient and in practice increase the success rate of the cache subsystem. Unfortunately, the prefetch sometimes leads to the opposite result. If errors are frequent, they tend to pollute cache with useless data and reduce its success rate. For this reason, itīs possible to deactivate most of the hardware prefetch mechanisms. Intel recommends the deactivation of the DCU prefetch in processors intended for servers (the Woodcrest), as it is susceptible to reduce performances in some applications.