AMD K10 architecture - BeHardware
>> Processors
Written by Franck Delattre
Published on September 13, 2007
URL: http://www.behardware.com/art/lire/682/
Page 1
Staying in the race
This week AMD announced the first model of its new microarchitecture, the "Barcelona" server version of the much awaited K10. We will still need to be patient for a few months in waiting for the arrival of general public and desktop versions, the Phenom X2 and X4, which won’t be available before December. Here is a glimpse of the architecture that succeeds the very popular Athlon 64.
 Staying in the race The arrival of Intel’s Core architecture in June 2006 made waves in the processor world and the new processor quickly stole the show from the Athlon 64 in terms of performance, thermal dissipation...and price. Intel struck hard forcing AMD to drastically revise its pricing structure (almost a 50% reduction for the Athlon 64 X2) in order to stay in the race if not technologically then at least commercially.
A mere few months after the introduction of the Core 2 Duo, Intel did it again by introducing the first quad core processor. This was another tough blow for AMD, who in addition to being late on the 65 nm process, didn’t offer a single commercially viable solution against the Core 2 Quad.
 AMD bought some time by getting all it could out of the K8. A 65 nm version of the processor appeared in December 2006. The Brisbane core was intended to overcome AMD’s tardiness in terms of thermal dissipation and the Texan giant only came out with a 512 KB L2 cache version thus keeping it out of the high end market. Then last August the ultimate version of the Athlon 64 X2 was released, the 6400+ set at 3.2 GHz and based on the 90 nm Windsor core. In March, AMD’s Executive Vice-president, Mario Rivas, confessed that he would have liked to have seen the release of a “two times two core” version of the Athlon 64, in order to serve those most eager and be present on the quad core processor market. Instead, AMD’s commercial strategy consisted of arriving on the market with a new architecture that was designed and originally conceived to start with four cores.
The K10 changed its status from simply taking the baton from the K8 to being AMD’s savior. The new architecture is supposed to bridge AMD’s tardiness in a number of areas (four cores, 128 bit units, advanced energy management...), and we can’t help objectively wondering if this isn’t too much weight for the shoulders of a new and emerging architecture. It’s a sizeable challenge and as the delays get longer, the pressure and uncertainties grow. Several AMD directors have left the company and rumors are circulating of a buyout by a large micro-electronics group. Presentation of the K10 The characteristics of the K10 have been public for several months now. - Four core architecture; - A yet unseen cache hierarchy: 128 KB of L1 cache and 512 KB of L2 per core, unified L3 cache of 2 MB ; - SSE 128 bit units ; - DDR2 memory controller integrated to the processor; - Advanced mechanism for energy management (Independent Dynamic Core Technology); - 3 HyperTransport 1.0 lanes (Barcelona), one HyperTransport 3.0 lane (Phenom) ; - 463 million transistors engraved in 65 nm SOI technology.
Page 2
Four cores, finallyFour cores, finally The K10 marks AMD’s entrance into the quad core processor market, a domain that has been the reserve of Intel for the last 10 months. In order to distinguish itself from its rival, AMD highlights the “native” character of its architecture with four cores. This means that a certain number of characteristics concern the ensemble of four cores, rather than two by two as it is the case for the Core 2 Quad.
 The concept starts with the exchanges between the cores which benefit from a formidable vector of communication in the form of shared L3 cache. This is followed by an energy management strategy which benefits from control of all four cores together, keeping in mind the maximum control. We can thus speak of an architecture specifically designed for a quad core configuration, even if a dual and mono core version will be derived from it.
The integration of four cores is in itself a factor of processing acceleration, however, we know that this acceleration isn’t proportional to the number of cores as Amdahl’s law reveals. This simple law allows measuring the gain in performance of a partially paralleled program on a multi-processor system showing us that although the performance curve increases with the number of cores it also tends to flatten at the same time. This phenomenon can be understood in the way that a program can never be entirely parallel and a part of it will continue to run on a single thread. In fact, this part is in no way accelerated by the inflation of the number of cores, and this part of the total time required increases while that of the parallelized code decreases.
 So to maintain the most constant increase possible, the acceleration of unitary tasks is indispensable. This may seem paradoxical in the way that the more cores a processor has, the faster they should be in order that the overall level of performance significantly increases. Each of the K10’s cores therefore benefit from all of AMD’s effort to increase the IPC (instructions per cycle) compared to the K8.
Page 3
Boosted IPCBoosted IPC In our report on Intel’s Core architecture, we quantified the theoretical power of an architecture by the IPC (instructions per cycle) that it is able to provide on the main instruction sets (integers, FPU, SSE).
The core of the K10 is directly descended from the K8. Equipped with 3 ALUs (arithmetic and logic units) devoted to whole number calculation, the K8 offers x86 calculation capacities equal to that of the Core 2 Duo. For SSE integer instructions, the K8’s two 64 bit calculation units allow the processing of eight 16 bit integers per cycle, while the Core 2 Duo can process up to 24 thanks to its three 128 bit SSE units. This is the same for SSE floating point instructions, where two floating point units of the Core 2 Duo associated to 128 bit SSE units allow processing twice as much floating point data than the K8 per clock cycle.
The K10 has the same integer calculation capacity as its predecessor. For SSE integers, it offers a peak processing which attains three integer operations per cycle (two arithmetic operations by two SSE units and a move by the “FP Move” unit). On the other hand, for floating point calculations, the theoretical IPC is boosted to the same level as the Core 2 and this is thanks to the adoption of two SSE units capable of processing 128 bits per cycle.
 In order to feed the two 128 bit SSE units to a maximum, the K10 doubles the instruction rate input (from 16 to 32 bytes of instructions per cycle) as well as the bandwidth of L1 cache data (from 2 x 64 bits to 2 x 128 bits per cycle).New predictor units You may recall that branching and memory access constitute the two mains sources of reduction in IPC (please refer to our report on the Core 2 Duo for more details). It is therefore normal and good to see that AMD has equipped the K10 with specific optimizations.
A branch in a flux of instructions translates into a jump towards a new address. This jump perturbs the functioning of the pipeline, which can no longer receive new instructions before knowing the address of the destination. The solutions put into place by classic mechanisms of branching prediction consist of attempting to guess if a branch will be taken or not. To do this, the processor integrates several predictor units which differ depending on their way of functioning. The most efficient is the use of a history of branches that were chosen and which are stored in a dedicated buffer.
 The K8’s predictor units were conceived to predict direct branches, or in other words, those whose destination address of the jump is explicitly specified in the code. The task of the predictor unit therefore consists of determining if branching will be carried out or not. However, these units are not very efficient for indirect branching, or for those whose destination address is susceptible to change in execution. This type of branch is very common in object-oriented languages which often use function pointers.
The K10 has a predictor unit devoted to indirect branches and which is capable of storing several preferred destination addresses for each branch, thus improving prediction efficiency. This doesn’t involve a new mechanism as it has been used by Intel’s processors since the Pentium 4 Prescott. The K8, however, was designed well before this.
Page 4
The K10’s cachesThe K10’s caches The latencies induced by memory access represent one of the principle sources of slowing down for a processing pipeline. A processor’s main weapon to mask these latencies resides in its cache sub-system. From this we can determine the special importance of a cache hierarchy in an architecture’s specifications and in this regard the Athlon 64 has proven efficiency.
It seems natural then that the K10 strongly inherits this trait, if not to say entirely so for L1 and L2 cache levels, which has the most influence on performances. So here we only repeat these well known characteristics: each of the four cores integrates a large L1 cache split into 64 KB for instructions and 64 KB for data, aided by an exclusive unified L2 cache of 512 KB.
The performances of this tried and true sub-system mostly rely on the choice of a very large capacity (the biggest of all architectures) and fast L1 cache. Each L1 of 64 KB is two way set-associative, which signifies that it is (in a schematic way) organized each into two blocks of 32 KB. These caches thus have the advantage of being local in the way that a block of a certain size is capable of containing a large quantity of data or contiguous instructions, however, this propensity to locality is of course to the detriment of spatiality as L1 can only cover two areas in memory at any given time. L2 cache compensates for this relative weakness by offering the opposite in the form a higher associativity given its size (512 KB and associated to 16 ways). The association of these two levels assures a high level of performance in all conditions.
 This efficiency in association is one of the reasons AMD kept a dedicated L2 cache for each of the K10’s four cores, versus an L2 cache shared solution. So it’s almost entirely normal that a third level of cache was introduced on the K10. The L3, which is shared between all four cores, isn’t first and foremost there to increase the performance of each core individually, but rather to assure the performance of the four cores working in unison. The L3 is a large part of the "native" four core character of the K10, assuring an “on-die" path of communication between the cores.
 The sharing of cache between four cores is a first and the implementation of such a system is complex because conflicts between threads could eliminate any benefit of cache and even cause a slowing down. In order to reduce these conflicts, the K10’s L3 has 32 ways of associativity, or the largest value yet seen on an x86 processor cache.
Note that like on the K8, the cache hierarchy of AMD K10 processors are distinguished by an exclusive relationship which connects the successive levels. You may recall, an exclusive cache of levels 2, for example, receives data flushed from the L1 cache, but it doesn’t contain a copy of data or instructions coming from memory towards the L1. Thus, data and instructions are exclusively present in one of the two levels of cache, but never in both at the same time. In comparison to inclusive caching in which L2 contains a copy of L1, the exclusive relationship offers slightly lower performances (placement into cache requires a supplementary step in order to save the flushed line, an unnecessary operation in an inclusive relationship), with the advantage of a larger quantity of useful cache (equal to the total of sizes) and flexibility in implementation (no size constraints).
Although significantly inspired by the K8, the K10’s caches are adapted to the processor’s new capacities, particularly to the LSU (Load-Store Unit). The K10’s LSU is capable of executing two 128 bit reading/writing operations per cycle where the K8 was limited to two in 64 bits. In order not slow down the boosted LSU, the K10’s L1 were modified to provide a bandwidth double to that of the K8’s caches. The same doubling from 64 to 128 bits was done to the bus relaying the L2 caches to the memory controller, and this should remedy the relative weakness of the K8’s L2 cache bandwidth, which was quite behind the high performance of Intel processors in this domain.
Page 5
Caches, continuedLatencies, associativity and dissipation The measured latencies of the K10’s caches show 3 cycles for the L1, 15 for the L2 and between 30 and 45 for the L3. These values merit some explanation.
L1 latency has not changed compared to the K8, and this is for the best because it contributed to the latter’s good performances. The L2 figure is good news following the "surprise" of the 65 nm version Athlon 64 (Brisbane core) whose latency underwent an increase that was truly detrimental to performances. Luckily, the K10 has been spared and L2 latency reflects more the value that we observed on the Athlon 64 90 nm (Windsor core).
Memory latency has also slightly increased from the presence of L3 cache due to memory access which occurs from the moment when each cache level has not responded positively to a request. To be sure, it takes a little more time to access memory, but we shouldn’t conclude from this that the addition of a cache level slows memory access. It’s even the opposite because the presence of L3 significantly reduces the number of memory accesses.
 Finally, note that the K10’s L3 latency is expressed in processor cycles. This cache is dependent on the memory controller’s "power plane", which means that it doesn’t function at the processor’s frequency but at that of the memory controller’s (between 200 and 400 MHz less than the core). In the end, we might think that the L3’s higher latency could represent the intention of reducing cache thermal dissipation. To properly understand the relationship between latency and thermal dissipation, we need to take a closer look at the mechanisms which govern cache.
When data (or an instruction) is read in memory and written in cache, it is placed in an address where the controller will know where to find it. It is defined based on the original data address in memory. Cache being smaller than memory, the addresses are included in a much smaller range than that in memory. The cache address is often generated based on the values of the memory address in question. Here’s an example. If the memory address is 0xF01C0123, that of the cache is 0xC0123. Therefore, the data located at 0xF01C0123 and 0xF2C0123 share the same cache address, and if they are both put into cache there is a conflict. To remedy this, cache can have not one slot for the address 0xC0123, but two, four or eight. Here we are speaking of 2, 4 or 8 way associative cache. In this way, the higher the associativity, the lower the risk of conflict. Everything happens as if cache was organized in as many blocks as there are lanes of associativity and the smaller the blocks, the less they will contain data or continuous instructions in memory. The main concern then concerns finding the ideal compromise between size and cache associativity, however, in general the larger and more associative a cache is, the better the performance.
There is one problem with higher associativity. A request in cache with an address causes the reading of all data which shares the same address, or in other words, as many as there are levels. Of course, only one of these is being requested, and the others are activated for nothing. Unfortunately, each reading activates a part of the cache and provokes the release of heat. Therefore, the more cache is associative the more significant dissipation is for each request. So how do we conserve high associativity while at the same time limiting the amount of heat released? The solution consists of not simultaneously activating all the ways in reading, but rather just a few at a time (or even one), until the required data is found. This “serialization” in reading results in a cache that is energy efficient and which releases little heat, however, latency is multiplied. For this reason, this technique is mostly used with mobile processors.
To be honest, we don’t know if the K10’s L3 cache uses a real “serial” associativity such as we described. This being the case, we are slightly more reassured in our hypothesis by the rather high latencies that were measured, the size of cache (Intel’s are twice as large but show much lower latencies), and the relative slowness on the Brisbane’s L2 which can probably be explained by the use of the same technique.
Page 6
Memory controllerSecond generation memory controller The integration of a memory controller in a processor was one of the major innovations of the K8. Performance was improved, however, there with a sure loss of flexibility. We now know that Intel is planning on integrating a memory controller on some models with its future architecture, proof of the pertinence of AMD’s choice made several years ago.
DDR2 support and the greater need for memory bandwidth in double core versions of the Athlon 64 made changes to the controller integrated in 2006 necessary. The new K10 architecture imposes new constraints on the integrated memory controller, which now undergoes more profound evolutions.
 With the K10, it is now four cores which will share a single DDR2 / DDR3 memory controller. The main improvements therefore aimed at improving its efficiency with larger size buffers (in order to increase the possibility for parallel transfers), management of optimized memory pages (to reduce the resulting conflicts and lags), and the addition of a prefetcher as we already mentioned above. The management of memory channels has gained in flexibility and the two 64 bit channels can operate together in order to provide a maximum of bandwidth or separately for simultaneous reading/writing operations.
There is also added flexibility in terms of the power supply and the functioning speed of the memory controller. Here, "split plane" technology allows the separation of the memory controller’s power supply from the rest of the processor. In addition to a finer management of thermal dissipation, this technology also allows to temporarily increase the voltage supplied to the memory controller thus providing the possibility to increase its frequency (and speed) when needed.
 Only "AM2+ (or Socket F+ for the Barcelona) motherboards, however, support "split-plane". On the AM2 platform, the controller functions in "uni-plane mode ", or in other words, it shares the CPU’s voltage. To avoid thermal dissipation from going too high, the memory controllers functioning frequency and that of the L3 is reduced by 200 MHz compared to the "split-plane" mode. So a word of warning here…While AM2 platform compatibility is a reality, it isn’t without a few small snags.
Page 7
Various optimizationsPrefetchers The K10’s cache sub-system also gains from better hardware prefetchers than the K8. You may recall, the functioning of the prefetcher is based on the principle that a failure in reading has a strong probability of occurring, which will mean going back to instructions or data in the central memory.
The K10 has two hardware prefetchers which feed its L1 caches, while those of the K8 operate in the L2. The K10 additionally benefits from a new prefetch device in the memory controller which has a dedicated storage buffer.Various optimizations The K10 strives to correct several of its predecessor’s defects. The K8 “suffered” from rather slow integer division in comparison to rival Core architecture (particularly compared to the 45 nm version, which you may remember, benefited from an optimization specific to this type of operation). The K10 remedies this without entirely attaining the performance of the 45 nm version of the Core 2.
The adoption of “out-of-order” management in reading memory instructions is much more interesting. Present on the Core 2 and called “Memory Disambiguation” by Intel, this speculative mechanism’s aim is to predict if a reading instruction is susceptible to be dependent on writing in progress. Otherwise, the reading is processed without delay.
Also, note that there is better management of the stack (its management instructions are now handled by a dedicated unit), as well as several updates for the support of extended instructions, in particular certain SSE3 ones absent from the K8 and a new SIMD series of instructions grouped under the name, SSE4A (no relation to Intel’s SSE4.1 and SSE4.2, which would have been too easy).Improved virtualization support
 The K10 proposes a series of optimizations which aim to accelerate the processing of virtual machines, for example, improved memory management or the reduction of time in the transition between the hypervisor and virtual machines.
Page 8
Energy managementIn depth energy management If there is one thing that Intel learned from the "Prescott" experience (see our report on Core architecture to refresh your memory), it’s that thermal dissipation has become a limiting factor of performance in modern architecture. The impact of this new parameter is such that the design is no longer guided by the potential performances that can be generated but rather by the relation between performance and the number of consumed watts.
At the same time as its architecture, AMD is introducing a new power consumption index called ACP (Average CPU Power), which has been announced as being the most representative of real dissipation than TDP, which more indicates the maximum power value. Of course, the ACP has values lower than the TDP. AMD is staying prudent in this area and plans on giving both figures.
AMD talks a lot about an essential point related to the power consumption of its processors. They remind us that there is an integrated memory controller and therefore it’s normal that more heat is generated than with products with separate controllers. AMD claims that this integration reduces the overall thermal envelop of the processor + northbridge combo compared to a solution with a separate controller. This is true, however, it overlooks the fact that heat is more concentrated on a smaller surface and is therefore more difficult to dissipate.
 The K10 partly makes up for this defect by the separation of the power supply and the memory controller clock from the processor core, thus allowing their separate adjustment depending on the respective activities of the two sub-systems. Here, the K10 offers two mechanisms which manage multipliers (FID) and voltage (VID) : one for the CPU and another for the memory controller. As for the core of the processor, there is an energy management system called, "Independent Dynamic Core Technology" which is based on dynamic and independent modulation of the frequency of each core. Its aim is to control the overall thermal envelop of the processor.
Page 9
Product line, conclusionThe product line
 The Opteron Quad Core, code name “Barcelona”, is the server version of the K10, and the first processor available using the new architecture.
- Four cores, 512 KB of L2 cache per core, shared L3 cache of 2 MB ; - Engraved in 65 nm, SOI ; - Frequencies between 1.7 and 2 GHz with a TDP between 68 and 95W ; - Socket F/F+ (LGA 1207) ; - Three HyperTransport 1.0 lanes ; - DDR2-667 registered support.
Five models are planned for dual processor platforms (Opteron 23xx in 95W and 23xx HE in 68W) for prices between $206 and $370 depending on the version; four Opteron 83xx models will also be available to be installed in configurations of up to eight processors (and therefore 32 cores !), with prices ranging from $690 to $1000. AMD plans on increasing frequencies and announces versions attaining 2.5 GHz for the end of the year.
Desktop versions of the K10 will be called the “Phenom” and will only arrive in the month of December. There will be the:
Phenom X4, code name "Agena" : - Four cores, 512 KB of L2 cache per core, shared L3 cache of 2 MB -Engraved in 65 nm, SOI ; - Socket AM2+ (PGA 940); - One HyperTransport 3.0 lane ; - DDR2-1066 support.
The Phenom FX shares these characteristics, but benefits from a clock frequency which will most likely be slightly higher. It will also exist in Socket F+ versions.
The Phenom X2 "Kuma”" is the dual core version of the X4 : - Two cores, 512 KB of L2 cache per core, shared L3 cache of 2 MB ; - Engraved in 65nm, SOI ; - Socket AM2+ (PGA 940);
AMD has planned two entry level models.
- The Athlon X2 "Rana", a Phenom X2 stripped of its L3 cache, in socket AM2+ ; - The Sempron "Spica" which has a single core and no L3, also in socket AM2+.And compared to the Core 2 ? A closer look at the K10 reveals that AMD has opted for a judicious compromise between innovation and proven techniques for the design of its new architecture. The K10 inherits some of the qualities of the K8 while at the same time bringing what is necessary to be at the edge of innovation. K10 is most likely a balanced and high performance architecture. However, it is yet to be seen if these qualities will be enough to make them competitive compared to today’s Core 2 and if AMD will be able to quickly increase the K10’s frequency.
 The first tests of K10 architecture on “desktop” applications gives us a glimpse with an average performance gain of 14% (with peaks of 21%) compared to the K8 at equal frequencies. This is slightly disappointing knowing that Core architecture is around 25% ahead of the K8. These tests were however carried out with the Opteron platform, and on the desktop there could be superior gains. For example, we don’t yet know to what extent the K10’s performances rely on memory speed compared to the K8. If it is more dependent on this factor, it is necessarily at a disadvantage in tests in DDR2-667.
 Finally, if the K10 does indeed have potential, this will no longer be enough. The new challenge in improving performances involves mastering thermal dissipation and in this domain Intel is undeniably ahead in terms of production techniques. However, nothing is set in stone and the battle has not yet even begun.
Copyright © 1997-2009 BeHardware. All rights reserved.
|