Boosted IPC
In our report on Intel’s Core architecture, we quantified the theoretical power of an architecture by the IPC (instructions per cycle) that it is able to provide on the main instruction sets (integers, FPU, SSE).
The core of the K10 is directly descended from the K8. Equipped with 3 ALUs (arithmetic and logic units) devoted to whole number calculation, the K8 offers x86 calculation capacities equal to that of the Core 2 Duo. For SSE integer instructions, the K8’s two 64 bit calculation units allow the processing of eight 16 bit integers per cycle, while the Core 2 Duo can process up to 24 thanks to its three 128 bit SSE units. This is the same for SSE floating point instructions, where two floating point units of the Core 2 Duo associated to 128 bit SSE units allow processing twice as much floating point data than the K8 per clock cycle.
The K10 has the same integer calculation capacity as its predecessor. For SSE integers, it offers a peak processing which attains three integer operations per cycle (two arithmetic operations by two SSE units and a move by the “FP Move” unit). On the other hand, for floating point calculations, the theoretical IPC is boosted to the same level as the Core 2 and this is thanks to the adoption of two SSE units capable of processing 128 bits per cycle.

In order to feed the two 128 bit SSE units to a maximum, the K10 doubles the instruction rate input (from 16 to 32 bytes of instructions per cycle) as well as the bandwidth of L1 cache data (from 2 x 64 bits to 2 x 128 bits per cycle).
New predictor units
You may recall that branching and memory access constitute the two mains sources of reduction in IPC (please refer to our report on the Core 2 Duo for more details). It is therefore normal and good to see that AMD has equipped the K10 with specific optimizations.
A branch in a flux of instructions translates into a jump towards a new address. This jump perturbs the functioning of the pipeline, which can no longer receive new instructions before knowing the address of the destination. The solutions put into place by classic mechanisms of branching prediction consist of attempting to guess if a branch will be taken or not. To do this, the processor integrates several predictor units which differ depending on their way of functioning. The most efficient is the use of a history of branches that were chosen and which are stored in a dedicated buffer.

The K8’s predictor units were conceived to predict direct branches, or in other words, those whose destination address of the jump is explicitly specified in the code. The task of the predictor unit therefore consists of determining if branching will be carried out or not. However, these units are not very efficient for indirect branching, or for those whose destination address is susceptible to change in execution. This type of branch is very common in object-oriented languages which often use function pointers.
The K10 has a predictor unit devoted to indirect branches and which is capable of storing several preferred destination addresses for each branch, thus improving prediction efficiency. This doesn’t involve a new mechanism as it has been used by Intel’s processors since the Pentium 4 Prescott. The K8, however, was designed well before this.