Calculation unitsHere is a quick comparison of the current architectures:
…and of the theoretical instruction bandwidth that result from these architectures:
Core uses three calculations units for integer numbers. This is one more than Mobile and the same as the K8 with a capacity of three x86 instructions per cycle. Netburst keeps its supremacy for the processing of integers with double speed units which can process up to 4 full instructions per cycle. (It isn’t 5 as we could have supposed because of the presence of an additional single speed ALU, because it shares its port with one of the two double speed ALUs). Unfortunately, this processing capacity is not exploitable in practice because Netburst decoding units aren´t able to process such a transfer rate. It restricts the IPC to 3.
We felt that it was interesting to observe Core behaviour on common x86 instructions such as arithmetical operation, shifting, and rotations. We have studied a tool integrated to the Everest
which provides the latency and transfer rates of several instructions chosen amongst the x86/x87, MMX, SSE 1, 2 and 3. This tool is included in the evaluation version and you just have to right click in the status bar of the Everest, select « CPU Debug » and then « Instructions latency dump » in the menu.
The latency of an instruction represents the number of processor cycles, the time that it spends in the processing pipeline. In practice the OOO motor tries to process the instruction flow in order to mask latencies, however the dependence between instructions tends to generate waiting, all the more significant the latencies of these instructions are. The transfer rate of an instruction corresponds to the minimum time, in processor cycles, that separates the beginning of two similar instructions. So, for example, an integer division requires 40 cycles for the K8. This means that the processor will only be able to process one integer division every 40 cycles.
For some instructions, like addition, Core has a transfer rate equivalent to the maximum theoretical IPC (0.33 cycles per instructions, or 3 instructions per cycles). Multiplication has a slightly lower latency to the one obtained with the Yonah and is at the same level as the K8. Integer division is a little less, but it is much faster than with the K8 and Netburst. As for register manipulations, Core is slower than the K8, even if shifting (shl) has been improved compared to the Yonah.
The thing that we have to remember from this table is that efforts on Core units have been made on instructions, for which the K8 was much advanced compared to the Mobile and Netburst (integer addition and multiplication, for example), and that less attention was given to instructions on which the K8 doesn´t excel (integer division, for example)
Theoretical SSE Performances
One of the most noticeable improvements of calculation units of the Core consists of the presence of three SSE units dedicated to integer and floating point SIMD operations. Combined with the appropriate arithmetical units, each is capable of processing a 128 bit packed operation in only one cycle (they act simultaneously on four 32 bits data or two 64 bits), instead of 2 for the Netburst, Mobile and K8. Common arithmetical operations are concerned as well as multiplication and addition.
Each of the three ALUs is associated to one SSE unit. They can process up to 3 full 128 bit SSE operations per cycle (that is 12 instructions on 32 bit integers or 24 for 16 bit integers). The Mobile and K8 only have 2 SSE units and are able to process 64 bits per clock cycle. The Mobile and K8 capacity for integer SSE numbers is 2 x 64 bits, which is 4 instructions for 32 bit integers (or 8 instructions for 16 bit integers).
Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8. Let´s see how it behaves with several SSE2 instructions.
Packed mov is particularly fast on the Core, which here reaches a higher transfer rate of three 128 bit operations per cycle. Transfer rates for isolated arithmetical operations are explained by the fact that these operations are handled by only one FP unit, which when used alone has a maximum transfer rate of 128 bits per cycle. The combined operation of mul + add exploits the two units conjointly and is executed with a transfer rate of one cycle for the two operations, in other words two 128 bit operations per cycle.
Intel talks a lot about this new calculation capacity that comes with Core and calls it Digital Media Boost
. Core also introduced a new set of SSE instructions. Initially expected to be released with the Tejas, SSE4
consists of 16 new SIMD instructions. Most of them operate on whole number data. They are essentially intended to accelerate the compression and decompression of video algorithms. For example, palignr allows shifting half a position on two registers. This operation is often used in movement prediction algorithms for MPEG decoding.
The capacities of the core execution units are very impressive. Intel included a potential two to three times superior to its previous products and the competitor´s. Having a high IPV on paper is one thing, but exploiting it in practice is another. As we saw above, a x86 code tends to reduce the IPC because of branching and memory accesses. Intel has logically brought several improvements to reduce harmful effects of these two types of dependencies.