Intel has decided to unveil additional details concerning the future evolutions of their processor line starting with the Penryn. Some details have already been officially published (see this news or this one) about this evolution which is mainly presented as a « die-shrink » of current processors: 45nm fabrication process with high-k technology and a number of transistors increased from 290 to 410 million because of the L2 cache improvement from 4 to 6 MB.
The Penryn core with 6 MB of L2 cache
First thing first, Intel confirmed to be on schedule to enter production during the second half even if it doesn't mean that the entire product range will be available this year. Frequencies announced are unsurprisingly superior to 3 GHz. We remind you that for desktops, new platforms based on the P35 to be released in June or the X38 in September will be required.
For power consumption, Intel chose not to reduce the TDP to have higher performances: the TDP of the dual core Wolfdale will be of 65 watt and the quad core Yorkfield will be of 95 to 130 watts (instead of 105 to 130 currently). FSBs won't increase for desktops: 1333 for dual core, 1066 for quad core and 1600 for server.
What about architectural improvements?
1/ faster divisions
Intel hasn't thought that it was necessary to change the already very efficient calculation units of the core architecture except for the unit in charge of the division. The division is one of the slowest arithmetical operations ran by the processor. It is interesting to note about this that Intel and AMD use radically different techniques. Intel's processors use, like we do, the Euclidean method: one divider and one dividend are associated to one quotient and the remainder. The processor cut the division into pieces and it means that with each cycle, only a specific number of bits are processed. The operation is relatively slow (the number of cycles depend on the size of the dividend), but it is accurate.
AMD's processors use a method of approximation based on tables and multiplications. The operation is processed much faster but micro-coded tables are resource-consuming. AMD's processor only uses this technique for floating-point divisions and whole divisions have to go through an initial phase to change the type of division (this considerably reduces their efficiency). The operation is quick (at least for floating-point calculations) but approximate. AMD seems, however, to have given up this method with the K10 to focus on the Euclidean.
Would Intel have been scared that the last born of AMD's family might be more efficient for divisions? Anyhow, if the Core processes two bits per cycle (this is the Radix-4), Penryn will process four (Radix-16). Other more complex operations including divisions will also beneficiate from this technique. This is the case of square root calculations which have been particularly optimized. This type of operation is intensively used by 3D geometrical engines.
2 / Super Shuffle and SSE4
Two points of the SSE have been improved. The first one is the acceleration of shuffle instructions; instructions mixing data of several SSE registers heavily used for video encoding and decoding.
There is also a new set of instructions: SSE4. More information on this set are available in this page
. Fifteen instructions or so will be available. Many of them will bring general improvements and others in more specific domains such as the calculation of CRC value. Of course, programs will have to be written or compiled to take these instructions in account and this is the reason why they won't improve performances of current programs.
3/ Energie : IDA (Intel Dynamic Acceleration) and Deep Power Down
The technique of dynamic acceleration IDA isn't really specific to the Penryn. It has already been implemented to some future models of Core 2 Duo Merom working with the Santa Rosa mobile platform (scheduled to be released in May). Therefore, it is presented as being solely relating to energy consumption even if we believe that its possibilities are far beyond this perspective.
The objective of IDA consists in boosting the performances of one of the two cores when the other is inactive. For example, with one double core processor clocked t 2.2 GHz, the inactivity of one core leads to the acceleration of the second one at 2.4 GHz. The global thermal envelope is inferior to what it should be if the two cores were clocked at 2.2 GHz while ensuring superior performances of the thread processed by the active core. Thanks to this trick, mono applications are accelerated while maintaining reasonable thermal dissipation. It is unfortunate though that this technique is only implemented to mobile platforms even if of course in the case of a computer with overclocking the possibility to deactivate it must be available!
To finish, we noted that the Mobile version of the Penryn can enter an additional state named “Deep Power Down”. It is designed to increase a little bit more the autonomy of laptop computers. In this mode, and this is new, the L1 & L2 caches are simply cut out.
What about performances?
For this part, we will have to make do with the figures given by Intel.
For a game environment, micro-architectural improvements, size increase of the L2 cache and higher frequencies of the Penryn leads to, according to Intel, 20% performance improvements between the Penryn 3.2 GHz and Conroe 3.0 GHz. The optimizations brought to Shuffle and SSE4 instructions would allow, in comparison with SSE3, performance improvements of 40% for video encoding. Wait & See!