Intel Core 2 Duo - Test - BeHardware
>> Processors

Written by Franck Delattre and Marc Prieur

Published on June 22, 2006


Page 1


Netburst is dead, and long live Core ! This is something Intel announced a little over a year ago. The Netburst architecture introduced with the Pentium 4 in November 2000 is now replaced by a new architecture called Core, available for desktop, mobile and server platforms.

Intel will release new Xeons in the days to come and at the end of July the Core 2 Duo LGA 775 processors. This is a great opportunity for to study the release of Core architecture but also the performances of the Core 2 Duo product line in practice.

Ŧ The Core legacy ŧ
To understand the technical aspects in Core architecture, it’s important to look at the past. We go back to the past few years to the end of 2000. At this time, the entire line of Intel processors (desktop, server and mobile) relied on P6 architecture, which was introduced 6 years ago with the Pentium Pro. Despite improvements going along with new versions it eventually started to run out of steam. This was especially true compared to AMD and the Athlon, which won a very symbolic and marketing race in Gigahertz. It was urgent for Intel to release a new architecture to replace the P6.

The introduction of a new architecture isnīt an easy task. It must, from its release, show at least as a good performance compared to the most advanced products based on the previous architecture and also (and mainly) have a potential for evolution in the next five or six years to come. This is the average time required to make R&D investments profitable and has been Intelīs way of proceeding since the start of its company even if the presence of competitors has tended to accelerate product renewal. The objective is to avoid reproducing the Pentium III EB 1.13 GHz mishap that pushed the P6 architecture to its limits in such a way that the processor had to be recalled and withdrawn from the market.

This was probably the main worry in evolution in the definition of Netburst architecture. Netburst has been conceived to provide growing performances throughout its lifespan. Letīs see how it was done.

Page 2
IPC and frequency

IPC and frequency
The CPU performances can be evaluated with the number of instructions processed in one second, in other words the IPS. It is equal to :

ips = i/c x c/s

C corresponds to the number of processor cycles and IPC to the average number of instructions processed per cycles. The cycletime is the number of cycles per second, or in other words the clock frequency, called F.



This simple formula shows that the IPC and frequency are the two main performance factors. They are intimately connected to processor architecture and especially to the depth of the processing pipeline.
Letīs consider for example a processor where the fastest instruction is processed in 10 ns. If it uses a processing pipeline made of 10 stages, one stage is processed in 1 ns (10 ns / 10 stages) and it corresponds to the minimum time cycle. The maximum reachable frequency is the opposite of this cycletime, or 1 GHz. If the pipeline includes 20 stages, the cycle time is 0.5 ns (10 ns / 20 stages) and the maximum frequency 2 GHz. The maximum running frequency increases with the depth of the pipeline.

IPC is data that is intrinsic to a processor’s architecture. It depends, amongst other things, on the capacity of calculation units. For example, if the processor has a single processing unit for additions, it will be able to provide a maximum of one addition per cycle. If it includes two, it may be able to process two additions in one cycle. We say, "may" because the optimum scenario implies that processing pipelines provide a constant and maximum transfer rate. In practice, the instruction flow processed by the pipeline includes factors that make the pipeline wait, which interrupt the transfer rate and tend to reduce IPC. There are especially two types of factors that reduce the pipeline performances: branching and memory access.

Letīs take the case of a processor which has two calculations units for integers and a maximum IPC of 2 on these instructions. We also add a subsystem that has a success rate of 98% and central memory that has an access time of 70 ns.

X86 code has approximately 20% of its instructions that access memory. Amongst these, 98% will find the data in the cache subsystem and 2% will have to be accessed in central memory. We suppose that for the remaining 80% of the code and 98% that successfully accesses cache, the processor can provide a maximum IPC of 2. This represents 0.5 cycles per instruction. The number of average cycles per instructions is:

CPI = 20% x (98% x 0.5 + 2% x M) + 80% x 0.5

M represents the access time to central memory in cycles.
  • with a 10 stage pipeline, memory access requires 70 cycles at 1 GHZ. The CPI ratio is 0.778 and it corresponds to an average IPC of 1.28 or 64% of the maximum theoretical IPC.
  • with a pipeline of 20 stages, the only difference is the memory access time in cycles. At 2 GHz, 70ns correspond to 140 cycles. In this case CPI = 1.06. The average IPC is 0.95 or 47% of the theoretical IPC.
  • Branching has a slightly lower impact but it also depends of the depth of the pipeline. Indeed, in the case of inaccurate branching prediction the content of the pipeline is incorrect, because it includes instructions of the wrong branch. The penalty is equivalent in cycles to the depth of the pipeline. If we assume that there will be 10% branching instructions with a success rate of the branching mechanism of 96% the result is:

    CPI = 10% x (96% x 0.5 + 4% x P) + 90% x 0,5

    P is the pipeline depth.
  • with a 10 stage pipeline, the result is CPI = 0,538. The IPC is 1.85 (92,5% of the theoretical IPC).
  • with a 20 stage pipeline, the result is CPI = 0.578. The IPC is 1.74 (87% of the theoretical IPC).

  • The IPC that results from penalties due to branching and memory accesses falls to 1.19 for the 10 stage pipeline and 0.82 for the 20 stage pipeline. What interests us is not the IPC itself, but the result of multiplication by the frequency. This will give us the number of instructions processed each second.

    We see that the maximum frequency allowed by a 20 stage pipeline compensates for the reduction in IPC. In the end, the 20 stage pipeline is as fast as the 10 stage version. This was the reason why Intel opted for long pipelines and made this its new philosophy and Netburst was born.

    Page 3
    The plan and the problems of Netburst

    The Netburst plan
  • 20 stages (Willamette and Northwood cores), for a maximum frequency of 3.4 GHz.
  • 31 stages (Prescott and Cedar Mill cores), for a maximum frequency expected of 5 GHz
  • 45 stages (Tejas core), to reach over 7 GHz.
  • Of course there are some limitations to the increase of the number of stages of the pipelines. Beyond 55 stages, the IPC reduction due to the above mentioned factors is no longer compensated by increases in clock frequency and the number of instructions per second. In consequence, performances begin to decrease.

    (source Intel)

    Unfortunately, the first Pentium 4 Willamettes werenīt very efficient except maybe for the 2 GHz version. Indeed the theoretical model showed that performances were only there if clock frequency was high enough to compensate for the IPC reduction. The Willamette between 1.3 and 1.5 GHZ only partially fulfilled this condition, while the Northwood spectacularly rectified this situation. This was because on the one hand there were much higher frequencies and also much bigger and faster cache than the Willamette. The result was the increase of the success of the cache sub system and the reduction of penalties due to memory accesses. Northwood versions from 2.8 GHz really proved the worth of Netburst. The 3.2 and 3.4 GHz versions are still up to now models of performance and are very much sought after in the second hand market.

    In June 2004, Intel moved to the second phase of the Netburst plan and introduced the Prescott. Even if it included more cache memory than the Northwood, it surprised us in tests because of two points: performances were in some cases inferior to the Northwood and the new processor even if it has a 90 nm fabrication process tends to reach very high temperatures. The performance drop compared to the Northwood is explained by the pipeline depth increase to 31 stages. The excessive heat however was a very bad surprise and the Prescott never completely rid itself of this problem despite noticeable improvements in stepping. In the end, the thermal dissipation issues broke the progression of the Prescott and the situation turned somewhat sour. The Prescott was stuck in frequency increases, which led to doubts about the entire Netburst architecture.

    The problems of Netburst
    Northwood already suffered from significant thermal dissipation even if the problem was not as great for the Prescott. If thermal dissipation remained acceptable for a desktop or server platform, it was a real problem for the mobile, because of the heat and autonomy. Even if the Pentium 4 exists in a Mobile version, Netburst architecture has never really been adapted to low consumption use. A new architecture was required for this domain.

    Parallel to Netburst, there was a mobile architecture that was developed based on the P6 and whose first representative was the Pentium M Banias released in March 2003. Even if it was a success allying performances and energy saving features, Mobile gave a hard time to the Netburst. Intel had to produce two different architectures to cover all computer platforms. Of course, this meant higher production costs compared to a multi-use architecture and this was a first set back for Netburst.

    One reason there was high thermal dissipation was due to high frequencies. This wasnīt the only reason, however. At equivalent frequencies, the Prescott dissipates more energy than the Northwood despite a lower fabrication process. The difference in fact comes from pipeline depth. More stages increase power dissipation due to something called cut up.

    To understand, you have to know that some of the critical steps in instruction processing need to be made in one clock cycle. If not, this considerably slows down pipeline functioning. This is the case of branching production or the out-of-order engine that may lead to dependencies. These key stages arenīt really good candidates to the cut up and have to finish their work in one clock cycle.

    The longer the pipeline is, however, the smaller the clock cycle. In order to compensate for this decrease, it’s necessary to parallel algorithms used by these stages in order to finish their work in the time allowed. This parallelisation considerably makes the stage more complex and the number of transistors that it requires, amongst others things. Also, if the only algorithm change is not enough to finish the operation in one cycle, it’s necessary to use faster, bigger and more power hungry transistors. This of course leads to an increase in thermal dissipation and is all the more critical because of the intended low clock cycle and pipeline depth.

    The following example particularly illustrates this constraint. The Northwood has “double speed” whole number calculation units that make it possible in practice to complete two operations per clock cycle. The Prescott’s pipeline length increase didnīt make it possible for the integration of such ALUs. In order to keep the same instruction transfer rate, each double speed ALU has been turned into two single speed ALUs. This of course doubled the number of transistors used by the units in question.

    The Prescott turned each double speed ALU of the Northwood into two simple speed ALUs.

    We ask ourselves where Netburst would be today if there werenīt the heat dissipation issues, if the cryogenic cooling system would replace Intelīs standard CPU fan. The Prescott would run at 4.8 GHZ and the Cedar Mill version would be at over 5 GHz. The Tejas would be about to be released with the SSE4 instruction set (initially called TNI for Ŧ Tejas New Instructions ŧ) and a 45 stage pipeline.

    The objective of this projection isnīt to show you how idyllic the Netburst architecture is, but rather to make you understand that the abandonement of Netburst wasnīt motivated by performance issues. In the end, the final thermal dissipation didnīt make it possible to reach the frequencies required at the targeted performance.

    Page 4
    The after Netburst: Intel Core

    After Netburst
    When Intel stopped Netburst, it found itself in a situation similar to the one in 2001 when it was defining the successor to P6 architecture. However, the development of Netburst differed from the requirements of 2001 and the new specifications constitute the base of Core.

  • Netburst showed that it was now harder and harder to design a micro architecture with the capacity of evolving in the long term (more than 5 years). Forecasts were going along with too much uncertainty and unknown parameters. To succeed Netburst, it was required to invest money and hopes in a new architecture. The new policy consists in a step by step evolution of an existing architecture which already provided good performances.

  • Intel also had to rid itself of the bad image of very power hungry processors. Now is the time for energy saving products with little heat and noise.

  • The objective was also not to have to maintain a parallel architecture for mobile platforms.
  • After reading the new specifications, everyone turned their attention to the Mobile architecture. It was already there, has evolved in parallel to the Netburst and has integrated to the P6 the innovations introduced by the desktop Netburst (quad-pumped bus, SSE2). A short pipeline allowed low power consumption. Almost all the elements were there to make Mobile the ideal successor to Netburst. It benefited from a very good reputation according to users who only wished it wasnīt only used on these platforms. They were so eager to see this, that there are more and more attempts to adapt it to desktop platforms despite Intelīs willingness to protect the Netburst from too fast a fall so that they will have time to prepare their next step.

    With the new specifications, Mobile will benefit from several improvements to increase performances and make it capable of ensuring Intelīs presence on all three PC platforms. Core architecture was born!
    Back to a unified architecture
    If the choice of Mobile for the base of the new Core answers to the requirement of an energy saving architecture, it still has to be adapted to needs of non mobile platforms. This is an original way of proceeding, because up until today desktop processors were adapted to mobile versions and not the other way around.

    Coming back to a unified architecture for all three platforms of course represents savings in production for Intel, but according to the manufacturer it will also facilitate developers work in not having to optimise their programs for several micro-architectures with different requirements…at least as long as they remain in Intelīs product line!
    And because of that, a common architecture means generic optimisations no longer specific to one or another processor. For example, the non generalisation of 64 bit extensions certainly curbed the use of this new mode, which was up until today not included in Intelīs Mobile architecture. Core includes for developers the following standard points:
  • SSE, SSE2, SSE3 and new Supplemental SSE3 instruction sets.
  • l’EM64T.
  • Virtualization technology.
  • It would have been a very interesting point to also have dual core on this list but unfortunately Intel is planning on releasing Core architecture on single core products. Too bad!

    Core architecture in a Conroe
    Priority to IPC
    Even if it is efficient, the difference between Mobile and the latest versions based on Netburst (and especially compared to the Athlon 64) isnīt big enough. Core has the ambition of getting back the performance leadership on the desktop platform and has to make several modifications to Mobile for this purpose.

    Core has a 14 stage processing pipeline (Mobile has 12). Such depth restricts the maximum functioning frequency. So it isnīt on the pipeline depth but on width that efforts were focused to reach a high IPC.

    Core inherited the dynamic Out-Of-Order execution engine of Mobile, and improved it by extending the processing capacity. Each processing unit of the Core can load, decode and execute up to 4 instructions per cycle. Mobile was only able to process 3. Core introduces the 4-wide dynamic execution engine.

    Increasing the instruction transfer rate constitutes an acceleration factor in itself, but it also provides a wider instruction window to the OOO engine that will facilitate its management of dependencies and in consequence its efficiency. We remind you that this was the same objective of optimisation of OOO functioning that has been at the origin of Hyper-Threading integration in Netburst.

    A wider execution engine means calculation units that are capable of processing a higher transfer rate of instructions compared to Mobile. The Core’s calculations units have been the focus for this point.

    Page 5
    Calculation units

    Calculation units
    Here is a quick comparison of the current architectures:

    …and of the theoretical instruction bandwidth that result from these architectures:

    Core uses three calculations units for integer numbers. This is one more than Mobile and the same as the K8 with a capacity of three x86 instructions per cycle. Netburst keeps its supremacy for the processing of integers with double speed units which can process up to 4 full instructions per cycle. (It isn’t 5 as we could have supposed because of the presence of an additional single speed ALU, because it shares its port with one of the two double speed ALUs). Unfortunately, this processing capacity is not exploitable in practice because Netburst decoding units arenīt able to process such a transfer rate. It restricts the IPC to 3.

    We felt that it was interesting to observe Core behaviour on common x86 instructions such as arithmetical operation, shifting, and rotations. We have studied a tool integrated to the Everest which provides the latency and transfer rates of several instructions chosen amongst the x86/x87, MMX, SSE 1, 2 and 3. This tool is included in the evaluation version and you just have to right click in the status bar of the Everest, select Ŧ CPU Debug ŧ and then Ŧ Instructions latency dump ŧ in the menu.

    The latency of an instruction represents the number of processor cycles, the time that it spends in the processing pipeline. In practice the OOO motor tries to process the instruction flow in order to mask latencies, however the dependence between instructions tends to generate waiting, all the more significant the latencies of these instructions are. The transfer rate of an instruction corresponds to the minimum time, in processor cycles, that separates the beginning of two similar instructions. So, for example, an integer division requires 40 cycles for the K8. This means that the processor will only be able to process one integer division every 40 cycles.

    For some instructions, like addition, Core has a transfer rate equivalent to the maximum theoretical IPC (0.33 cycles per instructions, or 3 instructions per cycles). Multiplication has a slightly lower latency to the one obtained with the Yonah and is at the same level as the K8. Integer division is a little less, but it is much faster than with the K8 and Netburst. As for register manipulations, Core is slower than the K8, even if shifting (shl) has been improved compared to the Yonah.

    The thing that we have to remember from this table is that efforts on Core units have been made on instructions, for which the K8 was much advanced compared to the Mobile and Netburst (integer addition and multiplication, for example), and that less attention was given to instructions on which the K8 doesnīt excel (integer division, for example)
    Theoretical SSE Performances
    One of the most noticeable improvements of calculation units of the Core consists of the presence of three SSE units dedicated to integer and floating point SIMD operations. Combined with the appropriate arithmetical units, each is capable of processing a 128 bit packed operation in only one cycle (they act simultaneously on four 32 bits data or two 64 bits), instead of 2 for the Netburst, Mobile and K8. Common arithmetical operations are concerned as well as multiplication and addition.

    Each of the three ALUs is associated to one SSE unit. They can process up to 3 full 128 bit SSE operations per cycle (that is 12 instructions on 32 bit integers or 24 for 16 bit integers). The Mobile and K8 only have 2 SSE units and are able to process 64 bits per clock cycle. The Mobile and K8 capacity for integer SSE numbers is 2 x 64 bits, which is 4 instructions for 32 bit integers (or 8 instructions for 16 bit integers).

    Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8. Letīs see how it behaves with several SSE2 instructions.

    Packed mov is particularly fast on the Core, which here reaches a higher transfer rate of three 128 bit operations per cycle. Transfer rates for isolated arithmetical operations are explained by the fact that these operations are handled by only one FP unit, which when used alone has a maximum transfer rate of 128 bits per cycle. The combined operation of mul + add exploits the two units conjointly and is executed with a transfer rate of one cycle for the two operations, in other words two 128 bit operations per cycle.

    Intel talks a lot about this new calculation capacity that comes with Core and calls it Digital Media Boost. Core also introduced a new set of SSE instructions. Initially expected to be released with the Tejas, SSE4 consists of 16 new SIMD instructions. Most of them operate on whole number data. They are essentially intended to accelerate the compression and decompression of video algorithms. For example, palignr allows shifting half a position on two registers. This operation is often used in movement prediction algorithms for MPEG decoding.

    The capacities of the core execution units are very impressive. Intel included a potential two to three times superior to its previous products and the competitorīs. Having a high IPV on paper is one thing, but exploiting it in practice is another. As we saw above, a x86 code tends to reduce the IPC because of branching and memory accesses. Intel has logically brought several improvements to reduce harmful effects of these two types of dependencies.

    Page 6
    Caches, memory and prefetch

    The Coreīs caches
    The core architecture introduced new restrictions to the cache sub system. On the one hand, a high IPC requires a cache subsystem with a high success rate in order to efficiently mask memory latencies. It also requires a high transfer rate to face increasing data demands that go along with the IPC.

    The table below regroups the main cache characteristics of the new architecture and includes access latencies and transfer rates obtained with the SSE2 memory bandwidth test (128 Bits) of RightMark Memory Analyzer (RMMA) :

    The Core L1 caches shares the same size and associativity characteristics as the Mobileīs. However, available bandwidth is doubled as shown in the reading transfer rate test of RMMA. We find this result by looking at the 128 bit SSE2 movapd memory reading instruction transfer rate of one 128 bit reading per cycle or 16 octets/ cycle.

    The L2 cache access requires an additional cycle. Its transfer rate is 8 octets per cycle.

    Unlike the Pentium D and Athlon 64 X2, Core uses the Advanced Smart Cache technique inaugurated with the Yonah and which consists of sharing the L2 cache between the two execution cores. Compared to a L2 cache devoted to each core, the main advantage of this method is to share data between the two cores without using the memory bus. It reduces memory accesses (and latencies that go along) and optimises L2 filling (redundancies disappear).

    Shared cache also gives the possibility of being dynamically allocated by each of the two cores, until becoming integrally accessible by only one. This technique, which was specifically developed for a dual core implementation, is paradoxically more efficient than separated caches when only one of the two cores is used, which means for all single thread applications.
    An intelligent memory access
    In addition to improvements to memory cache, Intel has developed new techniques to improve memory accesses. They are grouped under the slightly pompous name, Smart Memory Access.

    The idea consists of working on two criteria, whose objective is to, once again, mask memory access latencies:
  • ensuring that a piece of data can be used as soon as possible (the temporal constraint).
  • make sure that a piece of data is the closest possible (in the memory hierarchy) to the processing unit (constraint of "where").
  • The temporal constraint refers to how a processor plans memory reading and writing operations. Indeed, when there is a memory reading in the out-of-order engine, it canīt be entirely processed before all on going read instructions are completed. If it didnīt follow this procedure, the risk would be a reading of data that hasnīt been updated in the memory hierarchy. This constraint imposes waiting and a slowing down.

    Core introduced a speculative mechanism that predicts if a read instruction is susceptible to depend on writes that are currently being processed, which means if it has to be processed without waiting. This predictive role is to remove ambiguities and is called Memory Disambiguation. Beside the wait reduction, the methodīs objective is to reduce dependencies between instructions and increase the efficiency of the out-of-order engine.
    Hardware prefetch
    Addressing the "where" constraint, which means trying very hard to bring data closer to processing unit, is the function of the cache subsystem. In order to help it in this task, Core uses hardware prefetch. This technique consists in using the memory bus when itīs inactive to preload code and data from memory to the cache subsystem.

    Hardware prefetch isnīt a new technique. It started with the Pentium III Tualatin. However, it is mainly the Netburst that fundamentally improved it. The important difference between the processor frequency and bus makes the Netburst particularly sensible to the harmful effects of a cache miss and that increases the interest of an efficient prefetch. For once, Core inherits prefetch technique from Netburst and slightly improves it.

    Several types of prefetchers are included in Core:

  • the instruction prefetcher pre loads instructions in the instruction L1 cache based on branching prediction results. Each of the two cores has one.
  • the IP prefetcher scrutinizes historical reading in order to have an overall diagram and loads "foreseeable" data in L1 cache. Each core also has one.
  • The DCU prefetcher detects multiple reading from a single cache line for a determined period of time and decides to load the following line in the L1 cache. One per core as well.
  • the DPL prefetcher has a similar functioning to the DCU. The only difference is that it detects requests on two successive cache lines (N and N+1) and is triggered if the reading of the line N+2 moves from the central memory to cache L2. The cache L2 has two of them, which are shared dynamically between the two cores.
  • The total of prefetchers is 8 for a Core 2 Duo.

    The small suns represent the 8 prefetchers of the Core 2 Duo.

    The hardware prefetch mechanisms are generally very efficient and in practice increase the success rate of the cache subsystem. Unfortunately, the prefetch sometimes leads to the opposite result. If errors are frequent, they tend to pollute cache with useless data and reduce its success rate. For this reason, itīs possible to deactivate most of the hardware prefetch mechanisms. Intel recommends the deactivation of the DCU prefetch in processors intended for servers (the Woodcrest), as it is susceptible to reduce performances in some applications.

    Page 7
    Branching and fusion

    After memory access, branching is the second most important factor for the slowing down of the functioning of a processor in the case of a wrong prediction.

    In an instruction flow, branching consists of a jump to a new address in the code. Two types of branching exist:
  • direct branching for which the jump address is explicitly mentioned in the code in the form of an operand. The destination address is resolved during the compilation. Direct branching is a loop jump most of the time.
  • indirect branching jumps to one address that dynamically changes during the execution of the program. Possible destinations are multiple. They are found in the tests of Ŧ switch / case ŧ type and are often used in object oriented languages in the form of function pointers.
  • Whether they are indirect or direct, branching constitutes an obstacle in the optimum functioning of pipeline processing. At the moment when the jump instruction enters the pipeline, in theory, it canīt include any other new instruction as long as the destination address isnīt calculated, i.e. when the jump instruction reaches the last processing stages. The pipeline has bubbles, which seriously reduce its efficiency. The objective of this branching predictor is to try to guess the destination address for the instruction that will jump and be loaded without waiting.

    There are several predictors. The simplest and oldest is the static, whose functioning relies on the assertion that the branch will always be taken or the contrary never be taken. So, in a loop, the static mechanism correctly predicts all the jumps except for the last! Of course, the success rate depends on the number of iterations.

    The static predictorīs limits happen in "if….then…" or "if not" situations. For the latter it has a 50% chance of being incorrect. In this case, the processor resort to dynamic prediction, which consist of storing a history of branching results in a table (the BHT : branch history table). When a branch is encountered, the BHT stores the result of the jump and if the branch is taken the destination address is stored in a dedicated buffer BTB (branch target buffer). (If the branch isnīt taken the address isnīt stored, because the destination is the instruction following the branch). Two types of dynamic predictors exist in a processor. They are distinguished by the range of the branch history that they store, which is to increase the granularity of the prediction mechanism.

    The combined action of dynamic and static predictors has, depending on the size of the storage buffer, a success rate of 95 to 97% for direct branches. Efficiency falls to 75% for correct predictions of indirect branches, which because of the multiplicity of the possible destinations arenīt adapted to the storage of the BHTīs binary information of "taken / not taken". Mobile has inaugurated a prediction mechanism of indirect branching. The predictor stores the different addresses in the BTB, where the branching ends up as well as the context that led to the destination (meaning the conditions that went along this jump). The predictorīs decision is no longer restricted to a single address in the case of a certain branch, but rather a series of "preferred" destinations of the indirect branch. This method gives good results, but is very performance demanding as the BTB has several addresses per branch.

    Mobile also introduced an innovating technique called the "loop detector". This detector scrutinizes branches in looking for the typical functioning of a loop: all branches taken except for one (or the opposite depending on the out prerequisite). If this loop is detected, a series of counters is attributed to the concerned branching, ensuring a success rate of 100%.

    Of course, Core benefits from all of these improvements in addition to several others, on which we have not been able to obtain more information.
    Fusion mechanisms
    Core includes a certain amount of techniques that aim to reduce the number of micro operations generated for a given number of instructions. Processing the same task with less micro operations, means processing it faster (increasing the IPC) while having a lower power consumption (increase of the performance per watts consumed).

    Initially introduced with Mobile, micro-fusion is one of these techniques. Letīs see what it does with one example, the x86 instruction: add eax,[mem32].
    Actually, this instruction processes two distinct operations, a memory reading and an addition. It will be decoded in two micro-operations:
    load reg1,[mem32]
    add reg2,reg1
    This breakdown also follows the logic of the processorīs organisation: reading and addition are handled by two different units. In a standard procedure, the two micro operations would be processed in the pipeline and the OOO engine would take care of dependencies.
    Micro-fusion consists in this case of the existence of a "super" micro-operation that would replace the two previous ones, which is:
    add reg1,[mem32]
    This would be a single micro-instruction that will go through the pipeline. During execution, a logic dedicated to the management of this micro-operation will address the two units concerned in parallel. The benefit of this method is to require fewer resources (a single internal register is now necessary in this example).

    Core adds macro-fusion to this technique. Where micro-fusion transforms two operations into a single one, macro-fusion decodes two x86 instructions in a single micro-operation. It intervenes before the decoding phase, looking for pairs that can be merged in the instruction waiting list. For example, the instruction sequence:
    cmp eax,[mem32]
    jne target
    is detected as such and is decoded in the only following micro-operation:
    cmpjne eax,[mem32],target
    This micro-operation benefits from special treatment because it is taken in charge by an improved ALU capable of processing it in a single cycle (if the data [mem32] is in the cache L1).

    An improved Core calculation unit is in charge of micro-operations coming from macro-fusion.

    It is rather difficult to quantify the performance gain brought by these fusion mechanisms, however, with the Yonah we measured 10% of instructions are micro-fused, which reduces by as much the number of micro-operations to process. Our light estimation is that the simultaneous use of macro-fusion extends this proportion to more than 15%.

    Page 8
    Intel Core product line & platform

    Intel Core product line
    The Intel Core architecture is for desktop, server and mobile product lines. At first, it will be the Xeon that will be available with the new processors of the 51xx line in the days to come:
  • 5160 (3.00 GHz, FSB1333, 4 MB L2) : $851
  • 5150 (2.66 GHz, FSB1333, 4 MB L2) : $690
  • 5140 (2.33 GHz, FSB1333, 4 MB L2) : 455$
  • 5130 (2.00 GHz, FSB1333, 4 MB L2) : $316
  • 5120 (1.86 GHz, FSB1066, 4 MB L2) : $256
  • 5110 (1.60 GHz, FSB1066, 4 MB L2) : $209
  • The TDP of processors up to 2.66 GHz is 65 Watts compared to 80 Watts for the 3 GHZ. Desktop processors, the Core 2 Duo, will be officially released at the end of July in the following declinations:
  • X6800 (2.93 GHz, FSB1066, 4 MB L2) : $999
  • E6700 (2.66 GHz, FSB1066, 4 MB L2) : $530
  • E6600 (2.40 GHz, FSB1066, 4 MB L2) : $316
  • E6400 (2.13 GHz, FSB1066, 2 MB L2) : $224
  • E6300 (1.86 GHz, FSB1066, 2 MB L2) : $183
  • E4200 (1.60 GHz, FSB800, 2 MB L2) : N/A

  • The X6800 will be part of the "Extreme Edition" line and this is the reason why itīs so expensive. We also noted that the Xeon homologue clocked at 3 GHz using a higher FSB is much cheaper, which isnīt too consistent.

    Laptops will see the coming of the Core 2 Duo this summer:
  • T7600 (2.33 GHz, FSB667, 4 Mo L2) : 637$
  • T7400 (2.16 GHz, FSB667, 4 MB L2) : $423
  • T7200 (2.00 GHz, FSB667, 4 MB L2) : $294
  • T5600 (1.83 GHz, FSB667, 2 MB L2) : $241
  • T5500 (1.66 GHz, FSB667, 2 MB L2) : $209
  • These figures come from the official price list and correspond to Intel price for 1000 pieces.
    Intel Core platforms
    Whatever the platform for which they are intended, processors based on Core architecture use an existing Socket. For servers, it is the Socket LGA771 introduced recently with the Ŧ Dempsey ŧ Xeon 50xx (derived from the Presler), whereas for desktops and mobiles it will be the Socket LGA775 and Socket mPGA479M.

    A word of warning: This doesnīt means that processors will be compatible with existing platforms. For laptops, if all current motherboards supporting the Core Duo are compatible with the Core 2 Duo, desktop motherboards have to be conform with the 11th version of Intelīs VRM (Voltage Regulation Module).

    So, if the i975X officially support the Core 2 Duo (we might believe that it is also the case of all other FSB1066 chipsets), i975X motherboards sold since September 2005 aren’t compatible. However, other revisions such as Intelīs rev.304 D975X Bad Axe, or new products like Asus P5W DH Deluxe are compatible.

    A relatively simple procedure to ensure the Core 2 Duo compatibility is to directly use P965 Express. Announced in early June, this chipset only equips very recent motherboards and is automatically compatible with Core 2 Duo. Compared to the 975X, it works with a more functional ICH8. The MCH can only support one PCI Express x16 link, whereas the 975X can support one x16 link or two x8 and Crossfire. SLI will be accessible via the new NVIDIA nForce 5 line for Intel and will be released this summer. Here again, all nForce 5 motherboards will be automatically compatible with the Core 2 Duo, but sooner there will also be a modified nForce 4 that will support the Core 2 Duo.

    Page 9
    CPU, moco, power consumption and o/c

    For this test we have received 3 desktop Core 2 Duos:
  • X6800 (2.93 GHz, FSB1066, 4 MB L2) : $999
  • E6600 (2.40 GHz, FSB1066, 4 MB L2): $316
  • E6400 (2.13 GHz, FSB1066, 2 MB L2): $224
  • E6400, E6600 et X6800
    E6400, E6600 et X6800

    The three processors have reached the stepping 4, B0 revision. Processors in stores will be in stepping 6.
    The motherboard : ASUSTeK P5W DH Deluxe
    Tests were made with the ASUSTeKīs i975X + ICH7R motherboard that supports the Core 2 Duo, the P5W DH Deluxe. There are also the usual functionalities implemented via this chipset, as well as ASUSTeKīs additional functions.

    First of all for storage, one of the 4 SATA port supported by the ICH7R is connected to a Silicon Image 4723 chip that split this port into two. So, itīs possible to connect a single disc on the first port that will be used normally, or two to use them in RAID 1, RAID 0 or JBOD. To choose the two latter modes you have to change the jumper position. RAID 1 is initially configured which is rather antiquated. ASUS also integrated a PCI Express JMicron JMB363 controller. The latter supports two Serial ATA (including one external), which can be configured in RAID 0 or 1, and one additional UDMA 100/66/33 port which wonīt be too much for some users as the ICH7-R only supports one.

    Network management is entrusted to two Marvel 88E8053 chips. Supporting the Gigabit network, they are interconnected to the rest of the system via PCI Express. There is also WiFi as the card has a WiFi 802.11a/b/g Realtek RTL8187L chip that uses the USB bus. HD audio is entrusted to a Realtek ALC882M chip and the card is in accordance with Dolby Master Studio specifications. FireWire is supported via a Texas Instrument controller. We noted the presence of an infrared remote control that switches the computer on and off, placing it in sleep or silent mode or even controls the sound level or video displayed.
    Power consumption
    First off, we took a look at power consumption of these processors. The Core 2 Duo is derived from Mobile architecture so it has the required base for low consumption. Fabrication process and low dissipation transistors, the Core 2 Duo benefits from the latest fabrication process that reduces electronic dissipation. SpeedStep is of course implemented and has, according to Intel, been improved for the reduction of transition time.

    A new power management method exists in the Core 2 Duo that allows the processor to accurately manage consumption even in load. This is called, Ultra Fine Grained Power Control. It consists of a very precise cutting out of areas that can be placed in sleep. Non solicited units remain in sleep even if the others run at full speed. This often happens, because itīs rare that all processor units are solicited at the same time. This ultra precise management makes it possible for better control of power consumption and thermal dissipation.

    The last innovation of the Core architecture that aims to reduce processor power consumption is the capacity of the data and address bus, which are capable of adapt themselves to the length of data. So if only 64 bits have to be processed, only half of the 128 bit bus concerned is activated.

    What difference does this make in practice? Here is the total power consumption of the configuration in load under Prime 95. The software is launched as many times as the number of core as it isnīt multithreaded. For the Athlon 64 X2 and FX, measurements were taken on the M2N32-SLI Deluxe with AM2 Socket:

    Results were very good since the Core 2 Duo E6600 and E6400 are less hungry than the Athlon 64 X2 3800+. Strangely enough, power consumption of our E6400 was equivalent to the E6600 despite an identical voltage of 1.3V. The Core 2 Duo X6800 also doesnīt have a high power consumption, because itīs barely above the Pentium 4 631 at 3 GHz, whose performances of course are at a much different level.

    We are far from the power consumption of the FX-62 and especially the Pentium D 950 (here in stepping B1). Based on a 90 nm fabrication process, the Celeron has a higher power consumption than the 65 nm Pentium 4 631.
    What about Overclocking? Of course our processors are "only" stepping 4, but we wanted to know what was possible. For each of the following results only an air cooling systems were used, in fact here the standard CPU cooler sold by Intel with the Pentium 4 & D. Room temperature was 31° for these tests and we increased voltage by +0.1V. Only results of overclocking with 2 Prime95 for 15 minutes was included in our results.

    The E6400 goes up to 3.2 GHz. However, at this frequency the FSB is 400 MHz for the ICH7, while we had to increase voltage to 1.65V.

    Our E6600 didnīt give us the same good result, because stability was impossible at 3.2 GHz despite a voltage of 1.4V.

    Finally, the X6800 was the most overclockable with stability reached at 3.4 GHz and 1.4V. Once again, we have to specify that overclocking was only valid for stepping 4 Core 2 Duo. Stepping 5 apparently easily reaches over 3.4 or even 3.6 GHz with air cooling. Starting with a too low clocked processor requires a high FSB not necessarily supported by all motherboards. For example, for 3.6 GHz, a E6400 will require a FSB of 450 MHz. Indeed, for the coefficient part, it is possible to reduce steps from 1 to 6 whatever the CPU is thanks to the EIST. We couldn’t reach over the basic coefficient, however, even with the X6800.

    Overclocking Step 5
    Just after finishing all tests, we received a stepping 5 Core 2 Duo E6600 and tested it to see what the Overclocking potential of this processor was. The conditions were the same:

    This time, the 3.4 GHz were stable from 1.35V as compared to 1.4V for the X6800 step 4. We even reached 3.6 GHz in 1.4V. To go over this, voltage needed to be increased even more and the processor starts to dissipate a quite a bit of energy. Water cooling solutions will be greatly appreciated.

    Aiming for a frequency between 3.4 and 3.6 GHz seems to be perfectly reasonable for a Core 2 Duo stepping 5.
    Performances at 3.6 GHz
    What are the performances of a Core 2 Duo overclocked at 3.6 GHz in 9x400 with DDR2-800 and timings of 4-4-4-12 ? This is what we wanted to know. Here are the figures compared to a X6800 in DDR2-800:

    With a 22.7% frequency increase, we could logically expect similar performance gains, sometime even higher, because of the impact of a more important FSB in some of the tests.

    Page 10
    L2 influence, DDR2, FSB

    Influence of cache L2
    First off in this area, we wanted to know what would be the performance gain of the 4MB of unified cache L2 compared to 2 MB. To do so, we compared a Conroe (4 MB) and an Allendale (2 MB) both clocked at 2.13 GHz:

    As usual, gains were variable depending on the application. We reached 7.2% with WinRAR, 6.2% in Pacific Fighters and 4.8% in Far Cry, which is appreciable. There are, however, some applications, in which gains are only 1%, or even less, for example 0.2% with 3ds max.

    These gains are relatively comparable to those obtained with the increase from 512 KB to 1 MB of cache per core on the Athlon 64 X2.
    Influence of DDR2 frequency and timings
    What is the influence of DDR2 on the Core 2 Duo? Because Intel has once again improved the hardware prefetch to restrict penalties due to memory access, we can think that the impact will be reduced. This is something that we wanted to verify.

    To do so, we measured performances in four areas. First, we focused on a reading bandwidth test and a latency cycle with ScienceMark. These results are expressed in MB and in number of cycles, respectively. Finally, two applicative tests complete the above results with WinRAR and Far Cry, which are especially dependent on memory sub system speed.

    With a FSB of 1066, the theoretical maximum bus bandwidth is 8,533 MB/s. Even if these values arenīt reached here in practice, itīs certain that this is restrictive for memories such as DDR2-1066, which have up to 8.5 GB /s of bandwidth in dual channel. This doesnīt stop the DDR2-1066 from bringing a bandwidth gain. The gap is 12% compared to the poorest adjustment that is DDR2-533 in 4-4-4-12.

    For latency, there was a rather significant gap between the DDR2-1067 and other types of memory. One possibility is that this is due to the asynchronous FSB frequency and memory bus in DDR2-667 and DDR2-800, whereas in DDR2-1067, the memory bus works exactly at two times the FSB frequency.

    We move now to more "practical" results. We begin with compression time in WinRAR. As you can see, DDR2-1067 in CL5 is 15% faster than DDR2-533 in CL4. DDR2-667 CL4, DDR2-800 CL5 and DDR2-533 CL3 are quite close.

    With Far Cry, the gap is smaller because the DDR2-1067īs gain is only 8.9%. Here again the performances of the DDR2-667 CL4, DDR2-800 CL5 and DDR2-533 CL3 trio are very close.

    What about the behaviour of the Core 2 Duo compared to DDR2 and AMD? For this test we looked at the impact of timings and frequency on the two platforms. We indicated the percentages of performances reached compared to the best adjustment:

    The results speak for themselves. With AM2, DDR2-533 CL4 is only at 83 and 87% of the performance level of DDR2-800 CL4, whereas with the Core 2 Duo results increase to 91% and 84%. The impact of a slower memory is half as significant for the Core 2 Duo as for the Athlon 64 X2.
    Influence of FSB
    We also wanted to know what was the influence of FSB on the performance of the Core 2 Duo. To do so, we made a series of tests always at 2.4 GHz but with the following two configurations: 9x266 and 7x342 MHz.

    The first thing to notice is memory bandwidth that strongly increases and benefits more from the DDR frequency increase. FSB restricted the latter in the previous test.

    However in FSB1370, DDR2-1026 CL5 is in finally at a comparable level to DDR2-684 in CL3, because of the combined action of frequency and a little help coming from synchronicity.

    In FSB1600, there are not many choices possible: DDR2-600, 800, 1000, 1066, 1200, etc. ...but over DDR2-800, strangely the motherboard no longer boots. The fastest adjustment possible was DDR2-800 in 4-4-4 and performances obtained are better than DDR2-855 4-4-4 in FSB1370 because of synchronicity.

    In the end, FSB doesnīt have a big influence on performances and there is not much gain in reducing the coefficient for its benefit. More than the highest possible FSB and DDR speed; we will try to have optimum performances and find the best adjustments combining a high FSB, aggressive timings and DDR running at 1x or 2x of the FSB speed. Either way, except for several adjustments performances were relatively close.

    Page 11
    Windows x64 & EM64T, the test

    Windows x64 & EM64T
    Introduced by AMD in 2003, AMD64 ISA took a long time working its way onto the computer desktop market. It’s a 64 bit extension of the x86 instruction set. So, general registers, small memory areas, which temporary store memory addresses and whole numbers, are increased from 32 to 64 bits.

    Intel released in early 2005, a comparable and compatible function, the EM64T, but this function was only available for Netburst and not Mobile. With Core, the EM64T is extended to all platforms.

    Processing 64 bits data isnīt an innovation by itself. Since its introduction, x87, which is in charge of floating point calculations goes up to 80 bits internally. Also, some MMX/SSE/SSE2 instructions give us the possibility of working with 64 bit whole numbers. However, the use of this type of data is now generalised to all data, that is stored in the GPR and this bring two advantages :
  • An acceleration of calculations with whole numbers. Indeed, for applications requiring calculations with very significant whole numbers (the limit is 4.29e9 in 32 bits, and reaches 1.84e19 in 64 bits), encoding the whole number in 64 bits makes it possible for the processor to manipulate it more easily and faster without the necessity of having to double the number of registers and clock cycles required for calculations. This should only concern very specific applications such as data encryption or scientific calculations.

  • storing addresses in 64 bits makes it possible to exceed the 4 GB limitation due to 32 bit binary encoding and increases it to 256 terabytes because of a "limitation" at 48 bits for virtual memory coding. We noted however that Intel exceeded this 4GB limitation with the Xeon to reach 64 GB and this even if this mode has limitations. There again, this wonīt be really too useful for most users.

  • In fact, the main benefit of EM64T like AMD64 is the number of registers. Indeed, in x86, processors have eight 80 bits x87 registers, eight general 32 bits registers and eight SSE 128 bits registers. Increasing the number of available registers makes it possible to restrict the number of instructions intended to free and copy the latter in memory and in consequence to increase performances.

    Finally, the release of EM64T AMD64 creates a break with the sacrosanct x86 compatibility. Many executables are still compiled with the x86 instruction set such as it was with the 386. There has been some improvement since, but they arenīt necessarily used by developers during compilation. From now on, improvements will be automatically included.

    What are performance gains in practice? To find out, we installed Windows XP on a Core 2 Duo E6600, Pentium D 950 and Athlon FX-60 x64 and tested three 32 bit software in 32 and 64 bits: Mathematica 5.2 (scientific calculations), CineBench 9.5 (3d rendering) and Far Cry (game).

    With Mathematica, performances were very variable since the gain was 2.7% for Core, 8.6% for Netburst and the speed of the K8 was reduced by 2.9%. Cinebench provided better results with AMD with a performance gain of 11.5% as compared to 8.6% with the Pentium D and 4.6% for the Core 2 Duo. It was finally with the Pentium D that Far Cry benefited the most from the 64 bits with 6.5% improvements, as compared to 3.2% for the Athlon 64 FX and 0.3% (!) for the Core.

    Performances were very variable from one processor to another and we even observed a performance drop in one case. Overall (and ironically), the Pentium 4 provided the most homogeneous performance gains. This could be explained by the presence of the Trace Cache that stores instructions as they are decoded.

    On the contrary, with more classic architecture such as the Core or K8, instructions are stored before decoding in the L1I, whereas AMD64 and EM64T have a negative impact on decoding performances, because these instructions are coded on more octets than standard x86 instructions. This increases the decoding load. It seems by the way that with the Core 2 Duo many cases of fusion arenīt activated in 64 bits.

    Of course, this load for decoders is generally compensated by the diminution of the total amount of 64 bit instructions. However, in the end, performances gains can change from one architecture to another. In fact, Core seems to benefit less than other architectures from the 64 bits. This isnīt really that dramatic as 32 bit performances and overall low performance gains obtained with 64 bits regardless of the architecture.
    Test configurations
    After specific tests, we put the Core 2 Duo through our usual test suite. For all DDR2 configurations we used DDR2-667 4-4-4-12, as well as DDR2-800 4-4-4-12 for the highest end AM2 and Core 2 Duo solutions.

    We used the following configurations:

    Common :
    - ATI Radeon X850 XT PE
    - 2 x Raptor 74 GB
    - Windows XP SP2 French

    Intel Socket 775 Core 2 Duo :
    - ASUSTeK P5W DH (i975X) motherboard
    - 2 x 512 MB DDR2-667 4-4-4
    - 2 x 512 MB DDR2-800 4-4-4

    Intel Socket 775 :
    - ASUSTeK P5WD2-E motherboard (i975X)
    - 2 x 512 MB DDR2-667 4-4-4

    Intel Socket mPGA479 :
    - Gigabyte GA-I8I945GTMF-YRH motherboard
    - 2 x 512 MB DDR2-667 4-4-4

    AMD Socket AM2 :
    - ASUS M2N32-SLI Deluxe motherboard
    - 2 x 512 MB DDR2-667 4-4-4
    - 2 x 512 MB DDR2-800 4-4-4

    AMD Socket 939 :
    - ASUS A8N SLI Premium motherboard
    - 2 x 512 MB DDR-400 2-2-2

    Page 12
    3ds Max & Maya

    3d Studio Max 7
    For 3D-studio max we used a rendering via the 3Ds internal engine (scanline). Developed by Studio PC this scene mainly uses radiosity. The result is more realistic in terms of lighting and is also slower. 80% of this scene is based on this type of effect.

    The Core 2 Duo already shows its power because it is at the same level as the fastest Pentium and Athlon 64 FX and this from the E6400. In the end, the most efficient Core 2 Duo configuration is 38% faster than AMD and Intelīs existing solutions!

    Compared to the Core Duo, the Core 2 Duo brings a performance gain of 8-10% at equivalent frequencies.
    Maya 6
    We used a scene developed by Yann Dupont of 3DVF (whom we thank for its use) rendered via Mental Ray.

    As Hyperthreading has a negative impact on performances with Maya, for Pentiums the 960 is in the lead. The Core 2 Duo is much faster and is in front starting with the E6300 at only 1.86 GHz. The resistance is much stronger than under 3ds for AMD because you need a E6700 to reach the performances of a FX-62…but the Core 2 Duo is announced as being half the price of this FX!

    The advantage of the X6800 is of 71% compared to the Pentium EE 965 and 15.4% compared to the FX-62.

    Page 13
    Mathematica & WinRAR

    Mathematica 5.2
    The following tests are scientific calculation programs, starting with Mathematica 5 from Wolfram Research and the test suite developed by Stefan Steinhaus.

    The Core 2 Duo clearly provides performances of a "new generation". A simple E6400 is enough to reach the level of performances of a FX-62 and even the E6300 is much faster than the PEE 965. So, the X6800 is in the end 36% faster than its homologue at AMD and 74% than its ancestor based on Netburst. Compared to the Core 2 Duo gains are approximately 14% at equivalent frequencies.
    WinRAR 3.51
    A total of 588 MB of 493 Word and Excel files (69 MB), 22 Eudora e-mail box (251 MB) and one audio wav format (268 MB) file were compressed using the most advanced option via WinRAR 3.5.

    We are starting to repeat ourselves, but once more the Core 2 Duo provides excellent performances. The E6400 has comparable results to the FX-62/P EE 965. The Allendale is 18% faster than the Yonah at equivalent frequencies and in the end the X6800 is 38% faster than Intelīs previous high end product and 33.6% compared to AMDīs.

    Page 14
    TMPGEnc & Vdub / DiVX 6

    TMPGEnc 3.3 Xpress
    With TMPGEnc, we encoded a 10 minute 16 second DV file to MPEG-2 format in 720x576 with an average bitrate of 4500 Kbits and in two paths. The video preview display is activated during this test and the DV file is decoded via a Mainconcept codec, which is faster than the decoder in TMPGEnc.

    For the first time, the advantage of the Core 2 Duo is much less obvious compared to the processor based on Netburst architecture. The first Pentium EE 965 reaches performances between the E6700 and X6800. This is honourable and the advantage of the X6800 on its predecessor is "only" 6.2%. Compared to AMD, the gap is still as big as it was (29.9%) compared to the FX-62. The latter provided performances between the E6600 and E6400. Core architecture is 25% more efficient than Mobile, proof that the improvements brought to SSE bear their fruits.
    VirtualDub 6.11 + DiVX 6
    We compressed the same video as in TMPGEnc in Fast recompress mode, via DiVX 6.1 CODEC, in one path with an average bitrate of 1500 Kbits /s, b-frame and with best quality encoding performances. The video preview display is activated during this test.

    Here again, the gain compared to Mobile is 25% at 2.1x GHz and even 30% at 1.8 GHz. Performances reached their summit with a X6800 being 50% faster than a Pentium EE 965 and 37% than a FX-62. The EE 965 is only slightly above a simple Core 2 Duo E6300 whereas the E6400 is at an equivalent level to the FX-62.

    Page 15
    Far Cry & Pacific Fighters

    Far Cry
    Here are results with Far Cry. The scene used was outdoors in map training.

    The E6300 provides better results than a Pentium EE 965, whereas the FX-62 is only barely faster than the E6400. The X6800 is much faster than its opponents because it is respectively 67% and 31.7% faster than Intelīs previous high end solution and AMDīs high end solution.
    Pacific Fighters

    With Pacific Fighters the E6300 is once again faster than the Pentium EE 965 whereas the FX-62 is caught between the E6300 and E6400. The X6800 is again much faster than its competitor: 74.6% higher than a EE 965 and 54.8% higher than a FX-62 !

    Page 16

    With Core, Intel released an architecture that is the opposite of Netburst when it was released. If Netburst put several principles into doubt and was really innovative (sometimes rightfully so and other times not,) Core is a sort of melting pot that uses the best existing technologies and improves upon them. So if Netburst didnīt really bring improvements while it was released, Core is from its release fully operational.

    Because of the results obtained in practice, we would be inclined to say that Intel is right for the short and middle term. Indeed, the Core 2 Duo is an exceptional processor! For example, a simple E6400 at $224 has a level of performances comparable to an Athlon 64 FX-62 at $1031, with a lower power consumption than an Athlon 64 X2 3800+ and with a comfortable Overclocking margin. From the E6600, AMD can no longer compete in terms of performances except in Maya where the FX-62 provides better results.

    What can AMD do to stop this Core 2 Duo advance? As the next architecture for the father of the Athlon wonīt be released before the beginning of next year, there is only one solution, prices! There will be a first wave of price cuts next July 24. It wonīt be enough because it is the 4200+ that will have to compete with the E6400 ($240 vs $224) and the 4600+ will have to compete with the E6600 ($301 vs $316). However, we canīt imagine AMD reducing its prices by much, all the more so that they are restricted to 90nm.

    Intel KentsfieldNow we still have to find out what is the margin of evolution for Core. For frequency, the Core 2 Duo stepping 5 makes it easily possible to reach over 3.4 GHz or even 3.6 GHz with air cooling. But because of the short pipeline it will quickly reach a limit. The other solution, because there is a reduced dissipation for these CPUs, is the increase of the number of cores. The Kentfield, which is based on two Conroe dies integrated to the same packaging, will be released in early 2007. Test samples are already available and fully functional. But what can we think of a Quad Core when we are still waiting for some type of software that really benefits from dual core? It is certainly one of the reasons why Intel is waiting for 2007, but it isnīt sure that the situation will be more favourable at this date.

    Anyhow, Intel doesnīt count on the Core in the long term because it has already announced a new architecture for 2008, Nehalem, and another for 2010, Gesher. While waiting, it is Core that will be available everywhere at the end of July and by looking at performances it would foolish not to take advantage of it!

    Copyright © 1997-2015 BeHardware. All rights reserved.