Home  |  News  |  Reviews  | About Search :  HardWare.fr 



MiscellaneousStorageGraphics CardsMotherboardsProcessors
Advertise on BeHardware.com
Review index:
Intel Core 2 Duo - Test
by Franck Delattre et Marc Prieur
Published on July 4, 2006

Branching
After memory access, branching is the second most important factor for the slowing down of the functioning of a processor in the case of a wrong prediction.

In an instruction flow, branching consists of a jump to a new address in the code. Two types of branching exist:
  • direct branching for which the jump address is explicitly mentioned in the code in the form of an operand. The destination address is resolved during the compilation. Direct branching is a loop jump most of the time.
  • indirect branching jumps to one address that dynamically changes during the execution of the program. Possible destinations are multiple. They are found in the tests of « switch / case » type and are often used in object oriented languages in the form of function pointers.
  • Whether they are indirect or direct, branching constitutes an obstacle in the optimum functioning of pipeline processing. At the moment when the jump instruction enters the pipeline, in theory, it can´t include any other new instruction as long as the destination address isn´t calculated, i.e. when the jump instruction reaches the last processing stages. The pipeline has bubbles, which seriously reduce its efficiency. The objective of this branching predictor is to try to guess the destination address for the instruction that will jump and be loaded without waiting.

    There are several predictors. The simplest and oldest is the static, whose functioning relies on the assertion that the branch will always be taken or the contrary never be taken. So, in a loop, the static mechanism correctly predicts all the jumps except for the last! Of course, the success rate depends on the number of iterations.

    The static predictor´s limits happen in "if….then…" or "if not" situations. For the latter it has a 50% chance of being incorrect. In this case, the processor resort to dynamic prediction, which consist of storing a history of branching results in a table (the BHT : branch history table). When a branch is encountered, the BHT stores the result of the jump and if the branch is taken the destination address is stored in a dedicated buffer BTB (branch target buffer). (If the branch isn´t taken the address isn´t stored, because the destination is the instruction following the branch). Two types of dynamic predictors exist in a processor. They are distinguished by the range of the branch history that they store, which is to increase the granularity of the prediction mechanism.

    The combined action of dynamic and static predictors has, depending on the size of the storage buffer, a success rate of 95 to 97% for direct branches. Efficiency falls to 75% for correct predictions of indirect branches, which because of the multiplicity of the possible destinations aren´t adapted to the storage of the BHT´s binary information of "taken / not taken". Mobile has inaugurated a prediction mechanism of indirect branching. The predictor stores the different addresses in the BTB, where the branching ends up as well as the context that led to the destination (meaning the conditions that went along this jump). The predictor´s decision is no longer restricted to a single address in the case of a certain branch, but rather a series of "preferred" destinations of the indirect branch. This method gives good results, but is very performance demanding as the BTB has several addresses per branch.

    Mobile also introduced an innovating technique called the "loop detector". This detector scrutinizes branches in looking for the typical functioning of a loop: all branches taken except for one (or the opposite depending on the out prerequisite). If this loop is detected, a series of counters is attributed to the concerned branching, ensuring a success rate of 100%.

    Of course, Core benefits from all of these improvements in addition to several others, on which we have not been able to obtain more information.
    Fusion mechanisms
    Core includes a certain amount of techniques that aim to reduce the number of micro operations generated for a given number of instructions. Processing the same task with less micro operations, means processing it faster (increasing the IPC) while having a lower power consumption (increase of the performance per watts consumed).

    Initially introduced with Mobile, micro-fusion is one of these techniques. Let´s see what it does with one example, the x86 instruction: add eax,[mem32].
    Actually, this instruction processes two distinct operations, a memory reading and an addition. It will be decoded in two micro-operations:
    load reg1,[mem32]
    add reg2,reg1
    This breakdown also follows the logic of the processor´s organisation: reading and addition are handled by two different units. In a standard procedure, the two micro operations would be processed in the pipeline and the OOO engine would take care of dependencies.
    Micro-fusion consists in this case of the existence of a "super" micro-operation that would replace the two previous ones, which is:
    add reg1,[mem32]
    This would be a single micro-instruction that will go through the pipeline. During execution, a logic dedicated to the management of this micro-operation will address the two units concerned in parallel. The benefit of this method is to require fewer resources (a single internal register is now necessary in this example).

    Core adds macro-fusion to this technique. Where micro-fusion transforms two operations into a single one, macro-fusion decodes two x86 instructions in a single micro-operation. It intervenes before the decoding phase, looking for pairs that can be merged in the instruction waiting list. For example, the instruction sequence:
    cmp eax,[mem32]
    jne target
    is detected as such and is decoded in the only following micro-operation:
    cmpjne eax,[mem32],target
    This micro-operation benefits from special treatment because it is taken in charge by an improved ALU capable of processing it in a single cycle (if the data [mem32] is in the cache L1).


    An improved Core calculation unit is in charge of micro-operations coming from macro-fusion.

    It is rather difficult to quantify the performance gain brought by these fusion mechanisms, however, with the Yonah we measured 10% of instructions are micro-fused, which reduces by as much the number of micro-operations to process. Our light estimation is that the simultaneous use of macro-fusion extends this proportion to more than 15%.

    << Previous page
    Caches, memory and prefetch

    Page index
    1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16
    Next page >>
    Intel Core product line & platform  




    Copyright © 1997- Hardware.fr SARL. All rights reserved.
    Read our privacy guidelines.