A new processor bus
One of the defects of the Core 2 resides in the use of a processor bus of older design. While mobile and desktop platforms have no problem with it, this isn’t the case for servers where the old FSB is a bottleneck in the interconnection between sockets. In this area, the Opteron and its HyperTransport bus have been without serious competition up until now. Nehalem abandons the FSB for a more modern interconnection bus called the QPI (Quick Path Interconnect). This new point to point dual directional bus shares numerous characteristics with the HyperTransport bus and the principle is fundamentally similar. Just like its rival, QPI offers large flexibility in its implementation and systems will be able to integrate as many QPI links as required by bandwidth.
The QPI bus is announced with transfers of 4.8 to 6.4 GT/s (Giga-transfers per second). With a bus width that can attain 20 bits. This gives us a maximum speed of 6.4 x 20 / 8 = 16 GB/s, or 32 GB/s for a dual directional link. The first implementations of QPI on Nehalem provide
a lane of 25.6 GB/s, or the double of that which is offered by a classic FSB at 1600 MHz.

QPI lanes, in blue on the diagram, play the role of interconnection between the processors and also between each processor and an IOH (input/output hub that for example serves as an interface with the PCI-Express bus). In this example, each processor is capable of handling four QPI lanes. On a mono-processor machine, a single QPI lane between the processor and IOH (in this case an X58) is of course necessary.
Improvements to the core
Compared to the Core 2, many improvements to the Nehalem were motivated by the support of SMT and in general by the new memory hierarchy (the three cache levels and the increase in available memory bandwidth). The same goes for the processing cores and this along the entire pipeline of which the stages were more or less slightly improved compared to what was found on the Core 2.
Branching predictionStarting with branching prediction, it is one of the mechanisms that, as we saw in our look at Core 2, has one of the most significant influences on the performances of the processing pipeline. Branching prediction’s goal is to avoid cuts in the flux of code as these slow traffic ins the pipeline and thus lower speeds. Nehalem inherits the mechanisms already found on the Core 2: a loop detector and the management of direct and indirect branches. In addition, the new architecture integrates a second BTB (Branch Target Buffer) address buffer whose role is to stock a history of destination addresses that were efficiently taken; while the first BTB is devoted to “local” addresses, the second is meant for addresses further away that we can find in heavier applications (yes, like the management of data bases).
In addition to this, Intel has added a new mechanism that relies on the storing of return addresses (and not on destination addresses like the BTB) called the RSB (Return Stack Buffer). Note that each thread has its own RSB in order to avoid any conflict in the management of this buffer when SMT is activated.
FusionThe instruction decoding step was also reviewed. You may recall that this phase consists of transforming x86 instructions into elementary micro-operations that are comprehensible to the rest of the processing pipeline. Nehalem keeps the four decoders already found on the Core 2 but improves certain mechanisms brought by its predecessor. Macro-fusion was one of the innovations of Core 2 architecture which consists of detecting pairs of predefined x86 instructions such as “compare + jump” (CMP + JCC) and transforming them into a single micro-operation. The technique enables to both increase decoding capacity and reduce the number of micro-operations that are generated - and this all the more so with numerous appearances of these instruction pairs. Nehalem adds new instructions pairs capable of “macro-fusing” and especially enables macro-fusion in 64 bit mode (which is unfortunately not the case with the Core 2 that does thus benefit from its potential when it runs with a 64 bit operating system).