What’s new?Nehalem is a “monolithic” quad core architecture meaning that it doesn’t result from the fusion of two dual cores. Nehalem also introduces the “uncore” notion to Intel’s products, this term designating any part of the processor that isn’t directly part of the instruction processing engine. Unlike the Core 2, which was based on a single clock distribution (the entire processor functions on a single clock cycle value), Nehalem uses a complex clock distribution. Each core can thus run at its own frequency and this goes as well for the entire “uncore” part of the processor.
Some new and some old
The Nehalem’s cores benefit from SMT technology (Simultaneous Multi-Threading) which appeared with the Pentium 4 equipped with Hyperthreading (the non-commercial name of SMT on Netburst) and that we also find on the first generations of Atom processors.
SMT is technique which aims to facilitate handling several threads by the same execution core. In the absence of SMT, a core successively processes the pieces of the different threads that it is in charge of at any given moment. The constant transition from one thread to another gives an illusion that they are being executed simultaneously but in actuality a lot of time is devoted to these transitions. Each time the core must save the context in which the execution of a thread was carried out (state of the registers and stack) and load the context of a new thread. The concept of SMT is to offer the core the possibility to have not one but two contexts at the same time which will thus enable processing two threads but this time in a real simultaneous manner. The core’s resources (caches and execution units) are shared between the two threads in a static manner (for example, the buffer is separated into two identical parts) or dynamic (threads access the resource in a competitive way depending on their specific need).
Besides the time spent in transitions in context, SMT’s performance gain comes from the best use of a core’s execution units. In fact, the flux of instructions from each of the two threads are independent which notably is of benefit to the out-of-order execution engine (OOO) one of whose constraints in functioning is due to the interdependency of instructions. The possibility to more efficiently fill the execution pipeline is thus greater and in the end the efficiency of the execution core increases. In a non-OOO execution engine that cannot re-order instructions (as is the case with the Atom), speed is directly related to the dependence of successive instructions on each other. In this way, SMT enables practically double performance. And so as not to waste a thing, the addition of SMT to a core is economical compared to the added benefits in terms of performance as soon as more than one thread is handled in the processor.
As the operating system only handles threads, it interprets the presence of the two contexts as two distinct logic processors in the same way as two cores. The Nehalem’s four cores thus appear in the Windows task manager in the form of 8 logic processors.
All of this sounds quite advantageous but SMT technology isn’t exempt from defects.
In the first place, the concept of SMT resides in increasing the efficiency of an execution core and the possible gain is thus all the more significant if the starting efficiency of the architecture in question is low. Netburst has problems with its very long pipeline (20 and then 30 stages), and is therefore difficult to fill in an optimal manner. For this reason the architecture benefits from SMT and the gain can attain 40% in certain applications with the Pentium 4 Prescott. The effect is even more advantageous on the Atom whose in-order execution engine is strongly penalized by dependency.
Nehalem inherits its execution engine from the Core 2 and it’s only legitimate to wonder about the gain added by SMT on execution cores that are already reputed to be efficient. According to Intel, SMT allows the Nehalem’s 4-wide engine (in other words, it’s capable of simultaneously processing up to four instructions) to fully use its width. This may be considered a somewhat “horizontal” optimization compared to the “vertical” one obtained from the added length of the Netburst pipeline.
The other defect of SMT resides in the competing access of threads to cache, in particular to L1. The Nehalem’s L1 are fortunately sufficiently large to easily accommodate two threads and at any rate are better equipped than those of the Pentium 4 in this domain.
So in the end, will SMT be advantageous for the Nehalem? Yes, because we will see in our study that many of the improvements added to the Nehalem’s core were made with consideration for the optimal functioning of SMT on the new architecture and this in order to maximize performance gains. Of course, the gain will necessarily be variable depending on the application. Otherwise, SMT works miracles in server and database management environments or at least this is what resulted from its use on the Xeon with its Netburst architecture. What remains is that the interest of SMT will of course be less on desktop PCs, notably in the framework of office or gaming use (and even more so on the mobile platform). Moreover, it was even planned that for a certain time that non-server versions of the Nehalem would not be equipped with SMT but Intel have come back on this decision and the Bloomfield (a high end version for PC desktops) will have it. Whatever the case, SMT remains optional and it will therefore always be possible to decide if its presence is desired.