GeForce 8 architecture
As we briefly explained above, this new GPU relies on a unified architecture that consists of using the same units to process all types of elements, whether they are pixels or vertices. The objective is that none of the units run empty. You may have noticed that we only spoke of "pixel shaders" and their processing units. This was because, these units in fact have almost everything required to process vertex shaders. The unification consists more of extending the capabilities of current pixel shader engines rather than merging pixel and vertex shaders. It’s obvious that the shader cores of the ATI Radeon X1000, at the functional (but not management) level, are similar to those of a unified architecture. Changing to a unified architecture will be a natural evolution for ATI with the R600.

For NVIDIA, the fixed architecture of the GeForce 7 isn't particularly adapted to this evolution. With the GeForce 8, they had to start from scratch. You may have heard this before, because with each new generation of GPU the "brand new architecture" is amongst the basic selling points. Generally, this isn't the case, but today it is. NVIDIA had to start over and had to redesign a new architecture for an old one that had reached its limit.

NVIDIA chose a similar architecture to the Radeon X1000’s and decoupled the calculation and texturing units, which in the latest highest end version reached 128 and 32, respectively. Compared to the evolution these last few years, the GeForce 8800 is very close to ATI's current GPUs. However, if we take a closer look some major difference appear.
A scalar processor
The calculations units of previous GPUs worked with a certain amount of pixels in parallel. This is true for both ATI and NVIDA and was 4 pixels for the GeForce 7 and 12 for the Radeon X1000. Each pixel is a vector of 4 components (RGBA or XYZW since they aren't necessary color) and these 4 components are also processed in parallel. We will suppose here that the computed values are colors to make our explanation a little easier. With each cycle an instruction will be applied to 4 components of 4 pixels, or 16 elements in the case of the GeForce 7. It often happens that an instruction isn't applied to all components. To avoid wasting resources, the shader cores of these GPUs are capable of simultaneously processing two instructions. For example:
MUL R1.xy
ADD R1.z
These two instructions, multiplication and addition, can be processed simultaneously even if they are different. This possibility is called, “co-issue”. These units are named MIMD (multiple instructions multiple data) and are 512 bits wide (16 elements x 32 bits).
The GeForce 8800’s units, however, are of the SIMD (single instruction multiple data) 512 bit type. Does that mean that they are less efficient? No, because instead of processing 4 components of 4 pixels per cycle, they process one element of 16 pixels. This means that each component of pixels can have a different instruction without wasting resources. The above example of 2 instructions shows the interest of such an organisation of units. With one unit of the GeForce 7 type, they will be applied to 4 pixels during each cycle or to 16 pixels in 4 cycles. With the GeForce 8, they will be applied to 16 pixels in 3 cycles. The first one is broken down in MUL R1.x and MUL R1.y. There is a 25% performance improvement with equivalent processing resources and this is only due to such a reorganisation.
Shader core specifications
Now that we have finished describing the philosophy behind each architecture, we will compare their specifications:

As you can see, the GeForce 8800 GTX has an enormous calculation power in addition to excellent efficiency thanks to scalar instruction processing. We have to keep in mind that these units will also have to process vertex shaders. With the other 2 GPUs, there were special units in charge of them.
You will also notice that the newcomers have much higher filtering power. We will come back on this point later on.
We made a lot of test to have a more precise idea of how the shader cores of the GeForce 8800 are working and we have to admit that they are formidably efficient despite the very young drivers. We failed, however, to see how the second MUL is in action. We believe that its utilisation is submitted to restriction or that the compiler integrated to drivers doesn’t exploit it yet. This could mean a future performance improvement.
It’s also interesting to note that each scalar processor has in addition to the MAD and MUL units, one unit which interpolates and process specific functions (EXP, LOG, RCP, RSQ, SIN, COS) all executed in 4 cycles. We suppose that for its implementation, NVIDIA included 4 of these units per shader core eahc capable of interpolating over one quad (square of 4 pixels which simplifies calculations) or executing one specific instruction in one cycle (-> 4 cycles to process the special instruction for the 16 elements the shader core works on).