The GeForce GTX 200
With the GT200 that equips the GeForce GTX 200, Nvidia of course fixed an objective of offering a higher performance GPU. So what could be done based on GeForce 8 architecture? Put two G80s on the same chip for a total of 256 scalar processors and 128 texturing units? Sounds simple, right?
It’s never that easy. Doubling what we have rarely results in doubled performances. This is all the more so true given that inconveniences such as power consumption and the accompanying heat created can be such that frequencies can be reduced and therefore performance gains too.
So Nvidia first of all wanted to know what was the limiting factor on the GeForce 8/9. And then they tried to guess what it would be in the future. The conclusion evidently was that more calculation power and registers were needed and that the number of texturing units didn’t necessarily have to increase much.

The GT200’s partitions received an additional multiprocessor compared to the GeForce 8 and 9 bringing their total number to 3.For this reason, Nvidia added one multiprocessor per partition which now contain 3. The number of registers of each multiprocessor was also doubled to finally attain 16,384. The more registers there are implies that the compiler is more flexible to produce a series of optimal instructions and that the GPU can more efficiently mask the various latencies, for example, in the access to textures. You may recall, the GPU handles a very large number of threads (pixels, vertices, etc.) to mask latency and keep execution units busy. The data of these threads should stay in the registers. Next, Nvidia increased the number of partitions from 8 to 10 for a total of 240 scalar processors. In terms of general registers on the entire GPU we go from 131,072 to 393,216 x 32 bits !

The GT200’s architecture.A supplementary unit was placed in the GT200’s multiprocessors: a 64 bit FMAD. This unit enables the GT200 to support 64bit floating point calculations. Given that there is a single unit, the speed is an eighth of that compared to a SIMT unit composed of eight 32 bit scalar processors. In addition, in 64 bits two 32 bit registers should be used, limiting performances a bit more. This support is therefore destined not to be the most efficient possible but is first and foremost there for developers who need it with CUDA.
There were no changes in texturing units which simply benefit from the transition to 10 partitions. On the other hand, Nvidia says that it has improved the scheduler and a few other details in order to maximize the use of these units.
Moreover, small changes of this nature were numerous. Dual issue was improved and it is now easier to use FMADs and FMULs in parallel. ROPs are now capable of blending at full speed with 32 bit (4x 8 bits) formats while it was done at half speed before.
The output buffer of geometry shaders was enlarged and is now six times bigger. You may recall that this was one of the weak points of the GeForce 8 and 9 whose performances plummeted when a geometry shader was used to create a significant amount of geometry, for example, in tessellation.
Just like the Radeon HD 2000 and 3000, the GT200 has a processor dedicated to management of PCI Express transfers. It can therefore send and receive data at the same time it is working on 3D rendering or some program with CUDA.
Finally, the memory bus was extended to 512 bits with eight 64 bit controllers, something that enables giving the GPU significant bandwidth without having to use very expensive memory.
On the other hand, where Nvidia hasn't made any innovation is in its insistence on not supporting DirectX 10.1. As we explained on several occasions, there is a strategy in this choice consisting of not lessening the value of its other GPUs compared to the competition. Another aspect is that while Nvidia supports some parts of DirectX 10.1, such as direct access to depth buffers when anti-aliasing is used and helps developers to circumvent DirectX 10 and 9 to access them, other points require more significant changes. For example, Nvidia does not have programmable grids for the position of samples in multisampling, something which is necessary to support DirectX 10.1 and which would require reviewing in depth the antialiasing part of its GPUs.