Pixel Shader performances
We tested two relatively simple lighting shaders which represent a good compromise between theoretical and actual speeds:

The GeForce GTX 280 is 40 to 50% faster here.
Of course, we wanted to know more about performances in much more complex situations. Our tests mostly involved analysing dual issue and we were thus able to see that it was indeed more easily usable. For example, we noted a throughput of 1.5 FMUL per cycle which means that FMAD and FMUL units function well in parallel. On the other hand, the latter seem to be only used in 1 out of 2 cycles and we weren’t sure why. Perhaps, this was a limitation related to register access. The maximum calculation power we obtained with a shader composed of 2,000 FMADs and 2,000 FMULs was 664.4 Gflops or 70% of the maximum announced calculation power of 933.12 Gflops.
Vertex Shader performances
We tested performances in T&L, VS 1.1, VS 2.0 and VS 3.0 in RightMark :

Unified architecture enables recent GPUs to allocate all resources to the processing of vertex shaders which can mean a significant gain. Moreover, the gain could be even bigger but it is limited by the GPU’s triangle processing speed which on all the GeForces tested here is 1 triangle per cycle. On the other hand, this is 0.5 per cycle for the Radeon HD 3870 while it was 1 per cycle for the Radeon HD 2900 XT. The higher frequency of the GeForce 9800 GTX therefore is to its advantage making this GPU the most powerful we have seen in terms of (simple) geometrical processing.
Geometry Shader performances
Contrary to Nvidia, AMD has integrated a generalized cache for reads/writes in memory from the shader core. It can be used in a classic manner for Stream Output which consists, as required by DirectX 10, of being able to write data that comes out of the shader core without having to go through the ROPs. It also enables the virtualization of general registers which can thus be unlimited.
Another use is to utilize video memory via this cache to temporarily stock a potentially enormous mass of data created by Geometry Shaders during the amplification of geometry without which calculation units could be blocked due to the impossibility of placing the result in general registers. This could otherwise cause problems and theoretically a crash because geometry data should remain in the correct order. For example, imagine triangles 1 and 2 being deconstructed. Triangle 1 should be rendered before triangle 2. Given that a GPU is in parallel, these two 2 triangles could be processed at the same time by the geometry shader which will deconstruct them into a series of smaller triangles. At the output, all the pixels stemming from triangle 1 should be rendered before the others. This is fine if there is enough memory to store everything. We only have to wait for all to be finished and check that the rendering was done in the correct order. However, if the GPU falls short of memory when triangles 1 and 2 are still being deconstructed, it is stuck.
AMD thus avoids this problem. While Nvidia should of course also avoid it but Nvidia’s approach is very different. They take the problem from the other end and instead of providing more memory to stock data before putting things back in order, reduce the number of elements processed in parallel to always have enough memory in the GPU. In other words, instead of using 128 or 240 processors to process a geometry shader, if Nvidia detects that there could be a problem, this number is reduced. We do not know exactly at what point Nvidia should reduce parallel processing, but it obviously seems to be a very big difference between Nvidia and AMD, with an advantage for the latter. This is true even if developers are careful not to use it in problematic cases.
To compensate for this, Nvidia has strongly increased the size of its cache in the output of geometry shaders by six times the amount. What does it mean in practice? We observed performances in a tessellation demo based on geometry shaders provided by AMD at the launch of the Radeon HD 2900 XT :

As you can see, even if the Radeon HDs hold their edge, the GeForce GTX 200 significantly improves performances compared to the GeForce 8 and 9. Nvidia says that it has increased cache in relation to what developers use and will use in the mid-term.