Performance in pixel shading
We extracted complex shaders of three applications: 3DMark05, Far Cry and Tomb Raider AOD. We carried them out on the entire monitor in an external application.

In 2 out of the 3 pixel shaders, NVIDIA dominates with the 7800 GTX thanks to a higher calculation power. ATI take the lead for the third, which relies more on dependant texture accesses (indirection) and benefits from the bigger X1800 XT bandwidth.
Compared to the X850 XT PE, the X1800 XL brings very small performance gains of 5 to 10%. This isn’t so surprising as it has a smaller memory bandwidth and calculation power. The gains are then really due to the new architecture, ring bus and Ultra Threading. Of course, the X1800 XT provides higher performances especially with the third shader as the memory bandwidth and higher frequency combine to increase performances by 70% compared to the X1800 XL.
We then tested 2 lighting shaders:

These shaders measure pure calculation power and are clearly to NVIDIA´s advantage, who also benefits from FP16 to increase performances. The difference between the X1800 XT and X850 XT PE is mainly due to frequency.
Branching
One of the main innovations introduced with the GeForce 6800 was dynamic branching in shaders. It facilitates some shader writing and increases the efficiency of other shaders by avoiding the calculation on pixels which don’t need it. For example, why apply a very performance costly filter to soften the border of a shadow if the pixel is in the middle of a shadow? Dynamic branching helps to determine if the pixel needs it or not. Splinter Cell Chaos Theory uses this technique, whereas the Chronicles of Riddick calculates everything for every pixel. Performances drop by 10 to 15% for the first and more than 50% for the second. Of course, the algorithms aren’t identical, but it does give us an idea of what dynamic branching is capable of.

Of course, this only applies to very specific cases. In a GPU, pixels are processed by groups of 100 or even 1000. For a branching, all pixels have to take the same branch or else two branches have to be calculated for all pixels with masks to only write the result of the required branch. On paper, ATI has a clear advantage with its processing unit devoted to branching and very small threads. Let’s see if this is the case in practice with a small test that we developed allowing us to change branching granularity (the number of consecutive pixels that take the same branch). We specify the branch to take per pixel column. One column out of two has to display a complex shader and the other can skip this part of rendering. Average sized triangles in motion are displayed on the monitor. The triangle size, their position and the column size have an influence on branching efficiency. This is then closer to real situations than our previous test, which was made with two triangles in full screen.

With narrow columns, GPUs can’t use the branching to avoid the complex part for half of the pixels, but they have to process branching instructions. It reduces performances instead of increasing them. You will notice that this performance reduction is only 2.5% for ATI as compared to 9-10% for NVIDIA. This is due to the fact that ATI has a special unit for branching, which works in parallel with other units.
ATI’s small threads of 16 pixels (4x4) allow performance improvements as soon as the column width reaches 4 pixels, whereas you have to wait until 64 bits for NVIDIA! ATI easily reached a 60% performance gain whereas NVIDIA remains at 20% except for the 800 pixels column (a monitor divided in two because we are working in 1600x1200).
Here the GeForce 7800 isn’t more efficient than the GeForce 6800. It was in our previous tests, but it was for very specific cases where the 7800 did indeed have higher performances. In practice, this isn’t the case, however. Gains are slightly smaller with the 7800 as the architecture is more efficient and consequently the cost for a complex branch is lower.
Overall, ATI has better branching efficiency than NVIDIA and this should permit its use in more situations. Developers will appreciate this.
Vertex Shader
We tested performances in T&L, VS 1.1, VS 2.0 and VS 2.X/3.0 in RightMark :

For simple rendering, with a single light source, NVIDIA dominates except in T&L. For more complex rendering, the X1800 XT comes out ahead in all tests. With static branching, NVIDA has problems and the previous Radeon generations aren’t much better. The X1800 XT doubles performances. With dynamic branching, unlike pixel shaders, NVIDIA’s GPU and the X1800 XT have similar behaviours and the performance gap is only due to the difference in frequency.
Unlike NVIDIA, ATI seems to have forgotten about Vertex Texturing support which we thought was required for Vertex Shader 3.0 support. ATI says the opposite and this is quite odd. If we take a closer look, we see that ATI reports vertex texturing to DirectX, but doesn’t authorize it for any texture formats. This looks suspicious and may be a clever way to avoid DirectX specifications and announce Vertex Shader 3.0 support without the use of Vertex Texturing. Either way, it´s unclear and anyway, in practice, Vertex Texturing isn’t really important except for 2-3 technological demonstrations. It´s not very widespread, because it´s too restricted at least in its current implementation.