Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphics Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
ATI Radeon HD 2900 XT
by Damien Triolet
Published on May 29, 2007

Branching performances
One of the main innovations that was introduced with Pixel Shader 3.0 was dynamic branching. This allowed writing some shaders more easily and to increase the efficiency of others by avoiding the calculations on parts that don’t need it. For example, why apply a very performance costly filter to soften the border of a shadow to a pixel in the middle of the shadow? Dynamic branching can help to determine if the pixel needs that or not.


However, this only applies to very specific cases. Branching has the reputation of being difficult to manage and this is particularly the case in CPUs that have to predict the branching result to mask calculation latency. In a GPU, pixels are processed by groups of 10s, 100s or even 1000s, and this allows the automatic masking of this latency. This problem, therefore, doesn’t really exist for GPUs. There is another one, however. For efficient branching with GPUs, all pixels of a working group have to take the same branch or else both branches have to be calculated for all pixels with masks in order to only write the result of the required branch for each pixel.

We developed a small test that allows us to change branching granularity (the number of consecutive pixels that take the same branch). We create virtual screen columns inside the pixel shader applied to moving triangles. We specify the branch to take per pixel column. One column out of two has to display a complex shader while the other can skip this part of rendering. Average sized triangles in motion are displayed on the monitor and across these virtual areas that use different branches. The triangle size, their position and the column size have an influence on branching efficiency. We think this test is quite close to real situations.


With narrow columns, GPUs can’t use branching to avoid the complex part for half of the pixels, but they do have to process branching instructions. This reduces performances instead of increasing them - at least for the GeForce 7. Radeons and GeForce 8s have a special unit devoted to branching, which functions in parallel with pixel shading and texturing pipelines, masking the cost of branching instructions. The Radeon X1950, however, seems to be the only one to completely mask branching latency.

The most efficient GPU up until now to process these operations is the Radeon X1800. It is closely followed by the Radeon X1900 and then the Radeon HD 2900. The size of pixel threads on the GeForce 8800 is 32 pixels versus 48 for the Radeon X1950 and 64 for the Radeon HD 2900. This gives Nvidia‘s chip an advantage. We precise pixel threads, because in the case of vertice threads, granularity is 16 vertices for Nvidia. Note that the Radeon X19x0 show less predictable results than other GPUs in this test (see the strange result of the column of 16 pixels). We believe this is due to the complex way the architecture distributes pixels to the shader cores, affecting efficiency in certain conditions. Overall, the Radeon HD 2900 behaves therefore like a GeForce 8800. Its efficiency, however, is a step below due to the size of groups of processed pixels.

Next, we carried out a second test related to dynamic branching. This time we first rendered a fractal in a normal way and then with branching. This algorithm uses a high number of identical iterations, which are found next to each other in the standard (or flat) shader. With the branching based shader, we used a loop around 2 iterations with a test that checks if the additional iterations are useful or not. If they aren't, we exit the loop and leave out the unnecessary ones.



Before comparing performances with and without branching, it’s interesting to take a look at raw results. This shader is actually mainly made up of vec2 operations and therefore takes advantage of an architecture that is more flexible in terms of calculation units. The Geforce 8’s scalar functioning here allows reaching very high performances. The Radeon HD displays clearly better results than the Radeon X1950, but this is not due to its « scalar architecture ». It rather corresponds to the frequency increase and to the extra vectorial calculation units (from 48 to 64).

As for branching, here the Radeon HD 2900 isn’t very efficient, and it seems to have more troubles in masking the cost of the numerous branching operations than its predecessor and the GeForce 8800.

<< Previous page
PS performances, VS, texturing and ROPs

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16
Next page >>
DX10 performances and tessellation  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.