Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphic Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Product review: The Nvidia GeForce GTX 280 & 260
by Damien Triolet
Published on July 7, 2008

Branching performances
One of the main innovations that was introduced with the evolution of GPU programmability was dynamic branching. This allowed writing some shaders more easily and to increase the efficiency of others by avoiding the calculations on parts that don’t need it. For example, why apply a very performance costly filter to soften the border of a shadow to a pixel in the middle of the shadow? Dynamic branching can help to determine if the pixel needs that or not.


However, the situation is not that rosy as this only applies to very specific cases. Branching has the reputation of being difficult to manage and this is particularly the case in CPUs that have to predict the branching result to mask calculation latency. In a GPU, pixels are processed by groups of 10s, 100s or even 1000s, and this allows the automatic masking of this latency. This problem, therefore, doesn’t really exist for GPUs. There is another one, however. For efficient branching with GPUs, all pixels of a working group have to take the same branch or else both branches have to be calculated for all pixels with masks in order to only write the result of the required branch for each pixel.

In the case of the GeForce 8, 9 and GTX 200, the GPU works on groups of 16 or 32 threads (vertices, pixels, etc.). Why these two possibilities? First of all because they are 8-way SIMT units which require groups of at least 8 threads. Next, you may recall that calculation units are double pumped and function at twice the scheduler’s frequency. Thus, only one command can be sent in every other cycle when seen from the calculation units’ point of view. Working on groups of 16 threads enables calculation units to have enough work and to not have to wait for a slower scheduler. Finally, working on 32 threads authorizes dual issue. Alternatively, the scheduler will send an instruction to the 8-way SIMT unit and then it will send an instruction to special units. It can alternate between these two operations at full speed thanks to groups of 32 threads.

Nvidia can configure its GPUs for 16 or 32 threads. In the first case, branching performances are improved and in the second calculation power is improved thanks to dual issue. Groups of 16 are activated for vertex and geometry shaders while groups of 32 are activated for pixel shaders and CUDA.

We developed a small test that allows us to change branching granularity (the number of consecutive pixels that take the same branch). We create virtual screen columns inside the pixel shader applied to moving triangles. We specify the branch to take per pixel column. One column out of two has to display a complex shader while the other can skip this part of rendering. Average sized triangles in motion are displayed on the monitor and across these virtual areas that use different branches. The triangle size, their position and the column size have an influence on branching efficiency. We think this test is quite close to real situations.


With narrow columns, GPUs can’t use branching to avoid the complex part for half of the pixels, but they do have to process branching instructions. This reduces performances instead of increasing them - at least for the GeForce 8, 9 and GTX 280. All of these GPUs have a special unit devoted to branching, which functions in parallel with pixel shading and texturing pipelines, masking the cost of branching instructions. The Radeon HD 3870, however, seems to be the only one to completely mask branching latency.

The size of groups of pixels on the GeForce 8800 is 32 versus 64 for the Radeon HD 3870, which enables Nvidia chips to take the lead. We noted a surprising difference between the GeForce 9800 GTX and GeForce GTX 280 which with one column of 8 pixels is much more efficient. It is probable that the breaking down of triangles into pixels is done in a way that it best groups close pixels together (and thus they are more susceptible to take the same branch) which is beneficial in this case.

<< Previous page
Texturing and ROP performances

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16
Next page >>
Specifications, the cards  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.