Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphic Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Nvidia CUDA: preview
by Damien Triolet
Published on March 21, 2007

The GeForce 8800 in detail
Even, if Nvidia has chosen, unlike AMD, not to have a low level language accompanied by a detailed documentation of its hardware, a great knowledge of the GPU is required for CUDA. Consequently, the GPU is described by NVIDIA in less marketing oriented language and this gives us an opportunity to learn a little bit more about this GPU.

Roughly, the GeForce 8800 has been described as a GPU equipped with 128 scalar processors divided into 8 groups of 16 and which work at very high frequencies: 1,350MHz. These groups process sets of 32 pixels or 16 vertices.

Actually, each of these 8 groups contains two sets of 8 scalar processors, which make the GeForce 8800 GTX a chip equipped with 16 groups of 8 processors. Nvidia calls these groups, “multiprocessors”. The fact that multiprocessors are made of 8 processors and not 16 doesn't really have an impact on performances. This is more an implementation detail that is intended to facilitate the functioning of calculation units at very high frequencies. The counterpart is that it requires more transistors.


The GeForce 8 consists of a group of multiprocessors, which represents a SIMD unit consisting of a certain number of processors.

In CUDA documentation, the multiprocessors aren't described as running at 1,350 MHz but at 675 MHz with "double pumped" execution units, which means running at doubled frequencies like the ALU of the Pentium 4. These multiprocessors process blocks of 64 to 512 elements called threads and spread out into sub-groups of 32 threads called, “warps”. Two cycles are required (4 x 0.5 cycles because of the "double pumped" mode) to process a common floating point instruction on a warp. Outputting one instruction every 2 cycles is easier than every cycle. This explains the choice of using two multiprocessors based on 8 processors per group instead of the single one based on 16 as we could have assumed when reading the initial GeForce 8800 documentation.

As we said in the article on the GeForce 8800, it also has calculation units dedicated to more complex instructions (exp, log, sin, cos, rcp, rsq). Two of these units are included into each multiprocessor in addition to the 8 processors, which process common instructions. Special instructions are four times slower and 8 cycles are required to execute them for entire warps. You should note that, unlike with pixels and vertex shaders, sin, cos and exp instructions are 2 times slower than the other three instructions and require 16 cycles to be executed on the 32 threads of a warp. The reason for this is probably that in the case of 3D rendering, instructions are executed with less precision but faster. Nvidia makes it clear that most instructions can be executed faster in a less precise mode (with a tag per instructions or a compiler command).

Integer multiplications are also processed by these two units and require 8 cycles. A lower precision of this instruction (24 bit instead of 32 bit) can be executed with the 8 standard processors in 2 cycles from the multiprocessor point of view.

In short, the GeForce 8800 GTX can be seen as a big calculation unit divided into 16 multiprocessors processing warps of 32 threads via 8 general processors and two specialized ones. These 16 multiprocessors clocked at 675 MHz can process together one common instruction every two cycles and 512 threads or a rate of 256 operations per cycle (512 floating point calculations in the case of FMAD/FMAC, which represents a multiplication and an addition). 64 special operations also need to be added to this figure. A Core 2 Duo as seen from the SSE unit point of view, with its two cores, is able to process 16 operations per cycle (8 additions and 8 multiplications). It runs, however, at much higher frequencies than 675 MHz but does not process FMAD/FMAC operations at full speed, because it needs two units to process an operation of this type.

The following table represents the calculation power of the GeForce 8800 and of two Intel Core 2 Extreme processors in four different situations: floating point multiplication, floating point addition, half of each (the best case for the Core 2) and floating point addition-multiplication (the best case for GPUs because all units support this instruction).


The GeForce 8800 clearly has higher calculation power than the Core 2, including the quadcore. Nevertheless, the gap isn't always as 'big' as we could have imagined. This means that it is really important to efficiently use a GeForce 8800 to surpass a quadcore. We also have to keep in mind that the GeForce 8800 can process more as well as more complex operations are execvuted relatively fast and is also able to use texture filtering units to accelerate some operations. If an algorithm is able to exploit the additional capabilities of the GPU, performances might explode when compared to CPUs.

<< Previous page
CPU/GPU, BrookGPU

Page index
1 | 2 | 3 | 4 | 5 | 6
Next page >>
Precision, memories  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.