Home  |  News  |  Reviews  | About Search :  HardWare.fr 

MiscellaneousStorageGraphics CardsMotherboardsProcessors
Advertise on BeHardware.com
Review index:
Report: Nvidia GeForce GTX 460
by Damien Triolet
Published on July 29, 2010

Fermi architecture for gamers
To recap, the GF100 is based primarily on a big structure called the GPC (Graphic Processing Cluster). 4 in number, these GPCs each include a rasterizing unit and 4 SMs (Streaming Multiprocessors). There are six 64-bit memory controllers forming a 384-bit bus to feed these GPCs. For the GF104, NVIDIA has gone for half a GF-100 with a 256-bit interface. It is, then, based on 2 GPCs each of 4 SMs. Up to here, pretty standard then.

Hold the mouse over the diagram of the GF104 to compare it to the GF100.

Taking a closer look, the SMs on the GF104 seem bigger. And indeed they are. In the GF100, each SM has 32 “cores” and 4 texturing units. Looking in more detail, there are 2 schedulers which supply 5 execution blocks:

- 16-way SIMD0 (the “cores”): 16 FMA FP32
- 16-way SIMD1 (the “cores”): 16 FMA FP32
- 4-way SFU unit: 4 FP32 special functions or 8 interpolations
- 16-way 32-bit Load/Store unit
- 4-way texturing unit

For the GF104, NVIDIA wanted to add some execution units at lower cost and increase the ratio of texturing units to processing units. The SMs have therefore been enlarged to 48 “cores” and 8 texturing units. This is a ratio that directly targets gaming yield. In more detail, the GF104’s SMs have 2 dual instruction schedulers which supply 6 execution blocks:

- 16-way SIMD0 (the “cores”): 16 FMA FP32
- 16-way SIMD1 (the “cores”): 16 FMA FP32
- 16-way SIMD2 (the “cores”): 16 FMA FP32
- 8-way SFU unit: 8 FP32 special functions or 16 interpolations
- 16-way 32-bit Load/Store unit
- 8-way texturing unit

Hold the mouse over the GF104 diagram to compare with the GF100.

The half-GF100 is thus transformed into a much faster GPU with just a 25% deficit in terms of main processing units overall and an identical number of texturing units and execution units for special functions. Moreover, the texturing units have been improved to filter FP16 textures (as well as FP11, FP10 and RGB9E5) at full speed. Double precision performances however are cut a great deal (it can't compute double precision at half-speed), which is anyway also the case with the consumer versions of the GF100.

While the GF104 can send 4 instructions per SM and per cycle against only 2 for the GF100, we’re talking about 2 dual instruction schedulers and not 4 schedulers. The difference is a subtle one but marks a paradigm shift at NVIDIA. From the G80 to the GF100, all NVIDIA GPUs have naturally had optimal yield thanks to scalar type processing of the executed programme. This is in opposition to Radeon GPUs which use vector units which are of course less efficient.

Although each SM on the GF100 sends 2 instructions per cycle, they are executed on two different groups of data, warps of 32 threads. This means there’s never any dependence problem and yield is optimal as 2 instructions can mostly always be scheduled. This has changed with the GF104, on which each of the two schedulers can send two instructions per warp, so as to supply the additional units. These instructions cannot of course be interdependent.

NVIDIA prefers to talk superscalar rather than 2D vector architecture, given that each scheduler can send any combination of independent instructions. To keep yield high, the driver compiler has been tweaked and will try to organise the code so as to adapt to this particularity. This means the problematic is similar to the one you get with AMD GPUs, although on a quite different scale. If all instructions are scalar and dependent, the yield and raw processing power of the Radeons falls to 20%, while it can only fall to 66% at worst on the GF104. And the best case scenario is equal yield to the GF100 with half the number of SMs!

For the rest, the GF104’s SMs retain the same number of registers, the same 64 KB L1 cache memory and shared memory (16/48 KB or 48/16 KB) and the Polymorph Engine which handles a share of geometric operations such as vertex fetch, culling and tessellation. You still have the unified L2 cache linked to the memory controllers, but now 512 KB instead of 768 KB.

The architecture as a whole is therefore optimised to give the best results in current games. There is however a major limitation when it comes to the fillrate…

<< Previous page

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23
Next page >>
Architecture (cont), specifications  

Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.