Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphics Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Nvidia GeForce GF100: the geometry revolution?
by Damien Triolet
Published on January 26, 2010

The GF100
We’re not going to go back over the details of the compute side of the architecture as we have already dealt with this thoroughly in the article given over to Fermi, the name of the architecture, the GF100 representing its implementation.

To recap, the G8xs, G9xs and GT2xxs were based on TPCs (Texture Processing Clusters) with 2 or 3 SMs (Streaming Multiprocessors) and a group of 8 texturing units (with limitations for the G80). For example, the GT200 has 10 TPCs of 3 SMs which share 8 texturing units between them. These TPCs are topped by a single group of specialised units for the preparation of tasks, set-up of triangles, rasterisation and so on.


The GF100 is made up of 4 big blocks called GPCs (Graphics Processing Clusters). All the specialised units are now at the level of the GPCs and SMs. This makes the GF100 the first GPU that can process more than one triangle per cycle! We’ll come back to this. The are 4 SMs in each GPC for a total of 16 in total. There’s another important change with the texturing units which are no longer situated in the main structure but at the level of the SM. This is why NVIDIA decided to abandon the term TPC for the new one, GPC. In the GF100 there are four texturing units for each SM. The groups of SMs therefore no longer share texturing units, which simplifies design and gives gains in efficiency.

Opting for decoupled texturing units (AMD from the R520 to the RV670) or semi-decoupled (NVIDIA G80 or GT200) was a nice idea on paper that opened the door to easy development of architectures towards a higher processing power / texturing power ratio, isolating a fixed function of the programmable core and maximising yield by allowing all units to be used when the GPU needs them. In practice however, the gain in yield was not as significant as thought and didn’t make up for the loss in efficiency due to the more complex design. AMD therefore took a step backward with the Radeon HD 4000s and NVIDIA has now done the same, which shows that architectural developments can also be counter-productive.


Each SM therefore has a double scheduler that can, on each cycle, send an instruction to 2 of these 5 execution blocks:

- 16-way SIMD 0 (the “cores”): 16 FMA FP32, 16 ADD INT32, 16 MUL INT32
- 16-way SIMD 1 (the “cores”): 16 FMA FP32, 16 ADD INT32
- four SFUs: 4 FP32 special functions or 16 interpolations
- 16-way 32-bit Load/Store unit
- texturing units

The latency and throughput of each instruction is different but all are decoupled which means, for example, that a special function that takes several cycles won’t stop the scheduler from sending an instruction to another execution block. At a given moment they may all therefore be running. Note that we are not here talking about the FMA FP64 that uses SIMD 0 and SIMD 1 and isn’t used in graphics rendering.

Note that the notion of a “core” is even more complex with the GPC. Which structure should be given the term? The GPC? The SM? Each lane in a SIMD unit? NVIDIA of course goes for the last option and talks about 512 “CUDA Cores”. We think it makes more sense to call the SMs “cores”. Using our terminology then, the GF100 would have 16 cores.

<< Previous page
Introduction

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7
Next page >>
Clocks, memory architecture  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.