GK104: Fermi on a dietNVIDIA didn’t start from zero on the Kepler generation because good, made to last foundations had already been laid with Fermi. Kepler is in fact a small development of Fermi architecture designed with the aim of correcting Fermi’s major weak point: the rather poor energy yield. This hasn’t so much been done for ecological reasons as to avoid slamming into a brick wall. Keeping Fermi architecture as it was at the same time as changing over to the 28 nanometre fabrication process would have meant losing out on some of the benefits of this process as the excessive energy consumption would have been a break on complexifying the GPU.
While all the NVIDIA documentation compares the GK104 architecture with that of the GF100/110, this only serves to confuse the picture. In fact, with the GF104/114, introduced with the GeForce GTX 460, NVIDIA offered a variant of its Fermi architecture optimised for gaming yields, whereas the big GPU was designed to offer a compromise that left more space for GPU Computing. It is of course this GF1x4 architecture that the GK104 should be compared to so as to properly understand what developments there have been. You can find our description of the differences between the GF1x4 and the GF100/110 here.
NVIDIA GPUs are based on fundamental blocks known as SMs or Streaming Multiprocessors. These SMs contain a certain number of processing and texturing units, memory cache and logic management. Each group of 4 SMs forms a GPC (Graphics Processing Cluster) and has its own rasterizer, allowing each cluster to process small triangles efficiently. On the GK104, the SM has evolved to become what’s now called an SMX. Here’s a representation of the development from the GF1x4 SM on the left to the GK104 SMX on the right:
As you can see, the SMX dwarfs the SM! The number of main processing units has gone up from 48 to 192 and the number of texturing units from 8 to 16. Is this a radical change? Not really if we look a little closer.
With the SMX NVIDIA has introduced a first energy optimisation : no more dual clocks for processing units (shader clock twice the core clock). Introduced with the G80 and the GeForce 8800 GTXs, running certain units at double the core clock allowed NVIDIA to do a lot more in terms of performance with relatively few processing units. Unfortunately, this approach comes at a cost, with higher energy demands for the units themselves as well as for the distribution of the clock signal.
Moving down to the 28 nanometre process, NVIDIA is less limited by the surface area taken up by the units than by the energy required to run them. It therefore no longer makes any sense to persist with the higher energy solution and on the GK104 NVIDIA has dropped the higher shader core clock and doubled the number of processing units to compensate, including the special function SFUs (but not the units that deal with double precision processing, the rate of which therefore drops to a rate that is 1/24th of single precsion). So then, this explains half of the evolution from the SM to the SMX!
For the other half, we have in fact to see an SMX as two SMs stuck one to another so as to share the same L1 cache and reduce the overall cost of the cache, which doesn’t come in all that useful in games because the texturing units have their own dedicated caches. Remember that part of this L1 cache serves as shared memory allowing various threads processed in parallel to communicate during GPU Computing usage.
Fermi GPUs could share their 64 KB between an L1 part of either 16 KB or 48 KB and a shared memory part of either 48 KB or 16 KB.
The GK104 also introduces a 32 KB / 32 KB mode, which allows for more efficient synching with DirectX 11 specifications. The cache bandwidth has also been doubled.
Apart from the cache, the two halves of an SMX are independent of each other. Thus the first two schedulers can only access the first half of the execution units and the two others the second half. Just like with the GF1x4, what we have here is a superscalar architecture as, for any given warp (group of 32 threads) to maximise use of the processing units, it must be possible to process at least 50% of the mathematical instructions as pairs. This isn’t therefore strictly speaking a scalar architecture but the compiler’s work remains relatively simple.
Each scheduler has its own registers (4096 x 32 bits) and its own group of four texturing units (each with its own little dedicated cache) and can issue two instructions per cycle but must share resources at this level with a second scheduler:
- SIMD0 32-way unit (the “cores”): 32 FMA FP32 or 4 FMA FP64
- SIMD1 32-way unit (the “cores”): 32 FMA FP32
- SIMD2 32-way unit (the “cores”): 32 FMA FP32
- SFU 16-way unit: 16 FP32 special functions or 32 interpolations
- Load/Store 16-way 64-bit unit
Note that this last point isn’t very clear. NVIDIA says that the Load/Store capacity of an SMX is the same as a Fermi SM when it comes to 32-bit transactions but doubled for 64-bit. We therefore suppose that the diagram, which is a simplification of a very complex architecture, is partly wrong and that in fact the two halves of an SMX share these resources. 64-bit load/stores however don’t represent any additional cost than 32-bit, with NVIDIA stipulating that this first type of access is more often a limiting factor than the second.
We now come to the second development designed to reduce the architecture’s energy footprint. Fermi schedulers use scoreboarding to check constantly which resgisters are being used (and therefore possibly being written to) so as to determine which instruction can be issued on which group of data. Kepler still uses scoreboarding, which is important as there’s very high latency on some instructions, but gets rid of it when it’s no longer required.
Throughput and latency of mathematical instructions are deterministic and the compiler can therefore predict exact behaviour of the mathematical instructions it issues and no longer needs to call on Fermi’s complex harware scheduling to process sequences of instructions within a warp (group of data). This means that Kepler only has recourse to such scheduling for instructions of indeterminate latency (texturing, load, store) as well as to determine which warp to start working on. This approach allows Kepler to reduce the energy consumption required by the processing units.
Note finally that as an SMX is basically two SMs, the pixel and triangle throughputs of an SMX are double those of an SM, namely one triangle (= one vertex fetch) every two cycles and four 32-bit pixels per cycle.