At a technical session on the GK110 architecture we were able to learn some details to add to what we brought you yesterday. This new information is of course focussed on the compute part of the GPU. First of all, Nvidia presented an architecture schema that clearly shows that the GK110 is made up of 15 SMXes, each with 192 processing units (Cuda cores), or a total of 2880, and a 384-bit memory bus.
Moreover, we learn that the L2 cache is now at 256 KB per 64-bit memory controller, making up a total of 1.5 MB, against 768 KB for the GF1x0 and 512 KB for the GK104. As with the GK104, each portion of L2 cache has twice the Fermi generation bandwidth.
The fundamental processing unit blocks, called SMXs on the Kepler generation, are similar on the GK110 to those on the GK104:
The number of single precision processing units is the same, as is the number of special function, read/write and texturing units. The caches are also identical whether this be the registers, the L1/shared memory or the texturing caches.
The only fundamental difference lies in the increase in double precision processing units, which are up from eight on the GK104 to 64 on the GK110. So while the GK104 is 24x slower in this mode than in single precision, the GK110 will only be 3x slower. Coupled with the increase in the number of SMXes this gives us a GK110 that can process 15x more of these instructions per cycle! Compared to the GF1x0, this represents a direct gain of 87.5% at equal clocks.
In the GK110, like the GK104, each SMX is fed with four schedulers, each of which is capable of sending two instructions. However not all the execution units can be accessed by all the schedulers as an SMX is in practice separated into two symmetrical parts inside of which two schedulers share the various units. Each scheduler has its own lot of registers: 16384 32-bit registers (actually 512 general registers of 32x32 bits). Moreover each scheduler has a dedicated block of four texturing units accompanied by a 12 KB cache.
In contrast to what we were expecting, the L1 cache / shared memory system is the same on the GK110 as the GK104 and remains proportionally smaller to what the Fermi generation provided. Nvidia has however introduced three small developments that can give important gains:
Firstly, each thread can have up to 256 registers allocated to it, as against 64 previously. Whatís the point of this if thereís no increase in the number of physical registers? Itís a way of giving more flexibility to the developer and the compiler to juggle between the number of threads and the number of registers allocated to each to maximise performance. This is particularly important for double precision processing that takes up twice the number of registers and which was previously limited in having just 32 registers per thread. Nvidia says that increasing this to 128 gives impressive gains in certain cases.
The second little development consists in authorising direct access to caches dedicated to texturing. It was previously possible to create a way of accessing them manually through the texturing units, but this method wasnít practical. With the GK110, these 12 KB caches can be exploited directly by the SMXes but only in the case of read only data accesses. They have the advantage of providing excellent access to the GPUís memory subsystem, suffering less in the case of cache misses and better supporting non-aligned accesses. The compiler (via a directive) calls on them when useful.
Finally, a new instruction makes its appearance: SHFL. It enables the exchange of 32 bits of data per thread within a warp (block of 32 threads). Its function is similar to that of the shared memory and thus comes as a kind of compensation for the relatively small quantity of shared memory (in proportion t the number of processing units). When it comes to data exchange it will therefore be possible to gain time (direct transfer in place of a write then a read) and economise on shared memory.
There are also several other minor developments such as the addition of a few missing 64-bit atomic operations (min/max and logic operations) and a 66% reduction in ECC overhead.
We can conclude, then, by saying that with the Kepler generation, Nvidia has indeed taken a different route than it did with Fermi. The big Fermi GPU, the GF100/110, had a different internal organisation to that of the other GPUs in the family, increasing the control logic to the detriment of the density of processing units and energy yield.
With the GK110, Nvidia didnít want to make the same energy compromise or rather say that it couldn't. They are now trying to do as much as they can within a thermal envelope that can no longer be extended. This is why the internal organisation of the GK110 is the same as the GK104, with the exception of double precision capacity that has been increased significantly.
Thus, Nvidia hasnít tried to make its architecture any more complex to support GPU computing performance and has simply tried to do as much as it can with the available resources by settling for minor developments which however can have a major impact. This is also why the command processor has been revised to allow maximum use of the GPU with the Hyper-Q and Dynamic Parallelism technologies that we described briefly yesterday and that we will return to with more details as soon as possible.