Tahiti: 2048 processing units, 32 ROPs and a 384-bit memory busAs with all current GPUs, execution units on Tahiti and its derivatives are organised in fundamental blocks which take in processing units, the cache, texturing units, control units and so on. Previously, AMD called these blocks SIMDs, which wasn’t very clear as this is also the generic name given to vector processing units. With GCN AMD is now referring to them as Compute Units (CU). With the aim of being as explicit as possible, we will also use this term to refer to the fundamental blocks of current Radeon GPUs and will reserve the term SIMD for its original definition: a vector processing unit. On the GeForces, remember, these blocks are referred to as Shader Multiprocessors (SM).
The first development on Tahiti (HD 7900) in comparison to Cayman (HD 6900) is that the number of CUs is up from twenty four to thirty two, with the same processing and texturing throughput per unit. This gain of 33%, which takes the number of processing units from 1536 (384 vec4s) to 2048 and texturing units from 96 to 128, will directly benefit performance. The CUs are also "scalar", which makes them more efficient (see next page). "Scalar" units have been used by NVIDIA since the GeForce 8s.
The texturing units are unchanged and still filter HDR 64-bit textures (FP16) at half speed and HDR 128-bit (FP32) textures at quarter rate. Filtering quality has been tweaked a bit to reduce flickering by a noticeable extent. AMD has also added hardware support for Partially Resident Textures (PRT), a sort of Mega Texturing used by John Carmack’s id Tech5. This PRT acceleration means that engines that use it can be accelerated but support will remain limited as Direct3D is not easily extensible (currently there's a proprietary OpenGL implementation).
To feed these new CUs, AMD has gone from a 256-bit to a 384-bit memory bus, which represents a gain in bandwidth of 50% for identical memory. The number of ROPs is however decoupled from the memory controllers, something already seen with the Radeon HD 6790, and AMD has opted not to increase them in number. There are therefore still 32 and this means that there’s no improvement in fillrate. It was already pretty high before and this isn‘t therefore too much of an issue, especially as to write more than 32 pixels to memory, you also have to be able to generate more! Indeed this was the problem with the GeForce GTX 400s and 500s. The GeForce GTX 580 is, for example, able to write 48 pixels to memory per cycle but can only generate 32, which is only of any use in terms of accelerating multisample type antialising.
Can 32 ROPs properly use a 384-bit memory bus? Not always, but as well as the ROPs, textures also require memory bandwidth. In some cases however, 32 ROPs are limited by a 256-bit bus, as when there's blending of colours in HDR 64 and 128 bits. These modes will therefore make full use of the extended bus.
Like Cayman, Tahiti can process two triangles per cycle, with or without tessellation, against four for the GF100/110 from NVIDIA. The fact that there has been no development here is however compensated by several little optimisations to improve performance when a high level of tessellation is used: bigger caches, fewer penalties when using the video memory as a buffer and ability to reuse vertices that have already been processed (neighbouring triangles) as often as possible. The gains resulting from these optimisations can give as much as a 4x improvement on Cayman according to AMD.
Implementing these additional units as well as all the architecture developments means a huge increase in the number of transistors, up from 2.64 billion for Cayman to 4.31 billion for Tahiti. Thanks to the 28 nm process, Tahiti is however slightly smaller at 365 mm² compared to 389 mm² for Cayman. Note that AMD hasn’t yet given any detail on which variant of the 28nm fabrication process has been used.