RV870 or CypressAs you can imagine, to fully implement support for Direct3D 11, AMD had to develop a new GPU. The architecture changes required were however minor as Direct3D 10.1 support on previous cards and the fact that tessellation units have been integrated for several generations already meant that the Radeon HD 4000s weren’t far off. This is partly why AMD was quicker off the mark with its card than NVIDIA.
Cypress, the GPU codename (also called the RV870), uses comparable architecture to the Radeon HD 4800s and 4700s, but doubles the number of execution units and has an amazing 2.15 billion transistors in a surface area of 334 mm²! Enough to take the card up to 2.72 teraflops.
The Radeon 5800 GPU is made up of large 2560-bit SIMD units, each including 16 vec5 shaders. Here AMD (much in the same way as NVIDIA does) talks about 80 cores per SIMD unit, however we see this as a misleading attempt to play up GPU capacities in comparison to CPUs. That said, while AMD has retained the old structure when it comes to execution units, it has doubled the number of SIMDs from 10 to 20, going from 160 vec5 shaders to 320 or from 800 to 1600 stream processing units (cores).
The internal structure of the vec5 shaders remains more or less the same, of a 4+1 type. The first 4 units are identical and can execute one single precision FP32 instruction per cycle, an INT24 instruction per cycle, an FP64 addition in 2 cycles or a multiplication (or multiplication + addition) in 4 cycles. The 5th unit is distinct and can execute either one single precision FP32 per cycle, an INT32 per cycle or a special FP32 per cycle.
A significant development is that Cypress supports the new IEEE754-2008 standard required by Direct3D 11. This adds FMA (Fused Multiply Add) instruction support that, in contrast to the standard MAD (Multiply Add) instruction, allows you to retain the totality of the intermediary result of the multiplication and therefore gives higher precision. Given how floating point numbers function, this provides very efficient processing of other instructions. AMD has taken advantage of this by making the scalar product more efficient and adding a native SAD (Sum of Absolute Difference) instruction that greatly accelerates certain algorithms.
When you use such vec5 units rather than the scalar units in NVIDIA GPUs, the compiler has to try and find 5 independent instructions to process in parallel. This is often impossible, which is what makes this type of architecture less efficient per unit. It does however allow AMD to build many more units into its GPUs.
Each SIMD has 16384 128-bit registers, like on previous generations, and can support a large number of threads and therefore better mask latencies.
As well as arithmetic logic units, each vec5 processor has a unit to process branching that therefore slows down performance less, at least if there isn’t any divergence as each SIMD works on blocks of 64 threads (as against 32 with NVIDIA). As soon as there is divergence between threads, the two branches must be executed one after the other with a mask so as only to write the results for threads that use the branches.
Again to give full support for Direct3D 11, each SIMD now has 32 KB of shared memory rather than 16 KB (Radeon HD 4600 to 4800). The global shared memory of 64 KB for all the SIMDs is made available in addition, exposed via an extension in OpenCL.
No change when it comes to the texturing units, still 4 per SIMD. AMD says however that it has increased bandwidth to feed L1 caches, which should improve yeild.
In the AMD documentation, the fact that they were highlighting the addition of a second rasterizer made us think that Cypress might be the first GPU to rasterize two triangles per cycle; up until now this has been the only stage not to have been parallelized. Doing this represents several challenges and we momentarily thought AMD might have come up with a solution. However, when questioned, the manufacturer told us that the addition of a second rasterizer was simply to supply the 20 SIMDs and 32 ROPs and that there is only one setup engine and only one triangle processed per clock cycle. Instead of a complex rasterizer to handle breaking them down into pixels, we have 2 more simple ones, which boils down to the same thing.
Note that like in all previous GPUs tessellation is handled by fixed-function units (it could have been done via a programme executed in the SIMDs). However, the fixed-function units for interpolation have now disappeared and interpolation is carried out by the shader processing units.
ROPs doubled and 256-bit bus
Like on the RV740, AMD has doubled the number of ROPs per memory controller. There are still 4 64-bit controllers for a total memory bus of 256 bits, but Cypress has 32 ROPs. To support the extra load, the L2 caches for each controller have been doubled to 128 KB.
While processing power has doubled, the memory bus remains the same and AMD won’t be able to count on faster GDDR5 memory for its new GPU. This means that the processing power to memory bandwidth ratio will be significantly higher and in some cases risks limiting Cypress.