CaymanWith Cayman, AMD took a risk by deciding to review an aspect of the architecture of its processing units which hadn’t changed since the Radeon HD 2900 XT. Broadly speaking, the AMD processing units can be described as vectorial 5d or vec5, which means they can execute up to 5 instructions in parallel. However, with such an architecture, if the code to be processed doesn’t allow the parallelisation of so many instructions, the units aren’t fully exploited, in contrast to the scalar NVIDIA architecture which can maintain a high yield in a maximum of situations. Both approaches are as valid as each other.
Note however that a processing unit isn’t the same as a “core”, which is a marketing notion used by NVIDIA to give a comparison with CPUs and adopted by AMD to count 5 cores per vec5 processing unit. Overall, you can look at things in two ways: you get more out of an AMD vec5 unit than an NVIDIA scalar unit or you get less out of an AMD core than an NVIDIA core.
Close up on a Cypress vec5 processing unit.
Close up on a Cayman vec4 processing unit.
With the GF104 used on the GeForce GTX 460 and its derivatives, NVIDIA moved towards vector-like functioning to increase yield and AMD has tried to do the same with Cayman, but in the other direction, dropping down from vec5 to vec4. The Cayman processing units are therefore less powerful than those on pevious AMD GPUs. They are statistically more efficient, though don’t give higher performance. This is an important nuance. As they are simpler however, they take up less space and draw less energy, which means you can have more of them, all other things being equal.
To look a little more closely, the previous Radeon GPUs were based on 4+1 type processing units, with an execution line able to handle complex instructions. It’s this “+1” that AMD has decided to get rid of. This means that these complex instructions must be processed on other lines via a succession of simpler operations. These instructions will now take up 3 of the 4 execution lines, which will make them much more demanding in terms of resources as it’ll be possible to process only a simple instruction at the same time compared to 4 before.
BR>Without this slightly unusual “+1”, that was difficult to supply correctly under certain circumstances, the compiler’s task is greatly simplified, which even means these vec4 units will show higher performance than the vec5s in some cases but which also means overall that AMD now needs more vec4 processing units to maintain the same level of performance.
While Cypress, the Radeon HD 5800s' GPU, had 20 blocks of 16 vec5 processing units, Cayman has 24 blocks of 16 vec4 processing units. This means that instead of 320 vec5 units we now have 384 vec4s, which gives a lower total of “cores” (only 1536 for Cayman against 1600 for Cypress). We mustn’t however forget the texturing units, of which there are four per block. This means that Cayman gives 20% more power here at equal clocks. Note that AMD says that it has increased double precision calculations performance but in fact this is simply a twisted way of interpreting the fact that the “+1” didn’t handle double precision. A Cayman unit is identical to a Cypress unit here.
AMD didn’t stop there and has introduced some other small improvements to its architecture. The first concerns geometry processing, which is parallelised so that it’s no longer limited to one triangle per cycle. Note however, it’s parallelised but not distributed between blocks of processing units as is the case with the GeForces. Here’s a simplified way of looking at things:
Cypress: 1 complex geometry processing unit -> 1 triangle of 32 pixels per cycle,
Cayman 2 complex geometry processing units -> 2 triangles of 16 pixels per cycle,
GF100/GF110: 16 simple geometry processing units -> 4 triangles of 8 pixels per cycle
NVIDIA retains an advantage with small triangles and, above all, with more simple geometry processing units, the GPU doesn’t get stalled when there’s a lot of data generated by tessellation. To combat this problem, AMD enlarged the dedicated buffer in Barts, the GPU used for the Radeon HD 6800s, and Cayman goes further allowing it to transfer all this data into the video memory temporarily so as to avoid stalling the GPU. This feature isn’t however directly exposed and we don’t know if it kicks in automatically at certain loads or if AMD has recourse to it manually on a case by case basis.
BR>AMD has also improved its ROPs to increase the speed of 16-bit integer and 32-bit floating point formats. Antialiasing efficiency has also been improved, as has writing to memory in “compute” mode. Here, AMD has taken a close look at what NVIDIA are offering and enabled simultaneous processing of several different kernels while previously the GPU had to attribute successive processing periods. The same goes for communication with the CPU which will be enabled in both directions at the same time thanks to the fact that there are two DMA engines, like with the GF100/110. The memory controllers have been revisited to support high speed GDDR5 more easily.
Cayman: 2.64 billion transistors
Finally, AMD has taken an important step in including, for the first time in a GPU, a unit to handle energy consumption monitoring. Using hundreds of sensors distributed across all the GPU blocks, Cayman can monitor its own energy consumption and limit clocks to remain within the maximum energy consumption that has been defined. AMD is talking about fine tuning down to the precision of one frame. NVIDIA has nothing to compare to this sort of precision and responsiveness on its GeForce GTX 500s. Here Cayman is giving the sort of thing you get on recent CPUs, though without the Turbo feature that would allow it to increase clocks when energy consumption is lower than the limit. The technology used on Cayman, known as PowerTune, should be rolled out across all forthcoming AMD GPUs and will really come into its own on the mobile market!
All these changes have increased the number of transistors and Cayman now has 2.64 billion as opposed to 2.15 on the Cypress. This represents an increase of 23%, mirrored in the size of the die, up from 334 to 389mm². The increase in die size is just 16%, showing that the density of the transistors has gone up.