GCN: goodbye VLIW
Since the Radeon 9700 Pros, AMD has used a VLIW architecture which was gradually developed to attain a very high level of flexibility on the most recent implementations. VLIW, or Very Long Instruction Word, consists of executing complex instructions, which are in reality assembled from a series of simpler instructions. This is what we have referred to as vector architecture for Radeons (vec4 or vec5) as opposed to the scalar-like architecture of GeForce cards: for each pixel, for example, 5 instructions could be executed side by side. This model came from the natural development of GPUs whose basic task consists in processing colours (four components: red, green, blue and clear) and coordinates (three or four components). Processing five instructions side by side meant that the Radeons could make the most of the natural parallelism between these instructions, at the same time as leaving a little space for the few scalar operations that also had to be processed.
Cypress, the GPU used on the Radeon HD5800s, uses CUs that each contain a big SIMD processing unit that can process each five-instruction cycle at the same time on sixteen elements (pixels, vertices, threads and so on). With Cayman, the GPU used on the Radeon HD 6900s, AMD simplified this model somewhat, using a more efficient SIMD engine that executes four instructions at the same time, still across 16 elements. For Tahiti and the other GCN GPUs, this big unit has been split down into four small SIMD units, each of which can execute one instruction on 16 elements.

In reality the big SIMD engine on Cayman and the four small SIMD units on Tahiti are probably identical, with just the way they’re fed actually changing. With Radeons, all these elements to be processed are organised in groups of 64, known as wavefronts. These groups are bigger than the length of the SIMD units (16) in order to simplify the work of the schedulers and accommodate processing unit latency more easily.
With Cayman, one of these groups is processed in four cycles with up to four instructions in parallel. For Tahiti, a single instruction on four wavefronts is processed every four cycles. Tahiti is therefore more flexible in as much as it can juggle with many more elements at the same time: at least 256. This can be compared with 128, not 64, for Cayman: with a latency of 8 cycles for the processing units, each Cayman CU must constantly interleave two wavefronts. With Tahiti and GCN, AMD has reduced the latency of the processing units to 4 cycles to avoid multiplying the number of elements required to use all the processing units. In the end they’re doubled in number, which is reasonable.

Note this slight nuance: although Cayman can execute an instruction directly on all processing units, this is no longer the case with Tahiti. The scheduler in each CU can issue the execution of an instruction to just one SIMD per cycle. On start-up, the second SIMD thus loses one cycle, the third two cycles and the fourth three cycles, which represents a loss of 192 flops. This is however negligeable when the programmes to be executed are long and is compensated by the lower latency.
What’s the difference in practice? Here are a few examples, comparing the VLIW4 architecture used on the Radeon HD 6900s (latency of 8 cycles, vec4) to the GCN architecture used on the Radeon HD 7900s (latency of 4 cycles + 3 cycles on start-up, scalar), supposing that each CU is fed with 2 / 4 / 8 groups of 64 elements to be processed:
1 scalar instruction to be executed:
VLIW 4 CU: 16 / 24 / 40 cycles
GCN CU:
11 /
11 /
15 cycles
100 scalar instructions to be executed:
VLIW 4 CU: 408 / 808 / 1608 cycles
GCN CU:
207 /
207 /
407 cycles
1 vec3 instruction to be executed:
VLIW 4 CU:
16 / 24 / 40 cycles
GCN CU: 19 /
19 /
31 cycles
100 vec3 instructions to be executed:
VLIW 4 CU:
408 / 808 / 1608 cycles
GCN CU: 607 /
607 /
1207 cycles
1 vec4 instruction to be executed:
VLIW 4 CU:
16 / 24 / 40 cycles
GCN CU: 23 / 23 / 39 cycles
100 vec4 instructions to be executed:
VLIW 4 CU:
408 / 808 / 1608 cycles
GCN CU: 807 / 807 / 1607 cycles
When the CUs in the GCN architecture are fed with at least 256 elements they then give higher performance than the VLIW 4 CUs, with an insignificant difference when four instructions can be processed in parallel but which can be close to 4x higher with scalar instructions executed in series! 3D rendering comes in on average somewhere between the results at vec3 and vec4. It has to be said that the AMD compiler performs particularly well to extract this parallelism thanks to all the experience acquired over the course of time. When under-fed however, the GCN CUs can give lower performance.

The new architecture comes into its own mainly on the compute side where the code lends itself less to vectorisation than 3D rendering. However, 3D rendering will gradually benefit from the GCN setup as it is increasingly developing away from easy vectorisation of colour and positioning processing. GCN will also free AMD up from working so intensively on its compiler and put these resources into other optimisations.
AMD has also added a real scalar processing unit to each CU which it will be able to use to deal with operations that don’t have to be executed for each element of a group via the SIMDs, which can for example serve to optimise branching in certain cases. This unit will not be used for graphics languages themselves but may be used by the compiler.
There’s still a unit for processing branching, extended for debugging messages.