
At the last keynote at AFDS, Eric Demers, Chief Technology Officer for GPUs at AMD, went back over the future GPU architecture that was presented
in great detail this week, highlighting the main areas more simply. It was also confirmed that AMD hoped to launch this architecture at the end of the year, if the 28nm fabrication process doesn’t hold things up.
Eric Demers reminded us that ATI/AMD GPUs had gone from a vec4+1 architecture to a vec5 as of the Radeon HD 2900s, then towards a vec4 architecture with the Radeon HD 6900s, which will be used in the forthcoming mid range GPUs as well as in Trinity. These architecture choices can be explained by the fact that graphics rendering, still the main usage for GPUs, implies numerous vec4 and scalar operations. The flexibility of MIMD/VLIW type processing units in the latest GPUs has allowed AMD to dispense with the scalar channel and let the compiler handle the mixing of all operations in the 5 or 4 available channels.
With its future architecture, AMD wanted to keep a similar set-up. While the VLIW model has been abandoned, the fundamental blocks of these GPUs will still have these four channels, not to carry out vec4 operations but to keep a similar ratio seen as the best adapted for graphics. With "compute” style tasks, which often make less use of vec5 or vec4 units, becoming more and more important, it was necessary to return to a scalar model from the point of view of the programmer.




The new AMD architecture allows the combination of these two aspects by placing, not one big MIMD unit in each Compute Unit, but rather four small independent SIMD units. AMD adds a scalar unit to them which will be designed to stop vector processing power being monopilsed by simple operations. As with the fundamental blocks on current GPUs, each CU will have four texturing units. In terms of execution units, a CU is therefore very close to what AMD currently calls the SIMDs. It’s how these execution units are used which represents the radical change. The Cayman GPUs used in the Radeon HD 6900s can indeed be seen as an intermediary step in this new architecture. This hybrid/prototype nature might go some way to explaining their debatable effectiveness.
Another important aspect of the new architecture is multitasking as these new GPUs will be capable of processing different commands simultaneously as well as deciding what priority to give to each of them. All this will take place at GPU level and not at operating system level.
The third major development is the L2 cache that can be used in reads and writes. It also enables the existence of a coherent space between all the CUs and the CPU, whether within an APU or with a discreet graphics card.
This generalised L2 cache, the scalar functioning of processing units, support for the x86 virtual memory space and C++ will bring about a huge increase in interest in GPU computing. Note however that on some of these points, AMD is simply making up ground lost to NVIDIA.



One important question we have with respect to this new architecture is how energy efficient it is. As we saw with the Radeon HD 6970s, energy efficiency was slightly down. Increasing the yield of a Compute Unit will therefore increase its relative energy consumption. While the 28nm engraving process does enable a lowering of absolute energy consumption, this remains an important question.
We were able to ask Eric Demers about this and according to him it isn’t too much of a problem. In the current architecture, when some vec4 or vec5 unit lanes aren’t used, they still draw power. They don’t draw as much as when they’re in use, but they nevertheless waste a lot of energy. This won’t be the case in the future architecture. In other words, we will probably get closer to the maximum energy consumption of the Compute Units, but their energy yield is likely to be improved.
Finally we asked AMD’s CTO if, in the future, he was planning to include more CUs in GPUs than is allowed by the TDP, in view of the fact that they wouldn’t all be used in 3D rendering (limited by PowerTune for example) but could be in the compute mode which doesn't use much certain power hungry parts of the GPU (eg. texturing units). Eric Demers replied that AMD was thinking about this and that such an option could perhaps be explored in the future if justified in simulations, notably for a SKU designed for HPC.