GCN: caches and two ACEs for GPU computing
Although graphics remains at the heart of GCN development, GPU computing has also taken on a lot more importance. To prevent their GPUs from becoming confined to a few very specific usage scenarios, AMD and NVIDIA continue to make GPU usage easier. With Fermi, NVIDIA introduced numerous such developments and with GCN, AMD has followed suit.
Tahiti thus ushers in a new read/write cache structure. The texture cache of previous generations has developed towards an L1 cache of 16 KB which can be used both by the texturing units and the SIMDs. Moreover, each scalar unit has its own 4 KB L1 cache. This 4 KB cache is however implemented as a 16 KB cache shared between four Compute Units. This compromise has been made to reduce implementation costs. Tahiti therefore has a total of 40 L1 caches of 16 KB each.
They are connected with an access of 64 bytes per clock to the L2 cache that is made up of 128 KB partitions that are integrated into each of the six memory controllers. This L2 cache is now coherent and processes atomics much more efficiently than before.

The shared memory of each CU, Local Data Share, is thus also up from 32 to 64 KB. To recap, the LDS is designed to share information within a block of elements to be processed and the Direct3D 11 spec requires a minimum of 32 KB. This memory has a direct read access to the L1 of its Compute Unit, which means it can be loaded with data without having to go through the SIMDs. This improves both performance and energy consumption.
There has however been no increase in general registers for the SIMDs in each CU: 256 vector registers of 2048 bits (64x 32 bits). The scalar unit also has 256 registers of 32 bits.
Still looking at the memory sub-system, AMD has also implemented ECC protection for the SRAM (L1, L2 and registers) and the video memory. The implementation is probably similar to that on the NVIDIA GPU, which is to say that it consists in reserving part of the memory to stock ECC data, also therefore reducing the available memory bandwidth.

After the cache, AMD looked into another problem that affects GPU Computing: multitasking and overhead. To this end Tahiti has three command processors. The main one, not represented on this schema, can process all tasks, both graphic and compute. Beside it are two ACEs (Asynchronous Compute Engines) which are limited to compute tasks. With an evolved resources control system, prioritisation and synchronisation, they can simultaneously handle several contexts. They can for example, provide for efficient use of GPU computing and 3D at the same time. In the future, it’s also feasible that AMD send the main command processor DirectX 11 Compute shader processing to the ACEs, but this isn’t yet in place. Could this be a possible optimisation
for 3DMark 11?
To feed all these command processors, as is already the case with NVIDIA, AMD has added a second DMA engine to handle communication to and from the CPU.