At the AMD Fusion Developer Summit which took place several months ago, one guest caused a sensation: Jem Davies, Vice-President at ARM, in charge of technology for the Media Processing division. Contrary to what some may have thought, his attendance wasn’t however linked to the announcement of the use of ARM architecture by AMD. Jem Davies carries the heavy responsibility of ensuring that ARM has the right technology further down the line and in this respect shares AMD’s vision in terms of heterogeneous computing and that was what brought him to the summit.
AMD and its entire PC ecosystem needs heterogeneous computing to keep up in the performance race without overextending the size of its chips or energy consumption. The famous Moore's law says that processor complexity, namely the number of transistors, doubles every two years for a more or less constant cost. This prediction sets the pace for a large part of the industry. Every two years, a new fabrication process enables transistor density to double, making it possible to manufacture a processor that is twice as complex but the same size (size being an important parameter when it comes to fabrication costs). AMD, Intel and NVIDIA take advantage of these developments to increase CPU and GPU performance.
Much in the same way as has happened in the traditional PC world, ARM is however faced with energy constraints which call this model into question. Energy constraints are in effect the no. 1 priority for this cross-platform architecture which is found in most smartphones and tablets. Yet, while each fabrication process development (45nm -> 32nm -> 22nm…) allows a 50% reduction in the size of transistors, energy consumption doesn’t fall at the same rate.
Jem Davies showed us an approximate example of the differences between the 45nm and the 22nm process: in between four and six years, depending on access to the technologies involved, an identical processor will, as expected, see its size reduced by four, its maximum clock increase by 60%, but its energy consumption remain the same. Sure, there’s a decent gain of 60%, but the level of energy consumption represents a lost opportunity to use the more advanced fabrication process, and the space thus saved, to develop a more complex and higher performance chip.
AMD, Intel and NVIDIA CPUs and GPUs are also faced with the same issue but to a lesser extent depending on whether they target the de desktop or mobile market, their architectures still leaving space for numerous e energy optimisations. In the medium term, increasing the number of identical cores will nevertheless pose a problem , both in terms of yield and energy consumption.
According to Jem Davis, the solution lies in heterogeneous computing and dark silicon, or inactive silicon. The general idea, beyond that of fixed task-specific chips dedicated to, say, video decoding, is to take advantage of the development of fabrication processes to implement various types of core, at the same time as bearing in mind that the thermal envelope won’t allow you to use them all at the same time. These types of core must be adapted to certain task profiles, be used to process these tasks and remain inactive the rest of the time.
NVIDIA based the development of Kal-El, its upcoming Tegra 3 SoC, on the same principle. In addition to four main ARM Cortex A9 cores, Kal-El will have a fifth Cortex A9, made from lower performance transistors in terms of the clock but optimised to reduce current leakage and energy consumption. Depending on what tasks need processing, either this companion core, or the principal cores will be active.
ARM naturally intends going much further than this, which brings us to the innovations that have just been unveiled and which represent a first step towards this new heterogeneous architecture / dark silicon strategy. A new core has been introduced to this purpose: the Cortex A7. Contrary to what you might think by its name, the Cortex A7 did not precede the Cortex A8 (used in numerous SoCs in the Apple A4) and A9 (Tegra 2 & 3, Apple A5 and so on), but is a new core, the poor relation of the Cortex A15, which will represent an important development for ARM in terms of performance.
The Cortex A15 and A7 are based on the same architecture (ARM A7a) and execute all instructions consistently. Both can be implemented alone or in up to quad core versions, with an L2 cache. Where they differ is in their microarchitecture, the Cortex A15 being optimized for performance, out-of-order and long pipeline, while the Cortex A7 has been optimized for energy yield, in-order and short pipeline.
Although the Cortex A7 can be used for a low-end, low-energy SoC, as an interesting update to the Cortex A8 (lower performance and higher energy consumption), it really comes into its own in what has been christened the big.LITTLE architecture which will consist in implementing Cortex A7s alongside Cortex A15s on the same SoC, with task execution being taken care of by one or the other, depending on priority .
In contrast to NVIDIA’s approach with Kal-El, coherence won’t be at the shared L2 cache level as ARM says that the L2 cache represents a very high energy economy potential and that it's more useful to adapt its structure to each type of core. Typically, each group of Cortex A7s or A15s will have its own L2 cache and coherence will be taken care of by a robust interconnect that is designed to support the most complex cases.
There are two possible options for using all these cores. The task migration model consists in using either the Cortex A15s or the Cortex A7s, but never both at the same time, by moving between one cluster or another in the same way as you do for a change in clock (DVFS model, Dynamic Voltage and Frequency Scaling). When the system gets to the highest level of performance defined for the Cortex A7s, tasks can be migrated across to move OS and application processing across to the Cortex A15s.
The migration process takes under 20.000 cycles according to ARM, which represents 20ms at 1 GHz. Note however that this doesn’t mean that the SoC is inaccessible during 20ms: for most of this time, tasks continue to be executed on the original core cluster.
In this mode, ARM advises the implementation of the same number of Cortex A15s as Cortex A7s, so as to simplify the software side. ARM supplies a software switcher to switch between the A7 and A15 cores and mask the small differences between the two architectures. It can be used as of today (with virtualisation) or, better still, be integrated into operating systems.
The second option is to use heterogeneous computing, which consists in using Cortex A7s and A15s at the same time. You can then direct each task to the most appropriate core, either the Cortex A7s if it's a simple or low priority task, or to the Cortex A15s if more power is needed. Each type of core can be entirely at idle if not required for the task in hand. ARM doesn’t however describe the mechanism that directs the tasks towards one or other cluster of cores and we imagine here that the software and operating systems must be adapted.
This isn’t all because the interconnect layer has also been designed for a Mali-T604 GPU, which supports OpenCL 1.2! Of course, using four Cortex A7s, four Cortex A15s and a quadcore Mali T604 GPU running on OpenCL all at the same time, wouldn't be possible unless the TDP wasn't a limiting factor. In all other cases the mechanism charged with task management will have to take the number of parameters linked to energy consumption into account and decide between all the possible options: using four Cortex A15s or two Cortex A7s + the Mali T604 in OpenCL.
While both AMD and ARM require heterogeneous computing, they are also both faced with the same challenge: exploiting it efficiently, whether in terms of energy or performance. ARM however has a significant advantage as it is present in software ecosystems that are developing very fast, whereas AMD has to work with the weight of the x86 ecosystem…