AMD Radeon HD 7970 & CrossFireX review: 28nm and GCN - BeHardware
>> Graphics cards
Written by Damien Triolet
Published on December 24, 2011
Just before Christmas, AMD decided to lift the veil on the Radeon HD 7970, which was subsequently put on sale in stores as of January 9th. On the menu: a new architecture, support for the most recent technologies and of course the promise of higher performance to finally take top spot from the NVIDIA GeForce GTX 580. We're going to see if this promise has been fulfilled in a close examination of how the Radeon HD 7970 handles, alone and in CrossFire X!
2012 technologiesAMD has got into the good habit of adopting the latest manufacturing technologies and standards for its GPUs, standards that are generally rolled out across the rest of the market later. Once again the company has followed this path for its Radeon HD 7000sÖ or at least for some of them. The Radeon HD 7900s, 7800s and 7700s are slated to replace the Radeon HD 6900s, 6800s and 6700s with the new Southern Islands GPUs: Tahiti, Pitcairn and Cape Verde. The entry level cards will however largely, or even entirely, consist of a series of renamed current models. While it's probably pertinent for AMD to focus its resources on developing mid and high-end GPUs and APUs, this doesnít justify an across-the-board renaming calculated to fool certain consumers and gain a sales advantage. If we say Ďcertain consumersí, this is because the renamed cards may well only concern the OEM segment, fooling only the newbies who buy complete PCs. This is particularly problematic as the Southern Islands family ushers in a lot of innovations.
These new GPUs, including Tahiti which equips the Radeon HD 7900s, are made by TSMC and engraved at 28 nanometres. Transistor density has been doubled in comparison to the previous fabrication process at 40nm, which makes it possible to add more processing units and new features for a same-sized chip. Note however that energy consumption hasnít unfortunately come down in proportion with area used. Consumption is more than ever the main parameter to be considered when designing a chip.
Tahiti also introduces Direct3D 11.1 support, which will be rolled out across the board in 2012. Designed for Windows 8, but probably with Windows 7 support too, this new API is a minor development which has above all been designed to integrate certain requests from developers and to facilitate its usage across a wide range of GPUs, something that is important with the opening of Windows to the ARM universe. Compatibility with DirectX 11, 10.1, 10 and 9 is maintained. Thereís full Direct3D 11.1 support which facilitates integration with the DXVA video API. The resources of the compute shaders (UAV) can be used with all types of shaders (only the pixel shaders can share them in Direct3D 11), rasterisation is more flexible, logical operations on rendering buffers can be applied, shaders debugged and thereís support for a standard version of 3D stereo.
Thereís also full OpenCL 1.2 support, which includes integration with DirectX 11 and video streams and support for multitasking at GPU level. The PCI Express 3.0 standard is also supported and means the practical bandwidth between the GPU and CPU can be doubled, as long as youíre on an X79 platform, the only platform to support it as things stand.
Finally Tahiti also opens the door onto a new architecture: GCN or Graphics Core Next.
Tahiti: 2048 processing units and 384-bit
Tahiti: 2048 processing units, 32 ROPs and a 384-bit memory busAs with all current GPUs, execution units on Tahiti and its derivatives are organised in fundamental blocks which take in processing units, the cache, texturing units, control units and so on. Previously, AMD called these blocks SIMDs, which wasnít very clear as this is also the generic name given to vector processing units. With GCN AMD is now referring to them as Compute Units (CU). With the aim of being as explicit as possible, we will also use this term to refer to the fundamental blocks of current Radeon GPUs and will reserve the term SIMD for its original definition: a vector processing unit. On the GeForces, remember, these blocks are referred to as Shader Multiprocessors (SM).
The first development on Tahiti (HD 7900) in comparison to Cayman (HD 6900) is that the number of CUs is up from twenty four to thirty two, with the same processing and texturing throughput per unit. This gain of 33%, which takes the number of processing units from 1536 (384 vec4s) to 2048 and texturing units from 96 to 128, will directly benefit performance. The CUs are also "scalar", which makes them more efficient (see next page). "Scalar" units have been used by NVIDIA since the GeForce 8s.
The texturing units are unchanged and still filter HDR 64-bit textures (FP16) at half speed and HDR 128-bit (FP32) textures at quarter rate. Filtering quality has been tweaked a bit to reduce flickering by a noticeable extent. AMD has also added hardware support for Partially Resident Textures (PRT), a sort of Mega Texturing used by John Carmackís id Tech5. This PRT acceleration means that engines that use it can be accelerated but support will remain limited as Direct3D is not easily extensible (currently there's a proprietary OpenGL implementation).
To feed these new CUs, AMD has gone from a 256-bit to a 384-bit memory bus, which represents a gain in bandwidth of 50% for identical memory. The number of ROPs is however decoupled from the memory controllers, something already seen with the Radeon HD 6790, and AMD has opted not to increase them in number. There are therefore still 32 and this means that thereís no improvement in fillrate. It was already pretty high before and this isnĎt therefore too much of an issue, especially as to write more than 32 pixels to memory, you also have to be able to generate more! Indeed this was the problem with the GeForce GTX 400s and 500s. The GeForce GTX 580 is, for example, able to write 48 pixels to memory per cycle but can only generate 32, which is only of any use in terms of accelerating multisample type antialising.
Can 32 ROPs properly use a 384-bit memory bus? Not always, but as well as the ROPs, textures also require memory bandwidth. In some cases however, 32 ROPs are limited by a 256-bit bus, as when there's blending of colours in HDR 64 and 128 bits. These modes will therefore make full use of the extended bus.
Like Cayman, Tahiti can process two triangles per cycle, with or without tessellation, against four for the GF100/110 from NVIDIA. The fact that there has been no development here is however compensated by several little optimisations to improve performance when a high level of tessellation is used: bigger caches, fewer penalties when using the video memory as a buffer and ability to reuse vertices that have already been processed (neighbouring triangles) as often as possible. The gains resulting from these optimisations can give as much as a 4x improvement on Cayman according to AMD.
Implementing these additional units as well as all the architecture developments means a huge increase in the number of transistors, up from 2.64 billion for Cayman to 4.31 billion for Tahiti. Thanks to the 28 nm process, Tahiti is however slightly smaller at 352 mm≤ compared to 389 mm≤ for Cayman. Note that AMD hasnít yet given any detail on which variant of the 28nm fabrication process has been used.
GCN: goodbye VLIW
GCN: goodbye VLIWSince the Radeon 9700 Pros, AMD has used a VLIW architecture which was gradually developed to attain a very high level of flexibility on the most recent implementations. VLIW, or Very Long Instruction Word, consists of executing complex instructions, which are in reality assembled from a series of simpler instructions. This is what we have referred to as vector architecture for Radeons (vec4 or vec5) as opposed to the scalar-like architecture of GeForce cards: for each pixel, for example, 5 instructions could be executed side by side. This model came from the natural development of GPUs whose basic task consists in processing colours (four components: red, green, blue and clear) and coordinates (three or four components). Processing five instructions side by side meant that the Radeons could make the most of the natural parallelism between these instructions, at the same time as leaving a little space for the few scalar operations that also had to be processed.
Cypress, the GPU used on the Radeon HD5800s, uses CUs that each contain a big SIMD processing unit that can process each five-instruction cycle at the same time on sixteen elements (pixels, vertices, threads and so on). With Cayman, the GPU used on the Radeon HD 6900s, AMD simplified this model somewhat, using a more efficient SIMD engine that executes four instructions at the same time, still across 16 elements. For Tahiti and the other GCN GPUs, this big unit has been split down into four small SIMD units, each of which can execute one instruction on 16 elements.
In reality the big SIMD engine on Cayman and the four small SIMD units on Tahiti are probably identical, with just the way theyíre fed actually changing. With Radeons, all these elements to be processed are organised in groups of 64, known as wavefronts. These groups are bigger than the length of the SIMD units (16) in order to simplify the work of the schedulers and accommodate processing unit latency more easily.
With Cayman, one of these groups is processed in four cycles with up to four instructions in parallel. For Tahiti, a single instruction on four wavefronts is processed every four cycles. Tahiti is therefore more flexible in as much as it can juggle with many more elements at the same time: at least 256. This can be compared with 128, not 64, for Cayman: with a latency of 8 cycles for the processing units, each Cayman CU must constantly interleave two wavefronts. With Tahiti and GCN, AMD has reduced the latency of the processing units to 4 cycles to avoid multiplying the number of elements required to use all the processing units. In the end theyíre doubled in number, which is reasonable.
Note this slight nuance: although Cayman can execute an instruction directly on all processing units, this is no longer the case with Tahiti. The scheduler in each CU can issue the execution of an instruction to just one SIMD per cycle. On start-up, the second SIMD thus loses one cycle, the third two cycles and the fourth three cycles, which represents a loss of 192 flops. This is however negligeable when the programmes to be executed are long and is compensated by the lower latency.
Whatís the difference in practice? Here are a few examples, comparing the VLIW4 architecture used on the Radeon HD 6900s (latency of 8 cycles, vec4) to the GCN architecture used on the Radeon HD 7900s (latency of 4 cycles + 3 cycles on start-up, scalar), supposing that each CU is fed with 2 / 4 / 8 groups of 64 elements to be processed:
1 scalar instruction to be executed:
VLIW 4 CU: 16 / 24 / 40 cycles
GCN CU: 11 / 11 / 15 cycles
100 scalar instructions to be executed:
VLIW 4 CU: 408 / 808 / 1608 cycles
GCN CU: 207 / 207 / 407 cycles
1 vec3 instruction to be executed:
VLIW 4 CU: 16 / 24 / 40 cycles
GCN CU: 19 / 19 / 31 cycles
100 vec3 instructions to be executed:
VLIW 4 CU: 408 / 808 / 1608 cycles
GCN CU: 607 / 607 / 1207 cycles
1 vec4 instruction to be executed:
VLIW 4 CU: 16 / 24 / 40 cycles
GCN CU: 23 / 23 / 39 cycles
100 vec4 instructions to be executed:
VLIW 4 CU: 408 / 808 / 1608 cycles
GCN CU: 807 / 807 / 1607 cycles
When the CUs in the GCN architecture are fed with at least 256 elements they then give higher performance than the VLIW 4 CUs, with an insignificant difference when four instructions can be processed in parallel but which can be close to 4x higher with scalar instructions executed in series! 3D rendering comes in on average somewhere between the results at vec3 and vec4. It has to be said that the AMD compiler performs particularly well to extract this parallelism thanks to all the experience acquired over the course of time. When under-fed however, the GCN CUs can give lower performance.
The new architecture comes into its own mainly on the compute side where the code lends itself less to vectorisation than 3D rendering. However, 3D rendering will gradually benefit from the GCN setup as it is increasingly developing away from easy vectorisation of colour and positioning processing. GCN will also free AMD up from working so intensively on its compiler and put these resources into other optimisations.
AMD has also added a real scalar processing unit to each CU which it will be able to use to deal with operations that donít have to be executed for each element of a group via the SIMDs, which can for example serve to optimise branching in certain cases. This unit will not be used for graphics languages themselves but may be used by the compiler.
Thereís still a unit for processing branching, extended for debugging messages.
GCN: caches and two ACEs for GPU computing
GCN: caches and two ACEs for GPU computingAlthough graphics remains at the heart of GCN development, GPU computing has also taken on a lot more importance. To prevent their GPUs from becoming confined to a few very specific usage scenarios, AMD and NVIDIA continue to make GPU usage easier. With Fermi, NVIDIA introduced numerous such developments and with GCN, AMD has followed suit.
Tahiti thus ushers in a new read/write cache structure. The texture cache of previous generations has developed towards an L1 cache of 16 KB which can be used both by the texturing units and the SIMDs. Moreover, each scalar unit has its own 4 KB L1 cache. This 4 KB cache is however implemented as a 16 KB cache shared between four Compute Units. This compromise has been made to reduce implementation costs. Tahiti therefore has a total of 40 L1 caches of 16 KB each.
They are connected with an access of 64 bytes per clock to the L2 cache that is made up of 128 KB partitions that are integrated into each of the six memory controllers. This L2 cache is now coherent and processes atomics much more efficiently than before.
The shared memory of each CU, Local Data Share, is thus also up from 32 to 64 KB. To recap, the LDS is designed to share information within a block of elements to be processed and the Direct3D 11 spec requires a minimum of 32 KB. This memory has a direct read access to the L1 of its Compute Unit, which means it can be loaded with data without having to go through the SIMDs. This improves both performance and energy consumption.
There has however been no increase in general registers for the SIMDs in each CU: 256 vector registers of 2048 bits (64x 32 bits). The scalar unit also has 256 registers of 32 bits.
Still looking at the memory sub-system, AMD has also implemented ECC protection for the SRAM (L1, L2 and registers) and the video memory. The implementation is probably similar to that on the NVIDIA GPU, which is to say that it consists in reserving part of the memory to stock ECC data, also therefore reducing the available memory bandwidth.
After the cache, AMD looked into another problem that affects GPU Computing: multitasking and overhead. To this end Tahiti has three command processors. The main one, not represented on this schema, can process all tasks, both graphic and compute. Beside it are two ACEs (Asynchronous Compute Engines) which are limited to compute tasks. With an evolved resources control system, prioritisation and synchronisation, they can simultaneously handle several contexts. They can for example, provide for efficient use of GPU computing and 3D at the same time. In the future, itís also feasible that AMD send the main command processor DirectX 11 Compute shader processing to the ACEs, but this isnít yet in place. Could this be a possible optimisation for 3DMark 11?
To feed all these command processors, as is already the case with NVIDIA, AMD has added a second DMA engine to handle communication to and from the CPU.
Video Codec Engine and HDMI 1.4a 3 GHz
Video Codec Engine and HDMI 1.4a 3 GHzTwo innovations in terms of connectivity have been introduced with the Southern Islands family. The first is support for HDMI 1.4a 3 GHz which, at long last, makes it possible to transport 1080p 3D video flow at 60 Hz, or even 24 Hz per eye. Currently we're limited to 30 Hz. To be able to enjoy this, you also have to have a screen that can handle it, something which to our knowledge is not yet available. This optional HDMI standard also allows you to drive a 4K screen, such as that supported by DisplayPort 1.2. The second innovation comes with support for an independent audio stream for each video out, which means you can feed sound to several screens or get the audio to follow the image when it moves from one screen to another.
To improve video encoding on its GPUs, AMD has included a Video Codec Engine that enables encoding of an HD 1080p stream at more than 60 frames per second. The VCE is a fixed unit and improves the energy efficiency of the GPU.
It can be used in Full Mode or Hybrid Mode. In Hybrid Mode, the VCE only handles the Entropy Encode with the other stages, which are simpler to parallelise, exectuted by the GPU processing units. On a high end model, this provides for a significant increase in performance, but this wonít especially be the case on a smaller GPU. AMD will not however be providing an encoder that can use the VCE straight away and weíll have to wait and see what developers of this type of tools come up with and if theyíre able to improve quality at all.
Finally, one more little innovation, the implementation of the QSAD instruction, a variant of SAD (Sum of Absoloute Differences). QSAD combines SAD operations with alignment operations and is very useful for various image processing tasks (motion detection, 2D/3D conversion) for which it gives a jump in performance for lower energy consumption. It will be used in version 2.0 of Steady Video, the AMD image stabilisation algorithm.
PowerTune and ZeroCore Power
PowerTune and ZeroCore PowerOf course Tahiti includes support for the PowerTune technology introduced with Cayman. To recap, this corresponds to some extent to Turbo on a CPU, allowing you to benefit from the available thermal envelope to a maximum. There is a small difference all the same. The GPU starts from the principle that unless its sensors are telling it something different, it can always function at its maximum clock. AMD therefore talks only about this core clock speed, though it isn't guarantee to be the actual clock in all cases.
Without a technology such as PowerTune, its graphics cards would be likely to see their clocks fall significantly, by up to 30% according to AMD. Without a similar technology, NVIDIA finds itself in the position of having to cobble together an approximate software alternative for its GeForce GTX 500s that consists in monitoring the energy consumption into the card.
PowerTune is based on a multitude of load sensors in the various GPU blocks, with readings then compared in an energy consumption correspondence table. PowerTune doesnít measure to see if the GPU has passed its authorised limit but rather estimates as precisely as possible whether in the worst of cases it is going to do so. AMD fixes what this Ďworst of casesí is: a GPU with a lot of leakage and which runs at a high temperature.
A GPU with maximum energy consumption fixed at 250W wonít be limited when it attains 250W but rather beforehand, when the worst of sample models would have reached 250W under the same conditions. This means that all graphics cards are limited in the same way and therefore will perform in the same way. Please note that the GPU clock is used as a parameter to estimate power consumption but GPU voltage isn't.
The Radeon HD 7970, like the Radeon HD 6970, is limited to 250W. However it has more sensors, which allows the GPU to estimate its energy consumption more precisely and thus avoid reducing its clock unless really necessary. You can still modify the limit by +/- 20% in the Overdrive control panel.
With Tahiti and the Radeon HD 7970, AMD has introduced a new technology: ZeroCore Power. This puts the GPU into long idle when it no longer has to power the display, typically when there's a blank screen. The graphics card then consumes less than 3W according to AMD, a value we can confirm following our tests (we measured it at just 1.8W). The fan is then also turned off.
ZeroCore Power comes in useful when a system isnít being used constantly but needs to stay on. For example, an HTPC that is used for both gaming and file storage will benefit from the technology with a significant reduction in energy consumption. Another example could be with a supercomputer that is equipped with GPU. When the GPUs are not being used they can be switched to long idle.
ZeroCore Power will also be useful in multi-GPU systems as it goes much further than the Ultra Low Power State mode used on previous GPUs, which only reduce their consumption be a few Watts when they arenít sollicited.
Specifications, reference Radeon HD 7970, overclocking
The processing and texturing power of the Radeon HD 7970 is up 40% on the Radeon HD 6970, while the memory bandwidth is up by 50%.
The reference Radeon HD 7970For this test, AMD supplied us with a reference Radeon HD 7970:
The Radeon HD 7970 is the same size as the Radeon HD 6970: double slot and 27.5 cm long. Its cooling system is similar but has been developed slightly. The blower has longer blades to increase the airspeed and the vapour chamber block / radiator extends 1.5 cm further towards the connectors. Its extremity is therefore very close to the hot air extraction grill which has also been adapted to cover the full height of a PCI slot and itís no longer necessary to send some of this hot air into the casing above the card. One of the two DVI ports therefore had to go, with the remaining one accompanied by an HDMI out and two mini-DisplayPort outs. An HDMI to DVI adaptor is supplied with the card.
The GPU is surrounded by a metallic structure designed to protect it and guarantee the rigidity of the packaging. Unfortunately this structure is slightly thicker than the die, which means that alternative cooling systems will have to be adapted to be compatible. Be very careful if you decide to replace the orginal cooler!
Probably for cost reasons, AMD has dropped the plate covering the back of the Radeon HD 6970 and 6950 2 GB. The design of the casing that covers the cooling system has however been modified to make it much more aesthetic than the one used for the Radeon HD 6900s, which had a rather austere look and a cheap feel. In general, weíre not great fans of glossy designs but here it has been carefully made with quality materials.
The PCB is not that different to the one used for the Radeon HD 6970, but the power stage has been revisited to make it more powerful, with better quality components though still with 6 phases for the GPU. This increase in power isnít actually necessary and one of the six phases remains vacant. The design will however allow manufacturers who so desire to produce an overclocked model with more in reserve and it will also be possible to go from 8+6 pin power supply connectors to 2x 8 pins.
Two CrossFire X connectors are still available to enable tri and quad-GPU support and you still get the dual bios switch allowing you to return to the original bios.
OverclockingOur cardís GPU proved quite cooperative when it came to overclocking and we managed to take it from 925 to 1075 MHz. This sort of gain (16%) is unusual on high-end GPUs. We did of course check to make sure that PowerTune wasnít reducing the clocks by measuring performance in several games (in Furmark where PowerTune does kick in, we were able to get up to ď1125 MHzĒ). At 1075 MHz, we observed a gain of 10% in Battlefield 3 and 14% in Anno 2070.
PowerTune: impact on performance
PowerTune: impact on performanceDoes the PowerTune technology limit Radeon HD 7970 performance in games? To get an answer we increased its limit by 20% to 300W up from 250W.
We also wanted to find out how close the Radeon HD 7970 was to its limit in practice. To find out, we checked performance by reducing the limit by 10% (225W) and then 20% (200W).
Hold the mouse over the graph to view the results in fps.
As you can see, increasing the limit by 20% had no impact on performance. We observed a small gain in Crysis 2, but it was very slight: 0.5%. When we reduced the limit by 10%, the difference was also negligeable, with just a rather insignificant fall of 2.5% in Anno 2070, which is very resource hungry when all settings are pushed to a maximum.
There was however much more of an impact at -20%, with PowerTune kicking in much more often and taking more of a toll on performance than there was a reduction in the thermal envelope.
By default, the 250W limit imposed by PowerTune seems well adapted to the reference clocks. It could however limit performance if the card is significantly overclocked, in which case it's advisable to increase the thermal envelope.
Energy consumption and performance/watt
Energy consumptionWe did of course use our test protocol that allows us to measure the energy consumption of the graphics card alone. We took these readings at idle on the Windows 7 desktop as well as with the screen in standby so as to check out the impact of ZeroCore Power:
At idle, on the Windows 7 desktop, the Radeon HD 7970 has somewhat lower energy consumption than the Radeon HD 6970. There is however an enormous difference between the two cards when the screen is blank.
We then measured energy consumption after several minutes in load in 3D Mark 06 and in Furmark. Note that we use a version of Furmark that isnít detected by the stress test energy consumption limitation mechanism put into place by NVIDIA in the GeForce GTX 580 drivers. We also added the readings taken in Anno 2070 which is the game that seems to take the Radeon HD 7970 closest to the limit fixed by Power Tune:
In 3DMark and Furmark, the Radeon HD 7970 consumes 10 to 20W more than the Radeon HD 6970, which is probably due to the more accurate estimation made by PowerTune with the newer card. PowerTune kicks in for both cards, both in 3DMark 06 and Furmark. In Anno 2070 however energy consumption is similar for both Radeons, which gives the 7970 a much higher performance per Watt ratio. We have shown this graphically, with fps per 100W to make it more legible:
The Radeon HD 7970 thus shows itself to be much more energy efficient, 55% up on the Radeon HD 6970 and almost twice as efficient as the GeForce GTX 500s! Note however that each game is a particular case and that the Radeons are placed a little better than average in this game.
Nuisances sonores et tempťrature GPU
NoiseTo observe the noise levels produced by the various solutions, we put the cards in a Cooler Master RC-690 II Advanced casing and measured noise at idle and in load. We used an SSD and all the fans in the casing, including the CPU fan, were turned off for the reading. The sonometer was placed 60 cm from the closed casing and ambient noise was measured at +/- 21 dBA.
Compared to the Radeon HD 6970, the Radeon HD 7970 makes less noise at idle but slightly more in load taking it above the GeForce GTX 580. As we will see, this probably comes from the calibration of the cooling system.
TemperatureStill in the same casing, we took a temperature reading of the GPU using internal sensors:
The Radeon HD 7970 cooling system is in effect calibrated to maintain the GPU at a lower temperature than the other reference cards: 80 įC. This may be a calibration error but it may also be possible that GPUs engraved at 28 nanometres are less tolerant at high temperatures.
Infrared thermography For this test, we used the test protocol a beta version of which was introduced for our GeForce GTX 580 report.
We corrected it slightly by replacing the original set of fans in the Cooler Master RC-690 II Advanced casing with Noctua models: an NF-P14FLX to suck air in and two NF-S12Bs for extraction. At idle they run at 600 RPM, while in load the 140mm ups its speed to 780 RPM and the 120mm to 990 RPM. This modification improves the cooling to noise ratio but above all removes the mechanical noise you get with the original fans. Although this noise didnít alter the noise pressure obtained during readings by much, it did make it difficult to register noise levels by ear, which is important when the graphics card fan is also producing a mechanical noise or when its speed varies.
We took advantage of this to introduce a noise reading for when the graphics card fan was running at maximum speed.
First of all, hereís a summary of all the readings:
The internal temperatures are slightly lower with the Radeon HD 7970, which has pretty much eliminated the small amount of hot air that was previously sent into the casing over the top of the card. In load it is noisier because its blower runs faster. Although it runs at 47% of its maximum speed, this corresponds to 2500 RPM, which is exactly the speed of the Radeon HD 6970 fan at 40%.
Hereís what our thermal imaging showed:
The Radeon HD 7970 is well cooled, with a power stage which doesnít heat up excessively. Note that the Radeon HD 6970 is masked by the plate that covers the back of the card.
The Radeon HD 7970 sends less hot air back into the casing than the Radeon HD 6970 and the GeForce GTX 580, which shows up in these images, especially in comparison with the GeForce GTX 580.
Theoretical performance: pixels
Texturing performanceWe measured performance during access to textures of different formats in bilinear filtering: for standard 32-bit (4xINT8), 64-bit ďHDRĒ (4x FP16) and 128-bit (4x FP32) and 32-bit RGB9E5, an HDR format introduced with DirectX 10 which enables to store 32-bit HDR textures with a few tradeoffs.
The GeForce GTX 500s can filter FP16/11/10 and RGB9E textures at full speed but the Radeon HD 6900s and the Radeon HD 7970 have such superior filtering power that even though they have to filter FP16 textures at half-speed, they arenít far behind the GeForces.
Note that we have to increase the energy consumption limit of the Radeon HD 6900s to a maximum here, otherwise the clocks are cut during this test. By default the Radeons therefore seem incapable of fully benefitting from their texturing power! The good news is that this is no longer the case for the Radeon HD 7970.
FillrateWe measured the fillrate without and then with blending, and this with different data formats:
In terms of fillrate, the Radeons have an advantage over the GeForce GTX 580s, above all with FP10s, a format processed at full speed while with the GeForces it is processed at half-speed. Given the limitation of the GeForces in terms of datapaths between the SMs and ROPs, itís a shame that NVIDIA hasnít given its GPU the possibility of benefitting from FP10 and FP11 formats.
Like the GeForces, the Radeons can process FP32 single channel at full speed without blending, but retain this speed with blending. Thanks to its 384-bit memory bus and the additional bandwidth relative to the number of ROPs, the Radeon HD 7970 does significantly better with blending. The memory bus therefore does have a use!
Theoretical performance: geometry
Triangle throughputGiven the architectural differences between the various GPUs in terms of geometry processing, we obviously wanted to take a closer look at the subject. First of all we looked at triangle throughput in two different situations: when all triangles are drawn and when all the triangles are removed with back face culling (because they arenít facing the camera):
Although the Radeon HD 7900s and 6900s are indeed able to process 2 triangles per cycle, the GeForce GTX 580 retains the advantage with 4 triangles per cycle. When the triangles have to be rendered however, performance is reduced as NVIDIA has limited them to differentiate the Quadros and the GeForces.
Next we carried out a similar test but using tessellation:
While the Radeon HD 7970 doesnít really differentiate itself from the Radeon HD 6970 without tessellation, its performance is significantly better with, though the GeForce GTX 580 still retains an advantage.
The architecture of the Radeons means that they can be overloaded by the quantity of data generated, which then drastically reduces their speed. Doubling the size of the buffer dedicated to the GPU tessellation unit in the Radeon HD 6800s meant they gave significantly higher performance than the Radeon HD 5800s. Parallelisation of geometric processing allowed the Radeon HD 6900s to close the gap a bit on the GeForces and the Radeon HD 7970 reduces the gap even more.
The GeForce GTX 580 benefits from an architecture that handles geometry in a distributed way, at the processing units level, which means it avoids the centralization of geometric amplification and the resultant overload that can ensue.
The testFor this article, we decided to review our test protocol so as to include some new games: Anno 2070, Batman Arkham City, Battlefield 3, F1 2011 and Total War Shogun 2. We have also added Project Cars, a game still under development and which hasnít yet been fully optimised. It wonít therefore be used in our index, but it will be interesting to observe the performance of different cards with it.
We have decided no longer to use the level of MSAA (4x and 8x) as the main criteria for segmenting our results. Many games with deferred rendering offer other forms of antialiasing, the most common being FXAA, developed by NVIDIA. It therefore no longer makes sense to organise an index around a certain level of antialiasing, which in the past allowed us to judge a card according to its effectiveness with MSAA, which can vary according to implementation.
At 1920x1080, we carried out the tests with a very high level of quality on the one hand, which always included some antialiasing (either MSAA 4x, or FXAA/MLAA/AAA) and on the other hand at an extreme quality level. Tests were also carried out at 2560x1600 and with a surround resolution of 5760x1080 at very high quality.
We no longer show decimals in game performance results so as to make the graph more readable. We nevertheless note these values and use them when calculating the index. If youíre observant youíll notice that the size of the bars also reflects this.
The Radeons were tested with the beta 8.921.2-111215a drivers and the GeForces with the 290.36 drivers.
Test configurationIntel Core i7 980X (HT and Turbo off)
Asus Rampage III Extreme
6 GB DDR3 1333 Corsair
Windows 7 64-bit
GeForce 290.36 drivers
Catalyst beta 8.921.2.111215a
Benchmark: Anno 2070
Anno 2070 uses a development of the Anno 1404 engine which includes DirectX 11 support.
We used the very high quality mode on offer in the game and then, at 1920x1080, we pushed anistropic filtering and post processing to a max which made the game very resource hungry. We carried out a movement on a map and measured performance with Fraps.
The Radeon HD 7970 does particularly well in this first game with a gain of 47% over the Radeon HD 6970 at the very high quality level and 57% at maximum quality. It also has a very big advantage over the GeForce GTX 580.
The Radeon HD 7970s in CrossFire X are also very efficient with a much bigger gain than with the CrossFire X configuration with the Radeon HD 6970s.
The results are still excellent at high resolution.
Benchmark: Batman Arkham City
Batman Arkham City
Batman Arkham City was developed with a recent version of Unreal Engine 3 which supports DirectX 11. Although this mode suffered a major bug in the original version of the game, a patch (1.1) has corrected this. We used the game benchmark.
All the options were pushed to a maximum, including tessellation which was pushed to extreme on part of the scenes tested. Performance was measured with FXAA and with MSAA 8x. We also added performance obtained with GPU PhysX effects on.
Although there was a significant gain with the Radeon HD 7970 at FXAA, this wasn't the case at MSAA 8x, where it was limited to 16% over the Radeon HD 6970 and 10% over the GeForce GTX 580.
Note that CrossFire X doesnít work correctly in this game, which is however a major title.
The GeForces perform best of course with GPU PhysX effects on, but their lead is reduced as they arenít used in the whole test scene. Where they are the GeForces maintain 40 fps, whereas the Radeons dip under 20 fps.
At high resolution the Radeon HD 7970 gives a gain of as much as 61% over the Radeon HD 6970.
Benchmark: Battlefield 3
Battlefield 3 runs on Frosbite 2, probably the most advanced graphics engine currently on the market. A deferred rendering engine, it supports tessellation and calculates lighting via a compute shader.
We tested it at High and Ultra settings and recorded performance with Fraps, over a well defined route.
You get a 24% gain with the Radeon HD 7970 in High and 40% in Ultra mode. Note that in this mode, cards with just 1 GB of memory suffer from major jumpiness.
In surround, a Radeon HD 7970 CrossFire X system allows you to play Battlefield 3 very comfortably.
Although only in DirectX 9 mode, the rendering is pretty nice, based on version 3.5 of Unreal Engine.
All the graphics options were pushed to a max (high) and we measured performance with Fraps.
This time the Radeon HD 7970 doesnít give so much of a gain.
It does however manage 35% at 2560x1600, a gain which makes this resolution playable.
Benchmark: Civilization V
Pretty successful visually, Civilization V uses DirectX 11 to improve quality and optimise performance in the rendering of terrains thanks to tessellation and implements a special compression of textures thanks to the compute shaders, a compression which allows it to keep the scenes of all the leaders in the memory. This second usage of DirectX 11 doesnít concern us here however as we used the benchmark included on a game card. We zoomed in slightly so as to reduce the CPU limitation which has a big impact in this game.
All settings were pushed to a max and we measured performance with shadows and reflections. The latest patch was installed.
The Radeon HD 7970 is 50% up on the Radeon HD 6970 here, which is enough to overtake the GeForce GTX 580 which had a big lead in this game.
Civilization V supports surround, but unfortunately the benchmark doesn't.
Benchmark: Crysis 2
Crysis 2 uses a development of the Crysis Warhead engine optimised for efficiency but adds DirectX 11 support via a patch and this can be quite demanding. As, for example, with tessellation, implemented abusively in collaboration with NVIDIA with the aim of causing Radeon performance to plummet. We have already looked into this issue here.
Although we have presented the results obtained with tessellation for information, as it's interesting to see what the Radeon HD 7970 brings to the table here, we only used the results without tessellation in the calculation of the index.
We measured performance with Fraps on version 1.9 of the game.
The higher you push the graphics settings, the higher the gain given by the Radeon HD 7970. It increases to 56% when the abusive tessellation is on.
The two Radeon HD 7970s in CrossFire X allow you to play Crysis 2 very comfortably in surround, even at high graphics settings.
Benchmark: F1 2011
The latest Codemaster title, F1 2011 uses a slight development of the F1 2010 and DiRT 3 engine, which retains DirectX 11 support.
We pushed all the graphics options to a max and we used the gameís own test tool on the Spa-Rancorchamps circuit with a single F1.
At 1920x1080, there was a 30% gain. Note the AMD solutions are limited by the CPU much sooner than the NVIDIA solutions.
At very high resolution, thereís a gain of 45%.
Benchmark: Metro 2033
Still one of the most demanding titles, Metro 2033 forces all recent graphics cards to their knees. It supports GPU PhysX but only for the generation of particles during impacts, a rather discreet effect that we therefore didnít activate during the tests. In DirectX 11 mode, performance is identical to DirectX 10 mode but with two additional options: tessellation for characters and a very advanced, very demanding depth of field feature.
We tested it in DirectX 11 mode, at a very high quality level with tessellation on.
At 1920x1080, the gain over the Radeon HD 6970 varies between 25 and 30% but is very small in comparison to the GeForce GTX 580.
The Radeon HD 7970 enjoys a somewhat larger gain here, but for gaming in these conditions, you need to go for a CrossFire X system.
Benchmark: Project Cars
Project Cars is a car racing game currently under development. A participative build system allows you regular access to builds and you can interact with the developers at Slightlymad Studios (who also initially developed Need For Speed Shift). Although we havenít included these results in the final index, it is nevertheless interesting to see how the different graphics cards handle themselves in a game under development.
Its deferred rendering engine supports DirectX 11 and this is the mode we tested, pushing all settings to a max.
The GeForces do much better than the Radeons here. Note that in CrossFire and SLI, some small bugs are still visible in the reflections.
At high resolution, the Radeons do better.
Benchmark: Total War Shogun 2
Total War Shogun 2
Total War Shogun 2 has a DirectX 11 patch, developed in collaboration with AMD. Among other things, it gives tessellation support and a higher quality depth of field effect.
We tested it in DirectX 11, at max quality.
The Radeon HD 7970 has an impressive advantage over the GeForce GTX 580 here: +53%. So it does pay to work with developers! Except at MSAA 8x where the results arenĎt however quite as good: +12% only.
At 2560x1600 the Radeon HD 7970ís advantage over the GeForce GTX 580 is 62%.
Performance recapAlthough individual game results are obviously worth looking at when you want to gauge performance in a specific game, we have also calculated a performance index based on all tests with the same weight for each game. The results for Batman Arkham City with GPU PhysX, Crysis 2 with tessellation and Project Cars havenít been taken into account for the index.
We set an index of 100 to the Radeon HD 6970 at 1920x1080:
Hold the mouse over the graph to classify the cards by performance at 1920x1080.
The Radeon HD 7970 gives an average gain of 36 to 43% over the Radeon HD 6970, depending on the resolution. The gain over the GeForce GTX 580 is 22% at 1920x1080 and 31% at 2560x1600.
This isnít enough to overtake the GeForce GTX 590 and the Radeon HD 6990, nor the mid-range multi-GPU systems: the Radeon HD 6870s and GeForce GTX 560 Tis. These solutions only have a slight advantage over the 7970 however and as far as weíre concerned donít therefore justify all the outlay for a multi-GPU system. Moreover, they suffer slowdowns in certain games as their 1 GB memory is insufficient.
The Radeon HD 7970s in CrossFire X are at the head of the field as might be expected, even though theyíre affected by a significant CPU limitation in F1 2011 and by a non-functioning CrossFire X profile for Batman Arkham City.
During all these tests we noted a tendency with the Radeons: they tend to suffer more than the GeForces when the deferred rendering engines use MSAA. Itís a complex task to use this type of antialiasing in a deferred rendering engine. Thereís an example in our report: Understanding 3D rendering step by step with 3DMark 11. Although we canít see any technical reasons for this, we imagine that AMD has reduced its optimisation efforts, preferring to highlight the antialiasing carried out during post processing, such as FXAA or MLAA, which is simpler to support.
ConclusionWith a new architecture, the introduction of the 28 nm manufacturing process and over four billion transistors, the Tahiti GPU and the Radeon HD 7970 have been the cause of much excitement. Has AMD managed to do as well as it did with the launch of Cypress and the Radeon HD 5870? This is a difficult question and no doubt opinion will be divided as the performance gains are lower and new features less dramatic. With an average gain of 45 to 50% over the previous GPU, the introduction of a major new API, DirectX 11, and new features such as Eyefinity, the Radeon HD 5870 caused quite a splash when it was launched.
The Radeon HD 7970 offers an average performance gain of 35 to more than 40% over the Radeon HD 6970, the card that preceded it. It therefore represents a nice improvement, which can even be over 50% in some games such as Crysis 2 or Batman Arkham City. This therefore makes it an ideal card for gaming at very high resolution or at 1080p with the most demanding games and extreme quality options. Moreover, AMD has included all the available new technologies: DirectX 11.1, OpenCL 1.2, PCI Express 3.0, HDMI 1.4a 3 GHzÖ Nothing major of course, but itís all there, including some probably costly but important developments in terms of GPU computing and which it will take time to measure the impact of.
The fact that a new architecture has just been introduced also means that itís reasonable to suppose that thereís still room for forward movement, whether in terms of the compiler or directly from developers who have optimized their code for the architecture of previous Radeons and which may now be counter-productive.
Thanks to the 28 nm process and PowerTune technology, which now works more subtlely, AMD has managed to include all this while remaining within a 250W thermal envelope. This is very good news, especially as AMD has reduced energy consumption at idle, above all with a blank screen, to a point where the graphics card then consumes under 2W. For systems that are on all the time and require bursts of graphics power every now and again, such as an HTPC which combines the roles of file server and games platform, it would appear to be ideal.
Of course, it remains to be seen how NVIDIA will reply, what it has under its sleeve and of course when it manages to launch its offering. We arenít taking too many risks in supposing that the forthcoming NVIDIA high-end GPU, Kepler, will outdo Tahiti in performance terms. However we donít know when it will arrive. In two months? Three months? Six months? Will it once again break all energy consumption records? Or will it be capable of significantly improving its energy efficiency as NVIDIA has been hinting? Will it make up the lost ground in terms of the handling of video outputs? We donít know.
In the meantime, AMD can enjoy the impact of the Radeon HD 7970, which is more technically advanced than the GeForce GTX 580 in a number of ways, in addition to offering a performance hike of between 20 and 30%. The Radeon HD 7970ís pricing reflects its domination: Ä500. This is expensive but it makes sense given that volumes are still limited.
Copyright © 1997-2013 BeHardware. All rights reserved.