Report: The Radeon HD 4870 & 4850 - BeHardware
>> Graphic cards

Written by Damien Triolet

Published on June 25, 2008

URL: http://www.behardware.com/art/lire/725/


Page 1

Introduction



AMD comes back in force with two new Radeons. Indeed, the 4800 series looks more than promising and not only on paper. Several weeks ago, we were able to verify this by previewing the performances of the smaller model, the Radeon HD 4850. We now give you a complete test which includes the Radeon HD 4870 as well as a detailed analysis of its architecture.

Doing a lot with little
Contrary to Nvidia, which has the wind in its sails and despite a rather nice Radeon HD 3800, AMD has had difficulties selling its Radeons especially in the mid-level and high end. Nvidia largely dominates this segment as much as in sales as in image. As you may already know, it is very expensive to compete on this terrain.


In parallel, multi-GPU systems have gained in maturity and it is now a solution that can be relied upon in an acceptable manner for the very high end. As we have said on several occasions and as tests of the GeForce GX2 and other Radeon X2s have shown, it’s not the ideal solution, but in the absence of better options it enables filling a pontential gap in this segment.

With its limited resources, AMD has had to concentrate on more efficient projects. They do not have the means to play at “who has the biggest GPU” and therefore it’s obvious that developing an enormous GPU as Nvidia has done with the GT200 and the GeForce GTX 200 is not on the program. So what’s left to AMD? Be content with the low end? Luckily, this isn't the case.

AMD decided to concentrate on the "performance" segment, a domain generally shared by big mid-range models and former high end cards. Between 150 and 250 €, it can be considered the "good deal" segement. AMD thus took the foundation of its HD 2000/3000 architecture and tried to get the most out of it for a "GPU performance" budget or, in other words, for an average size chip and very far from the 600 mm² of Nvidia’s GT200.


The RV770 developed with the above in mind only measures 260 mm². In spite of this, it has a mere 956 million transistors compared to the Radeon HD 3800 or RV670 which measures 190 mm² and has 666 million transistors. So with 40% more, what was AMD able to do?


Page 2
Architecture: SIMT, SIMD, MIMD, Radeon HD

SIMT vs. SIMD vs. MIMD
With the GeForce 8, Nvidia introduced an architecture that made a complete break from the past. Thus, exit the enormous MIMD vector units out of which it was sometimes difficult to get the maximum. Instead the choice was in favor of scalar units. While on the implementation level this involved 256 bit wide (8 x 32 bits) SIMD units (like SSE), on the functional level it was one 32 bit operation applied on 8 threads/elements instead of 8 32-bit operations instruction on 1 thread/element per cycle. For this reason, in practice and from the outside these units behave as scalar units.

To highlight the difference with SIMD (Single Instruction Multiple Data), Nvidia speaks of SIMT (Single Instruction Multiple Threads). And while units are similar, the SIMT enables to naturally maximize the use of units if the task is massively parallel as it happens to be in 3D rendering. The interest of SIMT is that the programmer doesn’t have to do anything in order for this to be the case, while in SIMD the programmer and compiler have to strive to fill the vector unit which isn’t always that easy. This is because an identical instruction has to be executed several times on different data of the same thread and they have to be independent in order that they can be processed in parallel.

MIMD (Multiple Instructions Multiple Data) as it is used by AMD in the Radeon HD 2000/3000 is more flexible because the first constraint disappears; however, the second one is still of great importance.

Of course, the SIMT isn't the ultimate solution because there are always compromises. It is more efficient but also uses more transistors, surface on the chip and power consumption is higher because a more complex logic control is required. On the other hand, the SIMD and MIMD enable placing more calculation units in the GPU although this is to the detriment of efficiency. For Radeons, in reality it involves more of a mix between an SIMT and MIMD.

The GeForce 8 was thus created with only 128 scalar units while the Radeon HD 3870 has 64 vec5 units or the equivalent of 320 scalar units. The higher efficiency of SIMT of course is not sufficient to compensate for this difference. On the other hand, Nvidia has managed to implement double pumped-type calculation units, or in other words, those that function at double speed compared to the scheduler.
Radeon HD 2000 and 3000 architecture
The heart of the Radeon HD 2900 and 3800 is based on four large blocks of calculation units that we will call multi-processors in order to draw the comparison with GeForce 8/9/GTX architecture. To be more precise, each multi-processor has its own scheduler, a large general register file and 16 vec5 processors or vec4+1. This is the equivalent of 80 scalar processors.

The first architectural difference is that there is a logic control (scheduler, etc.) for 80 calculation units for AMD versus Nvidia’s one logic control for 8 calculation units (+ SFU). AMD thus has a much lower cost in terms of control per calculation unit. On the other hand, AMD must work on bigger groups of elements (64 versus 16 or 32 for Nvidia) and also they must use 4+1 and not scalar processors.

In this way, it’s a mix between a SIMT and MIMD. A multiprocessor is a 16-way SIMT because it executes the same instructions on 16 threads in parallel but it is also 5-way MIMD because it can execute up to 5 instructions at the same time on different data on these threads.

We specify vec4+1 because a Radeon processor is in reality composed of a MIMD unit based around a FMAD vec4 and a big scalar unit that can handle all instructions (except the scalar product). In other words, special instructions and instructions on whole numbers must go through this unit.

In parallel to these four big multiprocesseurs, the Radeon has four blocks of texturing units which are entirely decoupled. These are capable of sampling and filtering four 4D texels in FP16 (HDR) via their four main texturing units and can access four 1D supplementary texels although without filtering them. Given that they are decoupled, these texturing units are not connected to any particular multiprocessor and they can thus always be used even if a multiprocessor does not need them.

The only thing is that compared to a GeForce 9800 GTX’s 64 filtering units, the 16 of the Radeon HD 3870 are a bit light and the 16 supplementary 1D samplers do not change the situation.


Page 3
Architecture: Radeon HD 4800

Radeon HD 4800 architecture
With the Radeon HD 4000 and more particuliarily the RV770, AMD had to start with the basic architecture from previous generations, otherwise development would have been too long and all investment on the software side would have been lost. For this reason, the foundation is the same or, in other words, there are 16-way multiprocessors in which each of the 16 processeurs is a MIMD vec5 calculation unit.


A processor and its 5 calculation units.

There was, however, one small improvement. AMD extended the support of instructions for integers to all sub units. While the unit is more evolved with the support of special functions, integer support is now integrated to all calculation units. AMD worked the most on simplifying the hardware implementation of these multiprocessors and while maintaining a similar throughput (which is even better in operations therefore involving integers) they managed to significantly reduce their size.

AMD then took a closer look at its texturing units, which were too big and not overly efficient despite making them decoupled. They therefore decided to abandon this more refined solution and recombine a block of filtering units per multiprocessor to which it was then exclusive. Still thinking in terms of economical modifications, AMD took a step backwards in two areas: scalar samplers were erased from the design and FP16 filtering was reduced to half speed. Blocks of texturing units thus became 70% lighter. To partly compensate for the diffrent losses, AMD reviewed the structure of its caches to maximize production as much as possible of what "remained".


2 multiprocessors and their texturing units.

AMD was thus able to make its base architecture much more economical – to such an extent that according to rumors we heard, in the beginning, AMD planned on the implementation of 6 multiprocessors or 96 vec5 processors and 24 texturing units. This is the equivalent of 480 scalar processors if we were to compare with the GeForce. As for the size of the the chip fixed by the pin out (its connections) AMD "unfortunately" did its job too well. For this reason, units took up less space than planned and part of the die was left unoccupied.

So how did they fill this space? By adding four more multiprocessors. AMD had to push a little, but they eventually fit. For this reason, we have the RV770 which has no less than 160 vec5 processors, or 800 scalar processors and 40 texturing units. This is a sure step forward compared to the previous generation.


A big defect of the previous generation were the ROPs. Their efficiency was actually a bit questionable especially with antialiasing. Although AMD has admitted there is a problem (but never openly revealed its details), it seems evident that a part of the ROPs do not function and this weighs down performances with antialiasing. We believe this is due to the MSAA resolve unit or downsampling whose job it is to restore the image to display size, filter it and make aliasing disappear.


With the RV770, AMD introduces new ROPs and says that it has remedied the problem – though once again without giving us the details. In addition, they have doubled the speeds of its ROPs with antialiasing, in FP16 and in Z-only. While their number remains at 16, they have significantly gained in capacity.

Finally, and this was a surprise to us, AMD has abandoned the ring bus that it was so proud of in previous generations. While the ring bus is indeed a very elegant solution, it requires many more transistors than a classic memory controller all while adding few gains in practice. AMD comes back to a more simple system whose efficiency however has indeed been improved. The external memory bus has not changed and remains a 4x 64 bit or 256 bits.


Page 4
AMD, the first to feature GDDR5

AMD, the first to feature GDDR5
AMD has a strategic advantage over Nvidia in terms of new types of memory support because Joe Macri, the president of the JEDEC committee ( an organization in charge of DRAM memory development), is one of its employees. So it isn't that surprising to see AMD be the first on the market with a product that integrates this new technology.


Joe Macri is Director of Technology at AMD… and in parallel is the president of the JEDEC committee which is involved in defining DRAM memory standards.

So what does GDDR5 add? More or less the same thing that each new type of memory does; lower power consumption and easier clock increases. Voltage is lower and clock can be greatly reduced very easily, enabling to save a few watts. However, for high end cards, this isn’t the main interest. The increase in frequency is actually of direct benefit because it means a higher bandwidth which puts the GPU more at ease especially when antialiasing is used.

To enable the increase in frequency, first of all, GDDR5 still has an 8 bit prefetch like with GDDR4 and unlike the 4 bit prefetch of GDDR3. This means that more memory banks function in parallel and consequently the bandwidth is larger at identical frequencies. The same process took place with the transition to DDR and then in turn to DDR2

  • SDR: 1 memory bank
  • DDR: 2 memory banks
  • DDR2: 4 memory banks
  • GDDR3: 4 memory banks
  • GDDR4: 8 memory banks
  • GDDR5: 8 memory banks
  • DDR, DDR2, GDDR3, GDDR4 and GDDR5 all send 2 bits per cycle, on the rise and fall of the signal. The difference is that memory which uses more banks can increase the speed of data transmission without pushing the frequency of memory banks too much. In the case of 1100 MHz GDDR3 used with the GeForce GTX 280, memory banks run at 550 MHz, a very high speed and this being more or less the limit. GDDR4 or 5 at 1100 MHz is content with memory banks at 275 MHz, which poses less of a problem and leaves a consequent margin for evolution.

    So why haven’t we seen GDDR4 increase in frequency and become more common? Obviously, this wasn’t as simple as planned and if increasing the frequency of memory banks wasn’t a problem, increasing that of the transmission of commands and adresses was. For this reason, the main innovation that comes with GDDR5 memory is dividing the frequency of the sending of commands and addresses by two compared to the frequency of data transfer. And all of this while modifying communication protocol in order to have enough margins for improvement.

    In this way, for "3600 MHz" or 3.6 gbps GDDR 5 memory, the frequencies are:

  • Memory banks: 450 MHz
  • Sending commands and adresses : 900 MHz
  • Sending data: 1800 MHz

    All of this of course is in DDR and it therefore always involves DDR-type memory.

    Another small interesting detail is that memory has an error detection device, which in the case of error, can even decide by itself to recalibrate frequencies which may no longer be perfectly synchronized. This should leave more of a margin for overclocking.


    Finally, GDDR5 will lower the costs of graphic cards in the long run by simplifying the PCB. Because it has more tolerance, there is no longer the need to use complex traces. You can see the difference between GDDR3 memory traces on the left and that of GDDR5 on the right.


  • Page 5
    Pixel, Vertex and Geometry Shader performances

    Pixel Shader performances
    We tested two relatively simple lighting shaders which represent a good compromise between theoretical and actual speeds:


    The GeForce GTX 280 is the fastest here but the Radeon HD 4870 posts an enormous gain compared to the previous generation.
    Vertex Shader performances
    We tested performances in T&L, VS 1.1, VS 2.0 and VS 3.0 in RightMark:


    Unified architecture enables recent GPUs to allocate all resources to the processing of vertex shaders which can mean a significant gain. Moreover, this gain could be even bigger but it is limited by the GPU’s triangle processing speed which on all the GeForces tested here is 1 triangle per cycle. On the other hand, this is 0.5 per cycle for the Radeon HD 3870 while it was 1 per cycle for the Radeon HD 2900 XT or the speed at which the Radeon HD 4870 now functions. The higher frequency of the GeForce 9800 GTX therefore is to its advantage making this GPU the most powerful we have seen in terms of (simple) geometrical processing.
    Geometry Shader performances
    Contrary to Nvidia, AMD has integrated a generalized cache for reading/writing in memory from the shader core. It can be used in a classic manner for Stream Output which consists, as required by DirectX 10, of being able to write data that comes out of the shader core without having to go through the ROPs. It also enables the virtualization of general registers which can thus be unlimited.

    Another use is to utilize video memory with this cache to temporarily stock a potentially enormous mass of data created by Geometry Shaders during the amplification of geometry. Without this, calculation units could be blocked due to a lack of memory to stock the new data.

    Nvidia takes the problem from the other end and instead of offering an extended register memory, reduces the number of elements processed in parallel to an amount that enables always having enough memory in the GPU to stock new data. In other words, instead of using 128 or 240 processors to process a geometry shader, if Nvidia detects that there could be a problem, this number is reduced. We do not know exactly at what point Nvidia reduces parallel processing, but it obviously seems to be a very big difference between Nvidia and AMD, with an advantage for the latter. This is true even if developers are careful not to use it in problematic cases.

    To compensate for this, Nvidia has strongly increased (by six) the size of its cache in the output of geometry shaders. AMD has also increased the size of cache because it is largely more efficient to keep everything on the GPU instead of going through video memory. We observed performances in a tessellation demo based on geometry shaders provided by AMD at the launch of the Radeon HD 2900 XT:


    As you can see, even if the GeForce GTX 200 significantly improves performances compared to the GeForce 8 and 9, the Radeons are largely in the lead. Nvidia says that it has increased cache in relation to what developers use and will use in the mid-term while AMD specifies having implemented a "fast path" for geometric amplification.

    In addition to this, of course the Radeon HD 4800 keeps its tessellation unit even if it is still not used by any game.


    Page 6
    Texturing and ROP performances



    Texture access performances
    Performances were measured in the access of textures of different formats in bilinear and trilinear filtering. We kept the results in classic 32 bits (8x INT8), 64 bit "HDR" (4x FP16) and in 128 bits (4x FP32). For comparison, we added performances in 32 bit RGB9E5, a new HDR format introduced by DirectX 10, which enables storing HDR textures in 32 bits with a few compromises. These tests were carried out with a tool provided by our colleagues and friends at Beyond 3D..


    You will notice the obvious difference between the GeForce 8800 Ultra and GeForce 9800 GTX. The latter is capable of filtering 32 bit textures twice as fast thanks to the presence of more address units. The GeForce GTX 280 is largely ahead of the GeForce 9800 GTX, while when looking at theoretical speeds, they are very close at a respective 43.2 GTexels/s and 48.2 GTexels/s. In other words, Nvidia has indeed improved the output of its texturing units as we now go from an output of 78% to 98%. Not bad.

    For the Radeon HD 4870, we noted several things. First of all, while its maximum speed is 30 GTexels, we only obtained 24. The reason is that the RV770 only has 32 interpolaters which cannot feed the 40 texturing units when they are based on interploated adresses.

    Next, while the filtering of RGB9E5 textures is carried out at full speed, the speed of filtering FP16 (HDR 64 bits) textures is cut in half. This time, however, we were able to measure the maximum speed. While it is higher than that of the Radeon HD 3870, we should keep in mind that this card only has 16 texturing units versus the 40 of the Radeon HD 4870.

    Given the number of texturing units and interpolaters, we obtained 100% of theoretical speeds. This is surprising and also shows the effort made on the output of these units.
    ROP performances
    The GeForce GTX 280 has 32 ROPs versus the 24 of the GeForce 8800 Ultra and the 16 of the GeForce 9800 GTX and Radeon HD 3870 and 4870. As a reminder, ROPs are units devoted to the last step in the processing of pixels (mixing colors, anti aliasing, compression and writing data to memory). The size of the memory bus is partly related to the number of ROPs.

    You may remember that not just happy with increasing the quantity, Nvidia improved efficiency on the GeForce 8 for Z-only passes in memory. Without antialiasing, AMD is very far behind in terms of speed at this level:


    GeForces are very fast here, but with the Radeon HD 4870, AMD has doubled speeds as soon as anti-aliasing is activated. While GeForce performances plunge with antialiasing 8x, this isn’t the case for the Radeons.

    Next, again we use a tool provided by our colleagues at Beyond 3D in order to test the speed of ROPs when writing pixels in memory first in a classic manner and then with a mix of colors (blending), notably used for transparency effects.


    With the GeForce and Radeon HD 3870, results are logical and consistent with the number of ROPs. 64 bits is half as slow as 32 bits and 128 bits is in turn half this speed. As for 32 bit "FP10", it is handled in the same way as FP16 and, unfortunately, does not have a higher speed. Fortunately, this isn’t the case for the Radeon HD 4870, which can process it more quickly.

    On the other hand, while FP32 seems to be twice as fast, oddly FP16 is a bit slower. This is probably due to an error related to drivers.


    Once blending is used, we noticed a net gain for the GeForce GTX 280 which, contrary to the GeForce 9800 GTX, benefits from the implementation of this function at full speed.

    With the Radeon HD 4800, it seems that AMD has modified the blending capacities of its ROPs. Thus, while the Radeon HD 3800 is capable of eight FP16 and two FP32 blendings per cycle, the Radeon HD 4800 is only capable of four FP16 blendings but also four FP32 blendings per cycle. Therefore, AMD placed an FP32 blending unit (and also capable of FP16 blending) per block of 4 ROPs instead of two FP16 units used to (slowly) handle FP32.


    Page 7
    Branching performances

    Branching performances
    One of the main innovations that was introduced with the evolution of GPU programmability was dynamic branching. This allowed writing some shaders more easily and to increase the efficiency of others by avoiding the calculations on parts that don’t need it. For example, why apply a very performance costly filter to soften the border of a shadow to a pixel in the middle of the shadow? Dynamic branching can help to determine if the pixel needs it or not.


    However, the situation is not that rosy as this only applies to very specific cases. Branching has the reputation of being difficult to manage and this is particularly the case in CPUs that have to predict the branching result to mask calculation latency. In a GPU, pixels are processed by groups of 10s, 100s or even 1000s, and this allows the automatic masking of this latency. This problem, therefore, doesn’t really exist for GPUs. There is another one, however. For efficient branching with GPUs, all pixels of a working group have to take the same branch or else both branches have to be calculated for all pixels with masks in order to only write the result of the required branch for each pixel.

    In the case of the GeForce 8, 9 and GTX 200, the GPU works on groups of 16 or 32 threads (vertices, pixels, etc.). Why these two possibilities? First of all because 8-way SIMT units are used and groups of at least 8 threads are required. Next, you may recall that calculation units are double pumped and function at twice the scheduler’s frequency. Thus, only one command can be sent in every other cycle when seen from the calculation units’ point of view. Working on groups of 16 threads enables calculation units to have enough work and to not have to wait for a slower scheduler. Finally, working on 32 threads authorizes dual issue. The scheduler will send an instruction to the 8-way SIMT unit and then it will send an instruction to special units. It can alternate between these two operations at full speed thanks to groups of 32 threads.

    Nvidia can configure its GPUs for 16 or 32 threads. In the first case, branching performances are improved and in the second calculation power is improved thanks to dual issue. Groups of 16 are activated for vertex and geometry shaders while groups of 32 are activated for pixel shaders and CUDA.

    We developed a small test that allows us to change branching granularity (the number of consecutive pixels that take the same branch). We create virtual screen columns inside the pixel shader applied to moving triangles. We specify the branch to take per pixel column. One column out of two has to display a complex shader while the other can skip this part of rendering. Average sized triangles in motion are displayed on the monitor and across these virtual areas that use different branches. The triangle size, their position and the column size have an influence on branching efficiency. We think this test is quite close to real situations.


    With narrow columns, GPUs can’t use branching to avoid the complex part for half of the pixels, but they do have to process branching instructions. This reduces performances instead of increasing them - at least for the GeForce 8, 9 and GTX 280. All of these GPUs have a special unit devoted to branching, which functions in parallel with pixel shading and texturing pipelines, masking the cost of branching instructions. The Radeons, however, seem to be the only ones to completely mask branching latency.

    The size of groups of pixels on the GeForce 8800 is 32 versus 64 for the Radeon HD3870 and 4870. This enables Nvidia chips to take the lead. We noted a surprising difference between the GeForce 9800 GTX and GeForce GTX 280 which with one column of 8 pixels is much more efficient. It is probable that the breaking down of triangles into pixels is done in a way that it best groups close pixels (and thus they are susceptible to take the same branch) and is beneficial in this case.


    Page 8
    The Radeon HD 4800

    The Radeon HD 4800
    For this test, we received a Force3D Radeon HD 4870 and PowerColor 4850. However, these were only reference cards upon which these companies (hurriedly) placed their logos. The first should very soon be found for 250€ while the second is already available for 160€.

    The Radeon HD 4870 reçieved a double slot cooling system and is equipped with 512 MB of Qimonda 1.8 GHz GDDR5 memory.




    The Radeon HD 4850 has a single slot cooling system. It is also equipped with 512 MB but this is GDDR3 at nearly 1 GHz.




    The Radeon HD 4870 needs two 6 pin power connectors while a single one is enough for the Radeon HD 4850 :




    Page 9
    Nvidia’s reaction

    Nvidia’s reaction
    Nvidia didn't take long to react to the these Radeon HD 4800s and officially unveiled a GeForce GTX v2. This is destined to replace the former model and it is equipped with a new GPU, the G92b. Similar to the G92, it has the advantage of being produced in 55 nanometers instead of 65. Thus, it is smaller, less expensive and consumes fewer watts. In addition, theoretically it may be a bit easier to raise its frequency.

    Consequently, Nvidia changed its frequencies of 675/1688 on the GeForce 9800 GTX to 738/1836 on the new version, representing a 9% increase. On the other hand, memory does not change at 1100 MHz, just like the design of the card which remains perfectly identical:



    The transition to 55 nanometers reduces the size of the G92 from 324 mm² to 264 mm² or almost the same size of the RV770.

    PhysX
    Nvidia wasn’t just happy with a pre-announcement of this card to counter the Radeon. They also released the first version of their PhysX driver as fast as possible. You may recall, following the buyout of Ageia, Nvidia converted the "hardware" module (destined to the PPU) of the PhysX API into a CUDA version in order to be able to accelerate it via GeForce 8, 9 and GTX GPUs. Note that it wasn’t the PhysX API that Nvidia converted and accelerates via its GPUs but only the module.

    Currently, acceleration is functional in special PhysX maps of UT3 and in the CPU2 test of 3DMark Vantage. And according to Nvidia, this adds some weight to their GPUs. The only thing is that Nvidia only wants to accelerate games in this way that could benefit from the PPU. Moreover, these games are very few and limited. While it seems obvious that given the enormous pool of already compatible and ready GeForces (and game developers are now going to take interest), it remains to be seen when these games, using the acceleration of this PhysX API module via GPUs, will arrive.


    Page 10
    DirectX 10, GPGPU, HD Video

    DirectX 10.1
    With the Radeon HD 4870, AMD of course still supports DirectX 10.1. Otherwise, for Nvidia they continues to ignore it and do not seem in a hurry to make the necessary material modifications for its support.

    However, there is one interesting detail that presents itself. One of the main interests of DirectX 10.1 is to improve MSAA readback or the possibility for the GPU to work on buffers that receive a multisampling-type anti-aliasing in order to be able to easily apply antialiasing with complex rendering techniques. DirectX 10.1 notably enables the GPU to have deep access to the buffer. The GeForce 8, 9 and GTX are also capable of this although Nvidia can't speak about it since DirectX abandoned the caps.

    In spite of this, Nvidia can implement it to drivers and work with developers that plan on using it in a way in which it also functions in DirectX 10 on the GeForce. Note that this is also true for DirectX 9 because it is via this technology that Nvidia managed to implement MSAA support in S.T.A.L.K.E.R.
    GPGPU
    It’s no longer a secret that GPUs are capable of much more than just calculating three dimensional images. AMD added a few small improvements that enable better and more easily using Radeon HD 4800 calculation power. First of all, there have been performance gains in the (non-cached) capacity of writing or reading any place in memory.

    But especially, AMD has added a local shared cache specific to each GPU multiprocessor. This 16 KB cache enables different threads of a same group to communicate amongst themselves allowing a significant optimization of certain algorithms. Otherwise, a generalized shared cache of 16 KB means the threads of different multiprocessors can also communicate amongst themselves.

    More attentive readers may have noticed that this local cache is similar to the 16 KB of shared memory that is also specific to each multiprocessor on the GeForce 8, 9 and GTX. So will AMD’s addition of this cache make it possible for CUDA support on the Radeon? We would hope for this to be the case because CUDA is largely more efficient and better documented than other current AMD solutions.
    HD video
    The Radeon HD 4800s have a reviewed UVD for the playback of HD video. Thus, the version 2 handles the acceleration of MPEG2 HD (which wasn’t the case before) as well as Picture-in-Picture modes as defined by Bluray format via dual stream support.

    AMD also updated HDMI support which now moves into its version 1.3, thus enabling the playback of 7.1 audio fluxes and the encoding of Dolby TrueHD and DTS HD. This is when all other solutions are limited to HDMI 1.2 and therefore 5.1 sound.

    In reponse to Nvidia announcements concerning the acceleration of video encoding via CUDA, AMD highlights its AVT (Accelerated Video Transcoding). This is an interface found (or soon will be) in drivers that allow any application that so desires to use the GPU for encoding video in H.264 or MPEG2. Cyberlink’s Power Director 7 will be the first application to use AVT.

    You may recall that for Nvidia, a third party developer created an encoding program that must be purchased.


    Page 11
    Specifications, power consump., the test

    Specifications

    Note, once again, that the dual-GPU cards tested here are the equivalent of a single 512 MB card and not a 1 GB model like the GeForce GTX 280!

    AMD has put into place a redundant system that is rather efficient. It enables keeping all units active on all Radeon HD 4800s contrary to Nvidia which has to deactivate entire blocks on its GPUs so as not to have to discard GPUs that have defects.
    Power consumption and noise
    We evaluated the power consumption of the different cards. Measurements were taken at the wall socket. This is therefore the total power consumption of the power supply, in this case a Cooler Master Real Power M1000 (1000 watt).


    Unsurprisingly, the Radeon HD 4800’s power consumption is higher compared to that of the Radeon HD 3870. While this aspect is mastered on the Radeon HD 4850, it explodes on the Radeon HD 4870 especially when the card is at rest. Given that AMD has told us it has introduced new functions destined to reduce power use, we can only hope these are not yet activated and a future driver will remedy this.

    In terms of noise levels, the Radeon HD 4850 is similar to the Radeon HD 3850 and it is therefore relatively silent. On the other hand, it has the inclination to attain high temperatures even at rest. We noted 80°C in idle.

    The Radeon HD 4870 is better cooled and is also silent.

    As for the new GeForce 9800 GTX, it posts similar power consumption to the former version. The slight gain in frequency thus cancels out the gain related to the transistion to the 55 nanometer process. In terms of noise, the GeForce 9800 GTX v2 is identical to the GeForce 9800 GTX or in other words, it is discreet at rest but noisier than a GeForce 8800 GTX or Ultra in load. The fan appeared to turn a little faster or at least the sound of air flow was much more noticeable.
    The test
    In this test, we used ten games, four of which support DirectX 10. Tests were carried out only in 1920x1200 as a lower resolution isn’t generally suited for such a high end product. Anisotropic filtering, DirectX 10, and HDR were activated in all cases when available in the game. Finally, transparency/adaptive anti-aliasing were activated in multisampling mode.

    All available Windows Vista currently available in addition to SP1 were installed.
    Configuration
    Intel Core 2 Extreme QX9770
    Asus Striker II
    4 GB DDR3 1066
    Windows Vista SP1
    Forceware 177.34
    Catalyst 8.501.1 (= 8.6 hotfix)


    Page 12
    Enemy Territory : Quake Wars

    Enemy Territory : Quake Wars

    While Quake Wars is based on the Doom 3 engine, it has undergone some evolution such as megatexturing which facilitates the work of artists; however, there is the additional cost in terms of decoding and access to megatextures. In the end, Quake Wars is a little more resource heavy than Doom 3 or Quake 4.

    We saved a demo in a sequence versus 4 bots. Given that artificial intelligence was not calculated in the timedemo, results were less affected by the CPU than in actual gameplay or at least in this case versus our bot adversaries.

    All parameters were set to a maximum in the game including 16x anisotropic filtering. The patch 1.4 was used.


    Without anti-aliasing in this first game the GeForce cards dominate.


    Otherwise, the Radeon HD 4850 shows a very significant gain compared to the Radeon HD 3870 once anti-aliasing is activated. In this way it surpassed the GeForce 9800 GTX but the v2 manages to slightly take the lead again. The Radeon HD 4870 comes close to the GeForce GTX 260 in 1920x1200.


    Page 13
    Half Life 2 Episode 2

    Half Life 2 Episode 2

    Still based on the Source Engine, Half Life 2 Episode 2 doesn’t really have anything new on the technological level. It simply optimizes and more heavily relies on the engine’s capabilities, making the game more resource heavy than its previous versions. We carry out a demo with all game options set to a maximum including anisotropic filtering which is in 16x.


    Here, the Radeon HD 4870 surpasses the GeForce GTX 260.


    In Half Life 2 Episode 2, the largest gain is also with anti-aliasing activated with a 70% better score compared to the Radeon HD 3870! The Radeon HD 4850 thus places between the 9800 GTX v1 and v2 while the Radeon HD 4870 also surpasses the GeForce GTX 260.


    Page 14
    S.T.A.L.K.E.R.

    S.T.A.L.K.E.R.

    We carry out an identical movement and measure the framerate with fraps. The test was done in high quality, complete dynamic lighting, maximum details (anisotropic filtering 16x) and foliage shadows. S.T.A.L.K.E.R. uses an engine based on differed rendering, which is fundamentally incompatible with MSAA and makes the use of anti-aliasing impossible – or at least this is what we thought! Despite everything, Nvidia ended up finding a solution. The 1.00006 patch was used.


    The Radeon HD 4850 is equivaent to the GeForce 9800 GTX while the performances of the GeForce 9800 GTX v2 increase proportionaly to its higher GPU frequencies thus gaining 9%.


    The Radeons still do not enable the use of anti-aliasing.


    Page 15
    Rainbow Six : Vegas

    Rainbow Six : Vegas

    The first PC game based on the Unreal Engine 3.0, Rainbow Six : Vegas is still a very resource heavy game. We measure performances in the introductory scene. The HDR mode is activated as it is more or less obligatory as without it banding is very noticeable. Shadows are set to “low” because a higher quality in this domain lowers performance too much in certain areas.


    Originally designed for the Xbox 360, this game seems to have a natural affinity for the Radeon HD which has a similar architecture to the game console’s graphic chip. It’s therefore the HD 3870 X2 which dominates. As for the Radeon HD 4870, it surpasses the GeForce GTX 280 while the Radeon HD 4850 manages to even beat the new GeForce GTX 260 by 10% without anti-aliasing.


    Once again, the Radeon HD 4850 shows a higher gain with the activation of anti-aliasing and easily surpasses the GeForce 9800 GTX.

    You may recall that this game does not support anti-aliasing but Nvidia and AMD have implemented it to their drivers.


    Page 16
    Oblivion

    Oblivion

    We saved a specific movement in order for it to be always identical and the test reproducible. Of course, HDR was activated and a high level of detail was selected.


    Without anti-aliasing, the gain added by the Radeon HD 4850 compared to the Radeon HD 3870 is not that great at only 10%.


    Gains are a bit higher with anti-aliasing although the Radeons already have amazingly good performances in this game with this mode. Thus, the GeForce GTX 260 is behind while the Radeon HD 4870 easily surpasses the GeForce GTX 280.


    Page 17
    RaceDriver GRID

    RaceDriver GRID

    To test Codemaster’s latest opus, we carry out a well defined movement in high quality mode. The game is based on an evolution of Colin McRae DIRT’s engine and does away with some of the unnecessary complexity. The patch 1.1 was applied.


    In this game, there is a more than 50% gain compared to the previous generation. In this way, the Radeon HD 4850 breezes by the GeForce 9800 GTX and places just ahead of the GeForce GTX 260.


    Without anti-aliasing while the GeForce 9800 GTX v2 is equivalent to the Radeon HD 4850 and the GTX 260, with this filter it falls behind. With our without anti-aliasing, the Radeon HD 4870 is on top.


    Page 18
    Bioshock

    Bioshock

    The first game based on the Unreal Engine 3.0 to support DirectX 10, Bioshock has great graphics even in DirectX 9 mode while it is less resource heavy than Rainbow Six : Vegas. We carry out a well defined sequence of movement with all options pushed to a maximum and in DirectX 10.


    With a gain of a little more than 35%, the Radeon HD 4850 manages to surpass the GeForce GTX 260 or at least without anti-aliasing as AMD still does not allow its activation in DirectX 10 mode. Otherwise, the 9% gain shown by the GeForce 9800 GTX v2 is not enough to catch the new Radeon. Finally, the Radeon HD 4870 finishes just ahead of the GeForce GTX 280.





    Page 19
    Company of Heroes

    Company of Heroes

    Given that Company of Heroes received a DirectX 10 patch that adds a real plus on the graphics level, we decided to add it to our test protocol. All graphic settings were pushed to a maximum.

    We run the integrated test on the version 1.72.


    In this game in DirectX 10 mode, calculation power is very important. Therefore, the Radeon HD 4850 has an advantage and posts a gain of 40% without anti-aliasing.


    The gain is as high as 60% with anti-aliasing. This enables the card to come close to the GeForce 9800 GTX but is not enough to surpass it. The v2 therefore is in the lead.


    Page 20
    World in Conflict

    World in Conflict

    Very resource heavy and with nice graphics, it’s only natural World in Conflict joins our test suite. We carry out the internal test with the patch 1.0002. All game options are pushed to a maximum which includes the DirectX 10 mode and 16x anisotropic filtering.


    For some unknown reason, Radeon performances in 1680x1050 were limited.


    With an almost 70% gain with anti-aliasing, the Radeon HD 4850 largely surpasses the GeForce 9800 GTX, which like all Nvidia cards suffers much more from a lack of memory than the Radeons.


    Page 21
    Crysis

    Crysis

    An absolute must in terms of gaming, Crysis was tested with its 1.21 patch (optimized for multi-GPU systems). We carry out our own demo saved in ‘’Harbor’’, "High" mode and DirectX 10.


    While the success of Crysis has been rather mixed, it is currently the most resource heavy game and requires the most graphic power. Here, the gain is smaller at "only" 20% without anti-aliasing. The Radeon HD 4850 is thus slightly behind the GeForce 9800 GTX without this filter.


    On the other hand, with anti-aliasing it largely takes the lead as the 9800 GTX is overwhelmed by Crysis’ memory needs. Morevover, the new GeForce 9800 GTX does not radically change the situation with this filter. Otherwise, the Radeon HD 4870 does not manage to surpass the GeForce GTX 260.


    Page 22
    Recap of performances

    Recap
    Although individual game results are interesting, especially when involving multi GPU systems, we calculated a performance index based on all tests with the same weight for each game. A score of 100 was given to the GeForce 9800 GTX in 1920x1200.


    Without antialiasing, the Radeon HD 4850 jumps in front of the GeForce 9800 GTX. However, the GeForce 9800 GTX v2 shows an average gain of 8.3% compared to the previous version which isn’t bad for a frequency increase of 9%. For this reason, it surpasses the Radeon HD 4850. Otherwise, its bigger sibling, the Radeon HD 4870, has no problem beating the GeForce 9800 GTX v2 as it is even ahead of the GeForce GTX 260.


    On the other hand, with antialiasing, the Radeon HD 4850 is ahead with an 18% lead on the first GeForce 9800 GTX and 11% on the new version. Moreover, it surpasses the Radeon HD 3870 X2 which often has problems with this filter. As for the Radeon HD 4870, it comes close to the GeForce GTX 280.

    Note that (only) for anti-aliasing indexes, results obtained in Bioshock and S.T.A.L.K.E.R. were not taken into account as the Radeons do not have support for this filter in these games. You can consult a graph which otherwise takes these games into account here.


    Page 23
    Conclusion

    Conclusion
    With the Radeon HD 4800, AMD reserved a nice surprise for us. We might as well be honest in saying that we did not expect such high performances. By concentrating on the optimization of its previous architecture as much in efficiency as in cost, AMD has managed to develop a formidable GPU.


    The RV770 upon which these Radeon HD 4800s are based is thus capable of rivalling the GeForce GTX 200s’ GT200 while they are supposed to be in entirely different leagues as one is twice as big as the other. As much as the GT200 impressed us upon its release, we are forced to admit that AMD did much better in terms of efficiency. Of course Nvidia has much more ambitious and pressing objectives involving the GPGPU and which results in different technological influences; however, after the fact we still can’t help thinking that Nvidia could or should have done better.

    The Radeon HD 4850 initially launched for 160€ is already found at 140 €. And at this price there is no competition from Nvidia which was obviously as surprised as we were by the RV770. The GeForce 9800 GTX v2 which is an attempted response will only arrive in mid-July and at the supposedly higher price of around 180-190 €.


    The Radeon HD 4870 whose initial price is 250 € benefits from higher bandwidth due to GDDR5 memory. It surpasses the GeForce GTX 260 and even regularily comes close to the GeForce GTX 280 which is twice the price. In order to remain competitive, Nvidia will have to quickly review its prices. This is something that should not be easy given the higher production costs of the GT200 and the cards which integrate it.

    Of course, all is not lost for Nvidia. They still hold first place in terms of raw performances with the GeForce GTX 280 and have proved to be ahead of AMD with acceleration of a physics API via its GPUs. Otherwise, the Radeon HD 4800s are not exempt from defects, for example, with a power consumption that is oddly too high at rest with the Radeon HD 4870. However, these small details do not change the fact that we have witnessed a uncontestably perfect manouever on AMD’s part.


    Copyright © 1997-2009 BeHardware. All rights reserved.