Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphics Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
ATI Radeon X1800 XT & XL
by Damien Triolet et Marc Prieur
Published on October 5, 2005

Reviewed and corrected pixel shaders
ATI was undeniably late compared to NVIDIA’s GPU pixel shading capacities. This is only logical as ATI’s previous shader core, even if it was slightly improved with the X800, is more than three years old.

The 9700/X800 shader core functioned in a relatively fixed process and this restricted its capacities. When a thread (a group of pixel to be processed) arrived in the shader core, it first went through the texturing bloc and all texturing instructions were processed and stored in the registers. Then they went though the pixel shader pipeline for all the instructions to be executed with texturing results already in memory. The thread management was pipelined, which means that as soon as a thread leaves the texturing bloc, another arrives while a second one waits. This system is interesting in terms performance, and is partly responsible for the Radeon 9700/X800´s string efficiency as this implies that the restricting factor is either texture access or mathematical instructions. For NVIDIA, however, it´s the sum of these two factors. This is also the reason why ATI is more efficient for anisotropic filtering as it´s easier to hide the time requested by the system with mathematical instructions. NVIDIA has to optimise the order of instructions to do this, but it´s never as efficient.

It becomes more complicated when an indirection is included in the shader. An indirection is an access to a texture, whose coordinate has been dynamically calculated in the pixel shader. This data is initially unknown and is now relatively common. Given that texture access is done before, it´s impossible to directly access the texture, and the thread has to be put on hold and sent back to the texturing bloc as soon as possible. As other threads are already ready to go to the shader core, the number of threads in it increases. For example, if there are 2 effective threads (we say effective because the actual number might be higher, a multiple of this number) in real time (one in the texturing bloc and one in the pixel shading bloc), it will increase to 4 with one indirection, 6 with two indirections and 8 with three indirections. 8 was the maximum supported and was a strong limitation.

Previously, we often spoke of the limitation of the number of registers for NVIDIA as there were really only 2 FP32 registers per pixel in the GeForce FX and 4 for the GeForce 6 and 7. For ATI however, the 12 registers requested by Pixel Shader 2.0 were actually present and not “emulated” from a smaller number. A lesser known fact is that these 12 registers were only accessible with two effective threads. If 4 were used, they had to share the register space. In the end, there were only 6 registers per pixel and 3 registers per pixel with 8 effective threads.


NVIDIA’s solution : the long pipeline
NVIDIA’s method in avoiding limitations in terms of indirections and for more flexibility with branching etc. was the use of a very long 256 cycle pipeline, much of which was useless except in waiting for texturing unit results. With NVIDIA there isn´t two distinct parts, the entire process is fused. As soon as a texturing instruction arrives, it´s directly processed and the result is available shortly thereafter in the pipeline, however, in the same effective cycle. NVIDIA shouldn´t use several threads simultaneously if it wants to stay efficient but it can access an unlimited number of textures dependant on the results of the pixel shader. The downside is the impossibility of hiding texturing latency as well as latency beyond a certain limit and the necessity to work with large threads of 1024 pixels (or even more with the GeForce 6).

This approach raises several problems with dynamic branching because in the GPU, the instruction flow is managed per thread or per group of pixels and not per pixel. In other words, each pixel in a thread has to go through the same path and has the same instructions applied. In the case when a branching result isn’t identical for all pixels in a thread, the two branches have to be processed for all of them. It´s no longer possible to use dynamic branching to increase performances (for example avoiding the rendering of large part of the shader) or even the inverse can happen.


ATI’s solution: Ultra Threading
With the Radeon X1000, ATI had to make some modifications and have pixel shaders without indirection limits and capable of branching processing. The solution chosen wasn’t to follow NVIDIA, but rather to further the 9700/X800´s concept by increasing the number of threads. It´s now increased to a maximum of 512 in the Radeon X1800, which is much higher than before, even if we don’t know the exact number.


The thread size is very small at 16 pixels and is much different than NVIDIA´s 1024. The Radeon X1000 supports 32 real registers per pixel, but this number drops depending on the number of threads in activity. We didn’t obtain the maximum number of threads with which the 32 registers were available, but we estimate it to be 64 or 1024 pixels. This represents 32,768 128 bit general registers as compared to 24,576 for the GeForce 7800, which is less flexible. Pixels also never have more than 4.

The Ultra Threading principle is quite simple even if its implications are very complex. As soon as a thread arrives in one of the four shader cores (which all have a bloc of 4 texturing units and four pixel shading units), the process starts and mathematical instructions are executed until an operation causing latency arises (such as a texture access). When this happens, the thread is sent to the texturing bloc, its results staying in the temporary registers and a new thread goes to the shader core. As soon as it arrives to the texturing instruction, it goes to the adapted bloc and a new thread goes in, until the texturing result of the first thread is known. At this time, it goes back to the pixel shading bloc for the instruction suite to be applied, until a new operation which causes latency arrives. The cycle continues until the shader is completely processed. After that, the thread goes out of the shader core and the process starts over again.

In other words instead of hiding latency with a long pipeline as NVIDIA does, or with a fixed architecture as before, ATI uses a high number of threads of which a significant part remains dormant while awaiting the result of texturing units. This method combines the advantages of both architectures.

For dynamic branching, the fact that ATI uses very small threads, avoids the calculation of the two branches for each pixel more often than NVIDIA. This could lead to a very significant advantage in the future. Still at this level, ATI has, in addition to the pixel shading bloc and texturing bloc, a third bloc in parallel which deals with branching instructions. So this doesn’t really have an impact on performance whereas it requires several cycles for NVIDIA.

ATI hasn’t strongly improved its architecture on the calculation unit level as they remain more or less identical to the Radeon 9700/X800. There is the one large vec3 + 1 unit with a small vec3 + 1 unit, which process simple operations like modifiers. NVIDIA’s architecture includes two large and two small units and it´s important to note that the large ones can’t process all instructions and the order has to correspond to their capacities to use them simultaneously. NVIDIA is also capable of processing operations in vec2 + vec2, even if in practice the compiler has some difficulty in this domain. Finally, NVIDIA has native NRM instruction (normalisation) support in FP16, whereas ATI has no units in FP16 and uses the instruction decomposed version, which requires several cycles.


Compared to the 9700/X800, ATI has still made several small improvements, amongst others native support of sincos instructions. Overall however, NVIDIA keeps an advantage in calculation power. ATI defends itself by claiming that this architecture maximises the use of calculation units and compensates.

In terms of pixel shaders, the X1000 architecture has an innovating function, called “scater”. This allows the saving of any value directly to the graphic card´s memory. This is a huge evolution compared to restricted access of the memories of other GPUs made possible thanks to the new quite flexible memory architecture. Roughly, this function allows an unlimited number of registers and provides an enormous amount of new possibilities with GPU use, such as general calculation units in GPGPU. This function is nevertheless very advanced for its time and can’t be used with DirectX 9. ATI has decided, however, (a first in the GPU industry) to publish low level information on the GPU X1000 in 2006. GPGPU developers will then be able to access the chip without using an API and utilize its full potential.

<< Previous page
Architecture in brief

Page index
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
Next page >>
Perf in pixel shading, Branching, Vertex Shader  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.