NVIDIA GeForce 8800 GTX & 8800 GTS - BeHardware
>> Graphics cards
Written by Damien Triolet
Published on February 12, 2007
URL: http://www.behardware.com/art/lire/644/
Page 1
The first DX10 GPU
NVIDIA is the first company to release a DirectX 10 compatible GPU on the market. This is an important strategic move as this will naturally become the reference for developments based on this API. What kind of improvements will this architecture bring? What about efficiency? And more importantly for us, is this new GPU, the GeForce 8800, really interesting for current games?
The first DirectX 10 GPU Each new version of DirectX, the programming interface used by most computer games, is an opportunity for GPU manufacturers to develop a new architecture to support new capabilities while noticeably improving performances. The release of DirectX 10 (we already discussed this API in this article, is an important step as there are numerous modifications. It’s incompatible with previous graphic cards and Windows XP, and DirectX 10 will only work with new generations of cards and Windows Vista.
Innovations mainly concern shaders (small programs that allow the rendering of complex 3D images) and developers who will have a "cleaner" DirectX. Just to remind you, 3D rendering roughly consists of calculating the position of vertices (summits of polygons that form objects), gathering these vertices in triangles, cutting the triangles in pixels, and finally applying textures and other effects on the pixels. DirectX 9 allows the execution of these operations (shaders) on vertices and pixels, whereas DirectX 10 also allows it on triangles (and every other primitives). Why? At the beginning, it will probably be mainly used to optimise performances, with more efficient rendering techniques but it also allows the division of a triangle into smaller triangles to increase geometrical details of an object.
These small programs executed by the GPU are becoming more and more complex with the evolution of image quality. With DirectX 10, the shaders move from a 3.0 to a 4.0 version. The latter allows much longer programs with more flexibility.
To support these 4.0 shaders, NVIDIA developed a new architecture different from the GeForce 6 and 7. These previous GPUs had shader processing units for vertices or pixels. The GeForce 8 ends this process and only has “unified units”, capable of processing all types of data. The objective is to avoid units that run uselessly or not at all. For example, with the GeForce 7, if geometrical calculations are numerous, the eight units devoted to this operation are saturated whereas the twenty-four intended for pixels do nothing. With the GeForce 8, all calculation units can work on these calculations. It is in this way that GPU rendering can progress.How many pipelines? Stupid question for some, but a technical detail that a lot of people look for. Bringing an answer to this question is very difficult, while for NVIDIA’s marketing department it’s quite easy. Their answer to this question is 128 shader processing pipelines. However, it isn't possible to compare these pipelines to previous ones.
Roughly, these 128 pipelines, or shader or stream processors, correspond in terms of calculation capacity to 32 current pipelines or 8 vertex shaders and 24 pixel shaders of the GeForce 7800 and 7900.
Does it mean there is nothing more to hope from the GeForce 8800 than efficiency coming from the unification? Of course not, since NVIDIA succeeded in increasing the GPU frequency in this area to 1350 MHz! This is more than twice as much as the GeForce 7900 GTX. That's not all...
Page 2
A brief historyA brief history From the beginning, modern 3D rendering is based on texturing. 10 years ago, pixels displayed on a monitor only featured a texture decoration. As years went by, this 3D rendering became more and more complex. Shaders allowed the application of complex mathematical functions on pixels and textures were no longer decorations, but data bases, which had various utilizations. Modern shaders often contain dozens of mathematical instructions per texture access. Based on this fact, we might think that it’s no longer essential to have an efficient management of the access to these textures, and that it would be best to focus on raw calculation power.
Unfortunately, it’s not that simple for many reasons. The first is that a texture access is often associated to a filtering of the texture. When the filtering is of a high quality and even if it only requires one instruction, it can take several cycles to be executed. We will leave this point to the side for now, because the second reason more fundamentally concerns an architecture and is much more important. This is the latency of a texture access.
Access to graphic memory can take up to more than 100 cycles. If we had to wait 100 cycles to move to the next instruction we would still be at the stone age of 3D. The GPU is capable of preloading data in a small cache to avoid (most of the time) paying the full cost of the memory access. When textures were simple decorations, this task was easy because the GPU knew in advance which area of the memory had to be preloaded. Now, with the evolution of rendering techniques, textures might contain various data and their access is no longer as clearly determined. The GPU has to be capable of processing these undetermined accesses to textures without paying the costs of memory latency. Several solutions to this problem exist.
NVIDIA's previous solution was to use a very long pipeline, in which more than 100-150 stages were solely devoted to masking latency. The GeForce 7 has 2 pixel shader calculation units per pipeline. The first is in charge of texture access, then there is a very long "tunnel", in which pixels sleep until reaching the next unit. The longer the tunnel, the more likely the texture data is ready on time, and the pipeline won’t stall, waiting for this to happen. The optimum transfer rate can be maintained. The counterpart is that these pipelines need to be filled with as many pixels as there are stages, and are not very flexible.
This wasn't a problem until the arrival of branching in shaders. In a GPU, the instruction flow is managed in batches of elements (pixels in this case) and it isn't possible that one element from a packet receives a different instruction than another. If the result of a branching is different for pixels of the same group, the GPU has to execute the two branches for all pixels with a mask in order to avoid taking into account the instructions of the wrong branch. They are processed, however, and this isn't good for performances. Because of their very long pipeline which forces them to work with very big groups of pixels (+/- 1000!), the GeForce 7900 and previous generations face this problem.
Another issue is that the width of the pipeline is fixed just like the length. If the pipeline isn't wide enough for a pixel (=if it doesn't have enough temporary registers), a pixel has to occupy several stages, and this consequently reduces the computing rate. This is the limitation of the GeForce FX 6 and 7 registers, even if NVIDIA enlarged the pipeline for the GeForce 7 for most pixels to have enough space in one stage.
 ATI's solution to all these problems was, with the Radeon X1000, to decouple the pixel shader processing pipelines from texturing units. The long pipeline is no longer necessary, because there are other techniques to mask texturing latency. It’s possible to work with very small groups of pixels and use a significant amount of these small groups, or threads, to obtain the same result without the above inconveniences. As soon as a small group has to access a texture, it moves out of the pixel shaders processing unit into a queue for the texturing unit. In parallel, another group is processed by the shader calculation unit. As soon as a group has received the result of the access to the texture required, it can go back to the calculation line until it needs another texture, and so on. To efficiently mask latency, it’s necessary to have the possibility of placing a significant amount of threads in the queue, or in stand by, and to have a cache memory to store them. The bigger this is, the longer latency can be tolerated. Here, however, the waiting line no longer has a fixed length. A new thread is injected into the shader core only if it’s necessary and, of course, only if there is room left in memory cache for storage.
If latency to hide is low, threads quickly go from one unit to another and it isn't necessary to inject a high number into the shader processing core. Of course, the memory cache size that stores them is fixed. For example, if it’s designed to contain 128 threads and only 32 are enough to mask latency, is the 75% remaining cache memory useless? To answer this question, we will have to come back to the problem of the number of temporary registers. The number in use is variable and this means that the size of the thread, in terms of memory use, is also variable. You might have guessed that this type of architecture makes it possible to have a high number of temporary registers without reducing the transfer rate when there isn't a large latency to mask. This flexibility allows the GPU/compiler to find the best compromise between the number of registers accessible at full speed and masked latency, while the long pipeline has a fixed model, which leads to much lower performances when the process doesn't fit its structure.
Page 3
Architecture of the GeForce 8GeForce 8 architecture As we briefly explained above, this new GPU relies on a unified architecture that consists of using the same units to process all types of elements, whether they are pixels or vertices. The objective is that none of the units run empty. You may have noticed that we only spoke of "pixel shaders" and their processing units. This was because, these units in fact have almost everything required to process vertex shaders. The unification consists more of extending the capabilities of current pixel shader engines rather than merging pixel and vertex shaders. It’s obvious that the shader cores of the ATI Radeon X1000, at the functional (but not management) level, are similar to those of a unified architecture. Changing to a unified architecture will be a natural evolution for ATI with the R600.
 For NVIDIA, the fixed architecture of the GeForce 7 isn't particularly adapted to this evolution. With the GeForce 8, they had to start from scratch. You may have heard this before, because with each new generation of GPU the "brand new architecture" is amongst the basic selling points. Generally, this isn't the case, but today it is. NVIDIA had to start over and had to redesign a new architecture for an old one that had reached its limit.
 NVIDIA chose a similar architecture to the Radeon X1000’s and decoupled the calculation and texturing units, which in the latest highest end version reached 128 and 32, respectively. Compared to the evolution these last few years, the GeForce 8800 is very close to ATI's current GPUs. However, if we take a closer look some major difference appear.A scalar processor The calculations units of previous GPUs worked with a certain amount of pixels in parallel. This is true for both ATI and NVIDA and was 4 pixels for the GeForce 7 and 12 for the Radeon X1000. Each pixel is a vector of 4 components (RGBA or XYZW since they aren't necessary color) and these 4 components are also processed in parallel. We will suppose here that the computed values are colors to make our explanation a little easier. With each cycle an instruction will be applied to 4 components of 4 pixels, or 16 elements in the case of the GeForce 7. It often happens that an instruction isn't applied to all components. To avoid wasting resources, the shader cores of these GPUs are capable of simultaneously processing two instructions. For example:MUL R1.xy ADD R1.z These two instructions, multiplication and addition, can be processed simultaneously even if they are different. This possibility is called, “co-issue”. These units are named MIMD (multiple instructions multiple data) and are 512 bits wide (16 elements x 32 bits).
The GeForce 8800’s units, however, are of the SIMD (single instruction multiple data) 512 bit type. Does that mean that they are less efficient? No, because instead of processing 4 components of 4 pixels per cycle, they process one element of 16 pixels. This means that each component of pixels can have a different instruction without wasting resources. The above example of 2 instructions shows the interest of such an organisation of units. With one unit of the GeForce 7 type, they will be applied to 4 pixels during each cycle or to 16 pixels in 4 cycles. With the GeForce 8, they will be applied to 16 pixels in 3 cycles. The first one is broken down in MUL R1.x and MUL R1.y. There is a 25% performance improvement with equivalent processing resources and this is only due to such a reorganisation. Shader core specifications Now that we have finished describing the philosophy behind each architecture, we will compare their specifications:
 As you can see, the GeForce 8800 GTX has an enormous calculation power in addition to excellent efficiency thanks to scalar instruction processing. We have to keep in mind that these units will also have to process vertex shaders. With the other 2 GPUs, there were special units in charge of them.
You will also notice that the newcomers have much higher filtering power. We will come back on this point later on.
We made a lot of test to have a more precise idea of how the shader cores of the GeForce 8800 are working and we have to admit that they are formidably efficient despite the very young drivers. We failed, however, to see how the second MUL is in action. We believe that its utilisation is submitted to restriction or that the compiler integrated to drivers doesn’t exploit it yet. This could mean a future performance improvement.
It’s also interesting to note that each scalar processor has in addition to the MAD and MUL units, one unit which interpolates and process specific functions (EXP, LOG, RCP, RSQ, SIN, COS) all executed in 4 cycles. We suppose that for its implementation, NVIDIA included 4 of these units per shader core eahc capable of interpolating over one quad (square of 4 pixels which simplifies calculations) or executing one specific instruction in one cycle (-> 4 cycles to process the special instruction for the 16 elements the shader core works on).
Page 4
ROPs, DirectX 10, CUDA, the G80ROPs The GeForce 8800 has 24 ROPs, instead of 16 for the previous high end GPUs. The ROPs units are in charge of the last treatment to apply to pixels (mixing colors, AntiAliasing, data compression and writing the result in memory). Of course, the 384 bit bus has something to do with the ROP units number.
NVIDIA has included more ROPs, but has also improved their efficiency, especially for the path that only writes Z values in memory. The GeForce 7 were already quite efficient for this operation, but they are far behind the GeForce 8800:
 DirectX 10 To have more information about DirectX 10, you can read this article.
 Of course, the GeForce 8800 fully supports DirectX 10 and even more advanced formats: FP32 textures and antialiasing (128 bits).
At the moment we can't now how efficient the GeForce 8800 is when dealing with advanced DirectX 10 features such as the geometry shaders. The DX10 driver is still in its early stage and is only available to selected developers. We think about 4 points that could lead to performance issues with DX10 : driver immaturity, geometry amplification with a geometry shader, integer operations in the shaders and stream output. We will actually have to wait a couple of weeks or months to address these questions.
At the moment we can tell you that Nvidia state that integer operations can be processed at full speed by the shader processors (1 per cycle per shader pipeline) so they should not be a performance issue. Regarding geometry amplification, it is not supposed to be very efficient in a first DX10 implementation (which is true for ATI's R600 too). Of course it doesn't mean it shouldn't be usable. While we can't say that G80 will be very good at it we can't say it will be less efficient than its competitor either.CUDA We’ve mentioned on several occasions the ATI initiative, which consists of making the GPU available as a general calculation unit (however, this is only efficient in specific domains). To do so, ATI exposes the shader core of the Radeon X1000 as a very big floating point calculation unit and they disclose the exact specifications of this part of the GPU to help developers optimise their code for the hardware.
We were of course waiting for NVIDIA's riposte and the result is slightly different. Named CUDA, it doesn't give access to the GPU via a machine language but only via a code written in C. The CUDA driver takes the responsibility of interpreting everything. This seems simple to use. Compared to a standard C function, the only difference is to pass as an argument the number of elements that will receive the function. At the time of the test, CUDA isn't yet available and we are eagerly waiting for its release.
G80 supports scattering as the Radeon X1K do and can write anywhere in memory. An important feature for a general calculation unit. Also, the G80 is the first GPU to support data sharing between two elements (Nvidia calls them threads). With traditional architectures, an element such as a pixel can't have access to an information related to another one. With G80 the cache system can be used to pass data from one element to the other. This new flexibility will enable the efficient usage of the GPU to process many more algorithms.
 The result is the G80 DirectX 10, CUDA, unified architecture, massive filtering power…all of this results in the G80, a monster with 681 million transistors! This is way higher than the 278 million of the G71. NVIDIA chose a 90 nm fabrication process, which makes the G80 the biggest general public GPU ever sold.
This GPU is very expensive to produce (118 per wafer probably with an expected very low yield of fully working chips). It’s sure that the volumes produced will be low and power consumption will noticeably increase.
Page 5
Shaders and textures performancesPixel Shader performances We tested two relatively simple lighting shaders that represent a good compromise between the views of theoretical and practical calculations power:
 The GeForce 8800 destroys the competition. Note that unlike the GeForce 7, its performances do not drop in FP32 even if, according to NVIDIA, it could be faster with FP16 in specific cases. As explained before, if thanks to its architecture there isn’t too large a latency to mask, it has a huge quantity of registers. The lack of registers is the main cause of the performance gap between the FP16 and FP32 for the previous models of NVIDIA GPUs.
Vertex Shader performances We have tested the T&L, VS 1.1, VS 2.0 and VS 3.0 performances with RightMark :
 Thanks to the unified architecture, the GeForce 8800 can attribute all its resources to the processing of vertex shaders. This leads to possible performances improvements. The results measured could have been even higher, but the GPU is not limited in these tests by the computing power. The results obtained, however, were high enough to crush the competition.
Texture access performances We measured the performances of access to textures of various formats and sizes with and without filtering. As the texture is displayed in full screen (1920 x 1440) the access to big textures is actually outside of the limits of the full cache texture efficiency.
 There is the GeForce 8800...and then the others
Click here to take a look at the results.
Page 6
Branching performancesBranching performances One of the main innovations introduced with the GeForce 6800 is dynamic branching in pixel shaders. This facilitates shader writing and increases the efficiency of other shaders by avoiding the calculation on pixels which don’t need it. For example, why apply a very performance costly filter to soften the border of a shadow if the pixel is in the middle of a shadow? Dynamic branching helps to determine if the pixel needs it or not. Splinter Cell Chaos Theory uses this technique, whereas the Chronicles of Riddick calculates everything for every pixel. Performances drop by 10 to 15% for the first and more than 50% for the second. Of course, the algorithms aren’t identical, but it does give us an idea of what dynamic branching is capable of.
 This only applies to very specific cases. Branching has the reputation of being difficult to manage. It is particularly the case in CPUs that have to predict the branching result to mask calculation latency. In a GPU, pixels are processed by groups of 10s, 100s or even 1000s, and it allows the automatic masking of this latency. There is also another problem. For branching, all pixels have to take the same branch or else both branches have to be calculated for all pixels with masks in order to only write the result of the required branch.
We developed a small test that allows us to change branching granularity (the number of consecutive pixels that take the same branch). We specify the branch to take per pixel column. One column out of 2 has to display a complex shader and the other can skip this part of rendering. Average sized triangles in motion are displayed on the monitor and across the areas that use different branches. The triangle size, their position and the column size have an influence on branching efficiency. We think this test is quite close to real situations.
 With narrow columns, GPUs can’t use branching to avoid the complex part for half of the pixels, but they have to process branching instructions. This reduces performances instead of increasing them. At least for the GeForce 7. ATI has a special unit for branching that works in parallel with pixel shading and texturing pipelines to mask branching instruction costs. This is also probably the case of the GeForce 8800, whose performances improve even if it seems to be impossible. For now, NVIDIA couldn't explain this behaviour, and it seems that the GeForce 8800 is capable of saving resources other than on the calculation power.
If it was one of the major advantages of the Radeon for over a year, dynamic branching is now processed more efficiently by NVIDIA. The size of the pixel threads with the GeForce 8800 is 32 instead of 48 for the Radeon X1950. Thanks to this, NVIDIA's new chip takes the lead. We specified threads of pixels, because, in the case of threads of vertices, granularity is 16 vertices. You should note that the Radeon X19x0 produces less predictable results than the Radeon X1800 for this test (as the strange result shows for the column of 16 pixels). We suppose that this is due to the architecture’s complex method of distributing pixels to shader cores, which leads to a reduction in efficiency for groups of 48 pixels.
We conducted a second test related to dynamic branching. This time we normally rendered a fractal first and then with branching. This algorithm uses a high number of identical iterations, which are found next to each other in the standard (or flat) shader. With the branching based shader, we used a loop around 2 iterations with a test that checks if the additional iterations are useful or not. If they aren't, we exit the loop and leave the unnecessary iterations.
 We have to point out the GeForce 7’s inefficiency in this test. The Radeon X1950 XTX provides better results and the GeForce 8800 is by far the best.
Page 7
Textures filteringFiltering quality Since the release of the GeForce 7, we regularly criticised the filtering quality of NVIDIA's cards. Too aggressive optimisations lead to shimmering in textures. The problem was that it was difficult to show this in screenshot since it was mainly observable in motion. The good news is that this problem is now left in the past. The GeForce 8 correctly filters textures.
This wasn't the only criticism we used to make regarding the GeForce 7, because these cards process an anisotropic filtering dependent on the angle of the surface that receives it. This optimisation simplifies calculations (and the units that are in charge of this operation!) and increases performances by applying a lower quality to some surfaces. ATI who is at the origin of this optimisation gave the possibility of its deactivation with the Radeon X1800, or at least replaced it with a less aggressive version.
NVIDIA does the same with the GeForce 8800, but pushes the concept further. They remove this optimisation in standard settings, and process a much more accurate calculation of the LOD than with ATI’s HQ mode. NVIDIA becomes the new reference in terms of image filtering quality!
 
  Radeon X1950 XTX, Radeon X1950 XTX HQ GeForce 7900 GTX, GeForce 8800 GTX
These screenshots do not represent filtering quality, rather only the level of mipmap displayed. The later the color stripes are displayed, the sharper the textures. However, it could not show if they are twinkling due to poor filtering. This problem of filtering, especially visible in flight simulators, can't be shown with this type of test and with screenshots. In practice, the GeForce 7 suffered from this, while it could be reduced by activation of the “high quality” mode, this is no longer the case of the GeForce 8 (also true for ATI). Actually, filtering quality is slightly superior to the quality observed by ATI.
We noted the presence of angle dependency filtering quality with Unreal Tournament 2003 and a reduced quality with the vertical surface in Far Cry. Obviously, NVIDIA doesn't aggressively optimise the drivers for UT2003, and it’s probably a vestige of old specific optimisations that get enabled on our system for an unknown reason.
 Vertical angles don't seem to have the same level of filtering as the others.Filtering performances We measured the performances with Serious Sam 2 in an outdoor environment.
 The Radeons are strangely less efficient in 2x anisotropic filtering than in 4x. This is strange, and the reason might be that as the aniso increases the card has to process less trilinear filtering. We noticed that the Radeon X1950 is less efficient when it has to produce a trilinear rather than an anisotropic filtering. This might explain this odd result.
As you can see, the impact of anisotropic filtering activation for the GeForce 8 is much lower. Trilinear is only 16.9% faster than aniso 16x. The performance gap increases to 57% with the GeForce 7 and 41% with the Radeon X1950 XTX. Having 64 filtering textures units is definitively useful.
Page 8
AntialiasingAntialiasing If the Radeon started to support 6x multisample antialiasing a long time ago, the GeForce were limited to 4x. This limitation ends with the GeForce 8, which supports MSAA 8x and can also couple MSAA and HDR FP16 like the Radeon X1x00 (unlike the GeForce 7).
NVIDIA also implemented a new antialiasing mode called coverage sample (CSAA). This mode consists in improving the precision with which the various color samples are mixed to form the final image. For example, if 2 triangles cross the same pixel in MSAA 4x, they can be seen like if they cover 25 and 75 % of the pixel, 50 % each or 75 and 25 %. Colors are mixed in these proportions. The coverage sample is based on a standard MSAA buffer and on a second buffer with a higher resolution (8x or 16x), which do not focus on the color. There is only a boolean value that indicates whether the triangle covers this area or not.
This technique has some limitations, because with the coverage sample buffer, it isn't possible to precisely know the interactions between numerous triangles or triangles that are contiguous or cut into each other. As the Z-Buffer stays in the standard resolution, it isn't possible to know which triangle is on top. NVIDIA had to restrict this coverage sample buffer utilization to pixels that represents 2 spaced triangles. If this isn't the case, the CS buffer data is ignored. When it is taken into consideration, in theory, it provides a result similar to its resolution since the mixing of samples is made in that resolution. If we take the above example, the triangle that covers 30% of the pixel will only have 25% of the weight during MSAA 4x mixing since it will be the closest possible approximation in this mode. In MSAA 4x + CSAA 16x, it weighs 31%, which is a better approximation.
CSAA will only be of use in specific cases, but the counterpart is that it requires much fewer resources than a higher MSAA mode considering that it consumes much less bandwidth and never requires additional calculations. In terms of memory, MSAA4x occupies 256 bits per pixel. With CSAA 16x on top (16 boolean values), it "only" increases to 272 bits (actually probably a bit more than that but it stays cheap in term of memory used).
In practice, we measured the impact of these new options:
 The impact on performances is from 15 to 20%, which is reasonable. However, the difference between CSAA 8x and 16x is small, and this limits the interest of the 8x mode.
 This second test is identical, except we activated Transparency AntiAliasing (Adaptive AntiAliasing for ATI). Performances drop especially in MSAA 6x and 8x, which consequently have a calculation, respectively, of 6 and 8 samples of colors for the grass and other grids. Antialiasing quality (without TAA / AAA) 
GeForce 7 : without AA, 4x AA, 8xS AA

Radeon X1000 : without AA, 4x AA, 6x AA

GeForce 8 : without AA, 4x AA, 4x AA + 8x CSAA, 4x AA + 16x CSAA, 8x AA (8xQ), 8x AA + 16x CSAAAntialiasing quality (with TAA / AAA) 
GeForce 7 : without AA, 4x AA, 8xS AA

Radeon X1000 : without AA, 4x AA, 6x AA
GeForce 8 : without AA, 4x AA, 4x AA + 8x CSAA, 4x AA + 16x CSAA, 8x AA (8xQ), 8x AA + 16x CSAA
In our opinion, 4x MSAA with TAA or AAA should be used for a very good quality without an extreme performance cost.
Page 9
The cards, consumptionsThe cards
 The first card based on the G80, the GeForce 8800 GTX, requires a power supply of 450 Watts, while NVIDIA mentions figures of 750 to 1000 Watts for configurations with 2 cards in SLI! This GeForce 8800 GTX features a GPU clocked at 575 MHz (the processor shaders are clocked at 1350 MHz) and 384 bit memory at 900 MHz, which provides a comfortable bandwidth. You will have to pay a little bit more than 650€ to acquire this 27 cm / 10.6" long beast that requires the use of two power supply connectors. Noise is very low, just a bit higher than the GeForce 7900 GTX, which is our reference.
 The second card, the GeForce 8800 GTS uses a GPU with restricted capacities. The number of processor shaders decreases from 128 to 96 and the number of texturing units from 32 to 24. The GPU frequency decreases to 500 MHz (1200 MHz for the shaders' processor) and the memory to 800 MHz but this time in 320 bits. Calculation resources are reduced by 33% and the bandwidth by 25%. Price will also be lowered to 450€ - 500€. The cooling system is similar to the higher end model and is simply shorter. This card is also very silent.
 This GeForce 8800 GTS was provided by Galaxy. We thank the company for sending it so quickly. As NVIDIA only builds a few graphic cards, they are all identical and partners do not have many options to make their products stand out, at least in the beginning. Galaxy told us that the next GeForce 8800 GTX would be equipped with a water-cooling solution designed by Zalman.
 Finally, Asus sent us a second 8800 GTX and a GTS. Thanks to these cards we were able to test them in SLI.
 We noted that each GeForce 8800 has a companion chip, the NVIO (on the left). It handles video input and output. We don't know really why NVIDIA chose this solution which is more expensive than a simple GPU integration. Maybe to have easier evolutions without having to build a new GPU? Maybe to avoid having to buy Silicon Image chips?
 For the PCB, there is an interesting detail. The GeForce 8800 GTS, even if they feature 640 MB of memory (at 320 bits), can sometimes have 768 MB of memory even if some of this memory is unused. Maybe the reason is that it is easier to build a single PCB with every memory space occupied rather than a different version for each ROPs that has been deactivated (since of course, it isn't always the same that poses a problem or is deactivated). Consumption The power consumption of these graphic cards was evaluated with measurements taken directly at the power outlet. This represents the computer’s entire power consumption, with an Enermax 535W. Figures were obtained under Window’s desktop and in use with a fillrate test that saturates the pixel shader with Prime95 and with 3DMark05 (GameTest 3) also with Prime95. Prime95 makes it possible to have constant CPU usage regardless of a graphic card’s performance.
 We have to say that we worried too much about the power consumption of the GeForce 8800 GTX. In practice, it has in fact approximately the same power consumption as the ATI Radeon X1950 XTX. In 2D, however, the GeForce 8 required much more power than the other cards. Finally, NVIDIA’s new architecture more easily avoids feeding power to unused circuits. This is shown by our shader that only saturates the mathematical units of the chip. Under these circumstances, power consumption of the GeForce 7 increases compared to 3DMark, whereas it diminishes in the case of the GeForce 8.
Page 10
GeForce 8800 GTS 320 MBGeForce 8800 GTS 320 MB 3 months after the release of the two first GeForce 8800s, Nvidia releases a third declination, the GeForce 8800 GTS 320 MB. The only difference compared to the standard GeForce 8800 GTS is the video memory. It was decreased from 640 MB to 320 MB by replacing the ten 64MB chips by ten 32MB ones. The card frequencies and design remain unchanged. Nvidia announced a smaller price of +/- 300-330€ ($299 to $329) instead of the 400-450€. This only concerns “official” prices.
If we chose to use quotation marks for official, it is because we noted that GPU manufacturers usually announce very aggressive prices to optimize their image. The pressure is then on the partners who don’t always know how to manage with these prices. The $299 MSRP is very difficult to maintain for some partners considering that Nvidia sells each GeForce 8800 GTS 320 MB at $250. Obviously, exceeding the “official” price of $299 once the packaging, transport and margins of the various middlemen is quite easy.
We used a GeForce 8800 GTS 320 MB designed by Leadtek for this test. It is based on the standard design but the manufacturer has replaced the upper part of the cooling system provided by Nvidia. This gives a rather original and successful design to the card.
 Leadtek sells the card with a TV HD cable, a DVI/VGA adaptor, Serious Sam 2, Spellforce 2 and PowerDVD.
Page 11
Drivers, CPU limited, testsDrivers and SLI Even if the drivers of the GeForce 8800 (release 95) are still young we have to say that they are already very efficient. Except for a few minor issues, we went through all tests without a problem. This is a very good sign and it would seem that NVIDIA didn’t' have to "force" with drivers to obtain good results.
As for SLI, if performances are up to standards, we came across a couple of problems. The drivers for SLI tests came very late. We suppose that NVIDIA released them in a hurry and that there is still some work to do ensure the usual quality.
The question of antialiasing via the control panel of the drivers changes with the GeForce 8. There is now an option to improve the parameters of the application. It consists in replacing the antialiasing mode used by a game and not forcing it. It avoids bugs and makes it possible to use it with all games that support it. The trick is so simple that we don't understand why no one though about doing it before!
CPU Limited ? As for each new high end graphic cards, we will often hear of the GeForce 8800 that it is « CPU limited » and that you need a very big CPU to really fully use it. This type of remark is regularly overstated. 3DMark has something to do with this, whether it’s because the initial resolution seems to be too hard to change for some or because the last version includes processor power to the overall score.
We should add right off that the GeForce 8800 doesn't require more CPU resources than any other less powerful GeForce. However, with relatively low graphic adjustments compared to the power of the card, there is a higher probability that the framerate will be limited by the processor and not the graphic card. So is it a problem? Not really, because it means that you will be able to increase the resolution or the level of graphic details that have an impact on the graphic card without noticeable reduction of the framerate. The only situation where "CPU limited" can be disturbing is when the CPU isn't powerful enough to have a flowing framerate, but it has nothing to do with the graphic card.
So, if you have a processor that is listed as mid-line, but powerful enough in games, you can buy a more powerful graphic card to really take advantage of you latest 24" monitors. More than the CPU / 3D combo, it’s the 3D / Monitor that needs to be equilibrated. With current games you will have to associate at least one 8800 GTS with a 20" and a GTX and with a 24" monitor to really see their potential.
We remind you that to know if your CPU or graphic card restricts performances in games, we recommend running a simple test. Reduce the resolution to 800*600 or even 640*480! If you do not notice an improvement it means that you are limited by the CPU or maybe by the memory if your hard drive starts processing specifically in the lags. If not, this is the graphic card which is at the origin of the problem. Tests As usual, we activated anisotropic filtering for all tests. We believe that it’s no longer necessary to deactivate it especially with high end graphic cards. It is activated in the game when possible and in the drivers when it isn’t. We also decided to activate Transparency Antialiasing which allows better filtering of simulated objects from alpha tests such as grids. In some of the games, the impact on performance is significant, but it also means that the visual impact will greatly improve. These are high end graphic cards so we weren’t easy on them. Test configuration: eVGA nForce 680i Intel D975XBX (Bad Axe) Intel Core 2 Extreme X6800 2 x 1 GB Western Digital Raptor 74 GB Enermax 535W Windows XP SP2 Catalyst 6.10 ForceWare 96.94
Page 12
Quake 4, F.E.A.R.Quake 4 Here, we saved an action scene. Unlike Doom 3, there are fewer shadows but more characters and textures. This changes the load in terms of rendering.
Anisotropic filtering 8x was automatically activated by the game. Tests were made in Ultra quality mode.
 - Normal - Antialiasing 4x
  F.E.A.R. We use the integrated demo. Unfortunately, this only gives a whole number score, which can lead to a difference of one unit under the same conditions, because of a normal variation of two tenths. For each card, we selected the best of three results.
All graphic options are pushed to the maximum except for soft shadows, which were deactivated.
16x anisotropic filtering was activated via the game.
 - Normal - Antialiasing 4x
 
Page 13
Half-Life 2 Lost CoastHalf-Life 2 Lost Coast For this test, we use an internal demo recorded with Lost Coast to test Valve's HDR, which uses a quite complex rendering format. It doesn’t maximize the additional capabilities of the GeForce 6, 7 and Radeon X1K, but runs with all DirectX 9 cards with MSAA.
Anisotropic filtering x16 was activated via the game.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
   
Page 14
Far CryFar Cry We use an internal demo, which is a mix of outdoor and indoor locations recorded in the "catacombs" map. We activate the post process "cold" or "hot" filter, which considerably increases image quality. With the post process and HDR, our honest impression is that rendering quality is from another era and we rediscover the game. This post process filter leads to +/-20% performance reduction.
Anisotropic filtering 8X was activated in the game.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
   The Radeon and GeForce use a similar basic FP16 rendering, but it isn’t perfectly identical since the image is less “burned” for ATI.
  HDR rendering for ATI GPU (left) and Nvidia (right).

Page 15
Serious Sam 2Serious Sam 2 Here we also recorded a demo and activated anisotropic filtering 16x in the game.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
   
There is an important performance reduction after the activation of AA 4x. Our test scene includes a lot of objects that are filtered thanks to Transparency Antialiasing, which requires a lot of resources in this case. Without this option, objects wouldn't be antialiased and performances would increase.
Page 16
Tomb Raider LegendTomb Raider Legend We saved a game and always go through the same regular identical movement.
Anisotropic 16x filtering is activated via drivers.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
   We changed our usual process and modified the scale for HDR graphs in Tomb Raider Legend as there was such a large performance gap once activated. It is important to know that it isn’t really HDR, but more of a totally different rendering with parallax mapping. More attention was also given to lighting. With Tomb Raider Legend, it’s all or nothing!

Page 17
Splinter Cell Chaos TheorySplinter Cell Chaos Theory Once again, we use an internal demo. SM 3.0 is used with all cards that support it. Soft shadows and parallax mapping were activated.
Anisotropic filtering 16x is activated via the game.
 - Normal - Antialiasing 4x - HDR
   HDR is different for ATI and NVIDIA. With the GeForce, it’s FP16, while with the Radeon it’s FX16. It functions with the X800 but noticeably reduces quality.
  ATI's GPU HDR rendering (left) and Nvidia's (right).
Page 18
Age of Empire IIIAge of Empire III To test this game, we saved a scene with a movement of units.
Anisotropic filtering was activated via the game.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
 Because of a bug, we couldn't measure the performance of the GeForce 8800 GTX in SLI in standard rendering. We suppose that it comes from an instability due to a faulty card that we used for test. This card belongs to the problematic series and we didn't have time to change it. This won't however be the case for cards in shops.
  HDR rendering provides an equivalent result for ATI and NVIDIA, but it’s processed differently. It's FP16 for NVIDIA and FX10 for ATI. The FX10 mode, with only 2 bits dedicated to transparency, makes it possible to remain in the standard 32 bits, which is ideal for performances.
  ATI (left) and NVIDIA's rendering (right) are identical.
 Even if it is based on FP16 for the GeForce, HDR can be coupled to AntiAliasing. How is this possible? It isn't a multisampling but rather a supersampling processed by the game engine. The image is calculated at 2.25 times the resolution, or 1.5 times larger in each dimension (2400 x 1800 instead of 1600 x 1200 for example). If it makes FSAA available for the GeForce, we have to keep in mind that AntiAliasing quality is much lower compared to Radeon X1000 4x multisampling. The method used for the GeForce is very power hungry and performances are reduced. With the GeForce 8, this is unfortunately the same antialiasing used by the game despite the fact that it supports FP16+FSAA.
  The quality of FSAA is better for ATI's GPUs (left) than NVIDIA's (right).
Page 19
OblivionOblivion We saved a specific movement in order to be always identical and the test reproducible.
Anisotropic filtering 16x is activated via the driver.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
 The Radeons seems to be comfortable with Oblivion. The X1800 XT ends up in front of the 7900 GTX. There is a bug for NVIDIA and some textures (probably the specular effect) tend to be pixellated.
 Performances with Oblivions and AA 4x are seriously reduced simply because many objects are filtered thanks to Transparency Antialiasing, which we activated and that is much heavier. Without this option, objects wouldn't be antialiased.
 HDR is based on FP16. The GeForce 8 provides better results in HDR than in normal mode. We suppose that NVIDIA gave the priority in the drivers for the HDR mode of Oblivion or that it spread the load differently in a way that is better for the architecture of the GeForce 8.
 Oblivion doesn’t directly support HDR coupled to AntiAliasing and developers haven’t deemed it necessary to implement it themselves. ATI does this with a special driver, the “chuck patch” and has now integrated it in the official drivers. You just have to activate the FSAA 4x via the driver and it works in the game. There is no magic in the famous chuck patch, because ATI and NVIDIA developers only have to detect which surface needs antialiasing and force the driver to execute it. NVIDIA implemented the same possibility in drivers since the GeForce 8 support HDR FP16 combined with antialiasing.
Page 20
Pacific Fighters, Colin McRae 05Pacific Fighters In the second OpenGL game of this test, we measured performances while reading the recording of a combat scene.
Anisotropic filtering 16x was activated via drivers. The rendering mode of plane simulators is mainly based on the texturing. As the GeForce 7 has a lower quality on the filtering level (this is unnoticeable in many cases, but not here), these cards have an obvious advantage for performance. We have activated the high quality mode for these cards to have a closer rendering to the Radeon and GeForce 8.
 - Normal - Antialiasing 4x

 As you may have noticed the problem in performances of the GeForce 7 in SLI, this is no longer the case with the GeForce 8. Colin McRae 05 We drive a specific reproducible segment of the game (straight ahead until the end of the track) in the Japan Rally.
Anisotropic filtering 16x was activated via the drivers.
 - Normal - Antialiasing 4x
![]() 
Page 21
Need for Speed CarbonNeed for Speed Carbon We drive a specific reproducible segment of the game.
Anisotropic filtering was activated in the game.
 - Normal - Antialiasing 4x - HDR - HDR + Antialiasing 4x
   
Page 22
GeForce 8800 GTS : 320 vs 640 MBGeForce 8800 GTS : 320 vs 640 MB What are the improvements in practice of the additional 320MB?



 With Quake 4, performances collapse with the GeForce 8800 GTS and only 320 MB of memory (we had to restrict the X-axis to have a readable graph). We remind you that we test Quake 4 in Ultra high quality mode, which uses non compressed textures requiring more memory space. The 320 MB of the GeForce 8800 GTS aren’t enough and the incessant transfer to video memory destroys performances. This reduction in quality, however, is much less significant than with a GeForce 7 equipped with only 256MB. This leads us to believe that Nvidia hasn’t yet really worked on the GeForce 8 drivers for cards with less dedicated memory. Performances in High quality mode are “normal”.
Tomb Raider Legend also requires a lot of memory. With this game the FPS sometimes drops to less than 5fps instead of approximately 30 with a card featuring at least 512 MB. It is important to remind you here that reported figures are averages and that when the card is restricted by video memory, it generally leads to dramatic performance drops for limited periods of time. Most often than not, however, performances are similar.
F.E.A.R. and Half Life 2 Lost Coast are also sensitive to the quantity of memory once antialiasing is activated as this requires important 4x memory areas for some rendering buffers.
Finally, performances with Far Cry consequently drop as soon as the memory space is insufficient. This is the case in 1920x1440 in HDR and FSAA 4x.
If in most cases, 320MB are enough with current games, they restrict the GeForce 8800 GTS in specific areas such as very high resolutions with anti-aliasing.
Page 23
Performances in a nutshellPerformances in a nutshell We calculated the average of all benchmark results, gave the same weight to all games, and attributed a “100” to the GeForce 7900 GTX in 1600x1200.
 With standard rendering, the performances of the GeForce 8 surprised us even if they were announced to be limited by the CPU. On average, the GeForce 8800 GTX is 70% faster than the GeForce 7900 GTX! It’s also faster than any of the previous multi-card systems.
The GeForce 8800 GTS ends up just a little bit ahead of the GeForce 7950 GX2 and the 320MB is just behind.
The GeForce 8800 GTX in SLI can't really express its potential under these circumstances.
 Once antialiasing is activated, the CrossFire is just a little bit faster than the GeForce 8800 GTX. The 8800 GTX SLI can give their maximum and has no competitors.
For single cards, the 8800 is in a comfortable position.
 In HDR, calculation power plays a bigger part and the GeForce 8800 increases its lead on the competition.
 Once HDR and antialiasing are simultaneously activated, the GeForce 8 are very comfortable and the GeForce 8800 GTX, despite the young drivers, is almost twice as fast as the Radeon X1950 XTX, which used to crush the competition in this test.
Indeed, the GeForce 7 are incapable of running games using a format based on FP16 (64 bits). The result is a “0” for them in this test.
You should have noticed that with the performance index based on more than 700 tests and 12 games, the GeForce 8800 are very efficient. This is with current games and without any compromise in graphic quality. Clearly, the GeForce 8800 GTX doesn't have any direct competitor. The GeForce 8800 GTS is much slower than the GTX version, whose deactivated units and lower frequencies of course explain this position. However, it still provides interesting results. The 320MB version of the GTS is one step lower but it ends up in front of the Radeon X1950 XT. The performance gap isn’t significant with high resolution and antialiasing, but it is much greater once the calculation power is the most important element.
Page 24
ConclusionConclusion At the time of writing the conclusion of a new high end product, two critical questions come to our mind. Does it have any downsides? What is it designed for?
What is the point of a high end graphic card like the GeForce 8800 GTX? It’s at the same time somewhat pointless and very interesting. It’s pointless, because you don't need to buy such a card to enjoy games, and it’s very interesting because once we activate all graphic options without compromises and with a very high resolution, it looks great and it's difficult to go back to normal rendering, especially if you have an LCD monitor.
With a GeForce 8800 GTX and a 24" 1920x1200 monitor, you will be able to enjoy all the graphic advancements currently available.
Is there anything bad we can say about the GeForce 8800 GTX? We spend a lot of time searching for this criticism, ran all kinds of tests and despite the immaturity of the drivers, we couldn't find any. Practical and theoretical results are all excellent. The performances are excellent without "forcing", which shows how strong the new architecture is. This is the same feeling that we had at the time of the Radeon 9700 Pro.
 Performances aren't the only one to be excellent. Obviously, NVIDIA was upset to see their products associated with lower quality for a year or so. They decided to work on this aspect to release a new card that becomes the reference in terms of filtering quality. Of course, there is now the question of DirectX 10. How will the GeForce 8800 behave with DirectX 10 games? We won't answer this question today, because we don't know. Nevertheless, in our tests, we couldn't find anything that would show that it will have more trouble than any other GPU.
The last critique, and the easiest, is that because of the size of the card, it won't easily fit into mini computers or some computer towers (unless you push really hard). The power consumption also reaches a peak (not by far). But as we write these words, we can't help thinking that only a few people find fault with Ferraris for their fuel consumption.
 The high end implementation of the GeForce 8 architecture is a success and we are eager to see the result of the middle line and entry level versions. Will it be possible for NVIDIA to release as efficient products as the GeForce 6600 GT and 7600 GT were or will it be a problem like it was with the middle line of the Radeon X1000 architecture ?
The GeForce 8800 GTX is an extraordinary graphic card but it probably won't be widely available while the demand should be very high, because of its performances. There will probably be more GeForce 8800 GTS. Sold at a price close to the GeForce 7950 GX2, it generally provides higher performances with a better rendering quality and supports DirectX 10. To finish, you should note that the price gap between these two graphic cards corresponds to its performance gap which means that the GeForce 8800 GTX is much faster than the GTS version. We can guess that this bigger than usual gap is due to the G80 yield not being as good as it was for previous GPU because of its huge die size.
Update 12 February 2007 :
In the beginning of 2007, Nvidia starts to release less expensive GeForce 8s. The 320 MB version of the GeForce 8800 GTS is the first. Priced from €300 to €350, this card isn’t without value and is clearly more interesting than the Radeon X1950 XTX or GeForce 7900 GTX and 7950GX2. One important point to keep in mind, however, is that the 320MB restricts performances with very high resolutions and antialiasing. This is true for current games and this tendency will continue in the future.
This GeForce 8800 GTS will be a good choice for those equipped with 20’’ LCD monitors using 1680x1050 resolution and who quickly need a graphic card. If you aren’t in a rush, because of the release end of March of competing products by AMD, it will be wise to wait a little more and take a look at cards with in the DirectX 10 environment and maybe let the competition cut prices.
Copyright © 1997-2013 BeHardware. All rights reserved.
|