DirectX 10 and GPUs - BeHardware
>> Graphics cards

Written by Damien Triolet

Published on July 7, 2006

URL: http://www.behardware.com/art/lire/631/


Page 1

DirectX 10 and GPUs



We have heard a lot about DirectX 10, first because of its name that has changed man times: WGF, WGF 2.0, DirectX Next to come back to the most logical choice DirectX 10. The part however that mainly interest us is the graphic rendering or Direct3D 10. DirectX 10 was also the subject of many articles about the support outside of Windows Vista and about a unified architecture that would be obligatory. These subjects were of less interest because Windows always said that DirectX 10 would need Windows Vista and that it has no control on the hardware implementation of this API. This choice has to be made by GPU manufacturers. The multiplication of documents distributed by Microsoft is the occasion to us to take a look at the specifications of this new API and to expose a couple of thoughts about the first GPU that will support it.

A lighter but richer API
Direct 3D 10 is an opportunity for Microsoft to start all over again or almost. This means that the remnants of fixed functions inherited from the previous DirectX versions will disappear. This simplification of the API, which makes that D3D 10 will only support D3D10 GPUs, should make it lighter and less resource greedy, especially when driving the GPU which currently consumes a lot of processor time to send commands, data and various verifications. Globally, Microsoft worked a lot in that direction: reducing the processor time used in the API.


With supporting figures, Microsoft says that the new API will require less CPU cycles per command. So it is possible to increase their numbers and the complexity of the scene.


Except for the intrinsic evolutions of the API, 2 new capacities are released.


A new type of shaders, the geometry shaders, will work on primitive (dots, lines or triangles) and will be able to create new ones. The geometry shaders will work on data sent by the vertex shaders and the first use that we might have of them would be to act as tessellation units to increase geometrical details (TruForm, Displacement Mapping etc.). This is however only a small part of what they can do and it wonīt be used at first because of the lack of performances as Microsoft said many times to avoid developers to start moving in the wrong direction.


The tessellation with geometry shaders will make possible to do some displacement mapping. It will however be more technological demonstration than generalised utilisation in practice.


These are improvements of particle systems (with the obvious possibility to add particles), fur type effects, rendering optimisations in a cubemap and data calculation via primitive that will help the pixel shaders to increase their efficiency. Possibly, the first use of these geometry shaders could be more abstract than concrete like the tessellation is. Letīs take the case of the cubemaps that are used for example to display reflections on a car. A cubemap is made of 6 faces, like it is said in the name. Each of them represents the environment around the car and is rendered one after another. Each time, the vertices have to be reprocessed and one image has to be recalculated from a different point of view. With the geometry shaders, it will be possible to sextuple each triangle (if necessary) to render it in each face where it is visible in the same path and to calculate all the faces of a cubemap in a single path. This will improve performances.



Another innovation: the stream output that is placed next to the geometry shaders. They give the possibility to write in memory data coming from the calculation of vertices after the vertex shaders and geometry shaders. With current GPUS, only pixels can be written in memory.


Page 2
Specifications

Specifications
What do we know about the rest of the specifications? They havenīt been regrouped completely by Microsoft in a marketing documents but have been disseminated in several documentation send to developers. We have regrouped in a table as much specifications as we could find and added the specifications of DirectX 9, the evolution in SM 3.0 and the two architectures that support them additionally to Microsoftīs announcement about DirectX 10.1.


Just like with the release of each new versions of DirectX, there is more of everything: instructions, registers etc. The objective is to avoid developers to be restricted by the possibilities of the Shader Model, here in version 4.0 without imposing a useless costly complexity at the GPU level. As you probably have noticed, basic specifications are now similar (just like the set of instructions) between pixels shaders and vertex shaders (and geometry shaders of course). This is what Microsoft calls unification of shaders (this isnīt at the hardware level!).


The 3D3 10 calculation unit seen by Microsoft
More details…
The number of instructions increases from 512 to 65536 (128x more) and the number of executed is not unlimited. Just to remind you, the number of instructions executed might be higher because of the loops that repeat a series of instructions. The number of temporary registers jumped from 32 to 4096, but of course as it is the case with 32, GPU manufacturers wonīt have to integrate as much in hardware, at least with optimum performances. The driver will have to be able to support a shader that uses as much registers and modify it to take in account hardware restrictions.

One of the most important evolutions of this DirectX is about constant and their updates. Everything has been reviewed to make their use more flexible while reducing the CPU cost of their management and without touching the performances of their accesses. The constant and textures represent a memory access and could have been unified, but the constraint for access and performances are very much different and still justify their separation.

Each element, whether if it is pixel or shader, has in the beginning a maximum of 16 registers as compared to 10 for pixels with DirectX 9. In the case of a vertex, these are basic data used for rendering that come from the CPU and from the objects to render, which are organised by the Input Assembler which is an improved version of what is currently done with Geometry Instancing. For a pixel, there are mainly interpolated data, texture color and addresses. The things get a little bit more complicated with the geometry shaders since they work on primitives. They have to be able to accept in input, data of 3 vertices (triangles) but also from adjacent vertices: 6 x 16 registers 4x FP32 and that is enormous. Geometry shaders can transfer up to 32 registers to pixels or 16 more than without them. We don’t know however for sure if geometry shaders can modify these 32 register or if they have to let pass the 16 original data that come from vertex shaders and possibly add to them up to 16 others.

The access to textures has evolved. Today, the number of texturing instructions is already unlimited (but restricted by the maximum amount of instructions) but not the number of textures supported and the mode to access textures (= number of samplers) which are set at 16. For example, texture 1 and trilinear filtering requires one of the 16 possibilities of access. With DirectX10, textures and samplers are separated. The number of samplers is still of 16 (in other words, one shader can use 16 modes of filtering) but the number of textures increases from 16 to 128, this is also the case of the vertex shader (4 currently for GPUs that meet the shaders 3.0 specifications) and the geometry shaders. These textures can be up to 8192x8192 pixels as compared to 2048x2048 currently requested even if recent GPUs all support textures of 4096x4096 pixels (this size is sometime a problem if the GPU canīt find a big enough free memory space to place the texture, which often happens in FP32). FP16 filtering finally becomes required (the GeForce 6 and 7 support it but not the Radeon X1000) as well as shadow map access and filtering (PCF, percentage closer filtering, only supported by Nvidia). This is also the case of all type of shaders (no filtering in current vertex shaders). That is not all because a new type of access is added: load. The sampling of a texture consist in taking the closest texel to a certain value (point sampling) or the group of texel that is close to this value (bilinear or anisotropic filtering). Load consists in recovering a very specific texel. This facilitates the use of textures for data storage other than images.

The calculation accuracy is still FP32 but has also improved. GPUs manufacturers can currently support FP32 just like they want: round-off accuracy, special number support (NaN, +/-Inf etc.)… This can be a bit of a problem for developers that see different behaviour for shaders from one GPU to another. For example, one current implementation can replace a NaN (for example 0/0 = NaN) by 0. This simplifies the design of calculation units and facilitates basic 3D rendering that can more easily deal with a concrete 0 than a NaN. It doesnīt however correspond to the usual floating point calculation and Microsoft has decided that to facilitate the evolution, it would be best to force the support of these special numbers. Several other similar points have been chosen to get as close as possible to the IEEE 754 that is found in CPUs without completely following the specifications of the IEEE 754 (this would have been a useless additional cost). The relative error can be more important than in IEEE 754. You should note that Microsoft hasnīt only defined the behaviour of units that processes floating point number for calculation units but also for texture and blending filtering units.

Another major innovation of Direct 3D 10 is the integration of the complete support of 32 bits integers additionally to floating points. The support of integers is useful in many situations, for development and goes along with the support of binary operators. This is a precious tool for developers that now have a set of operations closer and closer to the CPU.

The number of render target increases to 8. This means that a DirectX 10 GPU will be able to write in memory 8 values additionally to Z-data, instead of 4 today. FP32 blending becomes obligatory whereas FP16 wasnīt in DirectX 9 even if it was supported by most of the SM 3.0 cards (except for the 6200). The support of multisample antialiasing still is optional. When it is supported, however, manufacturers have to make possible the reading of a multisampled render target like a standard texture. With current GPUs it is impossible and such a render target must be downsampled before being used again. The support is very complex because of the data compression algorithms that are the base of MSAA.


Direct 3D 10 isnīt released yet that Microsoft already speaks of the successor! Direct 3D 10 hardware will work with Direct 3D 10.1 even if it wonīt be able to exploit all its possibilities. This is also the case for a DX8 GPU with DX9. DX9 GPUs wonīt however work with DX10 and D3D 10 games will have to integrate a D3D 9 rendering to support them. MSAA 4x will be required with Direct 3D 10.1 (all of ATI and NVIDIAīs GPUs support it but it isnīt the case of Intelīs products, S3 etc). With this new version of Direct 3D, Microsoft will finally have the possibility to specify in detail the functioning of antialiasing and to give more control to developers. This hasnīt been done yet probably because of a lack of time and to avoid any delay for the release of DirectX 10. Some of the Direct 3D 10 GPUs will possibly support this more than advanced management of antialiasing (by the way, the NVIDIA G80 will have a brand new antialiasing engine).

The management of FP32 texture filtering will be necessary and Microsoft speaks of increased calculation accuracy but hasnīt said if it was a FP32 even closer to the IEEE 754 or another format. Blending will also evolve to increase in flexibility. We can roughly suppose that Direct 3D 10.1 will represent an evolution of the remaining fixed units of GPUs. Is it before making them completely programmable too in a future version?


Page 3
GPUs and DirectX 10

GPUs
For ATI and NVIDIA, DirectX 10 is a major evolution and they are both very enthusiast because of the new possibilities that will appear. We called both manufacturers and their point of views is identical except for one point. They are happy about internal optimisations of the API that reduces the CPU power consumption. For the same CPU load, the complexity of the scene will increase additionally to the evolution of the graphic pipeline that has been unchanged for a while. Geometry shaders and the stream output open new possibilities. The divergence point is the GPU architecture unification for calculation units. According to ATI it is the best solution, whereas for NVIDIA it isnīt yet. Here is a basic representation of architectures that expect to find for ATI and NVIDIA:


left, the G8x supposed implementation of Direct 3D 10, right, the supposed R6xx implementation of Direct 3D 10.


Even if GPU manufacturers like to talk about completely new architectures developed from scratch etc., they are always evolutions. Developing a GPU is complex and obtaining good rendering results requires a lot of small optimisations to polish up the details and ensure that the data and operation flow arenīt interrupted. It requires a very important expertise that ATI and NVIDIA have acquired with time and that lacked to a lot of potential competitors like XGI, S3 and Matrox that havenīt obtained high enough results from their GPUs. For ATI and NVIDIA it is of course out of the question to completely leave this expertise and move in a direction that would be completely the opposite of previous architectures.

For ATI, changing for a unified architecture of execution units (= shaders processing) is a logical evolution that is a continuation of previous and current architectures. A unified architecture requires an advanced management of the different threads (tasks). The Radeon X1000 already do that even if it is only restricted to pixels shaders. With a unified architecture it is possible to attribute execution units to tasks that need them and to avoid to have unused units. ATI already has an experience of unified architecture because the Xbox 360 GPU has one.

For NVIDIA, the situation is much different because the efficiency of pixels calculations comes from a very long pipeline in which texture access is masked. Moving directly from such architecture to a unified architecture would represent a big risk that could seriously affect rendering. Also, NVIDIAīs current architecture is rather economical for the chip size whereas ATIīs is very costly. Not unifying the architecture with the G8x will make possible for NVIDIA to ensure a high level of performances and not increase too much the number of transistors. We expect that the G8x and the G80 (previously NV50) might not be unified or at least not completely.

Even if Direct 3D 10 defines a similar set of operations for all type of shaders, very slight differences will remain (only the geometry shader can create element for example) and each type of shader will have to process tasks that have a different profile. Optimising the same execution unit for a vertex shader or a pixel shader leads to different directions. These directions arenīt however that different if it is about geometry shaders or vertex shaders. This makes us think that Nvidia could develop a partial unification of the architecture for processing units that would be shared. Pixel shaders would keep improved dedicated units for the specificities of DirectX 10. With this compromise, NVIDIA will integrate the DirectX 10 support while still keeping a high yield per mmē.
What would be the best solution ?
It is a little early to give our thoughts of course. NVIDIA will release the G80 before that DirectX 10 will be usable. The support wonīt probably be directly activated in the drivers. For ATI, the R6xx will probably come with the DirectX 10 support because we estimate that it will be very easy to show the benefits of a unified architecture in different theoretical cases. ATI will use this possibility and they are right to do that. Now the thing is that theoretical cases maybe wonīt represent the practice or at least not before a while.

With DirectX 9 and the first DirectX 10 games (that Microsoft announced at the same date), the G80 will behave like a GeForce 7900 with more calculation units and the R6xx like a more efficient Radeon X1900 (ATI speaks of 25% gain for some games with the same amount of calculation units between an unified and standard architecture). This doesnīt teach us anything except that it is too early to know which one will be the fastest. Frequencies will once more be of importance and NVIDIA wonīt leave aside a double card like the GeForce 7950 GX2.

ATI and NVIDIA architectures will be valid for the entire GPU line and it isnīt because one architecture will be more efficient for high end products that it will also be the case for middle line and entry level products.

Speaking of yield per mmē for NVIDIA, we have to point out that this is in fact a very big simplification. Everyone admits it because the gap is enormous between ATI and NVIDIAīs chips. In practice, the size of the chip is only one of the elements that define the yield per mmē (of wafer). The frequency used is another factor. If a manufacturer has to use very high frequencies to remain in the competition, it reduces the number of chips that will be usable and the yield per mmē of wafer. One architecture might give more possibilities than another to integrate redundancy. This reduces the number of unusable chips and increases the yield per wafer. A bigger chip might have in the end a higher yield than a smaller one.

You probably have understood it, we are expecting NVIDIA to speak of the size of their chips (they will also mention the power consumption) and the ATI will defend the benefit of a unified architecture. As usual, these will be marketing simplifications and it will be important to step back a little to have a more pertinent opinion. To finish, developers too will have to give their opinion because the technological choices that they will make will push one or the other architecture.


Copyright © 1997-2010 BeHardware. All rights reserved.