Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphic Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Nvidia CUDA: preview
by Damien Triolet
Published on March 21, 2007

Precision
For general calculation, the precision and detailed behavior of math units have to be communicated and conform to IEEE standards. The GeForce 8800, like other current GPUs, isn't completely in accordance with IEEE standards, because it doesn't support denormalised numbers and has a lower precision for some operations. Nvidia provides detailed information about the behavior of calculation units and this makes possible to know when it strays from the CPU:


Units are restricted to simple precision (FP32) but it’s likely that the next generation will support double precision (FP64) as CPUs do.


Local memory
The processors of the GeForce 8800 support gathering and scattering. This means they are capable of reading and writing anywhere in local memory (on the graphic card) or elsewhere (other parts of the system).


These memories, however, are not cached, and the cost of the latency of reading/writing cycles for the GeForce 8800 oscillates between 200 and 300 cycles! This latency can be masked by numerous mathematical instructions if they do not depend on a read.

Shared memory
Nevertheless, it is imperative to avoid these read and write cycles in local or global memory as much as possible. To do so, each multiprocessor has a small dedicated memory (16KB) called shared memory. It breaks some of the limitations imposed by the parallel processing of threads by enabling communication and interaction between them without using graphic card memory.


In addition to avoiding the enormous latency of local or global memory, in this example shared memory allows to save memory bandwidth by reducing accesses by 33%.

This shared memory is only available for the elements of a same block! In other words, more threads per block means less memory per thread and less threads per blocks means that less threads will be able to communicate. Also, it is generally recommended to allow each multiprocessor to work on several blocks while the first one is paused in order to process a second block and avoid wasting resources. This reduces even more the relative size of shared memory. This will be 8 KB per block for the standard and prescribed situations where two blocks are in each multiprocessor.

There are strict rules for the utilization of this shared memory. To illustrate this fact, here are a few more details (for the more courageous of you). It is divided into 16 memory banks. Within each cycle, it is possible to access each of the 16 banks via 16 internal buses of 32 bits (or 512 bits altogether). As an access instruction to this memory is processed by warps or by groups of 32 threads, these are in fact 32 memory accesses in two cycles that have to be processed. The first 16 threads will be processed during the first cycle and the next 16 in the second cycle. Two simultaneous accesses to the same memory bank can't be processed in the same cycle. Each of the first 16 (or last) threads will have to access a different bank or else several cycles will be required. It’s interesting to note that all threads can access the same bank. This shows the complexity of the utilization of this shared memory if the objective is to maximize performances. This isn't cache memory like CPU use and it is closer, for example, to the local memory of the SPEs of the Cell.


Cache memory, registers and constants
The GPU has some cache memory for texturing units. They can be employed, when accesses are lined up, to efficiently read (but not write) data. The cache memory is 8 KB per multiprocessor.

Each multiprocessor has a certain amount (not made public) of general registers, which the threads in process have to share. The more threads there are, the better the latency of some of the operations is hidden and, however, the lesser registers are available. This is an important parameter if you want to have a strong influence on performances and CUDA makes it possible to control this.

The GeForce 8800 has an additional 64 KB memory to store constants. This memory is cached with 8 KB per multiprocessor.


With local and global memory, shared memory, cache memory of texturing units, cache memory of constants and registers, developer get lot of parameter to play with when working on optimizing performances.

<< Previous page
GeForce 8800

Page index
1 | 2 | 3 | 4 | 5 | 6
Next page >>
CUDA's API  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.