Taking advantage of the GeForce 8Using a GPU as a calculation unit may appear complex. Itís not really about dividing up the task to execute into a handful of threads like using a multicore CPU but rather it involves thousands of threads.
In other words, to try and use the GPU is pointless if the task isnít massively parallel, and for this reason, it can be compared to a supercalculator rather than a multi-core CPU. An application to be carried out on a supercalculator is necessarily divided into an enormous number of threads and a GPU can thus be seen as an economical version devoid of its complex structure.
The GPU, especially for Nvidia, keeps an enormous number of its secrets hidden and not too many details are revealed. This could lead developers to assume that they are blindly going ahead in trying to develop an efficient program for this type of architecture. Although more details would be useful in certain cases, we canít forget that a GPU is conceived to maximize the throughput of its units and consequently, if sufficiently feeded, will handle everything efficiently by itself. This is not to say that with more details it isnít possible to do better, but rather by knowing what best feeds a GPU from the start, itís possible to obtain satisfactory results. Therefore, we canít think that a GeForce 8800 with 128 calculation units will need 128 threads to be used. Many more are necessary to allow the GPU to maximize its rates, as it does, for example, when working on thousands of pixels.
When we want to properly use a GeForce 8 type GPU, its program and data should be structured in a way to give the GPU the highest possible number of threads while remaining within hardware limits, which are:
- threads per SM: 768
- warps per SM: 24
- blocks per SM: 8
- threads per block : 512
- 32 bit registers per SM: 8192
- shared memory per SM: 16 KB
- cached constants per SM: 8 KB
- 1D textures cached per SM: 8KB
The arrangement of threads in blocks and blocks into grids of blocks (65536x65536x65536 maximum blocks) is up to the developer. A GeForce 8 class GPU can therefore execute a program of a maximum 2 million instructions on close to 150 billion (10^15) threads! These of course are only the maximum.
Each multi-processor can have 768 threads, or in other words, to fill them to the maximum you would, for example, use 2 blocks of 384 threads (or 2x 12 warps). 10 registers could then be used per thread and each block could use 8 KB of shared memory. If more registers are necessary, the number of threads per SM has to be reduced. This could result in a possible reduction of the multiprocessorís potential given that it will have less possibility to maximize the throughput of its calculation units.
The executed program also has to represent a sufficient number of blocks because a GeForce 8800 has 16 multiprocessors. In the previous example, which uses 2 blocks of 384 threads per multi-processor, at least 32 of these blocks will be needed to feed all of the GPUís calculation units. This represents close to 25,000 threads. To use several GPUs we have to multiply this number by that of the GPUs. The best, of course, would be to have planned a lot more in order to take advantage of future GPUs, which will have more calculation units, etc. To plan on a hundred, or even a thousand of blocks of threads is therefore not a luxury.
In our opinion, the complexity which is given to using a GPU as a calculation unit comes first and foremost from the fact that we have trouble seeing how a program that isnít easily paralleled will function with it. However, this is a wrong question. It would be a waste of time to try and run something of this kind on a GPU.