Since our first analysis of CUDA, various elements have evolved. Nvidia has launched a special line of devoted products and the API has improved. We had the opportunity to talk with the main people involved with this technology and were able to test what GPUs are capable of compared to CPUs in a practical application. This is the occasion to do a follow up on our first article on CUDA, which you can find >here. You can refer to it for the details that were explained quite thoroughly and which we won’t go into again.
We will simply remind you that behind CUDA is a software layer intended for stream computing and an extension in C programming language, which allows identifying certain functions to be processed by the GPU instead of the CPU. These functions are compiled by a compiler specific to CUDA in order that they can be executed by a GPU’s numerous calculation units in the GeForce 8 class and above. Thus, the GPU is seen as a massively parallel co-processor that is well adapted to processing well paralleled algorithms and is very poorly adapted to others.
An enormous proportion of the GPU is devoted to execution, contrary to the CPU
Unlike a CPU, a GPU attributes a significant portion of its transistors to calculation units and very few to logic control. Another big difference, which we overly neglected in our previous article (and the GPU vs. CPU tests here will show), is the memory bandwidth. A modern GPU disposes of +/- 100 GB/s versus +/- 10 GB/s for a CPU.
An assembly of processors
Another reminder concerns the way Nvidia describes what happens in the GPU. A GeForce 8 is a combination of independent multi-processors each equipped with 8 generalized processors (called SP), which always carry out the same operations similar to a SIMD unit, and 2 specialized ones (called SFU). A multi-processor uses these two types of processors to execute instructions on groups of 32 elements. Each element is called a « thread » (not to be confused with a CPU thread!) and these groups of 32 are called, « warps ».
Schema of a multi-processor, the G80 has 16.
Calculation units (SP and SFU) work at a frequency double than the logic control and attains 1.5 GHz with the GeForce 8800 Ultra. For a simple operation which only needs a single cycle as seen from the calculation unit point of view (and 0.5 cycles as seen from the rest of the multiprocessor), two cycles are needed so that it will be executed on an entire warp.
A program, called «kernel », is executed in a multiprocessor on blocks of warps, which can contain up to 16 or the equivalent of 512 threads. The threads of the same block can communicate to each other via shared memory.