Nvidia CUDA: preview - BeHardware
>> Graphics cards

Written by Damien Triolet

Published on March 2, 2007

URL: http://www.behardware.com/art/lire/659/


Page 1

CPU/GPU, BrookGPU

CUDAGraphic processors or GPUs have evolved much in the past few years. Today, they are capable of calculating things other than pixels in video games, however, it's important to know how to use them efficiently for other tasks. If AMD has been the first to present a concrete solution to this problem, Nvidia is the first to make this solution available. We'll have a look at it in this article.

(we accept no liability for a severe headache following the reading of this article)


CPU and GPU : the differences
During the last couple of years, GPU calculation power has improved exponentially and much faster than that of the CPU. However, this doesn't mean that GPUs have evolved faster. These two components face different challenges and for this reason they have evolved in different directions.


For simplification, a CPU is expected to process a task as fast as possible whereas a GPU must be capable of processing a maximum of tasks, or to be more accurate, one task for a maximum of data in a minimum period of time. Of course, a GPU also has to be fast and a CPU must be able to process several tasks, but up to this date the development of their respective architectures has shown the above priority. This has meant multiplying processing units for GPUs, and for CPUs, making control units more complex and increasing embedded cache memory.


An enormous part of the GPU is dedicated to execution, unlike the CPU

The CPU is capable of quickly processing all sorts of tasks whereas the GPU is capable of processing very quickly a certain type of task. For the latter, this has to be in the form of a problem composed of independent elements, because of the massive parallelization of GPUs. This is, in fact, a similar problem to the one faced by CPUs, whose calculating power partly relies on vectorial units (SSE etc.). If, however, the Core 2 Duo can be seen as composed of 8 units, the GeForce 8800 has 128! This is an entirely different level and requires a different perspective when working on exploiting this computing power.

BrookGPU: the early beginnings
The idea of using a GPU as an additional calculation unit isn't new and started with the GeForce FX, the first GPUs that supported simple precision floating point calculations (FP32). A bit more than 3 years ago, when the first official publications on BrookGPU (a programming language intended to facilitate the access to GPU calculation power) were disclosed, we wrote (translated from our French website):

"...NVIDIA and ATI's engineers still have some work to do before it's really usable. Using a current GPU for general calculation is a bit like trying to use the power provided by a potato to light a lamp. Nevertheless, as the evolution of GPUs is quite spectacular, it's not too soon to start working on this technology. Perhaps for something actually usable for the release of the NV50 and R500? We can even imagine that ATI and NVIDIA will start selling their chips on different markets…"

These predictions were quite close to reality. ATI/AMD introduced CTM, a low level API that uses the calculation core of the Radeon X1000 (R5xx) and a line of products devoted to this type of usages and NVIDIA has introduced CUDA, a programming language close to C, which exploits the core of the GeForce 8800 (the G80, the NV50’s new code name).

Is using a GPU as a calculation unit still utopian? The answer is 'no' even if we are still at an early stage for this type of use. Improvements have been numerous on the hardware and software side. In the past, we wrote several times about CTM without being able to give too many details as it wasn't public. Nvidia, however, announced a beta version of CUDA a little while ago and this gives us the occasion to take a deeper look at this language.


Page 2
GeForce 8800

The GeForce 8800 in detail
Even, if Nvidia has chosen, unlike AMD, not to have a low level language accompanied by a detailed documentation of its hardware, a great knowledge of the GPU is required for CUDA. Consequently, the GPU is described by NVIDIA in less marketing oriented language and this gives us an opportunity to learn a little bit more about this GPU.

Roughly, the GeForce 8800 has been described as a GPU equipped with 128 scalar processors divided into 8 groups of 16 and which work at very high frequencies: 1,350MHz. These groups process sets of 32 pixels or 16 vertices.

Actually, each of these 8 groups contains two sets of 8 scalar processors, which make the GeForce 8800 GTX a chip equipped with 16 groups of 8 processors. Nvidia calls these groups, “multiprocessors”. The fact that multiprocessors are made of 8 processors and not 16 doesn't really have an impact on performances. This is more an implementation detail that is intended to facilitate the functioning of calculation units at very high frequencies. The counterpart is that it requires more transistors.


The GeForce 8 consists of a group of multiprocessors, which represents a SIMD unit consisting of a certain number of processors.

In CUDA documentation, the multiprocessors aren't described as running at 1,350 MHz but at 675 MHz with "double pumped" execution units, which means running at doubled frequencies like the ALU of the Pentium 4. These multiprocessors process blocks of 64 to 512 elements called threads and spread out into sub-groups of 32 threads called, “warps”. Two cycles are required (4 x 0.5 cycles because of the "double pumped" mode) to process a common floating point instruction on a warp. Outputting one instruction every 2 cycles is easier than every cycle. This explains the choice of using two multiprocessors based on 8 processors per group instead of the single one based on 16 as we could have assumed when reading the initial GeForce 8800 documentation.

As we said in the article on the GeForce 8800, it also has calculation units dedicated to more complex instructions (exp, log, sin, cos, rcp, rsq). Two of these units are included into each multiprocessor in addition to the 8 processors, which process common instructions. Special instructions are four times slower and 8 cycles are required to execute them for entire warps. You should note that, unlike with pixels and vertex shaders, sin, cos and exp instructions are 2 times slower than the other three instructions and require 16 cycles to be executed on the 32 threads of a warp. The reason for this is probably that in the case of 3D rendering, instructions are executed with less precision but faster. Nvidia makes it clear that most instructions can be executed faster in a less precise mode (with a tag per instructions or a compiler command).

Integer multiplications are also processed by these two units and require 8 cycles. A lower precision of this instruction (24 bit instead of 32 bit) can be executed with the 8 standard processors in 2 cycles from the multiprocessor point of view.

In short, the GeForce 8800 GTX can be seen as a big calculation unit divided into 16 multiprocessors processing warps of 32 threads via 8 general processors and two specialized ones. These 16 multiprocessors clocked at 675 MHz can process together one common instruction every two cycles and 512 threads or a rate of 256 operations per cycle (512 floating point calculations in the case of FMAD/FMAC, which represents a multiplication and an addition). 64 special operations also need to be added to this figure. A Core 2 Duo as seen from the SSE unit point of view, with its two cores, is able to process 16 operations per cycle (8 additions and 8 multiplications). It runs, however, at much higher frequencies than 675 MHz but does not process FMAD/FMAC operations at full speed, because it needs two units to process an operation of this type.

The following table represents the calculation power of the GeForce 8800 and of two Intel Core 2 Extreme processors in four different situations: floating point multiplication, floating point addition, half of each (the best case for the Core 2) and floating point addition-multiplication (the best case for GPUs because all units support this instruction).


The GeForce 8800 clearly has higher calculation power than the Core 2, including the quadcore. Nevertheless, the gap isn't always as 'big' as we could have imagined. This means that it is really important to efficiently use a GeForce 8800 to surpass a quadcore. We also have to keep in mind that the GeForce 8800 can process more as well as more complex operations are execvuted relatively fast and is also able to use texture filtering units to accelerate some operations. If an algorithm is able to exploit the additional capabilities of the GPU, performances might explode when compared to CPUs.


Page 3
Precision, memories

Precision
For general calculation, the precision and detailed behavior of math units have to be communicated and conform to IEEE standards. The GeForce 8800, like other current GPUs, isn't completely in accordance with IEEE standards, because it doesn't support denormalised numbers and has a lower precision for some operations. Nvidia provides detailed information about the behavior of calculation units and this makes possible to know when it strays from the CPU:


Units are restricted to simple precision (FP32) but it’s likely that the next generation will support double precision (FP64) as CPUs do.


Local memory
The processors of the GeForce 8800 support gathering and scattering. This means they are capable of reading and writing anywhere in local memory (on the graphic card) or elsewhere (other parts of the system).


These memories, however, are not cached, and the cost of the latency of reading/writing cycles for the GeForce 8800 oscillates between 200 and 300 cycles! This latency can be masked by numerous mathematical instructions if they do not depend on a read.

Shared memory
Nevertheless, it is imperative to avoid these read and write cycles in local or global memory as much as possible. To do so, each multiprocessor has a small dedicated memory (16KB) called shared memory. It breaks some of the limitations imposed by the parallel processing of threads by enabling communication and interaction between them without using graphic card memory.


In addition to avoiding the enormous latency of local or global memory, in this example shared memory allows to save memory bandwidth by reducing accesses by 33%.

This shared memory is only available for the elements of a same block! In other words, more threads per block means less memory per thread and less threads per blocks means that less threads will be able to communicate. Also, it is generally recommended to allow each multiprocessor to work on several blocks while the first one is paused in order to process a second block and avoid wasting resources. This reduces even more the relative size of shared memory. This will be 8 KB per block for the standard and prescribed situations where two blocks are in each multiprocessor.

There are strict rules for the utilization of this shared memory. To illustrate this fact, here are a few more details (for the more courageous of you). It is divided into 16 memory banks. Within each cycle, it is possible to access each of the 16 banks via 16 internal buses of 32 bits (or 512 bits altogether). As an access instruction to this memory is processed by warps or by groups of 32 threads, these are in fact 32 memory accesses in two cycles that have to be processed. The first 16 threads will be processed during the first cycle and the next 16 in the second cycle. Two simultaneous accesses to the same memory bank can't be processed in the same cycle. Each of the first 16 (or last) threads will have to access a different bank or else several cycles will be required. It’s interesting to note that all threads can access the same bank. This shows the complexity of the utilization of this shared memory if the objective is to maximize performances. This isn't cache memory like CPU use and it is closer, for example, to the local memory of the SPEs of the Cell.


Cache memory, registers and constants
The GPU has some cache memory for texturing units. They can be employed, when accesses are lined up, to efficiently read (but not write) data. The cache memory is 8 KB per multiprocessor.

Each multiprocessor has a certain amount (not made public) of general registers, which the threads in process have to share. The more threads there are, the better the latency of some of the operations is hidden and, however, the lesser registers are available. This is an important parameter if you want to have a strong influence on performances and CUDA makes it possible to control this.

The GeForce 8800 has an additional 64 KB memory to store constants. This memory is cached with 8 KB per multiprocessor.


With local and global memory, shared memory, cache memory of texturing units, cache memory of constants and registers, developer get lot of parameter to play with when working on optimizing performances.


Page 4
CUDA's API

CUDA's API
CUDA, or Compute Unified Device Architecture, is the architecture that allows the exploitation of GeForce 8 GPU calculation power by allowing it to process kernels (programs) on a certain amount of threads. If CUDA also partly includes the GPU, since it has more and more optimizations to facilitate non graphic calculations, in practice it mainly concerns software. CUDA is in consequence a driver, a runtime, libraries (an implementation of BLAS amongst other things), an API based on an extension of the C programming language and an accompanying compiler (redirecting the part not executed by the GPU to the system’s default compiler).


CUDA is a high level API, meaning that it globally disregards hardware even if taking into account the specifications is required to provide high performances. AMD, however, with CTM has a low level API. This roughly means that it is easier to program with CUDA whereas it is easier to fully optimize the code with CTM.

The CUDA driver acts as an intermediate element between the compiled code and GPU. The CUDA runtime is an intermediate between the developer and driver, facilitating programming by masking some of the details. With CUDA it’s either possible to use the API runtime or directly access the API driver. It is possible to see the API runtime as a high level language and the API driver as an intermediate between high and low level, allowing a manual and deeper optimization of the code. In the opposite direction, AMD gives the possibility of writing kernels in HLSL instead of machine language to facilitate programming. While the both stick to their initial choices, Nvidia and AMD try to go a little bith in the opposite way.

For this first look at CUDA, we focused on API runtime. The driver mode, however, isn't that different and it only has more options and less automation.

This particular API consists of a couple of extensions of C language, a component intended for the system that makes it possible to control the GPU(s), another that runs with the GPU, and a common component that includes the types of vectors and a group of functions of the standard C library, which can be executed on the system as well as the GPU.

Without going into all the details on the added extensions, we are going to give you the main ones that allow the understanding of the functioning of CUDA. The first point is a set of functions that will specify on which component they are intended to be executed; the CPU or GPU. A kernel or function requested by the CPU and executed by the GPU will be referenced by __GLOBAL__, a function used in a kernel will be referenced by __DEVICE__ and a standard function by __HOST__. It isn't obligatory to mention the latter since it represents standard behavior.

The second point is how a kernel is named. Here is the procedure for a classic function:
Function(parameter);
A kernel is named slightly differently:
Function<<< blocks, threads, memory >>>(parameters);
Blocks represent the number of blocks of threads to process. Threads represent the number of threads per block, and memory an optional memory space dynamically allocated in shared memory. Blocks * threads represent the total number of threads that will be processed by the kernel.

Next, a set of integrated variables makes it possible to identify the thread in the middle of this mix of blocks. A set of functions is dedicated to control the GPU, allocate memory areas, recover details on the GPU(s) present in the system, select the one on which it will be executed, etc.

Finally, a group of mathematical functions supported by the GPU and a function to synchronize threads within a block (__synchthreads() ). It breaks the execution of a kernel in a multiprocessor as long as all threads haven't reached this state in order to avoid the problems of reading after writing. (It is important to make sure that the right information has been written before it’s read).

These extensions control the GPU and anyone with a good knowledge of C will be able to manipulate them easily. To properly use the GPU it is imperative to spread the work load in grids of blocks, whose size has to be adapted on a case by case basis to maximize the utilization of calculation units.



API and 3D interoperability
CUDA has a certain number of functions in order to have an interoperability with 3D API via buffer objects in OpenGL and the vertex buffers in Direct3D. CUDA can be used to process data that will be directly exploited by 3D rendering. For example, it’s possible to process physics with CUDA and inject these results for rendering.


Page 5
In practice

In practice
We played with CUDA for a couple of weeks so that we can now have an idea about what it can do.

Let's start with what it can't do: simply use an SLI system to double calculation power. Each GPU is perceived as being independent and a kernel is executed on a single GPU. A different kernel needs to be launched on each GPU to benefit from multi-GPU systems and this complicates the proper exploitation of the whole calculation power. Also, kernel execution is synchronous. This means that once the CPU has requested the execution of the kernel by the GPU, the thread and core that will execute it will be blocked until the GPU has finished working. CPU power can be easily wasted in waiting instead of being used as a complement to the GPU. This is something that Nvidia will have to improve in the future or else the same number of CPU cores (which won't be used!) and GPUs will be required by the system.

We thought about comparing several algorithms on GPUs and CPUs in order to measure performance gaps, but we quickly changed our mind for several reasons. The main reason was that we can’t claim to be able to develop a function that will be as efficient on one side as it is on the other. In other words, if the GPU is faster, will it be because it is more efficient or because the same function was less optimized for the CPU and vice versa? Also, it is easy to find an example that will be much faster for a CPU and another one that will be faster for the GPU. So, unless we spent weeks to develop relatively objective tests (unfortunately, we do not have that time) it is very difficult to objectively compare GPU and CPU performances. Nevertheless, we decided to give you two graphs of performances. While they include the GPU and CPU they aren't intended for direct comparison, but more to show how performances may vary with the modification of a particular parameter.

It is important to keep in mind that this is a beta version of CUDA and performances will logically improve with the newest revisions.

The first parameter chosen to be subject to variations was the number of blocks. The total number of threads or elements to be processed is identical but they are regrouped in one big block or in several smaller blocks. In the case of the CPU, each block can be perceived as one thread and be executed with a different core. The kernel consists in executing a series of operations on data and write results in the table.


With a simple core CPU and whatever the organization, performances are identical. A quad core CPU would, however, make it possible to process this type of kernel four times faster with 4+ blocks. In the case of the GeForce 8800 GTX, at least 16 blocks are needed for the 16 multiprocessors to be exploited and 32 to be exploited efficiently. This requires more work for programming but the performance gain is consequent.

The second test consisted of increasing the complexity of the kernel (or the number of operations). The number of blocks was fixed at 32.


If calculation time increases linearly with the CPU, it isn't the case for the GPU below a certain complexity. This indicates that the management cost is quite high and needs to be absorbed by complex operations. It isn't enough to process a large amount of data, and the process has to be sufficiently complex too for it to be really worthwhile.


Page 6
Conclusion

Conclusion
This first look at CUDA was relatively enjoyable. We say this because we were surprised by how easy access was to the GeForce 8800 calculation units. Of course, this judgment has to be put into perspective because it is something to run a couple of functions on a GPU and it is something else to do it efficiently.

We decided to adapt (almost completely) a basic version of Pacman written in C for the GPU. A couple of hours were required to have a functional code (even if it had a rather average performance and was approximate). The objective was to observe how easily it was possible to start working with CUDA. This was a success and CUDA isn't only reserved to Nvidia's engineers and a couple of researchers unlike AMD's current solution. Even if we do not have access to CTM, available documentation made us quickly realize how difficult it would be to adapt the same code. This is a something we would have quickly given up on even if we would have had access to the CTM.


This said, fully exploiting a GPU such as the GeForce 8800 with CUDA isn't easy, and is far from it. Deeply optimizing the code is complex whatever the method. From a certain level of efficiency, it is likely that CTM takes the lead, but probably at the expense of a much longer programming time.

If, in the beginning, our preference went to AMD's approach, which gives the possibility to get to know the GPU in detail and exploit it more efficiently (in theory at least), as our analysis of CUDA and CTM went by, we changed our opinion. This was because of the complexity of the CTM. With identical programming time, CUDA will certainly provide better results; and another important reason is that we realized that the gap between the calculation power of GPUs and CPUs isn't as gigantic as we could have thought. Don't get us wrong, G80 is a very powerful piece of silicon, but CPUs have also progressed consequently with the successive releases of dual core and then quad core. G80's calculation power is much higher, this is a fact, but it isn't a factor of 20 or 100. Ten times the power is barely reached and this doesn't even take into account the lesser efficiency in practice. Of course, in certain situations, the GPU might increase the gap, but overall we believe it doesn't provide enough extra performance room and efficiency yet to justify its disadvantages.

That being said, this first implementation of Nvidia's new architecture appears to us to be a great start for the development of practical applications. GeForce 8800 with CUDA is currently the best GPGPU solution on the market. We think that if GPUs make their way as calculation units in different domains (besides a couple of exceptions), future GPUs will be used as final products. Development takes time and developers need to start working on it now. Unlike CTM, the strength of CUDA is not to be specific to a single GPU and to allow the development of applications that will be compatible with the GPUs of the next generations. Nvidia recommends that developers aim for the process of 100 or 1000 blocks instead of the 32 for the GeForce 8800 GTX in order to directly benefit from future generations. This gives us an idea of future evolutions…


Of course, nothing is fully black or white and CUDA also partly relies on GPU specific details. It actually reports these details and the GeForce 8800 represents the 1.0 computing resources. Future GPUs in version 1.1, for example, could be able to support double precision floating point calculations (FP64) or different memories and this could seriously affect the method of optimizing the code. Nevertheless, it will remain functional, via recompilation if necessary.

From our point of view, the success of the utilization of GPUs as calculation units will rely first of all on their ability to continue to evolve faster than CPUs and to create a market before the number one on the processor market, Intel, releases massive multicore pieces of silicon. After that, it will probably be too late.

For simple users, like us, all of this will not be of great importance at least in the short term. There will be plenty of time before the release of relevant applications. As for the possibility to use CUDA to process physics effects in games, we do not think this will happen anytime soon. Developers already have a lot of work to do to exploit multicore CPUs.


For more information:

CUDA : http://developer.nvidia.com/object/cuda.html
CTM : http://ati.amd.com/companyinfo/researcher/Documents.html


Copyright © 1997-2010 BeHardware. All rights reserved.