Nvidia CUDA : practical uses - BeHardware
>> Graphics cards

Written by Damien Triolet

Published on August 9, 2007

URL: http://www.behardware.com/art/lire/678/

Page 1

Introduction, reminder

CUDASince our first analysis of CUDA, various elements have evolved. Nvidia has launched a special line of devoted products and the API has improved. We had the opportunity to talk with the main people involved with this technology and were able to test what GPUs are capable of compared to CPUs in a practical application. This is the occasion to do a follow up on our first article on CUDA, which you can find >here. You can refer to it for the details that were explained quite thoroughly and which we wonít go into again.

We will simply remind you that behind CUDA is a software layer intended for stream computing and an extension in C programming language, which allows identifying certain functions to be processed by the GPU instead of the CPU. These functions are compiled by a compiler specific to CUDA in order that they can be executed by a GPUís numerous calculation units in the GeForce 8 class and above. Thus, the GPU is seen as a massively parallel co-processor that is well adapted to processing well paralleled algorithms and is very poorly adapted to others.

An enormous proportion of the GPU is devoted to execution, contrary to the CPU

Unlike a CPU, a GPU attributes a significant portion of its transistors to calculation units and very few to logic control. Another big difference, which we overly neglected in our previous article (and the GPU vs. CPU tests here will show), is the memory bandwidth. A modern GPU disposes of +/- 100 GB/s versus +/- 10 GB/s for a CPU.
An assembly of processors
Another reminder concerns the way Nvidia describes what happens in the GPU. A GeForce 8 is a combination of independent multi-processors each equipped with 8 generalized processors (called SP), which always carry out the same operations similar to a SIMD unit, and 2 specialized ones (called SFU). A multi-processor uses these two types of processors to execute instructions on groups of 32 elements. Each element is called a ę thread Ľ (not to be confused with a CPU thread!) and these groups of 32 are called, ę warps Ľ.

Schema of a multi-processor, the G80 has 16.

Calculation units (SP and SFU) work at a frequency double than the logic control and attains 1.5 GHz with the GeForce 8800 Ultra. For a simple operation which only needs a single cycle as seen from the calculation unit point of view (and 0.5 cycles as seen from the rest of the multiprocessor), two cycles are needed so that it will be executed on an entire warp.

A program, called ękernel Ľ, is executed in a multiprocessor on blocks of warps, which can contain up to 16 or the equivalent of 512 threads. The threads of the same block can communicate to each other via shared memory.

Page 2
Taking advantage of the GeForce 8

Taking advantage of the GeForce 8
Using a GPU as a calculation unit may appear complex. Itís not really about dividing up the task to execute into a handful of threads like using a multicore CPU but rather it involves thousands of threads.

In other words, to try and use the GPU is pointless if the task isnít massively parallel, and for this reason, it can be compared to a supercalculator rather than a multi-core CPU. An application to be carried out on a supercalculator is necessarily divided into an enormous number of threads and a GPU can thus be seen as an economical version devoid of its complex structure.

The GPU, especially for Nvidia, keeps an enormous number of its secrets hidden and not too many details are revealed. This could lead developers to assume that they are blindly going ahead in trying to develop an efficient program for this type of architecture. Although more details would be useful in certain cases, we canít forget that a GPU is conceived to maximize the throughput of its units and consequently, if sufficiently feeded, will handle everything efficiently by itself. This is not to say that with more details it isnít possible to do better, but rather by knowing what best feeds a GPU from the start, itís possible to obtain satisfactory results. Therefore, we canít think that a GeForce 8800 with 128 calculation units will need 128 threads to be used. Many more are necessary to allow the GPU to maximize its rates, as it does, for example, when working on thousands of pixels.

When we want to properly use a GeForce 8 type GPU, its program and data should be structured in a way to give the GPU the highest possible number of threads while remaining within hardware limits, which are:

  • threads per SM: 768
  • warps per SM: 24
  • blocks per SM: 8
  • threads per block : 512
  • 32 bit registers per SM: 8192
  • shared memory per SM: 16 KB
  • cached constants per SM: 8 KB
  • 1D textures cached per SM: 8KB

The arrangement of threads in blocks and blocks into grids of blocks (65536x65536x65536 maximum blocks) is up to the developer. A GeForce 8 class GPU can therefore execute a program of a maximum 2 million instructions on close to 150 billion (10^15) threads! These of course are only the maximum.

Each multi-processor can have 768 threads, or in other words, to fill them to the maximum you would, for example, use 2 blocks of 384 threads (or 2x 12 warps). 10 registers could then be used per thread and each block could use 8 KB of shared memory. If more registers are necessary, the number of threads per SM has to be reduced. This could result in a possible reduction of the multiprocessorís potential given that it will have less possibility to maximize the throughput of its calculation units.

The executed program also has to represent a sufficient number of blocks because a GeForce 8800 has 16 multiprocessors. In the previous example, which uses 2 blocks of 384 threads per multi-processor, at least 32 of these blocks will be needed to feed all of the GPUís calculation units. This represents close to 25,000 threads. To use several GPUs we have to multiply this number by that of the GPUs. The best, of course, would be to have planned a lot more in order to take advantage of future GPUs, which will have more calculation units, etc. To plan on a hundred, or even a thousand of blocks of threads is therefore not a luxury.

In our opinion, the complexity which is given to using a GPU as a calculation unit comes first and foremost from the fact that we have trouble seeing how a program that isnít easily paralleled will function with it. However, this is a wrong question. It would be a waste of time to try and run something of this kind on a GPU.

Page 3
CUDA evolves

CUDA evolves
Itís now clear that CUDA wasnít simply an Nvidia marketing ploy and/or to see if the market is ready for such a thing. Rather itís a long term strategy based on the feeling that an accelerator market is starting to form and is expected to grow quickly in the coming years.

CUDAís team is therefore working hard to evolve the language, improve the compiler, make use more flexible, etc. Since the 0.8 beta version out in February, the versions 0.9beta and finally 1.0 have allowed CUDA to viably make the use of GPUs as coprocessors. More flexibility and robustness were necessary, although the version 0.8 was already very promising. These regular evolutions also allowed increasing the feeling of confidence, from which CUDA is starting to benefit.

Two main evolutions stand out. The first is the asynchronous functioning of CUDA. As we explained in our previous article, the version 0.8 suffered from a large limitation, because once the CPU sent the work to the GPU, it was blocked until it sent the results back. The CPU and GPU therefore couldnít work at the same time and was a big brake on performances. Another problem was found in the case where a calculation system was equipped with several GPUs. A CPU core per GPU was needed, which isnít too efficient in practice.

Nvidia of course knew this and the synchronous functioning of the first CUDA versions were probably used to facilitate a rapid release of a functional version without focusing on the more delicate details. With CUDA 0.9 and also 1.0, this problem disappeared and the CPU is free once it has sent the program to be executed to the GPU (except when access to textures is used). In the case where a number GPUs are used, it is however necessary to create a CPU thread per GPU because CUDA does not authorize the piloting of two 2 GPUs starting from the same thread. This is not a big problem. Note that there is a function that can force synchronous functioning if this is necessary.

The second main innovation on the functional level is the appearance of atomic functions, which means reading data in memory, using it in an operation and writing the result without any other access to this memory space until the operation is fully completed. This allows avoiding (or at least reducing) certain current problems such as a thread which tries to read a value which we donít know if it was modified or not.

Finally, with CUDA 1.0, Nvidia distributes PTX (Parallel Thread Execution) documentation, which is an intermediary assembler language between high level code and that which is sent to the GPU. PTX was already used and developers could access it, however, it wasnít documented yet. This is probably because the behavior of the different compilation levels was not yet clearly defined. PTX could be used to optimize certain algorithms or libraries or quite simply to debug the code.

Page 4
The Tesla line

The Tesla line
After the GeForce line intended for the general public and gamers and the Quadro conceived for graphic designers, the Tesla attacks the market in calculation power.

At first, Nvidia announced three products. The first, the Tesla C870, is a kind of GeForce 8800 GTX without video connections and therefore intended solely as an accelerator. This card moreover is equipped with 1.5 GB of video memory instead of 768 MB. Its price was set at $1299, which is reasonable because a Quadro FX 5600 also equipped with a memory of 1.5 GB costs $2999. The TDP is 170W.

The second element of the line is the Tesla D870, which takes on the concept of the Quadro Plex. There are two Tesla C870 cards in an external casing that is connected to the PC via a special PCI express card and specific cable. The TDP increases to 350W and the price jumps to $7500, which however is still a good deal compared to the Quadro Plex of equivalent Quadros offered at $17500. Two casings can fit into a bay and then together occupy 3U.

Finally, the third product in this line is a 1U rack, the Tesla S870, equipped with no less than four Tesla C870s, or four G80s and a total of 6 GB of video memory. The rack is connected to a main system in PCI Express and is already prepared for PCI Express 2.0 to boost transfers between the CPU(s) and GPUs. The TDP is 800W although Nvidia says that in practice power consumption is around 550W. This 1U rack is available for a price of $12000.

The three products are announced with availability for this month, and Nvidia also adds that a 1U rack equipped with 8 GPUs is in preparation.

As for the long term strategy of CUDA, Nvidia assures us that it will be offered on all product lines including Quadro and GeForce, and wonít just be reserved to the Tesla. CUDA should moreover soon be an integral part of general public drivers. However, in the future certain functions or those of future GPUs could be limited to the Tesla. This will notably be the case of 64 bit precision calculation on floating point numbers which will be introduced with the G92 and reserved to Tesla (and to some high end Quadros).

Page 5
And AMD ?

What about AMD ?
If you have followed the current progress of stream computing, you should know that AMD was the first to mention it with the announcement of a low level access (machine language machine) for its GPUs when launching the Radeon X1800 in October 2005. The details of this access, called DPVM for Data Parallel Virtual Machine, and renamed CTM for Close To Metal, were only given a year after in August 2006.

A few months afterwards during the finalization of the AMDís buyout, ATIís teams presented a few more practical applications, something which served well in fueling forum discussions concerning the merger. These presentations were completed by the release of an accelerated version of the Folding@home project via Direct3D by the X1900. However, the CTM wasnít available yet and although ATI informed us then that a CTM version of Folding@home would soon arrive, we still havenít seen anything yet.

In mid-November 2006, AMD launched the first product specific to this market with the Stream Processor which is a Radeon X1900 stripped of its video connections. Contrary to what we were told (CTM would concern all general public graphic cards) the CTM driver was only delivered to users of Stream Processor cards (if there are some) who were in direct contact with AMD developers because it could not be found on the manufacturerís website. This was something that in the end dampened our enthusiasm and even bothered us, because besides the systematic hyping announcements coming with the release of general public cards (or to justify the buyout by AMD) we didnít see anything too concrete.
Something new with R6xx GPUs?
With the release of the Radeon HD 2000, AMD came back to this subject by presenting a series of evolutions. Low level CTM access was complemented by AMD Runtime, which in a way is the equivalent of CUDAís runtime and is therefore a higher level access. The difference is that AMD Runtime could use multi-core CPUs as well as one or several GPUs. Next, AMDís library of mathematical functions,ACML, optimized for its CPUs, integrated GPU equivalents. And finally, AMD offered extensions of C and C++ languages to pilot everythingÖ as Nvidia has done with CUDA.

AMD seems to have followed Nvidia by moving to a higher level mode of use. However, AMD isnít shy with its criticism towards CUDA which is described as a useless solution, or in other words, too complex for most developers and too far away from the exact specificities of GPUs to be able to develop effective libraries. This criticism regarding CUDA isnít totally without foundation and we could suppose that this incited Nvidia to document PTX.

With CUDA Nvidia made the choice of quickly offering something usable and would later provide supplementary optimizations. This is while AMD first went with a very complex low level language before proposing more, or at least before offering marketing documents that said that the manufacturer would offer more than what we still havenít seen ! After more than two years of nothing new, we will wait to for something more concrete before going any further, and it is for this reason that we presented the innovations briefly if somewhat skeptically.

We finish this chapter devoted to AMD by mentioning Radeon HD 2000 architecture, which has some interesting advances in terms of use as a calculation unit. First, the Thread Generator is capable of generating threads optimized for rapid processing (low latency) or optimized to maximize GPU throughput, which permits in theory to make certain uses more efficient, although AMD doesnít really give any details here.

Next, the memory architecture of the Radeon HD 2000 is much more advanced than that of the GeForce 8. There is, on the one hand, generalized cached access to video memory in reading as much as in writing (while there isnít access of this type with other GPUs). And on the other hand, there is an independent engine to manage PCI Express transfers in parallel with the rest of the GPU. In a GeForce 8, the GPU is blocked during these transfers.

Chips from the R600 generation thus seem well armed for Stream Computing and could allow AMD to have an advantage over Nvidia. However, as AMD points out so well, the hardware is only half of the story...and for the other half we havenít seen much yet.

Page 6
Other competitors, practical uses

Other competitors
Nvidia and AMD arenít the only ones to try out the massively parallel processor market. IBM (as well as Sony but most likely to bring attention to the PS3 than anything else) propose a BladeCenter based on two Cell processors. You may recall, the Cell is composed of a generalized core accompanied by a block of 8 cores devoted to parallel calculation. IBM already offers an entire development environment around this platform. Compared to a GeForce 8800, a Cell has fewer calculation units and less memory bandwidth, but offers a higher frequency and more local memory for calculation units (256 KB versus 8-16 KB for a GeForce 8).

IBM commercializes the BladeCenter equipped with two Cell processors

Intel is also involved in this domain with the Larabee project, which is a massively parallel chip destined to serve as a GPU as well as a coprocessor. In its presentations, which are supposed to be confidential, Intel mentions 16 to 24 cores which have 512 bit SSE units (or sixteen 32 bit operations per cycle and per core!). Each of these cores has an L1 cache of 32 KB and an L2 of 256 KB, all accompanied by an overall cache of 4 MB. Larabee is expected in 2009 or 2010 and will have its big advantage of being based on x86 architecture.

The calculation units of each architectures act differently
What utility?
But what is the utility of these calculation units? It's not for gaming and not to accelerate Internet. By this we mean, at least for now, itís not for general public use but rather professional. A number of scientific applications need enormous calculation power that no generalized processor can provide. The solution therefore is to create huge supercalculators based on hundreds or even thousands of processors.

The conception of these calculators is very long, complex and expensive. Where a massively parallel processor is effective, it will allow at an identical cost/bulkiness to offer much more calculation power or to offer the same capabilities at less cost.

In our opinion, there is no current plan to build an immense supercalculator based on GPUs. They will first have to prove themselves and mature because there is no reason to take risks at this level. Moreover, it would be more interesting if Nvidia would give more information on the reliability of its chips, error rates (the memory isnít ECC for example), crash rates, etc. Asked about this at each announcement involving stream computing, Nvidia has never been able to answer. It would appear this has been neglected, perhaps voluntarily, because marketing prefers not to make public this type of data. We should not stop thinking that the GeForce never breaks down or never makes any errors.

Current use is more about putting into place systems that were thought to be impossible without such chips. We can't imagine a supercalculator in each hospital unit, for example. An accelerator such as the Tesla could allow a work station to carry out tasks that were unimaginable before. Or to carry out operations in real time that can be very long.

In a press conference at the end of May, Nvidia invited several partners who demonstrated some practical uses of GPUs.

Acceleware is a company which develops platforms based on GPUs destined to accelerate a certain number of functions. The platform is used by its customers to accelerate their application. Acceleware showed us a demo of simulated impact of radiation emitted from a GSM on human tissues as well as citing other uses, for example, for rapid detection of breast cancer or simulations related to pacemakers.

Evolved Machine is a company which is trying to understand the functioning of neurons, in order to be able to reproduce the circuits and create systems capable, for example, of learning and recognizing objects or odors as humans do naturally. Simulating a single neuron represents the evaluation of 200 million differential equations per second. When we know the basic structure of neurons represents thousands of them, we can easily imagine the enormous amount of calculations to process. Evolved Machine indicates having seen a gain of 130x in processing speed by using GPUs and is working on the development of a rack of GPUs which will be capable of competing with the best supercalculators in the world at 1/100th their cost.

Headwave develops solutions for the analysis of geophysical data. Petroleum companies are finding it more and more difficult to find oil and gas reservoirs. The deeper they are detected the harder is their analysis. An enormous quantity of data needs to be gathered and then processed. This processing is so heavy that the data collected accumulates at lightning speed and cannot be analyzed due to a lack of calculation power. The use of GPUs allows significantly accelerating this process, notably by making it possible to display results in real time. According to Headwave, the infrastructure to take advantage of GPUs is already in place and petroleum companies are ready and impatient to use this technology.

Page 7
VMD : the test 1/2

John Stone, Senior Research Programmer in the department of Theoretical and Computational Biophysics at the Beckman Institute for Advanced Science and Technology at the University of Illinois (UIUC), main developer of VMD (and to whom we thank for his help) provided us with a practical and massively parallel application to test GPUs and CPUs.

VMD, which stands for Visual Molecular Dynamics, is a tool used to visualize, animate and analyze enormous organic molecules. It is, of course, capable of using a GPU for rendering these molecules, as well as the analysis and simulation.

To study organic molecules, most of the time they have to be placed in a realistic environment, in other words surrounded by ions and water. The placement of these ions is a relatively heavy task with these big molecules and the most demanding part of this operation is the calculation of the coulombic potential map surrounding the molecule. We tested this operation with GPUs and CPUs.
Test protocol
After having carried out a few initial tests with the beta version of VMD provided by John, we used on the one side, the latest builds compiled for CUDA 1.0, and on the other, compiled with Intel tools and SSE optimized. Algorithms were optimized for each platform and differ slightly. The CPU version relies more on precalculation than the GPU.

VMD in its CUDA form was first developed based on CUDA 0.8, which needed a CPU core per piloted GPU. This limitation is still there because it is at the base of the structure of the code. Also, the application divides the task to be executed in a fixed way between the different CPUs/GPUs. If the application detects two GPUs (and enough CPU cores to manage them), it will divide the task into two, whatever the power of the GPUs. In other words, a faster GPU which would have finished the work more quickly would then have to wait for its slower counterpart. Using a GeForce 8800 Ultra and GeForce 8400 GS then basically means using two GeForce 8400 GS.

Tests were carried out in Linux with Fedora Core 7. We tested the entire GeForce 8 line (plus the Superclocked eVGA model) with a Core 2 Extreme QX6850 equipped with 2 GB of DDR2 800 MHz memory, all of this on an eVGA motherboard based on an nForce 680i. For tests of the QX6850 and E6850, we used the same platform with a GeForce 8400 GS.

Three GeForce 8800s versus Intelís V8.

We carried out the same tests on Intelís V8 platform comprised of two Xeon X5365s (quadcore, 3 GHz, identical to the QX6850) and the 5000X chipset which supports four FBDIMM channels and a distinct FSB for each of its CPUs. With the 4 GB of memory based on DDR2 at 667 MHz, we obtained 5.33 GB/s per channel, or a total bandwidth of 21.33 GB/s. Note that this figure is only true for reading because the bandwidth in writing is reduced by half. The GeForce 8400 GS is used here also for display.

We give the results obtained on the first part of the process because processing the entire molecule takes too much time.
First of all, we give the calculation time and the number of evaluations of atoms in billions per second. These are the same values but in different form in order that extreme performances are represented more clearly.

Where a GeForce 8800 Ultra needed almost four minutes to finish the task, a Core 2 Extreme QX6850 needed an hour, which is slower than a GeForce 8400 GS! The V8 does a little better but remains far from what high end GPUs can do in a massively parallel application such as this one.

As you may have noticed, performances are almost perfectly scaled when we use several GeForce 8s. On the other hand, this isnít the case for CPUs. Why this is we will explain in the next pages.

Page 8
VMD : the test 2/2

CPU Scaling
The program is meant to scale almost perfectly, but according to our results this isnít the case with CPUs, which caught our attention. We decided to test the QX6850 with 1, 2 and 4 cores in order to observe what happens:

Between 1 and 2 cores, performances are very proportional because they increase by 93%. On the other hand, with the transition from 2 to 4 cores, the gain is only 39% and this despite the fact that the 4 cores are fully functioning. So what is happening? At this point we should remember that each added GPU has its own memory and therefore bandwidth, while the added CPU cores have to share the same memory bandwidth.

In order to verify that these limited gains are indeed due to insufficient memory bandwidth, we carried out an additional test this time on the V8 platform by removing memory modules to test the performance with 1, 2 and 4 FB-DIMM channels, or in other words with 5.33, 10.66 and 21.33 GB/s in memory bandwidth. We use 1 GB modules and the systemís total memory consumption doesnít go over 900 MB when the test is carried out, and therefore we are not limited by the presence of only 1 GB.

The memory bandwidth does indeed have an influence on performances and is a limiting factor, especially with multicore CPUs which share cores.
We measured at the power outlet the voltage of systems with an Enermax Galaxy 850 Watt power supply :

The 3 GeForce 8800s showed more than 700 Watts on the meter! Note that a CPU core is used per GPU and the difference between 2 and 3 cards, for example, doesnít consist only of consumption of a 8800 but rather an 8800 plus a functioning CPU core.

While the difference between the QX6850 and E6850 is slight, this is because the QX6850 spends a lot of time waiting to get data from memory.

Once we compare power consumption to performances, we could see a significant advantage related to the use of GPUs. The higher the GPU power / CPU power ratio, the more output per watt increases. This is true at least on the range of data we measured, because it is evident that this productivity will not continue to increase starting from the moment where the CPU(s) are no longer able of correctly feeding GPUs.

Page 9

This second approach to CUDA has been very interesting. Nvidia has improved performances, flexibility, and robustness, and although the list of functions to add and small bugs to correct probably represents something to keep Nvidia engineers busy for some time, CUDA is really usable.

We were able to evaluate the performances of Nvidia GeForce 8 class GPUs in a practical application and we saw a huge performance advantage with these GPUs compared to CPUs with an algorithm which was initially destined for them. This is enough to open some new perspectives for this type of application. And in fact, these are results which differ with our previous conclusion on CUDA, in which we say that the power of the GeForce 8800 wasnít yet sufficient enough to really justify new development compared to multi-CPU systems. So it appears we underestimated two important points.

The first one is that a high end GPU comes with a memory bandwidth of 100 GB/s whereas the four cores of a CPU have to share 10 GB/s. This is a significant difference which limits CPUs in certain cases. The second point is that a GPU is designed to maximize its throughput. It will therefore automatically use a very high number of threads to maximize its throughput whereas a CPU will more often end up waiting many cycles. These two reasons allow GPUs to have an enormous advantage over CPUs in certain applications such as the one we tested.

Using a GPU with CUDA may seem very difficult even adventurous at first, but it is much simpler than what most people think. The reason is that GPU isnít destined to replace a CPU, but rather to help it in certain specific tasks. This doesnít involve making a task parallel in order to use several cores (like itís currently the case with CPUs) but to implement a task which is naturally massively parallel and they are numerous. A race car isnít used to transport cattle and we donít drive a tractor in an F1 race. This is the same thing for GPUs. Therefore, it's all about making a massively parallel algorithm efficient depending on a given architecture.

It would be a mistake to limit our view of a GPU such as the GeForce 8800 as a cluster of 128 cores, which we should try to take advantage of by segmenting an algorithm. A GeForce 8800 isnít just 128 processors, but rather first of all up to 25,000 threads in flight! We therefore have to provide the GPU with an enormous number of threads, keeping them within the hardware limits for maximum productivity and let the GPU execute them efficiently.

Some very important points we have to worry about when coding for a CPU have to be abandoned to focus on others. This is a change of habits that is unfortunately not too often taught in universities. Nvidia is aware of this problem and knows it is a key element in the successful use of its chips as calculation units. David Kirk, Chief Scientist at Nvidia, has for this reason been given the responsibility of teaching a class about massively parallel programming at the University of Illinois at Urbana-Champaign and Nvidia has supported and sponsored other similar courses.

David Kirkís very interesting class is available on-line and is offered as a free to use learning kit. While the course uses the example of the GeForce 8800, the concepts that are presented arenít too specific to any given architecture, with exception of course to optimizations. It can be found here :

ECE 498 AL : Programming Massively Parallel Processors

Moreover, if the subject interests you, we recommend the read of interviews transcripts published by Beyond3D of David Kirk, Andy Keane and Ian Buck, the Chief Scientist, General Manager of GPU Computing Group, and CUDA Software Manager, respectively.

Copyright © 1997-2014 BeHardware. All rights reserved.