Home  |  News  |  Reviews  | About Search :  HardWare.fr 



  Processors

  Motherboards

  Graphic Cards

  Multimedia

  Storage

  Imaging

  Monitors

  Miscellaneous
Advertise on BeHardware.com
Review index:
Nvidia CUDA: preview
by Damien Triolet
Published on March 21, 2007

In practice
We played with CUDA for a couple of weeks so that we can now have an idea about what it can do.

Let's start with what it can't do: simply use an SLI system to double calculation power. Each GPU is perceived as being independent and a kernel is executed on a single GPU. A different kernel needs to be launched on each GPU to benefit from multi-GPU systems and this complicates the proper exploitation of the whole calculation power. Also, kernel execution is synchronous. This means that once the CPU has requested the execution of the kernel by the GPU, the thread and core that will execute it will be blocked until the GPU has finished working. CPU power can be easily wasted in waiting instead of being used as a complement to the GPU. This is something that Nvidia will have to improve in the future or else the same number of CPU cores (which won't be used!) and GPUs will be required by the system.

We thought about comparing several algorithms on GPUs and CPUs in order to measure performance gaps, but we quickly changed our mind for several reasons. The main reason was that we can’t claim to be able to develop a function that will be as efficient on one side as it is on the other. In other words, if the GPU is faster, will it be because it is more efficient or because the same function was less optimized for the CPU and vice versa? Also, it is easy to find an example that will be much faster for a CPU and another one that will be faster for the GPU. So, unless we spent weeks to develop relatively objective tests (unfortunately, we do not have that time) it is very difficult to objectively compare GPU and CPU performances. Nevertheless, we decided to give you two graphs of performances. While they include the GPU and CPU they aren't intended for direct comparison, but more to show how performances may vary with the modification of a particular parameter.

It is important to keep in mind that this is a beta version of CUDA and performances will logically improve with the newest revisions.

The first parameter chosen to be subject to variations was the number of blocks. The total number of threads or elements to be processed is identical but they are regrouped in one big block or in several smaller blocks. In the case of the CPU, each block can be perceived as one thread and be executed with a different core. The kernel consists in executing a series of operations on data and write results in the table.


With a simple core CPU and whatever the organization, performances are identical. A quad core CPU would, however, make it possible to process this type of kernel four times faster with 4+ blocks. In the case of the GeForce 8800 GTX, at least 16 blocks are needed for the 16 multiprocessors to be exploited and 32 to be exploited efficiently. This requires more work for programming but the performance gain is consequent.

The second test consisted of increasing the complexity of the kernel (or the number of operations). The number of blocks was fixed at 32.


If calculation time increases linearly with the CPU, it isn't the case for the GPU below a certain complexity. This indicates that the management cost is quite high and needs to be absorbed by complex operations. It isn't enough to process a large amount of data, and the process has to be sufficiently complex too for it to be really worthwhile.

<< Previous page
CUDA's API

Page index
1 | 2 | 3 | 4 | 5 | 6
Next page >>
Conclusion  




Copyright © 1997- Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.