John Stone, Senior Research Programmer in the department of Theoretical and Computational Biophysics at the Beckman Institute for Advanced Science and Technology at the University of Illinois (UIUC), main developer of VMD (and to whom we thank for his help) provided us with a practical and massively parallel application to test GPUs and CPUs.
VMD, which stands for Visual Molecular Dynamics, is a tool used to visualize, animate and analyze enormous organic molecules. It is, of course, capable of using a GPU for rendering these molecules, as well as the analysis and simulation.
To study organic molecules, most of the time they have to be placed in a realistic environment, in other words surrounded by ions and water. The placement of these ions is a relatively heavy task with these big molecules and the most demanding part of this operation is the calculation of the coulombic potential map surrounding the molecule. We tested this operation with GPUs and CPUs.
After having carried out a few initial tests with the beta version of VMD provided by John, we used on the one side, the latest builds compiled for CUDA 1.0, and on the other, compiled with Intel tools and SSE optimized. Algorithms were optimized for each platform and differ slightly. The CPU version relies more on precalculation than the GPU.
VMD in its CUDA form was first developed based on CUDA 0.8, which needed a CPU core per piloted GPU. This limitation is still there because it is at the base of the structure of the code. Also, the application divides the task to be executed in a fixed way between the different CPUs/GPUs. If the application detects two GPUs (and enough CPU cores to manage them), it will divide the task into two, whatever the power of the GPUs. In other words, a faster GPU which would have finished the work more quickly would then have to wait for its slower counterpart. Using a GeForce 8800 Ultra and GeForce 8400 GS then basically means using two GeForce 8400 GS.
Tests were carried out in Linux with Fedora Core 7. We tested the entire GeForce 8 line (plus the Superclocked eVGA model) with a Core 2 Extreme QX6850 equipped with 2 GB of DDR2 800 MHz memory, all of this on an eVGA motherboard based on an nForce 680i. For tests of the QX6850 and E6850, we used the same platform with a GeForce 8400 GS.
Three GeForce 8800s versus Intel’s V8.
We carried out the same tests on Intel’s V8 platform comprised of two Xeon X5365s (quadcore, 3 GHz, identical to the QX6850) and the 5000X chipset which supports four FBDIMM channels and a distinct FSB for each of its CPUs. With the 4 GB of memory based on DDR2 at 667 MHz, we obtained 5.33 GB/s per channel, or a total bandwidth of 21.33 GB/s. Note that this figure is only true for reading because the bandwidth in writing is reduced by half. The GeForce 8400 GS is used here also for display.
We give the results obtained on the first part of the process because processing the entire molecule takes too much time.
First of all, we give the calculation time and the number of evaluations of atoms in billions per second. These are the same values but in different form in order that extreme performances are represented more clearly.
Where a GeForce 8800 Ultra needed almost four minutes to finish the task, a Core 2 Extreme QX6850 needed an hour, which is slower than a GeForce 8400 GS! The V8 does a little better but remains far from what high end GPUs can do in a massively parallel application such as this one.
As you may have noticed, performances are almost perfectly scaled when we use several GeForce 8s. On the other hand, this isn’t the case for CPUs. Why this is we will explain in the next pages.