PCI Express 3.0: impact on performance - BeHardware
Written by Guillaume Louel
Published on January 20, 2012
With the successive arrivals of the Intel LGA 2011 platform and the AMD Radeon HD 7970, PCI Express 3.0 has become a reality. Have the expected increases in speed also been delivered? Is there an impact on performance in applications and games? What about CrossfireX? So many questions for us to answer!
As we explained in a previous article on the subject, the PCI Express bus is a point-to-point serial interface, designed to be relatively low cost in terms of implementation as well as being particularly modular.
Beyond the physical implementation which allows the use of peripherals going from x1 to x32, PCI Express also auto-negotiates across two other parameters with respect to the link-up between a host and a peripheral. Firstly the width of the bus, which can vary independently of physical implementation (a x16 graphics card can run in x1 mode if necessary), and secondly the datarate (2.5 GT/s, 5 GT/s and 8 GT/s for versions 1.0, 2.0 and 3.0 respectively). These parameters can be negotiated in real time if necessary as far as speeds go: the operating system can reduce the speed of the bus when it isnít being used in order to maximise energy saving.
In practical terms, PCI Express 3.0 doubles the bandwidth of PCI Express 2.0, giving a theoretical 16 GB/s in each direction for a peripheral in x16 mode instead of 8. Note however that we arenít talking 10 GT/s but rather 8 GT/s. This is because while the first versions of PCI Express encode their data on 10 bits for just 8 bits of data (4 GT/s theoretical), PCI Express 3.0 introduces a more complex 128bit/130 bit encoding scheme which maximises the efficiency of an 8 GB/s link.
For the time being, only the LGA 2011 platform based on the LGA 2011 chipset and Sandy Bridge-E processors allows you to benefit from PCI-Express 3.0. In April, the LGA 1155 Ivy Bridges will also support the new standard. There will be backwards compatibility of course and PCI Express 3.0 cards do run on 2.0 and 1.0 ports and vice-versa.
To carry out our test, we used an X79 motherboard from Asus, the Maximus IV Gene in the Micro ATX format. As of its latest BIOS, it allows you to choose between 1.0, 2.0 and 3.0 modes for your PCI Express ports.
The first two motherboard slots are cabled at x16 while the third, though x16 physically, is only cabled at x8. We used only the first two motherboard slots for our tests.
We also used two Radeon HD 7970 graphics cards as they introduced PCI Express 3.0 support. The test machine processor was a Core i7 3960X and we used 16 GB of DDR3 clocked at 1600 MHz. The tests were carried out in Windows 7 x64 SP1.
Bandwidth, theoretical measurements
Bandwidth, theoretical measurements
Before measuring the practical impact, we ran some theoretical tests to see if PCI Express 3.0 fulfills its promise in terms of speeds. Here we used a performance test that is included in AMDís APP development kit available in version 2.6.
Paged pool memory, non-paged pool memory
This first test, you may remember, is fairly particular in the sense that it attempts to achieve the highest possible transfer speeds using what is known as non-paged pool memory. In effect, on the system side, the tool reserves memory pages so that they canít be moved. In practice this means that we can be 100% certain, throughout the length of the execution of the programme, that the memory pages will be physically situated in the RAM and never in a swap file.
While this may seem unimportant in theory on a test machine equipped with 16 GB of RAM, in practice this isn't the case. Of course, the data transferred will remain in the physical memory, but the possibility that it might not be calls for additional burden with respect to memory copying operations. For non-paged pool memory here, AMD uses algorithms optimised to make the most of PCI Express (just like NVIDIA in CUDA).
For this test and the following tests, we measured six distinct cases:
- PCI Express 3.0 x16, x8 and x4
- PCI Express 2.0 x16 and x8
- PCI Express 1.0 x16
From a theoretical point of view, some modes have an equivalent bandwidth:
- PCI Express 3.0 x16
- PCI Express 3.0 x8 and PCI Express 2.0 x16
- PCI Express 3.0 x4, PCI Express 2.0 x8 and PCI Express 1.0 x16
For games and applications that allow it, we also give the results in each of these cases in CrossFire mode.
Theoretical bandwidth (non-paged memory)
We independently measured the transfer rate from the CPU to the GPU (typical case in games), as well as in the opposite direction (also used in OpenCL).
Hold the mouse over the graph to view efficiency in comparison to theoretical bandwidth
There are several important points to note. Firstly, going from the GPU to CPU thereís a nice 77% increase in bandwidth between PCI Express 3.0 x16 and 2.0 x16 (compared to 89% between 2.0 x16 and 1.0 x16) but the gains are much smaller the other way around : only 50% and we remain under the 10 GB/s bar.
Another interesting point is the comparison between PCI Express 3.0 x8 and PCI Express 2.0 x16, two modes that theoretically have an identical bandwidth. While thereís a 2.5% fall going from the CPU to the GPU, there's a 5% gain in the other direction.
Performance levels at PCI-E 3.0 x4, 2.0 x8 and 1.0 x16 are similar overall.
Theoretical bandwidth (paged pool memory)
As developers canít always use non-paged pool memory, we carried out a second test via the Cloo library (in version 0.9.1). AMDís Open CL driver is compatible with version 1.1 of the specification.
As we saw with non-paged pool memory, the gains are asymmetrical once again when compared with the gains given by PCI-E 3.0 x16 over 2.0 x16: 38% and 51% from the CPU to the GPU and the GPU to the CPU respectively. These scores are however relatively high, approaching (or exceeding when going from the GPU to the CPU) the level of performance of non-paged PCI Express 2.0 x16.
Comparing 3.0 x8 and 2.0 x16, we find an identical score for both modes going from the GPU to the CPU and a 4.2% gain in the other direction. Performance levels at PCI-E 3.0 x4, 2.0 x8 and 1.0 x16 are almost identical.
Let's now see what this translates to in the applications tests!
OpenCL, DxO Optics Pro, AES, Luxmark
OpenCL, AES, Luxmark
We wanted to check the relative levels of performance of various OpenCL applications. When introducing the Radeon HD 7970s, AMD talked about a version of WinZip (16.5!) that supposedly used OpenCL to accelerate AES encryption. This version still isnít available.
We turned to three OpenCL tests. Firstly an AES encryption test from AMDís SDK. Secondly LuxMark, a benchmark from LuxRender that uses OpenCL.
Update: 23/01Thirdly DxO Optics Pro in version 7.1, the RAW export functionality of which we tested, as it may also use OpenCL.
DxO Optics Pro 7
We have added the DxO Optics Pro 7.1 photo processing software to our protocol. We used the RAW to JPEG development feature on 50 files, a feature that has been in OpenCL since version 7 of the software. We carried out two series of tests, first authorising the conversion of one and then two files simultaneously. Beyond this, processor performance tends to limit the performance of the graphics card. For info, we also added the conversion time obtained with the processor alone.
While there is a slight impact in different modes, in practice the differences remain slim. Note that the x16 modes are systematically faster Ė at equal bandwidth Ė than the others.
AMD supplies the source code for a piece of AES encryption software with its SDK APP. For this test we used a 190 MB photo, transferred to the GPU where it is transformed and then sent back to the CPU before being written to the drive. The test measures the transfer and processing times.
The differences are extremely small here though PCI Express 3.0 x16 mode does have a slight advantage (in the order of a hundredth of a second). Having more bandwidth doesnít make much of a difference in practice here.
We used the Windows 64-bit version of LuxMark in 100% OpenCL mode. The test supports CrossFire.
Here again, the gains are extremely slight, with 3.0 x16 just a hairsbreadth ahead of 2.0 x16. The additional bandwidth contributes very little. Note that while PCI Express 3.0 x4 is the most efficient of the equivalent modes without CrossFire, with CrossFire performance drops off and it no longer is. PCI Express 3.0 x8 is very slightly up on 2.0 x16 in both cases.
Letís now move on to games and see if we can obtain bigger performance differences!
Battlefield 3, Crysis 2, Civilization V
Battlefield 3, Crysis 2, Civilization V
Games mainly transfer data from the CPU to the GPU. Letís see what impact PCI-E 3.0 has! We carried out our tests at 1920 x 1200.
Battlefield 3 uses one of the most advanced graphics engines currently in use, Frostbite 2. We tested it at Ultra and recorded performance with Fraps, over a well defined route.
The differences are extremely slim here to the point where thereís only any significant reduction in performance in PCI-E 1.0 x16. Interestingly, PCI-E 3.0 x4 manages better. The small gain in latency afforded by the higher transfer rate is probably a factor here.
With CrossFire, at the same theoretical bandwidth, PCI-E 3.0 always comes out on top.
A development of the Crysis Warhead engine is used for Crysis 2. We used version 1.9 with the DirectX 11 patch and the additional textures. We tested it in Ultra mode with Fraps on a predefined route.
Once again with a single card at identical bandwidth PCI Express 3.0 has the advantage. PCI Express 1.0 is less affected than in Battlefield 3 and the differences are still tiny.
With CrossFire on, the gaps are a little bit bigger with the 3.0 solutions still having the advantage.
Civilization V uses quite a successful DirectX 11 engine. We used the built-in benchmark with all settings pushed to a maximum with shadows and reflections. We used MSAA 8x.
While version x16 of PCI Express 3.0 dominates, for the first time in games there's a slight advantage for PCI Express 2.0 x16 over PCI-E 3.0 x8 and this is true both with a single card and in CrossFire, something that possibly results from a different usage of PCI-E for this relatively high load benchmark. Remember, PCI-E 2.0 x16 had a very slight advantage over PCI-E 3.0 8x in our theoretical bandwidth readings.
When we look at the theoretical performance, we can see that some of the PCI Express 3.0 promise has been fulfilled. Firstly, the increase in bandwidth does have an impact, particularly with GPU to CPU transfers, though this isn't as significant in the opposite direction even if there are theoretical gains of (just) 50%. The reason for this more limited gain is difficult to measure as we only have a single PCI Express 3.0 compatible platform for now, Intelís LGA 2011, and one PCI-E 3.0 graphics card, the Radeon 7970. Whether it comes from the interface, platform or card, itís difficult for us to say as yet.
In practice the OpenCL applications that we were able to test are for now far from being limited by memory bandwidth. In already offering almost 7 GB/s, PCI Express 2.0 x16 covers most usages, even if certain very specific pro applications will certainly be able to take advantage of the theoretical gains.
On the gaming side, the increases in bandwidth arenít of much more use. While thereís still a difference between PCI Express 2.0 x16 and x8 (a difference that is even more marked with PCI Express 1.0 x16 in CrossFire), in practice this boils down to no more than one percent between PCI Express 3.0 x16 and x8 modes with a single card and half a percent in CrossFire. This is positive for those who are hoping to use CrossFire on future Ivy Bridge platforms, whether with two cards at x8/x8 or three cards at x8/x4/x4 as in PCI-Express 3.0 x4 mode doesn't bring performance down too much.
Copyright © 1997-2015 BeHardware. All rights reserved.