The impact of PCI Express on GPUs - BeHardware
Written by Guillaume Louel
Published on June 10, 2011
With the different AMD and Intel platforms offering varying numbers of PCI Express lanes, the impact of this bus on graphics performance is often the subject of discussion. Are the PCI Express 2.0 x8 and x16 equivalent in terms of performance on our 3D cards? What’s the impact of a x4 interconnect width, and what are the differences between lanes that are directly linked to processors and those which transit via the southbridge? So many questions that we’re going to try and answer in this article!
PCI Express in brief
The PCI Express is a bus for all seasons. With a relatively simple physical implementation, this point-to-point serial interface is organised in the form of lanes. Each lane is made up of two pairs of links, each of which transmits data in one direction. Physically speaking, PCI Express is a bi-directional serial bus. Bandwidth can be increased by bonding (or training) several lanes so that they can be used together (up to 32). In our PCs, standard implementation is however 16 lanes, known as x16.
One of the particularities of PCI Express is that the number of lanes used is automatically negotiated between the host and the downstream device. A x16 peripheral, such as a graphics card, can still function at x1. The size of a slot on a motherboard isn’t what determines the running speed of the device connected to it. A x1 device in an x16 slot will will run at x1. A x16 device in an x16 slot (physically speaking) but wired electrically at x4 will however run at x4. Motherboard manuals indicate how slots are wired and this increasingly depends on the processor and the chipset used.
With simple and flexible physical implementation, PCI Express has become the spinal column for motherboards. The Intel DMI bus used to connect the processor to the motherboard southbridge is in reality a x4 PCI Express interconnect (2.0 on the SandyBridges and chipsets that accompany it). On laptops, the ExpressCard exports a PCI Express lane, and the recently announced Intel Thunderbolt is also an extension cable directly based on PCI Express. A bus for all seasons.
We chose to look at performance in the different modes on a motherboard based on the Intel Z68 chipset. The card used for this test was an MSI Z68A-GD80 B3 (not to be confused with the G3 version presented at Computex). We'll come back to this in detail later but it has been designed for Intel Sandy Bridge processors, which have a 16 lane on-die PCI Express controller. Three x16 PCI Express ports (in blue on the photo below) are nevertheless included on the motherboard.
The first slot at the top is wired at x16. The second slot is only wired at x8. In practice if a card is inserted into this slot, the first slot then also drops to x8. Four Pericom chips serve in effect as a switch to divide the 16 lanes into two groups of 8. The last slot at the bottom is linked to the chipset with four lanes. It therefore shares the x4 link which connects the chipset to the processor with the rest of the peripherals (network, drives and so on) connected to the southbridge, introducing another potential bottleneck in terms of bandwidth. To carry out our test we used two high-end GPUs from AMD and NVIDIA with relatively equivalent performance, the Radeon HD 6970 and the GeForce GTX 570. Thanks go to Nicolas et Fils for the loan of certain products.
Bandwidth, theoretical measurements
Bandwidth, theoretical measurements
In gaming, PCI Express bus transactions are mainly between the CPU and GPU. Some loads, such as loading textures or data to a graphics scene, happen from time to time, while others are linked to the processing of every frame (data/instructions sent by the drivers). With equal graphics engines, the more frames a graphics card can process the more the potential impact of the PCI Express, if there is one.
Before measuring the impact in practice, we wanted to measure bandwidth more theoretically. For this we used the PCI Express bandwidth tests included in the NVIDIA CUDA (the CUDA Toolkit 4.0) and AMD APP (version 2.4) development kits. Our scores for the GeForce and Radeon aren’t therefore directly comparable. In practice it's changing from one mode to another that interests us most here, rather than a comparison between the two brands.
We tested the following modes:
- x16 (graphics card in slot 1)
- x8 (graphics card in slot 2)
- x4 (graphics card in slot 1 deactivating additional pins)
- x4 connection via chipset (graphics card in slot 3)
Theoretical bandwidth (pinned memory)
To recap, the theoretical bandwidth of PCI Express 2.0 is 5 GT/second per lane, with 8b/10b encoding (8 bits transferred using a 10-bit signal). In practice this gives us 4 GT/s, or 500 MB/s per lane, which translates to a theoretical bandwidth of 8 GB/s for 16 lanes, though this is without taking account of the cost of the transaction protocol.
At x16, we got around 75% efficiency, which corresponds to accepted standards. With the NVIDIA card moreover, we pretty much got the same levels of efficiency in the lower modes. On the AMD side, the different buffer size adds a bigger overhead, which explains why x16 was more than two times faster than x8.
There are two remarks to be made however. The first is that GPU to CPU transfer is generally faster than in the other direction. The second is that while there’s only a slight difference in GPU to CPU transfer between the x4 connection to the chipset and the x4 connection directly to the processor, the impact is higher in the other direction, between the CPU and the GPU. As this is the direction most used in gaming, it will be interesting to see what this translates to in practice. Note that there's almost double the cost with the AMD card in this direction, with a loss in performance of 11% on the x4 southbridge interconnect and the x4 CPU interconnect where there’s only a 5.5% loss on the NVIDIA side.
Finally, to reach these speeds, tools reserve pinned memory on the system side, with the operating system freezing memory pages so that they aren't subject to memory swap. We’re sure at 100% if the memory is RAM. While in theory performance should be identical if there’s sufficient memory, in practice the fact that the memory is RAM allows the NVIDIA/AMD drivers to use more efficient file copy algorithms.
Theoretical bandwidth (pageable memory)
As developers can’t always use pinned memory, we carried out a second test reserving system memory in the standard way. This time we used OpenCL to create a test that worked on both cards, via the Cloo library (version 0.9). For information, note that while general consumer AMD drivers are OpenCL 1.1 compatible, the general consumer NVIDIA drivers are still limited to OpenCL 1.0. A 1.1 driver is available (and has been for a year…), but only for registered developers.
The bandwidth wasn’t as high here, particularly at x16, but the trends are the same. GPU to CPU transfer is still faster than the other way round and with the x4 chipset interconnect there’s still a more significant dip in bandwidth on the CPU to GPU transfer.
Note that we also tried to measure latency, something that proved impossible due to software issues. Measuring the variation in latency by, for example, measuring the time required for data transfers, implied the reading of variations that went far beyond the latency of the bus itself. Whether the transaction layer (PCI Express works via a system of credits between pairs of hosts and devices) or the driver itself, in practice variability remains very wide, with variance over several hundreds of tests of up to and over 100 microseconds. Something to be borne in mind then is that, in spite of its great flexibility and simplicity in terms of implementation, the PCI Express bus remains difficult to measure in purely software terms as multiple software layers all weigh in with performance costs which are a good deal higher than those of the interface itself.
Let’s now move on to the practical tests in games.
Crysis Warhead, FarCry 2, Metro 2033
We used the most recent drivers available on the date of our test, namely the AMD Catalyst 11.5s and the NVIDIA GeForce 270.61s. We tested at three resolutions to measure the impact of each: 1280 x 1024, 1680 x 1050 and 1920 x 1200.Hold the mouse over the graph to view the performance index.
The test configuration was as follows:
- MSI Z68A-GD80 B3 motherboard
- 2x2 GB DDR3 1333 MHz
- Intel Core i7 2600K
- Nvidia GeForce GTX 570 / Radeon HD 6970
- Windows 7 Ultimate 64 bit
We used version 1.1 of Crysis 64-bit.
The way the NVIDIA and AMD cards are managed by their respective drivers can be subject to significant differences. While the game is responsible for DirectX calls, the driver has the last word on how these calls and transfers are executed. So while there’s only a small difference between x16 and x8 with the AMD card (around 2%), this goes up with the NVIDIA card (around 4%). As we’ll see, this can vary significantly from one game to another.
While the impact at x8 is slight, it starts to make itself felt more strongly at x4, once again with disparities between AMD and NVIDIA. There’s an 8% cost with the AMD card and up to 18% with NVIDIA at 1280. This is no surprise as the variations are highest when most frames are displayed on screen.
Interestingly, using the x4 port on the southbridge shows the opposite trend. Where there’s a 2.5% performance difference with the NVIDIA card between the x4 CPU interconnect and the x4 southbridge interconnect, this rises to 5.5% for AMD. In the theoretical tests we remarked on the fact that there was a slightly higher cost to performance on the southbridge with nonpaged CPU to GPU transfers on the AMD platform. This was borne out in practice here.
We used version 1.03 of Far Cry 2.
Hold the mouse over the graph to view the performance index.
Certain tendencies were inversed here, with the NVIDIA cards being more efficient at PCI Express x8. The loss in performance was nevertheless higher than for Crysis, because of the higher framerate. It came to 5.5% on the GeForce and a little over 7.5% on the Radeon. At PCI Express x4, we can see the same trend in terms of efficiency, with a performance cost of up to 12% for NVIDIA and a little over 16.5% for AMD at 1280.
One thing remained unchanged however, the difference in efficiency between the PCI Express x4 CPU interconnect and the southbridge. There’s only a 4.5% difference with the NVIDIA card and more than 8% for AMD.
We used the latest version to date of Metro 2033 (patch 2 via Steam) in DirectX 11 (AAA).
Hold the mouse over the graph to view the performance index.
With a lower frames per second count in what is a very resource hungry graphics engine, the performance differences were lower. Switching to x8 only implies a performance cost of 2.5% with a slight trend in AMD’s favour, which is confirmed at x4, where the cost is “just” 7% against practically 10% for NVIDIA.
NVIDIA cards still do better with the x4 southbridge interconnect, with a performance loss of just 3% against a little over 5.5% for AMD.
We calculated the averages, giving an index of 100 to performance at x16.
While we noted a lower difference in performance for the NVIDIA card between the x4 CPU interconnect and the x4 chipset interconnect, in the end we measured an identical difference for both NVIDIA and AMD between the x4 chipset interconnect and x16. We measured this cost at up to 19% on both platforms at 1280 x 1024. The AMD card did slightly better than the NVIDIA with the x4 CPU interconnect.
What is more important in practice is that there was a similar loss of 3.6/3.7% on both cards at x8 at 1900 x 1200. Reducing resolution therefore and increasing the number of frames per second has a higher impact with the NVIDIA card than the AMD one, with respective performance losses of 4.5% and 3.8% at x16.
PCI Express plays a role in performance in practice with high-end graphics cards. While we didn’t really expect to be able to measure much of a difference in performance between PCI Express x16 and x8, we actually found that there can be as much as a 7.8% difference in the most extreme case in games. This however has to be put in the context of a high framerate at which any performance impact is less likely to affect playability. At lower framerates, the impact was lower, with a cost of around 3% in Crysis and 1 to 2% in Metro 2033 at 1920 x 1200.
Note also that we wanted to test any impact of the Pericom switches, through which the PCI Express lanes pass from the second port cabled at x8, by comparing performance on the second port with that on the first and limiting the card to x8. In this case, there was no difference to report, the cards giving strictly identical performance.
Performance cost is higher at x4, with a loss varying between 26.5 and 5.4% with the x4 CPU interconnect and between 32.5 and 8.7% with the x4 chipset interconnect. Using a graphics card on such a slot should therefore be avoided, though many motherboards do offer a second or third PCI Express x16 slot wired as such.
These tests showed that the impact of PCI Express is not the same from one game to another. This makes sense as not all graphics engines are identical and use the PCI Express bus differently. More interestingly, in spite of the fact that the graphics cards have a similar level of performance overall, we can see that the way PCI Express is implemented differs from one manufacturer to another. Performance at x4 generally echoes what occurs at x8, except more so.
A cut-down Radeon HD 6970 at x4
The impact of PCI Express lanes coming from the chipset is also variable between AMD and NVIDIA, but the results are constant. Whether in our theoretical memory bandwidth tests or our practical tests in games, there is more of a performance spread with the AMD solution. The way the AMD driver handles accesses to the graphics card seems to be the reason for this, which is disappointing as AMD enable the use of CrossFire with such a port.
While we did note performance differences between x16 and x8 in mono GPU mode, we can expect to see at least equivalent differences in CrossFire and SLI, depending on whether the x16/x16 or x8/x8 ports are used. Once again, the importance of the drivers and the way they handle the exchange of data between the system and the graphics card, or even between cards, means that it is quite difficult to predict the practical impact with precision. We’ll come back to this in a separate article so we can measure the impact of these modes in CrossFire and SLI more precisely,
Copyright © 1997-2014 BeHardware. All rights reserved.