Bandwidth, theoretical measurements
In gaming, PCI Express bus transactions are mainly between the CPU and GPU. Some loads, such as loading textures or data to a graphics scene, happen from time to time, while others are linked to the processing of every frame (data/instructions sent by the drivers). With equal graphics engines, the more frames a graphics card can process the more the potential impact of the PCI Express, if there is one.
Before measuring the impact in practice, we wanted to measure bandwidth more theoretically. For this we used the PCI Express bandwidth tests included in the NVIDIA CUDA (the CUDA Toolkit 4.0) and AMD APP (version 2.4) development kits. Our scores for the GeForce and Radeon aren’t therefore directly comparable. In practice it's changing from one mode to another that interests us most here, rather than a comparison between the two brands.
We tested the following modes:
- x16 (graphics card in slot 1)
- x8 (graphics card in slot 2)
- x4 (graphics card in slot 1 deactivating additional pins)
- x4 connection via chipset (graphics card in slot 3)
Theoretical bandwidth (pinned memory)
To recap, the theoretical bandwidth of PCI Express 2.0 is 5 GT/second per lane, with 8b/10b encoding (8 bits transferred using a 10-bit signal). In practice this gives us 4 GT/s, or 500 MB/s per lane, which translates to a theoretical bandwidth of 8 GB/s for 16 lanes, though this is without taking account of the cost of the transaction protocol.
At x16, we got around 75% efficiency, which corresponds to accepted standards. With the NVIDIA card moreover, we pretty much got the same levels of efficiency in the lower modes. On the AMD side, the different buffer size adds a bigger overhead, which explains why x16 was more than two times faster than x8.
There are two remarks to be made however. The first is that GPU to CPU transfer is generally faster than in the other direction. The second is that while there’s only a slight difference in GPU to CPU transfer between the x4 connection to the chipset and the x4 connection directly to the processor, the impact is higher in the other direction, between the CPU and the GPU. As this is the direction most used in gaming, it will be interesting to see what this translates to in practice. Note that there's almost double the cost with the AMD card in this direction, with a loss in performance of 11% on the x4 southbridge interconnect and the x4 CPU interconnect where there’s only a 5.5% loss on the NVIDIA side.
Finally, to reach these speeds, tools reserve pinned memory on the system side, with the operating system freezing memory pages so that they aren't subject to memory swap. We’re sure at 100% if the memory is RAM. While in theory performance should be identical if there’s sufficient memory, in practice the fact that the memory is RAM allows the NVIDIA/AMD drivers to use more efficient file copy algorithms.
Theoretical bandwidth (pageable memory)
As developers can’t always use pinned memory, we carried out a second test reserving system memory in the standard way. This time we used OpenCL to create a test that worked on both cards, via the Cloo library (version 0.9
). For information, note that while general consumer AMD drivers are OpenCL 1.1 compatible, the general consumer NVIDIA drivers are still limited to OpenCL 1.0. A 1.1 driver is available (and has been for a year…
), but only for registered developers.
The bandwidth wasn’t as high here, particularly at x16, but the trends are the same. GPU to CPU transfer is still faster than the other way round and with the x4 chipset interconnect there’s still a more significant dip in bandwidth on the CPU to GPU transfer.
Note that we also tried to measure latency, something that proved impossible due to software issues. Measuring the variation in latency by, for example, measuring the time required for data transfers, implied the reading of variations that went far beyond the latency of the bus itself. Whether the transaction layer (PCI Express works via a system of credits between pairs of hosts and devices) or the driver itself, in practice variability remains very wide, with variance over several hundreds of tests of up to and over 100 microseconds. Something to be borne in mind then is that, in spite of its great flexibility and simplicity in terms of implementation, the PCI Express bus remains difficult to measure in purely software terms as multiple software layers all weigh in with performance costs which are a good deal higher than those of the interface itself.
Let’s now move on to the practical tests in games.