DDR3: Impact of channels & timings - BeHardware
>> Miscellaneous >> Memory
Written by Guillaume Louel
Published on January 12, 2011
Since the advent of DDR3, the question of the impact of memory on overall machine performance seems to have been pushed into the background somewhat. While all discussion on DDR2 was focussed around latency, moving across to DDR3 moved the goalposts.
This is partly because of certain decisions taken by JEDEC for the official DDR3 specs. The accent was put on reducing energy consumption and increasing bandwidth.
In the meanwhile, memory controllers have adapted to these changes, most importantly with the integration of the memory controller within the processor (historically it was built in the northbridge). AMD adopted an on-chip memory controller in 2003 with its Athlon 64 (with DDR memory at the time). Intel brought this in later with the introduction of its Core i CPUs (socket 1366 and 1156). Processor memory cache has also increased in size and level 3 cache has been rolled out across the board to better hide latency. The impact of this went as far as the pipelines because pre-empting memory operations as soon as possible has become a sine qua non for architecture engineers.
With all this effort going into the mitigation of the impact of latency on memory, is it really now the case that the only question that matters in any discussion on memory is bandwidth? After all, the third memory channel on Core i7s is often described as giving little improvement in performance.
Weíre going to try and take a closer look at these issues to get a clearer picture of the current situation with respect to memory and the various platforms in use: AMDís socket AM3 and Intelís sockets 1155 (Sandy Bridge), 1156 (Lynnfield/Clarkdale) and 1366 (Nehalem).
The platforms, the test
The platformsWe looked at the four platforms currently in use, namely AMDís socket AM3/890 and Intelís sockets 1155/P67, 1156/P55 and 1366/X58. While they all have an on-chip memory controller, each has its particularities. For example, since the Phenom, AMD has made the memory access mode known as Ďungangedí its default mode. The unganged mode allows memory controllers to process two 64-bit operations at the same time (including a read and a write) and, being the default mode now uniformly adopted, this was the mode we used.
Socket 1366 CPU memory controllers are first to integrate triple channel memory. It also allows asymmetric mode using four (or five) memory bars. With four bars of 2 GB for example, two distinct physical memory spaces are defined, one with 6 GB accessible in triple channel and the second of 2 GB, corresponding to the stand-alone bar, which is accessed alternately in single channel. This isnít exclusive to Intel but Intel claims to have worked hard to maximise performance in this mode.
Although quite recent, the on-chip socket 1156 (Lynnfield) memory controller is the simplest. Itís a dual channel controller with no particularities beyond this. Note however that Intel does implement segmentation across models. You can use DDR3 1600 with a Core i7 but not with a Core i5 as the Core i5 doesnít have the multipliers required to go beyond DDR-1333 without changing the bus system clock (Bclk).
With Sandy Bridge and socket 1155, which was launched at the beginning of 2011, Intel moved the goalposts somewhat in terms of overclocking. As it isnít at all simple to up the Bclk clock by any more than 7 MHz in practice, Intel decided to compensate by freeing the multiplier from the memory. You can therefore use memory going up to 2400 MHz, if youíre using a P67 motherboard.
The testTo carry out the tests, we used the following platforms. For each platform, we used two processors with a different number of cores, whether physical (4 and 6 physical cores on AM3 and LGA1366) or virtual (no hyperthreading in the Core i5 750 and Core i5 2500K).
- Asus Rampage II Gene (LGA1366)
- Intel Core i7 975X (4C/8T, 3.33 GHz) and Core i7 980X (6C/12T, 3.33 GHz)
- Gigabyte 890GPA-UD3H (AM3)
- AMD Phenom II X4 965 (4C/4T, 3.4 GHz) and Phenom II X6 1090T (6C/6T, 3.2 GHz)
- Asus P7P55D Deluxe (LGA1156)
- Intel Core i5 750 (4C/4T, 2.66 GHz) and Core i7 860 (4C/8T, 2.8 GHz)
- Asus P8P67 (LGA1155)
- Intel Core i5 2500K (4C/4T, 3.3 GHz) and Core i7 2600K (4C/8T, 3.4 GHz)
- GeForce GTX 480, Forceware 260.99 WHQL
- Samsung HD 501LJ 500 GB + Western Digital Raptor 300 GB
- Windows 7 64 bits
We used several memory kits (G.Skill and Corsair) and would like to thank Nicolas et Fils for providing us with some components. In terms of software, we used several theoretical tests in Aida64 and RightMark Memory Tester, as well as applied tests in 7-Zip (whose capacity to use a third memory channel we have mentioned before), Avidemux and x264 for video encoding and GTA IV as our gaming representative.
Impact of the number of channels
We started by looking at the impact of the number of memory channels on performance, a particularly worthwhile question when thinking about the Core i7s and their triple channel controller.
LatencyWe started by checking the impact of the number of channels on latency and memory bandwidth. We used DDR3-1333 9-9-9-24 for all these tests. We used Aida64 (ex Everest) to measure the latency.
Going from one to two channels will have a negligeable impact on the Intel platforms and zero impact on AMD thanks to unganged mode. Triple channel mode increases latency significantly, by almost 10ns on each of our processors. When you use four bars (what we are calling ďQuadĒ) on LGA 1366 processors, the score is relatively low, a function of having two distinct memory spaces. Note, there is a slight optimisation on the Sandy Bridge platform as latency in dual channel is slightly lower here than in single channel, in contrast to the previous platforms.
BandwidthNow letís move on to memory bandwidth in reads, again taken using Aida64:
There are several points to note, first of all the quite significant gap on the Intel LGA 1366 platform. In spite of an equal clock and more cores, the uncore part of the Core i7 980X runs more slowly, 2.0 GHz instead of 2.66 GHz on the Gulftowns at 32nm. Socket 1155 manages to obtain 1 GB/s more of bandwidth than socket 1156, which explains the jump in performance for Sandy Bridge in tests limited by memory. On the AMD side, performance levels are excessively slow.
The Aida test only uses one thread for its memory reads. We therefore took the same bandwidth measurements in RightMark, which uses one software thread per hardware core/thread (up to 8):
The situation improves a bit for the AMD offer but itís still the least efficient in dual channel mode. The gap is a lot bigger this time between the 975X and the 980X. Interestingly, the hierarchy you get with RMMT with a single thread is the same as with Aida64. With the same number of channels, the Sandy Bridges manage to dethrone the 975X which was previously top dog in this test.
7-ZipWe used the LZMA2 compression mode in 7-Zip. Itís multithread and very demanding. The dictionary was set at 32 MB and we used one software thread per hardware CPU core/thread. With 12 threads, it requires 3.53 GBs. We used a 4 GB bar of memory in single channel mode to avoid any swap drive problems.
The additional cores particularly boost Phenom performance, independently of the number of channels. On the Core i7 1366s the difference in single channel is negligeable. Going from 8 to 12 threads doesnít change anything as the bandwidth is already saturated by the first 8 threads. Going from dual to triple channel brings a 7.3% gain in performance on our high-end quad core, while here the six core sees a gain of 12.3% in spite of the lower theoretical bandwidth. On LGA 1156, the gain with a second channel is quite limited with the i5-750 (3.7%), against 11.2% with the i7-860 which supports HyperThreading. The gain is limited as on LGA 1155, but note that, down to better memory bandwidth management, the relative performance of the new Sandy Bridge CPUs stand out.
Avidemux/x264We used Avidemux to compress an MPEG-2 720p transport stream source file to H.264, using the x264 codec.
With a slim or inexistent difference between single and dual channel, the results are then locked. The number of memory channels is not the limiting factor for the Avidemux/x264 pairing.
Grand Theft Auto IVHere weíre looking at frames per second on a demanding motorway scene, with a resolution of 1280x1024. Patch 126.96.36.199 was applied.
Going from dual to triple channel brings a very slight gain on the two processors that have this facility, but the biggest difference still comes when going from one to two channels.
Letís now move on to the impact of the memory clock and timings on performance. For each platform we checked performance with DDR3 clocked at 800, 1066, 1333 and 1600 MHz as well as the following timings where supported: 7-7-7-19, 8-8-8-20 and 9-9-9-24. We also added, for Sandy Bridge, tests at 1866 and 2133 MHz, with timings of 9-9-9-24 and 8-8-8-20 (uniquely at 1866 MHz).
Latency Starting with latency, measured with Aida64 as before.
Itís interesting to note that the memory controller lag on the Gulftowns makes itself felt above all with DDR3-1333 and 1600 where the advantage in favour of the Core i7 975X is higher.
It is above all important to note that in contrast to DDR and DDR2, the clock is always more significant than the timings. This is true across all the platforms, with a single exception: between DDR3-1333 9-9-9-24 and DDR3-1066 7-7-7-19 and this only on Core i7 platforms on LGA1366. Note that we drop below 40ns on the Sandy Bridge platform at 2133 MHz.
Bandwidth (mono-threaded)Lets move on with the mono-threaded bandwidth readings taken with Aida64.
In terms of bandwidth, it makes sense that the clock should count more than the timings, but perhaps not as much as you might think. Here we limited all our processors to just one thread, which brings out the impact of timings on bandwidth. The socket 1155 processors are by far the most efficient, even at high clocks where the on-chip memory controller doesnít flag. The 980X is disappointing while the Phenom IIs confirm what we saw when we looked at the impact of memory channels.
Bandwidth (multithreaded)Now for the theoretical multithreaded bandwidth readings with RMMT.
At equal clocks, the Core i7 975X dominates the rest of the panel. The Core i7 2600K gives a jump of 11% on the Core i7 860. Note that DDR3 1600 memory doesnít bring any advantage in terms of bandwidth to the Phenom II X4 965. Although the Phenom II X6 1090T and its new die obtain slightly more, the gain in performance is very slight indeed. A bandwidth limitation can already be felt on the AMD platforms between DDR3-1066 and DDR3-1333. Letís hope that Bulldozer will correct this issue.
7-Zip, Avidemux, GTA IV
Weíll now finish with our applied tests, for which we look at the impact of clocks and timings on performance.
The 7-Zip compression test confirms the significance of bandwidth over timings, even if the timings do play a role. The Phenom II X4 965ís poor showing with DDR3-1600 confirms what we saw in the theoretical tests. The impact of memory on performance is significant indeed: between the slowest and fastest of the memories, the time needed for compression is reduced by 25%. The gains given by the higher clock on the LGA 1155 processors is tiny however.
Not all compressions, however advanced they may be, are necessarily limited by memory. Avidemux and x264 demonstrate this here. Latency, which falls as timings are increased, and the clock allow a reduction in compression times, but only by a maximum of 5%. Once again the clock has more impact than timings. The LGA 1155s canít combine efficiently with memory at 1866 or 2133 MHz in this test.
Grand Theft Auto IV
With a gap of up to 20% between the fastest and slowest, memory speed makes a real difference in GTA IV. Once again increasing the clock has more of an impact than timings. This is the test which demonstrates the best use of available additional bandwidth, whether triple channel with the Core i7 1366s or faster clocked memory with the LGA 1155s.
There are a few conclusions to be drawn from our tests. The first is that, though there may have been a debate on whether clocks or timings had the most impact with DDR2, now the issue is closed: clocks have the biggest impact in almost all cases. If you have the choice between DDR3-1333 CAS9 memory and DDR3-1066 CAS7, donít hesitate. Go for the DDR3-1333. The only exception to the rule is that DDR3-1600 CAS9 is slower in practice than DDR3-1333 CAS7 on AM3.
Does this mean that latency is no longer an issue? Not really. It continues to play a major role in many tests. However, increasing the clock has more of an impact on latency than a jump or two down in timings. When you add this to the advantage of additional bandwidth, the choice is easy.
Looking at performance in the theoretical tests, the efforts made by Intel on its LGA1156 and LGA1155 offers stand out. The on-chip memory controllers here are particularly fast and Intelís removal of the restriction on the memory mulitiplier for the Sandy Bridge processors is definitely worth taking. Although this gives the platform great scores in the theoretical tests, in practice, the gain given by very high end memory is relatively small, except in the very particular case of GTA IV. Itís worth weighing up the extra cost of these solutions first. Donít forget they also generally need dedicated cooling.
In terms of Intelís high end platform, LGA 1366, we can confirm several things. Firstly, using triple channel mode causes a significant increase in latency, which considerably limits the gains given by the third memory channel in many cases. Next, the slower 980X memory controller only manages to hold up against the quad core model thanks to its additional two cores. Itíll be interesting to see, in the third quarter of 2011, what LGA2011, the successor to LGA1366, brings to the party. A fourth memory channel is on the agenda, which should mean that the memory density on server platforms can be easily increased. It will however be interesting to see what impact this has on performance.
Lastly, the AMD offer is showing its age. The on-chip memory controller on the Phenom IIs struggles to benefit fully from DDR3-1333 and 1600 memory, as bandwidth usage hits a ceiling far too quickly. Performance in the monothreaded test also shows the limits of an architecture whose successor, Bulldozer, is eagerly awaited and should become available in the second quarter.
Copyright © 1997-2014 BeHardware. All rights reserved.