We measured the performance of this platform with three different processor configurations:
- 2x Xeon E5-2687W
- 1x Xeon E5-2687W
- 1x Core i7 3960X
Otherwise our configuration was as follows:
- Asus Z9PE-D8 WS motherboard
- 8 x 4 GB DDR3 1600 9-9-9
- Corsair F120 SSD (system)
- SSD OCZ Vertex 3 MaxIOPS (benchmarks)
- Radeon HD 6670
- Corsair TX 850 power supply
- Windows 7 64 bit SP1
In a mono-socket test, the quantity of RAM is halved of course, though with no impact on our benchmarks in practice. On the operating system side, it’s important to note that Windows 7 supports platforms with up to two sockets. Beyond that, the server OS (Windows 2008 R2) is required. In practice the Windows 7 kernel supports NUMA natively and 2008 R2 won’t add anything to this type of platform in terms of processor performance.
We also looked at the energy consumption at the socket in three scenarios: at idle, in load in Cinebench, in load in Prime95.
The energy consumption at idle of our Xeon E5 alone is equivalent to that of the Core i7 3960X. In load, note that the two additional cores push up energy consumption at the socket by a little more than 29 Watts in Prime95.
On the 2S platform, while energy consumption at idle is contained, in load we get up to around 500 Watts on the complete platform! With the registered memory/ECC, which is significantly more demanding in terms of energy, we were up to 541 Watts at the socket in Prime95!
We spent some time on the theoretical performance of the memory controllers and took the opportunity to look at the limitations of certain benchmarks. First we measured the memory latency via AIDA64:
Note here a very small advantage for the Xeon E5 over the Core i7, the most important reading being of course the latency measured in 2S (2 sockets) mode: using two processors simultaneously adds, in spite of NUMA, twenty milliseconds to the average latency. If you’ve read the beginning of this article, this higher latency wont surprise you! It could however be a factor that affects the scalability of practical performances.
Lets finish our memory measurements with RMMT, the multithreaded benchmark included in Rightmark. So as to bypass the large L3 cache, the memory operations were carried out on 32 MB blocks for each core. As we said before, this benchmark is limited to eight threads, forcing affinity on the cores in a non-optimal fashion. We therefore limited each die to four cores so as to be able to use both controllers.
As expected, performance levels rocket thanks to NUMA and we were just a hairsbreadth off 90 GB/s of total read bandwidth! While memory bandwidth isn’t always a limiting factor in performance for general consumer applications, don’t forget that here there are 32 threads to supply. This bandwidth probably won’t go unused… Enough with theoretical readings however, let's move onto the practical ones (finally!).