AMD FX-8150 and FX-6100, Bulldozer arrives on AM3+ - BeHardware
>> Processors

Written by Marc Prieur

Published on October 12, 2011


Page 1

A new architecture

At last! After being pushed back several times, the AMD Bulldozer architecture has arrived on the market. Its roll-out on AM3+, codename Zambezi, is under the AMD FX brand, a name which recalls the glorious period of the K8 during which AMD did some serious damage to Intel and its Pentium 4s. Will the AMD FX live up to its name?

A high frequency CMT architecture
We did devote an article to Bulldozer last spring but it may be of some use to run back over the main points.

An 8-core processor is in fact made up of 4 modules. Within a module, the two cores share a certain number of components:

- the front-end which groups the fetch unit and instruction decoding as well as the L1 instruction cache which is supplied by these units;
- the floating point unit;
- the L2 cache.

AMD is claiming 80% of the performance of two full cores for major efficiency improvements in terms of silicon area and energy consumption. Many other changes have also been made, both to the processing units themselves and the memory sub-system, in particular so the architecture can clock higher.

Bulldozer also supports 100% of current x86 instruction sets and is compatible with the latest versions of SSE4 (4.1 and 4.2) and AES-NI instructions which enable encryption acceleration. AVX, introduced by Intel with Sandy Bridge, and its 256-bit operands are included.

Note also some instructions specific to Bulldozer have been introduced, grouped under the following: XOP, FMA4 and CVT16. These instruction sets actually correspond to SSE5 (announced by AMD in 2007 but never implemented) adapted to the AVX format. XOP operates mainly on integer operands, FMA4 on 128-bit floating point numbers and CVT16 groups high precision floating point conversion instructions to medium and low precision floating points. FMA4, which allows the processing of a multiplication and addition in a single cycle, should, among other things, enable gains when used by applications, however a different version, FMA3, will be used by Intel. AMD will follow suit here, with Piledriver, the development of Bulldozer, adopting FMA3 and calling the durability of FMA4 into doubt.

Page 2
Zambezi becomes AMD FX

Zambezi becomes AMD FX
Bulldozerís desktop architecture, Zambezi, is now known under the name AMD FX. AMD FX is in effect a chip engraved at 32nm by GlobalFoundries with around 1.2 billion transistors on a surface area of 315 mm≤ (according to AMD's revised numbers).

For comparison, here are some numbers for the Thuban (Phenom II X6), Deneb (Phenom II X4), Sandy Bridge (Core i7 LGA 1155) and Lynnfield (Core i7 LGA 1156) chips:

Zambezi is a real monster in comparison to what weíve seen before, but also in comparison to the competition. It has 20% more transistors than Sandy Bridge and this in spite of the fact that the Sandy Bridge has an IGP that takes up a significant part of the chip. The size of Zambezi can be explained to a great extent by the fact that combining the L2 and L3 means the cache has been extended to 16 MB instead of 9 MB at best on the previous generation and 9 MB with Intel. The downside of this is of course that production cost risks are likely to be higher and that it will be harder to produce fully functional chips.

AMD has launched four AMD FXs:

As you can see, there are two Turbo Core clocks. This is an innovation of Tubo Core 2.0: ĎMax Turboí can be achieved when half the modules are in use and 'Turboí even when all modules and cores are in use, within TDP limitations. On the Phenom II X6, only one Turbo mode could be used when between 1 and 3 cores were in use.

AMD's pricing is on the aggressive side, with the FX-8120 released at the price of the Phenom II X6 1100T, the FX-6100 at the price of the 1055T and the FX-4100 at the price of the Phenom II X4 955. In comparison to the Intel offer, the FX-8150 is positioned between the 2600K and the 2500K, the FX-8120 a little under the 2500K, the FX 6100 a little under the i5-2300 and the FX-4100 on a par with the Core i3-2100.

Page 3
AM3+, almost obligatory?

AM3+, almost obligatory?
The AMD FXs use an AM3+ type platform infrastucture. Compatibility with AM3 motherboards is rather uncertain as although ASUS and MSI announced lists of compatible cards a few months ago, things have changed somewhat since. AMD had announced at the time that there would be no official support and said that some features might be affected, though without stipulating which.

In the end, MSI declared that many of these models would not in fact be compatible, with the list of compatible boards dropping to just three:

- 890FXA-GD70 via bios A7640AMS.1B0
- 890FXA-GD65 via bios A7640AMS.I40
- 890GXM-G65 via bios A7642AMS.1C0

MSI explains that this turnaround is due to the unexpected requirement for power stage modifications. All well and good but we're not impressed with the lack of communication on the issue. As we donít have any of these cards at our disposal, we havenít yet been able to check compatibility and whether there is full support or not.

ASUS had also announced compatibility with AM3+ processors for a number of its AM3 motherboards and delivered bios updates as expected.

AMD is currently marketing three AM3+ chipsets which only differ from each other in terms of graphics card support:

- 990FX: 2x16 or 4x8
- 990X: 1x16 or 2x8
- 970: 1x16

Theyíre basically renamed 8xx series chipsets but are only used on AM3+ motherboards in an attempt to clarify what is a rather complicated situation. Here AMD has the advantage over the Intel offer as LGA 1155 is basically equivalent to 990X and you have to move up to LGA 1366 (and 2011) for two x16 ports. The advantage in CrossFire X / SLI (between 2 and 5% when limited by graphics power) is certainly slight but it is there. For the southbridge, the SB950 has 6 Serial ATA 6 Gb/s ports.

Page 4
AMD FX-8150 and FX-6100 reviews

AMD FX-8150 and FX-6100 reviews
For this test, AMD lent us an AMD FX-8150 processor. Given the presence of Turbo mode, thereís no way of simulating the other processors in the range reliably. We did however manage to obtain an FX-6100 from another source:

From an external point of view and processor branding excepted, nothing differentiates the AMD FXs from other AM3 processors. We did however manage to locate a difference. AMD hasnít yet said anything on this but while the northbridge is clocked at 2.0 GHz on the FX-6100 (the same as on the Phenom IIs), itís clocked at 2.2 GHz on the FX-8150.

The AMD FX-8150 was accompanied by an ASUS Crosshair Formula V motherboard, a very high-end board. As we didnít really see the point of using a motherboard that was almost as expensive as the processor, and after having checked that performance levels were the same, we used a much more affordable ASUS M5A99X EVO with bios 810 supplied by ASUS for the AMD FXs. This will also give us a better comparison in terms of total energy consumption of the system at idle.

Note moreover, with the original bios the system didnít boot with an AMD FX and we had to update the bios with an AM3 processor. It has to be hoped that motherboards sold in stores will already have a bios that allows you to boot with an AMD FX!

Page 5
New test protocol

New test protocole
We elaborated a new test protocol for this review. Weíd been using the old one (the one used in our giant 185 CPU roundup) for two years and it was time to make some changes. To make sure we were prepared for the launch of a new architecture with such variable performance and to get as thorough a vision as possible, we opted to increase the number of tests.

For our 3D tests, we moved up to the 2011 version of 3ds Max, and we used close-up scenes supplied by Evermotion, one prepared for the built-in rendering engine, Mental Ray, and the other for another very popular rendering engine, V-Ray 2.0.

For compilation we switched to the source code of the 3D Ogre engine which is compiled both via MinGW / GCC and Visual Studio 2010. We have added 7-Zip, more effective both in terms of compression and multithreaded use, to WinRAR for file compression.

For video encoding we have kept x264, in its latest 2085 build, in tandem with the StaxRip interface and carried out a two-pass encoding of an extract from Avatar. The same encoding was also carried out with the MainConcept H.264 codec using the Reference application.

We have also introduced a processing test for RAW photo file lots. After trying many different pieces of software, we went with Adobe Lightroom, the leader in this domain, and Bibble.

We finished up our tour of applications with quite an unusual choice, namely artificial intelligence chess algorithms. We used Fritz Chess Benchmarking, by Chess Base, as well as Houdini Pro 2.0 via the Arena 3 interface.

Next come the tests designed to evaluate processor gaming performance. Once again, we decided to increase the number of tests and ran seven games:

- Crysis 2
- Arma II: Operation Arrowhead
- F1 2011
- Rise Of Flight
- Total War: Shogun 2
- Starcraft II
- Anno 1404

These tests were still carried out at maximum quality detail settings (not including anti-aliasing), with the exception of Crysis 2 where we opted for Ultra instead of Extreme mode as there wasnít sufficient playability at Extreme mode with just one graphics card. We also abandoned the 800*600 resolution and have provided scores at 1920*1080, nevertheless looking for quite heavy scenes in which processor limitations come into play.

For Crysis 2, we used a solo game saved in a high load scene and measured the framerate during a continuous burst of firing, while in Arma II: OA with all graphics options at max, crossing a village in the first mission was enough to have many CPUs on their knees. In F1 2011, we measured the framerate at the start of the Monaco GP and in Rise Of Flight we launched a customised mission of 32 against 32, with the framerate measured with the back-facing view of our 31 acolytes.

In Total War Shogun 2 we used the huge battle of the 'DX9 CPU' test modified for DX11 and suitable graphics settings, while in Starcraft II a major attack during a replay was generously donated by some French forum users. For Anno 1404 we loaded up a scene of a city of 46,600 inhabitants that we viewed from a distance.
The competition
For this test we set the AMD FXís up against their Phenom II X4 and X6 predecessors as well as the LGA 1155 Sandy Bridge Core i7/i5/i3s and the LGA 1156 Lynnfield Core i7/i5s. We also included the LGA 775 platform as a reference, in the form of the Q6600 and the Q9650 and QX9770. Because of a lack of time we didnít do any testing on the LGA 1366 platform this time, but unfortunately it has to be said that the AMD FXs already have their hands full with the quad core Intel CPUs, as youíll see:

- Intel DP55KG (LGA1156)
- Intel DP67BG (LGA1155)
- ASUS M5A99X EVO (AM3+)
- 2x4 GB DDR3-1066 7-7-7 (Q6600)
- 2x4 GB DDR3-1333 7-7-7 (Q9650)
- 2x4 GB DDR3-1600 9-9-9
- GeForce GTX 580 + GeForce 280.26
- SSD Intel X25-M 160 GB + SSD Intel 320 120 GB
- Corsair AX650 Gold power supply

Page 6
Processing units

Processing units
Each of the two x86 execution units in a Bulldozer module is made up of two ALUs (arithmetic logic unit) as well as two AGUs (address generation unit). In fact, AMD calls them AGLUs rather than AGUs as these units can carry out simple operations.

In contrast, K10 architecture had 3 ALUs and 3 AGUs, which means that with only 2 ALUs Bulldozer may, in a worst case scenario, only give a maximum of 67% of what a K10 gives on a single thread.

With respect to the FPU, a single FPU is shared between the two cores on each module. Able to process two threads, one for each of the parent cores, it can also combine its two 128-bit FMAC units to process 256-bit AVX operations.

What do we get in practice? Using AIDA, we took a reading of the latency and instruction throughput on K10, Bulldozer and Sandy Bridge and have produced a breakdown for you. Latency represents the period, in cycles, required to process a single instruction, while throughput gives the speed between each instruction processed when several are being processed.

In comparison with the K10 platform, you can see that Bulldozer does slightly better with the 32 or 64-bit integer divide (IDIV) instruction. At 32 bits however the integer divide is processed a good deal faster on Sandy Bridge, though slower at 64 bits. Bulldozer doesnít handle signed multiply (IMUL) as well however, either in terms of latency or throughput. While the latency on simpler operations such as move (MOV) or addition (ADD) is good, throughput is lower, a result of having two rather than three ALUs.

When it comes to floating point numbers (x87) and not including sine calculations, which were faster on K10, which was in turn faster than SNB, youíll note an overall improvement in throughput, which is positive if youíre repeating the same type of instruction. Unfortunately however, this is far from being systematic and the latency takes a hit. Note the same goes for SSE2 instructions, which are also executed by the FPU.

As expected thereís a significant improvement in performance with AES instructions. We obtained the following scores on the various different architectures (fixed frequency of 3.2 GHz) in the AIDA 64 AES test:

- 35,315 on Yorkfield (Core 2 Quad 65nm)
- 433,093 on Sandy Bridge (Core i7 32nm)
- 35,743 on Deneb (Phenom II X4)
- 52,833 on Thuban (Phenom II X6)
- 344,131 on Zambezi (FX-8)

Itís harder to show what happens with AVX and SSE4, especially as itís difficult to find versions of the same programme that are correctly optimized for each instruction set level and each architecture. For example, the AVX version of y-cruncher is around 13% faster than the SSE3 version (the SSE4 versions not being faster than the SSE3 version) on Sandy Bridge. This same AVX code, which was optimised for the only AVX architecture available (Sandy Bridge) is 10% slower than the SSE3 branch on Zambezi.

AMD supplied us with a version of x264 that was supposedly optimised for FMA4 and XOP instructions. Unfortunately, in our tests we didnít see any improvement in performance in comparison to the standard version, whether on the first or second encoding pass. It has to be said that the XOP/FMA4 optimisations are still under development for x264 and the main development branch hasn't yet been included: something still to be explored with respect to Bulldozer architecture.

Page 7
Cache performance

Cache performance
With Bulldozer, AMD has changed the cache management extensively in comparison to K10. From an L1 cache of 128 KB per core, 64 KB for instructions and 64 KB for data, it has moved to a cache of 64 KB for instructions for each module, which is therefore shared between two cores, each of which has a data cache of 16 KB.

The L2 cache is now 2 MB per module, compared to 512 KB or 1 MB per core on the various K10 architectures. Whereas the K10s shared an L3 cache of up to 8 MB at best, here it's at a maximum of 6 MB. The cache hierarchy has also been changed, as we described in our theoretical article, with notably a partially inclusive relationship between the L1D and L2.

Where are we at in practice in terms of cache performance? We measured this with AIDA, at a clock of 3.2 GHz on K10, Bulldozer and Sandy Bridge:

Reads on the L1D cache are the same across all three architectures, with an increase in latency from K10 to Bulldozer. Because of the write-through policy on Bulldozer, with writes updated in the L1D and the slower L2 at the same time, write speeds drop off drastically.

Reads on the L2 cache are faster on Bulldozer than on K10 at the same clock, but writes are slower and latency a good deal higher. With the L3, there has been a significant increase in read performance while writes are stable and latency is up.

Itís difficult to judge the efficiency of a cache system on the basis of these figures. Itís a question of compromise and while the increase in latencies and the fall in write speeds isnít positive, the fact that the L2 cache is a good deal bigger and that reads are up overall is a good thing given that reads are far more common than writes.

Page 8
Memory performance

Memory performance
We've pointed the finger at the relative weakness of the AMD memory controller, notably in terms of write speeds, in various articles. We were therefore curious to see if it had been improved in practice and measured memory bandwidth on a single thread in AIDA64 and on multiple threads in RMMT on the various architectures at 3.2 GHz and with DDR3-1600 9-9-9:

AMD has done some good work on its new memory controller, with single thread speeds up 47% and multithreaded speeds up 40% in comparison to the K10 controller. Thereís a 38% gain for single threaded writes and multithreaded write speeds have been more than doubled! This excellent progress means Bulldozer is closer to Sandy Bridge in terms of multithreaded performance, though Sandy Bridge is still a good deal faster on a single thread.

Note also that DDR3-1866 is now officially supported in place of DDR3-1333 on the previous generation. As usual, itís possible to go further and while DDR3-1600 was unofficially supported on K10, on Bulldozer thereís unofficial support for DDR3-2133 and DDR3-2400. We havenít however yet been able to get DDR3-2133 to run and performance levels with DDR3-1866 9-10-9-28 werenít really any better than with DDR3-1600 9-9-9:

Thereís a 1% drop in V-Ray, a gain of 1% in Visual Studio 2010 and performance levels in MainConcept H.264, Bibble and Fritz are stable. There is however a 1.5% gain in Arma II: OA, 2% in Anno 1404 and 2% in 7-zip but the gains are marginal.

Page 9
CMT efficiency

CMT efficiency
Bulldozer changes somewhat the definition of what a core is, as implemented on current x86 architectures, with what AMD is calling two cores sharing within a module the resources that were previously dedicated to individual cores.

Itís worth knowing that when AMD registered the patents relating to Bulldozer, the engineers chose to call what is now known as a module a core and what is now a core a cluster. Does this make AMDís final naming scheme totally abusive?

With the aim of getting a clearer picture of things we carried out the following tests on a Bulldozer processor clocked at 3.2 GHz:

- 4-module, 8-core test
- 2-module, 4-core test
- 4-module, 4-core test

The first is the basic configuration so simple to set up. For the second, we were able to deactivate certain modules on our motherboard. For the third it was more difficult as we had to define affinity on the right cores (CPU 0, 2, 4 and 6 in Windows), limiting the number of threads executed by the application to four if possible. This wasn't possible with the tests in MinGW or Visual Studio as many different processes are used during compilation.

Why carry out this test? It allows us to see if AMDís claim that two CMT cores are equivalent to 80% of two same architecture standard cores is valid and whether the performance improvement from four to eight cores justifies AMDís 8-core label.

In this first table, we set an index of 100 to the 4-module, 4-core version.

If you exclude the tests that donít fully load four cores (WinRAR, the 1st x264 pass and games), you get between 71 and 95% of the 4-module / 4-core performance with the 2-module / 4-core configuration. AMDís claim with respect to CMT efficiency therefore appears to be right. Where an application doesnít fully load all four cores however, the 4-module / 4-core mode is fastest, even if the gap is often reduced.

Can Bulldozer therefore be thought of as an 8-core architecture or should we rather be talking about a 4-core, 8 thread processor as with Intelís processors with Hyperthreading. Hereís a breakdown of the gains you get with a) Hyperthreading on Sandy Bridge, b) moving from four to six cores on K10 and c) CMT on Bulldozer in the most multithreaded applications:

On average, Hyperthreading gives a gain of 23.4%, moving from four to six cores on K10 a gain of 42.2% and CMT in Bulldozer 53.1%. Weíre therefore well beyond what you get with Hyperthreading and talking about Bulldozer as an 8-core architecture therefore does seem most accurate.

Of course the whiners will say that by using this terminology, AMD has gained a rather dubious marketing advantage over Intel: eight cores is bound to sound better than four for many people. You can however also see things the other way round and say that you need to load all eight threads to fully exploit the potential processing power thatís on offer. This certainly isnít the message that the marketing team wanted to communicate and nor is it what the salesperson in your local computer supermarket is likely to tell you...

Page 10
CMT, Turbo Core 2.0 and Windows 8

CMT, Turbo Core 2.0 and Windows 8
The AMD FX includes a development of AMDís Turbo Core. With the Phenom II X6s, Turbo Core only kicked in if a maximum of half the cores were being used, giving, for example, on the 1100T:

1) a base clock of 3.3 GHz
2) up to 3.7 GHz with up to 3 cores in load

With the AMD FX-8150 the new version of Turbo Core allows you to define several clocks:

1) a base clock of 3.6 GHz
2) up to 3.9 GHz with all modules in load
3) up to 4.2 GHz with 2 modules in load

As with Intel Hyperthreading, this clock increase must stay within TDP limits and you generally donít get the maximum clock running continuously on all modules.

On this example the frequency of the various modules varies between 3612 and 3905 MHz, just as the voltage varies between 1.2375V and 1.375V. It should be noted that in AMD OverDrive the clocks don't all seem to be updated at exactly the same time, which explains the fact that, say, two CPUs in the same module have different clocks or that a high VID corresponds to a low frequency.

Youíll have noted that when it comes to Turbo weíre talking modules rather than cores. Thus, if you have four threads shared across four modules, youíll get a frequency of 3.9 GHz at best (case no. 2), while you could have got up to 4.2 GHz with four threads shared across two modules.

This last option isn't however necessarily the most efficient as CMT only gives 80% of the performance you get on full cores. Looking at what happens on Fritz Chess Benchmark, we obtained the following scores:

For Windows 7 SP1, all AMD FX cores are the same and thereís no preference for one core over another. The default four thread (4T) scores are therefore between those obtained when the four threads are forced onto four cores from two modules (4T/2M) and four cores, from four modules (4T/4M).

Activating Turbo therefore brings a variable gain according to module occupation, which is to be expected. In 4T mode thereís a gain of 9.8%, against 14.6% if only two modules are used and 4.9% with four modules. Turbo doesn't however manage to make up for the reduced efficiency of CMT in comparison to cores which arenít shared and the 4T/4M setting is still the highest performance mode with four threads. With eight threads, you get a 3.3% gain with Turbo.

The positioning of threads naturally has an impact on power consumption, which as you can see here, varies a good deal according to the configuration used:

Itís no surprise to see that the lowest consumption is with four threads shared across two modules. The energy cost of CMT is therefore 20% if you compare four threads across four modules with eight threads.

What about performance per watt efficiency? Hereís a breakdown:

Without Turbo mode, the best performance per watt for four threads is when theyíre grouped on two modules: the drop in performance is more than made up for by the reduction in power consumption. As you can see, the increase in performance you get from Turbo is to the detriment of the processorís energy efficiency.

Note that Windows 8 is likely to change the situation in terms of how threads are shared across AMD FX processors. Instead of sharing threads between cores without distinction, Windows will favour those within the same module, so as to use the lowest possible number of modules.

This may at first seem at odds with the results we got with Fritz Chess Benchmark where we got the best results when there was no preference. This however doesnít take into account the fact that a large proportion of the demanding applications not fully exploiting an 8-core processor are games and that they struggle to fully occupy four cores as things stand.

On the previous page we saw that with just half a Zambezi, namely two modules and four cores, we managed to get 89 to 98% of what we got with four modules and four cores. It may therefore become worthwhile to maximise Tubo gains and group threads on a minimum number of modules, as AMD has done with Windows 8.

Nevertheless, performance gains are likely to be minimal at the end of the day, except in particular cases. We observed that performance varies by -3% and +3% in comparison to the basic Windows 7 SP1 configuration when you force games onto two modules on an AMD FX-8150 with Turbo.

Page 11
3.2 GHz tests

3.2 GHz tests
At each new architecture launch we carry out tests at a clock equal to 3.2 GHz. Of course, such a test only offers a partial vision of the performance of processors and architectures as it doesnít take account of the higher clocks allowed by tradeoffs to performance at equal clocks.

Performance levels are given in comparison to an index of 100 set to a Deneb (Phenom II X4 955) at 3.2 GHz:

As you can see, Zambezi performance is down when the architectures are compared at an equal clock. In 4-core (2 module) mode, it offers from 62.6% to 111.4% of the Deneb (Phenom II X4) performance. The WinRAR result is however an exception and on average we got 80.8%.

Zambezi does much better in 8-core (4 module) mode, when applications are able to take advantage of the available resources, though this isn't the case in games, for which performance is still down on Deneb performance. Moreover, in applications that exploit eight cores, Zambezi only does better than Thuban in 7-zip, the second x264 pass and Bibble.

Comparison with the Intel offer is even more disadavantageous to Zambezi, with the Yorkfield at 3.2 GHz (Intel Core 2 QX9770) doing a good deal better in games. This advantage is even bigger on Sandy Bridge. In multithreaded applications Zambezi regularly manages a level of performance somewhere between SNB with and without Hyperthreading, but still brings up the rear at times.

Page 12
Energy consumption and efficiency

Energy consumption and efficiency
In our previous articles on processors, we measured energy consumption in load in Prime95. This stress test has the merit of pushing the various architectures to the limit in a pretty equitable manner, but we werenít able to use it to compare energy consumption and performance as the Prime95 benchmark consumes less and and isnít as balanced between processors.

We therefore decided to look for another application that would give us a level of performance and energy consumption representative of what we obtained on the other applications in our test protocol. In the end we opted for Fritz Chess Benchmark once again. This application has the additional advantage of allowing us to fix the number of threads to be used easily.

The energy consumption readings therefore shouldn't be taken as absolute maximum values but rather as typical for a heavy load - applications specialised in processor stress such as Prime95 can consume up to 20% more. All energy economy features, including those on motherboards such as the ASUS EPU, were turned on for this test, as long as they didn't have a negative impact on performance:

[ 220V socket ]  [ ATX12V ]

The AMD FXs are more economical than the Phenom II X6s and even X4s at idle, which is a very interesting development. The Intel 1155 and 1156 platforms are however still a good deal more efficient. In low loads (1 thread), energy consumption is around the same as for the Phenom II X4s while in full load (100%) the FX-8150 consumes a good deal more.

Taking the reading at the ATX12V allows us to isolate the processor energy consumption. Unfortunately however, the figures are not entirely comparable as in certain cases some of the CPU consumption comes from the standard ATX 24 pin connector. To get a totally accurate comparison however, we can compare processors using the same motherboard. The trends observed on the reading at the wall socket are indeed confirmed.

We then looked at the energy efficiences of the different processors. To get a representation of this you have to divide the performance levels obtained in Fritz Chess Benchmark by CPU energy consumption. The only problem is however that it is impossible to get an exact reading of CPU consumption: the readings at the ATX12V arenít 100% comparable from one platform to another and the reading at the wall socket doesnít allow us to isolate CPU consumption entirely.

We therefore decided to use two methods to isolate processor consumption:

- Energy consumption at the ATX12V
- 90% of the difference in energy consumption between load and idle at the socket

We took this at 90% so as to exclude power supply yield. Note that while the first reading favours processors that draw a small proportion of power from the standard ATX socket, the second favours those with high energy consumption at idle. Unfortunately no method is perfect.

[ 220V socket ]  [ ATX12V ]

If we only compare the AM3/AM3+ processors between themselves, the ATX12V graph shows that energy efficiency hasnít really improved since the Phenom IIs. With one thread, the resuts are comparable and with full occupation of the processor the Phenom II X6s still do better. As things stand, the Bulldozer architecture and CMT are not all that convincing here.

Comparison with the Intel offer shows Bulldozer up in a poor light whatever reading you use. Even in multithreaded performance, AMD is on a par with the previous LGA 1156 45nm generation and the LGA 1155 Sandy Bridge 32nms are clearly in another world.

Page 13

Designed to support high clocks, the AMD FXs have attracted plenty of attention when it comes to overclocking, with AMD proudly announcing recently that it had beaten the frequency record for x86 processors, clocking up to no less than 8429 MHz using liquid helium. They didn't stabilise the processor at this clock but rather this is the maximum clock obtained with extreme cooling on just one module. What about more standard overclocking? We tried to overclock our FX-8150 using a Noctua NH-U12P SE2, still on the M5A99X EVO.

By setting CPU Load Line Calibration to High and adding Offset 0.16v to the processor voltage we managed to stabilise the processor at a clock of 4.6 GHz in Prime95 by using an x23 multiplier Ė this is unblocked on the FX range, an advantage when compared with the fact that you have to pay extra for an Intel K model.

4.6 GHz is pretty good but why stop us when weíre enjoying ourselves? Simply because energy consumption was already very high at this setting.

In comparison to the base configuration, in CPU load in Fritz Chess Benchmark the energy consumption at the wall socket increased from 206 to 313 Watts and from 109 to 206 Watts at the ATX12V. This gives 14602 kilo nodes per second, which is 23% better than the default setting, not that much for what is almost double the processor energy consumption.

In Prime95, energy consumption at the ATX12V went up as far as 255 Watts and during overclocking tests where we pushed things even further we recorded hellish energy consumption levels of up to 300 Watts at the ATX12V. While the AMD FX may impress in terms of extreme overclocking, things seem more complicated for daily usage... It remains to be seen if the versions on sale in stores will offer better results. When it comes to undervolting, we didnít manage to do any better than 0.05V, which economises around 10 Watts.

Page 14
3D rendering: Mental Ray and V-Ray

3d Studio Max 2011 - Mental Ray

We now move on to the practical tests, firstly with a 3D rendering in 3d Studio Max 2011 using the Mental Ray rendering engine on an Evermotion scene. We carried the rendering out at 600*375 so as not to extend the length of the test too much.

The FX-8150 was a disappointment here, not managing any better than the Phenom II X6 1100T. The FX-6100 though sold at the same price as the 1055T, is a good way behind. The AMD offer is however still well positioned opposite the Intel offer at the same tarif, with the FX-8150 falling between the 2500K and the 2600K, which is no great exploit when you consider the excessive number of transistors used.
3d Studio Max 2011 - V-Ray 2.0

Still in 3d Studio Max 2011, we changed the engine for the more popular third party engine, V-Ray 2. We used another version of the same scene prepared by Evermotion for this engine, still with a 600*375 rendering. Rendering times are a good deal faster but of course weíre not carrying out a comparison of the engines themselves or the quality of the final files.

This time the FX-8150 is significantly up on the 1100T, but the FX-6100 remains down on the 1055T, which is nevertheless sold at the same price. This allows the FX-8150 to get closer to the 2600K, though there is still a comfortable gap between the two.

Page 15
Compilation: Visual Studio and MinGW/GCC

Visual Studio 2010 SP1

We compiled the source code of the 3D Ogre engine in Visual Studio 2010 SP1.

The FX-8150 is slightly faster than the 1100T, but the FX-6100 trails the 1055T. Once again the FX-8150 has closed the gap on the 2600K, though the 2600K still has an advantage of around 12%.
MinGW / GCC 4.5.2

The same source code was compiled in MinGW / GCC 4.5.2.

Unfortunately this time the FX-8150 was pretty much on a par with the X6 1100T and exactly midway between the 2600K and the 2500K. The FX-6100 was once again down on the 1055T.

Page 16
Compression: 7-zip and WinRAR

7-zip 9.2

7-zip has been added to our test protocol. In contrast to WinRAR, this application is highly multithreaded if its highest performance algorithm, LZMA2, is used. We measured the time required to compress a large volume of files.

This time there was a substantial gain in performance with the FX-6100 managing to position itself at the same level as the 1100T and the FX-8150 closer still to the i7-2600K, though still with a deficit of 7%. With compression software generally putting higher demands on the memory subsystem, these gains can be linked to the increase in cache size and improvements to the memory controller.
WinRAR 4.01

The same files were compressed in WinRAR using the most demanding RAR algorithm ("Best").

WinRAR doesnít unfortunately really exploit more than two cores, but again it does seem to benefit from the AMD FX caches and memory controller. Thereís a significant gain on the previous generation but unfortunately for AMD, WinRAR is even more comfortable on the Intel processors, the low multithreading optimisation not making it conducive to the AMD FXs.

Page 17
Encoding: x264 and MainConcept H.264

StaxRip - x264 build 2085

For video encoding we retained the popular x264, here in build 2085. We used the StaxRip interface to transcode a 1080p file taken from the Avatar Blu-ray using two passes in fast mode with a bitrate of 10 Mbits /s. Weíve posted the times for both passes, the first being less multithreaded than the second and only really exploiting three or four cores.

[ Total ]  [ 1st pass ]  [ 2nd pass ]

In what is a very common procedure, the FX-8150 is only slightly up on the 1100T and therefore only just up on the 2500K. If we look at these results in detail we can see that the FXs struggle on the first pass but make up some of the lost ground on the second, on which the FX-8150 is on an equal footing with the 2600K!

Note that AMD recently supplied us with a version of x264 that was supposedly optimised for FMA4 and XOP instructions. Unfortunately, in our tests we didnít see any improvement in performance in comparison to the standard version, whether on the first or second encoding pass. It has to be said that the XOP/FMA4 optimisations are still under development for x264 and the main development branch hasn't yet been included: something that remains to be explored with respect to Bulldozer architecture.
MainConcept Reference 2.2 H264 Pro

We then moved on to another H.264 codec from MainConcept. We used the MainConcept Reference H.264 interface to carry out the same type of transcoding as in x264. Note that the first pass is more multithreaded here and we have only given the overall score.

The FX-8150 was this time 9.4% faster than the 1100T, but the FX-6100 couldnít overtake the X6 1055T. Looking at the comparison with Intel, the FX-8150 comes between the 2600K and the 2500K.

Page 18
Photo processing: Lightroom and Bibble

Adobe Lightroom 3.4

We have now introduced photo processing by lot to our protocol. We started by exporting a lot of 96 RAW photos from a 5D Mark II as JPEGs in Lightroom, applying various effects such as colour and lens correction or noise processing.

The AMD chips were a long way back here, whatever the application used (we tested several when we were elaborating the new protocol). In Lightroom, the AMD FXs were able to make up some ground with the FX-8150 almost on a par with the 2500K, though the 2600K is out of reach.
Bibble 5.2.2

In Bibble we processed a lot of 48 RAW photos. Note that Bibble is slower than Lightroom but as with the rendering engines we didnít carry out this test to compare the applications with each other as this would mean comparing the quality of results: a slower export may also be of higher quality.

AMDís deficit was even more marked in Bibble. There was a very significant gain with the AMD FXs as the FX-8150 was almost 22% faster than the 1100T. This means it was ahead of the 2500K, which isnĎt all that amazing in itself but looks better when you see that the 1100T was on a par with the i5-2300.

Page 19
Chess AIs: Houdini and Fritz

Houdini 2.0 Pro

We finished up our tour of applications with quite a particular choice, namely artificial intelligence algorithms designed for chess. We started with Houdini Pro 2, via the Arena 3 interface. Version 1.5 dominated the top of the chess engine classifications and Version 2 seems destined to do the same. We left the engine running until the 24th move at the beginning of a game and noted the speed in kilo nodes per second.

Houdini runs very well on the K10 architecture and the 1100T dominates things here. The FX-8150 was down on the 1100T but did nevertheless perform somewhere between the 2500K and the 2600K. The FX-6100 performed poorly however.
Fritz Chess Benchmark 4.3

We then moved on to Fritz Chess Benchmarking from Chess Base. Once again, the scores are given in kilo nodes per second.

This time the FX-8150 was slightly in front of the 1100T. Once again it was between the 2500K and the 2600K.

Page 20
3D gaming: Crysis 2 and Arma II: OA

Crysis 2 v1.9

The 3D gaming part of this comparative begins with Crysis 2. We used the latest version 1.9 in DirectX 11 and measured the framerate obtained at 1920*1080 Ultra at a precise point in the game during a shoot-out.

Crysis 2 doesnít really use more than four cores, which means it doesnít fully exploit the potential of the AMD FXs. With such low performance when only four cores are fully used, the FX-8150 trailed the X6 1100T and the X4 980 here in spite of its clock. The comparison with the Intel offer is simply catastrophic with even the i3-2100 managing to equal it.
Arma II: Operation Arrowhead v1.59

In Arma II: Operation Arrowhead we measured the framerate when crossing a village in the first solo mission, still at 1920*1080 and with all options pushed to a maximum, including visibility.

The same causes and the same effects: as in Crysis 2, the FX-8150 was outdone by its predecessors and the Core i3-2100 was also faster.

Page 21
3D gaming : Rise of Flight and F1 2011

Rise Of Flight v1.021b

We used Rise Of Flight, a First World War fighter plane simulator, at 1920*1080 at high graphics settings. In this test we launched a customised mission with a 32 vs 32 dogfight, with the framerate measured with the back-facing view of our 31 acolytes.

Once again thereís no gain in performance with the FXs in comparison to the previous generation. In fact theyíve lost ground. The Core i3-2100 is a good deal faster here while the i5s and i7s are way out front.
F1 2011

We ran the brand new F1 2011 at 1920*1080 with settings pushed to a maximum. We measured the framerate at the start of the Monaco GP.

Without any significant gain on the Phenom IIs, the AMD FXs were even behind the Core i3-2100 here.

Page 22
3D gaming: Total War Shogun 2, Starcraft II and Anno 1404

Total War: Shogun 2

For Total War: Shogun 2 we used the huge battle of the 'DX9 CPU' test modified for DX11 at 1920*1080 and with high graphics settings.

Unfortunately we canít give you a score for the AMD FXs in this game. The game kept crashing on start-up with the FXs. We contacted AMD about the issue and they got the same result. Theyíre working on a solution. This is the only application that posed a problem during this test and while we have included it for information, it obviously wonít be included in the average. The other processors all struggle in this extreme test, the Intel CPUs still with a very clear lead.
Starcraft II v1.3.6

For Starcraft II a major attack during a replay was generously donated by forum users (thanks!). This replay contained a very (very) full-on attack and we measured the framerate at a resolution of 1920*1080 with all graphics settings pushed to a max.

All the processors were brought to their knees in this test which is in practice even more extreme than the one carried out in Shogun 2. Starcraft II doesnít really exploit more than two cores, which explains the slight difference between the i3s and the i5s. The AMD FXs were on a par with the Phenom IIs and were outdone by the i3-2100.
Anno 1404 v1.3

Lastly in Anno 1404 we loaded a saved game with a city of 46,600 inhabitants that we partly visualise from a distance. The resolution was 1920*1080 and all graphics settings were pushed to a maximum.

Itís no surprise to see that there was no gain with the AMD FXs and that they were even slightly down on the Phenom IIs. The i3-2100 once again offered a higher framerate and the i5s/i7s are in another league altogether.

Page 23
Performance averages

Performance averages
Although individual app results are worth looking at, we have also calculated a performance index based on all tests with the same weight for each test. For the first time we've included two averages, one thatís applied across all the tests with the exclusion of 3D games and the other specific to 3D games.

[ Standard ]  [ By performance ]

The FX-8150 is the fastest AMD processor on the applications average. The gain of 6.9% is however limited considering the resources employed and the FX-8120, which is priced at the same level as the 1100T, will almost certainly be down on this score. This is the case for the 6100T, which is down on the 1055T in spite of being priced at the same level. The positioning in comparison to the Intel offer is however okay, with the FX-8150 coming between the 2500K and 2600K both in terms of performance and pricing.

[ Standard ]  [ By performance ]

Unfortunately in 3D games, the results arenít good. Rarely able to exploit more than four cores intensively, games aren't able to make the most of the AMD FX resources, with Turbo not making up for the drop in IPC. Performance levels are slightly down on the Phenom IIs and even the Core i3-2100 does better. Here the AMD FXs are on a par with the Q9650, which was released by Intel almost four years ago.

Of course, in our tests we looked for cases where gaming performance was limited by the CPU and not the GPU, this in spite of using a resolution of 1920*1080 and high graphics settings. Of course, if we used scenes that put less of a demand on the CPU and/or increased the graphics settings that only have an impact on the GPU so as to stop us from achieving the sort of framerates made possible on the highest performance CPUs (via AA or the resolution), we would see less of a difference between the different CPU platforms.

In cases which are limited by GPU performance, it would even be possible to see the AMD FX platform take a slight lead, particularly in SLI / CrossFire X because of the 2x16 support against 2x8 on LGA 1155. This slight advantage wouldnít however make up for the enormous gulf between these solutions when performance is limited by the CPU.

Page 24

New processor architectures donít come round all that often and there was a great deal of excitement in anticipation of Bulldozer. Breaking with the traditional vision of cores, its CMT architecture was indeed promising on paper but its integration within the AMD FXs is unfortunately a disappointment.

We certainly noted good overall performance when all eight cores were fully exploited and AMD has made some important improvements to its memory controller at the same time as increasing cache size, three points that make us think that the Bulldozer architecture could be worth a look in its Opteron version.

The AMD FXs are however desktop processors and while performance with applications is good, the average gain on the Phenom II X6s is rather small. Worse still, when you compare equally priced options, the Phenom II X6s have an advantage and the 32nm engraving only gives a slight energy efficiency saving because of the huge number of transistors used in the AMD FXs. In comparison with the Intel offer, the AMD FXs are competitive in terms of the price/performance ratio if you exempt 3D gaming, especially as the motherboards are slightly cheaper, but the Core i5s and i7s are a long way ahead when it comes to energy efficiency.

We had also expected good things from the AMD FXs in terms of overclocking, especially with all the buzz created by AMD around the 8.4 GHz record. Unfortunately, we only managed to stabilise our test processor at 4.6 GHz by doubling what is already a high level of energy consumption. Nothing exceptional then. In fact, this is even disappointing given the architetureís ďhigh frequencyĒ orientation, especially as this comes with certain concessions in terms of IPC: if you canít clock higher, results are bound to be lagging.

3D gaming performance is the other major drawback as thereís no improvement on the Phenom II X4s and X6s, even though this was their major fault. Even a simple Core i3 with two cores and Hyperthreading is slightly faster! Of course in many cases, the AMD FXs will have enough to obtain decent gameplay but Intel still has a comfortable advantage.

This shows up the fault in the CMT architecture implemented in Bulldozer: when only four cores are in load the processor's resources are far from being maximised. This is why the AMD FXs can be considered to be 8-core processors, even if these cores are rather weak individually speaking and, when combined, donít systematically position the CPU in front of other 4 or 6-core processors.

There arenít all that many possible solutions to this performance issue when cores arenít fully exploited. Either we'll have to wait for game engines to be redesigned to exploit all eight cores correctly, or the frequency has to be increased, or the processing capacities of a core within a module have to be increased. In the first case, AMD canít do much apart from motivating the troops, but it looks as if it will be some time before developers get up to speed. When it comes to clocks, the next stepping should allow AMD to go a bit higher and in terms of processing capacity by thread, weíll have to wait and see if this option is something AMD goes for on forthcoming Bulldozer architectures, which are likely to follow quite rapidly if AMDís roadmap is anything to go by: Piledriver in 2012, Steamroller in 2013 and Excavator in 2014.

Another more technical option that might increase single thread performance would be to introduce a CMT architecture that could share the instructions of one thread between both cores of the same module in a sort of anti-hyperthreading. Like Intel, AMD has been working on this sort of project for years, but as things stand the procedure, which is very delicate indeed, is still closer to science fiction than reality.

In the medium term at least, weíll have to count on a mix of the first three possibilities, combined with software optimizations to use the additional instructions supported by the AMD FXs and fit in better with the architecture. After all the delays however, we were obviously expecting more than possible future performance improvements and only AMD enthusiasts will go for the current solution. At the end of the day then, the current situation isnít good for anyone as AMDís competitiveness on the x86 processor market is essential.

To conclude, if youíre counting on opting for an AMD FX in spite of all this, note that two months after launch availability of the 8-core AMD FXs still seems very limited. AMD is no doubt giving priority to the Opteron variants and as it should be clear by now, we aren't alone in advising you to skip this first Bulldozer release and keep an eye on the architecture to see how it ripens.

Copyright © 1997-2015 BeHardware. All rights reserved.