



Review index: 


Nvidia Fermi: the GPU Computing revolution
by Damien Triolet
Published on October 9, 2009
IEEE7542008 Fermi SIMDs move from MAD (multiply add) type instruction capability to FMA (fused multiply add). The difference is one of precision. The MAD instruction performs a floating multiplication followed by a floating addition, which is to say that the result is rounded at each stage. The FMA instruction retains full precision in the intermediate stage and only rounds at the end.
Like Cypress, Fermi implements the new IEEE7542008 standard and therefore denormalized numbers and the four rounding modes. The big difference between Cypress and Fermi however is that Fermi cannot handle standard MADs for both single and double precision arithmetic. With Fermi, NVIDIA will replace the MAD instructions by FMA instructions. This doesn’t give an equivalent result and though it is more precise, it might be a source of problems. It will however be possible to specify at compilation time that rather than going with the FMA instruction, MADs should be split into MULs and ADDS, which will be an almost identical solution. To get a perfectly identical result however you have to forego MADs on current architectures and FMAs on the new one.
Cypress is able to handle both MADs and FMAs for both single and double precision and at the same speed, no doubt facilitated by the emulation of doubles using FP32 units and partial products. NVIDIA tells us that the decision to abandon MADs was influenced by the costs implied for the architecture.
The use of FMA instructions means acceleration of certain functions such as divisions and square roots. NVIDIA told us that they will supply a new library of maths functions that will be used automatically in compilation for Fermi and will use FMA instructions to accelerate them when possible. Cypress already uses double precision FMA instructions to speed up divisions (DIV) and square roots (SQRT).
Arithmetic throughput We have displayed the processing power of various architectures with some common instructions. For Fermi we have taken into account the separation of MADs into MULs and ADDS and estimated a conservative clock of 1600 MHz for processing units. We have added, for information, maximum processing power given by the SSE units of a Core i7 975.
In comparison to the previous generation, the gains brought by Fermi are enormous, especially support for double precision. Although Cypress gives superior processing power for single precision, you have to remember that it is harder to attain in practice due to its architecture. Fermi has a 50% advantage over Cypress when it comes to double precision support, except with additions. Given that one single MUL, MAD or FMA is processed per cycle with Cypress, of course there is in these cases no loss in efficiency linked to the vector architecture.


Copyright © 1997 Hardware.fr SARL. All rights reserved.
Read our privacy guidelines.

