Me too, I want to be scalar…
There are trends and this is one of them. Nvidia starts to speak about scalar processors, this works very well, and thus AMD follows and speaks of 320 scalar processors for its new GPU. The GeForce 8800 seems to be in the minor league with its 128. However, are Nvidia and AMD talking about the same thing? Not really.

We’ll leave behind the GPGPU view of things and with a little abstraction this will allow us to see things in another way to concentrate on the more practical aspects. Before anything else a GPU calculates vertices and pixels. The GeForce 8800 can calculate 128 of these elements at the same time, decomposing instructions into scalar ones, which will be executed successively.
The Radeon HD 2000 processes 64 of these elements in parallel but works with 5D units, or those capable of processing not one instruction but up to five. You may remember that the Radeon X1000 can process four per element for pixel shaders and five for vertex shaders. The number five was chosen so that past developments would always be up to date. This involves a MIMD 5-way vectoriel unit, MIMD signifying (contrary to SIMD) that several different instructions can be processed in parallel. This was already the case before with type 3+1 co-issues (and even 2+2 in the GeForce 6 and 7). Here, AMD pushes this co-issue concept to the limit as the 1+1+1+1+1 mode becomes possible and, of course, all other combinations.

One of the 64 calculation units of the R600. It’s composed of 5 math units, one of them being able to handle special tasks and an extra unit which processes branching operations.64x5, however, is very different from 320. Actually, these 5 instructions cannot be dependant on each other. With the GeForce 8800 every instruction can follow any other. This means that while the GeForce 8 compiler will break vectorial operations into scalar ones, the Radeon HD2000 compiler is going to do the opposite and try to assemble simple operations in order to fill the MIMD 5D unit. They are thus VLIW type units (for "very long instruction word" ), which means that the instruction sent to the GPU (which combines or rather tries to combine more simple instructions) is a long and complex 512 bits (!). This choice in architecture allows an increased density of calculation units, but puts a large work load on the compiler reducing the chip’s efficiency. We should add that luckily for you Nvidia didn’t opt for this type of architecture given their recent problems with drivers.
Concerning special functions (sin, cos, exp, log, etc.) and operations on 32 bit integers, with the GeForce 8 they are managed by an extra unit, which has a throughput four times slower than other units. For AMD one of the five components for each calculation unit is capable of processing these operations. In other words, a GeForce 8800 can process 128 simple operations + 32 special operations at the same time, while a Radeon HD 2900 can handle 320 simple operations or 256 simple + 64 special ones.
Does the increased efficiency of Nvidia architecture compensate for this difference? No. But Nvidia uses a technique that Intel used with the ALUs of the Pentium 4: double pumped units, in other words those that work at double frequency. Viewed from the GeForce 8800’s shader core (675 MHz for the GeForce 8800 GTX, 1350 Mhz being a simplification), units are capable of processing 256 simple operations + 64 special operations per cycle - or identical transfer rates to that of the Radeon HD 2900.
So we have the Radeon HD 2900 with its higher transfer rates in simple operations (MAD/MUL/ADD) and a higher frequency (742 MHz) opposed to a more efficient architecture.
About hardware implementation, on Nvidia's side, scalar processors are in groups of 8 and process blocks of 32 elements, 8 by 8, which allows hiding calculation unit latency. For AMD, MIMD 5D units are in groups of 16 and process blocks of 64 elements, 16 by 16 for the same reasons. Also, generally these groups of units have two batches of elements to process so that they can switch from one to the next and mask processing latency as efficiently as possible.