CUDA's API
CUDA, or Compute Unified Device Architecture, is the architecture that allows the exploitation of GeForce 8 GPU calculation power by allowing it to process kernels (programs) on a certain amount of threads. If CUDA also partly includes the GPU, since it has more and more optimizations to facilitate non graphic calculations, in practice it mainly concerns software. CUDA is in consequence a driver, a runtime, libraries (an implementation of BLAS amongst other things), an API based on an extension of the C programming language and an accompanying compiler (redirecting the part not executed by the GPU to the system’s default compiler).

CUDA is a high level API, meaning that it globally disregards hardware even if taking into account the specifications is required to provide high performances. AMD, however, with CTM has a low level API. This roughly means that it is easier to program with CUDA whereas it is easier to fully optimize the code with CTM.
The CUDA driver acts as an intermediate element between the compiled code and GPU. The CUDA runtime is an intermediate between the developer and driver, facilitating programming by masking some of the details. With CUDA it’s either possible to use the API runtime or directly access the API driver. It is possible to see the API runtime as a high level language and the API driver as an intermediate between high and low level, allowing a manual and deeper optimization of the code. In the opposite direction, AMD gives the possibility of writing kernels in HLSL instead of machine language to facilitate programming. While the both stick to their initial choices, Nvidia and AMD try to go a little bith in the opposite way.
For this first look at CUDA, we focused on API runtime. The driver mode, however, isn't that different and it only has more options and less automation.
This particular API consists of a couple of extensions of C language, a component intended for the system that makes it possible to control the GPU(s), another that runs with the GPU, and a common component that includes the types of vectors and a group of functions of the standard C library, which can be executed on the system as well as the GPU.
Without going into all the details on the added extensions, we are going to give you the main ones that allow the understanding of the functioning of CUDA. The first point is a set of functions that will specify on which component they are intended to be executed; the CPU or GPU. A kernel or function requested by the CPU and executed by the GPU will be referenced by __GLOBAL__, a function used in a kernel will be referenced by __DEVICE__ and a standard function by __HOST__. It isn't obligatory to mention the latter since it represents standard behavior.
The second point is how a kernel is named. Here is the procedure for a classic function:
Function(parameter);
A kernel is named slightly differently:
Function<<< blocks, threads, memory >>>(parameters);
Blocks represent the number of blocks of threads to process. Threads represent the number of threads per block, and memory an optional memory space dynamically allocated in shared memory. Blocks * threads represent the total number of threads that will be processed by the kernel.
Next, a set of integrated variables makes it possible to identify the thread in the middle of this mix of blocks. A set of functions is dedicated to control the GPU, allocate memory areas, recover details on the GPU(s) present in the system, select the one on which it will be executed, etc.
Finally, a group of mathematical functions supported by the GPU and a function to synchronize threads within a block (__synchthreads() ). It breaks the execution of a kernel in a multiprocessor as long as all threads haven't reached this state in order to avoid the problems of reading after writing. (It is important to make sure that the right information has been written before it’s read).
These extensions control the GPU and anyone with a good knowledge of C will be able to manipulate them easily. To properly use the GPU it is imperative to spread the work load in grids of blocks, whose size has to be adapted on a case by case basis to maximize the utilization of calculation units.

API and 3D interoperability
CUDA has a certain number of functions in order to have an interoperability with 3D API via buffer objects in OpenGL and the vertex buffers in Direct3D. CUDA can be used to process data that will be directly exploited by 3D rendering. For example, it’s possible to process physics with CUDA and inject these results for rendering.