There are four main things in a typical CUDA program

Question

Name each stage, explain what you do as a programmer in each of these stages. Give pseudocode examples of lines of programs in each stage:

    • Stage Name:
      • Stage Explanation
      • Example code
    •  Stage Name:
      • Stage Explanation
      • Example code
    • Stage Name:
      • Stage Explanation
      • Example code
    • Stage Name:
      • Stage Explanation

Summary

There are basically, six different stages in a CUDA program.

    • Memory Allocation for host (CPU) variables.

    • Memory Allocation for device (GPU) variables.

    • Copying host variables to device variables.

    • Calling the kernel.

    • Copying the device variables back to host variables.

    • Deallocating memory for device variables.

All these stages are explained in detail below.

Explanation

Stage-1:

(Memory Allocation for host variables)
Host variables are nothing but the CPU variables, which are actually allocated memory in the global memory.
Just by declaring the variables in the main memory will be allocated for normal variables other than pointers. For pointers, we will use the malloc function.

Example:

int a[10], b[10], c[10];
In the above example, memory is allocated for three host variables a, b, and c each of size 10 integers.

 

Stage-2: (Memory allocation for device variables)

Device variables are nothing but GPU variables, and we allocate memory in the local memory of GPU.
Declaration in C is done in the same way as the normal variables.
But, for allocation of memory in the GPU’s local memory, a separate function named “cudaMalloc()” is used.

Example:

int *dev_a, *dev_b, *dev_c;
The device variables dev_a, dev_b, and dev_c are declared.
cudaMalloc((void**)&dev_a, 10*sizeof(int));
The variable dev_a, is allocated memory of size 10 integers, which might be 40 bytes.
cudaMalloc((void**)&dev_b, 10*sizeof(int));
cudaMalloc((void**)&dev_c, 10*sizeof(int));

 

Stage-3: (Copying host variables to device variables)

Generally, host variables have their memory allocation in global memory, but for the GPU to access the data, the variables must be allocated memory in the local memory of the GPU. Hence, we use a separate function with the name cudaMemcpy() to copy host variables to device variables.
The function’s first argument will be the destination, the second is the source and then the size, and lastly the type of copy.
The type of copy can be:
cudaMemcpyHostToDevice (to copy from host to device).
cudaMemcpyDeviceToHost (to copy from device to host).

Example:

cudaMemcpy(dev_a, &a, 10*sizeof(int), cudaMemcpyHostToDevice);
to copy array ‘a’ to device variable ‘dev_a’.
cudaMemcpy(dev_b, &b, 10*sizeof(int), cudaMemcpyHostToDevice);
to copy array ‘b’ to device variable ‘dev_b’.

 

Stage-4: (calling the kernel)

We will define the kernel function which will run in the GPU. All the threads will run simultaneously, thus reducing the time complexity.
An example kernel function definition would be:
__global__ void add(int *a, int *b, int *c){
Int i=blockIdx.x*blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
In the above code,
__global__ indicates, the function call is from CPU and will be run in GPU.
If it is:
__device__ indicates, the function call and execution, both will be in GPU.
__host__ indicates the function call and execution will be in the CPU.

Example:

The example is for the function call inside main.
Add<<<2, 5>>>(dev_a, dev_b, dev_c);
To the above function call, two parameters are passed in the angular brackets, which indicate the number of blocks and number of threads in each block respectively.
Here, there are 10 elements in array ‘a’ and array ‘b’. We have to add each and store it in array ‘c’.
So, ten threads will be there. Hence 2 blocks and 5 threads in each block, and hence 10 threads in total.

 

Stage-5: (Copying device variables back to host variables)

Finally, the result should be present in the global memory. Hence, we have to copy back the device variables (only result variable) back to the host variable. We have used the same function which we use to copy host to device variables. Just with a variant fourth argument, here the fourth argument should be cudaMemcpyDeviceToHost. Also, we will change the source and the destination.

Example:

cudaMemcpy(&c, dev_c, 10*sizeof(int), cudaMemcpyDeviceToHost);
To copy the resultant dev_a variable back to ‘c’.

 

Stage-6: (deallocating the memory for device variables)

Finally, the memory allocated to the device variables has to be deallocated, such that the local memory of the GPU will be freed. For this, we will use a function with the name cudaFree, to which we will pass the variable for deallocation.

Example:

cudaFree(dev_a);
to deallocate ‘dev_a’.
cudaFree(dev_b);
to deallocate ‘dev_b’.
cudaFree(dev_c);
to deallocate ‘dev_c’.

 

Also read, Pointer and Array in C

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *