0% found this document useful (0 votes)
60 views

GPU_Assignment-3_Solution

Uploaded by

Cat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

GPU_Assignment-3_Solution

Uploaded by

Cat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

NPTEL Online Certification Courses

Indian Institute of Technology Kharagpur

GPU Architectures
and Programming
Assignment- Week 3

TYPE OF QUESTION: Objective

Number of questions: 10
Total mark: 10 X 1 = 10

QUESTION 1:
How are CUDA threads invoked to execute a kernel from the host?
Options:
A) Using a loop structure
B) With the <<<...>>> execution configuration syntax
C) By specifying thread IDs in the main function
D) Automatically by the GPU scheduler
Answer:
B) With the <<<...>>> execution configuration syntax
QUESTION 2:
What is the purpose of the threadIdx built-in variable in a CUDA kernel?
Options:
A) Provides a random number
B) Identifies the current CUDA block
C) Gives the total number of threads
D) Provides a unique identifier for each thread
Answer:
D) Provides a unique identifier for each thread
QUESTION 3:
Any function that is launched by the host and executed by a GPU kernel should be qualified by which
keyword?
Options:
A) __device__
B) __host__
C) __kernel__
D) __global__
Answer:
D) __global__
QUESTION 4:
What does the <<<1, N>>> syntax signify in the kernel invocation VecAdd<<<1, N>>>(A, B, C)?
Options:
A) 1 block of threads, N threads per block
B) N blocks of threads, 1 thread per block
C) N blocks with variable thread count
D) 1 thread per block, 1 block in total
Answer:
A) 1 block of threads, N threads per block

QUESTION 5:
Given a GPU with 10 streaming multiprocessors, each supporting a maximum of 1024 threads per SM, and a
CUDA kernel is launched with a block size of 128 threads, calculate the maximum number of active blocks
on the GPU.
Options:
A. 80
B. 100
C. 200
D. 1280
Answer:
A. 80
Detailed Solution:
Maximum active blocks per SM = Total threads per SM / Threads per block
Maximum active blocks on GPU = Maximum active blocks per SM * Number of SMs

QUESTION 6:
Calculate the execution time (in seconds) for a CUDA kernel that processes 8192 elements with a block size
of 128 threads and an average execution time of 2 milliseconds per block, considering that only one SM is
available on the target GPU for executing the blocks.
Options:
A. 0.512 seconds
B. 0.256 seconds
C. 1.024 seconds
D. 0.128 seconds
Answer:
D. 0.128 seconds
Detailed Solution:
Execution time is calculated as the product of the number of blocks and the average execution time per
block.
QUESTION 7:
Given a CUDA kernel with a grid size of 2 blocks and 256 threads per block, calculate the total number of
threads launched by the kernel.
Options:
A. 256
B. 512
C. 1024
D. 4096
Answer:
B. 512
Detailed Solution:
Total threads launched = Block size * Threads per block
QUESTION 8:
What is the CUDA function call required to copy an array h_A from the CPU memory to the GPU
memory, where it is known as d_A?
Options:
A. cudaMemcpy(h_A, d_A, size, cudaMemcpyHostToDevice);
B. cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
C. cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);
D. cudaMemcpy(d_A, h_A, size, cudaMemcpyDeviceToHost);
Answer:
B. cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

QUESTION 9:
Which of the following options is true regarding the matrix multiplication kernel in the code shown
below:

#where d_M and d_N are matrices and N is the row and column sizes
and d_P is the product matrix
__global__ void Matrix MulKernel ( float * d_M , float * d_N , float * d_P , int N ) {

int i = blockIdx.y * blockDim.y + threadIdx.y ;


int j = blockIdx.x * blockDim.x + threadIdx.x ;
if (( i < N ) && (j < N ) ) {
float Pvalue = 0.0;
for ( int k = 0; k < N ; ++k ) {
Pvalue += d_M [i*N + k]* d_N [k*N + j];
}
d_P [i*N+j] = Pvalue ;
}
}
Options:
A. The kernel iterates over each element of the output matrix (d_P) parallelly and calculates its
value using a nested loop that iterates over the corresponding row of the first matrix (d_M)
and the corresponding column of the second matrix (d_N) sequentially.
B. The kernel iterates over each element of the output matrix (d_P) sequentially and calculates
its value using a nested loop that iterates over the corresponding row of the first matrix
(d_M) and the corresponding column of the second matrix (d_N) parallelly.
C. The computation of individual elements in the product matrix d_P can be carried out
parallelly using threads along a different dimension than the ones used for the parallel
computation of the entire product matrix.
D. The computation of individual elements in the product matrix d_P can be carried out
parallelly using threads along one of the same dimensions as the ones used for the parallel
computation of the entire product matrix.
Answer:
B. The kernel iterates over each element of the output matrix (d_P) sequentially and calculates its
value using a nested loop that iterates over the corresponding row of the first matrix (d_M) and the
corresponding column of the second matrix (d_N) parallelly.
QUESTION 10:
Which of the following statements regarding CUDA memory allocation is false?
Options:
A. It is possible to allocate memory in a CUDA device kernel for an integer array.
B. It is possible to allocate memory in a CUDA device kernel by passing the pointer to an
integer array.
C. An array created inside a CUDA device kernel cannot be directly dereferenced in the host
side.
D. An array created inside a CUDA device kernel can be copied to another CUDA device
kernel by calling the function cudaMemcpy using the flag cudaMemcpyDeviceToDevice.
Answer: B. It is possible to allocate memory in a CUDA device kernel by passing the pointer to an
integer array.
Detailed Solution: It is not possible to allocate memory in a CUDA device kernel by passing the
pointer to an integer array, the pointer has to be typecast to void before passing.

You might also like