问题:能否在主机端创建CPU多线程,在每个线程里调用设备端核函数的caller函数,进而实现进一步的并行运行。
例如有5张图片,对于每张图片都有N个GPU线程对其进行像素操作,但是此时是逐一对这5张图片处理的,想在主机端创建5个CPU线程,每个线程里进行 传输到设备端–>设备端GPU多线程处理–>结果返回主机端 这一系列操作,实现五张图片同时处理
此方法能否实现: 不能
只存在一个流时(默认的流),所有调用核函数的指令将被存在一个队列中,依次执行。因此直接使用CPU多线程调用kernel函数不能达到并行的目的,此时即便能运行也与串行运行的效果相同,只有通过使用多流才能进一步加速。
Question:
I have a CUDA kernel that do my hard work, but I also have some hard work that need to be done in the CPU (calculations with two positions of the same array) that I could not write in CUDA (because CUDA threads are not synchronous, I need to perform a hard work on a position X of an array and after do z[x] = y[x] - y[x - 1], where y is the array resultant of a CUDA kernel where each thread works on one position of this array and z is another array storing the result). So I’m doing this in the CPU.
I have several CPU threads to do the CPU side work, but each one is calling a CUDA kernel passing some data. My question is: what happens on the GPU side when multiple CPU threads are making GPU calls? Would be better if I do the CUDA kernel call once and then create multiple CPU threads to do the CPU side work?
回答:
Kernel calls are queued and executed one by one in single stream.
However you can specify stream during kernel execution - then CUDA operations in different streams may run concurrently and operations from different streams may be interleaved. Default stream is 0.
Things are similar when different processes use the same card.
Also remember that kernels are executed asynchronously from CPU stuff.