CUDA编程问题记录：能否用CPU多线程调用CUDA核函数

最新推荐文章于 2025-05-16 11:33:38 发布

原创最新推荐文章于 2025-05-16 11:33:38 发布 · 2.9k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#多线程 #并发编程 #cuda #cpu

CUDA编程专栏收录该内容

3 篇文章

订阅专栏

在CUDA环境中，直接使用CPU多线程调用kernel函数无法实现真正的并行处理，因为kernel调用会被放入单一流中依次执行。要实现并发，需要利用CUDA的流（stream）机制，通过为每个线程指定不同的流，使得GPU可以在不同流之间并发执行任务，从而提高效率。这种方法适用于需要同时处理多个独立任务的场景，如批量处理多张图片。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题：能否在主机端创建CPU多线程，在每个线程里调用设备端核函数的caller函数，进而实现进一步的并行运行。
例如有5张图片，对于每张图片都有N个GPU线程对其进行像素操作，但是此时是逐一对这5张图片处理的，想在主机端创建5个CPU线程，每个线程里进行传输到设备端–>设备端GPU多线程处理–>结果返回主机端这一系列操作，实现五张图片同时处理

此方法能否实现：不能

只存在一个流时(默认的流)，所有调用核函数的指令将被存在一个队列中，依次执行。因此直接使用CPU多线程调用kernel函数不能达到并行的目的，此时即便能运行也与串行运行的效果相同，只有通过使用多流才能进一步加速。

参考资料：
https://stackoverflow.com/questions/13061619/what-happen-if-a-cuda-kernel-is-called-from-multiple-pthreads-simultaneously

Question:
I have a CUDA kernel that do my hard work, but I also have some hard work that need to be done in the CPU (calculations with two positions of the same array) that I could not write in CUDA (because CUDA threads are not synchronous, I need to perform a hard work on a position X of an array and after do z[x] = y[x] - y[x - 1], where y is the array resultant of a CUDA kernel where each thread works on one position of this array and z is another array storing the result). So I’m doing this in the CPU.

I have several CPU threads to do the CPU side work, but each one is calling a CUDA kernel passing some data. My question is: what happens on the GPU side when multiple CPU threads are making GPU calls? Would be better if I do the CUDA kernel call once and then create multiple CPU threads to do the CPU side work?

回答：
Kernel calls are queued and executed one by one in single stream.

However you can specify stream during kernel execution - then CUDA operations in different streams may run concurrently and operations from different streams may be interleaved. Default stream is 0.

See:http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Things are similar when different processes use the same card.
Also remember that kernels are executed asynchronously from CPU stuff.