Fundamentals of GPU programming (2)

最新推荐文章于 2025-08-05 20:04:56 发布

原创最新推荐文章于 2025-08-05 20:04:56 发布 · 196 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#c++ #cuda

本文探讨了CPU和GPU在处理并行任务时的差异，重点在于任务并行性和数据并行性两种加速方式。任务并行性是将大任务拆分为多个子任务分配给多个处理器，而数据并行性则是将输入/输出数据分割，每个处理器处理一部分。SIMD（单指令多数据）是GPU中常见的数据并行处理模型，所有核心执行相同指令但使用不同数据。指导原则指出，对于串行过程和小规模数据，CPU更优，而GPU则强调每条指令的重要性，避免变量迭代次数不确定的if语句或循环导致的延迟。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Parallelism and GPU Architecture

CPU is optimized to process a single sequence of instructions. It is extremely fast but there are some walls, such as memory, power, and instruction level parallelism.

Two speedup ways.
Given a process that requires time $T$ , we can use $P$ processors to reduce the processing time to ideally $T / P$ .

task parallelism. Break the problem up into $T > = P$ tasks and pass them off to a process.
data parallelism. Break the input/output data into $D > = P$ subsets and lauch one thread for each piece of data.

Task Prallelism

Assign the first P tasks to a process --> When any processor finishes a task $T_n$ , move to task $T_{P+1}$ --> Repeat until all tasks are completed

This has generally been the primary model for cluster computing and supercomputing.

Data Prallelism

Send the first P threads on different processors --> once any thread $T_n$ completes, lauch another thread --> Repeat until all threads have completed

SIMD --Single instruction multiple data

All cores execute the same instruction and different data can be used.

Guiding principles

CPU is always faster for a serial process and small data. On GPU, every instruction is important. There might be stalls when if statement or loops with viriable numbers of iterations occur.