0% found this document useful (0 votes)
11 views

Lecture 17-Introduction to GPU

The document discusses the evolution and functionality of multi-core CPUs and many-core GPUs, highlighting their architectures and performance characteristics. It covers the history of GPUs, their role in rendering graphics, and their transition into parallel computing engines for various applications. Additionally, it introduces concepts like latency, throughput, and the graphics rendering pipeline, emphasizing the capabilities of GPUs in data parallelism and general-purpose computing.

Uploaded by

roarsomebros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 17-Introduction to GPU

The document discusses the evolution and functionality of multi-core CPUs and many-core GPUs, highlighting their architectures and performance characteristics. It covers the history of GPUs, their role in rendering graphics, and their transition into parallel computing engines for various applications. Additionally, it introduces concepts like latency, throughput, and the graphics rendering pipeline, emphasizing the capabilities of GPUs in data parallelism and general-purpose computing.

Uploaded by

roarsomebros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Applied High-Performance Computing and Parallel

Programming

Presenter: Liangqiong Qu

Assistant Professor

The University of Hong Kong


Outline

▪ Multi-core CPUs and many core GPUs


▪ GPU history
▪ Rendering a picture with GPU
▪ How GPU evolved into highly parallel compute
engines for a broad class of applications
Review of Last Lecture: Multi-core CPU Processors

Significant advances in CPU

Von Neumann Architecture • Multi-core architecture with dual,


Single-core Processor quad, six or n processing cores
• Processing cores are all on one chip

• Multi-core CPU chip architecture


• Hierarchy of caches (on/off chip)

• Clock-rate for single processors increased from 10 MHz (lntel 286) to 4 GHz (Pentium 4) in 30
years
• Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations /
heat
Review of Last Lecture: Many-core GPUs
▪ Use of very many simple cores
• High throughput computing-oriented architecture
• Use massive parallelism by executing a lot of
concurrent threads slowly
• Handle an ever increasing amount of multiple
instruction threads
• CPUs instead typically execute a single long thread as
fast as possible
▪ Many-core GPUs are used in large clusters and within
massively parallel supercomputers today

Graphics Processing Unit (GPU) is great for data parallelism and task parallelism
Compared to multi-core CPUs, GPUs consist of a many-core architecture with hundreds to
even thousands of very simple cores executing threads rather slowly
Review of Last Lecture: Multi-core CPU and Multi-core GPU
• Both the CPU and GPU are silicon-based microprocessors, and have similar internal
components, including cores, memory, and control unit.
• The core functionality of the CPU is to fetch instructions from RAM and then decode and
execute the instructions, i.e., running processes serially.
• CPU is powerful that handles all types of computing tasks required for the operating system
and applications to run.
• GPU contains thousands of smaller and less powerful cores (specialized) than CPU cores.
• GPU breaks tasks into smaller chunks and runs them in parallel and processes high
volumes of the same instructions rapidly.
Review of Previous Lecture: Latency vs Throughput
• Latency: Latency refers to the delay that happens between when a user takes an action on
a network or web application and when it reaches its destination, which is measured in
milliseconds.
• Throughput: measures the volume of data that passes through a network in a given
period.
• See below: assume only one car in a lane of the highway at once.
• When car on highway reaches Stanford, the next car leaves San Francisco.

What’s the latency and throughput here?

Picture from: https://gfxcourses.stanford.edu/cs149/fall23/lecture/multicore2-ispc/


Latency vs Throughput
• Assume only one car in a lane of the highway at once.
• When car on highway reaches Stanford, the next car leaves San Francisco.

Latency of driving from San Francisco to Stanford: 0.5 hr.


Throughput: 2 cars per hour.
Improving Throughput
Improving Throughput
Review of Last Lecture: CPU vs GPU

High throughput
GPU is a specialized component that
Lower latency is best at running many smaller tasks
CPU handles all the main functions of a computer at once.
CPU has a few cores, each with higher clock speeds GPU has a massive number of smaller
and larger caches and more specialized cores (less
powerful than CPU cores).
Outline

▪ Multi-core CPUs and many core GPUs


▪ GPU history
▪ Rendering a picture with GPU
▪ How GPU evolved into highly parallel compute
engines for a broad class of applications
GPU History

What GPUs were originally designed to do?


3D rendering

Simple definition of rendering task: computing how each triangle in 3D mesh


contributes to appearance of each pixel in the image?
GPU History
What GPUs are still designed to do

Unreal Engine 5.4 demo


Real-time rendering on a high-end GPU
Outline

▪ Multi-core CPUs and many core GPUs


▪ GPU history
▪ Rendering a picture with GPU
▪ How GPU evolved into highly parallel compute
engines for a broad class of applications
Tips: How to Explain a System
Step 1. describe the things (key entities) that are manipulated
- The nouns

Step 2: describe operations the system performs on the entities


- The verbs
Real-time Graphics Primitives (Entities)
Represent surface as a 3D triangle mesh

Vertices have following attributes: (1) Position in 3D space V(x,y,z), (2) Color:
expressed in RGB or RGBA components, (3) Vertex-Normal. (4). Texture
Real-time Graphics Primitives (Entities)
Represent surface as a 3D triangle mesh

The primitives (such as triangle, point, line or quad), which is formed by one or more
vertices. Primitives are the inputs to the graphics rendering pipeline.
Real-time Graphics Primitives (Entities)

▪ All modern displays are raster-based. A raster is a 2D rectangular grid of pixels.


Real-time Graphics Primitives (Entities)

▪ All modern displays are raster-based. A raster is a 2D rectangular grid of pixels.

▪ A pixel, short for "picture element," is the smallest unit of a digital image or display. It is
a tiny square or dot that represents a single point of color. A pixel is 2-dimensional, with
a (x, y) position and a RGB color value (no alpha value for pixels).
Real-time Graphics Primitives (Entities)

• All modern displays are raster-based. A raster is a 2D rectangular grid of pixels.


• The vertices, which is usually represented in a float value, are not necessarily aligned with
the pixel-grid of the display. The primitives are also not considered.
• Each primitive is raster-scan (rasterizer) to obtain a set of fragments enclosed within the
primitive.
• Fragments are produced via interpolation of the vertices. Hence, a fragment has all the
vertex's attributes such as color, fragment-normal and texture coordinates.
Tips: How to Explain a System
Step 1. describe the things(key entities) that are manipulated
- The nouns

Step 2: describe operations the system performs on the entities


- The verbs
Rendering a Picture
▪ Input: a list vertices in 3D space
(and their connectivity into primitives)

Example: every three vertices defines a triangle


Rendering a Picture
Step 1: given a scene camera position, compute where the
vertices lie on screen
Rendering a Picture
Step 2: group vertices into primitives
Rendering a Picture
Step 3: perform rasterization on the primitive, to obtain a
set of fragments enclosed within the primitive that are
aligned to the pixel grid of the display

A fragment is 3-dimensional, with a (x, y, z) position. The (x, y) are aligned with the 2D
pixel-grid. The z-value (not grid-aligned) denotes its depth.
The 3D fragments are interpolated from vertices.
Rendering a Picture
Step 4: compute color of primitive for each fragment
(based on scene lighting and primitive material properties)
Rendering a Picture
Step 5: put color of the closest fragment to the camera in
the output image
Real-time Graphics Pipeline

Abstracts process of rendering a picture as a sequence


of operations on vertices, primitives, fragments, and
pixels.

The purpose of the Graphics Rendering Pipeline


is to produce the color-value for all the pixels
for displaying on the screen, given the input
vertices.
Early Graphics Programming (OpenGL API)

Graphics programming APIs provided programmer mechanisms to set parameters of


scene lights and materials
Graphics Shading Languages
▪ Allow application to extend the
functionality of the graphics pipeline
by specifying materials and lights
programmatically!
• Support diversity in materials
• Support diversity in lighting
conditions

• Programmer provides mini-


programs(“shaders”) that define pipeline
logic for certain stages
• Pipeline maps shader function onto
all elements of input stream
Outline

▪ Multi-core CPUs and many core GPUs


▪ GPU history
▪ Rendering a picture with GPU
▪ How GPU evolved into highly parallel compute
engines for a broad class of applications
Observation Circa 2001-2003

These GPUs are very fast processors for


performing the same computation (shader
programs) on large collections of data
(streams of vertices, fragments, and pixels)

Wait a minute! That sounds a lot like data-


parallelism!
Hack! Early GPU-based Scientific Computation
Use a GPU which typically handles computation only for computer graphics, to perform
computation in applications traditionally handled by the central processing unit (CPU).
Two research groups independently discovered GPU-based approaches for the solution
of general linear algebra problems on GPUs that ran faster than on CPUs

Sparse Matrix Solvers [Bolz 03]

Sparse Matrix Solvers on the GPU


Brook Stream Programming Language
▪ Stanford graphics lab research project (2004)
▪ Abstract GPU hardware as data-parallel processor.

▪ Brook compiler converted generic stream program into OpenGL


commands such as drawTriangles() and a set of shader programs.
GPU Computing Mode
General-Purpose Computing on Graphics Processing Units (GPGPUs)

CUDA Programming

Next Lectures
Thank you very much for choosing this course!

Give us your feedback!

https://forms.gle/zDdrPGCkN7ef3UG5A

38

You might also like