0% found this document useful (0 votes)

6 views

Using GPUs

Uploaded by

chief artificer

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Using GPUs

Uploaded by

chief artificer

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Using GPUs to Accelerated

Computational Performance
Dr Eric McCreath
Research School of Computer Science
The Australian National University
Overview
GPU Architecture
SIMT
Kernels
Memory
Intermediate representations and runtimes
"Hello World" - OpenCL
"Hello World" - Cuda
Lab Activity

2
Progress?
What has changed in the last 20 years in computing?
Me - ~1998 Me - more recently

3
GEForce

4
Super Computer Performance
Rapid growth of supercomputer performance, based on data from top500.org site. The
logarithmic y-axis shows performance in GFLOPS.

By AI.Graphic - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=33540287

5
GPU vs CPU
Just looking at the specs of a basic desktop computer we can
see great potential in GPU computing.

Intel Core i7-6700K GeForce GTX 1080

4 CPU cores 2560 cuda cores
8 threads

114 GFlops 8228 GFlops

34 GB/s 320 GB/s

256bits wide

RAM
RAM
16GB DDR4 8GB DDR5

PCIE
15 GB/s

6
Inside a CPU
The Core i7-6700K quad-core processor

From https://www.techpowerup.com/215333/intel-skylake-die-layout-detailed

7
Inside the GPU
If we take a closer look inside a GPU we see some similarity with
the CPU, although more repetition that comes with the many more
cores.
GTX1070 - GP104 - Pascal

From https://www.flickr.com/photos/130561288@N04/36230799276 By Fritzchens Fritz Public Domain

8
Key Parts Within a GPU
Nvidia GPUs chips are partitioned into Graphics Processor
Clusters (GPCs). So on the GP104 there is 4 GPCs.
Each GPC is again partitioned into Streaming Multiprocessors
(SMs). On the GP104 there is 5 SMPs per GPC.
Each SM has "CUDA" cores which are basically ALU units which
can execute SIMD instructions. On the GP104 there is 128 CUDA
cores per SMs.
On the GP104 each SMP has 24KiB of Unified L1 cache/texture
cache and 96K of "shared memory".
The GP104 chip has 2048KiB of L2 cache.
I think we need a diagram!!

9
Key Parts Within A GPU
64K of 32bit
GPC GPC registers
SM SM SM SM 128 CUDA CORES
SM SM
24KiB L1

96KiB Shared
SM SM
SM SM
L2

GPC GPC
2MB
SM SM SM SM

SM SM SM SM SM SM

8G DRAM

10
AMD
If we had a look at an AMD GPU we would see something similar.
So the Radeon R9 290 series block diagram is:

Asynchronous Compute Engines Each compute unit has:

Global Data Share 64 stream processors
Shader Engine Shader Engine Shader Engine Shader Engine 4*64KB vector registers
Compute Unit Compute Unit Compute Unit Compute Unit 64KB local shared data
16KB L1 Cache
Compute Unit Compute Unit Compute Unit Compute Unit texture and scheduler
Compute Unit Compute Unit Compute Unit Compute Unit components
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit
Compute Unit Compute Unit Compute Unit Compute Unit

L2 Cache (1MB)
Memory Controler

11
Some Terminology
CUDA (Compute Unified Device Architecture) is Nvidia's
programming model and parallel programming platform developed
by Nvidia for there GPU devices. It comes with its own
terminology.
The stream multiprocessor (SM) is a key computational grouping
within a GPU, although "stream multiprocessor" is Nvidia's
terminology. AMD would call them "compute units".
Also "CUDA cores" would be called "shader units" or "stream
processors" by AMD.

12
Kernels
Kernels are the small pieces of code that execute in a thread (or
work-item) on the GPU. They are written in c . For a single kernel
one would normally launch many threads. Each thread is given the
task of working on a different data item (data parallelisim).
In CUDA kernels have the "__global__" compiler directive before
them, they don't return anything (type void), parameters can be
basic types, structs, or pointers. Below is a simple kernel that adds
one to each element of an array.
__global__ void addone(int n, int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) data[idx] = data[idx] + 1;
}

To launch this kernel with 10 blocks with 256 thread per block you
would:
addone<<<10,256>>>(n, data); // "n" is the number of items in the array "data"

13
SIMT
Single Instruction Multiple Data (SIMD), describle by Flynn in 1966,
and typically has single instructions operate on a vectors of data
items. This saves on duplicating the instruction execution
hardware and the memory has good spatial locality. GPUs have
an extension on this called Single Instruction Multiple Thread
(SIMT), this provides more context for each of these 'threads'.
SIMD SIMT
Instructions PC Instructions PC
Processing Processing
Unit
Register
Unit

Processing Processing
Unit Unit Register
Data Data
Processing Processing
Unit Unit Register

Processing Processing
Unit Unit Register

Thread have their own registers, can access different

addresses, and can follow divergent paths in the code.

14
Memory
Memory bandwidth and latency can often significantly impact
performance so one of the first performance considerations or
questions when porting a program to the GPU is: Which memory
to use and how to best use this memory. Memory is described by
its scope from the threads perspective. The key memory types to
consider are:
registers - fast and local to threads.
shared memory - fast memory that is shared within the block
(local memory in OpenCL).
global memory - this is main memory of the GPU, it is accessible
to all threads in all blocks and persists over the execution of the
program.
constant memory - can't change over kernel execution, great if
threads all want to access the same constant information.

15
"Hello World" - OpenCL
So in this implementation of "Hello World" we are getting the
GPU to do the work of generating the string in parallel. So a single
thread does the work of outputing a single character in the string
we output.

CPU 1 GPU

2
Host Memory
Device Memory
"hello world" "hello world"
3

16
Overview Of Lab Activity
Bascially in this first lab you will have a go compiling and run the
code. And then make a small modification to the "hello world"
programs. This involves make add your name to the "hello" and
also making 1 thread be copy over 2 characters, rather, than just
the one.

GPU

Device Memory
"Hello Eric"

17
References
Flynn's taxonomy https://en.wikipedia.org/wiki/Flynn's_taxonomy
Using CUDA Warp-Level Primitives, Lin and Grover,
https://devblogs.nvidia.com/using-cuda-warp-level-primitives/
Cuda C Programming Guide,
https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Benchmarking the cost of thread divergence in CUDA, Bialas
and Strzelecki, https://arxiv.org/pdf/1504.01650.pdf

AZ 801T00A ENU TrainerPrepGuide
0% (2)
AZ 801T00A ENU TrainerPrepGuide
15 pages
T-Head Xuantie C910 (Openc910) : High Performance Rv64 Compatible Processor
No ratings yet
T-Head Xuantie C910 (Openc910) : High Performance Rv64 Compatible Processor
5 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Demystifying GPU microarchitecture through microbenchmarking
No ratings yet
Demystifying GPU microarchitecture through microbenchmarking
12 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Chap6 Heter Computing
No ratings yet
Chap6 Heter Computing
22 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
U03_Exercises_III_v2 (SRC)
No ratings yet
U03_Exercises_III_v2 (SRC)
6 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
TMM FAQ Computer Specifications
No ratings yet
TMM FAQ Computer Specifications
2 pages
Nvidia Cuda
No ratings yet
Nvidia Cuda
26 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Microprocessor Core 2 Duo
100% (1)
Microprocessor Core 2 Duo
22 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
13 pages
System Requirements Guidelines NX 8 5
No ratings yet
System Requirements Guidelines NX 8 5
3 pages
Cloud Computing Test
No ratings yet
Cloud Computing Test
11 pages
HPE_a00062182enw_HPE ProLiant Compute XD680 QuickSpecs
No ratings yet
HPE_a00062182enw_HPE ProLiant Compute XD680 QuickSpecs
28 pages
I7 Processor
No ratings yet
I7 Processor
13 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Amd Ryzen Embedded 8000 Product Brief
No ratings yet
Amd Ryzen Embedded 8000 Product Brief
4 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
Core I5 Report
100% (2)
Core I5 Report
13 pages
NIIT BS+IT Program
No ratings yet
NIIT BS+IT Program
14 pages
COA1x1 - Class 2
No ratings yet
COA1x1 - Class 2
19 pages
CMG 2022 Hardware and Operating System Recommendations
No ratings yet
CMG 2022 Hardware and Operating System Recommendations
8 pages
Architectural Details of Tesla GPU Microarchitecture
No ratings yet
Architectural Details of Tesla GPU Microarchitecture
9 pages
Gpu Virtualisation
No ratings yet
Gpu Virtualisation
14 pages
WatsonX Starcoder-15.5b- Training
No ratings yet
WatsonX Starcoder-15.5b- Training
10 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
3a CPU
No ratings yet
3a CPU
7 pages
CMG 2020 Hardware and Operating System Recommendations
No ratings yet
CMG 2020 Hardware and Operating System Recommendations
8 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
CMG 2021 Hardware and Operating System Recommendations
No ratings yet
CMG 2021 Hardware and Operating System Recommendations
8 pages
Lec 1
No ratings yet
Lec 1
27 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Large-Scale Transient Stability Simulation On Graphics Processing Units
No ratings yet
Large-Scale Transient Stability Simulation On Graphics Processing Units
6 pages
CMG 2018 Hardware and Operating System Recommendations
No ratings yet
CMG 2018 Hardware and Operating System Recommendations
8 pages
versal-ai-edge-gen2-psg
No ratings yet
versal-ai-edge-gen2-psg
5 pages
GPGPU
No ratings yet
GPGPU
139 pages
Super Computer
No ratings yet
Super Computer
45 pages
MODULE 1
No ratings yet
MODULE 1
4 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Dual Processor Vs Dual Core
No ratings yet
Dual Processor Vs Dual Core
6 pages
Coral Dev Board Micro Datasheet-3009759
No ratings yet
Coral Dev Board Micro Datasheet-3009759
23 pages
Final Research
No ratings yet
Final Research
17 pages
Intel Core I5 1135G7 Vs AMD Ryzen 5 5600H Performance Comparison
No ratings yet
Intel Core I5 1135G7 Vs AMD Ryzen 5 5600H Performance Comparison
1 page
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Storage Area Network
No ratings yet
Storage Area Network
109 pages
HC2021.UWisc - Karu Sankaralingam.v02
No ratings yet
HC2021.UWisc - Karu Sankaralingam.v02
20 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
Intel Microprocessor I3, I5, I7
100% (2)
Intel Microprocessor I3, I5, I7
22 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
How to Have More Joy in the Ministry — Watchtower ONLINE LIBRARY
No ratings yet
How to Have More Joy in the Ministry — Watchtower ONLINE LIBRARY
1 page
Husbands, Honor Your Wife | Watchtower Study
No ratings yet
Husbands, Honor Your Wife | Watchtower Study
1 page
How to Have a Successful Courtship — Watchtower ONLINE LIBRARY
No ratings yet
How to Have a Successful Courtship — Watchtower ONLINE LIBRARY
1 page
Let Love Motivate You to Keep Preaching! — Watchtower ONLINE LIBRARY
No ratings yet
Let Love Motivate You to Keep Preaching! — Watchtower ONLINE LIBRARY
1 page
“Press On to Maturity” — Watchtower ONLINE LIBRARY
No ratings yet
“Press On to Maturity” — Watchtower ONLINE LIBRARY
1 page
What Do We Know About Jehovah’s Future Judgments? — Watchtower ONLINE LIBRARY
No ratings yet
What Do We Know About Jehovah’s Future Judgments? — Watchtower ONLINE LIBRARY
1 page
Strengthen Your Appreciation for Jehovah’s Organization — Watchtower ONLINE LIBRARY
No ratings yet
Strengthen Your Appreciation for Jehovah’s Organization — Watchtower ONLINE LIBRARY
1 page
How to Find a Potential Marriage Mate — Watchtower ONLINE LIBRARY
No ratings yet
How to Find a Potential Marriage Mate — Watchtower ONLINE LIBRARY
1 page
Trust in the Merciful “Judge of All the Earth”! — Watchtower ONLINE LIBRARY
No ratings yet
Trust in the Merciful “Judge of All the Earth”! — Watchtower ONLINE LIBRARY
1 page
My Weaknesses Have Magnified God’s Strength — Watchtower ONLINE LIBRARY
No ratings yet
My Weaknesses Have Magnified God’s Strength — Watchtower ONLINE LIBRARY
1 page
Never Leave the Spiritual Paradise — Watchtower ONLINE LIBRARY
No ratings yet
Never Leave the Spiritual Paradise — Watchtower ONLINE LIBRARY
1 page
Find Comfort in Jehovah’s Approval — Watchtower ONLINE LIBRARY
No ratings yet
Find Comfort in Jehovah’s Approval — Watchtower ONLINE LIBRARY
1 page
You Can Persevere Despite Disappointments — Watchtower ONLINE LIBRARY
No ratings yet
You Can Persevere Despite Disappointments — Watchtower ONLINE LIBRARY
1 page
Avoid the Darkness—Remain in the Light — Watchtower ONLINE LIBRARY
No ratings yet
Avoid the Darkness—Remain in the Light — Watchtower ONLINE LIBRARY
1 page
Are You Ready to Dedicate Yourself to Jehovah? — Watchtower ONLINE LIBRARY
No ratings yet
Are You Ready to Dedicate Yourself to Jehovah? — Watchtower ONLINE LIBRARY
1 page
Jehovah Has Tender Affection for You — Watchtower ONLINE LIBRARY
No ratings yet
Jehovah Has Tender Affection for You — Watchtower ONLINE LIBRARY
1 page
Are You Ready for the Most Important Day of the Year? — Watchtower ONLINE LIBRARY
No ratings yet
Are You Ready for the Most Important Day of the Year? — Watchtower ONLINE LIBRARY
1 page
Do You Treat Women as Jehovah Does? — Watchtower ONLINE LIBRARY
No ratings yet
Do You Treat Women as Jehovah Does? — Watchtower ONLINE LIBRARY
1 page
“Keep Following” Jesus After Baptism — Watchtower ONLINE LIBRARY
No ratings yet
“Keep Following” Jesus After Baptism — Watchtower ONLINE LIBRARY
1 page
Jehovah Will Help You During Difficult Times — Watchtower ONLINE LIBRARY
No ratings yet
Jehovah Will Help You During Difficult Times — Watchtower ONLINE LIBRARY
1 page
DP 203t00a Enu Powerpoint 02
No ratings yet
DP 203t00a Enu Powerpoint 02
24 pages
Screen, Root RPG
No ratings yet
Screen, Root RPG
5 pages
Conquer Fear by Trusting in Jehovah — Watchtower ONLINE LIBRARY
No ratings yet
Conquer Fear by Trusting in Jehovah — Watchtower ONLINE LIBRARY
1 page
DP 203T00A ENU PowerPoint - 01
No ratings yet
DP 203T00A ENU PowerPoint - 01
20 pages
Coloringpage Beholder1 PDF
No ratings yet
Coloringpage Beholder1 PDF
1 page
2018 Summer Tutorial Intro To Linux
No ratings yet
2018 Summer Tutorial Intro To Linux
71 pages
AZ 801T00A ENU ChangeLog
No ratings yet
AZ 801T00A ENU ChangeLog
3 pages
ML For The C64 and Other Commodore Computers
No ratings yet
ML For The C64 and Other Commodore Computers
350 pages
Linux File
No ratings yet
Linux File
29 pages
Calculating Prime Numbers Comparing Java, C, and Cuda
No ratings yet
Calculating Prime Numbers Comparing Java, C, and Cuda
27 pages
Installation Log
No ratings yet
Installation Log
7 pages
Practical Assignment 1
No ratings yet
Practical Assignment 1
25 pages
VM Practice
No ratings yet
VM Practice
6 pages
Script To Gather OCFS2 Diagnostic Information
No ratings yet
Script To Gather OCFS2 Diagnostic Information
3 pages
Multithreading and Multiprocessing
No ratings yet
Multithreading and Multiprocessing
3 pages
Fluent Tutrial
No ratings yet
Fluent Tutrial
21 pages
Standby Android Log 2024 0721 013115
No ratings yet
Standby Android Log 2024 0721 013115
642 pages
Logcat
No ratings yet
Logcat
405 pages
System Architecture: COMP9243 - Week 2 (16s1)
No ratings yet
System Architecture: COMP9243 - Week 2 (16s1)
15 pages
Adding A Scheduling Policy To The Linux Kernel
No ratings yet
Adding A Scheduling Policy To The Linux Kernel
34 pages
Reset Factory All
No ratings yet
Reset Factory All
15 pages
ViewPlannerInstallationAndUserGuide 20110127
No ratings yet
ViewPlannerInstallationAndUserGuide 20110127
70 pages
Performance and Security Alerts in Pega Platform
No ratings yet
Performance and Security Alerts in Pega Platform
4 pages
Linux-Twonky UPnP HowTo
No ratings yet
Linux-Twonky UPnP HowTo
2 pages
Task Scheduling for Multi core and Parallel Architectures Challenges Solutions and Perspectives 1st Edition Quan Chen - The complete ebook version is now available for download
No ratings yet
Task Scheduling for Multi core and Parallel Architectures Challenges Solutions and Perspectives 1st Edition Quan Chen - The complete ebook version is now available for download
58 pages
Threathunting Malware Analysis Series A5
No ratings yet
Threathunting Malware Analysis Series A5
71 pages
Nutanix On HPE DX-TechData-Partner-29th-Apri-2020
No ratings yet
Nutanix On HPE DX-TechData-Partner-29th-Apri-2020
43 pages
Install and Configure Nfs
No ratings yet
Install and Configure Nfs
3 pages
An Operating System Is A Program Designed To Run Other Programs On A
No ratings yet
An Operating System Is A Program Designed To Run Other Programs On A
2 pages
Limitation of ILP
No ratings yet
Limitation of ILP
28 pages
ADB Commands List
No ratings yet
ADB Commands List
5 pages
528187-001F Linux Driver Release Notes Version 1.7
No ratings yet
528187-001F Linux Driver Release Notes Version 1.7
4 pages
Docker With Dotnet
No ratings yet
Docker With Dotnet
12 pages
Slides For Chapter 18: Distributed Shared Memory: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 18: Distributed Shared Memory: Distributed Systems: Concepts and Design
16 pages
String Hacker Process2
No ratings yet
String Hacker Process2
19 pages
Virus 12.01.2023
No ratings yet
Virus 12.01.2023
3 pages
Xiaopan OS: Description
No ratings yet
Xiaopan OS: Description
1 page