Lecture 6

This document discusses memory coalescing and performance considerations for global memory accesses in CUDA. It explains that global memory is accessed in chunks of 32, 64, or 128 bytes and how the device coalesces memory accesses from threads within a warp to minimize transactions. It also discusses how to avoid bank conflicts for efficient shared memory access.

Uploaded by

raghunaath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views28 pages

Lecture 6

Uploaded by

raghunaath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

Ceng 545

Performance Considerations
Memory Coalescing
• High Priority: Ensure global memory accesses
are coalesced whenever possible.
• Off-chip memory is accessed in chunks
– Even if you read only a single word
– If you dont use whole chunk, bandwidth is wasted
• Chunks are aligned to multiples of 32/64/128
Bytes
– Unaligned accesses will cost more
• Global memory loads and stores by threads of
a half warp (for devices of compute capability
1.x) or of a warp (for devices of compute
capability 2.x) are coalesced by the device into
as few as one transaction when certain access
requirements are met.
• To understand these access requirements,
global memory should be viewed in terms of
aligned segments of 16 and 32 words.
Coalescing algorithm
• Find the memory segment that contains the address requested by
the lowest numbered active thread:
– 32B segment for 8-bit data
– 64B segment for 16-bit data
– 128B segment for 32, 64 and 128-bit data.
• Find all other active threads whose requested address lies in the
same segment
• Reduce the transaction size, if possible:
– If size == 128B and only the lower or upper half is used, reduce
transaction to 64B
– If size == 64B and only the lower or upper half is used, reduce
transaction to 32B
• Carry out the transaction, mark threads as inactive
• Repeat until all threads in the half-warp are serviced
A Simple Access Pattern
The first
A Sequential but Misaligned Access
Pattern

If the addresses fall within a 128-byte segment, then a single 128-byte

transaction is performed
one 64-byte transaction and one 32-byte transaction result.
• Memory allocated through the runtime API,
such as via cudaMalloc(), is guaranteed to be
aligned to at least 256 bytes. Therefore,
choosing sensible thread block sizes, such as
multiples of 16, facilitates memory accesses
by half warps that are aligned to segments.
• __align__(8) and __align__(16) can be used
when defining structures to ensure alignment
to segments.
Strided Accesses
__global__ void strideCopy(float *odata, float* idata, int stride)
{
int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
odata[xid] = idata[xid];
}
• As the stride increases, the effective
bandwidth decreases until the point where 16
transactions are issued for the 16 threads in a
half warp
Memory Coalescing
• Structure of array is often better than array
of structures
– Very clear win on regular, stride 1 access
patterns
– Unpredictable or irregular access patterns are
case-by-case
Shared Memory and Memory Banks
• it is on-chip, shared memory is much faster
than local and global memory.
• In fact, uncached shared memory latency is
roughly 100x lower than global memory
latency
– provided there are no bank conflicts between the
threads
• To achieve high memory bandwidth for
concurrent accesses, shared memory is
divided into equally sized memory modules
(banks) that can be accessed simultaneously
• Medium Priority: Accesses to shared memory
should be designed to avoid serializing
requests due to bank conflicts.
• Shared memory banks are organized such that
successive 32-bit words are assigned to
successive banks and each bank has a
bandwidth of 32 bits per clock cycle.

1083 Wang
No ratings yet
1083 Wang
56 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Accessing Global and Shared Memory: Introduction To Supercomputing (MCS 572) Memory Coalescing Techniques
No ratings yet
Accessing Global and Shared Memory: Introduction To Supercomputing (MCS 572) Memory Coalescing Techniques
26 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
10 Caches
No ratings yet
10 Caches
124 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
CUDA Optimization
No ratings yet
CUDA Optimization
54 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
CUDA Memory Model Insights
No ratings yet
CUDA Memory Model Insights
50 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Memory Organisation: Shared Memory in Distributed Memory Architectures
No ratings yet
Memory Organisation: Shared Memory in Distributed Memory Architectures
15 pages
Qn:Explain Different Latency Hiding Techniques /mechanisms? (Ans:Describe Sections 6.1.2,6.1.3, 6.1.5, 6.2.2.)
No ratings yet
Qn:Explain Different Latency Hiding Techniques /mechanisms? (Ans:Describe Sections 6.1.2,6.1.3, 6.1.5, 6.2.2.)
28 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Lecture 3
No ratings yet
Lecture 3
16 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lecture 5 Parallel Memory Architecture 1
No ratings yet
Lecture 5 Parallel Memory Architecture 1
15 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Memory Management Techniques
No ratings yet
Memory Management Techniques
6 pages
Hardware
No ratings yet
Hardware
54 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
Parallel Memory Architectures
No ratings yet
Parallel Memory Architectures
6 pages
05 GPU Memory
No ratings yet
05 GPU Memory
80 pages
He-Dieu-Hanh - Kai-Li - Vmdesign - (Cuuduongthancong - Com)
No ratings yet
He-Dieu-Hanh - Kai-Li - Vmdesign - (Cuuduongthancong - Com)
24 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Oral Questions 2021 - Architecture
No ratings yet
Oral Questions 2021 - Architecture
14 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
109 pages
11 VM
No ratings yet
11 VM
118 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
3 Physical Memory Architecture: Assignments
No ratings yet
3 Physical Memory Architecture: Assignments
22 pages
Operating Systems Practicals Solutions
No ratings yet
Operating Systems Practicals Solutions
4 pages
Lecture 3 - Memory Management and Virtual Memory
No ratings yet
Lecture 3 - Memory Management and Virtual Memory
8 pages
CS 4348 HW1
No ratings yet
CS 4348 HW1
5 pages
Embedded System Architecture
No ratings yet
Embedded System Architecture
10 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
53 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Cache Coherency
No ratings yet
Cache Coherency
19 pages
OS & Concurrency Exam Solutions
No ratings yet
OS & Concurrency Exam Solutions
6 pages
Unit IV CAL 817 Operating System
No ratings yet
Unit IV CAL 817 Operating System
17 pages
Unit 4 CH A Memory MGMT
No ratings yet
Unit 4 CH A Memory MGMT
37 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
16 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
17 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Segmentation and Classification of CT Renal Images Using Deep Networks
No ratings yet
Segmentation and Classification of CT Renal Images Using Deep Networks
8 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
DT-830B Digital Multimeter Manual
100% (1)
DT-830B Digital Multimeter Manual
7 pages
Python File Handling Guide
No ratings yet
Python File Handling Guide
4 pages
Pentaho Pivot 4 JTutorial
No ratings yet
Pentaho Pivot 4 JTutorial
14 pages
Ct3Md: Search Site Menu Forums Log Out
No ratings yet
Ct3Md: Search Site Menu Forums Log Out
4 pages
Dau Tst 102 Homework Help
100% (1)
Dau Tst 102 Homework Help
8 pages
ICT External Range Liquidity Static Multi-Timeframe (Swing High and Low)
No ratings yet
ICT External Range Liquidity Static Multi-Timeframe (Swing High and Low)
7 pages
RAG FOR Agriculture
No ratings yet
RAG FOR Agriculture
2 pages
DS Pace1PTM PDF
No ratings yet
DS Pace1PTM PDF
2 pages
Journal of Building Engineering: Adrianto Oktavianus, Po-Han Chen, Jacob J. Lin
No ratings yet
Journal of Building Engineering: Adrianto Oktavianus, Po-Han Chen, Jacob J. Lin
20 pages
Stw8Nb100: N-Channel 1000V - 1.3 - 7.3A To-247 Powermesh™ Mosfet
No ratings yet
Stw8Nb100: N-Channel 1000V - 1.3 - 7.3A To-247 Powermesh™ Mosfet
9 pages
ASOS Intranet Resources
No ratings yet
ASOS Intranet Resources
28 pages
Basic Electrical Engineering Lecture Part 1pdf
No ratings yet
Basic Electrical Engineering Lecture Part 1pdf
44 pages
23-042-2006 Bombardier Global 5000 - 05.12.2023
No ratings yet
23-042-2006 Bombardier Global 5000 - 05.12.2023
7 pages
Bar Coding: It'S Hard To Kill A Hippo: Margaret Keller, Beverly Oneida, and Gale Mccarty
No ratings yet
Bar Coding: It'S Hard To Kill A Hippo: Margaret Keller, Beverly Oneida, and Gale Mccarty
4 pages
Digital Reef SDKLog
No ratings yet
Digital Reef SDKLog
7 pages
GPRlab User Manual - English
No ratings yet
GPRlab User Manual - English
51 pages
ISTQB Previous Year Q&A
No ratings yet
ISTQB Previous Year Q&A
3 pages
Using IntelliCAD
No ratings yet
Using IntelliCAD
644 pages
ECE 101 Laboratory 1 Basics
No ratings yet
ECE 101 Laboratory 1 Basics
15 pages
TCS
No ratings yet
TCS
13 pages
Interview Question
No ratings yet
Interview Question
11 pages
Generator Testing Guidelines
No ratings yet
Generator Testing Guidelines
38 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
20 pages
جبر محمد بن موسى 2
No ratings yet
جبر محمد بن موسى 2
363 pages
Digital Portable Leeb Hardness Tester
No ratings yet
Digital Portable Leeb Hardness Tester
5 pages
Standard Deviation Concrete
No ratings yet
Standard Deviation Concrete
2 pages
Sage X3 Licensing Guide
100% (2)
Sage X3 Licensing Guide
18 pages
Awesome Kubernetes Resources Guide
No ratings yet
Awesome Kubernetes Resources Guide
37 pages
Official CompTIA Linux+ Student Guide
43% (7)
Official CompTIA Linux+ Student Guide
566 pages
0106 - Enthuse - Phase-1 - Student Copy - 1901cja101021240178
No ratings yet
0106 - Enthuse - Phase-1 - Student Copy - 1901cja101021240178
12 pages

Lecture 6

Uploaded by

Lecture 6

Uploaded by

Ceng 545

If the addresses fall within a 128-byte segment, then a single 128-byte

You might also like