0% found this document useful (0 votes)
47 views28 pages

Lecture 6

This document discusses memory coalescing and performance considerations for global memory accesses in CUDA. It explains that global memory is accessed in chunks of 32, 64, or 128 bytes and how the device coalesces memory accesses from threads within a warp to minimize transactions. It also discusses how to avoid bank conflicts for efficient shared memory access.

Uploaded by

raghunaath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views28 pages

Lecture 6

This document discusses memory coalescing and performance considerations for global memory accesses in CUDA. It explains that global memory is accessed in chunks of 32, 64, or 128 bytes and how the device coalesces memory accesses from threads within a warp to minimize transactions. It also discusses how to avoid bank conflicts for efficient shared memory access.

Uploaded by

raghunaath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Ceng 545

Performance Considerations
Memory Coalescing
• High Priority: Ensure global memory accesses
are coalesced whenever possible.
• Off-chip memory is accessed in chunks
– Even if you read only a single word
– If you dont use whole chunk, bandwidth is wasted
• Chunks are aligned to multiples of 32/64/128
Bytes
– Unaligned accesses will cost more
• Global memory loads and stores by threads of
a half warp (for devices of compute capability
1.x) or of a warp (for devices of compute
capability 2.x) are coalesced by the device into
as few as one transaction when certain access
requirements are met.
• To understand these access requirements,
global memory should be viewed in terms of
aligned segments of 16 and 32 words.
Coalescing algorithm
• Find the memory segment that contains the address requested by
the lowest numbered active thread:
– 32B segment for 8-bit data
– 64B segment for 16-bit data
– 128B segment for 32, 64 and 128-bit data.
• Find all other active threads whose requested address lies in the
same segment
• Reduce the transaction size, if possible:
– If size == 128B and only the lower or upper half is used, reduce
transaction to 64B
– If size == 64B and only the lower or upper half is used, reduce
transaction to 32B
• Carry out the transaction, mark threads as inactive
• Repeat until all threads in the half-warp are serviced
A Simple Access Pattern
The first
A Sequential but Misaligned Access
Pattern

If the addresses fall within a 128-byte segment, then a single 128-byte


transaction is performed
one 64-byte transaction and one 32-byte transaction result.
• Memory allocated through the runtime API,
such as via cudaMalloc(), is guaranteed to be
aligned to at least 256 bytes. Therefore,
choosing sensible thread block sizes, such as
multiples of 16, facilitates memory accesses
by half warps that are aligned to segments.
• __align__(8) and __align__(16) can be used
when defining structures to ensure alignment
to segments.
Strided Accesses
__global__ void strideCopy(float *odata, float* idata, int stride)
{
int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
odata[xid] = idata[xid];
}
• As the stride increases, the effective
bandwidth decreases until the point where 16
transactions are issued for the 16 threads in a
half warp
Memory Coalescing
• Structure of array is often better than array
of structures
– Very clear win on regular, stride 1 access
patterns
– Unpredictable or irregular access patterns are
case-by-case
Shared Memory and Memory Banks
• it is on-chip, shared memory is much faster
than local and global memory.
• In fact, uncached shared memory latency is
roughly 100x lower than global memory
latency
– provided there are no bank conflicts between the
threads
• To achieve high memory bandwidth for
concurrent accesses, shared memory is
divided into equally sized memory modules
(banks) that can be accessed simultaneously
• Medium Priority: Accesses to shared memory
should be designed to avoid serializing
requests due to bank conflicts.
• Shared memory banks are organized such that
successive 32-bit words are assigned to
successive banks and each bank has a
bandwidth of 32 bits per clock cycle.

You might also like