Shared memory bank conflict reordering

uraniumbathtub · July 15, 2025, 1:59pm

if 32 threads requests 32 ints, starting from the same bank causing 32 way bank conflicts over 32 elements
and the requests are made via float4 , v4 loading
will the requests be staggered so that after the first bank load, now the memory is simultaneously loading from banks 0 and 1, then 0, 1, and 2, and then 0123

another way of saying it would be
the first cycle completes, now bank 1 would be open for reading, so will it then reorder to read from 0 and 1 , while the 31 way conflict is still seralizing execution, or will it a full 32 way conflict over all 32 elements

and can a new shared memory request be initiated before a v4 load finishes

Curefab · July 16, 2025, 4:05pm

With .v4 you are loading 128 bits that is from 4 banks (the banks also have to be aligned to 128 bits, so 0+1+2+3 is possible 1+2+3+4 not).
So you say all 32 threads are loading a float4 from the same 4 banks. Let’s assume from banks 0+1+2+3.

Then the load takes 32 cycles. Each of the 4 banks deliver 32 values. Each of the 32 threads/lanes receive 4 32-bit values.

The exact ordering, how the load is done, is not published and normally non-observable (for correctly synced programs), i.e. would not make a difference.

I would expect something like the following:

In cycles 0..3 threads 0..3 are loading from banks 0..3.
In cycle 0, thread 0 loads 0, thread 1 loads 1, thread 2 loads 2, thread 3 loads 3.
In cycle 1, thread 0 loads 1, thread 1 loads 2, thread 2 loads 3, thread 3 loads 0.
In cycle 2, thread 0 loads 2, thread 1 loads 3, thread 2 loads 0, thread 3 loads 1.
In cycle 3, thread 0 loads 3, thread 1 loads 0, thread 2 loads 1, thread 3 loads 2.

Each thread reorders the received data into a correct 128-bit value. (Actually 128-bit values are stored in 4 32-bit general purpose registers internally.)

Only one transaction can be active at the same time, so no other read or write accesses can happen during the 32 cycles. But shared memory accesses (MIO pipeline) are asynchronous. So you can enqueue other accesses, but would not receive results.

The .v4instruction or 128 bit accesses are already optimized as written before. They take 32 cycles for a maximum bank conflict, not 128 cycles (which would happen, if each thread would read the same part of the 128-bit number).

Also remember that if the address is the same between threads (not just the same bank), then it does not count as bank conflict, but is executed as a broadcast to those threads.

Topic		Replies	Views
128-bit access bank conflict CUDA Programming and Performance	11	1009	March 29, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10981	January 16, 2025
A question about load shared memory in matrix multiplication CUDA Programming and Performance cuda	4	123	December 1, 2024
When bank conflicts in shared memory, serialized request is the order fixed? CUDA Programming and Performance cuda	4	38	August 12, 2024
float4 Shared memory doesn't yield bank conflict according to nvprof when it should CUDA Programming and Performance	4	1942	January 13, 2024
How to explain this bank conflict CUDA Programming and Performance	1	711	September 20, 2013
Load data for tensor core CUDA Programming and Performance	23	84	February 5, 2025
Bank Conflicts CUDA Programming and Performance	2	1969	December 6, 2009
Unexpected shared memory bank conflict. CUDA Programming and Performance	2	1027	July 3, 2019
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	63	June 23, 2025

Shared memory bank conflict reordering

Related topics