With .v4
you are loading 128 bits that is from 4 banks (the banks also have to be aligned to 128 bits, so 0+1+2+3 is possible 1+2+3+4 not).
So you say all 32 threads are loading a float4
from the same 4 banks. Let’s assume from banks 0+1+2+3.
Then the load takes 32 cycles. Each of the 4 banks deliver 32 values. Each of the 32 threads/lanes receive 4 32-bit values.
The exact ordering, how the load is done, is not published and normally non-observable (for correctly synced programs), i.e. would not make a difference.
I would expect something like the following:
In cycles 0..3 threads 0..3 are loading from banks 0..3.
In cycle 0, thread 0 loads 0, thread 1 loads 1, thread 2 loads 2, thread 3 loads 3.
In cycle 1, thread 0 loads 1, thread 1 loads 2, thread 2 loads 3, thread 3 loads 0.
In cycle 2, thread 0 loads 2, thread 1 loads 3, thread 2 loads 0, thread 3 loads 1.
In cycle 3, thread 0 loads 3, thread 1 loads 0, thread 2 loads 1, thread 3 loads 2.
Each thread reorders the received data into a correct 128-bit value. (Actually 128-bit values are stored in 4 32-bit general purpose registers internally.)
Only one transaction can be active at the same time, so no other read or write accesses can happen during the 32 cycles. But shared memory accesses (MIO pipeline) are asynchronous. So you can enqueue other accesses, but would not receive results.
The .v4
instruction or 128 bit accesses are already optimized as written before. They take 32 cycles for a maximum bank conflict, not 128 cycles (which would happen, if each thread would read the same part of the 128-bit number).
Also remember that if the address is the same between threads (not just the same bank), then it does not count as bank conflict, but is executed as a broadcast to those threads.