Shared memory bank conflict reordering

if 32 threads requests 32 ints, starting from the same bank causing 32 way bank conflicts over 32 elements
and the requests are made via float4 , v4 loading
will the requests be staggered so that after the first bank load, now the memory is simultaneously loading from banks 0 and 1, then 0, 1, and 2, and then 0123

another way of saying it would be
the first cycle completes, now bank 1 would be open for reading, so will it then reorder to read from 0 and 1 , while the 31 way conflict is still seralizing execution, or will it a full 32 way conflict over all 32 elements

and can a new shared memory request be initiated before a v4 load finishes

With .v4 you are loading 128 bits that is from 4 banks (the banks also have to be aligned to 128 bits, so 0+1+2+3 is possible 1+2+3+4 not).
So you say all 32 threads are loading a float4 from the same 4 banks. Let’s assume from banks 0+1+2+3.

Then the load takes 32 cycles. Each of the 4 banks deliver 32 values. Each of the 32 threads/lanes receive 4 32-bit values.

The exact ordering, how the load is done, is not published and normally non-observable (for correctly synced programs), i.e. would not make a difference.

I would expect something like the following:

In cycles 0..3 threads 0..3 are loading from banks 0..3.
In cycle 0, thread 0 loads 0, thread 1 loads 1, thread 2 loads 2, thread 3 loads 3.
In cycle 1, thread 0 loads 1, thread 1 loads 2, thread 2 loads 3, thread 3 loads 0.
In cycle 2, thread 0 loads 2, thread 1 loads 3, thread 2 loads 0, thread 3 loads 1.
In cycle 3, thread 0 loads 3, thread 1 loads 0, thread 2 loads 1, thread 3 loads 2.

Each thread reorders the received data into a correct 128-bit value. (Actually 128-bit values are stored in 4 32-bit general purpose registers internally.)

Only one transaction can be active at the same time, so no other read or write accesses can happen during the 32 cycles. But shared memory accesses (MIO pipeline) are asynchronous. So you can enqueue other accesses, but would not receive results.

The .v4instruction or 128 bit accesses are already optimized as written before. They take 32 cycles for a maximum bank conflict, not 128 cycles (which would happen, if each thread would read the same part of the 128-bit number).

Also remember that if the address is the same between threads (not just the same bank), then it does not count as bank conflict, but is executed as a broadcast to those threads.