SIMD
SIMD
4
SIMD Processing
■ Single instruction operates on multiple data elements
❑ In time or in space
■ Multiple processing elements
■ Time-space duality
❑ Array processor: Instruction operates on multiple data
elements at the same time
❑ Vector processor: Instruction operates on multiple data
elements in consecutive time steps
5
Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
6
SIMD Array Processing vs. VLIW
■ VLIW (very large Instruction word)
7
SIMD Array Processing vs. VLIW
■ Array processor
8
Vector Processors
■ A vector is a one-dimensional array of numbers
■ Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2
9
Vector Processors (II)
■ A vector instruction performs an operation on each element
in consecutive cycles
❑ Vector functional units are pipelined
❑ Each pipeline stage operates on a different data element
10
Vector Processor Advantages
+ No dependencies within a vector
❑ Pipelining, parallelization work well
❑ Can have very deep pipelines, no dependencies!
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 12
Vector Processor Limitations
-- Memory (bandwidth) can easily become a bottleneck,
especially if
1. compute/memory operation balance is not maintained
2. data is not mapped appropriately to memory banks
13
Vector Registers
■ Each vector data register holds N M-bit values
■ Vector control registers: VLEN, VSTR, VMASK
■ Vector Mask Register (VMASK)
❑ Indicates which elements of vector to operate on
V0,N-1 V1,N-1
14
Vector Functional Units
■ Use deep pipeline (=> fast
clock) to execute element
operations V V V
1 2 3
■ Simplifies control of deep
pipeline because elements in
vector are independent
V3 <- v1 * v2
15
Vector Machine Organization (CRAY-1)
■ CRAY-1
■ Russell, “The CRAY-1
computer system,”
CACM 1978.
16
Memory Banking
■ Example: 16 banks; can start one bank access per cycle
■ Bank latency: 11 cycles
■ Can sustain 16 parallel accesses if they go to different banks
M MA M MA M MA M MA
DR R DR R DR R DR R
Data bus
Address bus
CPU
Slide credit: Derek Chiou 17
Vector Memory System
Base Stride
Vector Registers
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory
Banks
■ Scalar code
MOVI R0 = 50 1
MOVA R1 = A 1 304 dynamic instructions
MOVA R2 = B 1
MOVA R3 = C 1
X: LD R4 = MEM[R1++] 11 ;autoincrement addressing
LD R5 = MEM[R2++] 11
ADD R6 = R4 + R5 4
SHFR R7 = R6 >> 1 1
ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
19
Scalar Code Execution Time
■ Scalar execution time on an in-order processor with 1 bank
❑ First two loads in the loop cannot be pipelined: 2*11 cycles
❑ 4 + 50*40 = 2004 cycles
■ Why 16 banks?
❑ 11 cycle memory access latency
❑ Having 16 (>11) banks ensures there are enough banks to
overlap enough memory operations to cover memory latency
20
Vectorizable Loops
■ A loop is vectorizable if each iteration is independent of any
other
■ For I = 0 to 49
❑ C[i] = (A[i] + B[i]) / 2 7 dynamic instructions
■ Vectorized loop:
MOVI VLEN = 50 1
MOVI VSTR = 1 1
VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
VADD V2 = V0 + V1 4 + VLN - 1
VSHFR V3 = V2 >> 1 1 + VLN - 1
VST C = V3 11 + VLN – 1
21
Vector Code Performance
■ No chaining
❑ i.e., output of a vector functional unit cannot be used as the
input of another (i.e., no vector data forwarding)
■ One memory port (one address generator)
■ 16 memory banks (word-interleaved)
■ 285 cycles
22
Vector Chaining
■ Vector chaining: Data forwarding from one vector functional
unit to another
V V V V
V1
LV v1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Strict assumption:
Each memory bank
has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be
pipelined. WHY?
24
Vector Code Performance – Multiple Memory Ports
■ Chaining and 2 load ports, 1 store port in each bank
■ 79 cycles
25
Questions (I)
■ What if # data elements > # elements in a vector register?
❑ Need to break loops so that each iteration operates on #
elements in a vector register
■ E.g., 527 data elements, 64-element VREGs
■ 8 iterations where VLEN = 64
■ 1 iteration where VLEN = 15 (need to change value of VLEN)
❑ Called vector stripmining
26
Gather/Scatter Operations
27
Gather/Scatter Operations
■ Gather/scatter operations often implemented in hardware
to handle sparse matrices
■ Vector loads and stores use an index vector which is added
to the base register to generate the addresses
Index Vector Data Vector Equivalent
1 3.14 3.14
3 6.5 0.0
7 71.2 6.5
8 2.71 0.0
0.0
0.0
0.0
71.2
2.7
28
Conditional Operations in a Loop
■ What if some operations should not be executed on a vector
(based on a dynamically-determined condition)?
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
goto loop
1. Compare A, B to get
A B VMASK VMASK
1 2 0
2 2 1 2. Masked store of A into C
3 2 1
4 10 0 3. Complement VMASK
-5 -4 0
0 -3 1 4. Masked store of B into C
6 5 1
-7 -8 1
30
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
M[0]=0 C[0]
■ Storage of a matrix
❑ Row major: Consecutive elements in a row are laid out
consecutively in memory
❑ Column major: Consecutive elements in a column are laid out
consecutively in memory
❑ You need to change the stride when accessing a row versus
column
32
33
Array vs. Vector Processors, Revisited
■ Array vs. vector processor distinction is a “purist’s”
distinction
34
Remember: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
35
Vector Instruction Execution
ADDV C,A,B
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
Instruction
issue
add Time
add add
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
Vectorization is a compile-time reordering of
operation sequencing
⇒ requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 39
Vector/SIMD Processing Summary
■ Vector/SIMD machines good at exploiting regular data-level
parallelism
❑ Same operation performed on many data elements
❑ Improve performance, simplify design (no intra-vector
dependencies)
40
SIMD Operations in Modern ISAs
Intel Pentium MMX Operations
■ Idea: One instruction operates on multiple data elements
simultaneously
❑ Ala array processing (yet much more limited)
❑ Designed with multimedia (graphics) operations in mind
No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword
42
MMX Example: Image Overlaying (I)
43
MMX Example: Image Overlaying (II)
44