Compiler techniques for exposing ILP
Instruction Level Parallelism
Potential overlap among instructions Few possibilities in a basic block
Blocks are small (6-7 instructions) Instructions are dependent
Goal: Exploit ILP across multiple basic blocks
Iterations of a loop
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Basic Scheduling
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code
Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop
Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD 0(R1), F4 SUBI R1, R1, #8 stall BNEZ R1, Loop stall
1 2 3 4 5 6 7 8 9 10
Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6
Loop Unrolling
Loop: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop
Comment: Often a precursor step for other optimizations
Exit:
Loop Transformations
Instruction independency is the key requirement for the transformations Example
Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code
1. Eliminating Name Dependences
Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F0, -8(R1) F4, F0, F2 -8(R1), F4 F0, -16(R1) F4, F0, F2 -16(R1), F4 F0, -24(R1) Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1)
Register Renaming
ADDD
SD SUBI BNEZ
F4, F0, F2
-24(R1), F4 R1, R1, #32 R1, Loop
ADDD
SD SUBI BNEZ
F16, F14, F2
-24(R1), F16 R1, R1, #32 R1, Loop
2. Eliminating Control Dependences
Loop: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop
Intermediate BEQZ are never taken Eliminate!
Exit:
3. Eliminating Data Dependences
Loop: LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop
Data dependencies SUBI, LD, SD Force sequential execution of iterations
Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI
Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)
4. Alleviating Data Dependencies
Unrolled loop:
Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop
Scheduled Unrolled loop:
Loop: LD LD LD LD ADDD ADDD ADDD ADDD SD SD SUBI SD BNEZ SD F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 0(R1), F4 -8(R1), F8 R1, R1, #32 16(R1), F12 R1, Loop 8(R1), F16
Some General Comments
Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based
Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }
x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];
Dependence Analysis Algorithms
Assume array indexes are affine (ai + b)
GCD test:
For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4)
General graph cycle determination is NP a, b, c, and d may not be known at compile time
Software Pipelining
Start-up
Finish-up
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Software pipelined iteration
Example
Iteration i LD F0, 0(R1) LD F0, 0(R1) LD F0, 0(R1) Iteration i+1 Iteration i+2
ADDD F4, F0, F2
SD 0(R1), F4
ADDD F4, F0, F2 SD 0(R1), F4
ADDD F4, F0, F2 SD 0(R1), F4 16(R1), F4
Loop:
LD
F0, 0(R1)
Loop:
SD
ADDD F4, F0, F2 SD SUBI 0(R1), F4 R1, R1, #8
ADDD F4, F0, F2 LD SUBI F0, 0(R1) R1, R1, #8
BNEZ R1, Loop
BNEZ R1, Loop
Trace (global-code) Scheduling
Find ILP across conditional branches Two-step process
Trace selection
Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches
Trace compaction
Squeeze the trace into a small number of wide instructions Preserve data and control dependences
Trace Selection
A[I] = A[I] + B[I]
LW LW
F
R4, 0(R1) R5, 0(R2)
A[I] = 0?
ADD
SW
R4, R4, R5
0(R1), R4
BNEZ R4, else
B[I] = X
....
SW J Else: .... X 0(R2), . . . join
C[I] =
Join:
.... SW 0(R3), . . .
Summary of Compiler Techniques
Try to avoid dependence stalls Loop unrolling
Reduce loop overhead
Software pipelining
Reduce single body dependence stalls
Trace scheduling
Reduce impact of other branches
Compilers use a mix of three All techniques depend on prediction accuracy
Food for thought: Analyze this
Analyze this for different values of X and Y
To evaluate different branch prediction schemes For compiler scheduling purposes
add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop:
andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11
even:
addi r2, r2, 4 subi r1, r1, Y bnez r1, loop