Compiler Scheduling for MIPS ILP

The document discusses compiler support for exploiting instruction-level parallelism (ILP). It provides examples of scheduling code for the MIPS pipeline to reduce stalls and improve performance. Unrolling loops can further optimize code by exposing more parallelism between instructions from different iterations. The compiler must intelligently schedule instructions to avoid dependencies and fully utilize the pipeline.

Uploaded by

Atharva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views18 pages

Compiler Scheduling for MIPS ILP

Uploaded by

Atharva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 18

Computer Organization and

Architecture (AT70.01)
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor: Dr. Sumanta Guha
Slide Sources: Based on CA:
aQA by Hennessy/Patterson.
Supplemented from various
freely downloadable sources
Advanced Topic:
Compiler Support for ILP
CA:aQA Sec. 4.1
Scheduling Code for the MIPS
Pipeline
 Example:
 for (i=1000; i>0; i=i-1)

x[i] = x[i] + s;

 Notes:
 the loop is parallel – the body of each iteration is independent of
that of other iterations
 conceptually : if we had 1000 CPUs, we could distribute one
iteration to each CPU and compute in parallel (=simultaneously)
 Only the compiler can exploit such instruction-level parallelism
(ILP), not the hardware! Why?
 because only the compiler has a global view of the code
 the hardware sees each line of code only after it is fetched from
memory, not all together – in particular, not the whole loop
 the compiler must schedule the code intelligently to expose and
exploit ILP…
Scheduling Code for the MIPS
Pipeline
 Assume FP operation latencies as below
 latency indicates number of intervening cycles required between
producing and consuming instruction to avoid stall
 Assume integer ALU operation latency of 0 and integer load
latency of 1
FP Latency table
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Unscheduled Code
Original C loop statement: for (i=1000; i>0; i=i-1) x[i] = x[i] + s;
Unscheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1) ;F0 = array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer
;8 bytes per DW
BNE R1,R2,Loop ;branch R1!=R2
Execution cycles for the unscheduled code:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
Why one stall? Think of when the
DADDUI R1,R1,#-8 7 optimized MIPS pipeline resolves
stall 8 branch outcomes…
BNE R1,R2,Loop 9 Delayed branch stall
stall 10
10 clock cycles per iteration
Scheduled Code
Scheduled code for the MIPS pipeline:

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8
ADD.D F4,F0,F2
BNE R1,R2,Loop ;delayed branch
S.D F4,8(R1) ;altered and interchanged with DADDUI

Execution cycles for the scheduled code:

Clock cycle issued
Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
stall 4
BNE R1,R2,Loop 5
S.D F4,8(R1) 6
6 clock cycles per iteration is optimal because of the dependencies. Only 3 of the operations
(L.D, ADD.D & S.D) actually operate on the array, the other three are loop overhead…
 Compiler has to be “smart” to perform this scheduling
 e.g., interchanging the DADDUI and S.D instructions requires
understanding the dependence between them and accordingly
changing the S.D store address from 0(R1) to 8(R1)!
Unrolling Loops
 The 3 clock cycle per iteration overhead delay in the
scheduled code of the previous example may be reduced…
 …by amortizing the loop overhead over multiple loop iterations
 For this we need to unroll the loop and block multiple
iterations into one
 Loop unrolling also allows improved scheduling by exposing
increased ILP – between instruction from different iterations
 Example…
Unrolling Loops – High-level
 for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;

C equivalent of unrolling to block four iterations into one:

 for (i=250; i>0; i=i-1)

{
x[4*i] = x[4*i] + s;
x[4*i-1] = x[4*i-1] + s;
x[4*i-2] = x[4*i-2] + s;
x[4*i-3] = x[4*i-3] + s;
}
Unrolled Loop – not Scheduled
Unrolled but unscheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
 Notes:
 four copies of the loop body have been unrolled
 different registers are used in each copy – to facilitate future
scheduling
 three branches and three decrements of R1 have been eliminated
Executing the Unrolled
Unscheduled Loop Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
L.D F6,-8(R1) 7
stall 8
ADD.D F8,F6,F2 9
stall 10
stall 11
S.D F8,-8(R1) 12
L.D F10,-16(R1) 13
stall 14
ADD.D F12,F10,F2 15
stall 16
stall 17
S.D F12,-16(R1) 18
L.D F14,-24(R1) 19
stall 20
ADD.D F16,F14,F2 21
stall 22
stall 23
S.D F16,-24(R1) 24
DADDUI R1,R1,#-32 25
stall 26
BNE R1,R2,Loop 27
stall 28

One iteration of the unrolled loop runs in 28 clock cycles. Therefore,

7 clock cycles per iteration of original loop – slower than scheduled original loop!
Scheduling the Unrolled Loop
Unrolled and scheduled code for the MIPS pipeline:
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,-16(R1)
BNE R1,R2,Loop
S.D F16,-24(R1) ;8-32 = -24
Executing the Unrolled and
Scheduled Loop
Clock cycle issued
Loop: L.D F0,0(R1) 1
L.D F6,-8(R1) 2
L.D F10,-16(R1) 3
L.D F14,-24(R1) 4
ADD.D F4,F0,F2 5
ADD.D F8,F6,F2 6
ADD.D F12,F10,F2 7
ADD.D F16,F14,F2 8
S.D F4,0(R1) 9
S.D F8,-8(R1) 10
DADDUI R1,R1,#-32 11
S.D F12,-16(R1) 12
BNE R1,R2,Loop 13
S.D F16,-24(R1) 14

No stalls! One iteration of the unrolled loop runs in 14 clock cycles. Therefore,
3.5 clock cycles per iteration of original loop vs. 6 cycles for scheduled but not unrolled loop
Notes
 Scheduling code (if possible) to avoid stalls is always a win
and optimizing compilers typically generate scheduled
assembly
 Unrolling loops can be advantageous but there are potential
problems
 growth in code size
 register pressure: aggressive unrolling and scheduling requires
allocation of multiple registers
Enhancing Loop-Level
Parallelism
 Consider the previous running example:
 for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
 there is no loop-carried dependence – where data used in a later
iteration depends on data produced in an earlier one
 in other words, all iterations could (conceptually) be executed in
parallel
 Contrast with the following loop:
 for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2
*/ }
 what are the dependences?
A Loop with Dependences
 For the loop:
 for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2
*/ }
 what are the dependences?
 There are two different dependences:
 loop-carried:
 S1 computes A[i+1] using value of A[i] computed in previous
iteration
 S2 computes B[i+1] using value of B[i] computed in previous
iteration
 not loop-carried:
 S2 uses the value A[i+1] computed by S1 in the same iteration A[i-1]

 The loop-carried dependences in this case force successive A[i]

iterations of the loop to execute in series. Why?
 S1 of iteration i depends on S1 of iteration i-1 which in turn
A[i+1]
depends on …, etc.
Another Loop with
Dependences
 Generally, loop-carried dependences hinder ILP
 if there are no loop-carried dependences all iterations could be
executed in parallel
 even if there are loop-carried dependences it may be possible to
parallelize the loop – an analysis of the dependences is required…
 For the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
 what are the dependences?
 There is one loop-carried dependence:
 S1 uses the value of B[i] computed in a previous iteration by S2
B[i]
 but this does not force iterations to execute in series. Why…?
 …because S1 of iteration i depends on S2 of iteration i-1…, and
the chain of dependences stops here! A[i]
Parallelizing Loops with Short
Chains of Dependences
 Parallelize the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
 Parallelized code:
 A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];

 the dependence between the two statements in the loop is no

longer loop-carried and iterations of the loop may be executed in
parallel
Another Example
 Analyze the loop:
 for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
C[i+1] = E[i] + D[i]; /* S3 */
}

Adv Topic Compiler Supported ILP
No ratings yet
Adv Topic Compiler Supported ILP
17 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Techniques for Enhancing ILP in Compilers
No ratings yet
Techniques for Enhancing ILP in Compilers
4 pages
Unit II
No ratings yet
Unit II
84 pages
MN Loop Unrolling
No ratings yet
MN Loop Unrolling
5 pages
Advanced Computer Architecture HW3
No ratings yet
Advanced Computer Architecture HW3
5 pages
Understanding Data Dependences and Hazards
No ratings yet
Understanding Data Dependences and Hazards
24 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
No ratings yet
Lecture: Static ILP: Topics: Predication, Speculation (Sections C.5, 3.2)
26 pages
Optimizing Instruction-Level Parallelism
No ratings yet
Optimizing Instruction-Level Parallelism
18 pages
Software Pipelining in Compiler Design
No ratings yet
Software Pipelining in Compiler Design
25 pages
Loop Optimization in Computer Architecture
No ratings yet
Loop Optimization in Computer Architecture
2 pages
Computer Architecture Homework
No ratings yet
Computer Architecture Homework
12 pages
Out-of-Order Superscalar Optimization
No ratings yet
Out-of-Order Superscalar Optimization
156 pages
Advanced Loop Optimization Techniques
No ratings yet
Advanced Loop Optimization Techniques
21 pages
Solution 2
No ratings yet
Solution 2
3 pages
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
No ratings yet
4.1 Basic Compiler Techniques For Exposing ILP Instruction-Level Parallelism
11 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Intro To Static Pipelining: CS252 Graduate Computer Architecture
No ratings yet
Intro To Static Pipelining: CS252 Graduate Computer Architecture
52 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
Chapter 03 Solution
No ratings yet
Chapter 03 Solution
19 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Chapter 03
No ratings yet
Chapter 03
19 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Instruction-Level Parallelism Overview
No ratings yet
Instruction-Level Parallelism Overview
170 pages
En m3 Ex Sol
No ratings yet
En m3 Ex Sol
35 pages
Midterm Solutions Mar 30
No ratings yet
Midterm Solutions Mar 30
6 pages
Computer Architecture Exam Solutions
No ratings yet
Computer Architecture Exam Solutions
5 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
No ratings yet
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
34 pages
Mid Sem Q1 Q4 Solutions
No ratings yet
Mid Sem Q1 Q4 Solutions
5 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
Introduction To Advanced Pipelining
No ratings yet
Introduction To Advanced Pipelining
64 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Lec 11
No ratings yet
Lec 11
19 pages
Tut10 Selected Ans
No ratings yet
Tut10 Selected Ans
7 pages
VLIW Architecture Overview and Benefits
No ratings yet
VLIW Architecture Overview and Benefits
53 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
ILP2 (Unit4)
No ratings yet
ILP2 (Unit4)
27 pages
KIIT HPC Assignment Submission Guidelines
No ratings yet
KIIT HPC Assignment Submission Guidelines
4 pages
Understanding VLIW Processors
No ratings yet
Understanding VLIW Processors
11 pages
Loop Unrolling for ILP Optimization
No ratings yet
Loop Unrolling for ILP Optimization
8 pages
hw4 Cse490-590-Sp2025 Sol
No ratings yet
hw4 Cse490-590-Sp2025 Sol
7 pages
Static ILP Exploitation Techniques
No ratings yet
Static ILP Exploitation Techniques
21 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
CSCI 510: Computer Architecture Stalls
No ratings yet
CSCI 510: Computer Architecture Stalls
6 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
No. of Cycles IF ID EXE MEM WB
No ratings yet
No. of Cycles IF ID EXE MEM WB
5 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
L11 DS PDF
No ratings yet
L11 DS PDF
41 pages
MIPS R4000 Pipelining Case Study
No ratings yet
MIPS R4000 Pipelining Case Study
23 pages
Tutorial 07 - Mid Paper Discussion (LO1 - LO3)
No ratings yet
Tutorial 07 - Mid Paper Discussion (LO1 - LO3)
6 pages
Compte Rendu TP N°1: Microcontroleur
No ratings yet
Compte Rendu TP N°1: Microcontroleur
7 pages
Instruction-Level Parallelism Techniques
0% (1)
Instruction-Level Parallelism Techniques
40 pages
Csis Csg524 Midsem Q
No ratings yet
Csis Csg524 Midsem Q
3 pages
Department of Electronics and Communication Engineering: B E Degree Examination - Internal Assessment-II
No ratings yet
Department of Electronics and Communication Engineering: B E Degree Examination - Internal Assessment-II
1 page
8251A USART Communication Interface
No ratings yet
8251A USART Communication Interface
19 pages
COMSOL Multiphysics: Application Builder Reference Manual
No ratings yet
COMSOL Multiphysics: Application Builder Reference Manual
186 pages
Overview of Midrange 8-bit PIC MCUs
100% (1)
Overview of Midrange 8-bit PIC MCUs
115 pages
ESIOT Manual
No ratings yet
ESIOT Manual
23 pages
Network Video Recorder User Manual UI
No ratings yet
Network Video Recorder User Manual UI
281 pages
Java Garbage Collection Interview Questions
No ratings yet
Java Garbage Collection Interview Questions
11 pages
MIPS Homework: Binary and Decimal Conversions
No ratings yet
MIPS Homework: Binary and Decimal Conversions
18 pages
Log Crash 20240907 104416
No ratings yet
Log Crash 20240907 104416
2 pages
Hiren's BootCD USB Installation Log
No ratings yet
Hiren's BootCD USB Installation Log
2 pages
BCA 2nd Sem (11-01-2069)
No ratings yet
BCA 2nd Sem (11-01-2069)
18 pages
SAP Projects 1741456694
No ratings yet
SAP Projects 1741456694
11 pages
DSA Lab 09
No ratings yet
DSA Lab 09
5 pages
(SE) BMMA2343 Microprocessor Technology (Lecture 01)
No ratings yet
(SE) BMMA2343 Microprocessor Technology (Lecture 01)
31 pages
Motif-Rack Xs Editor Owner's Manual
No ratings yet
Motif-Rack Xs Editor Owner's Manual
53 pages
Installer Debug
No ratings yet
Installer Debug
7 pages
User Manual For ATMEGA Evaluation Kit
No ratings yet
User Manual For ATMEGA Evaluation Kit
26 pages
Nuipc / Nudaq 743X Series: 64-CH Isolated Digital I/O Board User's Guide
No ratings yet
Nuipc / Nudaq 743X Series: 64-CH Isolated Digital I/O Board User's Guide
48 pages
Submitted To: Submitted By: Mr. Vinod Jain Vandana Jain 07/CS/103
No ratings yet
Submitted To: Submitted By: Mr. Vinod Jain Vandana Jain 07/CS/103
39 pages
3.distributed Mutual Exclusion
No ratings yet
3.distributed Mutual Exclusion
2 pages
Flipflops and Counters
No ratings yet
Flipflops and Counters
9 pages
EE 222 MPS Course Outline
No ratings yet
EE 222 MPS Course Outline
5 pages
Anthony Winter Ton Resume
No ratings yet
Anthony Winter Ton Resume
1 page
NetApp FAS 8200
No ratings yet
NetApp FAS 8200
4 pages
Java Exam Model Answer Guide
No ratings yet
Java Exam Model Answer Guide
26 pages
2021 - Python For Absolute Beginners
100% (6)
2021 - Python For Absolute Beginners
158 pages
SP 74946
No ratings yet
SP 74946
2 pages
Jayesh-Devops Engineer
No ratings yet
Jayesh-Devops Engineer
3 pages
Overview of SQL Server Management Studio
No ratings yet
Overview of SQL Server Management Studio
1 page
Black Belt - Switching - Presales - Stage 1 Quiz
100% (3)
Black Belt - Switching - Presales - Stage 1 Quiz
1 page
Chapter 4 HMI
No ratings yet
Chapter 4 HMI
5 pages

Compiler Scheduling for MIPS ILP

Uploaded by

Compiler Scheduling for MIPS ILP

Uploaded by

Computer Organization and

Loop: L.D F0,0(R1)

Execution cycles for the scheduled code:

C equivalent of unrolling to block four iterations into one:

One iteration of the unrolled loop runs in 28 clock cycles. Therefore,

 The loop-carried dependences in this case force successive A[i]

 the dependence between the two statements in the loop is no

You might also like