Unit V
Unit V
This strategy often produces redundant loads and stores. For example, the sequence
of three-address statements
Here, the fourth statement is redundant since it loads a value that has just been
stored, and so is the third if a is not subsequently used.
The quality of the generated code is usually determined by its speed and size.
While translating, the cost also should be taken into account.
For example, if the target machine has an "increment" instruction (INC) , then the
three-address statement a = a + 1 may be implemented more efficiently by the single
instruction INC a, rather than by a more obvious sequence that loads a into a
register, adds one to the register, and then stores the result back into a.
5. Register allocation
Instructions involving register operands are shorter and faster than those
involving operands in memory.
The use of registers is subdivided into two subproblems :
o Register allocation - the set of variables that will reside in registers at
a point in the program is selected
o Register assignment - the specific register that a variable will
reside in is picked
Finding an optimal assignment of registers to variables is difficult
Example 1: Certain machines require register-pairs (an even and next odd numbered
register) for some operands and results. For example, for integer multiplication and
integer division involve register pairs.
The multiplication instruction is of the form
M x, y
where x, the multiplicand, is the even register of an even/odd register pair
the multiplier y, is the odd register.
The product occupies the entire even/odd register pair.
The division instruction is of the form
D x, y
Where, the dividend occupies an even/odd register pair whose even
register is x;
the divisor is y.
After division,
o even register holds the remainder
o the odd register the quotient.
Example 2: Consider the two three-address code sequences below, in which the only
difference in (a) and (b) is the operator in the second statement.
The shortest assembly-code sequences for (a) and (b) are given below.
Ri stands for register i. SRDA stands for Shift-Right-Double-Arithmetic and
SRDA RO,32 shifts the dividend into R1 and clears R0 so all bits equal its sign bit. L,
ST, and A stand for load, store, and add, respectively
6. Evaluation order
The order in which the computations are performed can affect the efficiency of the
target code.
Some computation orders require fewer registers to hold intermediate results
than others.
Example
Assume that t , u, and v are temporaries, local to the block, while a, b, c , and d are
variables that are live on exit from the block.
Assume that there are as many registers as needed and when a register's value is no
longer needed then reuse its register.
The figure shows the register and address descriptors before and after the
translation of each three-address instruction.
Three
Address Assembly Code Register Descriptor Address Descriptor
Code
a= a t=
R1 = EMPTY
Initial Setup of Register and b= b u=
R2 = EMPTY
Address Descriptor c= c v=
R3 = EMPTY
d= d
t=a-b R1 = a a= a t=R2
R2 = t b= b u=
R3 = EMPTY c= c v=
d= d
u = a-c R1 = u a= a t=R2
R2 = t b= b u=R1
R3 = c c= c, R3 v=
d= d
v=t+u R1 = u a= a t=R2
R2 = t b= b u=R1
R3 = v c= c v=R3
d= d
a=d R1 = u a= R2 t=
R2 = a,d b= b u= R1
R3 = v c= c v= R3
d= d, R2
d = v+u R1 = d a= R2 t=
R2 = a b= b u=
R3 = v c= c v= R3
d= R1
Exit R1 = d a= a,R2 t=
R2 = a b= b u=
R3 = v c= c v= R3
d= d, R1
For the first three-address instruction, t = a - b we need to issue three instructions,
since nothing is in a register initially. Thus, a and b are loaded into registers R1 and
R2, and the value t produced in register R2. Can use R2 for t because the value b
previously in R2 is not needed within the block.
The second instruction, u = a - c, does not require a load of a, since it is already in
register R1. Can reuse R1 for the result, u, since the value of a, previously in that
register, is no longer needed within the block, and its value is in its own memory
location. Change the address descriptor for a to indicate that it is no longer in R1,
but is in the memory location called a.
The third instruction, v = t + u, requires only the addition. Can use R3 for the
result, v, since the value of c in that register is no longer needed within the block,
and c has its value in its own memory location.
The copy instruction, a = d, requires a load of d, since it is not in memory. Register
R2's descriptor holding both a and d. The addition of a to the register descriptor is
the result of processing the copy statement.
The fifth instruction, d = v + u, uses two values that are in registers. Since u is a
temporary whose value is no longer needed, reuse its register R1 for the new value
of d. Variable d and a is now in only R1, and is not in its own memory location. The
machine code for the basic block that stores the live-on-exit variables a and d into
their memory locations is needed.
CODE OPTIMIZATION
Principal Sources of Optimization
Peep-hole optimization
Machine independent Optimization
Basic Blocks
Types of Optimizations
Machine independent optimizations:
Machine independent optimizations are program transformations
that improve the target code without taking into consideration any properties of the
target machine.
Machine dependant optimizations:
Machine dependant optimizations are based on register allocation and
utilization of special machine-instruction sequences.
• We will consider only Machine-Independent Optimizations—i.e., they don’t
take into consideration any properties of the target machine.
• The techniques used are a combination of Control-Flow and Data-Flow
analysis.
• - Control-Flow Analysis: Identifies loops in the flow graph of a program
since such loops are usually good candidates for improvement.
• - Data-Flow Analysis: Collects information about the way variables are
used in a program.
Properties of Optimizing Compilers:
The source code should produce correct target code
Dead code should be completely removed from source language.
While applying the optimizing transformations the semantic of the source
program should not be changed.
Basic Blocks
A basic block is a sequence of consecutive statements in which flow of
control enters at the beginning and leaves at the end without any halt or
possibility of branching except at the end.
The following sequence of three-address statements forms a basic block:
t1 : = a * a
t2 : = a * b
t3 : = 2 * t2
t4 : = t1 + t3
t5 : = b * b
t6 : = t4 + t5
Flow Graphs
Flow graph is a directed graph containing the flow-of-control information
for the set of basic blocks making up a program.
The nodes of the flow graph are basic blocks. It has a distinguished initial
node.
Intermediate code for the marked fragment of the program in Figure 5.1 is shown
in Figure 5.2. In this example we assume that integers occupy four bytes. The
assignment x = a[i] is translated into the two three address statements t6=4*i and
x=a[t6] as shown in steps (14) and (15) of Figure. 5.2. Similarly, a[j] = x becomes
t10=4*j and a[t10]=x in steps (20) and (21).
Fig 5.3: Is the flow graph for the program in Figure 5.1.
Figure 5.3 is the flow graph for the program in Figure 5.2. Block B1 is the entry
node. All conditional and unconditional jumps to statements in Figure 5.2 have
been replaced in Figure 5.3 by jumps to the block of which the statements are
leaders. In Figure 5.3, there are three loops. Blocks B2 and B3 are loops by
themselves. Blocks B2, B3, B4, and B5 together form a loop, with B2 the only
entry point.
5.1.3 Semantics-Preserving Transformations
There are a number of ways in which a compiler can improve a program without
changing the function it computes. Common subexpression elimination, copy
propagation, dead-code elimination, and constant folding are common examples of
such function-preserving (or semantics preserving) transformations.
(a) (b)
Fig 5.6: Copies introduced during common subexpression elimination
In order to eliminate the common subexpression from the statement c = d+e in
Figure 5.6(a), we must use a new variable t to hold the value of d + e. The value of
variable t, instead of that of the expression d + e, is assigned to c in Figure 5.6(b).
Since control may reach c = d+e either after the assignment to a or after the
assignment to b, it would be incorrect to replace c = d+e by either c = a or by c =
b.
The idea behind the copy-propagation transformation is to use v for u, wherever
possible after the copy statement u = v. For example, the assignment x = t3 in
block B5 of Figure 5.5 is a copy. Copy propagation applied to B5 yields the code in
Figure 5.7. This change may not appear to be an improvement, but, it gives us the
opportunity to eliminate the assignment to x.
After reduction in strength is applied to the inner loops around B2 and B3, the
only use of i and j is to determine the outcome of the test in block B4. We know
that the values of i and t2 satisfy the relationship t2 = 4 * i, while those of j and
t4 satisfy the relationship t4 = 4* j. Thus, the test t2 >= t4 can substitute for i >=
j. Once this replacement is made, i in block B2 and j in block B3 become dead
variables, and the assignments to them in these blocks become dead code that can
be eliminated. The resulting flow graph is shown in Figure. 5.9.
Fig 5.9: Flow graph after induction-variable elimination
Note:
1. Code motion, induction variable elimination and strength reduction are loop
optimization techniques.
2. Common sub expression elimination, copy propogation dead code elimination
and constant folding are function preserving transformations.
PEEPHOLE OPTIMIZATION
Most production compilers produce good code through careful instruction
selection and register allocation, a few use an alternative strategy: they generate naive
code and then improve the quality of the target code by applying "optimizing"
transformations to the target program.
The term "optimizing" is many simple transformations can significantly improve
the running time or space requirement of the target program.
A simple but effective technique for locally improving the target code is
peephole optimization, which is done by examining a sliding window of target
instructions (called the peephole) and replacing instruction sequences within the
peephole by a shorter or faster sequence, whenever possible.
Peephole optimization can also be applied directly after intermediate code generation
to improve the intermediate representation.
The peephole is a small, sliding window on a program. The code in the peephole need
not be contiguous, although some implementations do require this.
Characteristic of peephole optimization:
• Repeated passes over the target code are necessary to get the maximum benefit.
Examples of program transformations that is characteristic of peephole
optimizations:
• Redundant-instruction elimination
• Flow-of-control optimizations
• Algebraic simplifications
• Use of machine idioms
Eliminating Redundant Loads and Stores
If we see the instruction sequence in a target program,
LD a ,
R0ST R0
,a
we can delete store instructions because whenever it is executed. First instruction
will ensure that the value of a has already been loaded into register R0.
Note that if the store instruction had a label we could not be sure that first instruction
was always executed immediately before the second and so we could not remove the
store instruction.
Put another way, the two instructions have to be in the same basic block for this
transformation to be safe.
Eliminating Unreachable Code
Another opportunity for peephole optimization is the removal of unreachable
instructions. An unlabeled instruction immediately following an unconditional jump
may be removed. This operation can be repeated to eliminate a sequence of instructions.
For example, for debugging purposes, a large program may have within it certain code
fragments that are executed only if a variable debug is equal to 1.
In C, the source code might look like:
#define debug 0
….
if ( debug ) {
Print debugging information
While the number of instructions in the two sequences is the same, we sometimes skip
the unconditional jump in the second sequence, but never in the first. Thus, the second
sequence is superior to the first in execution time.
Algebraic Simplification and Reduction in Strength
These algebraic identities can also be used by a peephole optimizer to eliminate three-
address statements such as
x =x +0
or
x= x * 1 in the peephole.
Similarly, reduction-in-strength transformations can be applied in the
peephole to replace expensive operations by equivalent cheaper ones on the target
machine. Certain machine instructions are considerably cheaper than others and can
often be used as special cases of more expensive operators.
For example, x^ 2 is invariably cheaper to implement as x * x than as a call to an
exponentiation routine.
• The target machine may have hardware instructions to implement certain specific
operations efficiently.
• Detecting situations that permit the use of these instructions can reduce
execution time significantly.
For example, some machines have auto-increment and auto-decrement addressing
modes. These add or subtract one from an operand before or after using its value.
The use of the modes greatly improves the quality of code when pushing or popping a
stack, as in parameter passing. These modes can also be used in code for statements like
x = x + 1.
x = x + 1 → x++
x = x - 1 → x- -
Machine independent code optimization can be achieved using the following methods:
t1=x*z;
t2=a+b;
t3=p%t2;
t5=a-z;
// after Optimization
t1=x*z;
t2=a+b;
t3=p%t2;
t5=a-z;
2. Constant Folding:
Constant Folding is a technique where the expression which is computed at compile time is replaced by
its value. Generally, such expressions are evaluated at runtime, but if we replace them with their values
they need not be evaluated at runtime, saving time.
//Code segment
int x= 5+7+c;
//Folding applied
int x=12+c;
Folding can be applied on boolean, integers as well as on floating point numbers but one should be
careful with floating point numbers. Constant folding is often interleaved with constant propagation.
Constant Propagation:
If any variable is assigned a constant value and used in further computations, constant propagation
suggests using the constant value directly for further computations. Consider the below example
// Code segment
int a = 5;
int c = b * 2;
int z = a;
int c = 5 * 2;
int z = a;
int c = 10;
int z = a;
int z = a[10];
//Code
z=a+y;
printf("%d,%d".z,y);
//After Optimization
z=a+y;
printf("%d,%d".z,y);
Another example of dead code is assign a value to a variable and changing that value just before using
it. The previous value assignment statement is dead code. Such dead code needs to be deleted in order to
achieve optimization.
4. Copy Propagation :
Copy Propagation suggests to use one variable instead of other, in cases where assignments of the form
x=y are used. These assignments are copy statements. We can efficiently use y at all required place
instead of assign it to x. In short, elimination of copies in the code is Copy Propagation.
//Code segment
----;
a=b;
z=a+x;
x=z-b;
----;
//After Optimization
----;
z=b+x;
x=z-b;
----;
Another kind of optimization, loop optimization deals with reducing the time a program spends inside a
loop.
Loop Optimizations:
The program spends most of its time inside the loops. Thus the loops determine the time complexity of
the program. So, in order to get an optimal and efficient code, loop optimization is required. In order to
apply loop optimization, we first need to detect the loop using control flow analysis with the help of
program flow graph. A cycle in a program flow graph will indicate presence of a loop. Note that, the
code from Intermediate code Generation phase which is in three-address format is given as input to the
optimization phase. It is difficult to identify loops in such a format, thus a program flow graph is
required.
The program flow graph consists of basic blocks, which is nothing but the code divided into parts or
blocks and show the execution flow of the code,
The cycle in the above graph shows a presence of loop from block 2 to block3.
Once the loops are detected following Loop Optimization techniques can be applied :
1. Frequency Reduction
2. Algebraic expression simplification
3. Strength Reduction
4. Redundancy Elimination
1. Frequency Reduction:
It applies to the concept that a loop runs for least possible lines of code. It can be achieved by following
methods:
a. Code Motion:
Many times, in a loop, statements that remain unchanged for every iteration are included in the loop.
Such statements are loop invariants and only result in the program spending more time inside the loop.
Code motion simply moves loop invariant code outside the loop, reducing the time spent inside the loop.
To understand this considers the example below.
p=100
for(i=0;i<p;i++)
if(p/a==0)
printf("%d",p);
}
p=100
a=b+40;
for(i=0;i<p;i++)
if(p/a==0)
printf("%d",p);
In the example, before optimizing, the loop invariant code was evaluated for every iteration of the loop.
Once code motion is applied, the frequency of evaluating loop invariant code also decreases. Thus it is
also called as Frequency Reduction. The following is also an example for code motion.
----;
while((x+y)>n)
----;
----;
----;
int t=x+y;
while(t>n)
----;
}
----;
b. Loop Unrolling :
If a loop runs doing the same operation for every iteration, we can perform that same operation inside
the loop more than once. This is called loop unrolling. Such unrolled loop will perform the evaluation
more than once in a single loop iteration.
x[i]=0;
i++;
//initializes 5 elements to 0;
x[i]=0;
i++;
x[i]=0;
i++;
x[i]=0;
i++;
x[i]=0;
i++;
x[i]=0;
i++;
As in above example , an unrolled loop is more efficient than the previous loop.
c. Loop Jamming :
Combining the loops that carry out the same operations is called as loop jamming.
----;
for(j=0;j<5;j++)
x[i][j]=0;
x[i][i]=0;
----;
for(j=0;j<5;j++)
x[i][j]=0;
x[i][i]=1;
----;
Thus, instead of executing same loops twice, that operation can be done by executing the loop only
once.
A=A+0;
x=x*1;
These statements do not result in any useful computations. Such code may seem harmless, but when
used inside any loops they keep on being evaluated by compiler. Thus it is best to eliminate them.
3. Strength Reduction :
It suggests replacing a costly operation like multiplication with a cheaper one.
Example:
a*4
after reduction
a<<2
It is an important optimization for programs where array accesses occur within loops and should be used
with integer operands only.
4. Redundancy Elimination :
It may happen, that a specific expression is repeated in a code many times. This expression is redundant
to code because we may evaluate it once and substitute its next occurrence with its evaluated value. This
substitution is nothing but redundancy elimination. A simple example is given below
//Code:
----;
int x=a+b;
----;
int z=a+b+40;
----;
return (a+b)*5;
//After optimization
----;
int x=a+b;
----;
int z=x+40;
----;
return x*5
Redundancy elimination avoids evaluating the same expressions multiple times resulting in faster
execution.
Chapter 13
Bootstrapping a compiler
13.1 Introduction
When writing a compiler, one will usually prefer to write it in a high-level language.
A possible choice is to use a language that is already available on the machine
where the compiler should eventually run. It is, however, quite common to be in
the following situation:
You have a completely new processor for which no compilers exist yet. Nev-
ertheless, you want to have a compiler that not only targets this processor, but also
runs on it. In other words, you want to write a compiler for a language A, targeting
language B (the machine language) and written in language B.
The most obvious approach is to write the compiler in language B. But if B
is machine language, it is a horrible job to write any non-trivial compiler in this
language. Instead, it is customary to use a process called “bootstrapping”, referring
to the seemingly impossible task of pulling oneself up by the bootstraps.
The idea of bootstrapping is simple: You write your compiler in language A
(but still let it target B) and then let it compile itself. The result is a compiler from
A to B written in B.
It may sound a bit paradoxical to let the compiler compile itself: In order to
use the compiler to compile a program, we must already have compiled it, and to
do this we must use the compiler. In a way, it is a bit like the chicken-and-egg
paradox. We shall shortly see how this apparent paradox is resolved, but first we
will introduce some useful notation.
13.2 Notation
We will use a notation designed by H. Bratman [11]. The notation is hence called
“Bratman diagrams” or, because of their shape, “T-diagrams”.
281
282 CHAPTER 13. BOOTSTRAPPING A COMPILER
A B
C
In order to use this compiler, it must “stand” on a solid foundation, i.e., something
capable of executing programs written in the language C. This “something” can be
a machine that executes C as machine-code or an interpreter for C running on some
other machine or interpreter. Any number of interpreters can be put on top of each
other, but at the bottom of it all, we need a “real” machine.
An interpreter written in the language D and interpreting the language C, is
represented by the diagram
C
D
The pointed bottom indicates that a machine need not stand on anything; it is itself
the foundation that other things must stand on.
When we want to represent an unspecified program (which can be a compiler,
an interpreter or something else entirely) written in language D, we write it as
D
JD
J
Note that the languages must match: The program must be written in the language
that the machine executes.
We can insert an interpreter into this picture:
13.3. COMPILING COMPILERS 283
C
C
D
JD
J
Note that, also here, the languages must match: The interpreter can only interpret
programs written in the language that it interprets.
We can run a compiler and use this to compile a program:
A A B B
C
JC
J
The input to the compiler (i.e., the source program) is shown at the left of the
compiler, and the resulting output (i.e., the target program) is shown on the right.
Note that the languages match at every connection and that the source and target
program are not “standing” on anything, as they are not executed in this diagram.
We can insert an interpreter in the above diagram:
A A B B
C
C
D
JD
J
It often happens that a compiler does exist for the desired source language, it
just does not run on the desired machine. Let us, for example, assume we want a
compiler for ML to x86 machine code and want this to run on an x86. We have
access to an ML-compiler that generates ARM machine code and runs on an ARM
machine, which we also have access to. One way of obtaining the desired compiler
would be to do binary translation, i.e., to write a compiler from ARM machine
code to x86 machine code. This will allow the translated compiler to run on an
x86, but it will still generate ARM code. We can use the ARM-to-x86 compiler to
translate this into x86 code afterwards, but this introduces several problems:
• We still need to make the ARM-to-x86 compiler run on the x86 machine.
ML x86 ML x86
ML ML ARM ARM
ARM
JARM
J
Now, we can run the ML-to-x86 compiler on the ARM and let it compile itself1 :
ML x86 ML x86
ML ML x86 x86
ARM
JARM
J
We have now obtained the desired compiler. Note that the compiler can now be
used to compile itself directly on the x86 platform. This can be useful if the com-
piler is later extended or, simply, as a partial test of correctness: If the compiler,
when compiling itself, yields a different object code than the one obtained with the
above process, it must contain an error. The converse is not true: Even if the same
target is obtained, there may still be errors in the compiler.
1 We regard a compiled version of a program as the same program as its source-code version.
13.3. COMPILING COMPILERS 285
It is possible to combine the two above diagrams to a single diagram that covers
both executions:
ML x86 ML x86
ML x86 ML ML x86 x86
ML ML ARM ARM
ARM JARM
J
JARM
J
In this diagram, the ML-to-x86 compiler written in ARM has two roles: It is the
output of the first compilation and the compiler that runs the second compilation.
Such combinations can, however, be a bit confusing: The compiler that is the in-
put to the second compilation step looks like it is also the output of the leftmost
compiler. In this case, the confusion is avoided because the leftmost compiler is
not running and because the languages do not match. Still, diagrams that combine
several executions should be used with care.
The above bootstrapping process relies on an existing compiler for the desired lan-
guage, albeit running on a different machine. It is, hence, often called “half boot-
strapping”. When no existing compiler is available, e.g., when a new language has
been designed, we need to use a more complicated process called “full bootstrap-
ping”.
A common method is to write a QAD (“quick and dirty”) compiler using an
existing language. This compiler needs not generate code for the desired target
machine (as long as the generated code can be made to run on some existing plat-
form), nor does it have to generate good code. The important thing is that it allows
programs in the new language to be executed. Additionally, the “real” compiler is
written in the new language and will be bootstrapped using the QAD compiler.
As an example, let us assume we design a new language “M+”. We, initially,
write a compiler from M+ to ML in ML. The first step is to compile this, so it can
run on some machine:
286 CHAPTER 13. BOOTSTRAPPING A COMPILER
M+ ML M+ ML
ML ML ARM ARM
ARM
JARM
J
The QAD compiler can now be used to compile the “real” compiler:
M+ ARM M+ ARM
M+ M+ ML ML
ARM
JARM
J
M+ ARM M+ ARM
ML ML ARM ARM
ARM
JARM
J
The result of this is a compiler with the desired functionality, but it will probably
run slowly. The reason is that it has been compiled by using the QAD compiler (in
combination with the ML compiler). A better result can be obtained by letting the
generated compiler compile itself:
M+ ARM M+ ARM
M+ M+ ARM ARM
ARM
JARM
J
This yields a compiler with the same functionality as the above, i.e., it will generate
the same code, but, since the “real” compiler has been used to compile it, it will run
faster.
The need for this extra step might be a bit clearer if we had let the “real” com-
piler generate x86 code instead, as it would then be obvious that the last step is
13.3. COMPILING COMPILERS 287
required to get the compiler to run on the same machine that it targets. We chose
the target language to make a point: Bootstrapping might not be complete even if a
compiler with the right functionality has been obtained.
Using an interpreter
Instead of writing a QAD compiler, we can write a QAD interpreter. In our ex-
ample, we could write an M+ interpreter in ML. We would first need to compile
this:
M+ M+
ML ML ARM ARM
ARM
JARM
J
M+ ARM M+ ARM
M+ M+ ARM ARM
M+
M+
ARM
JARM
J
Since the “real” compiler has been used to do the compilation, nothing will be
gained by using the generated compiler to compile itself, though this step can still
be used as a test and for extensions.
Though using an interpreter requires fewer steps, this should not really be a
consideration, as the computer(s) will do all the work in these steps. What is im-
portant is the amount of code that needs to be written by hand. For some languages,
a QAD compiler will be easier to write than an interpreter, and for other languages
an interpreter is easier. The relative ease/difficulty may also depend on the language
used to implement the QAD interpreter/compiler.