Compiler Design Note1
Compiler Design Note1
What is a compiler?
A program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language).
Why we design compiler?
Why we study compiler construction techniques?
Compilers provide an essential interface between applications and architectures
Compilers embody a wide range of theoretical techniques
Since different platforms, or hardware architectures along with the operating systems
(Windows, Macs, Unix), require different machine code, you must compile most
programs separately for each platform.
Page 1 of 115
Does not translate the whole source program into object code
Interpretation is important when:
Programmer is working in interactive mode and needs to view and update
variables
Running speed is not important
Commands have simple formats, and thus can be quickly analyzed and
executed
Modification or addition to user programs is required as execution proceeds
Interpreter:
o Interpreter takes one statement then translates it and executes it and then takes
another statement.
o Interpreter will stop the translation after it gets the first error.
Compiler:
While compiler translates the entire program in one go and then executes it.
Generates the error report after the translation of the entire program.
Takes a large amount of time in analyzing and processing the high level language
code.
Overall execution time is faster.
Page 2 of 115
Interpreter:
You can run bytecode on any computer that has a Java Interpreter installed
Assemblers:
Page 3 of 115
Translator for the assembly language.
Linker
Loader
Loading of the executable codes, which are the outputs of linker, into main memory.
Pre-processors
Such a pre-processor:
Page 4 of 115
The translation process
A compiler consists of internally of a number of steps, or phases, that perform distinct
logical operations.
The phases of a compiler are shown in the next slide, together with three auxiliary
components that interact with some or all of the phases:
Page 5 of 115
Analysis consists of three phases:
1) Linear/Lexical analysis
2) Hierarchical/Syntax analysis
3) Semantic analysis
Blanks, new lines, tabulation marks will be removed during lexical analysis.
Example:
a[index] = 4 + 2;
a identifier
index identifier
] right bracket
= assignment operator
4 number
+ plus operator
2 number
; semicolon
A scanner may perform other operations along with the recognition of tokens.
Page 6 of 115
It may inter identifiers into the symbol table, and
The results of syntax analysis are usually represented by a parse tree or a syntax tree.
Syntax tree each interior node represents an operation and the children of the node
represent the arguments of the operation.
Sometimes syntax trees are called abstract syntax trees, since they represent a further
abstraction from parse trees. Example is shown in the following figure.
Page 7 of 115
Syntax Analysis Tools
Semantic analysis
o Type checking.
Page 8 of 115
Synthesis of the target program
Code generator
The machine code generator receives the (optimized) intermediate code, and then it
produces either:
Page 9 of 115
o Assembly code for a specific machine and assembler.
Code generator
The code generator takes the IR code and generates code for the target machine.
*R1-indirect registers addressing (the last instruction stores the value 6 to the address
contained in R1)
Grouping of phases
The discussion of phases deals with the logical organization of a compiler.
Compiler passes:
Page 10 of 115
A pass consists of reading an input file and writing an output file.
For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis,
and intermediate code generation might be grouped together into one pass.
Single pass:
o Is a compiler that passes through the source code of each compilation unit only
once
o They are unable to generate efficient programs, due to the limited scope available.
Multi pass:
o Is a type of compiler that processes the source code or abstract syntax tree of a
program several times
Syntax Tree
o Nodes have fields containing information collected by the parser and semantic
analyzer
Symbol Table
Page 11 of 115
o Tokens are entered by the scanner and parser
o Code generation and optimization phases use the information in the symbol table
Performance Issues
o Insertion, deletion, and search operations need to be efficient because they are
frequent
Literal Table
Scanner generators
Parser Generators
o These tools produce a parser /syntax analyzer/ if given a Context Free Grammar
(CFG) that describes the syntax of the source language.
o It produces a collection of routines that walk the parse tree and execute some
tasks.
Page 12 of 115
Automatic code generators
o Take a collection of rules that define the translation of the IC to target code and
produce a code generator.
This completes our brief description of the phases of compiler (chapter 1).
For any unclear, comment, question, doubt things and etc please don’t hesitate to
announce me as you can.
Review Exercise
1) What is compiler?
4) Consider the line of C++ code: float [index] = a-c. write its:
C. Code generator
b) What is another programs used in this process and what makes them different
from compilers.
Page 13 of 115
Chapter 2
Lexical analysis
Introduction
The role of lexical analyzer is:
o Produce as output a sequence of tokens for each lexeme in the source program.
Page 14 of 115
Patterns are rules describing the set of lexemes belonging to a token.
Example: The following table shows some tokens and their lexemes in Pascal (a high
level, case insensitive programming language)
Page 15 of 115
o an alphabet Σ of legal characters;
o the metacharacter ε: or
o the metacharacter ø.
In the first case, L(a)={a}; in the second case, L(ε)= {ε}; and in the third case, L(ø)= { }.
{} – contains no string at all and {ε} – contains the single string consists of no character
Alternation: an expression of the form r|s, where r and s are regular expressions.
o In this case , L(r|s) = L(r) U L(s) ={r,s}
Concatenation: An expression of the form rs, where r and s are regular expressions.
o L ∪ M = {s |s ∈ L or s ∈ M}
Concatenation of L and M
o LM = {xy | x ∈ L and y ∈ M}
Exponentiation of L
o L0 = {ε}; Li = Li-1L
Kleene closure of L
o L* = ∪i=0,…,∞ Li
Positive closure of L
o L+ = ∪i=1,…,∞ Li
Note: The following short hands are often used:
r+ =rr*
r* = r+| ε
r? =r|ε
Page 16 of 115
RE‟s: Examples
a) L(01) = ?
b) L(01|0) = ?
c) L(0(1|0)) = ?
L(0*) = ?
L((0|10)*(ε|1)) = ?
L(01) = {01}.
L((0|10)*(ε|1)) = all strings of 0‟s and 1‟s without two consecutive 1‟s.
1- a | b = ?
2- (a|b)a = ?
3- (ab) | ε = ?
4- ((a|b)a)* = ?
Reverse
2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of
all strings over this alphabet that contains exactly one b.
Page 17 of 115
1- a | b = {a,b}
2- (a|b)a = {aa,ba}
3- (ab) | ε ={ab, ε}
Reverse
2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of
all strings over this alphabet that contains exactly one b.
Exercises
1- a(a|b)*a
2- ((ε|a)b*)*
3- (a|b)*a(a|b)(a|b)
4- a*ba*ba*ba*
o If α is a regular expression, so is α*
Page 18 of 115
Reserved (Key) words: They are represented by their fixed sequence of characters,
If we want to collect all the reserved words into one definition, we could write it as follows:
Special symbols: including arithmetic operators, assignment and equality such as =, :=, +, -, *
Identifiers: which are defined to be a sequence of letters and digits beginning with letter,
letter = A|B|…|Z|a|b|…|z
digit = 0|1|…|9
or
letter= [a-zA-Z]
digit = [0-9]
identifiers = letter(letter|digit)*
o decimal numbers, or
nat = [0-9]+
Page 19 of 115
relop < | <= | = | <> | > | >=
name = n;
color = c;
System.out.println("Woof");
Automata
Abstract machines
Characteristics
Input: input values (from an input alphabet ∑) are applied to the machine
States: at any instant, the automation can be in one of the several states
Page 20 of 115
State relation: the next state of the automation at any instant is determined by the present
state and the present input
Types of automata
o Have a finite number of states, and a finite amount of memory (i.e., the current
state).
Finite Automata
Page 21 of 115
Lex – turns its input program into lexical analyzer.
Finite automata are recognizers; they simply say "yes" or "no" about each possible input
string.
a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges.
b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.
Overview
o Token Pattern
o NFA DFA
Page 22 of 115
Non-Deterministic Finite Automata (NFA)
Definition:
The set of strings of characters c1c2...cn with each ci from Σ U { ε} such that there exist
states s1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.
o The same symbol can label edges from one state to several different states.
Transition Graph
The transition graph for an NFA recognizing the language of regular expression
(a|b)*abb
Page 23 of 115
Transition Table
The language defined by an NFA is the set of input strings it accepts, such as (a|b)*abb
for the example NFA
An NFA accepts input string x if and only if there is some path in the transition graph
from the start state to one of the accepting states
Another NFA
aa*|bb*
o For each state S and input symbol a there is at most one edge labeled a leaving S
Page 24 of 115
Each entry in the transition table is a single state
DFSA: Example
DFA example
INPUT:
o A DFA D with start state So, accepting states F, and transition function move.
METHOD
o The function move(s, c) gives the state to which there is an edge from state s on input c.
o The function nextChar() returns the next character of the input string x.
Page 25 of 115
Example:
DFA: Exercise
digit =[0-9]
nat=digit+
signednat=(+|-)?nat
number=signednat(“.”nat)?(E signedNat)?
Process:
o Build a DFA
How?
Implement it
Page 26 of 115
Step 1: Come up with a Regular Expression
(a|b)*ab
Two algorithms:
Page 27 of 115
2- Translate NFA into DFA (Subset construction)
Rules:
Case 1: Alternation: regular expression (s|r), assume that NFAs equivalent to r and s
have been constructed.
letter(letter|digit)*
(a|b)*abb
Page 28 of 115
From an NFA to a DFA (subset construction algorithm)
Rules:
ε- closure
1- S‟ € ε-closure(S‟) itself
2- if t € ε-closure (S‟) and if there is an edge labeled ε from t to v, then v € ε-closure (S‟)
Algorithm
Begin
Mark X
Begin
Let T be the set of states to which there is a transition „a‟ from state si in X.
Y= ε-Closer (T)
Page 29 of 115
Mark Y an “Unmarked” state of D add a transition from X to Y labeled „a‟ if not already
presented
End
End
Example: Convert the following NFA into the corresponding DFA. letter (letter|digit)*
Other Algorithms
Page 30 of 115
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the input source and converts strings in the source to
tokens.
Lex: generates a scanner (lexical analyzer or lexer) given a specification of the tokens using
REs.
o The input notation for the Lex tool is referred to as the Lex language and
The Lex compiler transforms the input patterns into a transition diagram and generates
code, in a file called lex.yy.c, that simulates this transition diagram.
By using regular expressions, we can specify patterns to lex that allow it to scan and match
strings in the input.
Typically an action returns a token, representing the matched string, for subsequent use by
the parser.
It uses patterns that match strings in the input and converts the strings to tokens.
Page 31 of 115
Scanner, Parser, Lex and Yacc
We will see more about lex and its construction in lab case.
Summary of Chapter 2
Tokens: The lexical analyzer scans the source program and produces as output a
sequence of tokens, which are normally passed, one at a time to the parser. Some tokens
may consist only of a token name while others may also have an associated lexical value
that gives information about the particular instance of the token that has been found on
the input.
Lexemes: Each time the lexical analyzer returns a token to the parser, it has an associated
lexeme - the sequence of input characters that the token represents.
Buffering: Because it is often necessary to scan ahead on the input in order to see where
the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input.
Using a pair of buffers cyclically and ending each buffer's contents with a sentinel that
warns of its end are two techniques that accelerate the process of scanning the input. +
Patterns. Each token has a pattern that describes which sequences of characters can form
the lexemes corresponding to that token. The set of words or strings of characters that
match a given pattern is called a language.
Page 32 of 115
Regular Expressions: These expressions are commonly used to describe patterns.
Regular expressions are built from single characters, using union, concatenation, and the
Kleene closure, or any-number-of, operator.
Deterministic Finite Automata: A DFA is a special kind of finite automaton that has
exactly one transition out of each state for each input symbol. Also, transitions on empty
input are disallowed. The DFA is easily simulated and makes a good implementation of a
lexical analyzer, similar to a transition diagram.
Nondeterministic Finite Automata: Automata that are not DFA7s are called
nondeterministic. NFA's often are easier to design than are DFA's. Another possible
architecture for a lexical analyzer is to tabulate all the states that NFA7s for each of the
possible patterns can be in, as we scan the input characters.
Lex: There is a family of software systems, including Lex and Flex, that are lexical-
analyzer generators. The user specifies the patterns for tokens using an extended regular-
expression notation. Lex converts these expressions into a lexical analyzer that is
Page 33 of 115
essentially a deterministic finite automaton that recognizes any of the patterns.
Minimization of Finite Automata: For every DFA there is a minimum state DM
accepting the same language. Moreover, the minimum-state DFA for a given language is
unique except for the names given to the various states.
Review Exercise
1) Divide the following C + + program:
float linitedSquare(x) float x {
/* returns x-squared, but never more than 100 */
return (x<=-10.01||x>=l0.0)?100:x*x;
into appropriate lexemes. Which lexemes should get associated lexical values? What should
those values be?
2) Write regular definitions for the following languages:
a) All strings of lowercase letters that contain the five vowels in order.
b) All strings of lowercase letters in which the letters are in ascending lexicographic order.
c) Comments, consisting of a string surrounded by /* and */, without an intervening */,
unless it is inside double-quotes (").
d) All strings of digits with no repeated digits. Hint: Try this problem first with a few digits,
such as {0, 1, 2). !!
e) All strings of digits with at most one repeated digit. !!
f) All strings of a's and b's with an even number of a's and an odd number of b's.
g) The set of Chess moves, in the informal notation, such as p-k4 or kbp x qn.!!
h) All strings of a's and b's that do not contain the substring abb.
i) All strings of a's and b's that do not contain the subsequence abb.
3) Construct the minimum-state DFA7s for the following regular expressions:
a) (a|b)*a(a|b)
b) (a|b)*a(a|b) (a|b)
c) (a|b)*a(a|b) (a|b)(a|b)
Page 34 of 115
Chapter – 3
Syntax analysis
Introduction
Syntax: the way in which tokens are put together to form expressions, statements, or
blocks of statements.
Syntax analysis: the task concerned with fitting a sequence of tokens into a specified
syntax.
Parsing: To break a sentence down into its component parts with an explanation of the
form, function, and syntactical relationship of each part.
Parser
The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a CFG or not.
A CFG:
Top-down parser
Page 35 of 115
o The parse tree is created top to bottom, starting from the root to leaves.
Bottom-up parser
o The parse tree is created bottom to top, starting from the leaves to root.
Both top-down and bottom-up parser scan the input from left to right (one symbol at a
time).
Efficient top-down and bottom-up parsers can be implemented by making use of context-
free- grammar.
G = (T, N, P, S) where
Page 36 of 115
Derivation
A derivation is a sequence of replacements of structure names by choices on the right
hand sides of grammar rules.
Example: E → E + E | E – E | E * E | E / E | -E
E→(E)
E → id
o we can replace E by E + E
o we have to have a production rule E → E+E in our grammar.
If we always choose the left-most non-terminal in each derivation step, this derivation is
called left-most derivation.
Example: E=>-E=>-(E)=>-(E+E)=>-(id+E)=>-(id+id)
If we always choose the right-most non-terminal in each derivation step, this derivation is
called right-most derivation.
Example: E=>-E=>-(E)=>-(E+E)=>-(E+id)=>-(id+id)
We will see that the top-down parser try to find the left-most derivation of the given source
program.
We will see that the bottom-up parser try to find right-most derivation of the given source
program in the reverse order.
Parse tree
A parse tree is a graphical representation of a derivation.
It filters out the order in which productions are applied to replace non-terminals.
o the children of each internal node represent the replacement of the associated non-
terminal in one step of the derivation.
Page 37 of 115
Parse tree and Derivation
Ambiguity: example
Ambiguity: example…
To add precedence
Page 38 of 115
o Force the parser to recognize high precedence sub expressions first
To add association
o Elimination of ambiguity
Left Recursion
Page 39 of 115
Top-down parsing methods cannot handle left-recursive grammar. So a transformation that
eliminates left-recursion is needed.
To eliminate left recursion for single production A Aα |β could be replaced by the non-
left- recursive productions
A β A‟
A‟ α A‟| ε
Generally, we can eliminate immediate left recursion from them by the following technique.
Left factoring
When a non-terminal has two or more productions whose right-hand sides start with the same
grammar symbols, the grammar is not LL(1) and cannot be used for predictive parsing
A predictive parser (a top-down parser without backtracking) insists that the grammar must
be left-factored.
In general: A αβ1 | αβ2 , where α-is a non-empty and the first symbol of β1 and β2.
When processing α we do not know whether to expand A to αβ1 or to αβ2, but if we re-write
the grammar as follows:
A αA‟
Page 40 of 115
S iEtS | iEtSeS | a
Eb
S iEtSS‟ | a
S‟ eS | ε
Eb
Syntax analysis
Every language has rules that prescribe the syntactic structure of well-formed programs.
The syntax can be described using Context Free Grammars (CFG) notation.
o it is possible to have a tool which produces automatically a parser using the grammar
o a properly designed grammar helps in modifying the parser easily when the language
changes
Top-down parsing
Recursive Descent Parsing (RDP)
This method of top-down parsing can be considered as an attempt to find the left most
derivation for an input string. It may involve backtracking.
o Two pointers, one for the tree and one for the input, will be used to indicate where the
parsing process is.
Page 41 of 115
o Initially, they will be on S and the first input symbol, respectively.
o Then we use the first S-production to expand the tree. The tree pointer will be
positioned on the left most symbol of the newly created sub-tree.
As the symbol pointed by the tree pointer matches that of the symbol pointed by the input
pointer, both pointers are moved to the right.
Whenever the tree pointer points on a non-terminal, we expand it using the first
production of the non-terminal.
Whenever the pointers point on different terminals, the production that was used is not
correct, thus another production should be used. We have to go back to the step just
before we replaced the non-terminal and use another production.
If we reach the end of the input and the tree pointer passes the last symbol of the tree, we
have finished parsing.
Example: G: S cAd
A ab|a
Draw the parse tree for the input string cad using the above method.
Exercise:
SA
A A + A | B++
By
S→E
E → id
|(E.E)
|(L)
|()
Page 42 of 115
L→LE
|E
This method uses a parsing table that determines the next production to be applied. The
input buffer contains the string to be parsed followed by $ (the right end marker)
Initially, the stack contains the start symbol of the grammar followed by $.
The parsing table is a two dimensional array M[A, a] where A is a non-terminal of the
grammar and a is a terminal or $.
Predictive Parsing…
2. x = a ≠ $ : the parser pops x off the stack and advances the input pointer to the next
symbol
Page 43 of 115
If M[X, a] = {X uvw}, X on top of the stack will be replaced by uvw (u at the top of the
stack).
E TR
R -TR
Rε
T 0|1|…|9
Page 44 of 115
FIRST and FOLLOW
The construction of both top-down and bottom-up parsers are aided by two functions, FIRST
and FOLLOW, associated with a grammar G.
During top-down parsing, FIRST and FOLLOW allow us to choose which production to
apply, based on the next input symbol. During panic-mode error recovery, sets of tokens
produced by FOLLOW can be used as synchronizing tokens.
We need to build a FIRST set and a FOLLOW set for each symbol in the grammar. The
elements of FIRST and FOLLOW are terminal symbols.
o FIRST() is the set of terminal symbols that can begin any string derived from .
FIRST
b) For each production X y1y2…yk, place a in FIRST(X) if for some i, a Є FIRST (yi)
and ε Є FIRST (yj), for 1<j<i. If ε Є FIRST (yj), for j=1, …,k then ε Є FIRST(X)
Page 45 of 115
a- Add all non- ε symbols of FIRST(X1) in FIRST(y)
b- Add all non- ε symbols of FIRST (Xi) for i≠1 if for all j<i, ε Є FIRST (Xj)
FOLLOW
FOLLOW(A) = set of terminals that can appear immediately to the right of A in some
sentential form.
o Place $ in FOLLOW(A), where A is the start symbol.
o If there is a production B αAβ, then everything in FIRST(β), except ε, should be added
to FOLLOW(A).
o If there is a production B αA or B αAβ and ε Є FIRST(β), then all elements of
FOLLOW(B) should be added to FOLLOW(A).
Exercises:
A BCD
B bB | ε
C Cg | g | Ch | i
D AB | ε
Fill in the table below with the FIRST and FOLLOW sets for the non-terminals in this grammar:
Page 46 of 115
Construction of predictive parsing table
o Input Grammar G
Exercise:
1) Consider the following grammars G, Construct the predictive parsing table and parse the
input symbols:
Page 47 of 115
S [ SX ] | a
X ε | +SY | Yb
Y ε | -SXc
A – Find FIRST and FOLLOW sets for the non-terminals in this grammar.
LL (1) Grammars…
Exercises: 1) Consider the following grammar G:
A‟ A
A xA | yA |y
Solution:
Page 48 of 115
S WAB | ABCS
A B | WB
B ε |yB
Cz
Wx
S ScB | B
B e | efg | efCg
C SdC | S
Starts constructing the parse tree at the top (root) of the tree and move down
towards the leaves.
Bottom-up parsers:
Page 49 of 115
A bottom-up parser, or a shift-reduce parser, begins at the leaves and works up to the top of
the tree. The reduction steps trace a rightmost derivation on reverse.
We want to parse the input string abbcde. This parser is known as an LR Parser because it
scans the input from Left to right, and it constructs a rightmost derivation in reverse order.
S aABe
A Abc | b
Bd
At each step, we have to find α such that α is a substring of the sentence and replace α by A,
where A α
o By shifting zero or more input into the stack until the right side of the handle is on top
of the stack.
o This is repeated until the start symbol is in the stack and the input is empty, or until
error is detected.
o Reduce: the parser knows the right end of the handle is at the top of the stack. It
should then decide what non-terminal should replace that substring
Page 50 of 115
o Accept: the parser announces successful completion of parsing
Grammars for which we can construct an LR (k) parsing table are called LR (k) grammars.
o Shift/reduce conflict: when we have a situation where the parser knows the entire
stack content and the next k symbols but cannot decide whether it should shift or
reduce. Ambiguity
o Reduce/reduce conflict: when the parser cannot decide which of the several
productions it should use for a reduction.
ET
T id
LR parser
Page 51 of 115
o Si is a new symbol called state that summarizes the information contained in
the stack
o Xi is a grammar symbol
the parsing table which has two parts: ACTION and GOTO.
then consulting the entry ACTION[Sm , ai] in the parsing action table
The ACTION function takes as arguments a state i and a terminal a (or $, the input
endmarker).
o Shift j, where j is a state. The action taken by the parser shifts input a on the top of the
stack, but uses state j to represent a.
o Reduce A β, The action of the parser reduces β on the top of the stack to head A.
LR parser configuration
Page 52 of 115
This configuration represents the right-sentential form
Behavior of LR parser
o the parsing table which has two parts: ACTION and GOTO.
o then consulting the entry ACTION[Sm , ai] in the parsing action table
1. If Action[Sm, ai] = shift S, the parser program shifts both the current input symbol ai and
state S on the top of the stack, entering the configuration
(S0 X1 S1 X2 S2 … Xm Sm ai S, ai+1 … an $)
2. Action[Sm, ai] = reduce A β: the parser pops the first 2r symbols off the stack, where r =
|β| (at this point, Sm-r will be the state on top of the stack), entering the configuration,
o Then A and S are pushed on top of the stack where S = goto[Sm-r, A]. The input buffer is
not modified.
4. Action[Sm, ai] = error, parsing has discovered an error and calls an error recovery routine.
LR-parsing algorithm
Page 53 of 115
let S be the state on top of the stack;
if ( ACTION[S, a] = shift t ) {
The following grammar can be parsed with this action and goto table as bellow.
Page 54 of 115
Example: The following example shows how a shift/reduce parser parses an input string w = id
* id + id using the parsing table shown above.
This method is the simplest of the three methods used to construct an LR parsing table. It is
called SLR (simple LR) because it is the easiest to implement. However, it is also the
weakest in terms of the number of grammars for which it succeeds. A parsing table
constructed by this method is called SLR table. A grammar for which an SLR table can be
constructed is said to be an SLR grammar.
LR (0) item
An LR (0) item (item for short) is a production of a grammar G with a dot at some
position of the right side.
A.XYZ
AX.YZ
AXY.Z
A X Y Z.
A .
Page 55 of 115
An item indicates what is the part of a production? That we have seen and what we hope
to see. The central idea in the SLR method is to construct, from the grammar, a
deterministic finite automaton to recognize viable prefixes.
A viable prefix is a prefix of a right sentential form that can appear on the stack of a
shift/reduce parser.
o If you have a viable prefix in the stack it is possible to have inputs that will reduce
to the start symbol.
o If you don‟t have a viable prefix on top of the stack you can never reach the start
symbol; therefore you have to call the error recovery procedure.
If I is a set of items of G, then Closure (I) is the set of items constructed by two
rules:
o This rule is applied until no more new item can be added to Closure (I).
Example G1‟:
E‟ E
EE+T
ET
TT*F
TF
F (E)
F id
I = {[E‟ .E]}
Page 56 of 115
The second useful function is Goto (I, X) where I is a set of items and X is a grammar
symbol. Goto (I, X) is defined as the closure of all items [A αX.β] such that [A α.Xβ]
is in I.
Example:
Below is given an algorithm to construct C, the canonical collection of sets of LR(0) items
for augmented grammar G‟.
Begin
Repeat
For Each item of I in C and each grammar symbol X such that Goto (I, X) is not empty and
not in C do
End
Example: Construction of the set of Items for the augmented grammar above G1‟.
Page 57 of 115
I6 = Goto (I1, +) = {[E E + . T], [T .T * F], [T .F],
o [F .(E)], [F .id]}
o [F .id]}
o Goto(I4,T)={[ET.], [TT.*F]}=I2;
o Goto(I4,F)={[TF.]}=I3;
o Goto (I4, () = I4;
o Goto (I4, id) = I5;
LR (0) automation
Page 58 of 115
1. Construct C = {I0, I1, ......, IN} the collection of the set of LR (0) items for G‟.
o If no conflicting action is created by 1 and 2 the grammar is SLR (1); otherwise it is not.
4. All entries of the parsing table not defined by 2 and 3 are made error
5. The initial state is the one constructed from the set of items containing [S‟ .S]
Example: Construct the SLR parsing table for the grammar G1‟
E‟ E
1 EE+T
2 ET
3 TT*F
4 TF
5 F (E)
6 F id
Page 59 of 115
Legend: Si means shift to state i, Rj means reduce production by j
Exercise: Construct the SLR parsing table for the following grammar:/* Grammar G2‟ */
S‟ S
SL=R
SR
L *R
L id
RL
Answer
C = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}
o [L .id], [R .L]}
o [R .L]}
Page 60 of 115
o [L .id]}
o goto (I4, *) = I4
o goto (I4, id) = I5
o goto (I6, L) = I8
o goto (I6, *) = I4
o goto (I6, id) = I5
Follow (S) = {$} Follow (R) = {$, =} Follow (L) = {$, =}. We have shift/reduce conflict
since = is in Follow (R) and R L. is in I2 and Goto (I2, =) = I6. Every SLR(1)
grammar is unambiguous, but there are many unambiguous grammars that are not
SLR(1).
G2‟ is not an ambiguous grammar. However, it is not SLR. This is because the SLR
parser is not powerful enough to remember enough left context to decide whether to shift
or reduce when it sees an =.
Exercise
(1) S A
(2) S B
(3) A a A b
(4) A 0
(5) B a B b b
(6) B 1
Page 61 of 115
– calls lexical analyzer to collect tokens from input stream. Tokens are organized using
grammar rules. When a rule is recognized, its action is executed
Note:
o lex tokenizes the input and yacc parses the tokens, taking the right actions, in context.
Yacc…
1) Generate a parser from Yacc by running Yacc over the grammar file.
o Write the grammar in a .y file (also specify the actions here that are to be
taken in C).
o Write a lexical analyzer to process input and pass tokens to the parser. This
can be done using Lex.
Page 62 of 115
3) Compile code produced by Yacc as well as any other relevant source files.
4) Link the object files to appropriate libraries for the executable parser.
Review Exercise
Page 63 of 115
S->AaAb|BbBa
A-> ε
B->ε
S-> SA | A
A-> a
Page 64 of 115
CHAPTER 4
Syntax-Directed Translation
Introduction
Grammar symbols are associated with attributes to associate information with the
programming language constructs that they represent. Values of these attributes are
evaluated by the semantic rules associated with the production rules.
Syntax-Directed Definitions
Translation Schemes
Syntax-Directed Definitions:
o We associate a production rule with a set of semantic actions, and we do not say
when they will be evaluated.
Translation Schemes:
o indicate the order of evaluation of semantic actions associated with a production rule.
Page 65 of 115
Syntax-Directed Definitions
o Each grammar symbol is associated with a set of attributes. This set of attributes for
a grammar symbol is partitioned into two subsets called synthesized and inherited
attributes of that grammar symbol. Each production rule is associated with a set of
semantic rules.
o The value of a synthesized attribute is computed from the values of attributes at the
children of that node in the parse tree.
o The value of an inherited attribute is computed from the values of attributes at the
siblings and parent of that node in the parse tree.
A depth-first traversal algorithm traverses the parse tree thereby executing semantic rules to
assign attribute values. After the traversal is completed the attributes contain the translated
form of the input.
OR
b is an inherited attribute one of the grammar symbols in α (on the right side of the
production), and c1,c2,…,cn are attributes of the grammar symbols in the production ( A → α ).
For A C C.c = A.b
Page 66 of 115
Annotated Parse Tree
A parse tree showing the values of attributes at each node is called an annotated parse tree.
The process of computing the attributes values at the nodes is called annotating (or
decorating) of the parse tree. Of course, the order of these computations depends on the
dependency graph induced by the semantic rules.
Begin
visit(m);
end
Example 4.1: Synthesized Attributed grammar that calculate the value of expression
L→En print(E.val)
It specifies a simple calculator that reads an input line containing an arithmetic expression
involving:
o digits, parenthesis, the operator + and *, followed by a new line character n, and
Example 4.2: Synthesized Attributed grammar that calculate the value of expression
Page 67 of 115
L→En print(E.val)
Symbols E, T, and F are associated with a synthesized attribute val. The token digit has a
synthesized attribute lexval (it is assumed that it is evaluated by the lexical analyzer).
a) Given the expression 5+3*4 followed by new line, the program prints 17.
Page 68 of 115
c) Draw the annotated parse tree for input: 5*3+4n
a) (3+4) * (5+6)n
b) 7*5*9*(4+5)n
c) (9+8*(7+6)+5)*4n
Page 69 of 115
Dependency Graph
addtype(id.entry,L.inh)
L → id addtype(id.entry,L.inh)
Page 70 of 115
SDD based on a grammar suitable for top-down parsing
T.val = T‟.syn
T‟.syn = T1‟.syn
T‟ → ε T‟.syn = T‟.inh
The SDD above computes terms like 3 * 5 and 3 * 5 * 7. Each of the non-terminals T and
F has a synthesized attribute val; The terminal digit has a synthesized attribute lexval.
Exercises
Page 71 of 115
Production Semantic Rules
L1.l = L2.l + 1
L1.l = 1
B→0 B.v = 0
B→1 B.v = 1
Draw the decorated parse tree and draw the dependency graph for input:
a - 1011.01
b – 11.1
c – 1001.001
Evaluation Order
A topological sort of a directed acyclic graph (DAG) is any ordering m1, m2, …, mn of
the nodes of the graph, such that: if mi → mj is an edge, then mi appears before mj
Any topological sort of a dependency graph gives a valid evaluation order of the semantic
rules.
S-Attributed Definitions
Syntax-directed definitions are used to specify syntax-directed translations that guarantee
an evaluation order. We would like to evaluate the semantic rules during parsing (i.e. in a
single pass, we will parse and we will also evaluate semantic rules during the parsing).
Page 72 of 115
We will look at two sub-classes of the syntax-directed definitions:
These classes of SDD can be implemented efficiently in connection with top-down and
bottom-up parsing.
S-Attributed Definitions
Bottom-up parser uses depth first traversal. A new stack is maintained to store the values
of the attributes as in the following example. Yacc/Bison only support S-attributed
definitions
%{
#include <stdio.h>
%}
%token INTEGER
%%
program:
expr:
INTEGER { $$=$1;}
Page 73 of 115
| expr '-' expr { $$ = $1 - $3; }
%%
o When an entry of the parser stack holds a grammar symbol X (terminal or non-
terminal), the corresponding entry in the parallel stack will hold the synthesized
attribute(s) of the symbol X.
L→En print(val[top-1])
T→F
F → digit
At each shift of digit, we also push digit.lexval into val-stack. At all other shifts, we do
not put anything into val-stack because other terminals do not have attributes (but we
increment the stack pointer for val-stack).
Page 74 of 115
Canonical LR(0) Collection for The Grammar
Page 75 of 115
CHAPTER 5
Type checking
Introduction
The compiler must check that the source program follows both the syntatic and semantic
conventions of the source language.
Semantic Checks
This checking is called static checking (to distinguish it from dynamic checking executed
during execution of the target program). Static checking ensures that certain kind of
errors will be detected and reported.
Page 76 of 115
Static checking finds semantic errors
o inappropriate instruction
Most such checks can be done using 1 or 2 traversals of (part of) the parse tree
Memory layout
In C, char, short, int, long, float, double usually have different sizes
Need to allocate different amounts of memory for different types
Choice of instructions
Page 77 of 115
o Do operators match their operands? Do types of variables match the values assigned
to them? Do function parameters match the function declarations? Have called
function and variable names been declared?
Not all languages can be completely type checked. All compiled languages must be at least
partially type checked
Type checking can be done bottom up using the parse tree. For convenience, we may create
one or more pseudo-types for error handling purposes
Static checking
o Type checks
o Flow-of-control checks
o Uniqueness checks…
Type checks:
int a, c[10],d;
d = c + d;
o Statements that cause flow of control to leave a construct must have some place to which
to transfer the flow of control.
o Example: a break statement in C causes control to leave the smallest enclosing while, for,
or switch statement.
Page 78 of 115
for(i=0;i<attempts;i++) {
Cin>>password;
if(verify(password))
break;//OK
cout<<“incorrect\n”;
Uniqueness check:
Page 79 of 115
One-Pass versus Multi-Pass Static Checking
One-pass compiler: static checking in C, Pascal, Fortran, and many other languages is
performed in one pass while intermediate code is generated
Multi-pass compiler: static checking in Ada, Java, and C# is performed in a separate phase,
sometimes by traversing a syntax tree multiple times. A separate type-checking pass between
parsing and intermediate code generation.
In this chapter, we focus on type checking. A type checker verifies that the type construct
matches that expected by its context.
For example:
o The type checker should verify that the type value assigned to a variable is
compatible with the type of the variable.
o Dereference to pointer
Type systems
A type system is a collection of rules for assigning type expressions to the parts of a
program. A type checker implements a type system. A sound type system eliminates run-
time type checking for type errors.
o In practice, some of type checking operations is done at run-time (so, most of the
programming languages are not strongly-typed).
o Ex: int x [100]; … x[i] most of the compilers cannot guarantee that i will be
between 0 and 99
Type expressions
Page 80 of 115
o a basic type or
o A special basic type, type_error, will signal an error during type checking.
o Arrays: if I in an index set and T is a type expression, then Array (I, T) is a TE:
o Pointers: if T is a TE then pointer (T) is a TE. Denotes the type “pointer to an object of
type T.”
int *a;
For example:
o mod function has domain type int x int, a pair of integers and range type int, thus
mod has int x int int
Page 81 of 115
The TE corresponding to the Pascal declaration:
For example, the type expression corresponding to the above function declaration can
be represented with the tree shown below:
o A basic type
void : no type
Page 82 of 115
o A type name
o Synthesizes the type of each expression from the types of its sub expressions.
o arrays,
o pointers,
o statements, and
o Functions.
PD;E|D;S
D D ; D | id : T
Page 83 of 115
E true | false | literal | num | id | E mod E | E [ E ] | E ^
|E=E|E+E
S id := E | if E then S | while E do S | S ; S
key : integer;
The language has three basic types: boolean, char and integer. Type_error used to signal
error. Void used to check statements. All arrays start at 1. For example:
Consisting of the constructor array applied to the sub range 1...256 and the type char.
The prefix operator ^ in declarations builds a pointer type, so ^ integer leads to the TE
pointer (integer), consisting of the constructor pointer applied to the type integer.
P D;E|D;S
DD;D
Page 84 of 115
o add the type expression in the symbol table entry corresponding to the variable
identifier.
integer
else type_error}
else type_error}
else type_error}
else E.type=type-error }
else type_error}
Page 85 of 115
Translation scheme for type checking of Statements:
else type_error }
S1.type
else type_error}
S1.type
else type_error}
then void
else type_error}
Exercises:
1) For the translation scheme of a simple type checker presented above, draw the
decorated parse tree for:
a) A: array [1…10] of ^array [1…5] of char
b) A: array [1…10] of ^ array [1…5] of char; B: char
Solutions:
1) a)
1) b)
Page 86 of 115
2) For the translation scheme of a simple type checker presented above, draw the
decorated parse tree for:
b: integer;
c: char;
b=a[1];
if b = a[10] then
b = b + 1;
c = a[5];
Page 87 of 115
CHAPTER 6
Intermediate Representations
In a compiler, the front end translates source program into an intermediate representation,
and the back end generates the target code from this intermediate representation. The use
of a machine independent intermediate code (IC) is:
Type checking is done in another pass Multi – pass. IC generation and type checking
can be done at the same time one pass.
Decisions in IR design affect the speed and efficiency of the compiler. Some important
IR properties:
o Ease of generation
o Ease of manipulation
o Procedure size
o Level of abstraction
Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.
Page 88 of 115
o Postfix notation can be used as an intermediate language.
Three addresses are close to machine instructions, but they are not actual
machine instructions.
o Structural
Graphically oriented
Heavily used in source-to-source translators
Tend to be large
Examples: Trees, DAG
o Linear
o Hybrid
Intermediate languages
Syntax tree
While parsing the input, a syntax tree can be constructed. A syntax tree (abstract tree) is
a condensed form of parse tree useful for representing language constructs. For example,
Page 89 of 115
for the string a+b, the parse tree in (a) below can be represented by the syntax tree shown
in (b); the keywords (syntactic sugar) that existed in the parse tree will no longer exist in
the syntax tree.
E → ( E1 ) E. nptr := E1.nptr
Page 90 of 115
Abstract Syntax Trees versus DAGs
Postfix notation
Example:
Page 91 of 115
Three-Address Code
A three address code is: x := y op z where x, y and z are names, constants or compiler-
generated temporaries; op is any operator. But we may also use the following notation for
three address code (much better notation because it looks like a machine code instruction)
We use the term “three-address code” because each statement usually contains three
addresses (two for operands, one for the result).
In three-address code:
o Only one operator at the right side of the assignment is possible, i.e. x + y * z is not
possible
o It has been given the name three-address code because such an instruction usually
contains three addresses (the two operands and the result)
Page 92 of 115
t1 = y * z
t2 = x + t1
Three-Address Statements
Binary Operator:
op y,z,result or result := y op z
Where op is a binary arithmetic or logical operator. This binary operator is applied to y and z,
and the result of the operation is stored in result.
mul a,b,c
addr a,b,c
addi a,b,c
Unary Operator:
Where op is a unary arithmetic or logical operator. This unary operator is applied to y, and
the result of the operation is stored in result.
not a,,c
inttoreal a,,c
movi a,,c
movr a,,c
Unconditional Jumps:
Page 93 of 115
We will jump to the three-address code with the label L, and the execution continues from
that statement.
Conditional Jumps:
We will jump to the three-address code with the label L if the result of y relop z is true, and
the execution continues from that statement. If the result is false, the execution continues
from the statement following this conditional jump statement.
Procedure Parameters:
Procedure Calls: call p,n, or call p,n where x is an actual parameter, we invoke the
procedure p with n parameters.
param x2,,
p(x1,...,xn)
param xn,,
call p,n,
Page 94 of 115
f(x+1,y) add x,1,t1
param t1,,
param y,,
call f,2,
Indexed Assignments:
movecont y,,x or x := *y
o Assignment statements: x := y op z, x := op y
o Copy statements: x := y
return y
o the semantic actions have side effects that write the three-address code statements
in a file.
When the three-address code is generated, it is often necessary to use temporary variables
and temporary names. The following functions are used to generate 3-address code:
Page 95 of 115
newtemp() - each time this function is called, it gives distinct names that can be used for
temporary variables.
newlabel() - each time this function is called, it gives distinct names that can be used for
label names.
gen() to generate a single three address statement given the necessary information.
gen will produce a three-address code after concatenating all the parameters.
gen (id1.lexeme, „:=‟, id2.lexeme, „+‟, id3.lexeme) will produce the three-
address code : x := y + z
Note: variables and attribute values are evaluated by gen before being concatenated with
the other parameters.
Use attributes:
E.code: hold the three address code statements that evaluate E (this is the
`translation‟ attribute).
|(E)
| id
| num
Page 96 of 115
Implementation of Three-Address Statements
The description of three-address instructions specifies the components of each type of
instruction. However, it does not specify the representation of these instructions in a data
structure. In a compiler, these statements can be implemented as objects or as records with
fields for the operator and the operands.
o Quadruples
o Triples and
o Indirect triples
Quadruples: A quadruple (or just "quad') has four fields, which we call op, arg1, arg2, and result
Triples: A triple has only three fields, which we call op, arg1, and arg2.
Indirect Triples: consists of a listing of pointers to triples, rather than a listing of triples
themselves.
The benefit of Quadruples over Triples can be seen in an optimizing compiler, where
instructions are often moved around. With quadruples, if we move an instruction that
computes a temporary t, then the instructions that use t require no change.
With triples, the result of an operation is referred to by its position, so moving an instruction
may require to change all references to that result. This problem does not occur with indirect
triples.
Page 97 of 115
Implementation of Three-Address Statements: Triples
Major tradeoff between quads and triples is compactness versus ease of manipulation
Exercises
Page 98 of 115
b) Quadruples.
c) Triples.
E E1 + E2 E.place := newtemp();
E E1 * E2 E.place := newtemp();
E - E1 E.place := newtemp();
E.code := E1.code
E id E.place := id.lexeme
S.after = newlabel();
gen(„goto‟ S.begin) ||
gen(S.after „:”)
Page 99 of 115
S if E then S1 else S2 S.else = newlabel();
S.after = newlabel();
S.code = E.code ||
S1.code ||
gen(„goto‟ S.after) ||
gen(S.after „:”)
EE<E E.place=newtemp();
S.after = newlabel();
gen(S.after „:”)
S.after = newlabel();
S.code = E.code ||
gen(S.after „:”)
Exercises:
1) Draw the decorated parse tree and generate three-address code by using the translation
schemes given:
a) A := B + C d) while a < b do a := a + b
b) A := C * ( B + D) e) a:= b * -c + b * -c
c) while a < b do a := (a + b) * c
Solutions for:
Three address code of A := B + C
Code Generation
Introduction
Position of a Code Generator
The final phase in our compiler model is code generator. It takes as input the intermediate
representation (IR) produced by the front end of the compiler, along with relevant symbol
table information, and produces as output a semantically equivalent target program.
o Preserving the semantic meaning of the source program and being of high quality
o instruction selection,
o instruction ordering
The most important criterion for a code generator is that it should produce correct code.
o Instruction Selection
o Register Allocation
o the intermediate representation of the source program produced by the frontend along
with
o information in the symbol table that is used to determine the run-time address of the
data objects denoted by the names in the IR.
Assumptions
o Front end has scanned, parsed and translated into relatively lower level IR
The most common target-machine architectures are RISC, CISC, and stack based.
o In a stack-based machine, operations are done by pushing operands onto a stack and
then performing the operations on the operands at the top of the stack.
In this chapter
Instruction Selection
The code generator must map the IR program into a code sequence that can be executed by
the target machine. The complexity of the mapping is determined by factors such as:
If the IR is high level, use code templates to translate each IR statement into a sequence of
machine instruction.
If the IR reflects some of the low-level details of the underlying machine, then it can use this
information to generate more efficient code sequence. The nature of the instruction set of the
target machine has a strong effect on the difficulty of instruction selection. For example,
o The uniformity and completeness of the instruction set are important factors.
Straight forward translation may not always be the best one, which leads to unacceptably
inefficient target code.
Register Allocation
Efficient and careful management of registers results in a faster program. A key problem
in code generation is deciding what values to hold in what registers.
Example:
The order in which computations are performed can affect the efficiency of the target code.
Some computation orders require fewer registers to hold intermediate results than others.
However, Selection of the best evaluation order is also mathematically difficult. When
instructions are independent, their evaluation order can be changed.
o Load operations
o Store operations
o Unconditional jumps
o Conditional jumps
Load operations
The instruction LD dst, addr loads the value in location addr into location dst. This
instruction denotes the assignment dst = addr. The most common form of this instruction is
LD r, x which loads the value in location x into register r. An instruction of the form LD r1,
r2 is a register-to-register copy in which the contents of register r2 are copied into register r1.
Store operations
The instruction ST x, r stores the value in register r into the location x. This instruction
denotes the assignment x = r.
Computation operations
Has the form OP dst, src1, src2, where OP is an operator like ADD or SUB, and dst, src1,
src2 are locations, not necessarily distinct.
The effect of this machine instruction is to apply the operation represented by OP to the
values in locations src1 and src2, and place the result of this operation in location dst.
For example, SUB r1, r2, r3 computes r1 = r2 – r3 any value formerly stored in r1 is lost, but
if r1 is r2 or r3 the old value is read first. Unary operators that take only one operand do not
have a src2.
Unconditional Jumps
The instruction BR L causes control to branch to the machine instruction with label L. (BR
stands for branch)
Conditional Jumps
Has the form Bcond r, L, where: r is a register, L is a label, and cond is any of the common
tests on values in the register r.
For example: BLTZ r, L causes a jump to label L if the value in register r is less than zero,
and allows control to pass to the next machine instruction if not.
Loading into R1 the value in the memory location stored in the memory location obtained by
adding 100 to the contents of register R2. Immediate constant addressing mode. The constant
is prefixed by #.
The instruction LD R1, #100 loads the integer 100 into register R1, and ADD R1, R1,
#100 adds the integer 100 into register R1.
R1 = R1 + 100
Example:
o The cost of an instruction = one + the costs associated with the addressing modes
of the operands.
Examples