m433-نظرية المترجمات د عبدالباقي
m433-نظرية المترجمات د عبدالباقي
Introduction to Compiler
1
Why Learn About Compilers?
Few people will ever be required to write a compiler for a general-
purpose language like C and Java. So why do most computer science
institutions offer compiler courses and often make these mandatory?
Some typical reasons are:
(a) It is considered a topic that you should know in order to be “well-
cultured” in computer science.
(b) A good craftsman should know his tools, and compilers are
important tools for programmers and computer scientists.
(c) The techniques used for constructing a compiler are useful for other
purposes as well.
(d) There is a good chance that a programmer or computer scientist will
need to write a compiler or interpreter for a domain-specific language
2
Compilers
• Compilation is the translation of a program written in a source
language into an equivalent program written in a target language.
• A compiler is a program that can read a program in one language,
the source language, and translate it into an equivalent program in
another language, the target language.
• An important role of the compiler is to report any errors in the
source program that it detects during the translation process.
• If the target program is an executable machine-language program, it
can then be called by the user to process inputs and produce outputs.
Input
Source Target
Compiler
Program Program
3
Error messages Output
Interpreters
• An interpreter is another common kind of language processor.
Instead of producing a target program as a translation, an interpreter
appears to directly execute the operations specified in the source
program on inputs supplied by the user
• The machine-language target program produced by a compiler is
usually much faster than an interpreter at mapping inputs to outputs.
• An interpreter, however, can usually give better error diagnostics
than a compiler, because it executes the source program statement by
statement.
Source
Program
Interpreter Output
Input
Error messages 4
A language-processing system
Source Program
Preprocessor collects source program modules
Preprocessor which may be divided into stored in separate files.
5
The Structure of a Compiler (1)
• Any compiler must perform two major tasks
– Analysis of the source program (Front end)
– Synthesis of a machine-language program (Back end)
6
The Structure of a Compiler (2)
• Front end translates a source program into an independent
intermediate code, then the back end uses this intermediate
code to generate the target code.
• Analysis part (Front end)
– It breaks up the source program into constituent pieces and
imposes a grammatical structure on them.
– It detects that the source program is either syntactically ill formed
or semantically unsound.
– It collects information about the source program and stores it in a
symbol table, which is passed along with the intermediate
representation to the synthesis part.
• Synthesis Part (Back end)
– It takes the tree structure and translates the operations therein into
7
the target program
The Structure of a Compiler (3)
Source
Program Lexical Tokens Syntax Syntax
Analyzer Semantic
Analyzer tree Analyzer
(Character (Scanner) (Parser)
Stream)
Syntax tree
Intermediate
Code
Generator
Symbol and
Intermediate
Attribute
Representation
Tables
Optimizer
(Used by all Phases of The Compiler)
Intermediate
Representation
Code
Generator
8
Target machine code
1- Lexical Analysis (1)
Source
Program Lexical Syntax Syntax
Tokens Semantic
Analyzer Analyzer
(Parser) tree Analyzer
(Character (Scanner)
Stream)
Syntax tree
Scanner Intermediate
➢ The scanner begins the analysis of the source program by Code
reading the input, character by character, and grouping Generator
characters into individual words and symbols (tokens) Intermediate
Representation
RE ( Regular expression )
NFA ( Non-deterministic Finite Automata ) Optimizer
DFA ( Deterministic Finite Automata )
Intermediate
Representation
Scanner
Code
[Lexical Analyzer] Generator
Tokens
9 Target machine code
1- Lexical Analysis (2)
▪ Lexical analysis attempts to isolate the “words” in an input string.
▪ A word, known as a lexeme or a lexical item, is a string of input
characters, which is passed on to the next phase of compilation.
▪ When the lexical analyzer encounters a whitespace, operator
symbol, or special symbols, it decides that a word is completed.
▪ For each lexeme, the lexical analyzer produces as output a token of
the form <token-name; attribute-value> that it passes on to the
subsequent phase, syntax analysis.
The scanner does the following:
• It puts the program into a compact and uniform format (tokens).
• It eliminates unneeded information (such as comments).
• It sometimes enters initial information into symbol tables (for
example, to register the presence of a particular label or identifier).
10
Examples of tokens 1
11
Examples of tokens 2
12
2- Syntax Analysis (structure) (1)
Source
Program Tokens Syntax Syntax Semantic
Scanner Analyzer
(Character Tree Analyzer
(Parser)
Stream)
Syntax tree
Parser Intermediate
➢ The parser reads tokens and groups them into units as Code
Generator
specified by the productions of the CFG being used. Intermediate
Tokens
Representation
CFG ( Context-Free Grammar )
Optimizer
BNF ( Backus-Naur Form ) Parser
[Syntax Analyzer] Intermediate
Syntax
Representation
tree
Code
Generator
13
Target machine code
2- Syntax Analysis (2)
Syntax Analysis (parsing)
• The parser uses the first components of the tokens produced by the
scanner to create a syntax tree that depicts the grammatical structure
of the token stream. In a syntax tree, each interior node represents
an operation and the children of the node represent their arguments.
• Construction of a syntax tree is a basic activity in compiler writing.
14
3- Semantic Analysis (meaning) (1)
Source
Program Tokens Syntax Semantic
Scanner Parser
(Character Tree Analyzer
Stream)
Syntax tree
Semantic Analyzer Intermediate
Code
➢ Semantic analysis is the discovery of meaning in a program.
Generator
➢ Performs two functions Intermediate
Representation
◼ Check the static semantics of each construct
◼ Do the actual translation Optimizer
➢ The heart of a compiler Intermediate
Syntax Directed Translation
Representation
Semantic Processing Techniques Code
IR (Intermediate Representation) Generator
Semantic Process
[Semantic analyzer]
Syntax Tree
17
4- Intermediate Code Generator (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree
Intermediate
Code
Generator
Intermediate Code Generator Intermediate
Representation
– machine independent
– it should be easy to generate Optimizer
– it should be easily translatable into target program Intermediate
Representation
Code
Generator
19
4- Intermediate Code Generator (3)
• The output of the intermediate code generator consists of the three-
address code (TAC) which consists of a sequence of assembly-like
instructions with three operands per instruction.
• TAC is a linearized representation of a syntax tree in which explicit
names correspond to the interior nodes of the graph.
Code Generator
[Intermediate Code Generator]
Non-optimized Intermediate
Code
20
4- Intermediate Code Generator (4)
• Each statement has the general form of: z = x op y where x, y and
z are variables, constants or temporary variables generated by the
compiler. ‘op’ represents any operator.
• Each three-address assignment instruction has at most one operator
on the right side. Some three-address instructions have fewer than
three operands.
21
5- Code Optimization (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree
Intermediate
Optimizer Code
Generator
➢ The intermediate code generated by the semantic routines Intermediate
is analyzed and transformed into functionally equivalent Representation
Code Optimizer
23
5- Code Optimization (3)
Common optimizations include:
1. Removing redundant identifiers
A variable x at a statement in a program has always the same value, c.
Then, variable x can be replaced by the value c in this statement. For
example, at each execution of the assignment, b = a * a – 7, variable a
has the value 4. Replacing both occurrences of a by 4 leads to the
expression 4 * 4 - 7, whose value can be evaluated at compile time.
2. Removing unreachable sections of code
For example, in the following program segment, the statement stmt2
can never be executed. It is unreachable and can be eliminated from
the object program:
stmt1
go to label1
stmt2
label2: stmt3 24
5- Code Optimization (4)
3. Loop invariant
A computation is loop invariant if it only depends on variables that do
not change their value during the execution of the loop. Such a
computation is executed only once instead of in each iteration when it
has been moved out of a loop. For example,
for (i=1; i<=100000; i++) {
x = sqrt (y); // square root function
printf(x+i) ; }
The assignment to x need not be inside the loop since y doesn’t change
as the loop repeats (it is a loop invariant). In the optimization phase,
the compiler would move the assignment to x out of the loop in the
object program:
x = sqrt (y); // loop invariant
for (i=1; i<=100000; i++) printf(x+i) ;
This eliminate 99,999 unnecessary calls to the sqrt function at run time.
25
6- Code Generation (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree
• Code generator converts the intermediate
representation of source code into a form that can Intermediate
Code
be readily executed by the machine. Generator
• Code generator takes as input an intermediate Intermediate
representation of the source program and maps it Representation
into the target language. Optimizer
• If the target language is machine code, registers
Intermediate
or memory locations are selected for each of the Representation
variables used by the program. Then, the
Code
intermediate instructions are translated into Generator
sequences of machine instructions that perform
26
the same task. Target machine code
6- Code Generation (2)
The first operand of each instruction specifies a destination. The F in
each instruction tells us that it deals with floating-point numbers. The
code loads the contents of address id3 into register R2, then multiplies
it with floating-point constant 60.0. The # signifies that 60.0 is to be
treated as an immediate constant. The third instruction moves id2 into
register R1 and the fourth adds to it the value previously computed in
register R2. Finally, the value in register R1 is stored into the address
of id1.
Optimized Intermediate Code
Code Generator
27
Symbol-Table Management
• An essential function of a compiler is to record the variable names
used in the source program and collect information about various
attributes of each name.
• These attributes may provide information about the storage allocated
for a name, its type, its scope (where in the program its value may
be used), and in the case of method names, such things as the
number and types of its arguments, the method of passing each
argument (for example, by value or by reference), and the type
returned.
• The symbol table is a data structure containing a record for each
variable name, with fields for the attributes of the name.
• The data structure should be designed to allow the compiler to find
the record for each name quickly and to store or retrieve data from
that record quickly. 28
Chapter 3
Lexical Analysis
29
Lexical Analysis
• The main task of the lexical analyzer is to read the input characters
of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• The lexical analyzer interacts with the symbol table as well. When
the lexical analyzer discovers a lexeme constituting an identifier, it
needs to enter that lexeme into the symbol table.
• The lexical analyzer removes
comments and whitespace
(blank, newline, tab, and
perhaps other characters that
are used to separate tokens in
the input).
30
Lexical Analysis
• Lexical analyzers are divided into a cascade of two processes:
– a) Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
– b) Lexical analysis proper is the more complex portion, which
produces tokens from the output of the scanner.
• Some languages have only a few kinds of token, of fairly simple
form. Other languages are more complex. C, for example, has
almost 100 kinds of tokens, including 37 keywords (double, if,
return, struct, etc.); identifiers (my_variable, printf, etc.); integer,
floating-point (6.02e2), and character (’x’, ’\’’) constants; string
literals ("hello", "say \"hi\"\n"); 54 “punctuators” (+, ], ->, *=, :, ||,
etc.), and two different forms of comments. 31
Attributes of Tokens
y := 31 + 28*x Lexical analyzer
<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>
token
(lookahead)
tokenval Parser
(token attribute)
32
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units. It is a pair consisting of a
token name and an optional attribute value. The token name is an
abstract symbol representing a kind of lexical unit, e.g., a particular
keyword, or a sequence of input characters denoting an identifier.
– For example: id and num
• Lexemes are the specific character strings that make up a token. It is
identified by the lexical analyzer as an instance of that token.
– For example: abc and 123
• Patterns are rules describing the set of lexemes belonging to a token.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other
tokens, the pattern is a more complex structure that is matched by
many strings.
– For example: “letter followed by letters and digits” and “non-
33
empty sequence of digits”
Examples of tokens
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letter and digits pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ surrounded by “ “core dumped”
• Ex 1: printf("Total = %d\n", score);
Solution: 7 tokens
<id, "printf"> <(> < literal, "Total = %d\n"> < , > <id, " score"> <)> <;>
36
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer
must provide the subsequent compiler phases additional information
about the particular lexeme that matched. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an
attribute value that describes the lexeme represented by the token;
the token name effects parsing decisions, while the attribute value
effects translation of tokens after the parse.
• The most important example is the token id, where we need to
associate with the token a great deal of information. Normally,
information about an identifier, e.g., its lexeme, its type, and the
location at which it is first found is kept in the symbol table. Thus,
the appropriate attribute value for an identifier is a pointer to the
symbol-table entry for that identifier. 37
Example of Attributes for Tokens
• The token names and associated attribute values for the Fortran
statement: E = M * C ** 2
– <id, pointer to symbol-table entry for E>
– <assign op>
– <id, pointer to symbol-table entry for M>
– <mult op>
– <id, pointer to symbol-table entry for C>
– <exp op>
– <number, integer value 2>
• Note that in certain pairs, especially operators, punctuation, and
keywords, there is no need for an attribute value. In this example,
the token number has been given an integer-valued attribute.
38
Reading Ahead
• A lexical analyzer may need to read ahead some characters before it
can decide on the token to be returned to the parser.
• For example, a lexical analyzer for Java must read ahead after it sees
the character >. If the next character is =, then > is part of the
character sequence >=, the lexeme for the token for the “greater than
or equal to" operator. Otherwise > itself forms the “greater than"
operator, and the lexical analyzer has read one character too many.
• A general approach to reading ahead on the input, is to maintain an
input buffer from which the lexical analyzer can read and push back
characters.
• The lexical analyzer reads ahead only when it must. An operator like
* can be identified without reading ahead. In such cases, the input
buffer is set to a blank, which will be skipped when the lexical
analyzer is called to find the next token. 39
Terms for Parts of Strings
1. A prefix of string s is any string obtained by removing zero or more
symbols from the end of s. For example, ban, banana, and ɛ are
prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s. For example, nana, banana, and ɛ
are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix
from s. For instance, banana, nan, and ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not ɛ or
not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a
subsequence of banana. 40
Specification of Tokens
• Regular expressions are an important notation for specifying
lexeme patterns. While they cannot express all possible patterns, they
are very effective in specifying those types of patterns that we
actually need for tokens.
43
Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
• letter → [A-Za-z]
• digit → [0-9]
• num → digit+ (. digit+)? ( E (+-)? digit+ )?
• [abcd] means (a | b | c | d)
• [b-g] means [bcdefg]
• [b-gM-Qkr] means [bcdefgMNOPQkr]
• M? means (M | , i.e., zero or one). 44
Transition Diagrams
• As an intermediate step in the construction of a lexical analyzer, we
first convert patterns into stylized flowcharts, called “transition
diagrams " which is similar to a DFA.
• Differences between TD and DFA
1. DFA accepts or rejects a string. TD reads characters until finding a
token, returns the read token and prepare the input buffer for the next
call.
2. In a TD, there is no out-transition from accepting states.
3. Transition labeled other (or not labeled) should be taken on any
character except those labeling transitions out of a given state.
4. States can be marked with a *: This indicates states on which a input
retraction must take place.
45
Transition diagram Examples 1
• relop → <<=<>>>==
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
48
Combined Finite Automata
49
keywords Finite Automata
This machine accepts keywords: if, int, inline, for, float
50
Chapter 4
Syntax Analysis
51
Position of a Parser in the Compiler Model
• Syntax analysis or parsing is the second phase of a compiler.
• Parsing is the process of determining how a string of terminals can
be generated by a grammar.
• A syntax analyzer or parser takes the input from a lexical analyzer in
the form of token streams. The parser analyzes the source code
(token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
• The parser accomplishes two tasks, i.e., parsing the code, looking
for errors, and generating a parse tree as the output of the phase.
• Parsers are expected to parse the whole code even if some errors
exist in the program. Parsers use error recovering strategies
52
Lexical Versus Syntax Analysis
Why use regular expressions to define the lexical syntax of a language?
The lexical rules of a language are often quite simple, and to
describe them we do not need a notation as powerful as
grammars.
Regular expressions generally provide a more concise and
easier-to-understand notation for tokens than grammars.
55
Context free grammars
Terminals
Nonterminals
Start symbol
productions expression -> expression + term
expression -> expression – term
expression -> term
term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id
56
Derivations & Parse trees
Productions are treated as rewriting rules to generate a string
E -> E + E | E * E | -E | (E) | id
57
Ambiguity
For some strings there exist more than one parse tree
Or more than one leftmost derivation
Or more than one rightmost derivation
Example: id+id*id
58
Ambiguity
60
Left recursion
A grammar is left recursive if it has a non-terminal A
+
such that there is a derivation A=> Aα
A simple rule for direct left recursion elimination:
For a rule like:
A -> A α|β
We may replace it with
A -> β A’
A’ -> α A’ | ɛ
61
Left recursion elimination
62
Example Left Recursion Elimination
A→BC|a
B→CA|Ab Choose arrangement: A, B, C
C→AB|CC|a
i = 1: nothing to do
i = 2, j = 1: B→CA|Ab
B→CA|BCb|ab
(imm) B → C A BR | a b BR
BR → C b BR |
i = 3, j = 1: C→AB|CC|a
C→BCB|aB|CC|a
i = 3, j = 2: C→BCB|aB|CC|a
C → C A BR C B | a b BR C B | a B | C C | a
(imm) C → a b BR C B CR | a B CR | a CR
CR → A BR C B CR | C CR |
63
Left factoring
When a nonterminal has two or more productions whose right-hand
sides start with the same grammar symbols, the grammar is not LL(1)
and cannot be used for predictive parsing
Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive or top-down parsing.
A way of delaying the decision until more info is available
Consider following grammar:
Stmt -> if expr then stmt else stmt
| if expr then stmt
On seeing input if it is not clear for the parser which production
to use
We can easily perform left factoring:
If we have A->αβ1 | αβ2 then we replace it with
A -> αA’
64
Left factoring (cont.)
Algorithm
For each non-terminal A, find the longest prefix α common
to two or more of its alternatives. If α<> ɛ, then replace all
of A-productions A->αβ1 |αβ2 | … | αβn | γ by
A -> αA’ | γ
A’ -> β1 |β2 | … | βn
Example:
S -> i E t S | i E t S e S | a
E -> b
Left-factored, this grammar becomes:
S -> i E t S S’ | a
S’ -> e S | ɛ
E -> b 65
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from
lexical analyzers. Lexical analyzers are responsible for the validity of a
token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks:
• it cannot determine if a token is valid,
• it cannot determine if a token is declared before it is being used,
• it cannot determine if a token is initialized before it is being used,
• it cannot determine if an operation performed on a token type is valid
or not.
66
Parsing Techniques
Top-down parsers (LL(1), recursive descent)
Start at the root of the parse tree from the start symbol and
grow toward leaves (similar to a derivation)
Pick a production and try to match the input
Bad “pick” may need to backtrack
Some grammars are backtrack-free (predictive parsing)
67
Parsing Techniques
Bottom-up parsers (LR(1), operator precedence)
Start at the leaves and grow toward root
The process as reducing the input string to the start symbol
At each reduction step a particular substring matching the
right-side of a production is replaced by the symbol on the left-
side of the production
Bottom-up parsers handle a large class of grammars
68
Recursive descent parsing : It is a common form of top-down
parsing. It is called recursive, as it uses recursive procedures to process
the input. Recursive descent parsing suffers from backtracking
Backtracking : It means, if one derivation of a production fails, the
syntax analyzer restarts the process using different rules of same
production. This technique may process the input string more than once
to determine the right production. 69
Top Down Parsing
A Top-down parser tries to create a parse tree from the root
towards the leafs scanning input from left to right
It can be also viewed as finding a leftmost derivation for an
input string
Example: id+id*id
At each step of a top-down parse, the key problem is that of
determining the production to be applied for a nonterminal, say A. Once
an A-production is chosen, the rest of the parsing process consists of
“matching" the terminal symbols in the production body with the input
string.
E E E E E E
E -> TE’ lm lm lm lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’ F T’ F T’ F T’ F T’ + T E’
T’ -> *FT’ | Ɛ
F -> (E) | id id id Ɛ id Ɛ
70
Recursive descent parsing
Consists of a set of procedures, one for each nonterminal
Execution begins with the procedure for start symbol
A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal) call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
71
Recursive descent parsing (backtracking)
General recursive descent may require backtracking
The previous code needs to be modified to allow backtracking
So we need to try all alternatives. If one failed the input pointer
needs to be reset and another alternative should be tried
Recursive descent parsers cant be used for left-recursive grammars
72
Backtracking Example
Now, we have a match for the second input symbol “a”,
so we advance the input pointer to “d”, the third input
symbol, and compare d against the next leaf “b”.
Backtracking
Since “b” does not match “d”, we report failure and go back to
A to see whether there is another alternative for A that has not
been tried - that might produce a match!
In going back to A, we must reset the input pointer to “a”.
S->cAd S S
A->ab | a S
c A d c A d
Input: cad c A d
step 1. From start a b a
symbol
Step 2. We expand A using the first alternative Step 3
A → ab to obtain the following tree 73
Predictive Parsing
• Recursive descent is a top-down parsing technique that constructs the
parse tree from the top and the input is read from left to right.
• It uses procedures for every terminal and non-terminal entity.
• This parsing technique recursively parses the input to make a parse
tree, which may or may not require back-tracking. But the grammar
associated with it (if not left factored) cannot avoid back-tracking.
• A predictive parsing is a form of recursive-descent parsing that
does not require any back-tracking and has the capability to predict
which production is to be used to replace the input string.
• To accomplish its tasks, the predictive parser uses a look-ahead
pointer, which points to the next input symbols. To make the parser
back-tracking free, the predictive parser puts some constraints on the
grammar and accepts only a class of grammar known as LL(k)
grammar.
74
First Set
First() is set of terminals that begins strings derived from
i.e., a First(α) iff α=>
*
a for some .
In predictive parsing when we have A-> α|β, if First(α) and
First(β)* are disjoint sets then we can select appropriate A-
production by looking at the next input
To compute First(X) for all grammar symbols X, apply following
rules until no more terminals or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a production for some
k>=1, then place a in First(X) if for some i, a is in First(Yi), and
ɛ is in all of First(Y1),…,First(Yi-1) that is Y1…Yi-1 =>
* ɛ.If ɛ is in
First(Yj) for j=1,…,k then add ɛ to First(X). If Y1 does not derive
ɛ, then we add nothing more to FIRST(X), but if Y1 => * ɛ , then
76
First and Follow Examples
G1: S → a ABb First Follow
A→c| S {a} {$}
B→d| A {c, } {d, b}
B {d, } {b}
First Follow
G2: S → a BDh
B→cC S a $
C→bc| B c g, f, h
D→EF C b, g, f, h
E→g| D g, f, h
F→f| E g, f, h
F f, h
79
Construction of predictive parsing table
The next algorithm collects the information from FIRST and FOLLOW sets
into a predictive parsing table M[A; a]. It is based on the idea:
•The production A → α is chosen if the next input symbol a is in FIRST(α).
*
The only complication occurs when α= ɛ or, more generally, α ɛ. In this
case, we choose A → α, if the current input symbol is in Follow(A), or if $ on
the input has been reached and $ is in Follow(A).
Algorithm 4.31 : Construction of a predictive parsing table.
INPUT: Grammar G.
OUTPUT: Parsing table M
For each production A → α in grammar do the following:
1. For each terminal a in First(α) add A → α in M[A, a]
2. If ɛ is in First(α), then for each symbol b in Follow(A) add
A →α to M[A, b].
If after performing the above, there is no production in M[A, a]
then set M[A, a] to error (which we normally represent by an
empty entry in the table).
80
Parsing table
First Follow
Example
E { (, id} { ), $}
E -> TE’ {+, ɛ}
E’ -> +TE’ | Ɛ E’ { ), $}
T { (, id} {+, ), $}
T -> FT’ { *, ɛ} {+, ), $}
T’ -> *FT’ | Ɛ T’
F { (, id} {+, *, ), $}
F -> (E) | id
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’
a b e i t $
S S→a S → i E t S SR
SR →
SR SR →
SR → e S
E E→b 82
Non-recursive predicting parsing
A nonrecursive predictive parser can be built by maintaining a stack
explicitly, rather than implicitly via recursive calls. The parser mimics a
leftmost derivation. If w is the input that has been matched so far, then
the stack holds a sequence of grammar symbols α such that
S * wα
input a + b $
Predictive
parsing output
stack X
Y program
Z
$ Parsing
Table
M 83
Predictive parsing algorithm
84
LL(1) Example 1: parse the string “id+id*id”
85
LL(1) Example 2: parse the string “abba”
First Follow
S {a} { $}
B {b, ɛ} {a}
86
LL(1) Example 3: parse the string “abbb”
First
S → a ABC Follow
A → a | bb S {a} { $}
B→a| A {a,b} {a, b, $}
C→b| B {a,ɛ} {b, $}
C {b,ɛ} { $}
matched stack input action
S$
abbb $ Parsing table
aABC$ abbb $ S → a ABC a b c $
a ABC$ bbb $ match a
a bbBC$ bbb $ A → bb S S → a ABC
abb BC$ b$ match bb
abb C$ A A→a A → bb
b$ B→ Ɛ
abb b$ b$ C→ b
B B→a B→ Ɛ B→ Ɛ
abbb $ $ Match b
C C→ b C→ Ɛ
87
LL(1) Example 4: parse the string: “int id,id;”
First Follow
S → TL;
T → int | float
L → L , id | id
88
Error recovery in predictive parsing
Panic mode
Place all symbols in Follow(A) into synchronization set for
nonterminal A: skip tokens until an element of Follow(A) is
seen and pop A from stack.
Add to the synchronization set of lower level construct the
symbols that begin higher level constructs
Add symbols in First(A) to the synchronization set of
nonterminal A
If a nonterminal can generate the empty string then the
production deriving can be used as a default
If a terminal on top of the stack cannot be matched, pop the
terminal, issue a message saying that the terminal was
insterted
89
Example Non - Input Symbol
terminal id + * ( ) $
E E -> TE’ E -> TE’ synch synch
90
Exercises
Exercise 4.4.1 : For each of the following grammars, devise predictive
parsers and show the parsing tables. You may left-factor and/or eliminate
left-recursion from your grammars first.
a) S → 0 S 1 | 01
b) S → + S S | * S S | a
c) S → S (S) S |
d) S → S + S | S S | ( S) | S * | a
e) S → ( L ) | a and L → L , S | S
f) S → a S b S | b S a S |
91
Solution of 4.4.1. (d)
92
Solution of 4.4.1. (d)
93
Solution of 4.4.1. (d)
94
Exercises
Exercise 4.4.3 : Compute FIRST and FOLLOW for the grammars of
Exercise 4.2.2.
95
Bottom-up parsing starts from the leaf nodes of a tree and works in
upward direction till it reaches the root node. Here, we start from a
sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the
bottom-up parsers available.
96
Example: id*id
id*id F * id T * id T*F T E
E -> E + T | T
id F F id T*F T
T -> T * F | F
F -> (E) | id id id F id T*F
id F id
id
97
Shift-reduce parser
The general idea is to shift some symbols of input to the
stack until a reduction can be applied
At each reduction step, a specific substring matching the
body of a production is replaced by the nonterminal at the
head of the production
The key decisions during bottom-up parsing are about
when to reduce and about what production to apply
A reduction is a reverse of a step in a derivation
The goal of a bottom-up parser is to construct a derivation
in reverse:
E=>T=>T*F=>T*id=>F*id=>id*id
98
Handle pruning
A Handle is a substring that matches the body of a production
and whose reduction represents one step along the reverse of a
rightmost derivation Handles during a parse of id1 * id2
Example: abbcde
Grammar: aAbcde
S→aABe aAde Handle
A→Abc|b aABe
B→d S 99
Shift reduce parsing
Consists of:
A stack is used to hold grammar symbols and
input buffer: holds the rest of the string to be parsed
Handle always appear on top of the stack
Initial configuration: Acceptance configuration:
Stack Input Stack Input
$ w$ $S $
Basic operations:
1. Shift. Shift the next input symbol onto the top of the stack.
2. Reduce. Replace the handle on the top of the stack by the
non-terminal.
3. Accept. Announce successful completion of parsing.
4. Error. Discover a syntax error and call an error recovery
routine
100
Shift reduce parsing Example 1
Input: id+id*id
101
Shift reduce parsing Example 2
Input: id+id*id
102
Conflicts during shift reduce parsing
There are grammars for which shift-reduce parsing cannot
be used.
Every shift-reduce parser can reach a configuration in
which it cannot decide whether to shift or to reduce (a
shift/reduce conflict), or cannot decide which of several
reductions to make (a reduce/reduce conflict).
These grammars are not in LR(k) class of grammars.
103
shift/reduce conflict
Example: An ambiguous grammar can never be LR
Stack Input
… if expr then stmt else …$
104
Reduce-Reduce Conflicts
Stack Input Action
$ aa$ shift
$a a$ reduce A → a or B → a ?
Grammar:
C→AB
A→a
B→a
Resolve in favor
of reducing A → a,
otherwise we’re stuck!
105
LR Parsing
The most prevalent type of bottom-up parser today is based on LR(k)
The k for the number of input symbols of lookahead that are used in
making parsing decisions.
The cases k = 0 or k = 1 are of practical interest, and we shall only
consider LR parsers with k <= 1 here.
When (k) is omitted, k is assumed to be 1.
Why LR parsers?
Table driven much like the nonrecursive LL parsers
110
The Goto function
111
Canonical LR(0) items
Void items(G’) {
C= CLOSURE({[S’-> • S]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (GOTO(I,X) is not empty and not in C)
add GOTO(I,X) to C;
until no new set of items are added to C on a round;
}
112
Canonical LR(0) items - Example
Augmented I0 = closure({[C’ → •C]})
grammar: I1 = goto(I0,C) = closure({[C’ → C•]})
1. C’ → C … State I1: State I4:
2. C → A B C’ → C• final C → A B•
goto(I0,C)
3. A → a
4. B → a State I0: goto(I2,B)
State I2:
start C’ → •C goto(I 0 ,A)
C → A•B
C → •A B B → •a goto(I2,a)
A → •a
goto(I0,a) State I5:
1 State I3:
B → a•
C 4 A → a•
B
start A
0 2
a 5
a
3 113
114
LR-Parsing model
• LR parser consists of an input, an output, a stack, a driver program,
and a parsing table that has two parts (ACTION and GOTO).
• The driver program is the same for all LR parsers; only the
parsing table changes from one parser to another.
• Where a shift-reduce parser would shift a symbol, an LR parser
shifts a state.
115
LR(0) parsing
A shift item: It is an item has •a for terminal a. It says that a must be
shifted onto the stack if it appears as the next input symbol.
A reduce item: It is an item of the form A → α •. It indicates that,
when this state is reached, the production A→α
should be reduced.
Reducing the item S’ → S• accepts the input string.
LR(0) parsing requires that each of these steps be uniquely
determined by the LR(0) machine and the input. Therefore, if a
state has a reduce item, it must not have any other reduce items
or shift items.
With this restriction, the current state determines whether to
shift or reduce, and which production to reduce, without
looking at the next input. If it shifts, it can read the next input
to see which state to shift.
116
117
LR(0) parsing Example 1
Grammar:
1-S → (S)
2-S → a
Action Goto
State ( ) a $ S
0 s2 s5 1
1 accept
2 s2 s5 3
3 s4
4 r1 r1 r1 r1
5 r2 r2 r2 r2 r2 118
LR parsing algorithm
INPUT: input string w and an LR-parsing table with ACTION and GOTO functions
OUTPUT: If w is in L(G), the reduction steps of a bottom-up parse for w;
otherwise, an error indication.
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine; } 119
LR(0) parsing Example : input ((a))
Stack Symbols Input Action
0 $ ((a)) $ Shift to 2
02 $( (a)) $ Shift to 2
022 $ (( a)) $ Shift to 5
0225 $ ((a )) $ Reduce by S → a
022
0223 $ ((S )) $ Shift to 4
02234 $ ((S) )$ Reduce by S → (S)
02
023 $ (S )$ Shift to 4
0234 $ (S) $ Reduce by S → (S)
0
01 $S $ accept
Let's look at the reductions of S → (S) in more detail. When the first such reduction
occurs, the stack is 02234; three symbols are popped of (because the length of “(S)"
is 3), leaving a stack of 02. There is a transition from the top state, 2, on S to state 3,
so we push a 3, leaving 023 on the stack. The second time it reduces S → (S), the
stack is 0234. When three states are popped, this leaves a stack with just 0 on it.
There is a transition from state 0 to state 1 on S, so the new stack is 01.
120
Not LR(0) Parsing Example 1
0) E’→E
1) E → E + T
2) E → T
3) T → T * F
4) T→ F
5) F → (E)
6) F→ id
STATE ACTON GOTO
id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 Acc
2 r2 r2 r2/s7 r2 r2 r2
3 r4 r4 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 r1 r1/s7 r1 r1 r1
10 r3 r3 r3 r3 r3 r3
11 r5 r5 r5 r5 r5 r5
121
Not LR(0) Parsing Example 2
Grammar:
Example of a CFG that is not LR(0)
0. S’ → S
1. S → A a
2. S → Bb
3. S → a c
4. A → a
5. B → a
122
Not LR(0) Parsing (shift/reduce or reduce/reduce conflicts)
Action Goto
State a b c $ S A B
0 s6 1 2 4
1 accept
2 s3
3 r1 r1 r1 r1
4 s5
5 r2 r2 r2 r2
6 r4/r5 r4/r5 s7/r4/r5 r4/r5
7 r3 r3 r3 r3
123
Not LR(0) Parsing
• The machine is not LR(0) because of shift/reduce and
reduce/reduce conflicts in state 6 (there is a shift item and two
reduce items in the state, so the parser doesn't know whether to shift
or reduce, and if it decided to reduce, anyway, it wouldn't know
which production to reduce). Hence, this grammar is not LR(0).
• However, if we allowed the parser to base its choice on the next input
symbol, the correct choice could be made reliably. If you examine the
grammar carefully, you can see that A → a should only be reduced
when the next input is a, B → a should only be reduced when the next
input is b, and, if the next input is c, the parser should shift.
• How could we determine this algorithmically?
• The next three parsing algorithms all do it in different ways.
• The simplest method is SLR(1) parsing, which uses FOLLOW sets to
compute lookaheads for actions.
124
SLR Grammars Concept
SLR (Simple LR) is a simple extension of LR(0) parsing
SLR eliminates some conflicts by populating the parsing table with
reductions A→ on symbols in FOLLOW(A)
Shift on +
State I2:
State I0:
E → id•+ E
S→E S → •E goto(I0,id) goto(I3,+)
E → id•
E → id + E E → •id + E
E → id E → •id
reduce on +
id + $ E
1. S → E 0 s2 1
2. E → id + E FOLLOW(E)={$}
1 acc thus reduce on $
3. E → id
2 s3 r3
3 s2 4
Shift on +
4 r2 125
Constructing SLR parsing table
State I0: State I1: State I2: State I3: State I4: State I5:
C’ → •C C’ → C• C → A•B A → a• C → A B• B → a•
C → •A B B → •a
action goto Grammar:
A → •a
state a $ C A B 1. C’ → C
1 0 s3 1 2 2. C → A B
C 4 1 acc 3. A → a
B
start A 2 s5 4
4. B → a
0 2 FOLLOW(A) = {a}
a 3 r3
a 5 FOLLOW(C) = {$}
4 r2
3 5 r4 FOLLOW(B) = {$}
126
SLR Parsing Table Example 2
Grammar:
0. S’ → S
1. S → A a
2. S → Bb
3. S → a c
4. A → a
5. B → a
FOLLOW(S) = {$}
FOLLOW(A) = {a}
FOLLOW(B) = {b}
127
FOLLOW (E) = { $, +, ) }
(1) E -> E + T (2) E-> T
Example 3 (3) T -> T * F
(5) F -> (E)
(4) T-> F
(6) F->id
FOLLOW (T) = { $, +, ), * }
FOLLOW (F) = {$, +, ), * }
LR(0) Parsing table SLR Parsing table
STATE ACTON STATE ACTON
id + * ( ) $ id + * ( ) $
0 s5 s4 0 s5 s4
1 s6 Acc 1 s6 Acc
2 r2 r2 r2/s7 r2 r2 r2 2 r2 s7 r2 r2
3 r4 r4 r4 r4 r4 r4 3 r4 r4 r4 r4
4 s5 s4 4 s5 s4
5 r6 r6 r6 r6 r6 r6 5 r6 r6 r6 r6
6 s5 s4 6 s5 s4
7 s5 s4 7 s5 s4
8 s6 s11 8 s6 s11
9 r1 r1 r1/s7 r1 r1 r1 9 r1 s7 r1 r1
10 r3 r3 r3 r3 r3 r3 10 r3 r3 r3 r3
11 r5 r5 r5 r5 r5 r5 11 r5 r5 r5 r5
128
Example Parse id*id+id
(1) E -> E + T
(3) T -> T * F
(2) E-> T
(4) T-> F
(5) F -> (E) (6) F->id
STATE ACTON GOTO Line Stack Symbols Input Action
id + * ( ) $ E T F
(1) 0 id*id+id$ Shift to 5
0 S5 S4 1 2 3
(2) 05 id *id+id$ Reduce F->id
1 S6 Acc (3) 03 F *id+id$ Reduce T->F
130
Exercises SLR
131
Canonical-LR Parsing
The “Canonical-LR" or just “LR", uses a large set of items, called
the LR(1) items. The 1 refers to the length of the lookahead of the
item.
LR(1) items = LR(0) items + lookahead symbol.
The general syntax becomes [A → α • β, a ] where a is a terminal or
right end marker $.
The lookahead has no effect in an item of the [A → α • β, a] where β
is not , but an item of the form [A → α •, a] calls for a reduction by
A → α only if the next input symbol is a.
Thus, we are required to reduce by A → α only on those input
symbols a for [A → α •, a] is an LR(1) item in the state on top of the
stack.
The set of such a's will always be a subset of FOLLOW(A),
132
How to add lookahead with the production?
• CASE 1: A → α • BC, a
• Suppose this is the 0th production. Now, since ‘•‘ precedes B, so we
have to write B’s productions as well. B → •D [1st production]
• Suppose this is B’s production. The look ahead of this production is
given as we look at previous productions in 0th production. Whatever
is after B, we find FIRST(of that value) , that is the lookahead of 1st
production. So, here in 0th production, after B, C is there. Assume
FIRST(C)=d, then 1st production become B → • D, d.
• CASE 2: A → α • B, a
• We can see there’s nothing after B. So the lookahead of 0th
production will be the lookahead of 1st production. i.e.- B → • D, a
• CASE 3: A → a | b
➢ A → a, $ [0th production]
➢ A → >b, $ [1st production]
133
Constructing LR(1) Sets of Items
• The method for building the collection of sets of valid LR(1) items is
essentially the same as the one for building the canonical collection
of sets of LR(0) items.
• We need only to modify the two procedures CLOSURE and GOTO.
134
Example : S’ → S S→CC C→cC|d
135
LR(1) GOTO graph
1) S → C C 2) C → c C 3) C → d
LR GOTO graph
136
Constructing LR(1) Parsing Tables
• An LR parser using the canonical LR(1) parsing table is called a
canonical-LR(1) parser.
137
LR(1) Parsing Tables of the example
• Every SLR grammar is an LR grammar, but for an SLR grammar the
LR parser may have more states than the SLR parser for the same
grammar. The grammar of the previous example is SLR and has an
SLR parser with seven states, compared with the ten of the LR.
139
Constructing LALR Sets of Items
• LALR parser are same as LR parser with one difference. In LR
parser if two states differ only in lookahead then we combine those
states in LALR parser.
• We may merge the sets of LR(1) items which having the same core,
set of first components, into one set of items.
• In general, a core is a set of LR(0) items for the grammar at hand.
• For example, I4 and I7 with common core {C →d•}, are replaced by
their union: I47: C →d•, c/d/$.
• I8 and I9, with common core, {C →cC•}, are replaced by their union:
I89: C →cC•, c/d/$
• I3 and I6, with common core, {C →c•C, C →•cC, C →•d}, are
replaced by their union:
I36: C →c•C, c/d/$
C →•cC, c/d/$
C →•d, c/d/$ 140
LALR GOTO and Parsing Tables
• Consider GOTO(I36; C), in the original set of LR(1) items, GOTO(I3;
C) = I8, and I8 is now part of I89, so we make GOTO(I36; C) be I89.
We have arrived at the same conclusion if we considered I6, the other
part of I36. That is, GOTO(I6; C) = I9, and I9 is now part of I89.
• Consider GOTO(I2; c), in the original sets of LR(1) items, GOTO(I2;
c) = I6 and I6 is now part of I36, GOTO(I2; c) becomes I36. Thus, the
entry for state 2 and input c is made s36, meaning shift and push
state 36 onto the stack.
145
Exercises
• Consider the grammar, G1: • Consider the grammar, G2:
1. S → A d 1. S → A d
2. S → B e 2. S → A e
3. A → a A b 3. A → a A b
4. A → c 4. A → c
5. B → a B b
6. B → c
• Consider the grammar, G3:
1. S → A B
2. A → a A b
3. A → c
4. B → d
5. B → e
• Is this grammar LR(k) for some fixed k?
• What about LL(k) for some fixed k?
146