0% found this document useful (0 votes)
25 views146 pages

m433-نظرية المترجمات د عبدالباقي

The document introduces the concept of compilers, explaining their importance in computer science education and their role in translating source code into target code while detecting errors. It outlines the structure of a compiler, detailing the front-end analysis and back-end synthesis processes, along with the phases of lexical, syntax, and semantic analysis. Additionally, it discusses intermediate code generation, code optimization, and code generation, emphasizing the significance of the symbol table in managing variable attributes.

Uploaded by

nourhanm687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views146 pages

m433-نظرية المترجمات د عبدالباقي

The document introduces the concept of compilers, explaining their importance in computer science education and their role in translating source code into target code while detecting errors. It outlines the structure of a compiler, detailing the front-end analysis and back-end synthesis processes, along with the phases of lexical, syntax, and semantic analysis. Additionally, it discusses intermediate code generation, code optimization, and code generation, emphasizing the significance of the symbol table in managing variable attributes.

Uploaded by

nourhanm687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Ch 1

Introduction to Compiler

1
Why Learn About Compilers?
Few people will ever be required to write a compiler for a general-
purpose language like C and Java. So why do most computer science
institutions offer compiler courses and often make these mandatory?
Some typical reasons are:
(a) It is considered a topic that you should know in order to be “well-
cultured” in computer science.
(b) A good craftsman should know his tools, and compilers are
important tools for programmers and computer scientists.
(c) The techniques used for constructing a compiler are useful for other
purposes as well.
(d) There is a good chance that a programmer or computer scientist will
need to write a compiler or interpreter for a domain-specific language

2
Compilers
• Compilation is the translation of a program written in a source
language into an equivalent program written in a target language.
• A compiler is a program that can read a program in one language,
the source language, and translate it into an equivalent program in
another language, the target language.
• An important role of the compiler is to report any errors in the
source program that it detects during the translation process.
• If the target program is an executable machine-language program, it
can then be called by the user to process inputs and produce outputs.
Input

Source Target
Compiler
Program Program

3
Error messages Output
Interpreters
• An interpreter is another common kind of language processor.
Instead of producing a target program as a translation, an interpreter
appears to directly execute the operations specified in the source
program on inputs supplied by the user
• The machine-language target program produced by a compiler is
usually much faster than an interpreter at mapping inputs to outputs.
• An interpreter, however, can usually give better error diagnostics
than a compiler, because it executes the source program statement by
statement.

Source
Program
Interpreter Output
Input

Error messages 4
A language-processing system
Source Program
Preprocessor collects source program modules
Preprocessor which may be divided into stored in separate files.

Modified Source Program


Compiler may produce an assembly-language
Compiler program as its output, because assembly language is
easier to produce as output and is easier to debug.
Target Assembly Program
Assembler produces relocatable machine code as
Assembler
its output.
Relocatable Object Code
The linker resolves external memory addresses,
Libraries files where the code in one file may refer to a location
Linker/ Loader
Relocatable in another file.
Object Files The loader then puts together all of the executable
Target Machine Code object files into memory for execution

5
The Structure of a Compiler (1)
• Any compiler must perform two major tasks
– Analysis of the source program (Front end)
– Synthesis of a machine-language program (Back end)

6
The Structure of a Compiler (2)
• Front end translates a source program into an independent
intermediate code, then the back end uses this intermediate
code to generate the target code.
• Analysis part (Front end)
– It breaks up the source program into constituent pieces and
imposes a grammatical structure on them.
– It detects that the source program is either syntactically ill formed
or semantically unsound.
– It collects information about the source program and stores it in a
symbol table, which is passed along with the intermediate
representation to the synthesis part.
• Synthesis Part (Back end)
– It takes the tree structure and translates the operations therein into
7
the target program
The Structure of a Compiler (3)
Source
Program Lexical Tokens Syntax Syntax
Analyzer Semantic
Analyzer tree Analyzer
(Character (Scanner) (Parser)
Stream)
Syntax tree
Intermediate
Code
Generator
Symbol and
Intermediate
Attribute
Representation
Tables
Optimizer
(Used by all Phases of The Compiler)
Intermediate
Representation
Code
Generator

8
Target machine code
1- Lexical Analysis (1)
Source
Program Lexical Syntax Syntax
Tokens Semantic
Analyzer Analyzer
(Parser) tree Analyzer
(Character (Scanner)
Stream)
Syntax tree
Scanner Intermediate
➢ The scanner begins the analysis of the source program by Code
reading the input, character by character, and grouping Generator
characters into individual words and symbols (tokens) Intermediate
Representation
 RE ( Regular expression )
 NFA ( Non-deterministic Finite Automata ) Optimizer
 DFA ( Deterministic Finite Automata )
Intermediate
Representation

Scanner
Code
[Lexical Analyzer] Generator
Tokens
9 Target machine code
1- Lexical Analysis (2)
▪ Lexical analysis attempts to isolate the “words” in an input string.
▪ A word, known as a lexeme or a lexical item, is a string of input
characters, which is passed on to the next phase of compilation.
▪ When the lexical analyzer encounters a whitespace, operator
symbol, or special symbols, it decides that a word is completed.
▪ For each lexeme, the lexical analyzer produces as output a token of
the form <token-name; attribute-value> that it passes on to the
subsequent phase, syntax analysis.
The scanner does the following:
• It puts the program into a compact and uniform format (tokens).
• It eliminates unneeded information (such as comments).
• It sometimes enters initial information into symbol tables (for
example, to register the presence of a particular label or identifier).
10
Examples of tokens 1

11
Examples of tokens 2

12
2- Syntax Analysis (structure) (1)
Source
Program Tokens Syntax Syntax Semantic
Scanner Analyzer
(Character Tree Analyzer
(Parser)
Stream)
Syntax tree
Parser Intermediate
➢ The parser reads tokens and groups them into units as Code
Generator
specified by the productions of the CFG being used. Intermediate
Tokens
Representation
 CFG ( Context-Free Grammar )
Optimizer
 BNF ( Backus-Naur Form ) Parser
[Syntax Analyzer] Intermediate
Syntax
Representation
tree
Code
Generator

13
Target machine code
2- Syntax Analysis (2)
Syntax Analysis (parsing)
• The parser uses the first components of the tokens produced by the
scanner to create a syntax tree that depicts the grammatical structure
of the token stream. In a syntax tree, each interior node represents
an operation and the children of the node represent their arguments.
• Construction of a syntax tree is a basic activity in compiler writing.

14
3- Semantic Analysis (meaning) (1)
Source
Program Tokens Syntax Semantic
Scanner Parser
(Character Tree Analyzer
Stream)
Syntax tree
Semantic Analyzer Intermediate
Code
➢ Semantic analysis is the discovery of meaning in a program.
Generator
➢ Performs two functions Intermediate
Representation
◼ Check the static semantics of each construct
◼ Do the actual translation Optimizer
➢ The heart of a compiler Intermediate
 Syntax Directed Translation
Representation
 Semantic Processing Techniques Code
 IR (Intermediate Representation) Generator

15 Target machine code


3- Semantic Analysis (2)
• The semantic analyzer uses the syntax tree and the information in
the symbol table to check the source program for semantic
consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table,
for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For
example, many programming language definitions require an array
index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
16
3- Semantic Analysis (3)
• The language specification may permit some type conversions called
coercions. For example, an arithmetic operator may be applied to
either a pair of integers or to a pair of floating-point numbers. If the
operator is applied to a floating-point number and an integer, the
compiler may convert the integer into a floating-point number.
• Example 1: In the code z = a + b * 3.14 ; ‘a’, ‘b’, ‘c’ are defined as
integer variable while ‘z’ is defined as float variable. So expression
“z = a + b * 3.14” show error (type mismatch error) and can solve
Syntax
automatically or show error message. tree

Semantic Process
[Semantic analyzer]

Syntax Tree

17
4- Intermediate Code Generator (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree
Intermediate
Code
Generator
Intermediate Code Generator Intermediate
Representation
– machine independent
– it should be easy to generate Optimizer
– it should be easily translatable into target program Intermediate
Representation

Code
Generator

18 Target machine code


4- Intermediate Code Generator (2)
• After syntax and semantic analysis of the source program, many
compilers generate a machine-like intermediate representation, which
we can think of as a program for an abstract machine.
• If we generate machine code directly from source code then for n
target machine we will have n optimizers and n code generators but
if we will have a machine independent intermediate code,
we will have only one optimizer.
• In intermediate-code generation, there exist two forms:
• Trees, including parse trees and syntax trees.
• Linear representations, especially “three-address code”
• Some compilers combine parsing and intermediate-code generation
into one component.

19
4- Intermediate Code Generator (3)
• The output of the intermediate code generator consists of the three-
address code (TAC) which consists of a sequence of assembly-like
instructions with three operands per instruction.
• TAC is a linearized representation of a syntax tree in which explicit
names correspond to the interior nodes of the graph.

Code Generator
[Intermediate Code Generator]

Non-optimized Intermediate
Code

20
4- Intermediate Code Generator (4)
• Each statement has the general form of: z = x op y where x, y and
z are variables, constants or temporary variables generated by the
compiler. ‘op’ represents any operator.
• Each three-address assignment instruction has at most one operator
on the right side. Some three-address instructions have fewer than
three operands.

• Example: the code z = x * y + x


1. temp3 := x
2. temp4 := y
3. temp1 := temp3 * temp4
4. temp2 := x
5. z := temp1 + temp2

21
5- Code Optimization (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree

Intermediate
Optimizer Code
Generator
➢ The intermediate code generated by the semantic routines Intermediate
is analyzed and transformed into functionally equivalent Representation

but improved IR code Optimizer


➢ This phase can be very complex and slow Intermediate
Representation
 Register and Temporary Management
Code
 Peephole Optimization Generator
22
Target machine code
5- Code Optimization (2)
• The code-optimization phase attempts to improve the intermediate
code (time and space requirements ) so that better target code will
result. This phase is optional.
• It involves examining the sequence of atoms put out by the parser to
find redundant or unnecessary instructions or inefficient code.
• Since it is invoked before the code generator, this phase is often
called machine-independent optimization.

Non-optimized Intermediate Optimized Intermediate Code


Code

Code Optimizer

23
5- Code Optimization (3)
Common optimizations include:
1. Removing redundant identifiers
A variable x at a statement in a program has always the same value, c.
Then, variable x can be replaced by the value c in this statement. For
example, at each execution of the assignment, b = a * a – 7, variable a
has the value 4. Replacing both occurrences of a by 4 leads to the
expression 4 * 4 - 7, whose value can be evaluated at compile time.
2. Removing unreachable sections of code
For example, in the following program segment, the statement stmt2
can never be executed. It is unreachable and can be eliminated from
the object program:
stmt1
go to label1
stmt2
label2: stmt3 24
5- Code Optimization (4)
3. Loop invariant
A computation is loop invariant if it only depends on variables that do
not change their value during the execution of the loop. Such a
computation is executed only once instead of in each iteration when it
has been moved out of a loop. For example,
for (i=1; i<=100000; i++) {
x = sqrt (y); // square root function
printf(x+i) ; }
The assignment to x need not be inside the loop since y doesn’t change
as the loop repeats (it is a loop invariant). In the optimization phase,
the compiler would move the assignment to x out of the loop in the
object program:
x = sqrt (y); // loop invariant
for (i=1; i<=100000; i++) printf(x+i) ;
This eliminate 99,999 unnecessary calls to the sqrt function at run time.
25
6- Code Generation (1)
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Analyzer
Stream)
Syntax tree
• Code generator converts the intermediate
representation of source code into a form that can Intermediate
Code
be readily executed by the machine. Generator
• Code generator takes as input an intermediate Intermediate
representation of the source program and maps it Representation
into the target language. Optimizer
• If the target language is machine code, registers
Intermediate
or memory locations are selected for each of the Representation
variables used by the program. Then, the
Code
intermediate instructions are translated into Generator
sequences of machine instructions that perform
26
the same task. Target machine code
6- Code Generation (2)
The first operand of each instruction specifies a destination. The F in
each instruction tells us that it deals with floating-point numbers. The
code loads the contents of address id3 into register R2, then multiplies
it with floating-point constant 60.0. The # signifies that 60.0 is to be
treated as an immediate constant. The third instruction moves id2 into
register R1 and the fourth adds to it the value previously computed in
register R2. Finally, the value in register R1 is stored into the address
of id1.
Optimized Intermediate Code

Code Generator

Target machine code

27
Symbol-Table Management
• An essential function of a compiler is to record the variable names
used in the source program and collect information about various
attributes of each name.
• These attributes may provide information about the storage allocated
for a name, its type, its scope (where in the program its value may
be used), and in the case of method names, such things as the
number and types of its arguments, the method of passing each
argument (for example, by value or by reference), and the type
returned.
• The symbol table is a data structure containing a record for each
variable name, with fields for the attributes of the name.
• The data structure should be designed to allow the compiler to find
the record for each name quickly and to store or retrieve data from
that record quickly. 28
Chapter 3

Lexical Analysis

29
Lexical Analysis
• The main task of the lexical analyzer is to read the input characters
of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• The lexical analyzer interacts with the symbol table as well. When
the lexical analyzer discovers a lexeme constituting an identifier, it
needs to enter that lexeme into the symbol table.
• The lexical analyzer removes
comments and whitespace
(blank, newline, tab, and
perhaps other characters that
are used to separate tokens in
the input).
30
Lexical Analysis
• Lexical analyzers are divided into a cascade of two processes:
– a) Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
– b) Lexical analysis proper is the more complex portion, which
produces tokens from the output of the scanner.
• Some languages have only a few kinds of token, of fairly simple
form. Other languages are more complex. C, for example, has
almost 100 kinds of tokens, including 37 keywords (double, if,
return, struct, etc.); identifiers (my_variable, printf, etc.); integer,
floating-point (6.02e2), and character (’x’, ’\’’) constants; string
literals ("hello", "say \"hi\"\n"); 54 “punctuators” (+, ], ->, *=, :, ||,
etc.), and two different forms of comments. 31
Attributes of Tokens
y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>

token
(lookahead)
tokenval Parser
(token attribute)

32
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units. It is a pair consisting of a
token name and an optional attribute value. The token name is an
abstract symbol representing a kind of lexical unit, e.g., a particular
keyword, or a sequence of input characters denoting an identifier.
– For example: id and num
• Lexemes are the specific character strings that make up a token. It is
identified by the lexical analyzer as an instance of that token.
– For example: abc and 123
• Patterns are rules describing the set of lexemes belonging to a token.
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other
tokens, the pattern is a more complex structure that is matched by
many strings.
– For example: “letter followed by letters and digits” and “non-
33
empty sequence of digits”
Examples of tokens
Token Informal description Sample lexemes

if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letter and digits pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ surrounded by “ “core dumped”
• Ex 1: printf("Total = %d\n", score);
Solution: 7 tokens
<id, "printf"> <(> < literal, "Total = %d\n"> < , > <id, " score"> <)> <;>

• Ex. 2: sum = sum + unit ∗ /∗ accumulate sum ∗/ 2 ;


Solution:
<id, "sum"> <=> <id, "sum"> <+> <id, "unit"> <∗> <num, 2> <;> 34
Examples of tokens
int fun() { All the valid tokens are:
// 2 variables
int a, b; <int> <id, “fun"> <(> <)> <{> <int>
a = 10; <id, “a"> <id, “b"> < ; > <id, " a"> <=>
return 0; <num, 10> <;> <return> <num, 0> <;>
} <}>

Exercise 1: Count number of tokens:


void main()
{
int a = 10, b = 20;
printf("sum is:%d", a+b);
}
Answer: Total number of token: 24.
35
Tokens Classes
• In many programming languages, the following classes cover most
or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same
as the keyword itself.
2. Tokens for the operators, either individually or in classes.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and
literal strings.
5. Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon.

36
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer
must provide the subsequent compiler phases additional information
about the particular lexeme that matched. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an
attribute value that describes the lexeme represented by the token;
the token name effects parsing decisions, while the attribute value
effects translation of tokens after the parse.
• The most important example is the token id, where we need to
associate with the token a great deal of information. Normally,
information about an identifier, e.g., its lexeme, its type, and the
location at which it is first found is kept in the symbol table. Thus,
the appropriate attribute value for an identifier is a pointer to the
symbol-table entry for that identifier. 37
Example of Attributes for Tokens
• The token names and associated attribute values for the Fortran
statement: E = M * C ** 2
– <id, pointer to symbol-table entry for E>
– <assign op>
– <id, pointer to symbol-table entry for M>
– <mult op>
– <id, pointer to symbol-table entry for C>
– <exp op>
– <number, integer value 2>
• Note that in certain pairs, especially operators, punctuation, and
keywords, there is no need for an attribute value. In this example,
the token number has been given an integer-valued attribute.
38
Reading Ahead
• A lexical analyzer may need to read ahead some characters before it
can decide on the token to be returned to the parser.
• For example, a lexical analyzer for Java must read ahead after it sees
the character >. If the next character is =, then > is part of the
character sequence >=, the lexeme for the token for the “greater than
or equal to" operator. Otherwise > itself forms the “greater than"
operator, and the lexical analyzer has read one character too many.
• A general approach to reading ahead on the input, is to maintain an
input buffer from which the lexical analyzer can read and push back
characters.
• The lexical analyzer reads ahead only when it must. An operator like
* can be identified without reading ahead. In such cases, the input
buffer is set to a blank, which will be skipped when the lexical
analyzer is called to find the next token. 39
Terms for Parts of Strings
1. A prefix of string s is any string obtained by removing zero or more
symbols from the end of s. For example, ban, banana, and ɛ are
prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more
symbols from the beginning of s. For example, nana, banana, and ɛ
are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix
from s. For instance, banana, nan, and ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those,
prefixes, suffixes, and substrings, respectively, of s that are not ɛ or
not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a
subsequence of banana. 40
Specification of Tokens
• Regular expressions are an important notation for specifying
lexeme patterns. While they cannot express all possible patterns, they
are very effective in specifying those types of patterns that we
actually need for tokens.

Algebraic laws for regular expressions 41


Regular definitions
• For notational convenience, we may wish to give names to certain
regular expressions and use those names in subsequent
expressions, as if the names were themselves symbols.
• If  is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form:
• d1 → r1
d2 → r2

dn → rn
where
1. Each ri is a regular expression over   {d1, d2, …, di-1 }
2. Each di is a new symbol, not in  and not the same as any other of
the d's.
• Regular definitions cannot be recursive:
digits → digit digitsdigit wrong! 42
Regular definitions Examples

43
Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
• letter → [A-Za-z]
• digit → [0-9]
• num → digit+ (. digit+)? ( E (+-)? digit+ )?
• [abcd] means (a | b | c | d)
• [b-g] means [bcdefg]
• [b-gM-Qkr] means [bcdefgMNOPQkr]
• M? means (M | , i.e., zero or one). 44
Transition Diagrams
• As an intermediate step in the construction of a lexical analyzer, we
first convert patterns into stylized flowcharts, called “transition
diagrams " which is similar to a DFA.
• Differences between TD and DFA
1. DFA accepts or rejects a string. TD reads characters until finding a
token, returns the read token and prepare the input buffer for the next
call.
2. In a TD, there is no out-transition from accepting states.
3. Transition labeled other (or not labeled) should be taken on any
character except those labeling transitions out of a given state.
4. States can be marked with a *: This indicates states on which a input
retraction must take place.

45
Transition diagram Examples 1
• relop → <<=<>>>==
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)

• Identifier: id → letter ( letterdigit )*


letter or digit

start letter other


9 10 11 * return(gettoken(),
install_id()) 46
Transition diagram Examples 2
• A transition diagram for unsigned numbers

• A transition diagram for whitespace


One or more “whitespace" characters, represented by
delim. These characters would be blank, tab, newline,
and perhaps other characters that are not considered by
the language design to be part of any token.
• delim → blank | tab | newline • ws → (delim)+
47
Recognizing keywords
• Keywords: same pattern as identifiers but do not correspond to the
token ”identifier”.
Two solutions are possible:
• 1. Install the reserved words in the symbol table initially to know
whether the lexeme is an identifier or a keyword.
• We enter the strings if, then and else into the symbol table before
any characters in the input are seen.
• When a string is recognized by the TD:
• The symbol table is examined
• If the lexeme is found there marked as a keyword then the string
is a keyword else the string is an identifier
• 2. Create separate transition diagrams for each keyword
• A transition diagram for the keyword then

48
Combined Finite Automata

49
keywords Finite Automata
This machine accepts keywords: if, int, inline, for, float

50
Chapter 4

Syntax Analysis

51
Position of a Parser in the Compiler Model
• Syntax analysis or parsing is the second phase of a compiler.
• Parsing is the process of determining how a string of terminals can
be generated by a grammar.
• A syntax analyzer or parser takes the input from a lexical analyzer in
the form of token streams. The parser analyzes the source code
(token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
• The parser accomplishes two tasks, i.e., parsing the code, looking
for errors, and generating a parse tree as the output of the phase.
• Parsers are expected to parse the whole code even if some errors
exist in the program. Parsers use error recovering strategies

52
Lexical Versus Syntax Analysis
Why use regular expressions to define the lexical syntax of a language?
 The lexical rules of a language are often quite simple, and to
describe them we do not need a notation as powerful as
grammars.
 Regular expressions generally provide a more concise and
easier-to-understand notation for tokens than grammars.

 Regular expressions are most useful for describing the


structure of constructs such as identifiers, constants,
keywords, and white space.
 Grammars are most useful for describing nested structures
such as balanced parentheses, matching begin-end's,
corresponding if-then-else's, and so on. These nested
structures cannot be described by regular expression
53
Common programming errors
Lexical errors:
include misspellings of identifiers, keywords, or operators e.g., the use
of an identifier elipseSize instead of ellipseSize.
Syntactic errors
include misplaced semicolons or extra or missing braces; that is, \{" or
\}." As another example, in C or Java, the appearance of a case
statement without an enclosing switch is a syntactic error.
Semantic errors
include type mismatches between operators and operands, e.g., the
return of a value in a method with result type void.
Logical errors
can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator =
instead of the comparison operator ==.
54
Error-recover strategies
 Panic mode recovery
 Discard input symbol one at a time until one of designated
set of synchronization tokens is found
 Phrase level recovery
 Replacing a prefix of remaining input by some string that
allows the parser to continue
 Error productions
 Augment the grammar with productions that generate the
erroneous constructs
 Global correction
 Choosing minimal sequence of changes to obtain a globally
least-cost correction

55
Context free grammars
 Terminals
 Nonterminals
 Start symbol
 productions expression -> expression + term
expression -> expression – term
expression -> term
term -> term * factor
term -> term / factor
term -> factor
factor -> (expression)
factor -> id

56
Derivations & Parse trees
 Productions are treated as rewriting rules to generate a string

 Rightmost and leftmost derivations

 E -> E + E | E * E | -E | (E) | id

 Derivations for –(id+id)

 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)

57
Ambiguity
 For some strings there exist more than one parse tree
 Or more than one leftmost derivation
 Or more than one rightmost derivation
 Example: id+id*id

58
Ambiguity

Example: if E1 then if E2 then S1 else S2 has the two parse trees.

In all programming languages with conditional statements of this form,


the first parse tree is preferred. The general rule is, “Match each else
with the closest unmatched then” 59
Elimination of ambiguity
 Idea:
 A statement appearing between a then and an else must be
matched

Unambiguous grammar for if-then-else statements

60
Left recursion
 A grammar is left recursive if it has a non-terminal A
+
such that there is a derivation A=> Aα
 A simple rule for direct left recursion elimination:
 For a rule like:
 A -> A α|β
 We may replace it with
 A -> β A’
 A’ -> α A’ | ɛ

The left-recursive grammar The non-left-recursive grammar

61
Left recursion elimination

62
Example Left Recursion Elimination
A→BC|a
B→CA|Ab Choose arrangement: A, B, C
C→AB|CC|a

i = 1: nothing to do
i = 2, j = 1: B→CA|Ab
 B→CA|BCb|ab
(imm) B → C A BR | a b BR
BR → C b BR | 
i = 3, j = 1: C→AB|CC|a
 C→BCB|aB|CC|a
i = 3, j = 2: C→BCB|aB|CC|a
 C → C A BR C B | a b BR C B | a B | C C | a
(imm) C → a b BR C B CR | a B CR | a CR
CR → A BR C B CR | C CR | 
63
Left factoring
 When a nonterminal has two or more productions whose right-hand
sides start with the same grammar symbols, the grammar is not LL(1)
and cannot be used for predictive parsing
 Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive or top-down parsing.
 A way of delaying the decision until more info is available
 Consider following grammar:
 Stmt -> if expr then stmt else stmt
 | if expr then stmt
 On seeing input if it is not clear for the parser which production
to use
 We can easily perform left factoring:
 If we have A->αβ1 | αβ2 then we replace it with
 A -> αA’
64
Left factoring (cont.)
 Algorithm
 For each non-terminal A, find the longest prefix α common
to two or more of its alternatives. If α<> ɛ, then replace all
of A-productions A->αβ1 |αβ2 | … | αβn | γ by
 A -> αA’ | γ

 A’ -> β1 |β2 | … | βn

 Example:
 S -> i E t S | i E t S e S | a
 E -> b
 Left-factored, this grammar becomes:
 S -> i E t S S’ | a
 S’ -> e S | ɛ
 E -> b 65
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from
lexical analyzers. Lexical analyzers are responsible for the validity of a
token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks:
• it cannot determine if a token is valid,
• it cannot determine if a token is declared before it is being used,
• it cannot determine if a token is initialized before it is being used,
• it cannot determine if an operation performed on a token type is valid
or not.

These tasks are accomplished by the semantic analyzer, which we shall


study in Semantic Analysis.

66
Parsing Techniques
Top-down parsers (LL(1), recursive descent)
 Start at the root of the parse tree from the start symbol and
grow toward leaves (similar to a derivation)
 Pick a production and try to match the input
 Bad “pick”  may need to backtrack
 Some grammars are backtrack-free (predictive parsing)

67
Parsing Techniques
Bottom-up parsers (LR(1), operator precedence)
 Start at the leaves and grow toward root
 The process as reducing the input string to the start symbol
 At each reduction step a particular substring matching the
right-side of a production is replaced by the symbol on the left-
side of the production
 Bottom-up parsers handle a large class of grammars

68
Recursive descent parsing : It is a common form of top-down
parsing. It is called recursive, as it uses recursive procedures to process
the input. Recursive descent parsing suffers from backtracking
Backtracking : It means, if one derivation of a production fails, the
syntax analyzer restarts the process using different rules of same
production. This technique may process the input string more than once
to determine the right production. 69
Top Down Parsing
 A Top-down parser tries to create a parse tree from the root
towards the leafs scanning input from left to right
 It can be also viewed as finding a leftmost derivation for an
input string
 Example: id+id*id
At each step of a top-down parse, the key problem is that of
determining the production to be applied for a nonterminal, say A. Once
an A-production is chosen, the rest of the parsing process consists of
“matching" the terminal symbols in the production body with the input
string.
E E E E E E
E -> TE’ lm lm lm lm lm

E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’ F T’ F T’ F T’ F T’ + T E’
T’ -> *FT’ | Ɛ
F -> (E) | id id id Ɛ id Ɛ
70
Recursive descent parsing
 Consists of a set of procedures, one for each nonterminal
 Execution begins with the procedure for start symbol
 A typical procedure for a non-terminal

void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal) call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}

71
Recursive descent parsing (backtracking)
 General recursive descent may require backtracking
 The previous code needs to be modified to allow backtracking
 So we need to try all alternatives. If one failed the input pointer
needs to be reset and another alternative should be tried
 Recursive descent parsers cant be used for left-recursive grammars

72
Backtracking Example
 Now, we have a match for the second input symbol “a”,
so we advance the input pointer to “d”, the third input
symbol, and compare d against the next leaf “b”.
 Backtracking
 Since “b” does not match “d”, we report failure and go back to
A to see whether there is another alternative for A that has not
been tried - that might produce a match!
 In going back to A, we must reset the input pointer to “a”.

S->cAd S S
A->ab | a S
c A d c A d
Input: cad c A d
step 1. From start a b a
symbol
Step 2. We expand A using the first alternative Step 3
A → ab to obtain the following tree 73
Predictive Parsing
• Recursive descent is a top-down parsing technique that constructs the
parse tree from the top and the input is read from left to right.
• It uses procedures for every terminal and non-terminal entity.
• This parsing technique recursively parses the input to make a parse
tree, which may or may not require back-tracking. But the grammar
associated with it (if not left factored) cannot avoid back-tracking.
• A predictive parsing is a form of recursive-descent parsing that
does not require any back-tracking and has the capability to predict
which production is to be used to replace the input string.
• To accomplish its tasks, the predictive parser uses a look-ahead
pointer, which points to the next input symbols. To make the parser
back-tracking free, the predictive parser puts some constraints on the
grammar and accepts only a class of grammar known as LL(k)
grammar.
74
First Set
 First() is set of terminals that begins strings derived from
 i.e., a  First(α) iff α=>
*
a  for some .
 In predictive parsing when we have A-> α|β, if First(α) and
First(β)* are disjoint sets then we can select appropriate A-
production by looking at the next input
 To compute First(X) for all grammar symbols X, apply following
rules until no more terminals or ɛ can be added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a production for some
k>=1, then place a in First(X) if for some i, a is in First(Yi), and
ɛ is in all of First(Y1),…,First(Yi-1) that is Y1…Yi-1 =>
* ɛ.If ɛ is in
First(Yj) for j=1,…,k then add ɛ to First(X). If Y1 does not derive
ɛ, then we add nothing more to FIRST(X), but if Y1 => * ɛ , then

we add FIRST(Y2), and so on.


3. If X-> ɛ is a production then add ɛ to First(X) 75
Follow Set
 Follow(A), for any nonterminal A, is set of terminals a that
can appear immediately after A in some sentential form
*
 If we have S => αA aβ for some αand βthen a is in
Follow(A)
 To compute Follow(A) for all nonterminals A, apply
following rules until nothing can be added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything in
First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production A->αBβ
where First(β) contains ɛ, then everything in Follow(A) is
in Follow(B)

76
First and Follow Examples
G1: S → a ABb First Follow
A→c| S {a} {$}
B→d| A {c, } {d, b}
B {d, } {b}

First Follow
G2: S → a BDh
B→cC S a $
C→bc| B c g, f, h
D→EF C b,  g, f, h
E→g| D g, f,  h
F→f| E g,  f, h
F f,  h

G3: S → Xb First Follow


X → aXd |  S {a, b} {$}
X {a, } {d, b}
77
LL(1) Grammars
 Grammars for which creates predictive parsers are called LL(1)
 The first L means scanning input from left to right
 The second L means leftmost derivation
 And 1 stands for using one input symbol for lookahead
 No left-recursive or ambiguous grammar can be LL(1)
 A grammar G is LL(1) if A-> α|βare two distinct productions of
G, the following conditions hold:
1. For no terminal a do αandβ both derive strings beginning with a
2. At most one of α or βcan derive empty string
3. If α=>
* ɛ then βdoes not derive any string beginning with a
terminal in Follow(A).
The first two conditions are equivalent to FIRST(α)  FIRST(β)=.
The third condition is equivalent to stating that if ɛ is in FIRST(β),
then FIRST(α)  FOLLOW(A) = , and likewise if ɛ is in FIRST().
78
Non-LL(1) Examples

Grammar Not LL(1) because:


S→Sa|a Left recursive
S→aS|a FIRST(a S)  FIRST(a)  
S→aR|
R→S| For R: S *  and  * 
S→aRa For R:
R→S| FIRST(S)  FOLLOW(R)  

79
Construction of predictive parsing table
The next algorithm collects the information from FIRST and FOLLOW sets
into a predictive parsing table M[A; a]. It is based on the idea:
•The production A → α is chosen if the next input symbol a is in FIRST(α).
*
The only complication occurs when α= ɛ or, more generally, α ɛ. In this
case, we choose A → α, if the current input symbol is in Follow(A), or if $ on
the input has been reached and $ is in Follow(A).
 Algorithm 4.31 : Construction of a predictive parsing table.
 INPUT: Grammar G.
 OUTPUT: Parsing table M
 For each production A → α in grammar do the following:
1. For each terminal a in First(α) add A → α in M[A, a]
2. If ɛ is in First(α), then for each symbol b in Follow(A) add
A →α to M[A, b].
 If after performing the above, there is no production in M[A, a]
then set M[A, a] to error (which we normally represent by an
empty entry in the table).
80
Parsing table
First Follow
Example
E { (, id} { ), $}
E -> TE’ {+, ɛ}
E’ -> +TE’ | Ɛ E’ { ), $}
T { (, id} {+, ), $}
T -> FT’ { *, ɛ} {+, ), $}
T’ -> *FT’ | Ɛ T’
F { (, id} {+, *, ), $}
F -> (E) | id
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T T -> FT’ T -> FT’

T’ T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ

F F -> id F -> (E) 81


LL(1) Grammars are Unambiguous
Algorithm 4.31 can be applied to any grammar G to produce a parsing
table M. For every LL(1) grammar, each parsing-table entry uniquely
identifies a production or an error. For some grammars, M may have
some entries that are multiply defined. For example, if G is left-
recursive or ambiguous, then M will have at least one multiply defined
entry A→ FIRST() FOLLOW(A)
Ambiguous grammar S → i E t S SR i
e$
S → i E t S SR | a S → a a
SR → e S |  SR → e S e
e$
E→b SR →  
Error: duplicate table entry E→b b t

a b e i t $
S S→a S → i E t S SR
SR → 
SR SR → 
SR → e S
E E→b 82
Non-recursive predicting parsing
A nonrecursive predictive parser can be built by maintaining a stack
explicitly, rather than implicitly via recursive calls. The parser mimics a
leftmost derivation. If w is the input that has been matched so far, then
the stack holds a sequence of grammar symbols α such that
S  * wα

input a + b $

Predictive
parsing output
stack X
Y program
Z
$ Parsing
Table
M 83
Predictive parsing algorithm

84
LL(1) Example 1: parse the string “id+id*id”

85
LL(1) Example 2: parse the string “abba”
First Follow
S {a} { $}
B {b, ɛ} {a}

86
LL(1) Example 3: parse the string “abbb”
First
S → a ABC Follow
A → a | bb S {a} { $}
B→a| A {a,b} {a, b, $}
C→b| B {a,ɛ} {b, $}
C {b,ɛ} { $}
matched stack input action
S$
abbb $ Parsing table
aABC$ abbb $ S → a ABC a b c $
a ABC$ bbb $ match a
a bbBC$ bbb $ A → bb S S → a ABC
abb BC$ b$ match bb
abb C$ A A→a A → bb
b$ B→ Ɛ
abb b$ b$ C→ b
B B→a B→ Ɛ B→ Ɛ
abbb $ $ Match b
C C→ b C→ Ɛ
87
LL(1) Example 4: parse the string: “int id,id;”
First Follow
S → TL;
T → int | float
L → L , id | id

88
Error recovery in predictive parsing
 Panic mode
 Place all symbols in Follow(A) into synchronization set for
nonterminal A: skip tokens until an element of Follow(A) is
seen and pop A from stack.
 Add to the synchronization set of lower level construct the
symbols that begin higher level constructs
 Add symbols in First(A) to the synchronization set of
nonterminal A
 If a nonterminal can generate the empty string then the
production deriving can be used as a default
 If a terminal on top of the stack cannot be matched, pop the
terminal, issue a message saying that the terminal was
insterted

89
Example Non - Input Symbol
terminal id + * ( ) $
E E -> TE’ E -> TE’ synch synch

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T -> FT’ synch T -> FT’ synch synch


T
T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ
T’

F F -> id synch synch F -> (E) synch synch

90
Exercises
Exercise 4.4.1 : For each of the following grammars, devise predictive
parsers and show the parsing tables. You may left-factor and/or eliminate
left-recursion from your grammars first.
a) S → 0 S 1 | 01
b) S → + S S | * S S | a
c) S → S (S) S | 
d) S → S + S | S S | ( S) | S * | a
e) S → ( L ) | a and L → L , S | S
f) S → a S b S | b S a S | 

91
Solution of 4.4.1. (d)

92
Solution of 4.4.1. (d)

93
Solution of 4.4.1. (d)

94
Exercises
Exercise 4.4.3 : Compute FIRST and FOLLOW for the grammars of
Exercise 4.2.2.

95
Bottom-up parsing starts from the leaf nodes of a tree and works in
upward direction till it reaches the root node. Here, we start from a
sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the
bottom-up parsers available.

96
Example: id*id

id*id F * id T * id T*F T E
E -> E + T | T
id F F id T*F T
T -> T * F | F
F -> (E) | id id id F id T*F

id F id

id

97
Shift-reduce parser
 The general idea is to shift some symbols of input to the
stack until a reduction can be applied
 At each reduction step, a specific substring matching the
body of a production is replaced by the nonterminal at the
head of the production
 The key decisions during bottom-up parsing are about
when to reduce and about what production to apply
 A reduction is a reverse of a step in a derivation
 The goal of a bottom-up parser is to construct a derivation
in reverse:
 E=>T=>T*F=>T*id=>F*id=>id*id

98
Handle pruning
 A Handle is a substring that matches the body of a production
and whose reduction represents one step along the reverse of a
rightmost derivation Handles during a parse of id1 * id2

Example: abbcde
Grammar: aAbcde
S→aABe aAde Handle
A→Abc|b aABe
B→d S 99
Shift reduce parsing
 Consists of:
 A stack is used to hold grammar symbols and
 input buffer: holds the rest of the string to be parsed
 Handle always appear on top of the stack
 Initial configuration: Acceptance configuration:
Stack Input Stack Input
$ w$ $S $

 Basic operations:
 1. Shift. Shift the next input symbol onto the top of the stack.
 2. Reduce. Replace the handle on the top of the stack by the
non-terminal.
 3. Accept. Announce successful completion of parsing.
 4. Error. Discover a syntax error and call an error recovery
routine
100
Shift reduce parsing Example 1
Input: id+id*id

Stack Input Action


$ id+id*id$ shift
$id +id*id$ reduce E → id How to
Grammar: $E +id*id$ shift
$E+ id*id$ shift
resolve
E→E+E conflicts?
$E+id *id$ reduce E → id
E→E*E $E+E *id$ shift (or reduce?)
E→(E) $E+E* id$ shift
E → id $E+E*id $ reduce E → id
$E+E*E $ reduce E → E * E
$E+E $ reduce E → E + E
Found handles $E $ accept
to reduce

101
Shift reduce parsing Example 2
Input: id+id*id

102
Conflicts during shift reduce parsing
 There are grammars for which shift-reduce parsing cannot
be used.
 Every shift-reduce parser can reach a configuration in
which it cannot decide whether to shift or to reduce (a
shift/reduce conflict), or cannot decide which of several
reductions to make (a reduce/reduce conflict).
 These grammars are not in LR(k) class of grammars.

103
shift/reduce conflict
 Example: An ambiguous grammar can never be LR

Stack Input
… if expr then stmt else …$

we cannot tell whether if expr then stmt is the handle, no matter


what appears below it on the stack. Here there is a shift/reduce
conflict. Depending on what follows the else on the input, it might be
correct to reduce if expr then stmt to stmt, or it might be correct to
shift else and then to look for another stmt to complete the alternative
if expr then stmt else stmt.

104
Reduce-Reduce Conflicts
Stack Input Action
$ aa$ shift
$a a$ reduce A → a or B → a ?
Grammar:
C→AB
A→a
B→a

Resolve in favor
of reducing A → a,
otherwise we’re stuck!

105
LR Parsing
 The most prevalent type of bottom-up parser today is based on LR(k)
 The k for the number of input symbols of lookahead that are used in
making parsing decisions.
 The cases k = 0 or k = 1 are of practical interest, and we shall only
consider LR parsers with k <= 1 here.
 When (k) is omitted, k is assumed to be 1.
 Why LR parsers?
 Table driven much like the nonrecursive LL parsers

 Can be constructed to recognize all programming language


constructs
 Most general non-backtracking shift-reduce parsing method

 Can detect a syntactic error as soon as it is possible to do so


106
States of an LR parser
 How does a shift-reduce parser know when to shift and when to
reduce? An LR parser makes shift-reduce decisions by
maintaining states to keep track of where we are in a parse.
 An LR(0) item of G is a production of G with the dot at some
position of the body. For A->XYZ we have following items
 A-> •XYZ
 A->X•YZ
 A->XY•Z
 A->XYZ•
 The production A ->  generates only one item, A -> .
 Item A-> •XYZ indicates that we hope to see a string derivable
from XYZ next on the input.
 Item A->X•YZ indicates that we have just seen on the input a
string derivable from X and that we hope next to see a string
derivable from Y Z.
 Item A -> XYZ• indicates that we have seen the body XY Z and
that it may be time to reduce XYZ to A. 107
Canonical LR(0) item sets
 One collection of sets of LR(0) items, called the canonical LR(0)
collection, provides the basis for constructing a deterministic
finite automaton that is used to make parsing decisions.
 Such an automaton is called an LR(0) automaton.
 In particular, each state of the LR(0) automaton represents a set
of items in the canonical LR(0) collection.
 To construct the canonical LR(0) collection for a grammar, we
define an augmented grammar and two functions, CLOSURE
and GOTO.
 Augmented grammar:
 G with addition of a production: S’->S
 The purpose of this new starting production is to indicate to
the parser when it should stop parsing and announce
acceptance of the input.
108
Closure of item sets
 Closure of item sets:
 If I is a set of items, closure(I) is a set of items constructed
from I by the following rules:
 Add every item in I to closure(I)

 If A->α •Bβ is in closure(I) and B->γ is a production then


add the item B-> • γ to clsoure(I), if it is not already there .
 Example:
Grammar:
E’→E closure({[E’ → • E]} closure({[E → E+• T]}
E’-> • E E → E+• T
E→E+T E-> • E+T T → • T*F
E→T E-> • T T→•F
T→T*F T-> • T*F F → •(E)
T→ F T-> • F F → • id
F → (E) F-> •(E)
F→ id F-> • id
109
The Goto function
 If I is an item set and X is a grammar symbol then:
 Goto (I,X) = closure of the set of all items [A-> αX •β] such that [A->
α • X β] is in I.

Ex1- { [E’ → • E] Then goto(I,E)


Suppose I = [E → • E + T] = closure({[E’ → E •, E → E • + T]})
[E → • T] = { [E’ → E •]
[T → • T * F] [E → E • + T] }
[T → • F]
[F → • ( E )]
[F → • id] }

110
The Goto function

111
Canonical LR(0) items

Void items(G’) {
C= CLOSURE({[S’-> • S]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (GOTO(I,X) is not empty and not in C)
add GOTO(I,X) to C;
until no new set of items are added to C on a round;
}

112
Canonical LR(0) items - Example
Augmented I0 = closure({[C’ → •C]})
grammar: I1 = goto(I0,C) = closure({[C’ → C•]})
1. C’ → C … State I1: State I4:
2. C → A B C’ → C• final C → A B•
goto(I0,C)
3. A → a
4. B → a State I0: goto(I2,B)
State I2:
start C’ → •C goto(I 0 ,A)
C → A•B
C → •A B B → •a goto(I2,a)
A → •a
goto(I0,a) State I5:
1 State I3:
B → a•
C 4 A → a•
B
start A
0 2
a 5
a
3 113
114
LR-Parsing model
• LR parser consists of an input, an output, a stack, a driver program,
and a parsing table that has two parts (ACTION and GOTO).
• The driver program is the same for all LR parsers; only the
parsing table changes from one parser to another.
• Where a shift-reduce parser would shift a symbol, an LR parser
shifts a state.

115
LR(0) parsing
 A shift item: It is an item has •a for terminal a. It says that a must be
shifted onto the stack if it appears as the next input symbol.
 A reduce item: It is an item of the form A → α •. It indicates that,
when this state is reached, the production A→α
should be reduced.
 Reducing the item S’ → S• accepts the input string.
 LR(0) parsing requires that each of these steps be uniquely
determined by the LR(0) machine and the input. Therefore, if a
state has a reduce item, it must not have any other reduce items
or shift items.
 With this restriction, the current state determines whether to
shift or reduce, and which production to reduce, without
looking at the next input. If it shifts, it can read the next input
to see which state to shift.

116
117
LR(0) parsing Example 1
Grammar:
1-S → (S)
2-S → a

Action Goto
State ( ) a $ S
0 s2 s5 1
1 accept
2 s2 s5 3
3 s4
4 r1 r1 r1 r1
5 r2 r2 r2 r2 r2 118
LR parsing algorithm
INPUT: input string w and an LR-parsing table with ACTION and GOTO functions
OUTPUT: If w is in L(G), the reduction steps of a bottom-up parse for w;
otherwise, an error indication.
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine; } 119
LR(0) parsing Example : input ((a))
Stack Symbols Input Action
0 $ ((a)) $ Shift to 2
02 $( (a)) $ Shift to 2
022 $ (( a)) $ Shift to 5
0225 $ ((a )) $ Reduce by S → a
022
0223 $ ((S )) $ Shift to 4
02234 $ ((S) )$ Reduce by S → (S)
02
023 $ (S )$ Shift to 4
0234 $ (S) $ Reduce by S → (S)
0
01 $S $ accept

Let's look at the reductions of S → (S) in more detail. When the first such reduction
occurs, the stack is 02234; three symbols are popped of (because the length of “(S)"
is 3), leaving a stack of 02. There is a transition from the top state, 2, on S to state 3,
so we push a 3, leaving 023 on the stack. The second time it reduces S → (S), the
stack is 0234. When three states are popped, this leaves a stack with just 0 on it.
There is a transition from state 0 to state 1 on S, so the new stack is 01.
120
Not LR(0) Parsing Example 1
0) E’→E
1) E → E + T
2) E → T
3) T → T * F
4) T→ F
5) F → (E)
6) F→ id
STATE ACTON GOTO
id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 Acc
2 r2 r2 r2/s7 r2 r2 r2
3 r4 r4 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 r1 r1/s7 r1 r1 r1
10 r3 r3 r3 r3 r3 r3
11 r5 r5 r5 r5 r5 r5
121
Not LR(0) Parsing Example 2
Grammar:
Example of a CFG that is not LR(0)
0. S’ → S
1. S → A a
2. S → Bb
3. S → a c
4. A → a
5. B → a

122
Not LR(0) Parsing (shift/reduce or reduce/reduce conflicts)

Action Goto
State a b c $ S A B
0 s6 1 2 4
1 accept
2 s3
3 r1 r1 r1 r1
4 s5
5 r2 r2 r2 r2
6 r4/r5 r4/r5 s7/r4/r5 r4/r5
7 r3 r3 r3 r3

123
Not LR(0) Parsing
• The machine is not LR(0) because of shift/reduce and
reduce/reduce conflicts in state 6 (there is a shift item and two
reduce items in the state, so the parser doesn't know whether to shift
or reduce, and if it decided to reduce, anyway, it wouldn't know
which production to reduce). Hence, this grammar is not LR(0).
• However, if we allowed the parser to base its choice on the next input
symbol, the correct choice could be made reliably. If you examine the
grammar carefully, you can see that A → a should only be reduced
when the next input is a, B → a should only be reduced when the next
input is b, and, if the next input is c, the parser should shift.
• How could we determine this algorithmically?
• The next three parsing algorithms all do it in different ways.
• The simplest method is SLR(1) parsing, which uses FOLLOW sets to
compute lookaheads for actions.
124
SLR Grammars Concept
 SLR (Simple LR) is a simple extension of LR(0) parsing
 SLR eliminates some conflicts by populating the parsing table with
reductions A→ on symbols in FOLLOW(A)
Shift on +
State I2:
State I0:
E → id•+ E
S→E S → •E goto(I0,id) goto(I3,+)
E → id•
E → id + E E → •id + E
E → id E → •id
reduce on +

id + $ E
1. S → E 0 s2 1
2. E → id + E FOLLOW(E)={$}
1 acc thus reduce on $
3. E → id
2 s3 r3
3 s2 4
Shift on +
4 r2 125
Constructing SLR parsing table

State I0: State I1: State I2: State I3: State I4: State I5:
C’ → •C C’ → C• C → A•B A → a• C → A B• B → a•
C → •A B B → •a
action goto Grammar:
A → •a
state a $ C A B 1. C’ → C
1 0 s3 1 2 2. C → A B
C 4 1 acc 3. A → a
B
start A 2 s5 4
4. B → a
0 2 FOLLOW(A) = {a}
a 3 r3
a 5 FOLLOW(C) = {$}
4 r2
3 5 r4 FOLLOW(B) = {$}
126
SLR Parsing Table Example 2
Grammar:
0. S’ → S
1. S → A a
2. S → Bb
3. S → a c
4. A → a
5. B → a

FOLLOW(S) = {$}
FOLLOW(A) = {a}
FOLLOW(B) = {b}

127
FOLLOW (E) = { $, +, ) }
(1) E -> E + T (2) E-> T
Example 3 (3) T -> T * F
(5) F -> (E)
(4) T-> F
(6) F->id
FOLLOW (T) = { $, +, ), * }
FOLLOW (F) = {$, +, ), * }
LR(0) Parsing table SLR Parsing table
STATE ACTON STATE ACTON
id + * ( ) $ id + * ( ) $
0 s5 s4 0 s5 s4
1 s6 Acc 1 s6 Acc
2 r2 r2 r2/s7 r2 r2 r2 2 r2 s7 r2 r2
3 r4 r4 r4 r4 r4 r4 3 r4 r4 r4 r4
4 s5 s4 4 s5 s4
5 r6 r6 r6 r6 r6 r6 5 r6 r6 r6 r6
6 s5 s4 6 s5 s4
7 s5 s4 7 s5 s4
8 s6 s11 8 s6 s11
9 r1 r1 r1/s7 r1 r1 r1 9 r1 s7 r1 r1
10 r3 r3 r3 r3 r3 r3 10 r3 r3 r3 r3
11 r5 r5 r5 r5 r5 r5 11 r5 r5 r5 r5
128
Example Parse id*id+id
(1) E -> E + T
(3) T -> T * F
(2) E-> T
(4) T-> F
(5) F -> (E) (6) F->id
STATE ACTON GOTO Line Stack Symbols Input Action

id + * ( ) $ E T F
(1) 0 id*id+id$ Shift to 5
0 S5 S4 1 2 3
(2) 05 id *id+id$ Reduce F->id
1 S6 Acc (3) 03 F *id+id$ Reduce T->F

2 R2 S7 R2 R2 (4) 02 T *id+id$ Shift to 7

3 R4 R7 R4 R4 (5) 027 T* id+id$ Shift to 5

(6) 0275 T*id +id$ Reduce F->id


4 S5 S4 8 2 3
(7) 02710 T*F +id$ Reduce T->T*F
5 R6 R6 R6 R6
(8) 02 T +id$ Reduce E->T
6 S5 S4 9 3
(9) 01 E +id$ Shift
7 S5 S4 10 (10) 016 E+ id$ Shift

8 S6 S11 (11) 0165 E+id $ Reduce by F->id

9 R1 S7 R1 R1 (12) 0163 E+F $ Reduce by T->F

(13) 0169 E+T` $ Reduce E->E+T


10 R3 R3 R3 R3
(14) 01 E $ accept
11 R5 R5 R5 R5
129
SLR, Ambiguity, and Conflicts
• Every SLR(1) grammar is unambiguous
• But not every unambiguous grammar is SLR
• The grammar is not SLR(1) if shift/reduce and reduce/reduce
conflicts exist.

130
Exercises SLR

131
Canonical-LR Parsing
 The “Canonical-LR" or just “LR", uses a large set of items, called
the LR(1) items. The 1 refers to the length of the lookahead of the
item.
 LR(1) items = LR(0) items + lookahead symbol.
 The general syntax becomes [A → α • β, a ] where a is a terminal or
right end marker $.
 The lookahead has no effect in an item of the [A → α • β, a] where β
is not , but an item of the form [A → α •, a] calls for a reduction by
A → α only if the next input symbol is a.
 Thus, we are required to reduce by A → α only on those input
symbols a for [A → α •, a] is an LR(1) item in the state on top of the
stack.
 The set of such a's will always be a subset of FOLLOW(A),

132
How to add lookahead with the production?
• CASE 1: A → α • BC, a
• Suppose this is the 0th production. Now, since ‘•‘ precedes B, so we
have to write B’s productions as well. B → •D [1st production]
• Suppose this is B’s production. The look ahead of this production is
given as we look at previous productions in 0th production. Whatever
is after B, we find FIRST(of that value) , that is the lookahead of 1st
production. So, here in 0th production, after B, C is there. Assume
FIRST(C)=d, then 1st production become B → • D, d.
• CASE 2: A → α • B, a
• We can see there’s nothing after B. So the lookahead of 0th
production will be the lookahead of 1st production. i.e.- B → • D, a
• CASE 3: A → a | b
➢ A → a, $ [0th production]
➢ A → >b, $ [1st production]
133
Constructing LR(1) Sets of Items
• The method for building the collection of sets of valid LR(1) items is
essentially the same as the one for building the canonical collection
of sets of LR(0) items.
• We need only to modify the two procedures CLOSURE and GOTO.

134
Example : S’ → S S→CC C→cC|d

135
LR(1) GOTO graph
1) S → C C 2) C → c C 3) C → d

LR GOTO graph
136
Constructing LR(1) Parsing Tables
• An LR parser using the canonical LR(1) parsing table is called a
canonical-LR(1) parser.

137
LR(1) Parsing Tables of the example
• Every SLR grammar is an LR grammar, but for an SLR grammar the
LR parser may have more states than the SLR parser for the same
grammar. The grammar of the previous example is SLR and has an
SLR parser with seven states, compared with the ten of the LR.

LR GOTO graph LR parsing table 138


LALR Parsing
• LALR (lookahead-LR) technique is often used in practice, because
the tables obtained by it are considerably smaller than the LR tables.
• Most common syntactic constructs of programming languages can be
expressed by an LALR grammar. The same is almost true for SLR
grammars, but there are a few constructs that cannot be handled by
SLR techniques.
• SLR and LALR tables for a grammar always have the same number
of states, and this number is several hundred states for a language C.
• LR table would usually have several thousand states for the same-
size language.
• Thus, it is much easier and more economical to construct SLR and
LALR tables than the canonical LR tables.

139
Constructing LALR Sets of Items
• LALR parser are same as LR parser with one difference. In LR
parser if two states differ only in lookahead then we combine those
states in LALR parser.
• We may merge the sets of LR(1) items which having the same core,
set of first components, into one set of items.
• In general, a core is a set of LR(0) items for the grammar at hand.
• For example, I4 and I7 with common core {C →d•}, are replaced by
their union: I47: C →d•, c/d/$.
• I8 and I9, with common core, {C →cC•}, are replaced by their union:
I89: C →cC•, c/d/$
• I3 and I6, with common core, {C →c•C, C →•cC, C →•d}, are
replaced by their union:
I36: C →c•C, c/d/$
C →•cC, c/d/$
C →•d, c/d/$ 140
LALR GOTO and Parsing Tables
• Consider GOTO(I36; C), in the original set of LR(1) items, GOTO(I3;
C) = I8, and I8 is now part of I89, so we make GOTO(I36; C) be I89.
We have arrived at the same conclusion if we considered I6, the other
part of I36. That is, GOTO(I6; C) = I9, and I9 is now part of I89.
• Consider GOTO(I2; c), in the original sets of LR(1) items, GOTO(I2;
c) = I6 and I6 is now part of I36, GOTO(I2; c) becomes I36. Thus, the
entry for state 2 and input c is made s36, meaning shift and push
state 36 onto the stack.

LALR GOTO graph 141


Example : input cdd
1) S → C C 2) C → c C 3) C → d
Stack Symbols Input Action
0 $ cdd$ Shift to 3
03 $c dd$ Shift to 4
034 $cd d$ Reduce by C→ d
038 $cC d$ Reduce by C→cC LR(1)
02 $C d$ Shift to 7
027 $ Cd $ Reduce by C→ d
025 $CC $ Reduce by S→ CC
01 $S $ accept

Stack Symbols Input Action


0 $ cdd$ Shift to 36
036 $c dd$ Shift to 47
03647 $cd d$ Reduce by C→ d
03689 $cC d$ Reduce by C→cC LALR
02 $C d$ Shift to 47
0247 $Cd $ Reduce by C→ d
025 $CC $ Reduce by S→ CC
01 $S $ accept 142
Invalid Input : ccd
When presented with invalid input, the LALR parser may proceed to do
some reductions after the parser has declared an error. However, LALR
parser will never shift another symbol after LR parser declares an error.
1) S → C C 2) C → c C 3) C → d
Stack Symbols Input Action
0 $ ccd$ Shift to 3
03 $c cd$ Shift to 3
033 $cc d$ Shift to 4 LR(1)
0334 $ccd $ error

Stack Symbo Input Action


ls
0 $ ccd$ Shift to 36
036 $c cd$ Shift to 36
03636 $cc d$ Shift to 47
LALR
0363647 $ccd $ Reduce by C→ d
0363689 $ccC $ Reduce by C→ cC
03689 $cC $ Reduce by C→ cC
02 $C $ error
143
144
Exercises LR & LALR

145
Exercises
• Consider the grammar, G1: • Consider the grammar, G2:
1. S → A d 1. S → A d
2. S → B e 2. S → A e
3. A → a A b 3. A → a A b
4. A → c 4. A → c
5. B → a B b
6. B → c
• Consider the grammar, G3:
1. S → A B
2. A → a A b
3. A → c
4. B → d
5. B → e
• Is this grammar LR(k) for some fixed k?
• What about LL(k) for some fixed k?
146

You might also like