COMP 412
FALL 2010
Introduction to Parsing
Comp 412
Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make
copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit
educational purposes, provided this copyright notice is preserved.
The Front End
Source tokens IR
Scanner Parser
code
Errors
Parser
• Checks the stream of words and their parts of speech
(produced by the scanner) for grammatical correctness
• Determines if the input is syntactically well formed
• Guides checking at deeper levels than syntax
• Builds an IR representation of the code
Think of this chapter as the mathematics of diagramming
sentences
Comp 412, Fall 2010 2
The Study of Parsing
The process of discovering a derivation for some sentence
• Need a mathematical model of syntax — a grammar G
• Need an algorithm for testing membership in L(G)
• Need to keep in mind that our goal is building parsers,
not studying the mathematics of arbitrary languages
Roadmap for our study of parsing
1 Context-free grammars and derivations Today
2 Top-down parsing
— Generated LL(1) parsers & hand-coded recursive descent
parsers
3 Bottom-up parsing Lab 2
— Generated LR(1) parsers
We will define “context free” today. I am
Comp 412, Fall 2010 3
just deferring the definition for a couple of
slides.
Specifying Syntax with a Grammar
Context-free syntax is specified with a context-free grammar
SheepNoise SheepNoise baa
| baa
This CFG defines the set of noises sheep normally make
It is written in a variant of Backus–Naur form
Formally, a grammar is a four tuple, G = (S,N,T,P)
• S is the start symbol (set of strings in L(G))
• N is a set of nonterminal symbols (syntactic variables)
• T is a set of terminal symbols (words)
• P is a set of productions or rewrite rules (P : N (N T)+ )
Example due to Dr. Scott K. Warren
Comp 412, Fall 2010 From Lecture 4
1
Deriving Syntax
We can use the SheepNoise grammar to create sentences
— use the productions as rewriting rules
And so on ...
While this example is cute, it quickly runs out of intellectual
steam ...
Comp 412, Fall 2010 5
Why Not Use Regular Languages & DFAs?
Not all languages are regular (RL’s CFL’s CSL’s)
You cannot construct DFA’s to recognize these languages
• L = { p k qk } (parenthesis
languages)
• L = { wcwr | w *}
Neither of these is a regular language (nor an RE)
To recognize these features requires an arbitrary amount of
context (left or right …)
But, this issue is somewhat subtle. You can construct DFA’s
for
• Strings with alternating 0’s and 1’s
( | 1 ) ( 01 )* ( | 0 )
• Strings with an even number of 0’s and 1’s
RE’s can count bounded sets and bounded differences
Comp 412, Fall 2010 6
Limits of Regular Languages
Advantages of Regular Expressions
• Simple & powerful notation for specifying patterns
• Automatic construction of fast recognizers
• Many kinds of syntax can be specified with REs
Example — a regular expression for arithmetic expressions
Term [a-zA-Z] ([a-zA-Z] | [0-9])*
Op +|-||/
Expr ( Term Op )* Term
([a-zA-Z] ([a-zA-Z] | [0-9])* (+ | - | | /))* [a-zA-Z] ([a-zA-Z] | [0-9])
Of course, this would generate a DFA …
If REs are so useful … Why not use them for everything?
Cannot add parenthesis, brackets, begin-end pairs, …
Comp 412, Fall 2010 7
Context-free Grammars
What makes a grammar “context free”?
The SheepNoise grammar has a specific form:
SheepNoise SheepNoise baa
| baa
Productions have a single nonterminal on the left hand side,
which makes it impossible to encode left or right context.
The grammar is context free.
A context-sensitive grammar can have ≥ 1 nonterminal on
lhs.
Notice that L(SheepNoise) is actually a regular language: baa
+
Classic definition: any language that can be
Comp 412, Fall 2010 8
recognized by a push-down automaton is a
context-free language.
A More Useful Grammar Than Sheep Noise
To explore the uses of CFGs,we need a more complex
grammar
Rule Sentential Form
0 Expr Expr Op Expr
— Expr
1 | number
0 Expr Op Expr
2 | id
2 <id,x> Op Expr
3 Op + 4 <id,x> - Expr
4 | - 0 <id,x> - Expr Op Expr
5 | * 1 <id,x> - <num,2> Op
Expr
6 | /
5 <id,x> - <num,2> *
Expr
2 <id,x> - <num,2> *
<id,y>
• Such a sequence of rewrites is called a derivation
• Process of discovering a derivation is called parsing
We denote this derivation: Expr * id – num *
id
Comp 412, Fall 2010 9
Derivations
The point of parsing is to construct a derivation
• At each step, we choose a nonterminal to replace
• Different choices can lead to different derivations
Two derivations are of interest
• Leftmost derivation — replace leftmost NT at each step
• Rightmost derivation — replace rightmost NT at each step
These are the two systematic derivations
(We don’t care about randomly-ordered derivations!)
The example on the preceding slide was a leftmost
derivation
• Of course, there is also a rightmost derivation
• Interestingly, it turns out to be different
Comp 412, Fall 2010 10
Derivations
The point of parsing is to construct a derivation
A derivation consists of a series of rewrite steps
S 0 1 2 … n–1 n sentence
• Each i is a sentential form
— If contains only terminal symbols, is a sentence in L(G)
— If contains 1 or more non-terminals, is a sentential form
• To get i from i–1, expand some NT A i–1 by using A
— Replace the occurrence of A i–1 with to get i
— In a leftmost derivation, it would be the first NT A i–1
A left-sentential form occurs in a leftmost derivation
A right-sentential form occurs in a rightmost derivation
Comp 412, Fall 2010 11
The Two Derivations for x – 2 * y
Rule Sentential Form Rule Sentential Form
— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 2 Expr Op <id,y>
4 <id,x> - Expr 5 Expr * <id,y>
0 <id,x> - Expr Op Expr 0 Expr Op Expr * <id,y>
1 <id,x> - <num,2> Op 1 Expr Op <num,2> *
Expr <id,y>
5 <id,x> - <num,2> * 4 Expr - <num,2> *
Expr <id,y>
2 <id,x> - <num,2> * 2 <id,x> - <num,2> *
Leftmost
<id,y> derivation Rightmost
<id,y>
derivation
In both cases, Expr * id – num * id
• The two derivations produce different parse trees
• The parse trees imply different evaluation orders!
Comp 412, Fall 2010 12
Derivations and Parse Trees
Leftmost derivation
G
Rule Sentential Form
— Expr
0 Expr Op Expr
2 <id,x> Op Expr E
4 <id,x> - Expr
0 <id,x> - Expr Op Expr
1 <id,x> - <num,2> Op E Op E
Expr
5 <id,x> - <num,2> *
Expr x – E Op E
2 <id,x> - <num,2> *
<id,y>
This evaluates as x – ( 2 * 2 y
y) *
Comp 412, Fall 2010 13
Derivations and Parse Trees
Rightmost derivation
G
Rule Sentential Form
— Expr
0 Expr Op Expr
2 Expr Op <id,y> E
5 Expr * <id,y>
0 Expr Op Expr * <id,y>
1 Expr Op <num,2> * E Op E
<id,y>
4 Expr - <num,2> *
<id,y>
E Op E * y
2 <id,x> - <num,2> *
<id,y>
This evaluates as ( x – 2 ) * x – 2
y
This ambiguity is NOT good
Comp 412, Fall 2010 14
Derivations and Precedence
These two derivations point out a problem with the grammar:
It has no notion of precedence, or implied order of evaluation
To add precedence
• Create a nonterminal for each level of precedence
• Isolate the corresponding part of the grammar
• Force the parser to recognize high precedence
subexpressions first
For algebraic expressions
• Parentheses first (level 1 )
• Multiplication and division, next ( level
2)
• Subtraction and addition, last ( level 3)
Comp 412, Fall 2010 15
Derivations and Precedence
Adding the standard algebraic precedence produces:
0 Goal Expr This grammar is slightly larger
1 Expr Expr + Term •Takes more rewriting to
level
2 | Expr - Term reach some of the terminal
3
3 | Term symbols
level
4 Term Term * Factor •Encodes expected
5 | Term / Factor precedence
2
6 | Factor •Produces same parse tree
7 Factor ( Expr ) under leftmost & rightmost
level
8 | number derivations
1
9 | id •Correctness trumps the speed
of the parser
Cannot handle Let’s see how
Introduced it parses xtoo
parentheses, -2*
precedence in an RE for y
(beyond power of an RE)
expressions
Comp 412, Fall 2010 One form of the “classic expression 16
grammar”
Derivations and Precedence
Rule Sentential Form G
— Goal
0 Expr E
2 Expr - Term
4 Expr - Term * Factor E – T
9 Expr - Term * <id,y>
6 Expr - Factor * <id,y> T T * F
8 Expr - <num,2> *
<id,y> F F <id,y
3 Term - <num,2> * >
<id,y>
6 Factor - <num,2> * <id,x <num,2>
<id,y> >
9 <id,x> - <num,2> * Its parse tree
The rightmost
<id,y>
derivation
It derives x – ( 2 * y ), along with an appropriate parse tree.
Both the leftmost and rightmost derivations give the same expression,
because the grammar directly and explicitly encodes the desired
precedence.
Comp 412, Fall 2010 17
Ambiguous Grammars
Let’s leap back to our original expression grammar.
It had other problems.
Rule Sentential Form
0 Expr Expr Op Expr — Expr
1 | number 0 Expr Op Expr
2 | id 2 <id,x> Op Expr
3 Op + 4 <id,x> - Expr
0 <id,x> - Expr Op Expr
4 | -
1 <id,x> - <num,2> Op
5 | * Expr
6 | / 5 <id,x> - <num,2> *
Expr
2 <id,x> - <num,2> *
<id,y>
• This grammar allows multiple leftmost derivations for x - 2 * y
• Hard to automate derivation if > 1 choice
Different choice
• The grammar is ambiguous than the first time
Comp 412, Fall 2010 18
Two Leftmost Derivations for x – 2 * y
The Difference:
Different productions chosen on the second step
Rule Sentential Form Rule Sentential Form
— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 0 Expr Op Expr Op Expr
4 <id,x> - Expr 2 <id,x> Op Expr Op
0 <id,x> - Expr Op Expr Expr
1 <id,x> - <num,2> Op 4 <id,x> - Expr Op Expr
Expr 1 <id,x> - <num,2> Op
5 <id,x> - <num,2> * Expr
Expr 5 <id,x> - <num,2> *
1 <id,x> - <num,2> * Expr
Original choice
<id,y> 2 New -choice
<id,x> <num,2> *
<id,y>
Both derivations succeed in producing x - 2 * y
Comp 412, Fall 2010 19
Two Leftmost Derivations for x – 2 * y
The Difference:
Different productions chosen on the second step
Rule Sentential Form Rule Sentential Form
— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 0 Expr Op Expr Op Expr
4 <id,x> - Expr 2 <id,x> Op Expr Op
0 <id,x> - Expr Op Expr Expr
1 <id,x> - <num,2> Op 4 <id,x> - Expr Op Expr
Expr 1 <id,x> - <num,2> Op
5 <id,x> - <num,2> * Expr
Expr 5 <id,x> - <num,2> *
2 <id,x> - <num,2> * Expr
Original choice
<id,y> 2 New -choice
<id,x> <num,2> *
<id,y>
Different choices in same
situation, again
Remember
Comp 412, Fall 2010 nondeterminism? 20
Ambiguous Grammars
Definitions
• If a grammar has more than one leftmost derivation for
a single sentential form, the grammar is ambiguous
• If a grammar has more than one rightmost derivation
for a single sentential form, the grammar is ambiguous
• The leftmost and rightmost derivations for a sentential
form may differ, even in an unambiguous grammar
— However, they must have the same parse tree!
Classic example — the if-then-else problem
Stmt if Expr then Stmt
| if Expr then Stmt else Stmt
| … other stmts …
This ambiguity is inherent in the grammar
Comp 412, Fall 2010 21
Ambiguity
This sentential form has two derivations
if Expr1 then if Expr2 then Stmt1 else Stmt2 Part of the problem
is that the structure
built by the parser
if if will determine the
interpretation of the
code, and these two
E1 then else E1 then forms have different
meanings!
if S2 if
E2 then E2 then else
S1 S1 S2
production 2, then production 1, then
production 1 production 2
Comp 412, Fall 2010 22
The grammar forces the
Ambiguity structure to match the desired
meaning.
Removing the ambiguity
• Must rewrite the grammar to avoid generating the
problem
• Match each else to innermost unmatched if (common sense
0)
rule Stmt if Expr then Stmt
1 if Expr then WithElse else Stmt
2 Other Statements
3 WithElse if Expr then WithElse else WithElse
4 Other Statements
Intuition: once into WithElse, we cannot generate an unmatched
With
elsethis grammar, example has only one rightmost
derivation
… a final if without an else can only come through rule 2 …
Comp 412, Fall 2010 23
Ambiguity
if Expr1 then if Expr2 then Stmt1 else Stmt2
Rul Sentential Form
e
— Stmt
0 if Expr then Stmt
1 if Expr then if Expr then WithElse else Stmt
2 if Expr then if Expr then WithElse else S2
4 if Expr then if Expr then S1 else S2
? if Expr then if E2 then S1 else S2
? if E1 then if E2 then S1 else S2
Other productions to derive Expr
s
This grammar has only one rightmost derivation for the
example
Comp 412, Fall 2010 24
Deeper Ambiguity
Ambiguity usually refers to confusion in the CFG
Overloading can create deeper ambiguity
a = f(17)
In many Algol-like languages, f could be either a function
or a subscripted variable
Disambiguating this one requires context
• Need values of declarations
• Really an issue of type, not context-free syntax
• Requires an extra-grammatical solution (not in CFG)
• Must handle these with a different mechanism
— Step outside grammar rather than use a more complex
grammar
Comp 412, Fall 2010 25
Ambiguity - the Final Word
Ambiguity arises from two distinct sources
• Confusion in the context-free syntax (if-then-else)
• Confusion that requires context to resolve (overloading)
Resolving ambiguity
• To remove context-free ambiguity, rewrite the grammar
• To handle context-sensitive ambiguity takes cooperation
— Knowledge of declarations, types, …
— Accept a superset of L(G) & check it by other means†
— This is a language design problem
Sometimes, the compiler writer accepts an ambiguous
grammar
— Parsing techniques that “do the right thing”
— i.e., always select the same derivation
Comp 412, Fall 2010 †
See Chapter 4 26