0% found this document useful (0 votes)
51 views38 pages

Bottom Up Parsing Techniques Explained

Chapter 5 discusses bottom-up parsing algorithms, particularly focusing on shift-reduce parsing, which involves shifting input symbols onto a stack and reducing them to nonterminals based on grammar rules. It highlights potential conflicts such as shift/reduce and reduce/reduce conflicts that can arise when the parser is uncertain about which operation to perform. The chapter also introduces LR parsing with tables, which helps manage these operations systematically.

Uploaded by

abel sintayehu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views38 pages

Bottom Up Parsing Techniques Explained

Chapter 5 discusses bottom-up parsing algorithms, particularly focusing on shift-reduce parsing, which involves shifting input symbols onto a stack and reducing them to nonterminals based on grammar rules. It highlights potential conflicts such as shift/reduce and reduce/reduce conflicts that can arise when the parser is uncertain about which operation to perform. The chapter also introduces LR parsing with tables, which helps manage these operations systematically.

Uploaded by

abel sintayehu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 5

Bottom Up Parsing

The implementation of parsing algorithms for LL(1) grammars , as shown in


Chapter 4, is relatively straightforward. However, there are many situations in
which it is not easy, if possible, to use an LL(1) grammar. In these cases, the
designer may have to use a bottom up algorithm.
Parsing algorithms which proceed from the bottom of the derivation tree
and apply grammar rules (in reverse) are called bottom up parsing algorithms.
These algorithms will begin with an empy stack. One or more input symbols
are moved onto the stack, which are then replaced by nonterminals according to
the grammar rules. When all the input symbols have been read, the algorithm
terminates with the starting nonterminal alone on the stack, if the input string
is acceptable. The student may think of a bottom up parse as being similar
to a derivation in reverse. Each time a grammar rule is applied to a sentential
form, the rewriting rule is applied backwards. Consequently, derivation trees
are constructed, or traversed, from bottom to top.

5.1 Shift Reduce Parsing


Bottom up parsing involves two fundamental operations. The process of mov-
ing an input symbol to the stack is called a shift operation, and the process
of replacing symbols on the top of the stack with a nonterminal is called a re-
duce operation (it is a derivation step in reverse). Most bottom up parsers are
called shift reduce parsers because they use these two operations. The following
grammar will be used to show how a shift reduce parser works:
G22:

1. S → S a B
2. S → c
3. B → a b
A derivation tree for the string caabaab is shown in Figure 5.1. The shift reduce
parser will proceed as follows: each step will be either a shift (shift an input

164
5.1. SHIFT REDUCE PARSING 165

S a B

S a B a b
c a b
Figure 5.1: Derivation tree for the string caabaab using grammar G22

symbol to the stack) or reduce (reduce symbols on the stack to a nonterminal),


in which case we indicate which rule of the grammar is being applied. The
sequence of stack frames and input is shown in Figure 5.2, in which the stack
frames are pictured horizontally to show, more clearly, the shifting of input
characters onto the stack and the sentential forms corresponding to this parse.
The algorithm accepts the input if the stack can be reduced to the starting
nonterminal when all of the input string has been read.
Note in Figure 5.2 that whenever a reduce operation is performed, the sym-
bols being reduced are always on top of the stack. The string of symbols being
reduced is called a handle , and it is imperative in bottom up parsing that the
algorithm be able to find a handle whenever possible. The bottom up parse
shown in Figure 5.2 corresponds to the derivation shown below:
S ⇒ SaB ⇒ Saab ⇒ SaBaab ⇒ Saabaab ⇒ caabaab
Note that this is a right-most derivation; shift reduce parsing will always
correspond to a right-most derivation. In this derivation we have underlined
the handle in each sentential form. Read this derivation from right to left and
compare it with Figure 5.2.
If the parser for a particular grammar can be implemented with a shift reduce
algorithm, we say the grammar is LR (the L indicates we are reading input
from the left, and the R indicates we are finding a right-most derivation). The
shift reduce parsing algorithm always performs a reduce operation when the top
of the stack corresponds to the right side of a rule. However, if the grammar
is not LR, there may be instances where this is not the correct operation, or
there may be instances where it is not clear which reduce operation should be
performed. For example, consider grammar G23:
G23:
1. S → SaB
2. S → a
3. B → ab

When parsing the input string aaab, we reach a point where it appears that
we have a handle on top of the stack (the terminal a), but reducing that handle,
as shown in Figure 5.3, does not lead to a correct bottom up parse. This is called
a shift/reduce conflict because the parser does not know whether to shift an
input symbol or reduce the handle on the stack. This means that the grammar
is not LR, and we must either rewrite the grammar or use a different parsing
166 CHAPTER 5. BOTTOM UP PARSING

, caabaab
shift
,c aabaab
reduce using rule 2
,S aabaab
shift
,Sa abaab
shift
,Saa baab
shift
,Saab aab
reduce using rule 3
,SaB aab
reduce using rule 1
,S aab
shift
,Sa ab
shift
,Saa b
shift
,Saab
reduce using rule 3
,SaB
reduce using rule 1
,S
Accept

Figure 5.2: Sequence of stack frames parsing caabaab using grammar G22
5.1. SHIFT REDUCE PARSING 167

∇ aaab ↵
shift
∇ a aab ↵
reduce using rule 2
∇ s aab ↵
shift
∇ Sa ab ↵
shift/reduce conflict
reduce using rule 2 (incorrect)
∇ SS ab ↵
shift
∇ Ssa b ↵
shift
∇ Ssab ↵
reduce using rule 3
∇ SSb ↵
Syntax error (incorrect)

Figure 5.3: An example of a shift/reduce conflict leading to an incorrect parse


using grammar G23

algorithm.
Another problem in shift reduce parsing occurs when it is clear that a reduce
operation should be performed, but there is more than one grammar rule whose
right hand side matches the top of the stack, and it is not clear which rule
should be used. This is called a reduce/reduce conflict. Grammar G24 is an
example of a grammar with a reduce/reduce conflict.
G24:
1. S → SA
2. S → a
3. A → a

Figure 5.4 shows an attempt to parse the input string aa with the shift reduce
algorithm, using grammar G24. Note that we encounter a reduce/reduce conflict
when the handle a is on the stack because we don’t know whether to reduce
using rule 2 or rule 3. If we reduce using rule 2, we will get a correct parse, but
if we reduce using rule 3 we will get an incorrect parse.
It is often possible to resolve these conflicts simply by making an assumption.
168 CHAPTER 5. BOTTOM UP PARSING

∇ aa ↵
shift
∇ a a ↵
reduce/reduce conflict (rules 2 and 3)
reduce using rule 3 (incorrect)
∇ A a ↵
shift
∇ Aa ↵
reduce/reduce conflict (rules 2 and 3)
reduce using rule 2 (rule 3 will also yield a syntax error)
∇ AS ↵
Syntax error

Figure 5.4: A reduce/reduce conflict using grammar G24

For example, all shift/reduce conflicts could be resolved by shifting rather than
reducing. If this assumption always yields a correct parse, there is no need to
rewrite the grammar.
In examples like the two just presented, it is possible that the conflict can be
resolved by looking ahead at additional input characters. An LR algorithm that
looks ahead k input symbols is called LR(k). When implementing programming
languages bottom up, we generally try to define the language with an LR(1)
grammar, in which case the algorithm will not need to look ahead beyond the
current input symbol. An ambiguous grammar is not LR(k) for any value of k
i.e. an ambiguous grammar will always produce conflicts when parsing bottom
up with the shift reduce algorithm. For example, the following grammar for if
statements is ambiguous:
1. Stmt → if (BoolExpr) Stmt else Stmt
2. Stmt → if (BoolExpr) Stmt

The BoolExpr in parentheses represents a true or false condition. Fig-


ure 5.5 shows two different derivation trees for the statement if (BoolExpr)
if (BoolExpr) Stmt else Stmt. The tree on the right is the interpretation
preferred by most programming languages (each else is matched with the closest
preceding unmatched if). The parser will encounter a shift/reduce conflict when
reading the else. The reason for the conflict is that the parser will be configured
as shown in Figure 5.6.
In this case, the parser will not know whether to treat if (BoolExpr) Stmt
as a handle and reduce it to Stmt according to rule 2, or to shift the else, which
should be followed by a Stmt, thus reducing according to rule 1. However, if
the parser can somehow be told to resolve this conflict in favor of the shift, then
5.1. SHIFT REDUCE PARSING 169

Stmt Stmt

if ( BoolExpr ) Stmt else Stmt if ( BoolExpr ) Stmt

if ( BoolExpr ) Stmt if ( BoolExpr ) Stmt else Stmt

Figure 5.5: Two derivation trees for if (BoolExpr) if (BoolExpr) Stmt


else Stmt

Stack Input

∇ . . . if ( BoolExpr ) Stmt else . . . ↵

Figure 5.6: Parser configuration before reading the else part of an if statement

it will always find the correct interpretation. Alternatively, the ambiguity may
be removed by rewriting the grammar, as shown in section 3.1.

Sample Problem 5.1.1

Show the sequence of stack and input configurations as the string


caab is parsed with a shift reduce parser, using grammar G22.

Solution:
170 CHAPTER 5. BOTTOM UP PARSING

∇ caab ↵
shift
∇ c aab ↵
reduce using rule 2
∇ S aab ↵
shift
∇ Sa ab ↵
shift
∇ Saa b ↵
shift
∇ Saab ↵
reduce using rule 3
∇ SaB ↵
reduce using rule 1
∇ S ↵
Accept

5.1.1 Exercises
1. For each of the following stack configurations, identify the handle using
the grammar shown below:

1. S → SAb
2. S → acb
3. A → bBc
4. A → bc
5. B → ba
6. B → Ac

(a) ▽ SSAb

(b) ▽ SSbbc

(c) ▽ SbBc
5.2. LR PARSING WITH TABLES 171

(d) ▽ Sbbc

2. Using the grammar of Problem 1, show the sequence of stack and input
configurations as each of the following strings is parsed with shift reduce
parsing:

(a) acb
(b) acbbcb
(c) acbbbacb
(d) acbbbcccb
(e) acbbcbbcb

3. For each of the following input strings, indicate whether a shift/reduce


parser will encounter a shift/reduce conflict, a reduce/reduce conflict, or
no conflict when parsing, using the grammar below:

1. S → S ab
2. S → b A
3. A → b b
4. A → b A
5. A → b bc
6. A → c

(a) b c
(b) b b c a b
(c) b a c b

4. Assume that a shift/reduce parser always chooses the lower numbered rule
(i.e., the one listed first in the grammar) whenever a reduce/reduce con-
flict occurs during parsing, and it chooses a shift whenever a shift/reduce
conflict occurs. Show a derivation tree corresponding to the parse for the
sentential form if (BoolExpr) if (BoolExpr) Stmt else Stmt, using
the following ambiguous grammar. Since the grammar is not complete,
you may have nonterminal symbols at the leaves of the derivation tree.
1. Stmt → if (BoolExpr) Stmt else Stmt
2. Stmt → if (BoolExpr) Stmt

5.2 LR Parsing With Tables


One way to implement shift reduce parsing is with tables that determine whether
to shift or reduce, and which grammar rule to reduce. This technique makes
use of two tables to control the parser. The first table, called the action table,
172 CHAPTER 5. BOTTOM UP PARSING

determines whether a shift or reduce is to be invoked. If it specifies a reduce,


it also indicates which grammar rule is to be reduced. The second table, called
a goto table, indicates which stack symbol is to be pushed on the stack after
a reduction. A shift action is implemented by a push operation followed by an
advance input operation. A reduce action must always specify the grammar
rule to be reduced. The reduce action is implemented by a Replace operation
in which stack symbols on the right side of the specified grammar rule are
replaced by a stack symbol from the goto table (the input pointer is retained).
The symbol pushed is not necessarily the nonterminal being reduced, as shown
below. In practice, there will be one or more stack symbols corresponding to
each nonterminal.
The columns of the goto table are labeled by nonterminals, and the the rows
are labeled by stack symbols. A cell of the goto table is selected by choosing
the column of the nonterminal being reduced and the row of the stack symbol
just beneath the handle.
For example, suppose we have the following stack and input configuration:

Stack Input
▽S ab←֓

in which the bottom of the stack is to the left. The action shift will result in
the following configuration:

Stack Input
▽ Sa b←֓

The a has been shifted from the input to the stack. Suppose, then, that in
the grammar, rule 7 is:
7. B → Sa
Select the row of the goto table labeled ▽ and the column labeled B. If the
entry in this cell is push X, then the action reduce 7 would result in the following
configuration:

Stack Input
▽X b←֓

Figure 5.7 shows the LR parsing tables for grammar G5 for arithmetic
expressions involving only addition and multiplication (see section 3.1). As in
previous pushdown machines, the stack symbols label the rows, and the input
symbols label the columns of the action table. The columns of the goto table
are labeled by the nonterminal being reduced. The stack is initialized with a
▽ symbol to mark the bottom of the statck, and blank cells in the action table
indicate syntax errors in the input string. Figure 5.8 shows the sequence of
configurations which would result when these tables are used to parse the input
string (var+var)*var.
5.2. LR PARSING WITH TABLES 173

A c t i o n T a b l e
+ * ( ) var N
, shift ( shift var

Expr1 shift + Accept


Term1 reduce 1 shift * reduce 1 reduce 1
Factor3 reduce 3 reduce 3 reduce 3 reduce 3
( shift ( shift var
Expr5 shift + shift )
) reduce 5 reduce 5 reduce 5 reduce 5
+ shift ( shift var
Term2 reduce 2 shift * reduce 2 reduce 2
* shift ( shift var
Factor4 reduce 4 reduce 4 reduce 4 reduce 4
var reduce 6 reduce 6 reduce 6 reduce 6

G o T o T a b l e
Expr Term Factor
, push Expr1 push Term2 push Factor4
Expr1
Term1
Factor3
( push Expr5 push Term2 push Factor4
Expr5 ,
)
+ push Term1 push Factor4 Initial
Term2 Stack
* push Factor3
Factor4
var

Figure 5.7: Action and Goto tables to parse simple arithmetic expressions
174 CHAPTER 5. BOTTOM UP PARSING

Stack Input Action Goto

, (var+var)*var N
shift (
,( var+var)*var N
shift var
,(var +var)*var N
reduce 6 push Factor4
,(Factor4 +var)*var N
reduce 4 push Term2
,(Term2 +var)*var N
reduce 2 push Expr5
,(Expr5 +var)*var N
shift +
,(Expr5+ var)*var N
shift var
,(Expr5+var )*var N
reduce 6 push Factor4
,(Expr5+Factor4 )*var N
reduce 4 push Term1
,(Expr5+Term1 )*var N
reduce 1 push Expr5
,(Expr5 )*var N
shift )
,(Expr5) *var N
reduce 5 push Factor4
,Factor4 *var N
reduce 4 push Term2
,Term2 *var N
shift *
,Term2* var N
shift var
,Term2*var N
reduce 6 push Factor3
,Term2*Factor3 N
reduce 3 push Term2
,Term2 N
reduce 2 push Expr1
,Expr1 N
Accept

Figure 5.8: Sequence of configurations when parsing (var+var)*var


5.2. LR PARSING WITH TABLES 175

G5:
1. Expr → Expr + Term
2. Expr → Term
3. Term → Term * Factor
4. Term → Factor
5. Factor → ( Expr )
6. Factor → var

The operation of the LR parser can be described as follows:

1. Find the action corresponding to the current input and the top stack symbol.
2. If that action is a shift action:
a. Push the input symbol onto the stack.
b. Advance the input pointer.
3. If that action is a reduce action:
a. Find the grammar rule specified by the reduce action.
b. The symbols on the right side of the rule should also be on the top of the
stack -- pop them all off the stack.
c. Use the nonterminal on the left side of the grammar rule to indicate a
column of the goto table, and use the top stack symbol to indicate a row
of the goto table. Push the indicated stack symbol onto the stack.
d. Retain the input pointer.
4. If that action is blank, a syntax error has been detected.
5. If that action is Accept, terminate.
6. Repeat from step 1.

Sample Problem 5.2.1

Show the sequence of stack, input, action, and goto configurations


for the input var*var using the parsing tables of Figure 5.7.

Solution:
176 CHAPTER 5. BOTTOM UP PARSING

Stack Input Action Goto

, var*var N
shift var
,var *var N
reduce 6 push Factor4
,Factor4 *var N
reduce 4 push Term2
,Term2 *var N
shift *
,Term2* var N
shift var
,Term2*var N
reduce 6 push Factor3
,Term2*Factor3 N
reduce 3 push Term2
,Term2 N
reduce 2 push Expr1
,Expr1 N
Accept

There are three principle techniques for constructing the LR parsing tables.
In order from simplest to most complex or general, they are called: Simple LR
(SLR), Look Ahead LR (LALR), and Canonical LR (LR). SLR is the easiest
technique to implement, but works for a small class of grammars. LALR is
more difficult and works on a slightly larger class of grammars. LR is the most
general, but still does not work for all unambiguous context free grammars. In
all cases, they find a rightmost derivation when scanning from the left (hence
LR). These techniques are beyond the scope of this text, but are described in
Parsons [17] and Aho et. al. [1].

5.2.1 Exercises
1. Show the sequence of stack and input configurations and the reduce and
goto operations for each of the following expressions, using the action and
goto tables of Figure 5.7.

(a) var
(b) (var)
(c) var + var * var
(d) (var*var) + var
(e) (var * var
5.3. SABLECC 177

5.3 SableCC
For many grammars, the LR parsing tables can be generated automatically from
the grammar. There are several software systems designed to generate a parser
automatically from specifications (as mentioned in section 2.4). In this chapter
we will be using software developed at McGill University, called SableCC.

5.3.1 Overview of SableCC


SableCC is described well in the thesis of its creator, Etienne Gagnon [10] (see
[Link]). The user of SableCC prepares a grammar file, as described
in section 2.4, as well as two java classes: Translation and Compiler. These are
stored in the same directory as the parser, lexer, node, and analysis directories.
Using the grammar file as input, SableCC generates java code the purpose of
which is to compile source code as specified in the grammar file. SableCC
generates a lexer and a parser which will produce an abstract syntax tree as
output. If the user wishes to implement actions with the parser, the actions
are specified in the Translation class. An overview of this software system is
presented in Figure 5.9.

5.3.2 Structure of the SableCC Source Files


The input to SableCC is called a grammar file. This file contains the specifica-
tions for lexical tokens, as well as syntactic structures (statements, expressions,
...) of the language for which we wish to construct a compiler. Neither actions
nor attributes are included in the grammar file. There are six sections in the
grammar file:

1. Package
2. Helpers
3. States
4. Tokens
5. Ignored Tokens
6. Productions

The first four sections were described in section 2.4. The Ignored Tokens
section gives you an opportunity to specify tokens that should be ignored by the
parser (typically white space and comments). The Productions section contains
the grammar rules for the language being defined. This is where syntactic
structures such as statements, expressions, etc. are defined. Each definition
consists of the name of the syntactic type being defined (i.e. a nonterminal), an
equal sign, an EBNF definition, and a semicolon to terminate the production.
As mentioned in section 2.4, all names in this grammar file must be lower case.
An example of a production defining a while statement is shown below (l par
and r par are left parenthesis and right parenthesis tokens, respectively):
stmt = while l par bool expr r par stmt ;
178 CHAPTER 5. BOTTOM UP PARSING

[Link]

sablecc

parser lexer node analysis

[Link] [Link]

javac javac

[Link] [Link]

Figure 5.9: Generation and compilation of a compiler using SableCC


5.3. SABLECC 179

Note that the semicolon at the end is not the token for a semicolon, but a
terminator for the stmt rule. Productions may use EBNF-like constructs. If x
is any grammar symbol, then:

x? // An optional x (0 or 1 occurrences of x)
x* // 0 or more occurrences of x
x+ // 1 or more occurrences of x

Alternative definitions, using |, are also permitted. However, alternatives


must be labeled with names enclosed in braces. The following defines an argu-
ment list as 1 or more identifiers, separated with commas:

arg_list = {single} identifier


| {multiple} identifier ( comma identifier ) +
;

The names single and multiple enable the user to refer to one of these al-
ternatives when applying actions in the Translation class. Labels must also be
used when two identical names appear in a grammar rule. Each item label must
be enclosed in brackets, and followed by a colon:

for_stmt = for l_par [init]: assign_expr semi bool_expr


semi [incr]: assign_expr r_par stmt ;

Since there are two occurrences of assign expr in the above definition of a
for statement, they must be labeled. The first is labeled init, and the second
is labeled incr.

5.3.3 An Example Using SableCC


The purpose of this example is to translate infix expressions involving addition,
subtraction, multiplication, and division into postfix expressions, in which the
operations are placed after both operands. Note that parentheses are never
needed in postfix expressions, as shown in the following examples:

Infix Postfix
2 + 3 * 4 2 3 4 * +
2 * 3 + 4 2 3 * 4 +
( 2 + 3 ) * 4 2 3 + 4 *
2 + 3 * ( 8 - 4 ) - 2 2 3 8 4 - * + 2 -

There are four sections in the grammar file for this program. The first section
specifies that the package name is ’postfix’. All java software for this program
will be part of this package. The second section defines the tokens to be used.
180 CHAPTER 5. BOTTOM UP PARSING

No Helpers are needed, since the numbers are simple whole numbers, specified
as one or more digits. The third section specifies that blank (white space)
tokens are to be ignored; this includes tab characters and newline characters.
Thus the user may input infix expressions in free format. The fourth section,
called Productions, defines the syntax of infix expressions. It is similar to the
grammar given in section 3.1, but includes subtraction and division operations.
Note that each alternative definition for a syntactic type must have a label in
braces. The grammar file is shown below:

Package postfix;

Tokens
number = [’0’..’9’]+;
plus = ’+’;
minus = ’-’;
mult = ’*’;
div = ’/’;
l\_par = ’(’;
r\_par = ’)’;
blank = (’ ’ | 10 | 13 | 9)+ ;
semi = ’;’ ;

Ignored Tokens
blank;

Productions
expr =
{term} term |
{plus} expr plus term |
{minus} expr minus term
;
term =
{factor} factor |
{mult} term mult factor |
{div} term div factor
;
factor =
{number} number |
{paren} l_par expr r_par
;

Now we wish to include actions which will put out postfix expressions.
SableCC will produce parser software which will create an abstract syntax tree
for a particular infix expression, using the given grammar. SableCC will also
5.3. SABLECC 181

produce a class called DepthFirstAdapter, which has methods capable of visiting


every node in the syntax tree. In order to implement actions, all we need to do
is extend DepthFirstAdapter (the extended class is usually called Translation),
and override methods corresponding to rules (or tokens) in our grammar. For
example, since our grammar contains an alternative, Mult, in the definition of
Term, the DepthFirstAdapter class contains a method named outAMultTerm.
It will have one parameter which is the node in the syntax tree corresponding
to the Term. Its signature is
public void outAMultTerm (AMultTerm node)
This method will be invoked when this node in the syntax tree, and all its
descendants, have been visited in a depth-first traversal. In other words, a Term,
consisting of a Term, a mult (i.e. a ’*’), and a Factor have been successfully
scanned. To include an action for this rule, all we need to do is override the
outAMultTerm method in our extended class (Translation). In our case we want
to print out a ’+’ after scanning a ’+’ and both of its operands. This is done by
overriding the outAPlusExpr method. When do we print out a number? This
is done when a number is seen in the {number} alternative of the definition of
factor. Therefore, override the method outANumberFactor. In this method all
we need to do is print the parameter node (all nodes have toString() methods,
and therefore can be printed). The Translation class is shown below:

package postfix;
import [Link].*; // needed for DepthFirstAdapter
import [Link].*; // needed for syntax tree nodes.

class Translation extends DepthFirstAdapter


{
public void outAPlusExpr(APlusExpr node)
{// out of alternative {plus} in expr, we print the plus.
[Link] ( " + ");
}

public void outAMinusExpr(AMinusExpr node)


{// out of alternative {minus} in expr, we print the minus.
[Link] ( " - ");
}

public void outAMultTerm(AMultTerm node)


{// out of alternative {mult} in term, we print the minus.
[Link] ( " * ");
}

public void outADivTerm(ADivTerm node)


{// out of alternative {div} in term, we print the minus.
[Link] ( " / ");
182 CHAPTER 5. BOTTOM UP PARSING

public void outANumberFactor (ANumberFactor node)


// out of alternative {number} in factor, we print the number.
{ [Link] (node + " "); }
}

There are other methods in the DepthFirstAdapter class which may also be
overridden in the Translation class, but which were not needed for this example.
They include the following:
• There is an ’in’ method for each alternative, which is invoked when a node
is about to be visited. In our example, this would include the method
public void inAMultTerm (AMultTerm node)
• There is a ’case’ method for each alternative. This is the method that visits
all the descendants of a node, and it is not normally necessary to over-
ride this method. An example would be public void caseAMultTerm
(AMultTerm node)
• There is also a ’case’ method for each token; the token name is prefixed
with a ’T’ as shown below:

public void caseTNumber (TNumber token)


{ // action for number tokens }

An important problem to be addressed is how to invoke an action in the


middle of a rule (an embedded action). Consider the while statement definition:

while stmt = {while} while l par bool expr r par stmt ;


Suppose we wish to put out a LBL atom after the while keyword token is
seen. There are two ways to do this. The first way is to rewrite the grammar,
and include a new nonterminal for this purpose (here we call it while token):

while_stmt = {while} while_token l_par


bool_expr r_par stmt ;
while_token = while ;

Now the method to be overridden could be:

public void outAWhileToken (AWhileToken node)


{ [Link] ("LBL") ; } // put out a LBL atom.
5.3. SABLECC 183

The other way to solve this problem would be to leave the grammar as is
and override the case method for this alternative. The case methods have not
been explained in full detail, but all the user needs to do is to copy the case
method from DepthFirstAdapter, and add the action at the appropriate place.
In this example it would be:

public void caseAWhileStmt (AWhileStmt node)


{ inAWhileStmt(node);
if([Link]() != null)
{ [Link]().apply(this) }
///////////// insert action here //////////////////
[Link] ("LBL"); // embedded action
///////////////////////////////////////////////////
if([Link]() != null)
{ [Link]().apply(this); }
if([Link]() != null)
{ [Link]().apply(this); }
if([Link]() != null)
{ [Link]().apply(this); }
if ([Link]() != null)
{ [Link]().apply (this) ; }
outAWhileStmt (node);
}

The student may have noticed that SableCC tends to alter names that were
included in the grammar. This is done to prevent ambiguities. For example,
l par becomes LPar, and bool expr becomes BoolExpr.
In addition to a Translation class, we also need a Compiler class. This is the
class which contains the main method, which invokes the parser. The Compiler
class is shown below:

package postfix;
import [Link].*;
import [Link].*;
import [Link].*;
import [Link].*;

public class Compiler


{
public static void main(String[] arguments)
{ try
{ [Link]("Type one expression");

// Create a Parser instance.


184 CHAPTER 5. BOTTOM UP PARSING

Parser p = new Parser


( new Lexer
( new PushbackReader
( new InputStreamReader([Link]), 1024)));

// Parse the input.


Start tree = [Link]();

// Apply the translation.


[Link](new Translation());

[Link]();
}
catch(Exception e)
{ [Link]([Link]()); }
}
}

This completes our example on translating infix expressions to postfix. The


source code is available at [Link] In sec-
tion 2.3 we discussed the use of hash tables in lexical analysis. Here again
we make use of hash tables, this time using the Java class HashMap (from
[Link]). This is a general storage-lookup table for any kind of objects. Use
the put method to store an object, with a key:
void put (Object key, Object value);
and use the get method to retrieve a value from the table:
Object get (Object key)

Sample Problem 5.3.1

Use SableCC to translate infix expressions involving addition,


subtraction, multiplication, and division of whole numbers into atoms.
Assume that each number is stored in a temporary memory location
when it is encountered. For example, the following infix expression:
34 + 23 * 8 - 4
should produce the list of atoms:

MUL T2 T3 T4
ADD T1 T4 T5
SUB T5 T6 T7
5.3. SABLECC 185

Here it is assumed that 34 is stored in T1, 23 is stored in T2, 8


is stored in T3, and 4 is stored in T6.

Solution:

Since we are again dealing with infix expressions, the grammar


given in this section may be reused. Simply change the package name
to exprs.
The Compiler class may also be reused as is. All we need to do
is rewrite the Translation class.
To solve this problem we will need to allocate memory locations
for sub-expressions and remember where they are. For this purpose
we use a java Map. A Map stores key-value pairs, where the key
may be any object, and the value may be any object. Once a value
has been stored (with a put method), it can be retrieved with its key
(using the get method). In our Map, the key will always be a Node,
and the value will always be an Integer. The Translation class is
shown below:

package exprs;
import [Link].*;
import [Link].*;
import [Link].*; // for Hashtable
import [Link].*;

class Translation extends DepthFirstAdapter


{
// Use a Map to store the memory locations for exprs
// Any node may be a key, its memory location will be the
// value, in a (key,value) pair.

Map <Node, Integer> hash = new HashMap <Node, Integer>();

public void caseTNumber(TNumber node)


// Allocate memory loc for this node, and put it into
// the map.
{ [Link] (node, alloc()); }

public void outATermExpr (ATermExpr node)


{ // Attribute of the expr same as the term
[Link] (node, [Link]([Link]()));
}
186 CHAPTER 5. BOTTOM UP PARSING

public void outAPlusExpr(APlusExpr node)


{// out of alternative {plus} in Expr, we generate an
// ADD atom.
int i = alloc();
[Link] (node, i);
atom ("ADD", (Integer)[Link]([Link]()),
(Integer)[Link]([Link]()), i);
}

public void outAMinusExpr(AMinusExpr node)


{// out of alternative {minus} in Expr,
// generate a minus atom.
int i = alloc();
[Link] (node, i);
atom ("SUB", (Integer)[Link]([Link]()),
(Integer)[Link]([Link]()), i);
}

public void outAFactorTerm (AFactorTerm node)


{ // Attribute of the term same as the factor
[Link] (node, [Link]([Link]()));
}

public void outAMultTerm(AMultTerm node)


{// out of alternative {mult} in Factor, generate a mult
// atom.
int i = alloc();
[Link] (node, i);
atom ("MUL", (Integer)[Link]([Link]()),
(Integer) [Link]([Link]()) , i);
}

public void outADivTerm(ADivTerm node)


{// out of alternative {div} in Factor,
// generate a div atom.
int i = alloc();
[Link] (node, i);
atom ("DIV", (Integer) [Link]([Link]()),
(Integer) [Link]([Link]()), i);
}

public void outANumberFactor (ANumberFactor node)


{ [Link] (node, [Link] ([Link]())); }

public void outAParenFactor (AParenFactor node)


5.3. SABLECC 187

{ [Link] (node, [Link] ([Link]())); }

void atom (String atomClass, Integer left, Integer right,


Integer result)
{ [Link] (atomClass + " T" + left + " T" +
right + " T" + result);
}

static int avail = 0;

int alloc()
{ return ++avail; }

5.3.4 Exercises
1. Which of the following input strings would cause this SableCC program
to produce a syntax error message?

Tokens
a = ’a’;
b = ’b’;
c = ’c’;
newline = [10 + 13];
Productions
line = s newline ;
s = {a1} a s b
| {a2} b w c
;
w = {a1} b w b
| {a2} a c
;

(a) bacc (b) ab (c) abbacbcb (d) bbacbc (e) bbacbb


2. Using the SableCC program from problem 1, show the output produced
by each of the input strings given in Problem 1, using the Translation
class shown below.
188 CHAPTER 5. BOTTOM UP PARSING

package ex5_3;
import ex5_3.analysis.*;
import ex5_3.node.*;
import [Link].*;
import [Link].*;

class Translation extends DepthFirstAdapter


{

public void outAA1S (AA1S node)


{ [Link] ("rule 1"); }

public void outAA2S (AA2S node)


{ [Link] ("rule 2"); }

public void outAA1W (AA1W node)


{ [Link] ("rule 3"); }

public void outAA2W (AA2W node)


{ [Link] ("rule 4"); }
}

3. A Sexpr is an atom or a pair of Sexprs enclosed in parentheses and sepa-


rated with a period. For example, if A, B, C, ...Z and NIL are all atoms,
then the following are examples of Sexprs:
A (A.B) ((A.B).(B.C)) (A.(B.([Link])))
A List is a special kind of Sexpr. A List is the atom NIL or a List is a
dotted pair of Sexprs in which the first part is an atom or a List and the
second part is a List. The following are examples of lists:
NIL ([Link]) (([Link]).NIL) (([Link]).([Link])) (A.(B.([Link])))
(a) Show a SableCC grammar that defines a Sexpr.
(b) Show a SableCC grammar that defines a List.
(c) Add a Translation class to your answer to part (b) so that it will print
out the total number of atoms in a List. For example:
(([Link]).(B.([Link]))) 5 atoms
4. Use SableCC to implement a syntax checker for a typical database com-
mand language. Your syntax checker should handle at least the following
kinds of commands:

RETRIEVE employee_file
PRINT
5.3. SABLECC 189

DISPLAY FOR salary >= 1000000


PRINT FOR "SMITH" = lastname

5. The following SableCC grammar and Translation class are designed to


implement a simple desk calculator with the standard four arithmetic
functions (it uses floating-point arithmetic only). When compiled and
run, the program will evaluate a list of arithmetic expressions, one per
line, and print the results. For example:

2+3.2e-2
2+3*5/2
(2+3)*5/2
16/(2*3 - 6*1.0)
2.032
9.5
12.5
infinity

Unfortunately, the grammar and Java code shown below are incorrect.
There are four mistakes, some of which are syntactic errors in the gram-
mar; some of which are syntactic Java errors; some of which cause run-time
errors; and some of which don’t produce any error messages, but do pro-
duce incorrect output. Find and correct all four mistakes. If possible, use
a computer to help debug these programs.
The grammar, [Link] is shown below:

Package exprs;

Helpers
digits = [’0’..’9’]+ ;
exp = [’e’ + ’E’] [’+’ + ’-’]? digits ;
Tokens
number = digits ’.’? digits? exp? ;
plus = ’+’;
minus = ’-’;
mult = ’*’;
div = ’/’;
l_par = ’(’;
r_par = ’)’;
newline = [10 + 13] ;
blank = (’ ’ | ’t’)+;
semi = ’;’ ;

Ignored Tokens
190 CHAPTER 5. BOTTOM UP PARSING

blank;

Productions
exprs = expr newline
| exprs embed
;
embed = expr newline;
expr =
{term} term |
{plus} expr plus term |
{minus} expr minus term
;
term =
{factor} factor |
{mult} term mult factor |
{div} term div factor |
;
factor =
{number} number |
{paren} l_par expr r_par
;

The Translation class is shown below:

package exprs;
import [Link].*;
import [Link].*;
import [Link].*;

class Translation extends DepthFirstAdapter


{
Map <Node, Integer> hash =
new HashMap <Node, Integer> (); // store expr values

public void outAE1Exprs (AE1Exprs node)


{ [Link] (" " + getVal ([Link]())); }

public void outAEmbed (AEmbed node)


{ [Link] (" " + getVal ([Link]())); }

public void caseTNumber(TNumber node)


{ [Link] (node, new Double ([Link]())) ; }

public void outAPlusExpr(APlusExpr node)


5.3. SABLECC 191

{// out of alternative {plus} in Expr, we add the


// expr and the term
[Link] (node, new Double (getPrim ([Link]())
+ getPrim([Link]())));
}

public void outAMinusExpr(AMinusExpr node)


{// out of alternative {minus} in Expr, subtract the term
// from the expr
[Link] (node, new Double (getPrim([Link]())
- getPrim([Link]())));
}

public void outAFactorTerm (AFactorTerm node)


{ // Value of the term same as the factor
[Link] (node, getVal([Link]())) ;
}

public void outAMultTerm(AMultTerm node)


{// out of alternative {mult} in Factor, multiply the term
// by the factor
[Link] (node, new Double (getPrim([Link]())
* getPrim([Link]())));
}

public void outADivTerm(ADivTerm node)


{// out of alternative {div} in Factor, divide the term by
// the factor
[Link] (node, new Double (getPrim([Link]())
/ getPrim([Link]())));
}

public void outANumberFactor (ANumberFactor node)


{ [Link] (node, getVal ([Link]())); }

public void outAParenFactor (AParenFactor node)


{ [Link] (node, new Double (0.0)); }

double getPrim (Node node)


{ return ((Double) [Link] (node)).doubleValue(); }

Double getVal (Node node)


{ return [Link] (node) ; }
}
192 CHAPTER 5. BOTTOM UP PARSING

6. Show the SableCC grammar which will check for proper syntax of regular
expressions over the alphabet {0,1}. Observe the precedence rules for the
three operations. Some examples are shown:

Valid Not Valid

(0+1)*.1.1 *0
0.1.0* (0+1)+1)
((0)) 0+

5.4 Arrays
Although arrays are not included in our definition of Decaf, they are of such
great importance to programming languages and computing in general, that we
would be remiss not to mention them at all in a compiler text. We will give a
brief description of how multi-dimensional array references can be implemented
and converted to atoms, but for a more complete and efficient implementation
the student is referred to Parsons [17] or Aho et. al. [1].
The main problem that we need to solve when referencing an array element
is that we need to compute an offset from the first element of the array. Though
the programmer may be thinking of multi-dimensional arrays (actually arrays
of arrays) as existing in two, three, or more dimensions, they must be physically
mapped to the computer’s memory, which has one dimension. For example, an
array declared as int n[][][] = new int [2][3][4]; might be envisioned by
the programmer as a structure having three rows and four columns in each of
two planes as shown in Figure 5.10 (a). In reality, this array is mapped into
a sequence of twenty-four (2*3*4) contiguous memory locations as shown in
Figure 5.10 (b). The problem which the compiler must solve is to convert an
array reference such as n[1][1][0] to an offset from the beginning of the storage
area allocated for n. For this example, the offset would be sixteen memory cells
(assuming that each element of the array occupies one memory cell).
To see how this is done, we will begin with a simple one-dimensional array
and then proceed to two and three dimensions. For a vector, or one-dimensional
array, the offset is simply the subscripting value, since subscripts begin at 0 in
Java. For example, if v were declared to contain twenty elements, char v[] =
new char[20];, then the offset for the fifth element, v[4], would be 4, and in
general the offset for a reference v[i] would be i. The simplicity of this formula
results from the fact that array indexing begins with 0 rather than 1. A vector
maps directly to the computer’s memory.
Now we introduce arrays of arrays, which, for the purposes of this discussion,
we call multi-dimensional arrays; suppose m is declared as a matrix, or two-
dimensional array, char m[][] = new char [10][15];. We are thinking of
this as an array of 10 rows, with 15 elements in each row. A reference to an
element of this array will compute an offset of fifteen elements for each row after
5.4. ARRAYS 193

..
...
..
..
..
..
..
.. ..... ..
. (a)
..
... ..
..
.. ..
.. ..
.. ..
.. ..
.

^ ^ ^ ^ ^
n[0][0][0] n[0][1][0] n[0][2][0] n[0][3][0] n[1][2][3]

(b)

Figure 5.10: A three-dimensional array n[2][3][4] (a) Mapped into a one-


dimensional memory (b).

the first. Also, we must add to this offset the number of elements in the selected
row. For example, a reference to m[4][7] would require an offset of 4*15 + 7 =
67. The reference m[r][c] would require an offset of r*15 + c. In general, for
a matrix declared as char m[][] = new char [ROWS][ COLS], the formula
for the offset of m[r][c] is r*COLS + c.
For a three-dimensional array, char a[][][] = new char [5][6][7];, we must sum
an offset for each plane (6*7 elements), an offset for each row (7 elements), and
an offset for the elements in the selected row. For example, the offset for the ref-
erence a[2][3][4] is found by the formula 2*6*7 + 3*7 + 4. The reference a[p][r][c]
would result in an offset computed by the formula p*6*7 + r*7 + c. In general,
for a three-dimensional array, new char [PLANES][ROWS][COLS], the reference
a[p][r][c] would require an offset computed by the formula p*ROWS*COLS +
r*COLS + c.
We now generalize what we have done to an array that has any number of
dimensions. Each subscript is multiplied by the total number of elements in all
higher dimensions. If an n dimensional array is declared as char a[][]...[]
= new char[D1][D2 ][D3 ]...[Dn], then a reference to a[S1 ][S2 ][S3 ]...[Sn]
will require an offset computed by the following formula:
S1 *D2 *D3 *D4 *...*Dn + S2 *D3 *D4 *...*Dn + S3 *D4 *...*Dn + ... + Sn−1 *Dn
+ Sn .
In this formula, Di represents the number of elements in the ith dimension
and Si represents the ith subscript in a reference to the array. Note that in some
languages, such as Java and C, all the subscripts are not required. For example,
the array of three dimensions a[2][3][4], may be referenced with two, one, or
even zero subscripts. a[1] refers to the address of the first element in the second
194 CHAPTER 5. BOTTOM UP PARSING

plane; i.e. all missing subscripts are assumed to be zero.


Notice that some parts of the formula shown above can be computed at
compile time. For example, for arrays which are dimensioned with constants, the
product of dimensions D2 *D3 *D4 can be computed at compile time. However,
since subscripts can be arbitrary expressions, the complete offset may have to
be computed at run time.
The atoms which result from an array reference must compute the offset as
described above. Specifically, for each dimension, i, we will need a MUL atom
to multiply Si by the product of dimensions from Di+1 through Dn , and we will
need an ADD atom to add the term for this dimension to the sum of the previous
terms. Before showing a translation grammar for this purpose, however, we will
first show a grammar without action symbols or attributes, which defines array
references. Grammar G22 is an extension to the grammar for simple arithmetic
expressions, G5, given in section 3.1. Here we have changed rule 7 and added
rules 8,9.
G22
1. Expr → Expr + Term
2. Expr → Term
3. Term → Term * Factor
4. Term → Factor
5. Factor → ( Expr )
6. Factor → const
7. Factor → var Subs
8. Subs → [ Expr ] Subs
9. Subs → ǫ

This extension merely states that a variable may be followed by a list of sub-
scripting expressions, each in square brackets (the nonterminal Subs represents
a list of subscripts).
Grammar G23 shows rules 7-9 of grammar G22, with attributes and action
symbols. Our goal is to come up with a correct offset for a subscripted variable
in grammar rule 8, and provide its address for the attribute of the Subs defined
in that rule.
Grammar G23:

7. F actore → varv {M OV }0,,sum Subsv,sum,i


e ← v[sum]
i←1
sum ← Alloc
8. Subsv,sum,i1 → [Expre ]{M U L}e,=D,T {ADD}sum,T,sum Subsv,sum,i2
D ← prod(v, i1)
i2 ← i1 + 1
T ← Alloc
9. Subsv,sum,i → {check}i,v
5.4. ARRAYS 195

The nonterminal Subs has three attributes: v (inherited) represents a ref-


erence to the symbol table for the array being referenced, sum (synthesized)
represents the location storing the sum of the terms which compute the offset,
and i (inherited) is the dimension being processed. In the attribute computation
rules for grammar rule 8, there is a call to a method prod(v,i). This method
computes the product of the dimensions of the array v, above dimension i. As
noted above, this product can be computed at compile time. Its value is then
stored as a constant, D, and referred to in the grammar as =D.
The first attribute rule for grammar rule 7 specifies e y v[sum]. This means
that the value of sum is used as an offset to the address of the variable v, which
then becomes the attribute of the Factor defined in rule 7.
The compiler should ensure that the number of subscripts in the array ref-
erence is equal to the number of subscripts in the array declaration. If they are
not equal, an error message should be put out. This is done by a procedure
named check(i,v) which is specified by the action symbol {check}i,v in rule 9.
This action symbol represents a procedure call, not an atom. The purpose of
the procedure is to compare the number of dimensions of the variable, v, as
stored in the symbol table, with the value of i, the number of subscripts plus
one. The check(i,v) method simply puts out an error message if the number of
subscripts does not equal the number of dimensions, and the parse continues.
To see how this translation grammar works, we take an example of a three-
dimensional array declared as int a[][][] = new int[3][5][7]. An attributed deriva-
tion tree for the reference a[p][r][c] is shown in Figure 5.11 (for simplicity we
show only the part of the tree involving the subscripted variable, not an entire
expression). To build this derivation tree, we first build the tree without att-
tributes and then fill in attribute values where possible. Note that the first and
third attributes of Subs are inherited and derive values from higher nodes or
nodes on the same level in the tree. The final result is the offset stored in the
attribute sum, which is added to the attribute of the variable being subscripted
to obtain the offset address. This is then the attribute of the Factor which is
passed up the tree.

Sample Problem 5.4.1

Assume the array m has been declared to have two planes, four
rows, and five columns: m = new char[2] [4] [5];. Show the at-
tributed derivation tree generated by grammar G23 for the array ref-
erence m[x][y][z]. Use Factor as the starting nonterminal, and
show the subscripting expressions as Expr, as done in Figure 4.12.
Also show the sequence of atoms which would be put out as a result
of this array reference.
196 CHAPTER 5. BOTTOM UP PARSING

F actora[T 1]

vara (M OV )0,,T 1 Subsa,T 1,1

[ Exprp ] (M U L)p,=35,T 2 (ADD)T 1,T 2,T 1 Subsa,T 1,2

[ Exprr ] (M U L)r,=7,T 3 (ADD)T 1,T 3,T 1 Subsa,T 1,3

[ Exprc ] (M U L)c,=1,T 4 (ADD)T 1,T 4,T 1 Subsa,T 1,4

(check)4,a

Figure 5.11: A derivation tree for the array reference a[p][r][c], which is declared
as int a[3][5][7] using grammar G23.

Solution:
F actorm[T 1]

varm (M OV )0,,T 1 Subsm,T 1,1

[ Exprx ] (M U L)x,=20,T 2 (ADD)T 1,T 2,T 1 Subsm,T 1,2

[ Expry ] (M U L)r,=5,T 3 (ADD)T 1,T 3,T 1 Subsm,T 1,3

[ Exprz ] (M U L)z,=1,T 4 (ADD)T 1,T 4,T 1 Subsm,T 1,4

(check)4,m
The atoms put out are:
{M OV }0,,T 1 {M U L}x,=20,T 2 {ADD}T 1,T 2,T 1 {M U L}y,=5,T 3 {ADD}T 1,T 3,T 1
{M U L}z,=1,T 4 {ADD}T 1,T 4,T 1 {check}4,m

5.4.1 Exercises
1. Assume the following array declarations:

int v[] = new int [13];


5.5. CASE STUDY: SYNTAX ANALYSIS FOR DECAF 197

int m[][] = new int [12][17];


int a3[][][] = new int [15][7][5];
int z[][][][] = new int [4][7][2][3];

Show the attributed derivation tree resulting from grammar G23 for each
of the following array references. Use Factor as the starting nonterminal,
and show each subscript expression as Expr, as done in Figure 5.11. Also
show the sequence of atoms that would be put out.

(a) v[7]
(b) m[q][2]
(c) a3[11][b][4]
(d) z[2][c][d][2]
(e) m[1][1]

2. The discussion in this section assumed that each array element occupied
one addressable memory cell. If each array element occupies SIZE memory
cells, what changes would have to be made to the general formula given
in this section for the offset? How would this affect grammar G23?

3. 3. You are given two vectors: the first, d, contains the dimensions of a
declared array, and the second, s, contains the subscripting values in a
reference to that array.
(a) Write a Java method :
int offSet (int d[], int s[]);
that computes the offset for an array reference a[s0 ][s1 ]...[smax−1 ] where
the array has been declared as char a[d0 ][d1 ] ... [dmax − 1].
(b) Improve your Java method, if possible, to minimize the number of
run-time multiplications.

5.5 Case Study: Syntax Analysis for Decaf


In this section we continue the development of a compiler for Decaf, a small
subset of the Java programming language. We do this by implementing the
syntax analysis phase of the compiler using SableCC as described in Section
5.3, above. The parser generated by SableCC will obtain input tokens from the
standard input stream. The parser will then check the tokens for correct syntax.
In addition, we provide a Translation class which enables our parser to put
out atoms corresponding to the run-time operations to be performed. This
aspect of compilation is often called semantic analysis. For more complex lan-
guages, semantic analysis would also involve type checking, type conversions,
identifier scopes, array references, and symbol table management. Since these
198 CHAPTER 5. BOTTOM UP PARSING

will not be necessary for the Decaf compiler, syntax analysis and semantic anal-
ysis have been combined into one program.
The complete SableCC grammar file and Translation source code is shown
in AppendixB and is explained here. The input to SableCC is the file de-
[Link], which generates classes for the parser, nodes, lexer, and analysis.
In the Tokens section, we define the two types of comments; comment1 is a
single-line comment, beginning with // and ending with a newline character.
comment2 is a multi-line comment, beginning with /* and ending with */. Nei-
ther of these tokens requires the use of states, which is why there is no States
section in our grammar. Next each keyword is defined as a separate token taking
care to include these before the definition of identifiers. These are followed by
special characters ’+’, ’-’, ;, .... Note that relational operators are defined collec-
tively as a compare token. Finally we define identifiers and numeric constants
as tokens. The Ignored Tokens are space and the two comment tokens.
The Productions section is really the Decaf grammar with some modifica-
tions to allow for bottom-up parsing. The major departure from what has been
given previously and in Appendix A, is the definition of the if statement. We
need to be sure to handle the dangling else appropriately; this is the ambiguity
problem discussed in section 3.1 caused by the fact that an if statement has
an optional else part. This problem was relatively easy to solve when parsing
top-down, because the ambiguity was always resolved in the correct way simply
by checking for an else token in the input stream. When parsing bottom-up,
however, we get a shift-reduce conflict from this construct. If we rewrite the
grammar to eliminate the ambiguity, as in section 3.1 (Grammar G7), we still
get a shift-reduce conflict. Unfortunately, in SableCC there is no way to resolve
this conflict always in favor of a shift (this is possible with yacc). Therefore, we
will need to rewrite the grammar once again; we use a grammar adapted from
Appel [3]. In this grammar a no short if statement is one which does not contain
an if statement without a matching else. The EBNF capabilities of SableCC
are used, for example, in the definition of compound stmt, which consists of a
pair of braces enclosing 0 or more statements. The complete grammar is shown
in appendix B. An array of Doubles named ’memory’ is used to store the values
of numeric constants.
The Translation class, also shown in appendix B, is written to produce atoms
for the arithmetic operations and control structures. The structure of an atom is
shown in Figure 5.12. The Translation class uses a few Java maps: the first map,
implemented as a HashMap and called ’hash’, stores the temporary memory
location associated with each sub-expression (i.e. with each node in the syntax
tree). It also stores label numbers for the implementation of control structures.
Hence, the keys for this map are nodes, and the values are the integer run-time
memory locations, or label numbers, associated with them. The second map,
called ’nums’, stores the values of numeric constants, hence if a number occurs
several times in a Decaf program, it need be stored only once in this map. The
third map is called ’identifiers’. This is our Decaf symbol table. Each identifier is
stored once, when it is declared. The Translation class checks that an identifier is
not declared more than once (local scope is not permitted), and it checks that an
5.5. CASE STUDY: SYNTAX ANALYSIS FOR DECAF 199

op Operation of Atom
left Left operand location
right Right operand location
result Result operand location
cmp Comparison code for TST atoms
dest Destination, for JMP, LBL, and TST atoms

Figure 5.12: Record structure of the file of atoms

identifier has been declared before it is used. For both numbers and identifiers,
the value part of each entry stores the run-time memory location associated with
it. The implementation of control structures for if, while, and for statements
follows that which was presented in section 4.9. A boolean expression always
results in a TST atom which branches if the comparison operation result is
false. Whenever a new temporary location is needed, the method alloc provides
the next available location (a better compiler would re-use previously allocated
locations when possible). Whenever a new label number is needed, it is provided
by the lalloc method. Note that when an integer value is stored in a map, it
must be an object, not a primitive. Therefore, we use the wrapper class for
integers provided by Java, Integer. The complete Translation class is shown in
appendix B and is available at [Link]
For more documentation on SableCC, visit [Link]

5.5.1 Exercises
1. Extend the Decaf language to include a do statement defined as:
DoStmt → do Stmt while ( BoolExpr ) ;
Modify the files [Link] and [Link], shown in Appendix
B so that the compiler puts out the correct atom sequence implementing
this control structure, in which the test for termmination is made after the
body of the loop is executed. The nonterminals Stmt and BoolExpr are
already defined. For purposes of this assignment you may alter the atom
method so that it prints out its arguments to stdout rather than building
a file of atoms.

2. Extend the Decaf language to include a switch statement defined as:


SwitchStmt → switch ( Expr ) CaseList
CaseList → case number ’:’ Stmt CaseList
CaseList → case number ’:’ Stmt
Modify the files [Link] and [Link], shown in Appendix
B, so that the compiler puts out the correct atom sequence implement-
200 CHAPTER 5. BOTTOM UP PARSING

ing this control structure. The nonterminals Expr and Stmt are already
defined, as are the tokens number and end. The token switch needs to
be defined. Also define a break statement which will be used to transfer
control out of the switch statement. For purposes of this assignment, you
may alter the atom() function so that it prints out its arguments to std-
out rather than building a file of atoms, and remove the call to the code
generator.

3. Extend the Decaf language to include initializations in decalarations, such


as:
int x=3, y, z=0;
Modify the files [Link] and [Link], shown in Appendix
B, so that the compiler puts out the correct atom sequence implementing
this feature. You will need to put out a MOV atom to assign the value of
the constant to the variable.

5.6 Chapter Summary


This chapter describes some bottom up parsing algorithms. These algorithms
recognize a sequence of grammar rules in a derivation, corresponding to an
upward direction in the derivation tree. In general, these algorithms begin with
an empty stack, read input symbols, and apply grammar rules, until left with
the starting nonterminal alone on the stack when all input symbols have been
read.
The most general class of bottom up parsing algorithms is called shift reduce
parsing. These parsers have two basic operations: (1) a shift operation pushes
the current input symbol onto the stack, and (2) a reduce operation replaces
zero or more top-most stack symbols with a single stack symbol. A reduction
can be done only if a handle can be identified on the stack. A handle is a string
of symbols occurring on the right side of a grammar rule, and matching the
symbols on top of the stack, as shown below:
▽ ... HANDLE Nt → HANDLE
The reduce operation applies the rewriting rule in reverse, by replacing the
handle on the stack with the nonterminal defined in the corresponding rule, as
shown below
▽ ... Nt
When writing the grammar for a shift reduce parser, one must take care
to avoid shift/reduce conflicts (in which it is possible to do a reduce operation
when a shift is needed for a correct parse) and reduce/reduce conflicts (in which
more than one grammar rule matches a handle).
A special case of shift reduce parsing, called LR parsing, is implemented with
a pair of tables: an action table and a goto table. The action table specifies
whether a shift or reduce operation is to be applied. The goto table specifies
the stack symbol to be pushed when the operation is a reduce.
5.6. CHAPTER SUMMARY 201

We studied a parser generator, SableCC, which generates an LR parser from


a specification grammar. It is also possible to include actions in the grammar
which are to be applied as the input is parsed. These actions are implemented
in a Translation class designed to be used with SableCC.
Finally we looked at an implementation of Decaf, our case study language
which is a subset of Java, using SableCC. This compiler works with the lexical
phase discussed in section 2.4 and is shown in Appendix B.

Common questions

Powered by AI

Shift/reduce conflicts in ambiguous grammars can sometimes be resolved by making an assumption to always shift symbols rather than reducing them, especially if this strategy consistently results in correct parsing. Additionally, using an LR(k) parser, which considers k lookahead symbols, can help resolve conflicts by providing more context . Rewriting the grammar to eliminate ambiguities, such as the 'dangling else' problem, is another effective approach .

An LR parsing table implements a shift-reduce parser by defining the actions (shift, reduce, accept, or error) based on the current state of the parser and the next input symbol. This table-driven approach allows the parser to systematically apply the correct grammar rule or operation, thus eliminating ambiguity and optimizing the parsing process .

A reduce/reduce conflict arises when there are multiple rules whose right-hand sides match the top of the stack during parsing, making it unclear which rule to apply. This conflict can be managed by specifying priority rules, such as always choosing the lower-numbered rule in the grammar, or by rewriting the grammar to eliminate the ambiguity that causes the conflict .

LR(k) parsers, which look ahead k input symbols, aim to resolve conflicts such as shift/reduce or reduce/reduce by using additional context. However, ambiguous grammars inherently produce conflicts regardless of the lookahead used, making them incompatible with LR(k) parsing. LR(k) requires an unambiguous grammar to function without conflicts .

Determining whether a grammar is LR is crucial in bottom-up parsing because it dictates if a shift-reduce algorithm can be correctly applied to the grammar. An LR grammar ensures that the parser can make unambiguous shift and reduce decisions, leading to a correct parse. If a grammar is not LR, the parser may encounter shift/reduce or reduce/reduce conflicts, which prevent accurate parsing .

The 'if' statement example illustrates the 'dangling else' problem, which is a form of shift/reduce conflict. It arises when the parser cannot decide whether to shift the 'else' or reduce the 'if' statement due to ambiguous grammar. This issue is typically resolved by defining preferences such as associating each 'else' with the closest 'if' (usually by shifting the 'else'), or by rewriting the grammar to eliminate the ambiguity .

Translating expressions into atoms aids the compilation process by breaking down complex operations into simpler, intermediate representations. This allows for systematic tracking and manipulation of variables and operations, facilitates optimization, and enables the efficient generation of machine code. Atoms serve as building blocks for constructing execution paths in a sequence that can be easily managed by a compiler .

Implementing a desk calculator with SableCC can encounter challenges such as grammar errors, runtime errors, and incorrect output. Addressing these issues involves debugging both the grammar and the accompanying Java code used for parsing and interpreting expressions. Syntax errors in the grammar need to be rectified, Java exceptions handled appropriately, and logical errors in the translation logic corrected to ensure accurate evaluation of arithmetic expressions .

SableCC translates infix expressions by first parsing them according to a predefined grammar for arithmetic expressions. As it parses, it generates intermediate representations, or "atoms," for operations like addition, subtraction, multiplication, and division. Each number is stored in a temporary memory location, and operations result in the creation of new temporary locations for their results. The Translation class manages these allocations and transformations into atoms, achieving the conversion from infix to a sequence of operations .

A sexpr, or symbolic expression, can be either an atom or a pair of sexprs enclosed in parentheses and separated by a period. In SableCC, a grammar can be defined for sexprs by specifying productions that describe these structures. For example, a sexpr may be defined as an atom or two nested sexprs, allowing recursive definitions to represent complex nested lists .

You might also like