Reinhard Wilhelm, Helmut Seidl, Sebastian Hack (Auth.) - Compiler Design - Syntactic and Semantic Analysis-Springer-Verlag Berlin Heidelberg (2013)
Reinhard Wilhelm, Helmut Seidl, Sebastian Hack (Auth.) - Compiler Design - Syntactic and Semantic Analysis-Springer-Verlag Berlin Heidelberg (2013)
Helmut Seidl
Sebastian Hack
Compiler
Design
Syntactic and Semantic Analysis
Compiler Design
Reinhard Wilhelm Helmut Seidl
Sebastian Hack
Compiler Design
Syntactic and Semantic Analysis
Helmut Seidl Reinhard Wilhelm, Sebastian Hack
Fakultät für Informatik FB Informatik
Technische Universität München Universität des Saarlandes
Garching, Germany Saarbrücken, Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does not im-
ply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Compilers for high-level programming languages are large and complex software
systems. They have, however, several distinct properties by which they favorably
differ from most other software systems. Their semantics is (almost) well defined.
Ideally, completely formal or at least precise descriptions exist both of the source
and the target languages. Often, additional descriptions are provided of the inter-
faces to the operating system, to programming environments, to other compilers,
and to program libraries.
The task of compilation can be naturally decomposed into subtasks. This de-
composition results in a modular structure which, by the way, is also reflected in
the structure of most books about compilers.
As early as the 1950s, it was observed that implementing application systems
directly as machine code is both difficult and error-prone, and results in programs
which become outdated as quickly as the computers for which they have been de-
veloped. High-level machine-independent programming languages, on the other
hand, immediately made it mandatory to provide compilers, which are able to auto-
matically translate high-level programs into low-level machine code.
Accordingly, the various subtasks of compilation have been subject to intensive
research. For the subtasks of lexical and syntactic analysis of programs, concepts
like regular expressions, finite automata, context-free grammars and pushdown au-
tomata have been borrowed from the theory of automata and formal languages and
adapted to the particular needs of compilers. These developments have been ex-
tremely successful. In the case of syntactic analysis, they led to fully automatic
techniques for generating the required components solely from corresponding spec-
ifications, i.e., context-free grammars. Analogous automatic generation techniques
would be desirable for further components of compilers as well, and have, to a
certain extent, also been developed.
The current book does not attempt to be a cookbook for compiler writers. Ac-
cordingly, there are no recipes such as “in order to construct a compiler from source
language X into target language Y , take . . . ”. Instead, our presentation elabo-
rates on the fundamental aspects such as the technical concepts, the specification
vii
viii Preface
formalisms for compiler components and methods how to systematically derive im-
plementations. Ideally, this approach may result in fully automatic generator tools.
The book is written for students of computer science. Some knowledge about
an object-oriented programming language such as JAVA and very basic principles
of a functional language such as O CAML or S ML are required. Knowledge about
formal languages or automata is useful, but is not mandatory as the corresponding
background is provided within the presentation.
Acknowledgement
Besides the helpers from former editions, we would like to thank Michael Jacobs
and Jörg Herter for carefully proofreading the chapter on syntactic analysis. When
revising the description of the recursive descent parser, we also were supported by
Christoph Mallon and his invaluable experience with practical parser implementa-
tions.
We hope that our readers will enjoy the present volume and that our book may
encourage them to realize compilers of their own for their favorite programming
languages.
2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 The Task of Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Regular Expressions and Finite Automata . . . . . . . . . . . . . . . 12
2.2.1 Words and Languages . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 A Language for Specifying Lexical Analyzers . . . . . . . . . . . . . 27
2.3.1 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Nonrecursive Parentheses . . . . . . . . . . . . . . . . . . . . . 28
2.4 Scanner Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 An Implementation of the until-Construct . . . . . . . . . . . 30
2.4.3 Sequences of Regular Expressions . . . . . . . . . . . . . . . 32
2.4.4 The Implementation of a Scanner . . . . . . . . . . . . . . . . 34
2.5 The Screener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 Scanner States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2 Recognizing Reserved Words . . . . . . . . . . . . . . . . . . 38
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 The Task of Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Productivity and Reachability of Nonterminals . . . . . . . . 52
3.2.3 Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
x Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
The Structure of Compilers
1
Our series of books treats the compilation of higher programming languages into the
machine languages of virtual or real computers. Such compilers are large, complex
software systems. Realizing large and complex software systems is a difficult task.
What is so special about compilers such that they can even be implemented as a
project accompanying a compiler course? One reason is that the big task can be
naturally decomposed into subtasks which have clearly defined functionalities and
clean interfaces between them. Another reason is automation: several components
of compilers need not be programmed by hand, but can be directly generated from
specifications by means of standard tools.
The general architecture of a compiler, to be described in the following, is a
conceptual structure of the process. It identifies the subtasks of compiling a source
language into a target language and defines interfaces between the components re-
alizing the subtasks. The concrete architecture of the compiler is then derived from
this conceptual structure. Several components might be combined if the realized
subtasks allow this. On the other hand, a component may also be split into several
components if the realized subtask is very complex.
A first attempt to structure a compiler decomposes the compiler into three com-
ponents executing three consecutive phases:
1. The analysis phase, realized by the front-end. It determines the syntactic struc-
ture of the source program and checks whether the static semantic constraints
are satisfied. The latter contain the type constraints in languages with static type
systems.
2. The optimization and transformation phase, performed by the middle-end. The
syntactically analyzed and semantically checked program is transformed by
semantics-preserving transformations. These transformations mostly aim at im-
proving the efficiency of the program by reducing the execution time, the mem-
ory consumption, or the consumed energy. These transformations are inde-
pendent of the target architecture and mostly also independent of the source
language.
3. The code generation and the machine-dependent optimization phase, performed
by the back-end. The program is now translated into an equivalent program
R. Wilhelm, H. Seidl, S. Hack, Compiler Design, DOI 10.1007/978-3-642-17540-4_1, 1
c Springer-Verlag Berlin Heidelberg 2013
2 1 The Structure of Compilers
lexical analysis
A Scanner S
N sequence of symbols Y Optimization
A screening N
L Screener T
Y decorated sequence H
S syntactic analysis E
I S
Parser Code Generation
S I
syntax tree
S
semantic analysis
Fig. 1.1 Structure of a compiler together with the program representations during the analysis
phase
1.2 Lexical Analysis 3
We now walk through the sequence of subtasks step by step, characterize their
job, and describe the change in program representation. As a running example we
consider the following program fragment:
int a; bI
a D 42I
b D a a 7I
The component performing lexical analysis of source programs is often called the
scanner. This component reads the source program represented as a sequence of
characters mostly from a file. It decomposes this sequence of characters into a se-
quence of lexical units of the programming language. These lexical units are called
symbols. Typical lexical units are keywords such as if, else, while or switch and
special characters and character combinations such as D; DD; Š D; <D; >D; <; >
; .; /; Œ; ; f; g or comma and semicolon. These need to be recognized and converted
into corresponding internal representations. The same holds for reserved identifiers
such as names of basic types int, float, double, char, bool or string, etc. Fur-
ther symbols are identifiers and constants. Examples for identifiers are value42,
abc, Myclass, x, while the character sequences 42, 3:14159 and 00 HalloWorldŠ00 rep-
resent constants. It is important to note is that there are, in principle, arbitrarily
many such symbols. They can, however, be categorized into finitely many symbol
classes. A symbol class consists of symbols that are equivalent as far as the syn-
tactic structure of programs is concerned. The set of identifiers is an example of
such a class. Within this class, there may be subclasses such as type constructors
in O CAML or variables in P ROLOG, which are written in capital letters. In the class
of constants, int-constants can be distinguished from floating-point constants and
string-constants.
The symbols we have considered so far bear semantic interpretations which must
be taken into account in code generation. There are, however, also symbols with-
out semantics. Two symbols, for example, need a separator between them if their
concatenation would also form a symbol. Such a separator can be a blank, a new-
line, an indentation or a sequence of such characters. Such white space can also be
inserted into a program to visualize the structure of the program to human eyes.
Other symbols without meaning for the compiler, but helpful for the human
reader, are comments. Comments also can be used by software development tools.
Other symbols are compiler directives (pragmas), which may tell the compiler to
include particular libraries or influence the memory management for the program
to be compiled. The sequence of symbols for the example program may look as
4 1 The Structure of Compilers
follows:
Int.00 int00 / Sep.00 00 / Id.00 a00 / Com.00 ;00 / Sep.00 00 / Id.00 b 00 / Sem.00 I00 / Sep.00 nn00 /
Id.00 a00 / Bec.00 D00 / Intconst.00 4200 / Sem.00 I00 / Sep.00 nn00 /
Id.00 b 00 / Bec.00 D00 / Id.00 a00 / Mop.00 00 / Id.00 a00 / Aop.00 00 / Intconst.00 700 / Sem.00 I00 /
Sep.00 nn00 /
To increase readability, the sequence was broken into lines according to the original
program structure. Each symbol is represented by its symbol class and the substring
representing it in the program. More information may be added such as, e.g., the
position of the string in the input.
The screener will produce the following sequence of annotated symbols for our
example program:
Int./ Id.1/ Com./ Id.2/ Sem./
Id.1/ Bec./ Intconst.42/ Sem./
Id.2/ Bec./ Id.1/ Mop.Mul/ Id.1/ Aop.Sub/ Intconst.7/ Sem./
All separators are removed from the symbol sequence. Semantical values were
computed for some of the substrings. The identifiers a and b were coded by the
numbers 1 and 2, resp. The sequences of digits for the int constants were replaced
by their binary values. The internal representations of the symbols Mop and Aop
are elements of an appropriate enumeration type.
Scanner and screener are usually combined into one module, which is also called
scanner. Conceptually, however, they should be kept separate. The task that the
scanner, in the restricted meaning of the word, performs can be realized by a finite
automaton. The screener, however, may additionally require arbitrary pieces of
code.
The lexical and the syntactic analyses together recognize the syntactic structure of
the source program. Lexical analysis realizes the part of this task that can be imple-
mented by means of finite automata. Syntactic analysis recognizes the hierarchical
structure of the program, a task that finite automata cannot perform in general. The
syntactical structure of the program consists of sequential and hierarchical com-
position of language constructs. The hierarchical composition corresponds to the
nesting of language constructs. Programs in an object-oriented programming lan-
guage like JAVA, for example, consist of class declarations, which may be combined
into packages. The declaration of a class may contain declarations of attributes,
constructors, and methods. A method consists of a method head and a method
body. The latter contains the implementation of the method. Some language con-
structs may be nested arbitrarily deep. This is, e.g., the case for blocks or arithmetic
expressions, where an unbounded number of operator applications can be used to
construct expressions of arbitrary sizes and depths. Finite automata are incapable
of recognizing such nesting of constructs. Therefore, more powerful specification
mechanisms and recognizers are required.
The component used to recognize the syntactic structure is called parser. This
component should not only recognize the syntactic structure of correct programs.
It should also be able to properly deal with syntactically incorrect programs. After
all, most programs submitted to a program contain mistakes. Typical syntax errors
are spelling errors in keywords, or missing parentheses or separators. The parser
should detect these kind of errors, diagnose them, and maybe even try to correct
them.
The syntactic structure of programs can be described by context-free grammars.
From the theory of formal languages and automata it is known that pushdown au-
6 1 The Structure of Compilers
The task of semantic analysis is to determine properties and check conditions that
are relevant for the well-formedness of programs according to the rules of the pro-
gramming language, but that go beyond what can be described by context-free
grammars. These conditions can be completely checked on the basis of the program
text and are therefore called static semantic properties. This phase is, therefore,
called semantic analysis. The dynamic semantics, in constrast, describes the be-
havior of programs when they are executed. The attributes static and dynamic are
associated with the compile time and the run time of programs, respectively. We list
some static semantic properties of programs:
type correctness in strongly typed programming languages like C, PASCAL,
JAVA, or O CAML. A prerequisite for type correctness is that all identifiers are
declared, either explicitly or implicitly and, possibly, the absence of multiple
declarations of the same identifier.
the existence of a consistent type association with all expressions in languages
with type polymorphism.
Example 1.5.1 For the program of Fig. 1.2, semantic analysis will collect the dec-
larations of the decl-subtree in a map
This map associates each identifier with its type. Using this map, semantic analysis
can check in the stat-subtrees whether variables and expressions are used in a type-
correct way. For the first assignment, a D 42I, it will check whether the left side of
the assignment is a variable identifier, and whether the type of the left side is com-
patible with the type on the right side. In the second assignement, b D a a 7I,
the type of the right side is less obvious. It can be determined from the types of the
variable a and the constant 7. Recall that the arithmetic operators are overloaded
in most programming languges. This means that they stand for the designated op-
erations of several types, for instance, on int- as well as on float-operands, possibly
even of different precision. The type checker then is expected to resolve over-
loading. In our example, the type checker determines that the multiplication is an
int-multiplication and the subtraction an int-subtraction, both returning values of
type int. The result type of the right side of the assignment, therefore, is int. ut
1.6 Machine-Independent Optimization 7
statlist
statlist
statlist
stat
decl stat
A
A E
E T
idlist T T T
type idlist F F F F
Int Id Com Id Sem Id Bec Intconst Sem Id Bec Id Mop Id Aop Intconst Sem
Static analyses of the source program may detect potential run-time errors or possi-
bilities for program transformation that may increase the efficiency of the program
while preserving the semantics of the program. A data-flow analysis or abstract in-
terpretation can detect, among others, the following properties of a source program:
There exists a program path on which a variable would be used without being
initialized.
There exist program parts that cannot be reached or functions that are never
called. These superfluous parts need not be compiled.
A program variable x at a statement in an imperative program has always the
same value, c. In this case, variable x can be repaced by the value c in this
statement. This analysis would recognize that at each execution of the second
assignment, b D a a 7I, variable a has the value 42. Replacing both oc-
currences of a by 42 leads to the expression 42 42 7, whose value can be
evaluated at compile time. This analysis and transformation is called constant
propagation with constant folding.
A major empasis of this phase is on evaluating subexpressions whose value can
be determined at compile time. Besides this, the following optimizations can be
performed by the compiler:
Loop invariant computations can be moved out of loops. A computation is loop
invariant if it only depends on variables that do not change their value during
the execution of the loop. Such a computation is executed only once instead of
in each iteration when it has been moved out of a loop.
A similar transformation can be applied in the compilation of functional pro-
grams to reach the fully lazy property. Expressions that only contain variables
8 1 The Structure of Compilers
bound outside of the function can be moved out of the body of the function and
passed to the function in calls with an additional parameter.
These kinds of optimizations are performed by many compilers. They make up the
middle-end of the compiler. The volume Compiler Design: Analysis and Transfor-
mation is dedicated to this area.
The code generator takes the intermediate representation of the program and gener-
ates the target program. A systematic way to translate several types of programming
languages to adequate virtual machines is presented in the volume, Compiler De-
sign: Virtual Machines. Code generation, as described there, proceeds recursively
over the structure of the program. It could, therefore, start directly after syntactic
and semantic analysis and work on the decorated syntax tree. The efficiency of the
resulting code, though, can be considerably improved if the properties of the tar-
get hardware are exploited. The access to values, for example, is more efficient if
the values are stored in the registers of the machine. Every processor, on the other
hand, has only a limited number of such registers. One task of the code generator
is to make good use of this restricted resource. The task to assign registers to vari-
ables and intermediate values is called register allocation. Another problem to be
solved by the code generator is instruction selection, i.e., the selection of instruction
sequences for expressions and (sequences of) assignments of the source program.
Most processor architectures provide multiple possiblities for the translation of a
given statement of the source program. From these, a sequence must be selected
which is preferable w.r.t. execution time, storage consumption, parallizability, or
just program length.
Example 1.7.1 Let us assume that a virtual or concrete target machine has regis-
ters r1 ; r2 ; : : : ; rN for a (mostly small) N and that it provides, among others, the
instructions
Instruction Meaning
load ri ; q ri M ŒqI
store q; ri M Œq ri I
loadi ri ; q ri qI
subi ri ; rj ; q ri rj qI
mul ri ; rj ; rk ri rj rk I
where q stands for an arbitrary int-constant, and M Œ: : : for a memory access. Let
us further assume that variables a and b have been assigned the global addresses 1
and 2. A translation of the example program, which stores the values for a and b in
1.8 Specification and Generation of Compiler Components 9
loadi r1 ; 42
store 1; r1
mul r2 ; r1 ; r1
subi r3 ; r2 ; 7
store 2; r3
Registers r1 ; r2 , and r3 serve for storing intermediate values during the evaluation
of right sides. Registers r1 and r3 hold the values of variables a and b, resp. Closer
inspection reveals that the compiler may save registers. For instance, register r1
can be reused for register r2 since the value in r1 is no longer needed after the
multiplication. Even the result of the instruction subi may be stored in the same
register. We, thus, obtain the improved instructions sequence:
loadi r1 ; 42
store 1; r1
mul r1 ; r1 ; r1
subi r1 ; r1 ; 7
store 2; r1
t
u
Even for the simple processor architecture of Example 1.7.1, the code generator
must take into account that the number of intermediate results to be stored in regis-
ters does not exceed the number of currently available registers. These and similar
constraints are to be found in realistic target architectures. Furthermore, they typi-
cally offer dedicated instructions for frequent special cases. This makes the task of
generating efficient code both intricate and challenging. The necessary techniques
are presented in the volume: Compiler Design: Code Generation and Machine-
Level Optimization.
All the tasks to be solved during the syntactic analysis can be elegantly specified
by different types of grammars. Symbols, the lexical units of the languages, can be
described by regular expressions. A nondeterministic finite automaton recognizing
the language described by a regular expression R can be automatically derived from
R expression. This nondeterministic finite automaton then can be automatically
converted into a deterministic finite automaton.
A similar correspondence is known between context-free grammars as used
for specifying the hierarchical structure of programs, and pushdown automata.
A nondeterministic pushdown automaton recognizing the language of a context-
free grammar can be automatically constructed from the context-free grammar.
For practical applications such as compilation, clearly deterministic pushdown
10 1 The Structure of Compilers
automata are preferred. Unlike in the case of finite automata, however, nonde-
terministic pushdown automata are more powerful than deterministic pushdown
automata. Most designers of programming languages have succeeded in staying
within the class of deterministically analyzable context-free languages, so that
syntactic analysis of their languages is relatively simple and efficient. The example
of C++, however, shows that a badly designed syntax requires nondeterministic
parsers and thus considerably more effort, both in building a parser and in actually
parsing programs in the language.
The compiler components for lexical and syntactic analysis need not be pro-
grammed by hand, but can be automatically generated from appropriate specifica-
tions. These two example suggest to search for further compiler subtasks that can
be solved by automatically generated components. As another example for this
approach, we meet attribute grammars in this volume. Attribute grammars are an
extension of context-free grammars in which computations on syntax trees can be
specified. These computations typically check the conformance of the program to
semantic conditions such as typing rules. Table 1.1 lists compiler subtasks treated
in this volume that can be formally specified in such a way that implementations of
the corresponding components can be automatically generated. The specification
and the implementation mechanisms are listed with the subtask.
Program invariants as they are needed for the semantics-preserving application
of optimizing program transformations can be computed using generic approaches
based on the theory of abstract interpretation. This is the subject of the volume
Compiler Design: Analysis and Transformation.
There are also methods to automatically produce components of the compiler
back-end. For instance, problems of register allocation and instruction schedul-
ing can be conveniently formulated as instances of Partitioned Boolean Quadratic
Programming (PBQP) or Integer Linear Programming (ILP). Various subtasks of
code generation are treated in the volume Compiler Design: Code Generation and
Machine-Level Optimization.
Lexical Analysis
2
We start this chapter by describing the task of lexical analysis. Then we present
regular expressions as specifications for this task. Regular expressions can be auto-
matically converted into nondeterministic finite automata, which implement lexical
analysis. Nondeterministic finite automata can be made deterministic, which is
preferred for implementing lexical analyzers, often called scanners. Another trans-
formation on the resulting deterministic finite automata attempts to reduce the sizes
of the automata. These three steps together make up an automatic process gen-
erating lexical analyzers from specifications. Another module working in close
cooperation with such a finite automaton is the screener. It filters out keywords,
comments, etc., and may do some bookkeeping or conversion.
Let us assume that the source program is stored in a file. It consists of a sequence
of characters. Lexical analysis, i.e., the scanner, reads this sequence from left to
right and decomposes it into a sequence of lexical units, called symbols. Scanner,
screener, and parser may work in an integrated way. In this case, the parser calls
the combination scanner-screener to obtain the next symbol. The scanner starts the
analysis with the character that follows the end of the last found symbol. It searches
for the longest prefix of the remaining input that is a symbol of the language. It
passes a representation of this symbol on to the screener, which checks whether this
symbol is relevant for the parser. If not, it is ignored, and the screener reactivates the
scanner. Otherwise, it passes a possibly transformed representation of the symbol
on to the parser.
The scanner must, in general, be able to recognize infinitely many or at least very
many different symbols. The set of symbols is, therefore, divided into finitely many
classes. One symbol class consists of symbols that have a similar syntactic role. We
distinguish:
The alphabet is the set of characters that may occur in program texts. We use
the letter ˙ to denote alphabets.
R. Wilhelm, H. Seidl, S. Hack, Compiler Design, DOI 10.1007/978-3-642-17540-4_2, 11
c Springer-Verlag Berlin Heidelberg 2013
12 2 Lexical Analysis
A symbol is a word over the alphabet ˙ . Examples are xyz12, 125, class, “abc”.
A symbol class is a set of symbols. Examples are the set of identifiers, the set of
int-constants, and the set of character strings. We denote these by Id, Intconst,
and String, respectively.
The representation of a symbol comprises all of the mentioned information about
a symbol that may be relevant for later phases of compilation. The scanner may
represent the word xyz12 as pair (Id, “xyz12”), consisting of the name of the
class and the found symbol, and pass this representation on to the screener. The
screener may replace “xyz12” by the internal representation of an identifier, for
example, a unique number, and then pass this on to the parser.
Several words can be concatenated to a new word. The concatenation of the words
x and y puts the sequence of characters of y after the sequence of characters of x,
i.e.,
x : y D x1 : : : xm y1 : : : yn ;
if x D x1 : : : xm ; y D y1 : : : yn for xi ; yj 2 ˙ .
Concatenation of x and y produces a word of length n C m if x and y have
lengths n and m, respectively. Concatenation is a binary operation on the set ˙ .
In contrast to addition of numbers, concatenation of words is not commutative. This
means that the word x : y is , in general, different from the word y : x. Like addition
of numbers, concatenation of words is associative, i.e.,
x : .y : z/ D .x : y/ : z for all x; y; z 2 ˙
The empty word " is the neutral element with respect to concatenation of words,
i.e.,
x:" D ":x D x for all x 2 ˙ :
In the following, we will write xy for x : y.
2.2 Regular Expressions and Finite Automata 13
L1 [ L2 D fw 2 ˙ j w 2 L1 or w 2 L2 g:
L1 : L2 D fxy j x 2 L1 ; y 2 L2 g:
The complement L of language L consists of all words in ˙ that are not contained
in L:
L D ˙ L:
For L ˙ we denote Ln as the n-times concatenation of L, L as the union of
arbitrary concatenations, and LC as the union of nonempty concatenations of L,
i.e.,
Ln D fw1 : : : wn j w1 ; : : : ; wn 2 Lg S
L D fw1 : : : wn j 9n 0: w1 ; : : : ; wn 2 Lg D Ln
nS
0
LC D fw1 : : : wn j 9n > 0: w1 ; : : : ; wn 2 Lg D Ln
n1
The empty set ; and the set f"g, consisting only of the empty word, are regular.
The sets fag for all a 2 ˙ are regular over ˙ .
If R1 and R2 are regular languages over ˙ , so are R1 [ R2 and R1 R2 .
If R is regular over ˙ , then so is R .
According to this definition, each regular language can be specified by a regular
expression. Regular expressions over ˙ and the regular languages described by
them are also defined inductively:
; is a regular expression over ˙ , which specifies the regular language ;.
" is a regular expression over ˙ , and it specifies the regular language f"g.
14 2 Lexical Analysis
Example 2.2.1 The following table lists a number of regular expressions together
with the specified languages, and some or even all of their elements.
Regular expressions that contain the empty set as symbol can be simplified by re-
peated application of the following equalities:
r j; D ;jr D r
r :; D ;:r D ;
; D
2.2 Regular Expressions and Finite Automata 15
The equality symbol D between two regular expressions means that both specify
the same language. We can prove:
Lemma 2.2.1 For every regular expression r over alphabet ˙ , a regular expression
r 0 can be constructed which specifies the same language and additionally has the
following properties:
1. If r is a specification of the empty language, then r 0 is the regular expression ;;
2. If r is a specification of a nonempty language, then r 0 does not contain the
symbol ;. u t
Our applications only have regular expressions that specify nonempty languages. A
symbol to describe the empty set, therefore, need not be included into the specifica-
tion language of regular expressions. The empty word, on the other hand, cannot so
easily be omitted. For instance, we may want to specify that the sign of a number
constant is optional, i.e., may be present or absent. Often, however, specification
languages used by scanners do not provide a dedicated metacharacter for the empty
word: The ?-operator suffices in all practical situations. In order to remove explicit
occurrences of " in a regular expression by means of ?, the following equalities can
be applied:
r j " D " j r D r‹
r :" D ":r D r
" D "‹ D "
We obtain:
Lemma 2.2.2 For every regular expression r over alphabet ˙ , a regular expression
r 0 (possibly containing ‹) can be constructed which specifies the same language as
r and additionally has the following properties:
1. If r is a specification of the language f"g, then r 0 is the regular expression ";
2. If r is a specification of a language different from f"g, then r 0 does not contain
". ut
Finite Automata
We have seen that regular expressions are used for the specification of symbol
classes. The implementation of recognizers uses finite automata (FA). Finite au-
tomata are acceptors for regular languages. They maintain one state variable that
can only take on finitely many values, the states of the FA. According to Fig. 2.1,
an FA has an input tape and an input head, which reads the input on the tape from
left to right. The behavior of the FA is specified by means of a transition relation
.
Formally, a nondeterministic finite automaton (with "-transitions) (or FA for
short) is represented as a tuple M D .Q; ˙; ; q0 ; F /, where
Q is a finite set of states,
˙ is a finite alphabet, the input alphabet,
q0 2 Q is the initial state,
16 2 Lexical Analysis
State
Control
A DFA used as a scanner decomposes the input word into a sequence of sub-
words corresponding to symbols of the language. Each symbol drives the
DFA from its initial state into one of its final states.
The DFA starts in its initial state. Its input head is positioned at the beginning of
the input head.
It then makes a number of steps. Depending on the actual state and the next input
symbol, the DFA changes its state and moves its input head to the next character.
2.2 Regular Expressions and Finite Automata 17
The DFA accepts the input word when the input is exhausted, and the actual state is
a final state.
Example 2.2.2 Table 2.1 shows the transition relation of an FA M in the form of
a two-dimensional matrix TM . The states of the FA are denoted by the numbers
0; : : : ; 7. The alphabet is the set f0; : : : ; 9; :; E; C; g. Each row of the table de-
scribes the transitions for one of the states of the FA. The columns correspond to
the characters in ˙ [ f"g. The entry TM Œq; x contains the set of states p such that
.q; x; p/ 2 . The state 0 is the initial state, f1; 4; 7g is the set of final states. This
FA recognizes unsigned int- and float-constants. The accepting (final) state 1 can
be reached through computations on int-constants. Accepting states 4 and 6 can be
reached under float-constants. u t
Table 2.1 The transition relation of an FA to recognize unsigned int- and float-constants. The
first column represents the identical columns for the digits i D 0; : : : ; 9; the fourth the ones for
C and
TM i . E C; "
0 f1; 2g f3g ; ; ;
1 f1g ; ; ; f4g
2 f2g f4g ; ; ;
3 f4g ; ; ; ;
4 f4g ; f5g ; f7g
5 ; ; ; f6g f6g
6 f7g ; ; ; ;
7 f7g ; ; ; ;
digit .
1 E
digit digit digit
+, −
digit . E digit
0 2 4 5 6 7
ε
digit
. digit
3
Fig. 2.2 The transition diagram for the FA of Example 2.2.2. The character digit stands for the set
f0; 1; : : : ; 9g, an edge labeled with digit for edges labeled with 0; 1; : : : 9 with the same source
and target vertices
p to q that is labeled with x corresponds to a transition .p; x; q/. The start vertex
of the transition diagram, corresponding to the initial state, is marked by an arrow
pointing to it. The end vertices, corresponding to final states, are represented by
doubly encircled vertices. A w-path in this graph for a word w 2 ˙ is a path from
a vertex q to a vertex p, such that w is the concatenation of the edge labels. The
language accepted by M consists of all words in w 2 ˙ , for which there exists a
w-path in the transition diagram from q0 to a vertex q 2 F .
Example 2.2.3 Figure 2.2 shows the transition diagram corresponding to the FA
of example 2.2.2. u
t
2.2 Regular Expressions and Finite Automata 19
r1
(A) r1 |r2
q p q p
r2
(K) r1 r2 r1 r2
q p q q1 p
r
(S) r∗ ε ε
q p q q1 q2 p
ε
ε
Acceptors
The next theorem guarantees that every regular expression can be compiled into an
FA that accepts the language specified by the expression.
r
q0 qf
trans ;I
count 1I
generate.0; r; 1/I
return .count; trans/I
20 2 Lexical Analysis
The set trans globally collects the transitions of the generated FA, and the global
counter count keeps track of the largest natural number used as state. A call to
a procedure generate for .p; r 0 ; q/ inserts all transitions of an FA for the regular
expression r 0 with initial state p and final state q into the set trans. New states are
created by incrementing the counter count. This procedure is recursively defined
over the structure of the regular expression r 0 :
Exp denotes a datatype for regular expressions over the alphabet ˙ . We have
used a JAVA-like programming language as implementation language. The switch-
statement was extended by pattern matching to elegantly deal with structured data
such as regular expressions. This means that patterns are not only used to select
between alternatives but also to identify substructures.
A procedure call generate.0; r; 1/ terminates after n rule applications, where n
is the number of occurrences of operators and symbols in the regular expression r.
If l is the value of the counter after the call, the generated FA has f0; : : : ; lg as set
of states, where 0 is the initial state and 1 the only final state. The transitions are
collected in the set trans. The FA Mr can be computed in linear time.
Example 2.2.4 The regular expression a.aj0/ over the alphabet fa; 0g describes
the set of words fa; 0g beginning with an a. Figure 2.4 shows the construction of
the state diagram of a FA that accepts this language.
t
u
a (a|0)∗ (K)
0 2 1
(a|0)
a ε ε (S)
0 2 3 4 1
ε
ε
a
a ε 0 ε (A)
0 2 3 4 1
ε
ε
Fig. 2.4 Construction of a transition diagram for the regular expression a.a j 0/
Theorem 2.2.2 For each FA a DFA can be constructed that accepts the same lan-
guage. u
t
Proof The proof provides the second step of the generation procedure for scan-
ners. It uses the subset construction. Let M D .Q; ˙; ; q0 ; F / be an FA. The
goal of the subset construction is to construct a DFA P .M / D .P .Q/; ˙; P ./;
P .q0 /P .F // that recognizes the same language as M . For a word w 2 ˙ let
states.w/ Q be the set of all states q 2 Q for which there exists a w-path
leading from the initial state q0 to q. The DFA P .M / is given by:
P .Q/ D fstates.w/ j w 2 ˙ g
P .q0 / D states."/
P .F / D fstates.w/ j w 2 L.M /g
P ./.S; a/ D states.wa/ for S 2 P .Q/ and a 2 ˙ if S D states.w)
We convince ourselves that our definition of the transition function P ./ is reason-
able. For this, we show that for words w; w 0 2 ˙ with states.w/ D states.w 0 / it
holds that states.wa/ D states.w 0 a/ for all a 2 ˙ . It follows that M and P .M /
accept the same language.
We need a systematic way to construct the states and the transitions of P .M /.
The set of final states of P .M / can be constructed – once the set of states of P .M /
is known, because it holds that:
P .F / D fA 2 P .M / j A \ F ¤ ;g
22 2 Lexical Analysis
This set consists of all states that can be reached from states in S by "-paths in the
transition diagram of M . This closure can be computed by the following function:
The states of the resulting DFA that are reachable from A, are collected in the set
result. The list W contains all elements in result whose "-transitions have not yet
been processed. As long as W is not empty, the first state q from W is selected.
To do this, functions hd and tl are used that return the first element and the tail of
a list, respectively. If q is already contained in result, nothing needs to be done.
Otherwise, q is inserted into the set result. All transitions .q; "; q 0 / for q in are
considered, and the successor states q 0 are added to W . By applying the closure
operator SS ._/, the initial state P .q0 / of the subset automaton can be computed:
When constructing the set of all states P .M / together with the transition function
P ./ of P .M /, bookkeeping is required of the set Q0 P .M / of already gener-
ated states and of the set 0 P ./ of already created transitions.
Initially, Q0 D fP .q0 /g and 0 D ;. For a state S 2 Q0 and each a 2 ˙ ,
its successor state S 0 under a and Q0 and the transition .S; a; S 0 / are added to .
The successor state S 0 for S under a character a 2 ˙ is obtained by collecting the
successor states under a of all states q 2 S and adding all "-successor states:
Insertions into the sets Q0 and 0 are performed until all successor states of the
states in Q0 under transitions for characters from ˙ are already contained in the set
Q0 . Technically, this means that the set of all states states and the set of all tran-
sitions trans of the subset automaton can be computed iteratively by the following
loop:
list hset hstateii W I
set hstatei S0 closure.fq0 g/I
states fS0 gI W ŒS0 I
trans ;I
set hstatei S; S 0 I
while .W ¤ Œ/ f
S hd.W /I W tl.W /I
forall .x 2 ˙ / f
S0 nextState.S; x/I
trans trans [ f.S; x; S 0 /gI
0
if .S 62 states/ f
states states [ fS 0 gI
W W [ fS 0 gI
g
g
g
t
u
Example 2.2.5 The subset construction applied to the FA of Example 2.2.4 can be
executed by the steps described in Fig. 2.5. The states of the DFA to be constructed
are denoted by primed natural numbers 00 ; 10 ; : : :. The initial state 00 is the set f0g.
The states in Q0 whose successor states are already computed are underlined. The
state 30 is the empty set of states, i.e., the error state. It can never be left. It is the
successor state of a state S under a if there is no transition of the FA under a for
any FA state in S. u t
Minimization
The DFA generated from a regular expression in the given two steps is not necessar-
ily the smallest possible for the specified language. There may be states that have
the same acceptance behavior. Let M D .Q; ˙; ; q0 ; F / be a DFA. We say states
24 2 Lexical Analysis
a 1
0 {0 , 1 , 3 } with
0
1 = {1, 2, 3}
0 3
a
a 1 2
1 {0 , 1 , 2 , 3 } with
0 0
2 = {1, 3, 4}
0 3
a a
a 1 2
2 {0 , 1 , 2 , 3 } 0 0
0
0 3
a
a
a 1 2
3 {0 , 1 , 2 , 3 } 0 0
0
0 3
a 0
constructed which accepts the same language, and this minimal DFA is unique up
to isomorphism. This is the claim of the following theorem.
Theorem 2.2.3 For each DFA M , a minimal DFA M 0 can be constructed that
accepts the same language as M . This minimal DFA is unique up to renaming of
states.
ŒqM D fp 2 Q j q M pg
Q0 D fŒqM j q 2 Qg
Correspondingly, the initial state and the set of final states of M 0 are defined by
It can be verified that the new transition function 0 is well-defined, i.e., that for
Œq1 M D Œq2 M it holds that Œ.q1 ; a/M D Œ.q2 ; a/M for all a 2 ˙ . Further-
more,
.q; w/ 2 F if and only if .0 / .ŒqM ; a/ 2 F 0
holds for all q 2 Q and w 2 ˙ . This implies that L.M / D L.M 0 /. We
now claim that the DFA M 0 is minimal. For a proof of this claim, consider an-
other DFA M 00 D .Q00 ; ˙; 00 ; q000 ; F 00 / with L.M 00 / D L.M 0 / whose states are
all reachable from q000 . Assume for a contradiction that there is a state q 2 Q00
and words u1 ; u2 2 ˙ such that .00 / .q000 ; u1 / D .00 / .q000 ; u2 / D q, but
.0 / .Œq0 M ; u1 / ¤ .0 / .Œq0 M ; u2 / holds. For i D 1; 2, let pi 2 Q denote a
state with .0 / .Œq0 M ; ui / D Œpi M . Since Œp1 M ¤ Œp2 M holds, the states p1
and p2 cannot be equivalent. On the other hand, we have for all words w 2 ˙ ,
f.q; a/ j q 2 q 0 g p 0
In a stable partition, all transitions from one set of the partition lead into exactly one
set of the partition.
In the partition ˘ , those sets of states are maintained of which we assume that
they have the same acceptance behavior. If it turns out that a set q 0 2 ˘ contains
states with different acceptance behavior, then the set q 0 is split up. Different ac-
ceptance behavior of two states q1 and q2 is recognized when the successor states
.q1 ; a/ and .q2 ; a/ for some a 2 ˙ lie in different sets of ˘ . Then the partition
is apparently not stable. Such a split of a set in a partition is called refinement of ˘ .
The successive refinement of the partition ˘ terminates if there is no need for fur-
ther splitting of any set in the obtained partition. Then the partition is stable under
the transition relation .
In detail, the construction of the minimal DFA proceeds as follows. The partition
˘ is initialized with ˘ D fF; QnF g. Let us assume that the actual partition ˘ of
the set Q of states of M 0 is not yet stable under . Then there exists a set q 0 2 ˘
and some a 2 ˙ such that the set f.q; a/ j q 2 q 0 g is not completely contained in
any of the sets in p 0 2 ˘ . Such a set q 0 is then split to obtain a new partition ˘ 0
that consists of all nonempty elements of the set
ffq 2 q 0 j .q; a/ 2 p 0 g j p 0 2 ˘ g
The partition ˘ 0 of q 0 consists of all nonempty subsets of states from q 0 that lead
under a into the same sets in p 0 2 ˘ . The set q 0 in ˘ is replaced by the partition
˘ 0 of q 0 , i.e., the partition ˘ is refined to the partition .˘ nfq 0 g/ [ ˘ 0 .
If a sequence of such refinement steps arrives at a stable partition in ˘ the set of
states of M 0 has been computed.
˘ D fŒqM j q 2 Qg
2.3 A Language for Specifying Lexical Analyzers 27
0
ε
a
3
0
Example 2.2.6 We illustrate the presented method by minimizing the DFA of Ex-
ample 2.2.5. At the beginning, partition ˘ is given by
f f00 ; 30 g; f10 ; 20 g g
This partition is not stable. The first set f00 ; 30 g must be split into the partition ˘ 0 D
ff00 g; f30 gg. The corresponding refinement of partition ˘ produces the partition
This partition is stable under . It therefore delivers the states of the minimal DFA.
The transition diagram of the resulting DFA is shown in Fig. 2.6. u t
Example 2.3.1 The following regular expression describes the language of un-
signed int-constants of Examples 2.2.2 and 2.2.3.
.0j1j2j3j4j5j6j7j8j9/.0j1j2j3j4j5j6j7j8j9/
In the specification of a lexical analyzer, one should be able to group sets of char-
acters into classes if these characters can be exchanged against each other without
changing the symbol class of symbols in which they appear. This is particularly
helpful in the case of large alphabets, for instance the alphabet of all Unicode-
characters. Examples of frequently occurring character classes are:
alpha D a zA Z
digit D 09
The first two definitions of character classes define classes by using intervals in the
underlying character code, e.g. the ASCII. Note that we need another metacharacter,
’-’, for the specification of intervals. Using this feature, we can nicely specify the
symbol class of identifiers:
Id D alpha.alpha j digit/
The specification of character classes uses three metacharacters, namely ’D’, ’’,
and the blank. For the usage of identifiers for character classes, though, the descrip-
tion formalism must provide another mechanism to distinguish them from ordinary
character strings. In our example, we us a dedicated font. In practical systems, the
defined names of character classes might be enclosed in dedicated brackets such as
f: : :g.
Example 2.3.2 The regular expression for unsigned int- and float-constants is sim-
plified through the use of the character classes digit D 0 9 to:
digit digit
digit digit E.C j /‹digit digit j digit .:digit j digit:/
digit .E.C j /‹digit digit /‹
t
u
Programming languages have lexical units that are characterized by the enclosing
parentheses. Examples are string constants and comments. Parentheses limiting
comments can be composed of several characters: . and / or = and = or == and
nn (newline). More or less arbitrary texts can be enclosed in the opening and the
closing parentheses. This is not easily described. A comfortable abbreviation for
this is:
r1 until r2
2.4 Scanner Generation 29
Section 2.2 presented methods for deriving FAs from regular expressions, for com-
piling FAs into DFAs and finally for minimizing DFAs. In what follows we present
the extensions of these methods which are necessary for the implementation of scan-
ners.
Character classes were introduced to simplify regular expressions. They may also
lead to smaller automata. The character-class definition
alpha D a z
digit D 0 9
can be used to replace the 26 transitions between states under letters by one transi-
tion under bu. This may simplify the DFA for the expression
Id D alpha.alpha j digit/
alpha D az
alphanum D a z0 9
to define the symbol class Id D alpha alphanum . The generator would divide one
of these character classes into
digit0 D alphanumnalpha
alpha0 D alpha \ alphanum D alpha
Let us assume the scanner should recognize symbols whose symbol class is spec-
ified by the expression r D r1 until r2 . After recognizing a word of the language
for r1 it needs to find a word of the language for r2 and then halt. This task is a
generalization of the pattern-matching problem on strings. There exist algorithms
that solve this problem for regular patterns in time, linear in the length of the in-
put. These are, for example, used in the U NIX-program E GREP. They construct an
FA for this task. Likewise, we now present a single construction of a DFA for the
expression r.
Let L1 ; L2 be the languages described by the expressions r1 and r2 . The lan-
guage L defined by the expression r1 until r2 is:
L D L1 ˙ L2 ˙ L2
The process starts with automata for the languages L1 and L2 , decomposes the
regular expression describing the language, and applies standard constructions for
automata. The process has the following seven steps: Fig. 2.7 shows all seven steps
for an example.
1. The first step constructs FA M1 and M2 for the regular expressions r1 ; r2 , where
L.M1 / D L1 and L.M2 / D L2 . A copy of the FA for M2 is needed for step 2
and one more in step 6.
2. An FA M3 is constructed for ˙ L2 ˙ using the first copy of M2 .
Σ Σ
ε M2 ε
2.4 Scanner Generation 31
x y
1 2 3 FA for{xy}
Σ Σ
ε x y ε FA for
0 1 2 3 4 Σ ∗{xy}Σ ∗
Σ\{x} x x
x x
y DFA for
0,1 0,1,2 0,1,3,4 0,1,2,4
y Σ ∗{xy}Σ ∗
Σ\{x, y} x
Σ\{x} Σ\{x, y}
0,1,4
Σ\{x} x Σ\{x}
x
1 2 minimal DFA for Σ ∗ {xy}Σ ∗
Σ\{x, y} after removal of the error state
Σ\{x}
x x
z ε ε x y
3 4 1 2 5 6 7
Σ\{x, y}
FA for {z}Σ ∗ {xy}Σ ∗ {xy}
ε
Σ
Σ\{z} Σ
x
z x y
3 4,1,5 2,5,6 7 DFA for {z}Σ ∗ {xy}Σ ∗ {xy}
Σ\{x} x
Σ\{x, y}
1,5
Σ\{x}
5. The DFA M5 is transformed into a minimal DFA M6 . All final states of M4 are
equivalent and dead in M5 since it is not possible to reach a final state of M5
from any final state of M4 .
6. Using the FA M1 ; M2 for L1 and L2 and M6 an FA M7 for the language
L1 ˙ L2 ˙ L2 is constructed.
M6 ε
M1 ε M2
ε ε
From each final state of M6 including the initial state of M6 , there is an "-
transition to the initial state of M2 . From there paths under all words w 2 L2
lead into the final state of M2 , which is the only final state of M7 .
7. The FA M7 is converted into a DFA M8 and possibly minimized.
Let a sequence
r0 ; : : : ; rn1
of regular expression be given for the symbol classes to be recognized by the scan-
ner. A scanner recognizing the symbols in these classes can be generated in the
following steps:
1. In a first step, FAs Mi D .Qi ; ˙; i ; q0;i ; Fi / for the regular expressions ri are
generated, where the Qi should be pairwise disjoint.
2. The FAs Mi are combined into a single FA M D .˙; Q; ; q0 ; F / by adding a
new initial state q0 together with "-transitions to the initial states q0;i of the FA
Mi . The FA M therefore looks as follows:
The FA M for the sequence accepts the union of the languages that are accepted
by the FA Mi . The final state reached by a succesful run of the automaton
indicates to which class the found symbol belongs.
3. The subset construction is applied to the FA M resulting in a DFA P .M /. A
word w is associated with the ith symbol class if it belongs to the language of
ri , but to no language of the other regular expressions rj ; j < i. Expressions
with smaller indices are here preferred over expressions with larger indices.
To which symbol class a word w belongs can be computed by the DFA P .M /.
The word w belongs to the ith symbol class if and only if it drives the DFA
2.4 Scanner Generation 33
digit D 0 9
hex D A F
digit digit
h.digit j hex/.digit j hex/
for the symbol classes Intconst and Hexconst are processed in the following steps:
FA are generated for these regular expressions.
digit
digit ε ε
i0 i1 i2 i3 i4
ε ε
digit digit
h ε hex ε
h0 h1 h2 h3 h4 h5
hex ε ε
The final state i4 stands for symbols of the class Intconst, while the final state
h5 stands for symbols of the class Hexconst.
The two FA are combined with a new initial state q0 :
digit
i0
ε
q0
ε
h
h0
34 2 Lexical Analysis
zi
.
1
digit
0
digit digit
h
2 3
hex hex
We have seen that the core of a scanner is a deterministic finte automaton. The tran-
sition function of this automaton can be represented by a two-dimensional array
delta. This array is indexed by the actual state and the character class of the next
input character. The selected array component contains the new state into which the
DFA should go when reading this character in the actual state. While the access to
deltaŒq; a is usually fast, the size of the array delta can be quite large. We observe,
however, that a DFA often contains many transitions into the error state error. This
state can therfore be chosen as the default value for the entries in delta. Repre-
senting transitions into only non-error states, may then lead to a sparsely populated
array, which can be compressed using well-known methods. These save much space
at the cost of slightly increased access times. The now empty entries represent tran-
sitions into the error state. Since they are still important for error detection of the
scanner, the corresponding information must be preserved.
Let us consider one such compression method. Instead of using the original
array delta to represent the transition function an array RowPtr is introduced, which
is indexed by states and whose components are addresses of the original rows of
delta, see Fig. 2.8.
2.4 Scanner Generation 35
RowPtr
q
Delta[q, a]
We have not gained anything so far, and have even lost a bit of access efficiency.
The rows of delta to which entries in RowPtr point are often almost empty. The
rows are therefore overlaid into a single 1-dimensional array Delta in such a way
that non-empty entries of delta do not collide. To find the starting position for the
next row to be inserted into Delta, the first-fit-strategy can be used. The row is
shifted over the array Delta starting at its beginning, until no non-empty entries of
this row collide with non-empty entries already allocated in Delta.
The index in Delta at which the qth row of delta is allocated, is stored ina
RowPtrŒq (see Fig. 2.9). One problem is that the represented DFA now has lost its
ability to identify errors, that is, undefined transitions. Even if .q; a/ is undefined
(representing a transtion into the error state), the component DeltaŒRowPtrŒq C a
may contain a non-empty entry stemming from a shifted row of a state p ¤ q.
Therefore, another 1-dimensional array Valid is added, which has the same length
as Delta. The array Valid contains the information to which states the entries in
Delta belong. This means that ValidŒRowPtrŒq C a D q if and only if .q; a/ is
defined. The transition function of the DFA can then be implemented by a function
next./ as follows:
State next .State q; CharClass a/ f
if .ValidŒRowPtrŒq C a ¤ q/ return errorI
return DeltaŒRowPtrŒq C aI
g
Scanners can be used in many applications, even beyond the pure splitting of a
stream of characters according to a specification by means of a sequence of reg-
ular expressions. Scanners often provide the possibility of further processing the
recognized elements.
To specify this extended functionality, each symbol class is associated with a
corresponding semantic action. A screener can therefore be specified as a sequence
of pairs of the form
r0 faction0 g
:::
rn1 factionn1 g
where ri is a (possibly extended) regular expression over character classes speci-
fying the ith symbol class, and actioni denotes the semantic action to be executed
when a symbol of this class is found. If the screener is to be implemented in a par-
ticular programming language, the semanic actions are typically specified as code
in this language. Different languages offer different ways to return a representa-
tion of the found symbol. An implementation in C would, for instance, return an
int-value to identify the symbol class, while all other relevant values have to be
returned in suitable global values. Somewhat more comfort would be offered for
an implementation of the screener in a modern object-oriented language such as
JAVA. There a class Token can be introduced whose subclasses Ci correspond to
the symbol classes. The last statement in actioni should then be a return-statement
returning an object of class Ci whose attibutes store all properties of the identi-
fied symbol. In a functional language such as O CAML, a data type TOKEN can
be supplied whose constructors Ci correspond to the different symbol classes. In
this case, the semantic action actioni should be an expression of type token whose
value Ci .: : :/ represents the identified symbol of class Ci .
Semantic actions often need to access the text of the actual symbol. Some gener-
ated scanners offer access to it by means of a global variable yytext. Further global
variables contain information such as the position of the actual symbol in the input.
These are important for the generation of meaningful error messages. Some sym-
bols should be ignored by the screener. Semantic actions therefore should also be
able not to return a result but instead ask the scanner for another symbol from the
input stream. A comment may, for example, be skipped or a compiler directive be
realized without returning a symbol. In a corresponding action in a generator for C
or JAVA, the return-statement simply would be omitted.
2.5 The Screener 37
Token yylex./ f
while.true/
switch scan./ f
case 0 W action0 I breakI
:::
case n 1 W actionn1 I breakI
default W return error./I
g
g
The function error./ is meant to handle the case that an error occurs while the
scanner attempts to identify the next symbol. If an action actioni does not have a
return-statement, the execution is resumed at the beginning of the switch-statement
and reads the next symbol in the remaining input. If an action actioni terminates by
executing a return-statement, the switch-statement together with the while-loop is
terminated, and the corresponding value is returned as the return value of the actual
call of the function yylex./.
A0 W class_list 0
:::
Ar1 W class_list r1
where class_listj is the sequence of regular expressions and semantic actions for
state Aj . For the states normal and comment of Example 2.5.1 we get:
normal W
= f yystate commentI g
::: == further symbol classes
comment W
= f yystate normalI g
: f g
The character : stands for an arbitrary input symbol. Since none of the actions for
start, content, or end of comment has a return-statement, no symbol is returned for
the whole comment. u t
Scanner states determine the subsequence of symbol classes of which symbols are
recognized. In order to support scanner states, the generation process of the function
yylex./ can still be applied to the concatenation of the sequence class_listj . The
only function that needs to be modified is the function scan./. To identify the
next symbol this function no longer has a single deterministic finite-state automaton
but one automaton Mj for each subsequence class_listj . Depending on the actual
scanner state Aj first the corresponding DFA Mj is selected and then applied for
the identification of the next symbol.
The duties may be distributed between scanner and screener in many ways. Ac-
cordingly, there are also various choices for the functionality of the screener. The
advantages and disadvantages are not easily determined. One example for two al-
ternatives is the recognition of keywords. According to the distribution of duties
given in the last chapter, the screener is in charge of recognizing reserved symbols
(keywords). One possibility to do this is to form an extra symbol class for each
reserved word. Figure 2.10 shows a finite-state automaton that recognizes several
reserved words in its final states. Reserved keywords in C, JAVA, and O CAML,
on the other hand, have the same form as identifiers. An alternative to recognizing
them in the final states of a DFA therefore is to delegate the recognition of keywords
to the screener while processing found identifiers.
The function scan./ then signals that an identifier has been found. The semantic
action associated with the symbol class identifier additionally checks whether, and
2.6 Exercises 39
c l a s s
0 1 2 3 4 5
n e w
6 7 8
i f
9 10
e l s e
11 12 13 14
i n t
15 16 17
f l o a t
18 19 20 21 22
s t r i n g
23 24 25 26 27 28
bu
bu ε ε ε
29 30 31 32
zi
ε
Fig. 2.10 Finite-state automaton for the recognition of identifiers and keywords class, new, if,
else, in, int, float, string
if yes, which keyword was found. This distribution of work between scanner and
screener keeps the size of the DFA small. A prerequisite, however, is that keywords
can be quickly recognized.
Internally, identifiers are often represented by unique int-values, where the
screener uses a hash table to compute this internal code. A hash table supports the
efficient comparison of a newly found identifier with identifiers that have already
been entered. If keywords have been entered into the table before lexical analysis
starts, the screener thus can then identify their occurrences with approximately the
same effort that is necessary for processing other identifiers.
2.6 Exercises
1. Kleene star
Let ˙ be an alphabet and L; M ˙ . Show:
(a) L L .
(b) " 2 L .
(c) u; v 2 L implies uv 2 L .
(d) L is the smallest set with properties (1) - (3), that is, if a set M satisfies:
L M; " 2 M and .u; v 2 M ) uv 2 M / it follows L M .
40 2 Lexical Analysis
(e) L M implies L M .
(f) .L / D L .
2. Symbol classes
F ORTRAN provides the implicit declaration of identifiers according to their
leading character. Identifiers beginning with one of the letters i; j; k; l; m; n
are taken as int-variables or int-function results. All other identifiers denote
float-variables.
Give a definition of the symbol classes FloatId and IntId.
3. Extended regular expressions
Extend the construction of finite automata for regular expressions from Fig. 2.3
in a way that it processes regular expressions r C and r‹ directly. r C stands for
rr and r‹ for .r j "/.
4. Extended regular expressions (cont.)
Extend the construction of finite automata for regular expressions by a treat-
ment of counting iteration, that is, by regular expressions of the form:
rfu og at least u and at most o consecutive instances of r
rfug at least u consecutive instances of r
rfog at most o consecutive instances of r
5. Deterministic finite automata
Convert the FA of Fig. 2.10 into a DFA.
6. Character classes and symbol classes
Consider the following definitions of character classes:
bu D az
zi D 09
bzi D 0j1
ozi D 07
hzi D 09jAF
b bziC
o oziC
h hziC
ziC
bu .bu j zi/
(a) Give the partitioning of the character classes that a scanner generator
would compute.
(b) Describe the generated finite automaton using these character classes.
(c) Convert this finite automaton into a deterministic one.
7. Reserved identifiers
Construct a DFA for the FA of Fig. 2.10.
8. Table compression
Compress the table of the deterministic finite automaton using the method of
Sect. 2.2.
2.7 Bibliographic Notes 41
The conceptual separation of scanner and screener was proposed by F.L. DeRemer
[15]. Many so-called compiler generators support the generation of scanners from
regular expressions. Johnson et al. [29] describe such a system. The corresponding
routine under U NIX, L EX, was realized by M. Lesk [42]. F LEX was implemented by
Vern Paxson. The approach described in this chapter follows the scanner generator
JF LEX for JAVA.
Compression methods for sparsely populated matrices as they are generated in
scanner and parser generators are described and analyzed in [61] and [11].
Syntactic Analysis
3
The parser realizes the syntactic analysis of programs. Its input is a sequence of
symbols as produced by the combination of scanner and screener. The parser is
meant to identify the syntactic structure in this sequence of symbols, that is how
the syntactic units are composed from other units. Syntactic units in imperative
languages are, for example, variables, expressions, declarations, statements, and
sequences of statements. Functional languages have variables, expressions, pat-
terns, definitions, and declarations, while logic languages such as P ROLOG have
variables, terms, goals, and clauses.
The parser represents the syntactic structure of the input program in a data struc-
ture that allows the subsequent phases of the compiler to access the individual
program components. One possible representation is the syntax tree or parse tree.
The syntax tree may later be decorated with more information about the program.
Transformations of the program may rely on this data structure, and even code for
a target machine can be generated from it.
For some languages, the compilation task is so simple that programs can be
translated in one pass over the program text. In this case, the parser can avoid the
construction of the intermediate representation. The parser acts as the main function
calling routines for semantic analysis and for code generation.
Many programs that are presented to a compiler contain errors. Many of the
errors are violations of the rules for forming valid programs. Often such syntax
errors are simply scambled letters, nonmatching brackets or missing semicolons.
The compiler is expected to adequately react to errors. It should at least attempt
to locate errors precisely. However, often only the localization of error symptoms
is possible, not the localization of the errors themselves. The error symptom is the
position where no continuation of the syntactic analysis is possible. The compiler
is also often expected not to give up after the first error found, but to continue to
analyze the rest of the program in order to detect more errors in the same run.
The first reaction is absolutely required. Later stages of the compiler assume that
they receive syntactically correct programs in the form of syntax trees. And given
that there are errors in the program, the programmer better be informed about them.
There are, however, two significant problems: First, further syntax errors can re-
main undetected in the vicinity of a detected error. Second, since the parser only
gets suspicious when it gets stuck, the parser will, in general, only detect error
symptoms, not errors themselves.
a D a .b C c d I
"
error symptom: / is missing
Several errors could lead to the same error symptom: Either there is an extra open
parenthesis, or a closing parenthesis is missing after c or is missing after d . Each
of the three corrections leads to a program with different meaning. u t
At errors of extra or missing parentheses such as f, g, begin, end, if, etc., the po-
sition of the error and the position of the error-symptom can be far apart. Practical
parsing methods, such as LL.k/- and LR.k/ parsing, have the viable-prefix prop-
erty:
Whenever the prefix u of a word has been analyzed without announcing an error,
then there exists a word w such that uw is a word of the language.
Parsers possessing this property report error and error symptoms at the earliest pos-
sible time. We have explained above that, in general, the parser only discovers error
symptoms, not errors themselves. Still, we will speak of errors in the following. In
this sense, the discussed parsers perform the first two listed actions: they report and
try to diagnose errors.
Example 3.1.1 shows that the second action is not as easily realized. The parser
can only attempt a diagnosis of the error symptom. It should at least provide the
following information:
the position of the error in the program,
a description of the parser configuration, i.e., the current state, the expected sym-
bol, and the found symbol.
For the third listed action, the correction of an error, the parser would need to guess
the intention of the programmer. This is, in general, difficult. Slightly more realistic
is to search for an error correction that is globally optimal. The parser is given the
capability to insert or delete symbols in the input word. The globally optimal error
correction for an erroneous input word w is a word w 0 that is obtained from w
by a minimal number of such insertions and deletions. Such methods have been
proposed in the literature, but have not been used in practice due to the tremendous
effort that is required.
Instead, most parsers perform only local corrections to have the parser move
from the error configuration to a new configuration in which it can at least read the
46 3 Syntactic Analysis
next input symbol. This prevents the parser from going into an endless loop while
trying to repair an error.
3.2 Foundations
In the same way as lexical analysis is specified by regular expressions and imple-
mented by finite automata, so is syntax analysis specified by context-free grammars
(CFG) and implemented by pushdown automata (PDA). Regular expressions alone
are not sufficient to describe the syntax of programming languages since they can-
not express embedded recursion as occurs in the nesting of expressions, statements,
and blocks.
In Sects. 3.2.1 and 3.2.3, we introduce the necessary notions about context-
free grammars and pushdown automata. Readers familiar with these notions can
skip them and go directly to Sect. 3.2.4. In Sect. 3.2.4, a pushdown automaton
is introduced for a context-free grammar that accepts the language defined by that
grammar.
the screener. Examples for such symbols are reserved keywords of the language, or
symbol classes such as identifiers which comprise a set of symbols.
The nonterminals of the grammar denote sets of words that can be produced
from them by means of the production rules of the grammar. In the example gram-
mar 3.2.1, nonterminals are enclosed in angle brackets. A production rule (in short:
production) .A; ˛/ in the relation P describes a possible replacement: an occur-
rence of the left side A in a word ˇ D 1 A2 can be replaced by the right side
˛ 2 .VT [ VN / . In the view of a top-down parser, a new word ˇ 0 D 1 ˛2 is
produced or derived from the word ˇ.
A bottom-up parser, on the other hand, interprets the production .A; ˛/ as a
replacement of the right side ˛ by the left side A. Applying the production to a
word ˇ 0 D 1 ˛2 reduces this to the word ˇ D 1 A2 .
We introduce some useful conventions concerning a CFG G D .VN ; VT ; P; S/.
Capital latin letters from the beginning of the alphabet, e.g., A; B; C , are used to
denote nonterminals from VN ; capital latin letters from the end of the alphabet, e.g.,
X; Y; Z, denote terminals or nonterminals. Small latin letters from the beginning
of the alphabet, e.g., a; b; c; : : :, stand for terminals from VT ; small latin letters
from the end of the alphabet, like u; v; w; x; y; z; stand for terminal words, that is,
elements from VT ; small greek letters such as ˛; ˇ; ; '; stand for words from
.VT [ VN / .
The relation P is seen as a set of production rules. Each element .A; ˛/ of this
relation is, more intuitively, written as A ! ˛. All productions A ! ˛1 ; A !
˛2 ; : : : ; A ! ˛n for a nonterminal A are combined to
A ! ˛1 j ˛2 j : : : j ˛n
The ˛1 ; ˛2 ; : : : ; ˛n are called the alternatives of A.
Example 3.2.2 The two grammars G0 and G1 describe the same language:
G0 D fE; T; F g; fC; ; .; /; Idg; P0 ; E/ where P0 is given by:
E ! E CT j T
T ! T F j F
F ! .E/ j Id
G1 D .fEg; fC; ; .; /; Idg; P1 ; E/ where P1 is given by:
E ! E C E j E E j .E/ j Id
t
u
The sequence '0 ; '1 ; : : : ; 'n is called a derivation of from ' according to G. The
n
existence of a derivation of length n is written as ' H) . The relation H) de-
G G
notes the reflexive and transitive closure of H) .
G
Example 3.2.3 The grammars of Example 3.2.2 have, among others, the deriva-
tions
E H) E C T H) T C T H) T F C T H) T Id C T H)
G0 G0 G0 G0 G0
F Id C T H) F Id C F H) Id Id C F H) Id Id C Id;
G0 G0 G0
E H) E C E H) E E C E H) Id E C E H) Id E C Id H)
G1 G1 G1 G1 G1
Id Id C Id :
We conclude from these derivations that E H) Id Id C Id holds as well as
G1
E H) Id Id C Id. u
t
G0
A word x 2 L.G/ is called a word of G. A word ˛ 2 .VT [ VN / where S H) ˛
G
is called a sentential form of G.
Example 3.2.4
Let us consider again the grammars of Example 3.2.3. The word Id Id C Id is
a word of both G0 and G1 , since E H) Id Id C Id as well as E H) Id Id C Id
G0 G1
hold. u
t
We omit the index G in H) when the grammar to which the derivation refers is
G
clear from the context.
The syntactic structure of a program, as it results from syntactic analysis, is the
syntax tree or parse tree (we will use these two notions synonymously). The parse
tree provides a canonical representation of derivations. Within a compiler, the parse
tree serves as the interface to the subsequent compiler phases. Most approaches
to the evaluation of semantic attributes, as they are described in Chap. 4, about
semantic analysis, work on this tree structure.
Let G D .VN ; VT ; P; S/ be a CFG. Let t be an ordered tree whose inner nodes
are labeled with symbols from VN and whose leaves are labeled with symbols from
VT [ f"g. A parse or syntax tree t is an ordered tree where the inner nodes and leaf
50 3 Syntactic Analysis
E E
E E E E
E E E E
id ∗ id + id id ∗ id + id
Fig. 3.1 Two syntax trees according to grammar G1 of Example 3.2.2 for the word Id Id C Id
nodes are labeled with symbols from VN and elements from VT [ fg, respectively.
Moreover, the label B of each inner node n of t together with the sequence of labels
X1 ; : : : ; Xk of the children of n in t has the following properties:
1. B ! X1 : : : Xk is a production from P .
2. If X1 : : : Xk D ", then node n has exactly one child and this child is labeled
with ".
3. If X1 : : : Xk ¤ " then Xi 2 VN [ CT for each i.
If the root of t is labeled with nonterminal symbol A, and if the concatenation of
the leaf labels yields the terminal word w we call t a parse tree for nonterminal A
and word w according to grammar G. If the root is labeled with S, the start symbol
of the grammar, we just call t a parse tree for w.
Example 3.2.5 Figure 3.1 shows two syntax trees according to grammar G1 of
Example 3.2.2 for the word Id Id C Id. u
t
The definition implies that each word x 2 L.G/ has at least one derivation from S.
To each derivation for a word x corresponds a parse tree for x. Thus, each word
x 2 L.G/ has at least one parse tree. On the other hand, to each parse tree for a
word x corresponds at least one derivation for x. Any such derivation can be easily
read off the parse tree.
Example 3.2.7 The word Id C Id has the one parse tree of Fig. 3.2 according to
grammar G1 . Two different derivations result depending on the order in which the
3.2 Foundations 51
E ) E C E ) Id C E ) Id C Id
E ) E C E ) E C Id ) Id C Id
t
u
Fig. 3.2 The uniquely determined parse tree for the word Id C Id. E
E E
Id + Id
In Example 3.2.7 we saw that – even with unambiguous words – several derivations
may correspond to one parse tree. This results from the different possibilities to
choose a nonterminal in a sentential form for the next application of a production.
There are two different canonical replacement strategies: one is to replace in each
step the leftmost nonterminal, while the other one replaces in each step the right-
most nonterminal. The corresponding uniquely determined derivations are called
leftmost and rightmost derivations, respectively.
Formally, a derivation '1 H) : : : H) 'n of ' D 'n from S D '1 is is a left-
most derivation of ', denoted as S H) ' , if in the derivation step from 'i to 'i C1
lm
the leftmost nonterminal of 'i is replaced, i.e., 'i D uA, 'i C1 D u˛ for a word
u 2 VT and a production A ! ˛ 2 P: Similarly, a derivation '1 H) : : : H) 'n
is a rightmost derivation of ', denoted by S H) ', if the rightmost nonterminal
rm
in 'i is replaced, i.e., 'i D Au, 'i C1 D ˛u with u 2 VT and A ! ˛ 2 P: A
sentential form that occurs in a leftmost derivation (rightmost derivation) is called
left sentential form (right sentential form).
To each parse tree for S there exists exactly one leftmost derivation and exactly
one rightmost derivation. Thus, there is exactly one leftmost and one rightmost
derivation for each unambiguous word in a language.
Example 3.2.8
The word Id Id C Id has, according to grammar G1 , the leftmost derivations
E H) E C E H) E E C E H) Id E C E H) Id Id C E H)
lm lm lm lm lm
Id Id C Id and
E H) E E H) Id E H) Id E C E H) Id Id C E H) Id Id C Id:
lm lm lm lm lm
52 3 Syntactic Analysis
E H) E C E H) E C Id H) E E C Id H) E Id C Id H) Id Id C Id
rm rm rm rm rm
and
E H) E E H) E E C E H) E E C Id H) E Id C Id H)
rm rm rm rm rm
Id Id C Id:
E H) E C E H) Id C E H) Id C Id
lm lm lm
E H) E C E H) E C Id H) Id C Id:
rm rm rm
t
u
Lemma 3.2.1
1. If S H) uA' holds, then there exists ; with H) u; such that for all v with
lm
' H) v holds S H) Av:
rm
2. If S H) Av holds, then there exists a ' with ' H) v; such that for all u
rm
with H) u holds S H) uA'. u
t
lm
Figure 3.3 clarifies the relation between ' and v on one side and and u on the
other side.
Context-free grammars that describe programming languages should be unam-
biguous. If this is the case, then there exist exactly one parse tree, one leftmost and
one rightmost derivation for each syntactically correct program.
ψ ϕ
A
u v
to compute the subsets of nonterminals that have these properties. All nonterminals
not having these properties, together with all productions using such nonterminals
can be removed. The resulting grammars are called then reduced.
The first required property of useful nonterminals is productivity. A nonterminal
X of a CFG G D .VN ; VT ; P; S/ is called productive, if there exists a derivation
X H) w for a word w 2 VT , or equivalently, if there exists a parse tree whose
G
root is labeled with X.
Example 3.2.9 Consider the grammar G D .fS 0 ; S; X; Y; Zg; fa; bg; P; S 0/,
where P consists of the productions:
S0 ! S
S ! aXZ j Y
X ! bS j aY bY
Y ! ba j aZ
Z ! aZX
The call init.p/ of the routine init./ for a production p, whose code we have not
given, iterates over the sequence of symbols on the right side of p. At each occur-
rence of a nonterminal X the counter countŒp is incremented, and p is added to
the list occŒX. If at the end, countŒp D 0 still holds then init.p/ enters production
p into the list W . This concludes the initialization.
The main iteration processes the productions in W one by one. For each produc-
tion p in W , the left side is productive through p and therefore productive. When,
on the other hand, a nonterminal X is newly discovered as productive, the algo-
rithm iterates through the list occŒX of those productions in which X occurs. The
counter countŒr is decremented for each production r in this list. The described
method is realized by the following algorithm:
:::
while .W ¤ Œ/ f
X hd.W /I W tl.W /I
if .X 62 productive/ f
productive productive [ fXgI
forall ..r W A ! ˛/ 2 occŒX/ f
countŒrI
if .countŒr D 0/ W A WW W I
g == end of forall
g == end of if
g == end of while
Let us derive the run time of this algorithm. The initialization phase essentially
runs once over the grammar and does a constant amount of work for each symbol.
The main iteration through the worklist enters the left side of each production at
3.2 Foundations 55
most once into the list W and so removes it also at most once from the list. At the
removal of a nonterminal X from W more than a constant amount of work has to
be done only when X has not yet been marked as productive. The effort for such
an X is proportional to the length of the list occŒX. The sum of these lengths is
bounded by the overall size of the grammar G. This means that the total effort is
linear in the size of the grammar.
To show the correctness of the procedure, we ascertain that it possesses the fol-
lowing properties:
If X is entered into the set productive in the j th iteration of the while-loop,
there exists a parse tree for X of height at most j 1.
For each parse tree, the root is evenually entered into W .
The efficient algorithm just presented has relevance beyond its application in com-
piler construction. It can be used with small modifications to compute least solu-
tions of Boolean systems of equations, that is, of systems of equations in which the
right sides are disjunctions of arbitrary conjunctions of unknowns. In our exam-
ple, the conjunctions stem from the right sides while a disjunction represents the
existence of different alternatives for a nonterminal.
The second property of a useful nonterminal is its reachability. We call a non-
terminal X reachable in a CFG G D .VN ; VT ; P; S/, if there exists a derivation
S H) ˛Xˇ.
G
S ! Y X ! c
Y ! Y Z j Ya j b V ! Vd j d
U ! V Z ! ZX
To reduce a grammar G, first all nonproductive nonterminals are removed from the
grammar together with all productions in which they occur. Only in a second step
are the unreachable nonterminals eliminated, again together with the productions in
which they occur. This second step is, therefore, based on the assumption that all
remaining nonterminals are productive.
Example 3.2.11 Let us consider again the grammar of Example 3.2.9 with the
productions
S0 ! S
S ! aXZ j Y
X ! bS j aY bY
Y ! ba j aZ
Z ! aZX
We assume in the following that grammars are always reduced in this way.
3.2 Foundations 57
input tape
control
pushdown
state. The transition relation describes the possible computation steps of the PDA. It
lists finitely many transitions. Executing the transition .; x; 0 / replaces the upper
section 2 QC of the pushdown by the new sequence 0 2 Q of states and
reads x 2 VT [ f"g in the input. The replaced section of the pushdown has at least
the length 1. A transition that does not inspect the next input symbol is called an
"-transition.
Similarly to the case for finite automata, we introduce the notion of a configura-
tion for PDAs. A configuration encompasses all components that may influence the
future behavior of the automaton. With our kind of PDA these are the contents of
the pushdown and the remaining input. Formally, a configuration of the PDA P is
a pair .; w/ 2 QC VT . In the linear representation the topmost position of the
pushdown is always at the right end of , while the next input symbol is situated at
the left end of w. A transition of P is represented through the binary relation `P
between configurations. This relation is defined by:
This means that a word w is accepted by a PDA if there exists at least one compu-
tation that goes from the initial configuration .q0 ; w/ to a final configuration. Such
computations are called accepting. Several accepting computations may exist for
the same word, as well as several computations that can only read a prefix of a word
w or that can read w, but do not reach a final configuration.
In practice, accepting computations should not be found by trial and error. There-
fore, deterministic PDAs are of particular importance.
A PDA P is called deterministic if the transition relation has the following
property:
.D/ If .1 ; x; 2 /; .10 ; x 0 ; 20 / are two different transitions in and 10 is a suffix of
1 then x and x 0 are in ˙ and are different from each other, that is, x ¤ " ¤ x 0
and x ¤ x 0 .
If the transition relation has the property .D/ there exists at most one transition out
of each configuration.
3.2 Foundations 59
In this section, we meet a method that constructs for each CFG a PDA that ac-
cepts the language defined by the grammar. This automaton is nondeterministic
and therefore not in itself useful for a practical application. However, we can derive
the LL-parsers of Sect. 3.3, as well as the LR-parsers of Sect. 3.4 by appropriate
design decisions.
The notion of context-free item is crucial for the construction. Let G D
.VN ; VT ; P; S/ be a CFG. A context-free item of G is a triple .A; ˛; ˇ/ with
A ! ˛ˇ 2 P . This triple is, more intuitively, written as ŒA ! ˛:ˇ. The item
ŒA ! ˛:ˇ describes the situation that in an attempt to derive a word w from A a
prefix of w has already been derived from ˛. ˛ is therefore called the history of the
item.
An item ŒA ! ˛:ˇ with ˇ D " is called complete. The set of all context-free
items of G is denoted by ItG . If is the sequence of items:
then hist. / denotes the concatenation of the histories of the items of , i.e.,
hist. / D ˛1 ˛2 : : : ˛n :
As our next case, we assume that the last transition was a shifting transition. Be-
fore this transition, a configuration . ŒX ! ˇ:a; av/ has been reached from the
initial configuration .ŒS 0 ! :S; uav/. This configuration again satisfies the invari-
ant .I / by the induction hypothesis, that is, hist. /ˇ H) u holds. The successor
configuration . ŒX ! ˇa:; v/ also satisfies the invariant .I / because
hist. ŒX ! ˇa:/ D hist. /ˇa H) ua
For the final case, let us assume that the last transition was a reducing transition. Be-
fore this transition, a configuration . ŒX ! ˇ:Y ŒY ! ˛:; v/ has been reached
from the initial configuration .ŒS 0 ! :S; uv/. This configuration satisfies the in-
variant .I / according to the induction hypothesis, that is, hist. /ˇ˛ H) u holds.
G
The actual state is the complete item ŒY ! ˛:. It is the result of a computation
that started with the item ŒY ! :˛, when ŒX ! ˇ:Y was the actual state and the
alternative Y ! ˛ for Y was selected. This alternative has been successfully pro-
3.2 Foundations 61
Therefore w 2 L.G/. For the other direction, we assume w 2 L.G/. We then have
S H) w. To prove
G
.ŒS 0 ! :S; w/ `P .ŒS 0 ! S:; "/
G
we show a more general statement, namely that for each derivation A H) ˛ H) w
G G
with A 2 VN ,
. ŒA ! :˛; wv/ `P . ŒA ! ˛:; v/
G
for arbitrary 2 ItG and arbitrary v 2 VT . This general claim can be proved by
induction over the length of the derivation A H) ˛ H) w. t
u
G G
S ! E
E ! E CT jT
T ! T F jF
F ! .E/ j Id
The transition relation of PG0 is presented in Table 3.1. Table 3.2 shows an
accepting computation of PG0 for the word Id C Id Id. u
t
62 3 Syntactic Analysis
Table 3.1 Tabular representation of the transition relation of Example 3.2.12. The middle column
shows the consumed input
Top of the pushdown Input New top of the pushdown
ŒS ! :E " ŒS ! :E ŒE ! :E C T
ŒS ! :E " ŒS ! :E ŒE ! :T
ŒE ! :E C T " ŒE ! :E C T ŒE ! :E C T
ŒE ! :E C T " ŒE ! :E C T ŒE ! :T
ŒF ! .:E / " ŒF ! .:E /ŒE ! :E C T
ŒF ! .:E / " ŒF ! .:E /ŒE ! :T
ŒE ! :T " ŒE ! :T ŒT ! :T F
ŒE ! :T " ŒE ! :T ŒT ! :F
ŒT ! :T F " ŒT ! :T F ŒT ! :T F
ŒT ! :T F " ŒT ! :T F ŒT ! :F
ŒE ! E C :T " ŒE ! E C :T ŒT ! :T F
ŒE ! E C :T " ŒE ! E C :T ŒT ! :F
ŒT ! :F " ŒT ! :F ŒF ! :.E /
ŒT ! :F " ŒT ! :F ŒF ! :Id
ŒT ! T :F " ŒT ! T :F ŒF ! :.E /
ŒT ! T :F " ŒT ! T :F ŒF ! :Id
ŒF ! :.E / . ŒF ! .:E /
ŒF ! :Id Id ŒF ! Id:
ŒF ! .E:/ / ŒE ! .E /:
ŒE ! E: C T C ŒE ! E C :T
ŒT ! T: F ŒT ! T :F
ŒT ! :F ŒF ! Id: " ŒT ! F :
ŒT ! T :F ŒF ! Id: " ŒT ! T F :
ŒT ! :F ŒF ! .E /: " ŒT ! F :
ŒT ! T :F ŒF ! .E /: " ŒT ! T F :
ŒT ! :T F ŒT ! F : " ŒT ! T: F
ŒE ! :T ŒT ! F : " ŒE ! T:
ŒE ! E C :T ŒT ! F : " ŒE ! E C T:
ŒE ! E C :T ŒT ! T F : " ŒE ! E C T:
ŒT ! :T F ŒT ! T F : " ŒT ! T: F
ŒE ! :T ŒT ! T F : " ŒE ! T:
ŒF ! .:E /ŒE ! T: " ŒF ! .E:/
ŒF ! .:E /ŒE ! E C T: " ŒF ! .E:/
ŒE ! :E C T ŒE ! T: " ŒE ! E: C T
ŒE ! :E C T ŒE ! E C T: " ŒE ! E: C T
ŒS ! :E ŒE ! T: " ŒS ! E:
ŒS ! :E ŒE ! E C T: " ŒS ! E:
3.2 Foundations 63
Deterministic Parsers
By Theorem 3.2.1, the IPDA PG to a CFG G accepts the grammar’s language
L.G/. The nondeterministic way of working of the IPDA, though, is not suited
for the practical application in a compiler. The source of nondeterminism lies in
the transitions of type .E/: the IPDA must choose between several alternatives
for a nonterminal at expanding transitions. With a unambiguous grammar at most
one is the correct choice to derive a prefix of the remaining input, while the other
alternatives lead sooner or later to dead ends. The IPDA still can only guess the
right alternative.
In Sects. 3.3 and 3.4, we describe two different ways to replace guessing. The
LL-parsers of Sect. 3.3 deterministically choose one alternative for the actual non-
terminal using a bounded look-ahead into the remaining input. For grammars of
class LL.k/ a corresponding parser can deterministically select one .E/-transition
based on the already consumed input, the nonterminal to be expanded and the next
k input symbols. LL-parsers are left-parsers.
LR-parsers work differently. They delay the decision, which LL-parsers take
at expansion, until reduction. During the analysis, they try to pursue all possibili-
ties in parallel that may lead to a reverse rightmost derivation for the input word.
A decision has to be taken only when one of these possibilities signals a reduction.
This decision concerns whether to continue shifting or to reduce, and in the latter
case, to reduce by which production. The basis for this decision is again the actual
pushdown and a bounded look-ahead into the remaining input. LR-parsers signal
reductions, and therefore are right-parsers. There does not exist an LR-parser for
each CFG, but only for grammars of the class LR.k/, where k again is the number
of necessary look-ahead symbols.
u ˇk v D .uv/jk
k
Correspondingly, the function followk W VN ! 2VT;# returns for a nonterminal X
the set of terminal words of length at most k that can directly follow a nonterminal
X in a sentential form:
followk .X/ D fw 2 VT j S H) ˇX and w 2 firstk .#/g
The set firstk .X/ consists of the k-prefixes of leaf words of all trees for X,
followk .X/ of the k-prefixes of the second part of leaf words of all upper tree
fragments for X (see Fig. 3.5). The following lemma describes some properties of
k-concatenation and the function firstk .
66 3 Syntactic Analysis
followk (X)
firstk (X)
The proofs for .b/, .c/, .d /, and .e/ are trivial. .a/ is obtained by case distinctions
over the length of words x 2 L1 , y 2 L2 , z 2 L3 . The proof for .f / uses .e/ and
the observation that X1 : : : Xn H) u holds if and only if u D u1 : : : un for suitable
words ui with Xi H) ui .
Because of property .f /, the computation of the set firstk .˛/ can be reduced
to the computation of the set firstk .X/ for single symbols X 2 VT [ VN . Since
firstk .a/ D fag holds for a 2 VT it suffices to determine the sets firstk .X/ for
nonterminals X. A word w 2 VTk is in firstk .X/ if and ony if w is contained in
the set firstk .˛/ for one of the productions X ! ˛ 2 P .
Due to property .f / of Lemma 3.2.2, the firstk -sets satisfy the equation system
.fi/:
[
firstk .X/ D ffirstk .X1 / ˇk : : : ˇk firstk .Xn / jX ! X1 : : : Xn 2 P g ;
(fi)
Xi 2 VN
0W S ! E 3W E0 ! CE 6W T0 ! T
1W E ! TE 0 4W T ! FT0 7W F ! .E/
2W E0 ! " 5W T0 ! " 8W F ! Id
3.2 Foundations 67
t
u
The right sides of the system of equations of the firstk -sets can be represented
as expressions consisting of unknowns firstk .Y /; Y 2 VN and the set constants
fxg; x 2 VT [ f"g and built using the operators ˇk and [. Immediately the follow-
ing questions arise:
Does this system of equations always have solutions?
If yes, which is the solution corresponding to the firstk -sets?
How does one compute this solution?
To answer these questions we first consider general systems of equations like .fi/
and look for an algorithmic approach to solve such systems: Let x1 ; : : : ; xn be a set
of unknowns,
x1 D f1 .x1 ; : : : ; xn /
x2 D f2 .x1 ; : : : ; xn /
::
:
xn D fn .x1 ; : : : ; xn /
a system of equations to be solved over a domain D. Each fi on the right side de-
notes a function fi W D n ! D. A solution I of this system of equations associates
a value I .xi / with each unknown xi such that all equations are satisfied, that is
Therefore I .j / D I is a solution.
Without further assumptions it is unclear whether a j with I .j C1/ D I .j / is
ever reached. In the special cases considered in this volume, we can guarantee that
this procedure converges not only against some solution, but against the desired
solution. This is based on the specific properties of the domains D that occur in our
application.
There exists a partial order on the domain D represented by the symbol v. In
the case of the firstk -sets the set D consists of all subsets of the finite base set
VTk of terminal words of length at most k. The partial order over this domain
is the subset relation.
D contains a uniquely determined least element with which the iteration can
start. This element is denoted as ? (bottom). In the case of the firstk -sets, this
least element is the empty set. F
For each subset Y D, there exists a least upper bound Y wrt. to the relation
v. In the case of the firstk -sets, the least upper bound of a set of sets is the union
of its sets. Partial orders with this property are called complete lattices.
Furthermore, all functions fi are monotonic, that is, they respect the order v of
their arguments. In the case of the firstk -sets this holds because the right sides of
the equations are built from the operators union and k-concatenation, which are both
monotonic and because the composition of monotonic functions is again monotonic.
If the algorithm is started with d0 D ?, it holds that I .0/ v I .1/ . Hereby,
a variable binding is less than or equal to another variable binding, if this holds
for the value of each variable. The monotonicity of the functions fi implies by
induction that the algorithm produces an ascending sequence
of variable bindings. If the domain D is finite, there exists a number j , such that
I .j / D I .j C1/ holds. This means that the algorithm finds a solution. In fact, this
solution is the least solution of the system. Such a least solution even exists if the
complete lattice is not finite, and if the simple iteration does not terminate. This
follows from the fixed-point theorem of Knaster-Tarski, which we treat in detail in
the third volume, Compiler Design: Analysis and Transformation.
Example 3.2.14 Let us apply this algorithm to determine a solution of the system
of equations of Example 3.2.13. Initially, all nonterminals are associated with the
empty set. The following table shows the words added to the first1 -sets in the ith
iteration.
3.2 Foundations 69
1 2 3 4 5 6 7 8
S Id (
E Id (
E' +
T Id (
T'
F Id (
t
u
It suffices to show that all right sides are monotonic and that the domain is finite to
guarantee the applicability of the iterative algorithm for a given system of equations
over a complete lattice. The following theorem makes sure that the least solution
of the system of equations .fi/ indeed characterizes the firstk -sets.
Proof For i 0, let I .i / be the variable binding after the ith iteration of the
algorithm to find solutions for .fi/. By induction over i, we show that for all i 0
IS.i / .X/ firstk .X/ holds for all X 2 VN . Therefore, it also holds I.X/ D
i 0 .X/ firstk .X/ for all X 2 VN . For the other direction it suffices to show
that for each derivation X H) w, there exists an i 0 with wjk 2 I .i / .X/. This
lm
claim is again shown by induction, this time by induction over the length n 1 of
the leftmost derivation. For n D 1 the grammar has a production X ! w. We then
have
I .1/ .X/ firstk .w/ D fwjk g
and the claim follows with i D 1. For n > 1, there exists a production X !
u0 X1 u1 : : : Xm um with u0 ; : : : ; um 2 VT and X1 ; : : : ; Xm 2 VN and leftmost
derivations Xi H) wj ; j D 1; : : : ; k that all have a length less than n, with w D
lm
u0 w1 u1 : : : wm um . According to the induction hypothesis, for each j 2 f1; : : : ; mg
70 3 Syntactic Analysis
there exists a ij , such that .wi jk / 2 I .ij / .Xi / holds. Let i 0 be the maximum of these
ij . For i D i 0 C 1 it holds that
0 0
I .i / .X/ fu0 g ˇk I .i / .X1 / ˇk fu1 g : : : ˇk I .i / .Xm / ˇk fum g
fu0 g ˇk fw1 jk g ˇk fu1 g : : : ˇk fwm jk g ˇk fum g
fwjk g
followk .S 0 / Df#g
[
followk .X/ D ffirstk .ˇ/ ˇk followk .Y / j Y ! ˛Xˇ 2 P g; (fo)
S 0 ¤ X 2 VN
Example 3.2.15 Let us again consider the CFG G2 of Example 3.2.13. To calcu-
late the follow1 -sets for the grammar G2 we use the system of equations:
t
u
The system of equations .fo/ must again be solved over a subset lattice. The right
sides of the equations are built from constant sets and unknowns by monotonic
3.2 Foundations 71
operators. Therefore, .fo/ has a solution which can be computed by global iteration.
The next theorem ascertains that this algorithm indeed computes the right sets.
t
u
The proof is simiilar to the proof of Theorem 3.2.2 and is left to the reader (Exer-
cise 6).
1 2 3 4 5 6 7
S #
E # )
E' # )
T +,#,)
T' + , #, )
F , + , #,)
t
u
The iterative method for the computation of least solutions of systems of equations
for the first1 - and follow1 -sets is not very efficient. But even with more efficient
methods, the computation of firstk - and follow1 -sets needs a large effort when k
gets larger. Therefore, practical parsers mostly use look-ahead of length k D 1. In
this case, the computation of the first- and follow-sets can be performed particularly
efficiently. The following lemma is the foundation for our further treatment.
72 3 Syntactic Analysis
According to our assumption, the considered grammars are always reduced. They
therefore contain neither nonproductive nor unreachable nonterminals. Accord-
ingly, for all X 2 VN the sets first1 .X/ as well as the sets follow1 .X/ are nonempty.
Together with Lemma 3.2.3, this observation allows us to simplify the equations
for first1 and follow1 in such a way that the 1-concatenation can be (essentially)
replaced with union. In order to eliminate the case distinction of whether " is con-
tained in a first1 -set or not, we proceed in two steps. In the first step, the set of
nonterminals X is determined that satisfy " 2 first1 .X/. In the second step, the
"-free first1 -set is determined for each nonterminal X instead of the first1 -set. For
a symbol X 2 VN [ VT , The "-free first1 -set is defined by
To implement the first step, it helps to exploit that for each nonterminal X
" 2 first1 .X/ if and only if X H) "
In a derivation of the word " no production can be used that contains a terminal
symbol a 2 VT . Let G" be the grammar that is obtained from G by eliminating
all these productions. Then it holds that X H) " if and only if X is productive
G
with respect to the grammar G" . For the latter problem, the efficient solver for
productivity of Sect. 3.2.2 can be applied.
Example 3.2.17
Consider the grammar G2 of Example 3.2.13. The set of productions in which
no terminal symbol occurs is
0W S ! E
1W E ! TE 0 4W T ! FT0
2W E0 ! " 5W T0 ! "
With respect to this set of productions only the nonterminals E 0 and T 0 are pro-
ductive. These two nonterminals are thus the only "-productive nonterminals of
grammar G2 . u t
3.2 Foundations 73
Let us now turn to the second step, the computation of the "-free first1 -sets. Con-
sider a production of the form X ! X1 : : : Xm . Its contribution to eff.X/ can be
written as [
feff.Xj / j X1 : : : Xj 1 H) "g
G
Example 3.2.18
Consider again the CFG G2 of Example 3.2.13. The following system of equa-
tions serves to compute the "-free first1 -sets.
eff.S/ D eff.E/ eff.T / D eff.F /
eff.E/ D eff.T / eff.T 0 / D ; [ fg
eff.E 0 / D ; [ fCg eff.F / D fIdg [ f.g
All occurrences of the ˇ1 -operator have disappeared. Instead, only constant sets,
unions, and variables eff.X/ appear on right sides. The least solution is
Nonterminals that occur to the right of terminals do not contribute to the "-free
first1 -sets. Therefore, it is important for the correctness of the construction that all
nonterminals of the grammar are productive.
The "-free first1 -sets eff.X/ can also be used to simplify the system of equations
for the computation of the follow1 -sets. Consider a production of the form Y !
˛XX1 : : : Xm . The contribution of the occurrence of X in the right side of Y to the
set follow1 .X/ is
[
feff.Xj / j X1 : : : Xj 1 H) "g [ ffollow1 .Y / j X1 : : : Xm H) "g
G G
If all nonterminals are not only productive, but also reachable, the equation system
for the computation of the follow1 -sets can be simplified to
follow1 .S 0 / Df#g
[
follow1 .X/ D feff.Y / j A ! ˛XˇY 2 P; ˇ H) "g
G
[
[ ffollow1 .A/ j A ! ˛Xˇ; ˇ H) "g; X 2 VN nfS 0 g
G
74 3 Syntactic Analysis
Example 3.2.19 The simplified system of equations for the computation of the
follow1 -sets of the CFG G2 of Example 3.2.13 is given by:
Again we observe that all occurrences of the operators ˇ1 have disappeared. Only
constant sets and variables follow1 .X/ occur on the right sides of equations together
with the union operator. u t
The next section presents a method that solves systems of equations efficiently
that are similar to the simplified systems of equations for the sets eff.X/ and
follow1 .X/. We first describe the general method and then apply it to the computa-
tions of the first1 - and follow1 -sets.
x i D ei ; i D 1; : : : ; n
over an arbitrary complete lattice D such that the right sides of the equations are
expressions ei that are built from constants in D and variables xj by means of
applications of t, the least upper bound operator of the complete lattice D. The
problem of computing the least solution of this system of equations is called a pure
union problem.
The computation of the set of reachable nonterminals of a CFG is a pure union
problem over the Boolean lattice B D ffalse; trueg. Also the problems to compute
"-free first1 -sets and follow1 -sets for a reduced CFG are pure union problems. In
these cases, the complete lattices are 2VT and 2VT [f#g , respectively, both ordered
by the subset relation.
x0 x1
x2
x0 D fag
x1 D fbg [ x0 [ x3
x2 D fcg [ x1
x3 D fcg [ x2 [ x3
t
u
In order to come up with an efficient algorithm for pure union problems, we con-
sider the variable-dependence graph of the system of equations. The nodes of this
graph are the variables xi of the system. An edge .xi ; xj / exists if and only if the
variable xi occurs in the right side of the variable xj . Figure 3.6 shows the variable-
dependence graph for the system of equations of Example 3.2.20.
Let I be the least solution of the system of equations. We observe that always
I.xi / v I.xj / must hold if there exists a path from xi to xj in the variable-
dependence graph. In consequence, all values of variables within the same strongly
connected component of the variable-dependence graph are equal.
We determine each variable xi with the least upper bound of all constants that
occur on the right side of the equation for variable xi . We call this value I0 .xi /.
Then it holds for all j that
t
u
76 3 Syntactic Analysis
This observation suggests the following method for computing the least solution
I of the system of equations. First, the strongly connected components of the
variable-dependence graph are computed. This needs a linear number of steps.
Then an iteration over the list of strongly connected components is performed.
One starts with a strongly connected component Q that has no edges entering
from other strongly connected components. The values of all variables xj 2 Q are:
G
I.xj / D fI0 .xi / j xi 2 Qg
Dt ?I
forall .xi 2 Q/
t t t I0 .xi /I
forall .xi 2 Q/
I.xi / tI
The run time of both loops is proportional to the number of elements in the strongly
connected component Q. The values of the variables in Q are now propagated
along the outgoing edges. Let EQ be the set of edges .xi ; xj / of the variable-
dependence graph with xi 2 Q and xj 62 Q, that is, the edges leaving Q. For EQ
it is set:
forall ..xi ; xj / 2 EQ /
I0 .xj / I0 .xj / t I.xi /I
The number of steps for the propagation is proportional to the number of edges
in EQ . Then the strongly connected component Q together with the set EQ of
outgoing edges is removed from the graph and the algorithm continues with the
next strongly connected component without ingoing edges. This is repeated until
no further strongly connected component is left. Altogether, the algorithm performs
a linear number of operations t of the complete lattice D.
For Q0 , the value I0 .x0 / D fag is obtained. After removal of Q0 and the edge
.x0 ; x1 /, the new assignment is:
I0 .x1 / D fa; bg
I0 .x2 / D fcg
I0 .x3 / D fcg
The value of all variables in the strongly connected component Q1 then are deter-
mined as I0 .x1 / [ I0 .x2 / [ I0 .x3 / D fa; b; cg. u
t
3.3 Top-Down Syntax Analysis 77
3.3.1 Introduction
The way different parsers work can be understood best by observing how they con-
struct the parse tree to an input word. Top-down parsers start the construction of the
parse tree at the root. In the initial situation, the constructed fragment of the parse
tree consists of the root, which is labeled by the start symbol of the CFG; nothing
of the input word w is consumed. In this situation, one alternative for the start sym-
bol is selected for expansion. The symbols of the right side of this alternative are
attached under the root extending the upper fragment of the parse tree. The next
nonterminal to be considered is the one on the leftmost position. The selection of
one alternative for this nonterminal and the attachment of the right side below the
node labeled with the left side is repeated until the parse tree is complete. Among
the symbols of the right side of a production that are attached to the growing tree
fragment, there can also be terminal symbols. If there is no nonterminal to the left
of such a terminal symbol, the top-down parser compares it with the next symbol in
the input. If they agree, the parser consumes the corresponding symbol in the input.
Otherwise, the parser reports a syntax error. Thus, a top-down analysis performs
the following two types of actions:
Selection of an alternative for the actual leftmost nonterminal and attachment of
the right side of the production to the actual tree fragment.
Comparison of terminal symbols to the left of the leftmost nonterminal with the
remaining input.
Figures 3.7, 3.8, 3.9, and 3.10 show some parse tree fragments for the arithmetic
expression Id C Id Id according to grammar G2 . The selection of alternatives for
the nonterminals to be expanded was cleverly done in such a way as to lead to a
successful termination of the analysis.
S →E E → +E |ε T → ∗T |ε
E→TE T →FT F → (E) | Id
Id + Id ∗ Id
S S S S
E E E
T E T E
F T
Fig. 3.7 The first parse-tree fragments of a top-down analysis of the word Id C Id Id according
to grammar G2 . They are constructed without reading any symbol from the input
78 3 Syntactic Analysis
+ Id ∗ Id
S S
E E
T E T E
F T F T
Id Id ε
Fig. 3.8 The parse tree fragments after reading of the symbol Id and before the terminal symbol
C is attached to the fragment
Id ∗ Id
S S
E E
T E T E
F T + E F T + E
Id ε Id ε T E
F T
Fig. 3.9 The first and the last parse tree after reading of the symbols C and before the second
symbol Id appears in the parse tree
∗ Id Id
S S
E E
T E T E
F T + E F T + E
Id ε T E Id ε T E
F T F T ∗ E
Id ε Id ε
Fig. 3.10 The parse tree after the reduction for the second occurrence of T 0 and the parse tree
after reading the symbol , together with the remaining input
3.3 Top-Down Syntax Analysis 79
The IPDA PG to a CFG G works in principle like a top-down parser; its .E/-
transitions predict which alternative to select for the actual nonterminal to derive
the input word. The trouble is that the IPDA PG takes this decision in a nondeter-
ministic way. The nondeterminism stems from the .E/ transitions. If ŒX ! ˇ:Y
is the actual state and if Y has the alternatives Y ! ˛1 j : : : j ˛n , there are n
transitions:
Because of invariant .I / of Sect. 3.2.4 it holds that hist. /ˇ H) u.
Let D ŒX1 ! ˇ1 :X2 1 : : : ŒXn ! ˇn :XnC1 n be a sequence of items.
We call the sequence
fut. / D n : : : 1
the future of . Let ı D fut. /. So far, the leftmost derivation S 0 H) uY ı has
lm
been found. If this derivation can be extended to a derivation of the terminal word
uv, that is, S 0 H) uY ı H) uv, then in a LL.k/-grammar the alternative to be
lm lm
selected for Y only depends on u; Y and vjk .
Let k 1 be a natural number. The reduced CFG G is a LL.k/-grammar if for
every two leftmost derivations:
S H) uY˛ H) uˇ˛ H) ux and S H) uY˛ H) u˛ H) uy
lm lm lm lm lm lm
it follows from xj1 D yj1 that ˇ D . If for instance xj1 D yj1 D if, then
ˇ D D if .Id/ hstati else hstati. u
t
Example 3.3.2 We now add the following production to the grammar G1 of Ex-
ample 3.3.1:
ˇ
‚ …„ ƒ
hstati H) w hstati ˛ H) w Id 0 D0 IdI ˛ H) wx
lm lm lm
‚ …„ ƒ
hstati H) w hstati ˛ H) w Id W hstati ˛ H) wy
lm lm lm
ı
‚…„ƒ
hstati H) w hstati ˛ H) w Id.Id/I ˛ H) wz
lm lm lm
are pairwise different. And these are indeed the only critical cases. t
u
3.3 Top-Down Syntax Analysis 81
and therefore ˇ ¤ . t
u
Example 3.3.4 Let G4 D .fS; A; Bg; f0; 1; a; bg; P4; S/, where the set P4 of pro-
ductions is given by
S ! AjB
A ! aAb j 0
B ! aBbb j 1
Then
L.G4 / D fan 0b n j n 0g [ fan 1b 2n j n 0g
and G4 is not an LL.k/-grammar for any k 1. To see this we consider the two
leftmost derivations:
S H) A H) ak 0b k
lm lm
S H) B H) ak 1b 2k
lm lm
where in the case jxj < k, we have y D z D ". But then ˇ ¤ implies that G
cannot be an LL.k/-grammar – a contradiction to our assumption.
To prove the other direction, “ ( ”, we assume, G is not an LL.k/-grammar.
Then there exist two leftmost derivations
S H) uA˛ H) uˇ˛ H) ux
lm lm lm
S H) uA˛ H) u˛ H) uy
lm lm lm
3.3 Top-Down Syntax Analysis 83
with xjk D yjk , where A ! ˇ, A ! are different productions. Then the word
xjk D yjk is contained in firstk .ˇ˛/ \ firstk .˛/ – a contradiction to the claim of
the theorem. ut
Theorem 3.3.1 states that in an LL.k/-grammar, two different productions applied
to the same left-sentential form always lead to two different k-prefixes of the re-
maining input. Theorem 3.3.1 allows us to derive useful criteria for membership in
certain subclasses of LL.k/-grammars. The first concerns the case k D 1.
The set first1 .ˇ˛/ \ first1 .˛/ for all left-sentential forms wA˛ and any two
different alternatives A ! ˇ and A ! can be simplified to first1 .ˇ/ \ first1 ./,
if neither ˇ nor produce the empty word ". This is the case if no nonterminal
of G is "-productive. In practice, however, it would be too restrictive to rule out
"-productions. Consider the case that the empty word is produced by at least one of
the two right sides ˇ or . If is produced both by ˇ and , G cannot be an LL.1/-
grammar. Let us, therefore, assume that ˇ H) ", but that " cannot be derived from
. Then the following holds for all left-sentential forms uA˛; u0 A˛ 0 :
t
u
We check:
Case 1: The derivation starts with S ) aAaa. Then first2 .baa/ \ first2 .aa/ D
;.
Case 2: The derivation starts with S ) bAba. Then first2 .bba/ \ first2 .ba/ D
;.
Hence G is an LL.2/-grammar according to Theorem 3.3.1. However, the grammar
G is not a strong LL.2/-grammar, because
In the example, follow1 .A/ is too undifferentiated because it collects terminal fol-
lowing words that may occur in different sentential forms. u t
3.3 Top-Down Syntax Analysis 85
Deterministic parsers that construct the parse tree for the input top-down cannot
deal with left recursive nonterminals. A nonterminal A of a CFG G is called left
C
recursive if there exists a derivation A H) Aˇ.
VN0 D VN [ fhA; Bi j A; B 2 VN g;
E !E CT jT
T !T F jF
F ! .E/ j Id
While the grammar G0 has only three nonterminals and six productions, the gram-
mar G1 needs nine nonterminals and 15 productions.
The parse tree for Id C Id according to grammar G0 is shown in Fig. 3.11a, the
one according to grammar G1 in Fig. 3.11b. The latter one has quite a different
structure. Intuitively, the grammar produces for a nonterminal the first possible ter-
minal symbol directly and afterwards collects in a backward fashion the remainders
of the right sides, which follow the nonterminal symbol on the left. The nontermi-
nal hA; Bi thus represents the task of performing this collection backward from B
to A. u t
We convince ourselves that the grammar G 0 constructed from grammar G has the
following properties:
Grammar G 0 has no left recursive nonterminals.
There exists a leftmost derivation
A H) B H) aˇ
G G
in which after the first step only nonterminals of the form hX; Y i are replaced.
3.3 Top-Down Syntax Analysis 87
a b
E E
E + T Id E, F
T F E, T
F Id E, E
Id + T E, E
Id T, F ε
T, T
Fig. 3.11 Parse trees for Id C Id according to grammar G0 of Example 3.3.6 and according to the
grammar after removal of left recursion
The last property implies, in particular, that grammars G and G 0 are equivalent, i.e.,
that L.G/ D L.G 0 / holds.
In some cases, the grammar obtained by removing left recursion is an LL.k/-
grammar. This is the case for grammar G0 of Example 3.3.6. We have already seen
that the transformation to remove left recursion also has disadvantages. Let n be
the number of nonterminals. The number of nonterminals as well as the number
of productions can increase by a factor of n C 1. In large grammars, it therefore
may not be advisable to perform this transformation manually. A parser generator,
however, can do the transformation automatically and also can generate a program
that automatically converts parse trees of the transformed grammar back into parse
trees of the original grammar (see Exercise 7 of the next chapter). The user then
would not even notice the grammar transformation.
Example 3.3.6 illustrates how much the parse tree of a word according to the
transformed grammar can be different from the one according to the original gram-
mar. The operator sits somewhat isolated between its remotely located operands.
An alternative to the elimination of left recursion are grammars with regular right
sides, which we will treat in Sect. 3.3.5.
Figure 3.12 shows the structure of a parser for strong LL.k/-grammars. The prefix
w of the input is already read. The remaining input starts with a prefix u of length
k. The pushdown contains a sequence of items of the CFG. The topmost item, the
88 3 Syntactic Analysis
w u v
input tape
M
output tape
parser table
control
pushdown
VN VTk
# ! .VT [ VN / [ ferrorg
which associates each nonterminal with the alternative that should be applied based
on the given look-ahead, and signals an error if no alternative is available for the
given combination of actual state and look-ahead. Let ŒX ! ˇ:Y be the topmost
item on the pushdown and u be the prefix of length k of the remaining input. If
M ŒY; u D .Y ! ˛/ then ŒY ! :˛ will be the new topmost pushdown symbol
and the production Y ! ˛ is written to the output tape.
The table entries in M for a nonterminal Y are determined in the following way.
Let Y ! ˛1 j : : : j ˛r be the alternatives for Y . For a strong LL.k/-grammar,
the sets firstk .˛i / ˇk followk .Y / are disjoint. For each of the u 2 firstk .˛1 / ˇk
followk .Y / [ : : : [; firstk .˛r / ˇk followk .Y /, we therefore set
Otherwise, M ŒY; u is set to error. The entry M ŒY; u D error means that the actual
nonterminal and the prefix of the remaining input do not go together. This means
that a syntax error has been found. An error-diagnosis and error-handling routine is
3.3 Top-Down Syntax Analysis 89
started, which attempts to continue the analysis. Such approaches are described in
Sect. 3.3.8.
For k D 1, the construction of the parser table is particularly simple. Because of
Corollary 3.3.2.1, it does not require k-concatenation. Instead, it suffices to test u
for membership in one of the sets first1 .˛i / and maybe in follow1 .Y /.
Example 3.3.7 Table 3.3 is the LL.1/-parser table for the grammar of Example
3.2.13. Table 3.4 describes the run of the associated parser for input Id Id#. u
t
Example 3.3.8
A right-regular CFG for arithmetic expressions is given by
The terminal symbols C; ; ; = as well as the two bracket symbols have been set
into boxes in order to distinguish them from the metacharacters to denote regular
expressions. The mapping p is given by:
3.3 Top-Down Syntax Analysis 91
S ! E
E ! T + j - T
T ! F * j / F
F ! ( E ) j Id
t
u
A ! hri if A 2 VN ; p.A/ D r
hXi ! X if X 2 VN [ VT
h"i ! " if X 2 VN [ VT
hr i ! " j hri hr i if r 2 R
h.r1 j : : : j rn /i ! hr1 i j : : : j hrn i if .r1 j : : : j rn / 2 R
h.r1 : : : rn /i ! hr1 i : : : hrn i if .r1 : : : rn / 2 R
The language L.G/ of the right-regular CFG G then is defined as the language
L.hGi/ of the ordinary CFG hGi.
S !E
D E
E ! T + j - T
D E
T ! F * j / F
D E
F ! ( E ) j Id
D E D E
T + j - T !T + j - T
D E ˝ ˛ D E
+ j - T !"j + j - T + j - T
˝ ˛ ˝ ˛
+ j - T ! + j - T
˝ ˛
+ j - ! + j -
D E D E
F * j / F !F * j / F
D E D E D E
* j / F !"j * j / * j / F
D E
* j / ! * j /
D E
( E ) j Id ! ( E ) j Id
92 3 Syntactic Analysis
Example 3.3.10
The eff- as well as the first1 -sets for the nonterminals of the grammar Ge of
Example 3.3.8 are:
t
u
Generators provide only limited support for the realization of decent error
handling routines which, however, are mandatory for any reasonably practical
parser. Error handling, on the other hand, can be elegantly integrated into a
recursive-descent parser.
Assume we are given a right-regular CFG G D .VN ; VT ; p; S/ which is
RLL.1/. For each nonterminal A, a procedure with name A is introduced. Calls
to the function generate then translate the regular expressions in right sides into
corresponding imperative program fragments. The function generate is a metapro-
gram that, for a given regular expression r, generates code which is able to analyze
the input derived from hri: a expression becomes a while-loop; a concatenation
becomes sequential composition; and a choice between alternatives results in a
switch-statement. Terminals are turned into tests and nonterminals into recursive
calls of the corresponding procedures. The first1 -sets of regular subexpressions are
used to decide whether a loop should be exited and which alternative to choose.
Such a parser is called recursive-descent parser, since it replaces the explicit ma-
nipulation of a pushdown store by means of the run-time stack and recursion.
The parser relies on a procedure expect in order to test whether the next input
symbol is contained in the first1 -set of the next regular expression. In the following,
we assume that the next input symbol is stored in the global variable next.
void parse./ f
next scan./I == .1/
expect.first1 .S//I == .2/
S./I
expect.f#g/I
g
For a nonterminal symbol A 2 VN , the procedure A./ is defined by:
void A./ f
generate r
g
given that p.A/ D r is the right side for A in p. For the regular subexpressions of
right sides, we define:
Example 3.3.11 For the extended expression grammar Ge of Example 3.3.8, the
following parser is obtained (the procedures expect and parse have been omitted).
void T ./ f
F ./I
expect.f"; * ; / g/I
while .next 2 f * ; / g/ f
switch .next/ f
case * W consume./I breakI
case / W consume./I breakI
g
expect.fId; ( g/I
F ./I
g
g
void F ./ f
switch .next/ f
case Id W consume./I breakI
case . W consume./I
expect.f Id ; ( g/I
E./I
expect.f ) g/I
consume./I
breakI
g
g
t
u
96 3 Syntactic Analysis
Example 3.3.12 Consider again the grammar Ge from Example 3.3.8, and for that
grammar the parsing function for the nonterminal E. Let us assume that every node
of the syntax tree is goint to be represented by an object of the class Tree. For the
nonterminal E, we define:
Tree E./ f
Tree l T ./I
expect.f"; + ; - g/I
while .next 2 f + ; - g/ f
switch .next/ f
case + W consume./I
op PLUSI
breakI
case - W consume./I
op MINUSI
breakI
g
expect.fId; ( g/I
Tree r T ./I
l new Tree.op; l; r/I
g
return lI
g
The function E builds up tree representations for sequences of additions and sub-
tractions. The labels of the case distinction, which did not play any particular role
in the last example, are now used for selecting the right operator for the tree node.
In principle, any shape of the abstract syntax tree can be chosen when processing
the sequence corresponding to a regular right side. Alternatively, all tree nodes of
the sequence could be collected into a plain list (cf. Exercise 15). Or a sequence
is returned that is nested to the right, since the processed operators associate to the
right. u t
3.3 Top-Down Syntax Analysis 97
RLL.1/-parsers have the property of the extendible prefix: every prefix of the input
accepted by an RLL.1/-parser can be extended in at least one way to a sentence of
the language. Although parsers generally may only detect symptoms of errors and
not their causes, this property suggests not to perform corrections within the part of
the input which has already been parsed. Instead the parser may modify or ignore
some input symbols until a configuration is reached from where the remaining input
can be parsed. The method that we propose now tries by skipping a prefix of the
remaining input to reach a contents of the pushdown such that the analysis can be
continued.
An obvious idea to this end is to search for a closing bracket or a separator for the
current nonterminal and to ignore all input symbols inbetween. If no such symbol is
found but instead a meaningful end symbol for another nonterminal, then the entries
in the pushdown are popped until the corresponding nonterminal occurs on top of
the pushdown. This means in C or similar languages, for instance:
during the analysis of an assignment to search for a semicolon;
during the analysis of a declaration for commas or semicolons;
during the analysis of conditionals for the keyword else;
and when analyzing a block starting with an opening bracket f for a closing
bracket g.
Such a panic mode, however, has several disadvantages.
Even if the expected symbol occurs in the program, a longer sequence of words
may be skipped until the symbol is found. If the symbol does not occur or does not
belong to the current incarnation of the current nonterminal, the parser generally is
doomed.
Our error handling therefore refines the basic idea. During the syntactic analysis,
the recursive-descent parser maintains a set T of anchor terminals. This anchor set
consists of terminals where the parsing mode may be recovered. This means that
they occur in the right context of one of the subexpressions or productions that are
currently processed. The anchor set is not static, but dynamically adapted to the
current parser configuration. Consider, for instance, the parser for an if -statement
in C:
void hif_stati ./ f
consume./I
expect.f ( g/I consume./I
expect.first1 .E//I E./I
expect.f ) g/I consume./I
expect.first1 .hstati//I hstati ./I
expect.felse; "g/ switch .next/ f
98 3 Syntactic Analysis
The new version of expect receives the actual anchor set in the extra parameter anc.
First, expect skips input symbols until the symbol is either expected or an anchor
terminal. The skipped input can neither be consumed (since it is not expected),
nor used for recovery. If input has been skipped or the current input symbol is an
anchor terminal which is not expected (the intersection of E and anc need not be
empty!), then an error message is issued and the parser switches to the error mode.
The error mode is necessary in order to suppress the consumption of unexpected
input symbols.
Example 3.3.13
Consider the example of a syntacticly incorrect C program, where the program-
mer has omitted the right operand of the comparison together with the closing
bracket of the if:
if . Id < while : : :
During the syntactic analysis of the if condition, the first symbols of the nonterminal
S which analyzes statements are assumed to be contained in the anchor set. This
means that the symbol while is well suited for recovery and allows it to continue
with the analysis of the if statement. The call to expect by which the analysis
of the right side of the expression is started expects a terminal symbol from the
3.3 Top-Down Syntax Analysis 99
set first1 .F / (see Example 3.3.11), but finds while. Since while is not expected,
but is contained in the anchor set, the parser switches into the error mode without
skipping any input. After return from the analysis of E, the procedure hif _stati
expects a closing bracket / which, however, is missing, since the next input symbol
is still while. Having switched into the error mode allows it to suppress the follow-
up error message by this second call to expect. The error mode is active until
expect identifies an expected symbol without having to skip input. u t
The parser may leave the error mode, when it may consume an expected symbol.
This is implemented by means of the procedure consume:
void consume.terminal a/ f
if .next D a/ f
in_error_mode falseI
next scan./I
g
g
The input is only advanced if it is the expected terminal which is consumed. Oth-
erwise, consume does nothing. In case of an error where the next input symbol
is not expected but is contained in the anchor set, this behavior enables the parser
to continue until it reaches a position where the symbol is indeed expected and
consumable.
In the following, the generator schemes of the parsers are extended by a compu-
tation of the anchor sets. Of particular interest is the code generated for a sequence
.r1 rk / of regular expressions. The parser for ri receives the union of the first
symbols of ri C1 ; : : : ; rk . This enables the parser to recover at every symbol to the
right of ri . At alternatives, the anchor set is not modified. Each anchor symbol of
the whole alternative is also an anchor symbol of each individual case. At a regular
expression r , recovery may proceed to the right of r as well as before r itself.
Example 3.3.14 The parser for an if statement in C including error handling now
looks as follows:
void hif _stati .set hterminali anc/ f
consume.if/I
expect.f ( g; anc
[ first1 .E/ [ f ) g
[ first1 .hstati/ [ felseg/I consume. ( /I
expect.first1 .E/; anc [ f ) g
[ first1 .hstati/ [ felseg/I E.anc [ f ) g [ first1 .hstati/ [ felseg/I
expect.f ) g; anc
[ first1 .hstati/ [ felseg/I consume. ) /I
expect.first1 .hstati/; anc [ felseg/I hstati .anc [ felseg/I
expect.felse; "g; anc/I switch .next/ f
case else W consume.else/I
expect.first1 .hstati/; anc/I
hstati .anc/I
breakI
default W
g
g
t
u
3.4 Bottom-up Syntax Analysis 101
3.4.1 Introduction
Bottom-up parsers read their input like top-down parsers from left to right. They are
pushdown automata that can essentially perform two kinds of operations:
Read the next input symbol (shift), and
Reduce the right side of a production X ! ˛ at the top of the pushdown by the
left side X of the production (reduce).
Because of these operations they are called shift-reduce parsers. Shift-reduce
parsers are right parsers; they output the application of a production when they do
a reduction. Since shift-reduce parsers always reduce at the top of the pushdown,
the result of the successful analysis of an input word is a rightmost derivation in
reverse order.
A shift-reduce parser must never miss a required reduction, that is, cover it in the
pushdown by newly read input symbols. A reduction is required, if no rightmost
derivation from the start symbol is possible without it. A right side covered by an
input symbol will never reappear at the top of the pushdown and can therefore never
be reduced. A right side at the top of the pushdown that must be reduced to obtain
a derivation is called a handle.
Not all occurrences of right sides that appear at the top of the pushdown are
handles. Some reductions when performed at the top of the pushdown lead into dead
ends, that is, they cannot be continued to a reverse rightmost derivation although the
input is correct.
Example 3.4.1 Let G0 be again the grammar for arithmetic expressions with the
productions:
S ! E
E ! E CT jT
T ! T F jF
F ! .E/ j Id
Table 3.5 shows a successful bottom-up analysis of the word Id Id of G0 . The
third column lists actions that were also possible, but would lead into dead ends. In
the third step, the parser would miss a required reduction. In the other two steps,
the alternative reductions would lead into dead ends, that is, not to right sentential-
forms. u t
Bottom-up parsers construct the parse tree bottom up. They start with the leaf word
of the parse tree, the input word, and construct for ever larger parts of the read input
subtrees of the parse tree: upon reduction by a production X ! ˛, the subtrees
for the right side ˛ are attached below a newly created node for X. The analysis is
successful if a parse tree has been constructed for the whole input word whose root
is labeled with the start symbol of the grammar.
Figure 3.13 shows some snapshots during the construction of the parse tree ac-
cording to the derivation shown in Table 3.5. The tree on the left contains all nodes
102 3 Syntactic Analysis
Table 3.5 A successful analysis of the word Id Id together with potential dead ends
Pushdown Input Erroneous alternative actions
Id Id
Id Id
F Id Reading of misses a required reduction
T Id Reduction of T to E leads into a dead end
T Id
T Id
T F Reduction of F to T leads into a dead end
T
E
S
T ∗ Id T ∗ F T ∗ F
F F Id F Id
Id Id Id
Fig. 3.13 Construction of the parse tree after reading the first symbol, Id, together with the re-
maining input, before the reduction of the handle T F , and the complete parse tree
that can be created when the input Id has been read. The sequence of three trees in
the middle represents the state before the handle T F is being reduced, while the
tree on the right shows the complete parse tree.
3.4.2 LR.k/-Parsers
This section presents the most powerful deterministic method that works bottom-
up, namely, LR.k/ analysis. The letter L says that the parsers of this class read
their input from left to right, The R characterizes them as Right parsers, while k is
the length of the considered look-ahead.
We start again with the IPDA PG for a CFG G and transform it into a shift-reduce
parser. Let us recall what we do in the case of top-down analysis. Sets of look-
ahead words are computed from the grammar, which are used to select the right
alternative for a nonterminal at expansion transitions of PG . So, the LL.k/-parser
decides about the alternative for a nonterminal at the earliest possible time, when
the nonterminal has to be expanded. LR.k/-parsers follow a different strategy; they
pursue all possibilities to expand and to read in parallel.
3.4 Bottom-up Syntax Analysis 103
A decision has to be taken when one of the possibilities to continue asks for a
reduction. What is there to decide? There can be several productions by which to
reduce, and a shift can be possible in addition to a reduction. The parser uses the
next k symbols to take its decision.
In this section, first an LR.0/-parser is developed, which does not yet take any
look-ahead into account. Section 3.4.3 presents the canonical LR.k/-parser. In
Sect. 3.4.3, less powerful variants of LR.k/ are described, which are often pow-
erful enough for practice. Finally, Sect. 3.4.4 describes a error recovery method
for LR.k/. Note that all CFGs are assumed to be reduced of nonproductive and
unreachable nonterminals and extended by a new start symbol.
Example 3.4.2 Let G0 again be the grammar for arithmetic expressions with the
productions
S ! E
E ! E CT jT
T ! T F jF
F ! .E/ j Id
Figure 3.14 shows the characteristic finite automaton to grammar G0 . u
t
104 3 Syntactic Analysis
E
[S → .E] [S → E.]
E + T
[E → .E + T ] [E → E. + T ] [E → E + .T ] [E → E + T.]
T
[E → .T ] [E → T.]
T ∗ F
[T → .T ∗ F ] [T → T. ∗ F ] [T → T ∗ .F ] [T → T ∗ F.]
F
[T → .F ] [T → F.]
( E )
[F → .(E)] [F → (.E)] [F → (E.)] [F → (E).]
Id
[F → .Id] [F → Id.]
Fig. 3.14 The characteristic finite automaton char.G0 / for the grammar G0
The following theorem clarifies the exact relation between the characteristic finite
automaton and the IPDA:
Theorem 3.4.1 Let G be a CFG and 2 .VT [ VN / . The following three state-
ments are equivalent:
1. There exists a computation .ŒS 0 ! :S; / `char.G/ .ŒA ! ˛:ˇ; "/ of the charac-
teristic finite automaton char.G/.
2. There exists a computation . ŒA ! ˛:ˇ; w/ `P .ŒS 0 ! S:; "/ of the IPDA
G
PG such that D hist. / ˛ holds.
3. There exists a rightmost derivation S 0 H) 0 Aw H) 0 ˛ˇw with D 0 ˛.
rm rm
t
u
The equivalence of statements (1) and (2) means that words that lead to an item of
the characteristic finite automaton char.G/ are exactly the histories of pushdown
contents of the IPDA PG whose topmost symbol is this item and from which PG
can reach one of its final states assuming appropriate input w. The equivalence
of statements (2) and (3) means that an accepting computation of the IPDA for an
input word w that starts with a pushdown contents corresponds to a rightmost
derivation that leads to a sentential form ˛w where ˛ is the history of the pushdown
contents .
Before proving Theorem 3.4.1, we introduce some terminology. For a rightmost
derivation
S 0 H) 0 Av H) ˛v
rm rm
3.4 Bottom-up Syntax Analysis 105
t
u
Example 3.4.4 We give two reliable prefixes of G0 and some items that are valid
for them.
t
u
If during the construction of a rightmost derivation for a word, the prefix u of the
word is reduced to a reliable prefix , then each item ŒX ! ˛:ˇ, valid for ,
describes one possible interpretation of the analysis situation. Thus, there is a right-
most derivation in which is prefix of a right sentential-form and X ! ˛ˇ is one
of the possibly just processed productions. All such productions are candidates for
later reductions.
Consider the rightmost derivation
S 0 H) Aw H) ˛ˇw
rm rm
106 3 Syntactic Analysis
We now consider this rightmost derivation in the direction of reduction, that is, in
the direction in which it is constructed by a bottom-up parser. First, x is reduced to
in a number of steps, then u to ˛, then v to ˇ. The valid item ŒA ! ˛:ˇ for the
reliable prefix ˛ describes the analysis situation in which the reduction of u to ˛
has already been done, while the reduction of v to ˇ has not yet started. A possible
long-range goal in this situation is the application of the production X ! ˛ˇ.
We come back to the question of which language is accepted by the character-
istic finite automaton of PG . Theorem 3.4.1 says that by reading a reliable prefix
char.G/, will enter a state that is a valid item for this prefix. Final states, i.e., com-
plete items, are only valid for reliable prefixes where a reduction is possible at their
ends.
Proof of Theorem 3.4.1 We do a circular proof .1/ ) .2/ ) .3/ ) .1/. Let us
first assume that .ŒS 0 ! :S; / `char.G/ .ŒA ! ˛:ˇ; "/. By induction over the num-
rm rm
ber n of " transitions we construct a rightmost derivation S 0 H) Aw H) ˛ˇw.
If n D 0 holds then D " and ŒA ! ˛:ˇ D ŒS 0 ! :S also hold. Since
rm
S 0 H) S 0 holds, the claim holds in this case. If n > 0 holds we consider the
last " transition. The computation of the characteristic finite automaton can be
decomposed into
.ŒS 0 ! :S; / `char.G/ .ŒX ! ˛ 0 :Aˇ 0 ; "/ `char.G/ .ŒA ! :˛ˇ; ˛/
`char.G/ .ŒA ! ˛:ˇ; "/
rm rm
S 0 H) 0 Avw 0 H) 0 ˛ˇw
rm rm
Let us now assume that we have a rightmost derivation S 0 H) 0 Aw H)
0 ˛ˇw. This derivation can be decomposed into
rm rm rm rm
S 0 H) ˛1 X1 ˇ1 H) ˛1 X1 v1 H) : : : H) .˛1 : : : ˛n /Xn .vn : : : v1 /
rm
H) .˛1 : : : ˛n /˛ˇ.vn : : : v1 /
for Xn D A. We can prove by induction over n that . ; vw/ `K .ŒS 0 ! S:; "/
G
holds for
0
D ŒS ! ˛1 :X1 ˇ1 : : : ŒXn1 ! ˛n :Xn ˇn
w D vvn : : : v1
as long as ˇ H) v, ˛1 D ˇ1 D " and X1 D S. This proves the direction .2/ )
rm
.3/.
For the last implication we consider a pushdown store D 0 ŒA ! ˛:ˇ
with . ; w/ `K .ŒS 0 ! S:; "/. We first convince ourselves by induction over the
G
number of transitions in such a computation that 0 must be of the form:
0
D ŒS 0 ! ˛1 :X1 ˇ1 : : : ŒXn1 ! ˛n :Xn ˇn
Example 3.4.5 The canonical LR.0/-automaton for the CFG G0 of Example 3.2.2
is obtained by applying the subset construction to the characteristic FA char.G0 / of
108 3 Syntactic Analysis
+ T
S1 S6 S9
Id
E F
S5
( Id
Id +
F ∗
S0 S3
( Id
F
( E )
T S4 S8 S11
T (
∗ F
S2 S7 S10
Fig. 3.15 The transition diagram of the LR.0/-automaton for the grammar G0 obtained from the
characteristic finite automaton char.G0 / in Fig. 3.14. The error state S12 D ; and all transitions
into it are omitted
S0 D f ŒS ! :E; S4 D f ŒF ! .:E/; S7 D f ŒT ! T :F ;
ŒE ! :E C T ; ŒE ! :E C T ; ŒF ! :.E/;
ŒE ! :T ; ŒE ! :T ; ŒF ! :Id g
ŒT ! :T F ; ŒT ! :T F S8 D f ŒF ! .E:/;
ŒT ! :F ; ŒT ! :F ŒE ! E: C T g
ŒF ! :.E/; ŒF ! :.E/ S9 D f ŒE ! E C T:;
ŒF ! :Id g ŒF ! :Id g ŒT ! T: F g
S1 D f ŒS ! E:; S5 D f ŒF ! Id: g S10 D f ŒT ! T F:g
ŒE ! E: C T g S6 D f ŒE ! E C :T ; S11 D f ŒF ! .E/: g
S2 D f ŒE ! T:; ŒT ! :T F ; S12 D ;
ŒT ! T: F g ŒT ! :F ;
S3 D f ŒT ! F: g ŒF ! :.E/;
ŒF ! :Id g
t
u
form can only happen at the right end of this sentential form. An item valid for a
reliable prefix describes one possible interpretation of the actual analysis situation.
F ; .F ; ..F ; ...F ; : : :
T .F ; T ..F ; T ...F ; :::
E C F ; E C .F ; E C ..F ; :::
The state S6 in the canonical LR.0/-automaton to G0 contains all valid items for
the reliable prefix EC, namely the items
ŒE ! E C :T ; ŒT ! :T F ; ŒT ! :F ; ŒF ! :Id; ŒF ! :.E/:
S H) E H) E CT H) E CF H) E C Id
rm rm rm rm
" " "
Valid for instance ŒE ! E C :T ŒT ! :F ŒF ! :Id
t
u
The canonical LR.0/-automaton LR0 .G/ to a CFG G is a DFA that accepts the
set of reliable prefixes to complete items. In this way, it identifies positions for
reduction, and therefore offers itself for the construction of a right parser. Instead
of items (as the IPDA), this parser stores on its pushdown states of the canonical
LR.0/-automaton, that is sets of items. The underlying PDA P0 is defined as the
tuple K0 D .QG [ff g; VT ; 0 ; qG;0 ; ff g/. The set of states is the set QG of states
of the canonical LR.0/-automaton LR0 .G/, extended by a new state f , the final
state. The initial state of P0 is identical to the initial state qG;0 of LR0 .G/. The
transition relation 0 consists of the following kinds of transitions:
Read: .q; a; q G .q; a// 2 0 , if G .q; a/ ¤ ;. This transition reads the next
input symbol a and pushes the successor state q under a onto the pushdown. It
can only be taken if at least one item of the form ŒX ! ˛:aˇ is contained in q.
Reduce: .qq1 : : : qn ; "; q G .q; X// 2 0 if ŒX ! ˛: 2 qn holds with j˛j D n.
The complete item ŒX ! ˛: in the topmost pushdown entry signals a potential
reduction. As many entries are removed from the top of the pushdown as the
length of the right side indicates. After that, the X successor of the new topmost
pushdown entry is pushed onto the pushdown. Figure 3.16 shows a part of the
transition diagram of a LR.0/-automaton LR0 .G/ that demonstrates this situa-
tion. The ˛ path in the transition diagram corresponds to j˛j entries on top of
the pushdown. These entries are removed at reduction. The new actual state,
previously below these removed entries, has a transition under X, which is now
taken.
110 3 Syntactic Analysis
[· · · → · · · .X · · · ] [· · · → · · · X. · · · ]
X
[X → .α] ···
···
[X → α.]
···
for a w 2 VT . u
t
default W breakI
g
g
return resultI
g
where V is the set of symbols V D VT [VN . The set QG of states and the transition
relation G are computed by first constructing the initial state qG;0 D G;" .fŒS 0 !
:Sg/ and then adding successor states and transitions until all successor states are
already in the set of constructed states. For an implementation, we specialize the
function nextState./ of the subset construction:
As in the subset construction, the set of states states and the set of transitions trans
can be computed iteratively:
for n 0. The handles are always underlined. Two different possibilities to reduce
exist only in the case of right sentential-forms an aAbb n and an aBbbb 2n. The
sentential-form an aAbb n can be reduced to an Ab n as well as to an aSbb n . The
first choice belongs to the rightmost derivation
S H) an Ab n H) an aAbb n
rm rm
while the second one does not occur in any rightmost derivation. The prefix an of
an Ab n uniquely determines whether A is the handle, namely in the case n D 0, or
whether aAb is the handle, namely in the case n > 0. The right sentential-forms
an Bb 2n are handled analogously. u
t
S ! aAc A ! Abb j b
S ! aAc A ! bbA j b
S ! aAc A ! bAb j b
The following theorem clarifies the relation between the definition of a LR.0/-
grammar and the properties of the canonic LR.0/-automaton.
Case 1: The state p has a reduce-reduce-conflict, i.e., p contains two distinct items
ŒX ! ˇ:, ŒY ! ı:. Associated to state p is a nonempty set of words that are all
reliable prefixes for each of the two items. Let D 0 ˇ be one such reliable prefix.
Since both items are valid for , there are rightmost derivations
S 0 H) 0 Xw H) 0 ˇw and
rm rm
S 0 H) Yy H) ıy with ı D 0ˇ D
rm rm
Case 1: ˇ ¤ ". By Lemma 3.4.4, p D fŒX ! ˇ:g, i.e., ŒX ! ˇ: is the only valid
item for ˛ˇ. But then ˛ D , X D Y , and x D y must hold.
Case 2: ˇ D ". Assume that the second rightmost derivation violates the LR.0/-
condition. Then a further item ŒY ! ı:Y 0 must be contained in p such that
˛ D ˛ 0 ı. The last application of a production in the second rightmost derivation
is the last application of a production in a terminal rightmost derivation for Y 0 . By
Lemma 3.4.4, the second derivation therefore has the form:
S 0 H) ˛ 0 ıY 0 w H) ˛ 0 ıXvw H) ˛ 0 ıvw
rm rm rm
Let us summarize. We have seen how to construct the LR.0/-automaton LR0 .G/
from a given CFG G. This can be done either directly or through the characteristic
finite automaton char.G/. From the DFA LR0 .G/ a PDA P0 can be constructed.
This PDA P0 is deterministic if LR0 .G/ does not contain LR.0/-inadequate states.
Theorem 3.4.2 states this is the case if and only if the grammar G is an LR.0/-
grammar. Thereby, we have obtained a method to generate parsers for LR.0/-
grammars.
116 3 Syntactic Analysis
In real life, though, LR.0/-grammars are rare. Often look-ahead of length k > 0
is required in order to select from the different choices in a parsing situation. In
an LR.0/-parser, the actual state determines the next action, independently of the
next input symbols. LR.k/-parsers for k > 0 have states consisting of sets of
items as well. A different kind of items are used, though, namely LR.k/-items.
LR.k/-items are context-free items, extended by look-ahead words. An LR.k/-
item is of the form i D ŒA ! ˛:ˇ; x for a production A ! ˛ˇ of G and a word
x 2 .VTk [ VT<k #/. The context-free item ŒA ! ˛:ˇ is called the core, the word
x the look-ahead of the LR.k/-item i. The set of LR.k/-items of grammar G is
written as IG;k . The LR.k/-item ŒA ! ˛:ˇ; x is valid for a reliable prefix , if
there exists a rightmost derivation
S 0 # H) 0 Xw# H) 0 ˛ˇw#
rm rm
Observation (2) follows since the subword E cannot occur in a right sentential-
form. ut
Theorem 3.4.3 Let G be a CFG. For a reliable prefix let I t./ be the set of
LR.k/-items of G that are valid for .
The grammar G is an LR.k/-grammar if and only if for all reliable prefixes
and all LR.k/-items ŒA ! ˛:; x 2 I t./ the following holds.
1. If there is another LR.k/-item ŒX ! ı:; y 2 I t./, then x ¤ y.
2. If there is tanother LR.k/-item ŒX ! ı:aˇ; y 2 I t./, then x 62 firstk .aˇ/ ˇk
fyg. u t
A second table, the goto-table, contains the representation of the transition function
of the canonic LR.k/-automaton LRk .G/. It is consulted after a shift-action or
a reduce-action to determine the new state on top of the pushdown. Upon a shift,
it computes the transition under the read symbol out of the actual state. Upon a
reduction by X ! ˛, it gives the transition under X out of the state underneath
those pushdown symbols that belong to ˛.
The LR.k/-parser for a grammar G requires a driver that interprets the action-
and goto-table. Again, we consider the case k D 1. This is, in principle, suf-
ficient because for each language that has an LR.k/-grammar and therefore also
an LR.k/-parser one can construct an LR.1/-grammar and consequently also an
LR.1/-parser. Let us assume that the set of states of the LR.1/-parser were Q.
One such driver program then is:
118 3 Syntactic Analysis
The function list hstatei tl.int n; list hstatei s/ returns in its second argument
the list s with the topmost n elements removed. As with the driver program for
LL.1/-parsers, in the case of an error, it jumps to a label err at which the code for
error handling is to be found.
We present three approaches to construct an LR.1/-parser for a CFG G. The
most general method is the canonical LR.1/-method. For each LR.1/-grammar G
there exists a canonical LR.1/-parser. The number of states of this parser can be
large. Therefore, other methods were proposed that have state sets of the size of the
LR.0/-automaton. Of these we consider the SLR.1/- and the LALR.1/-method.
The given driver program for LR.1/-parsers works for all three parsing methods;
the driver interprets the action- and a goto-table, but their contents are computed in
different ways. In consequence, the actions for some combinations of state and
look-ahead may be different.
Construction of an LR.1/-Parser
The LR.1/-parser is based on the canonical LR.1/-automaton LR1 .G/. Its states
therefore are sets of LR.1/-items. We construct the canonical LR.1/-automaton
much in the same way as we constructed the canonical LR.0/-automaton. The only
difference is that LR.1/-items are used instead of LR.0/-items. This means that
the look-ahead symbols need to be computed when the closure of a set q of LR.1/-
items under "-transitions is formed. This set is the least solution of the following
equation
where V is the set of all symbols, V D VT [ VN . The initial state q0 of LR1 .G/ is
given by
q0 D closure.fŒS 0 ! :S; #g/
The set of states and the transition relation of the canonical LR.1/-automaton is
computed in analogy to the canonical LR.0/-automaton. The generator starts with
the initial state and an empty set of transitions and adds successors states until all
successor states are already contained in the set of computed states. The transition
function of the canonical LR.1/-automaton gives the goto-table of the LR.1/-
parser.
120 3 Syntactic Analysis
S20 D nextState.S10 ; T /
D f ŒE ! T:; f#; Cg;
ŒT ! T: F; f#; C; g g
3.4 Bottom-up Syntax Analysis 121
Table 3.6 Some rows of the action-table of the canonical LR.1/-parser for G0 . s stands for
shift, r.i / for reduce by production i , acc for accept. All empty entries represent error
Numbering of the productions
used:
1WS !E
2WE !E CT
3WE!T
4WT !T F
5WT !F
6 W F ! .E/
7 W F ! Id
After the extension by look-ahead symbols, the states S1 ; S2 , and S9 , which were
LR.0/-inadequate, no longer have conflicts. In state S10 the next input symbol C
indicates to shift, and the next input symbol # indicates to reduce. In state S20 look-
ahead symbol indicates to shift, and # and C to reduce; similarly in state S90 .
Table 3.6 shows the rows of the action-table of the canonical LR.1/-parser for
the grammar G0 , which belong to the states S00 ; S10 ; S20 ; S60 , and S90 . u
t
In SLR.1/-parsers, the look-ahead sets for items are independent of the states in
which they occur. The look-ahead only depends on the left side of the production
in the item:
S .q; ŒX ! ˛:ˇ/ D fa 2 VT [ f#g j S 0 # H) Xawg D follow1 .X/
The set follow1 .X/ collects all symbols that can follow the nonterminal X in a sen-
tential form of the grammar. Only the follow1 -sets are used to resolve conflicts in
the construction of an SLR.1/-parser. In many cases this is not sufficient. More
conflicts can be resolved if the state is taken into consideration in which the com-
plete item ŒX ! ˛: occurs. The most precise look-ahead set that considers the
state is defined by:
L .q; ŒX ! ˛:ˇ/ D fa 2 VT [ f#g j S 0 # H) Xaw ^ G .q0 ; ˛/ D qg
rm
Here, q0 is the initial state, and G is the transition function of the canonical
LR.0/-automaton LR0 .G/. In L .q; ŒX ! ˛:/ only terminal symbols are con-
tained that can follow X in a right sentential-form ˇXaw such that ˇ˛ drives the
canonical LR.0/-automaton into the state q. We call state q of the canonical LR.0/-
automaton LALR.1/-inadequate if it contains conflicts with respect to the function
L . The grammar G is an LALR.1/-grammar if the canonical LR.0/-automaton
has no LALR.1/-inadequate states.
There always exists an LALR.1/-parser to an LALR.1/-grammar. The defi-
nition of the function L , however, is not constructive since the occurring sets of
3.4 Bottom-up Syntax Analysis 123
This system of equations describes how sets of look-ahead symbols for items in
states are generated. The first equation says that only # can follow the start symbol
S 0 . The second type of equations describes that the look-ahead symbols of an item
ŒA ! ˛X:ˇ in a state q result from the look-ahead symbols of the item ŒA !
˛:Xˇ in states p from which one can reach q by reading X. The third class of
equations formalizes that the follow symbols of an item ŒA ! :˛ in a state q result
from the follow symbols of occurrences of A in items in q, that is, from the sets
first1 .ˇ/ ˇ1 L .q; ŒX ! :Aˇ/ for items ŒX ! :Aˇ in q.
The system of equations for the sets L .q; ŒA ! ˛:ˇ/ over the finite subset
lattice 2VT [f#g can be solved by the iterative method for computing least solutions.
By taking into account which nonterminal may produce ". the occurrences of 1-
concatenation in the system of equations can be replaced with unions. In this way,
we obtain a reformulation as a pure union problem, that can be solved by the effi-
cient method of Sect. 3.2.7.
Example 3.4.15 The following grammar taken from [2] describes a simplified ver-
sion of the C assignment statement:
S0 ! S
S ! LDRjR
L ! R j Id
R ! L
124 3 Syntactic Analysis
State S2 is the only LR.0/-inadequate state. We have follow1 .R/ D f#; Dg. This
look-ahead set for the item ŒR ! L: is not sufficient to resolve the shift-reduce-
conflict in S2 since the next input symbol D is in the look-ahead set. Therefore, the
grammar is not an SLR.1/-grammar.
The grammar is, however, an LALR.1/-grammar. The transition diagram of its
LALR.1/-parser is shown in Fig. 3.17. To increase readability, the look-ahead sets
L .q; ŒA ! ˛:ˇ/ are directly associated with the item ŒA ! ˛:ˇ of state q. In
state S2 , the item ŒR ! L: receives the look-ahead set f#g. The conflict is resolved
since this set does not contain the next input symbol D. u t
LR-parsers like LL-parsers have the viable-prefix property. This means that each
prefix of the input that could be analyzed by an LR-parser without finding an error
can be completed to a correct input word, a word of the language. The earliest
situation in which an error can be detected is when an LR-parser reaches a state q
where for the actual input symbol a the action-table only provides the entry error.
We call such a configuration an error configuration and q the error state of this
configuration. There are at least two approaches to error handling in LR-parsers:
Forward error recovery: Modifications are made in the remaining input, not in the
pushdown.
Backward error recovery: Modifications are also made in the pushdown.
Let us assume that the actual state is q and the next symbol in the input is a. Po-
tential corrections are the following actions: A generalized shift.ˇa/ for an item
ŒA ! ˛:ˇa in q, a reduce for incomplete items in q, and skip.
The correction shift.ˇa/ assumes that the subword for ˇ is missing. It therefore
pushes the states that the IPDA would run through when reading the word ˇ start-
ing in state q. After that the symbol a is read and the associated shift-transition
of the parser is performed.
3.4 Bottom-up Syntax Analysis 125
S1
S0 S [S → S., {#}]
[S → .S], {#}
[S → .L = R, {#}] S3
[S → .R, {#}] R
[L → . ∗ R, {=, #}] [S → R., {#}]
[L → .Id, {=, #}]
[R → .L, {#}] ∗
Id ∗
S2 L S4
S5
Id [L → ∗.R, {=, #}]
[S → L. = R, {#}] [L → id., {=, #}] [R → .L, {=, #}]
[R → L., {#}] [L → . ∗ R, {=, #}]
[L → .Id, {=, #}]
S6 = Id ∗
S8 L S7 R
[S → L = .R, {#}]
[R → .L, {#}] L
[L → . ∗ R, {#}] [R → L., {=, #}] [L → ∗R., {=, #}]
L → .Id, {#}]
S9 R
[S → L = R., {#}]
Fig. 3.17 Transition diagram of the LALR.1/-parser for the grammar of Example 3.4.15
The correction reduce.A ! ˛:ˇ/ also assumes that the subword that belongs to
ˇ is missing. Therefore j˛j many states are removed from the pushdown. Let p
be the newly appearing state on top of the pushdown. Then the successor state of
p and the left side A according to the goto-table is pushed onto the pushdown.
The correction skip continues with the next symbol a0 in the input.
A straight forward method for error recovery based on these actions may proceed
as follows. Let us assume that the action-table only contains an error-entry for the
actual input symbal a. If the actual state q contains an item ŒA ! ˛:ˇa, the
parser may restart by reading a. As a correction therefore shift.ˇa/ is performed. If
the symbol a does not occur in any right side of an items in q, but as a look-ahead of
an incomplete item ŒA ! ˛:ˇ in q, then reduce.A ! ˛:ˇ/ may be performed as
correction. If several corrections are possible in q, a plausible correction is chosen.
It may be plausible, for example, to choose the operation shift.ˇa/ or reduce.A !
˛:ˇ/ in which the missing subword ˇ is the shortest. If neither a shift- nor a reduce-
correction is possible the correction skip is applied.
E ! E CT T ! T F F ! .E/
E ! T T ! F F ! Id
for which the canonical LR.0/-automaton has been constructed in Example 3.4.5.
As input we choose
. Id C /
126 3 Syntactic Analysis
After reading the prefix .Id C the pushdown of an SLR.1/-parser contains the
sequence of states S0 S4 S8 S6 , corresponding to the reliable prefix . E C. The
actual state S6 consists of the items:
S6 D f ŒE ! E C :T ;
ŒT ! :F ;
ŒF ! :Id;
ŒF ! :.E/ g
While reading / in state S6 leads to error, there are incomplete items in S6 with
look-ahead /. One of these items may be used for reduction; let us choose, e.g.,
the item ŒE ! E C :T . The corresponding reduction produces the new pushdown
content S0 S4 S8 , since S8 is the successor state of S4 under the left side E. In state
S8 a shift-transition is possible that reads /. The corresponding action pushes the
new state S11 on top of the pushdown. Now, a sequence of reductions will reach
the final state f . ut
The presented error recovery method is a pure forward recovery. It is similar to the
one offered, e.g., by the parser generator CUP for JAVA.
operations for the insertion, for the deletion, and for the replacement of that single
symbol.
Let .'q; ai : : : an / be an error configuration. The goal of error correction by one
of the three operations can be described as follows:
Deletion: Find a pushdown contents ' 0 p with
.'q; ai C1 : : : an / ` .' 0 p; ai C1 : : : an / and actionŒp; ai C1 D shift
The required pushdown contents ' 0 p are determined by the property that reductions
are possible under the new next input symbol that were impossible in the error
configuration. An important property of all three operations is that they guarantee
termination of the error recovery process: each of the three operations advances the
input pointer by at least one symbol.
Error recovery methods with backtracking additionally allow us to do the last
applied production X ! ˛Y and to consider Yai : : : an as input, when all other
correction attempts have failed.
An immediate realization of the method searches through the different possi-
ble error corrections dynamically, that is, at parsing time until a suitable correction
is found. Checking one of the possibilities may involve several reductions, fol-
lowed by a test whether a symbol can be read. Upon failure of the test, the error
configuration must be restored, and the next possibility be tested. Finding a right
modification of the program by one symbol therefore can be expensive. Therefore
we are interested in precomputations that can be performed at the time when the
parser is generated. The result of the precomputations should allow us to recognize
many dead ends in the error recovery quickly. Let .'q; ai : : : an / be an error con-
figuration. Let us consider the insertion of a symbol a 2 VT . The error recovery
consists of the following sequence of steps (see Fig. 3.18a):
(1) a sequence of reductions under look-ahead symbol a, followed by
(2) reading of a, followed by
(3) a sequence of reductions under look-ahead symbol ai .
A precomputation makes it possible to exclude many symbols a from consideration
for which there are no subsequences for subtasks (1) or (3). Therefore for each state
q and each a 2 VT , the set Succ.q; a/ of potential reduction successors of q under
a is computed. The set Succ.q; a/ contains the state q together with all states in
which the parser may come out of q by reductions under look-ahead a. The set
Succ.q; a/ is the smallest set Q0 with the following properties:
128 3 Syntactic Analysis
a b c
a a ai+1
q p q p p
ai ai+1 q
q p q p
Fig. 3.18 Closing the gap for error correction, a by insertion, b by replacement, or c by deletion
of a symbol
q 2 Q0 .
Let q 0 2 Q0 , and let q 0 contain a complete item to a production A ! X1 : : : Xk .
Then gotoŒp; A 2 Q0 for each state p with
gotoŒ: : : gotoŒp; X1 : : : ; Xk D q
Using the set Succ.q; a/ we define the set Sh.q; a/ of all states that can be reached
from reduction successors of q under a by a shift transition for a:
Using the sets Sh.q 0 ; a0 / for all states q 0 and terminal symbols a, we define the set
of all states that are reached from the states in Sh.q; a/ by reductions with look-
ahead ai , followed by reading ai :
[
Sh.q; a; ai / D fSh.q 0 ; ai / j q 0 2 Sh.q; a/g
Bridge.q; ai / D fa 2 VT j Sh.q; a; ai / ¤ ;g
Example 3.4.17 We consider the grammar of Example 3.4.15 with the LALR.1/-
parser of Fig. 3.17. The sets of reduction successors Succ.q; a/ of q under a, the
3.4 Bottom-up Syntax Analysis 129
t
u
In procedure test, parsing is continued after the attempt of an error correction. If the
remaining input can be successfully analyzed, the correction attempt is considered
successful. If the attempt to process the remaining input fails, the parser assumes
the existence of another error and starts a correction attempt for this error.
A more ambitious implementation may be more reluctant to conjecture a second
error at a failed correction attempt. Instead it may return to the error location and
attempt another correction. Only if all correction attempts fail, is a further error
assumed. The parser then selects a best attempt and restarts error correction in
the reached configuration. One measure for the success of an attempt of error-
correction could be the length of the input possibly consumed after the correction
attempt.
word will often be much shorter than the subword ai C1 : : : ak . In the input for
the calls of procedure test the subword ai C1 : : : an can be replaced by akC1 : : : an ,
where the parser treats a nonterminal A in the input always like a shift of the sym-
bol A.
3.5 Exercises
1. Reduced Grammar
Check the productivity and the reachability of the nonterminals of the CFG
0 8 9 1
ˆ
ˆ S ! aAa j bS > >
B ˆ
ˆ A ! BB j C > > C
B ˆ
ˆ >
>
B < = C C
B B ! bC
G D BfS; A; B; C; D; Eg; fa; b; cg; ; SC
C
B ˆ
ˆ C ! B j c >
> C
@ ˆ
ˆ >
> A
ˆ D ! aAE >
>
:̂ ;
E ! Db
2. Items
Give a definition of the future of a sequence of items, fut./, such that you can
prove the following invariant .I 0 /:
.I 0 / For all words uv 2 L.G/ exists a sequence 2 ItG such that:
.q0 ; uv/ `K .; v/ implies fut./ H) v.
G
3. "-Productions
Assume that G D .VN ; VT ; P; S/ is a reduced CFG, and that the start symbol
does not occur in the right side of any production. G is called "-free if A !
" 2 P implies that A is the start symbol S. Show that for each grammar G, an
"-free grammar can be constructed that describes the same language.
4. Item-pushdown Automaton
(b) How many accepting sequences of configurations exist for the word
babaab?
6. follow-Sets
Prove Theorem 3.2.3.
7. Strong LL.k/-Grammars
Develop a construction that constructs to an arbitrary LL.k/-grammar a strong
LL.k/-grammar that specifies the same language.
[Hint: Use pairs hA; firstk .ˇ/i for nonteminals A and words ˇ with
S 0 # H) wAˇ as nonterminals.]
L
How are the parse trees of the original grammar related to the parse trees of the
transformed grammar?
8. k-Concatenation
Prove that the operation ˇk is associative.
9. first1 - and follow1 -Sets
Assume that the following grammar is given:
0 8 0 9 1
ˆ
ˆ S ! S >
>
B ˆ
ˆ >
> C
B ˆ
ˆ S ! LB >
> C
B 0 < = C
B ! I SI L jWD L
GDB
BfS ; S; B; E; J; Lg; fI ; WD; .; /; ; g; ˆ E
0C
;S C
B ˆ ! ajL >
> C
@ ˆ
ˆ >
> A
ˆ J ! ; EJ j/ >
>
:̂ ;
L ! .EJ
Compute the first1 - and follow1 -sets using the iterative procedure.
10. "-free first1 -sets
Consider the grammar:
0 8 9 1
< S ! aAaB j bAbB =
G D @fS; A; Bg; fa; bg; A ! a j ab ; SA
: ;
B ! aB j a
3.5 Exercises 133
(a) Set up the system of equations for the computation of the "-free first1 -sets
and the follow1 -sets.
(b) Determine the variable dependency-graphs of the two systems of equations
and their strongly connected components.
(c) Solve the two systems of equations.
11. LL.1/-Grammars
Test the LL.1/-property for
(a) the grammar of Exercise 5;
(b) the grammar of Exercise 9;
(c) the grammar of Exercise 10;
(d) the grammar
0 8 9 1
ˆ
ˆ E ! DE 0 >
>
B ˆ
ˆ >
> C
B < E0 ! CDE 0 j " = C
B 0 0
G D BfE; E ; D; D ; F g; fa; .; /; C; g; D ! FD 0 ; EC
ˆ > C
@ ˆ
ˆ D0 ! FD 0 j " >
> A
:̂ >
;
F ! .E/ j a
12. LL.1/-Parsers
(a) Construct the LL.1/-parser table for G of Exercise 11(d).
(b) Give a run of the corresponding parser for the input .a C a/ a C a.
13. LL.1/-Parsers (Cont.)
Construct the LL.1/-table for the grammar with the following set of produc-
tions:
E ! E j .E/ j VE 0
E 0 ! E j "
V ! Id V 0
V 0 ! .E/ j "
Sketch a run of the parser on input Id . Id/ Id.
14. Extension of Right-regular Grammars
Extend right-regular grammars by additionally allowing the operators ‹ and
._/C to occur in right sides of productions.
(a) Provide the
˝ productions
˛ of the transformed grammar for the nonterminals
hr‹i and r C .
(b) Extend the generator scheme generate of the recursive-descent parser to
expressions r‹ and r C .
15. Syntax Trees for RLL.1/-Grammars
Instrument the parsing procedures of the recursive-descent parser for an
RLL.1/-grammar G in such a way that they return syntax trees.
(a) Instrument the procedure of a nonterminal A in such a way that syntax
trees of the transformed CFG hGi are produced.
(b) Instrument the procedure of a nonterminal A in such a way that the sub-
trees of all symbols of the current word from the language of the regular
expression p.A/ are collected in a vector.
134 3 Syntactic Analysis
Adapt the generator schemes for the regular subexpressions of right sides ac-
cordingly.
16. Operator Precedences
Consider a CFG with a nonterminal symbol A and the productions:
A ! lop A j A rop j
A bop A j
. A / j var j const
for various unary prefix operators lop, unary postfix operators rop, and binary
infix operators bop where the sets of postfix and infix operators are disjoint.
Assume further that every unary operator has a precedence and every infix op-
erator has both a left precedence and a right precedence by which the strength
of the association is determined. If negation has precedence 1 and the operator
C has left and right precedences 2 and 3, respectively, the expression
1 C 2 C 3
..1/ C 2/ C 3
S !L S !L S !L S !L
L ! LI A j A L ! AI L j A L ! LI L j A L ! aT
A!a A!a A!a T ! " jI L
(a) (b) (c) (d)
3.5 Exercises 135
18. SLR.1/-Grammars
Show that the following grammar is an SLR.1/-grammar, and construct an
action-table for it:
S ! E
E ! T jE CT
T ! P jT P
P ! F j F "P
F ! Id j .E/
19. SLR.1/-Grammars (Cont.)
Show that the following grammar is an LL.1/-grammar, but not an SLR.1/-
grammar:
S ! AaAb j BbBa
A ! "
B ! "
20. LALR.1/-Grammars
Show that the following grammar is an LALR.1/-grammar, but not an
SLR.1/-grammar:
S ! Aa j bAc j dc j bda
A ! d
S ! A
A ! bB
B ! cC
B ! cC e
C ! dA
A ! a
S ! A
B ! "
C ! "
A ! BCA
A ! a
Detailed presentations of the theory of formal languages and automata are given in
the books by Hopcroft and Ullman [25] and Harrison [22]. Exclusively dedicated
to syntactic analysis are the books [46], [59] and [60]. Möncke and Wilhelm de-
veloped grammar-flow analysis as a generic method to solve some of the problems
treated in this chapter, namely the determination of the productivity and the reach-
ability of nonterminals, the computation of firstk and followk sets, and later the
computation of global attribute dependences. It was first described in [49] and is
further developed in [48] and [50]. Knuth presents in [39] a related approach. How-
ever, he used totally ordered sets. A similar approach is presented by Courcelle in
[9].
LL.k/-grammars were introduced by Lewis and Stearns [26], [27]. Heckmann
[23] develops an efficient RLL.1/ parser generator, which computes first1 - and
follow1 sets noniteratively as solution of a pure-union problem. The presented error-
recovery method for RLL.1/ parsers is a refinement of the method realized by
Ammann in the Zürich Pascal-P4 compiler [4], [69]. Related techniques are also
presented in [43]. The transformation for the removal of left recursion essentially
follows the method in [5].
As early as in the 1950s, parsing techniques were sought for expressions built
up by means of prefix, postfix, and infix operators of different precedences. In [16],
the shunting yard algorithm was suggested by Dijkstra. This algorithm memorizes
operators on a stack and maintains the operands on another stack. A shift-reduce-
parser for the same problem goes back to Floyd [18]. An alternative approach is
presented by Pratt [57]. The Pratt-parser extends the recursive-descent parser and
includes the treatment of different operators and operator precedences directly into
the recursive parse functions. A formal elaboration of this idea together with an
implementation in L ISP is provided by Van De Vanter’s Master’s thesis [64].
LR.k/-grammars were introduced by Knuth [36]. That the effort of LR.k/-
parsing does not necessarily grow exponentially in k, was shown 2010 by Norbert
Blum [6]. The subclasses of SLR.k/- and LALR.k/-grammars, as supported by
3.6 Bibliographic Notes 137
most LR-parser generators, were proposed by DeRemer [13], [14]. Besides parsing
techniques that are based on these subclasses of LR.1/, there has been extensive
discussion on how general LR.1/-parsers can efficiently be implemented. Useful
optimizations for that purpose were already proposed in the 1970s by Pager [53,
54]. An interesting newer approach is by Kannapinn in his PhD dissertation [33],
where also extensions are discussed for grammars with regular right sides. The
generation of verified LR.1/-parsers is described in [32]. The elaborated error-
recovery method for LR.1/-parsers follows [55].
Tomita presents generalized LR-parsing as a syntax-analysis method, which is
able to recognize languages of all context-free grammars [62, 63]. Whenever a con-
flict is encountered, all possibilities are tracked in parallel by using several stacks.
Still, it attempts to analyze as much as possible deterministically and to work with
only one stack in order to increase its efficiency. Nonetheless, its worst-case com-
plexity is O.n3 /. This method, although originally developed for natural-language
parsing, is also interesting for the analysis of languages like C++, which do not have
deterministic context-free grammars.
Semantic Analysis
4
Some Notions
We use the following notions to describe the task of semantic analysis.
An identifier is a symbol (in the sense of lexical analysis) which can be used
in a program to name a program element. Program elements in imperative lan-
guages that may be named are modules, functions or procedures, statement labels,
constants, variables and parameters, and their types. In object-oriented languages
such as JAVA, classes and interfaces, together with their attributes and their meth-
ods, can also be named. In functional languages such as O CAML, variables and
functions differ semantically slightly from the corresponding concepts in impera-
tive languages, but can be named by identifiers as well. An important class of data
structures can be built using constructors whose identifiers are introduced together
with the type declaration. In logic languages such as P ROLOG, identifiers may refer
to predicates, atoms, data constructors, and variables.
Some identifiers are introduced in explicit declarations. The occurrence of an
identifier in its declaration is the defining occurrence of the identifier; all other oc-
currences are applied occurrences. In imperative programming languages such as
C and in object-oriented languages such as JAVA, all identifiers need to be explicitly
introduced. Essentially, this also holds for functional languages such as O CAML.
In P ROLOG, however, neither constructors and atoms, nor local variables in clauses
are explicitly introduced. Instead, they are introduced by their syntactically first oc-
currence in the program or the clause. In order to distinguish between variables and
atoms, their respective identifiers are taken from distinct name spaces. Variables
start with a capital letter or an underscore, while constructors and atoms are identi-
fied by leading lower-case letters. Thus, the term f .X; a/ represents an application
of the binary constructor f =2 to the variable X and the atom a.
Each programming language has scoping constructs, which introduce bound-
aries within which identifiers can be used. Imperative languages offer packages,
modules, function and procedure declarations as well as blocks that may summa-
rize several statements (and declarations). Object-oriented languages such as JAVA
additionally provide classes and interfaces, which may be organized in hierarchies.
Functional languages such as O CAML also offer modules to collect sets of dec-
larations. Explicit let- and let-rec-constructs allow us to restrict declarations of
variables and functions to particular parts of the program. Beyond clauses, modern
dialects of P ROLOG also provide module systems.
Types are simple forms of specifications. If the programming element is a mod-
ule, the type specifies which operations, data structures, and further programing
elements are exported. For a function or method, it specifies the types of the argu-
ments as well as the type of the result. If the programming element is a program
variable of an imperative or object-oriented language, the type restricts which val-
ues may be stored in the variable. In purely functional languages, values cannot
be explicitly assigned to variables, i.e., stored in the storage area corresponding to
this variable. A variable here does not identify a storage area, but a value itself.
The type of the variable therefore must also match the types of all possibly denoted
values. If the programming element is a value then the type also determines how
much space must be allocated by the run-time system in order to store its internal
representation. A value of type int of the programming language JAVA, for exam-
ple, currently requires 32 bits or 4 bytes, while a value of type double requires 64
bits or 8 bytes. The type also determines which internal representation to use and
which operations to be applied to the value as well as their semantics. An int-value
in JAVA, for example, must be represented in two’s complement and can be com-
bined with other values of type int by means of arithmetic operations to compute
new values of type int. Overflow is explicitly allowed. C on the other hand, does
not specify the size and internal representation of base types with equal precision.
For signed int-values, for example, may be represented by the ones’ as well as by
the two’s complement – depending on what is provided by the target architecture.
Accordingly, the effect of an overflow is left unspecified. Therefore, the program
4.1 The Task of Semantic Analysis 141
fragment:
need not necessarily output Hello. Instead, the compiler is allowed to optimize
it away into an empty statement. For unsigned int, however, C is guaranteed to
wrap-around at overflows. Therefore, the statement:
int a; bI
a 42I
b a a 7I
from the introduction is shown in Fig. 4.1. We assumed that the context-free
grammar differentiates between the precedence levels for assignments, compari-
son, addition, and multiplication operators. Notable are the long chains of chain
productions, that is, replacements of one nonterminal by another one, which have
been introduced to bridge the precedence differences. A more abstract representa-
tion is obtained if in a first step, the applications of chain productions are removed
from the syntax tree and then superfluous terminal symbols are omitted. The result
of these two simplifications is shown in Fig. 4.2. u t
The first step of the transoformation into an abstract syntax that we executed by
hand in Example 4.1.1 need not be executed for each syntax tree separately. Instead,
142 4 Semantic Analysis
statlist
statlist
statlist
stat
stat
decl
A
A E
E T
idlist T T T
type idlist F F F F
Int Id Com Id Sem Id Bec Intconst Sem Id Bec Id Mop Id Aop Intconst Sem
statlist
statlist
statlist
stat
stat
decl
A
idlist
type idlist F F F F
the grammar G can be systematically rewritten such that all chain productions are
eliminated. We have:
Theorem 4.1.1 For every CFG G, a CFG G 0 without chain productions can be
constructed that has the following properties:
1. L.G/ D L.G 0 /;
2. If G is an LL.k/-grammar, then also G 0 .
3. If G is an LR.k/-grammar, then also G 0 .
Proof Assume that the CFG G is given by G D .VN ; VT ; P; S/. The CFG G 0
then is obtained as the tuple G 0 D .VN ; VT ; P 0 ; S/ with the same sets VN and VT
4.1 The Task of Semantic Analysis 143
and the same start symbol S where the set P 0 of productions of G 0 consists of all
productions A ! ˇ for which
A0 H) A1 H) : : : H) An H) ˇ whereby A D A0 ; ˇ 62 VN
G G G G
E ! E CT jT
T ! T F jF
F ! .E/ j Id
R E T F R E T F
E 0 1 0 E 1 1 1
T 0 0 1 T 0 1 1
F 0 0 0 F 0 0 1
By eliminating the chain productions, we obtain a grammar with the following pro-
ductions:
E ! E C T j T F j .E/ j Id
T ! T F j .E/ j Id
F ! .E/ j Id
Note that the resulting grammar is no longer a SLR.1/-grammar, but is still an
LR.1/-grammar. u t
We conclude that chain rules are, at least when expressiveness is concerned, super-
fluous. The duplication of right sides, which is introduced for their elimination, is
not appreciated during the specification of grammars. For the application of parse
trees, on the other hand, chain rules should be avoided. One meaningful com-
promise, therefore, could be to allow chain rules in grammars, but delegate their
elimination to the parser generator.
In the following, we will sometimes use the concrete syntax, sometimes the ab-
stract syntax, whichever is more intuitive.
144 4 Semantic Analysis
Programming languages often allow us to use the same identifier for several pro-
gram elements and thus to have several declarations. Thus, a general strategy is
needed to determine for an applied occurrence of an identifier the defining oc-
currence it refers to. This strategy is specified by means of the rules for validity
visibility.
The scope (range of validity) of the defining occurrences of an identifier x is
that part of a program in which an applied occurrence of x can refer to this defining
occurrence. In programming languages with nested blocks, the validity of a defining
occurrence of an identifier stretches over the block containing the declaration. Such
languages often require that there is only one declaration of an identifier in a block.
All applied occurrences of the identifier refer to this single defining occurrence.
Blocks, however, may be nested, and nested blocks may contain new declarations
of identifiers that have already been declared in the outer block. This is the case for
C and C++. While a declaration in the outer block is also valid in the nested block,
it may no longer be visible. It may hidden by a declaration of the same identifier in
the inner block. The range of visibility of a defining occurrences of an identifier is
the program text in which the identifier is valid and visible, i.e., not hidden.
JAVA, for example, does not allow us to hide local identifiers since this is a source
of unpleasant programming errors. The nested loop
is not possible in JAVA, since the inner declaration of i would overwrite the outer
one. With class fields JAVA is not as restrictive:
class Test f
int xI
void foo .int x/ f
x 5I
g
g
Method foo does not change the value of field x of its receiver object, but modifies
instead the value of parameter x.
4.1 The Task of Semantic Analysis 145
Some languages, for example, JAVA and C++, permit that the declaration of a
variable is placed somewhere in a block before its first applied occurrence:
f
int yI
:::
int x 2I
:::
y x C 1I
g
Variable x is valid in the whole block, but is visible only after its declaration. The
first property prevents further declarations of x within the same block.
The process of identification of identifiers identifies for each applied occurrence
of an identifier the defining occurrence that belongs to this applied occurrence ac-
cording to the rules of validity and visibility of the language. The validity and
the visibility rules of a programming language are strongly related to the types of
nesting of scopes that the language allows.
C OBOL has no nesting of scopes at all; all identifiers are valid and visible ev-
erywhere. F ORTRAN 77 only allows one nesting level, that is, no nested scopes.
Procedures and functions are all defined in the main program. Identifiers that are
defined in a block are only visible within that block. An identifier declared in the
main program is visible starting with its declaration, but is hidden within procedure
declarations that contain a new declaration of the identifier.
Modern imperative and object-oriented languages such as PASCAL, A DA, C,
C++, C#, and JAVA as well as functional programming languages allow arbitrarily
deep nesting of blocks. The ranges of validity and visibility of defining occurrences
of identifiers are fixed by additional rules. In a let construct
let x D e1 in e0
in O CAML the identifier x is only valid in the body e0 of the let-construct. Applied
occurrences of x in the expression e1 refer to a defining occurrence of x in enclosing
blocks. The scope of the identifiers x1 ; : : : ; xn of a let-rec construct
Predicates, atoms, and constructors have global visibility; they are valid in the
whole P ROLOG program and in associated queries.
Identifiers of clause variables are valid only in the clause in which they occur.
Explicit declarations exist only for predicates: These are defined by the list of their
alternatives.
An important concept to make identifiers visible in a given context is qualifi-
cation. Consider the expression x:a in the programming language C. The type
declaration which the component a refers to depends on the variable x. The vari-
able x, more precisely the type of x, serves as a qualification of component a.
Qualification is also used in programming languages with a module concept such
as M ODULA and O CAML to make (public) identifiers from modules visible out-
side of the defining module. Let A be an O CAML module. A:f calls a function f
declared in A. Similarly, the use-directive in A DA lists identifiers of surrounding
scopes, making their declaration thereby visible. The visibility of these identifiers
stretches from the end of the use-directive until the end of the enclosing program
unit.
Similar concepts for qualification exist in object-oriented programming lan-
guages such as JAVA. Consider a name x in JAVA defined in a class C as public.
Within a class A that is different from C and neither an inner class nor a subclass
of C , the identifier x of class C is still valid.
First, consider the case that x is declared to be static. Then x exists only once
for the whole class C . If class C belongs to a different package than class A, then
identfiication of x needs not only class C , but in addition the name of this package.
A call of the static method newInstance./, for instance, of class DocumentBuilder-
Factory of package javax.xml.parsers has the form:
javax:xml:parsers:DocumentBuilderFactory:newInstance./
Such lengthy qualifications take too much writing effort. JAVA therefore offers an
import directive. The directive:
import javax:xml:parsers:
at the beginning of a file, however, makes not only the class DocumentBuilderFac-
tory visible, but also all static public attributes and methods of the class Document-
BuilderFactory.
4.1 The Task of Semantic Analysis 147
Similar directives that make valid but not directly visible identifiers visible in
the actual context exist in many programming languages. In O CAML an open A in
a module B makes all those variables and types of module A visible in B that are
public.
Things are different if a field or a method x in a JAVA program is not static. The
class to which an instance of the identifier x belongs is determined considering the
static type of the expression whose run-time value refers to the object selected by
x.
The static type of the attribute o is A. At run time the attribute o has as value an
object of subclass B of A. Visible of the objects to which o may be evaluated are
only the visible attributes, methods, and inner classes of the superclass A. u
t
Conclusion
Not everywhere in the scope of a defining occurrence of x does an applied occur-
rence of x referxs to this defining occurrence. If the defining occurrence is global
to the actual block, it may be hidden by a local (re)declaration of x. Then it is not
directly visible. There are several possibilities within its scope, though, to make a
defining occurrence of an identifier x that is not directly visible, visible. For that,
many programming languages provide explicit qualification at a particular use or
general directives to omit explicit qualifications within a given context.
We will now sketch how compilers check the context-conditions. We consider the
simple case of a programming language with nested scopes, but without overload-
ing.
The task is decomposed in two subtasks. The first subtask consists in checking
whether all identifiers are declared and in relating their defining to applied occur-
rences. We call this task declaration analysis. This analysis is determined by the
rules for validity and visibility of the programming language. The second subtask,
type checking, examines whether the types of program objects conform to the rules
of the type system. It may also infer types for objects for which no types were
given.
148 4 Semantic Analysis
Identification of Identifiers
In our simple case, the rules for validity and visibility determine that in a correct
program, exactly one defining occurrence of an identifier belongs to each applied
occurrence of the identifier. The identification of identifiers consists in linking each
applied occurrence to a defining occurrence or to find that no such linking is pos-
sible or that more than one exists. The result of this identification is later used
for type checking and possibly for code generation. It must therefore be passed
on to subsequent compiler phases. There exist several possibilities for the repre-
sentation of the link between applied and defining occurrence. Traditionally, the
compiler constructs a symbol table, in which the declarative information for each
defining occurrence of an identifier is stored. This symbol table is frequently orga-
nized similarly to the block structure of the program. This helps to quickly reach
the corresponding defining occurrence starting from an applied occurrence.
The symbol table is not the result of the identification of identifiers, but it sup-
ports their identification. The result of the identification is to establish for every
node for an applied occurrence of an identifier x a link to the node of that defining
occurrence of x to which it refers.
Which operations must be supported by the symbol table? For each declaration
of an identifier, the identifier must be entered into the symbol table together with
a reference to the declaration’s node in the syntax tree. Another operation must
register the opening, yet another the closing of a block. The latter operation can
delete the entries for the declarations of the closed block from the symbol table.
In this way, the symbol table contains at any point in time exactly the entries for
declarations of all actually opened blocks that have not yet been closed. When
the declaration analyzer arrives at an applied occurrence of an identifier it searches
the symbol table according to the rules for validity and visibility for the entry of
the corresponding defining occurrence. When it has found this entry it copies the
reference to the declaration node to the node for the applied occurrence.
Thus, the following operations on the symbol table are required:
Example 4.1.4 We want to apply a symbol table for annotating the parse tree for
a simple fragment of a C-like imperative language without functions or procedures.
In order to keep the example grammar small, only a minimalistic set of types and
productions for the nonterminals have been included.
In order to assign declarations to uses of identifiers, we consider a simple internal
representation that closely resemples the parse tree. Every node of the parse tree
is represented by an object whose class name equals the corresponding terminal
or nonterminal symbol. Every such object has an array succs of references to the
successor nodes. Internal nodes corresponding to nonterminals additionally have
an attribute rhs containing the right side of the corresponding production.
Assume that the tokens of the class var serve as identifiers of variables and are
equipped with an attribute id containing their concrete name. Additionally, each
such token obtains an attribute ref that is meant to receive a reference to the declara-
tion to which the identifier refers. When computing these references, the algortihm
should take into account that new declarations of an identifier within the same block
are ruled out, while they are admitted within subblocks. The attributes ref can be
computed by means of a depth-first left-right traversal over the parse tree. Thereby,
a method process./ is called for every visited node, which may behave differently
depending on the class of the node.
For the class hdecli, we define:
void process./ f
hdecli ref I
switch .rhs/ f
case 0 htypei varI0 W ref table:search_block_id.succsŒ1:id/I
if .ref ¤ null/ error./I
else table:enter_id.succsŒ1:id; this/I
returnI
g
g
150 4 Semantic Analysis
void process./ f
switch .rhs/f
case 0 hdecli hblocki0 W succsŒ0:process./I
succsŒ1:process./I
returnI
case 0 hstati hblocki0 W succsŒ0:process./I
succsŒ1:process./I
returnI
case 0 0 W returnI
g
g
void process./ f
hdecli ref I
switch .rhs/ f
case 0 var D EI0 W ref table:search_id.succsŒ0:id/I
if .ref D null/ error./I
else succsŒ0:ref ref I
returnI
case 0 fhblockig0 W table:enter_block./I
succsŒ1:process./I
table:exit_block./I
returnI
g
g
void process./ f
switch .rhs/ f
case 0 const0 W returnI
case 0 var0 W ref table:search_id.succsŒ0:id/I
if .ref D null/ error./I
else succsŒ0:ref ref I
returnI
g
g
During the traversal of the parse tree, the visibility rules for identifiers must be taken
into account. If, for example, a new declaration of a variable x is forbidden within
the same block, then enter_id must only be executed if x has not yet been declared
within the actual block. u t
4.1 The Task of Semantic Analysis 151
For more complicated programming languages such as the one from Example 4.1.4,
a depth-first left-right traversal over the parse tree often is not sufficient. In JAVA,
for example, members such as methods or attributes may already be used, before
they syntactically occur within the declaration of the class. A first pass over the
class must therefore collect all declarations in order to map uses of identifiers to
their declarations in the second pass. A similar procedure is required for functional
languages in definitions of mutually recursive functions.
A specific characteristics of PASCAL-like programming languages as well as C
is to stick with the depth-first left-right traversal, but to insert a forward declaration
before the first use of a function or procedure. Such a forward declaration consists
of the name together with the return type and the list of parameters.
The method process./ for declarations now are extended to deal with the cases of
forward declarations and declarations of procedures:
:::
case 0 void var ./I0 W ref table:search_block_id./I
if .ref D null/ table:enter_id.succsŒ1:id; this/I
else returnI
case 0 void var ./ fhblockigI0 W ref table:search_block_id./I
if .ref D null/ table:enter_id.succsŒ1:id; this/I
else f
if .ref :impl ¤ null/ error./I
else ref :impl thisI
g
succsŒ5:process./I
returnI
:::
:::
case 0 var./I0 W ref table:search_id.succsŒ0:id/I
if .ref D null/ error./I
else succsŒ0:ref ref I
returnI
:::
t
u
Example 4.1.6 For the program in Fig. 4.3 and the program point marked by ,
the symbol table from Fig. 4.4 is obtained. The light entries represent references
to declarations. The modified implementation of the symbol table that maintains a
separate declaration stack for each identifier is displayed in Fig. 4.5. Additionally,
the stack of blocks is shown whose entries list the identifiers that have been allocated
in each entered block. u t
{ 2
int a; int c;
}
int c; int d;
void q()
{ 3
int a; int d;
void r()
{ 4
int a; int c;
∗
}
a a a
c d b
r c
d
p
q
4 3 1
a a a
c d b
r c
d
p
q
a
b
c
d
p
q
r
Fig. 4.5 Modified symbol table for the program of Fig. 4.3
The package x declares in its public part two new identifiers, namely the type
identifier boolean and the function identifier f. These two identifiers are made
potentially visible after the semicolon by the use directive use x; (see after (D2)).
Function identifiers in A DA can be overloaded. The two declarations of f, at (D1)
and (D2), have different parameter profiles, in this case different result types. Both
are therefore (potentially) visible at program point (A1).
The declaration f: integer in the program unit A (see (D3)) hides the outer
declaration (D2) of f, since variable identifiers in A DA cannot be overloaded. For
this reason the declaration (D1) is not visible. Declaration (D4) of f in program
unit B hides declaration (D3), and since this one hides declaration (D2), transitively
also D2. Declaration (D1) potentially made visible through the use directive is not
hidden, but still potentially visible. In the context put(f) (see (A3)) f can only
refer to declaration (D4) since the first declaration of put uses a type, boolean,
that is different from the result type of f in (D1). ut
Imperative languages typically require us to supply types for identifiers. These are
used to derive types of expressions. In modern functional programming languages,
however, not only the types of expressions, but also the types of identifiers are au-
tomatically inferred. Therefore new identifiers are introduced in programs (mostly)
without associating them with types.
The idea to automatically infer types goes back to J.R. Hindley and R. Milner. We
follow them and characterize the set of potential types of an expression by intro-
ducing axioms and inference rules, which relate the type of an expression to the
types of its subexpressions. For simplicity we only consider a functional core lan-
guage, derived from O CAML. A similar functional core language is also considered
in the volume Compiler Design: Virtual Machines. A program in this programming
language is an expression without free variables, where expressions e are built ac-
158 4 Semantic Analysis
Here b are basic values, x are variables, and i (i D 1; 2) are i-place operators on
basic values. For simplicity we consider as structured data types only tuples and
lists. Pattern matching can be used to decompose structured data. As patterns for
the decomposition we admit only patterns with exactly one constructor. We use the
usual precedence rules and associativities to spare parentheses.
Example 4.2.2 A first example program in the functional core language is:
Œa1 I : : : I an
We use a syntax for types that is also similar to that of O CAML, that is, the unary
type constructor list for lists is written to the right of its arguments. Types t are built
according to the following grammar:
The only basic types we consider are the type int for integral numbers and the type
bool for boolean values. Expressions may contain free variables. The type of an
expression depends on the types of free variables that occur in it. The assumptions
about the types of free variables are collected in a type environment. A type en-
vironment is a function mapping finite sets of variables to the set of types. A
4.2 Type Inference 159
`e W t
A type system consists of a set of axioms and of rules by which valid type judgments
can be inferred. Axioms are judgments that are valid without further assumptions.
Rules permit us to derive new valid type judgments from valid preconditions. We
now list the axioms and rules for our functional core language. As axioms we need:
Each family of axioms is given a name for later reference. Furthermore, we as-
sume that each basic value b has a uniquely determined basic type tb , syntactically
associated with it.
Rules are also given names. The preconditions of a rule are written above the
line; the conclusion is written below.
` e1 W int ` e2 W int
OP:
` e1 C e2 W int
` e1 W t ` e2 W t
C OMP :
` e1 D e2 W bool
` e0 W bool ` e1 W t ` e2 W t
IF:
` .if e0 then e1 else e2 / W t
` e1 W t1 : : : ` em W tm
T UPEL :
` .e1 ; : : : ; em / W .t1 : : : tm /
` e1 W t ` e2 W t list
C ONS :
` .e1 WW e2 / W t list
` e0 W .t1 : : : tm / ˚ fx1 7! t1 ; : : : ; xm 7! tm g ` e1 W t
M ATCH1 W
` .match e0 with .x1 ; : : : ; xm / ! e1 / W t
` e0 W t1 list ` e1 W t ˚ fx ! 7 t1 ; y 7! t1 listg ` e2 W t
M ATCH2 W
` .match e0 with Œ ! e1 j x WW y ! e2 / W t
` e1 W t1 ! t2 ` e2 W t1
A PP :
` .e1 e2 / W t2
˚ fx 7! t1 g ` e W t2
F UN :
` fun x ! e W t1 ! t2
` e1 W t1 ˚ fx1 7! t1 g ` e0 W t
L ET:
` .let x1 D e1 in e0 / W t
0 ` e1 W t1 : : : 0 ` em W tm 0 ` e0 W t
L ETREC :
` .let rec x1 D e1 and : : : and xm D em in e0 / W t
where 0 D ˚ fx1 7! t1 ; : : : ; xm 7! tm g
160 4 Semantic Analysis
In the rule O P has been displayed for the integral operator C. Analogous rules are
provided for the other unary and binary operators. In the case of boolean operators
the arguments and the result are of type bool. For comparison operators, the rule
C OMP s shown for the comparison operator D. Analogous rules are provided in
O CAML for the other comparison operators. Please note that according to the se-
mantics of O CAML, comparisons are allowed between arbitrary values as long as
they have the same type.
Example 4.2.3 For the body of the function fac of Example 4.2.1 and the type
environment
D ffac 7! int ! int; x 7! intg
` x W int ` 1 W int
` fac W int ! int ` x 1 W int
` x W int ` 0 W int ` x W int ` fac .x 1/ W int
` x 0 W bool ` 1 W int ` x fac .x 1/ W int
` if x 0 then 1 else x fac .x 1/ W int
Under the assumption that fac has type int ! int, and x has type int, it can be
inferred that the body of the function fac has type int. u
t
The rules are designed in such a way that the type of an expression is preserved
in the evaluation of the expression. This property is called subject reduction. If
the types of all variables are guessed correctly at their definition, the rules could be
used to check whether the guesses are consistent. Note that an expression might
have several types.
fun x ! x
describes the identity function. In each type environment and for each type t
` id W t ! t
can be derived. u
t
hold due to the axioms and rules of the type system when applied to the subexpres-
sions of the program.
fun x ! x C 1
α x + α2
α x 1 int
As type variable for variable x we choose ˛, while ˛1 and ˛2 denote the types of
the expressions fun x ! x C 1 and x C 1, respectively. The type rules for functions
and operator applications then generate the following equations:
F UN W ˛1 D ˛ ! ˛2
OP W ˛2 D int
˛ D int
int D int
It follows that
˛ D int ˛1 D int ! int ˛2 D int
must hold. u
t
Let ˛Œe denote the type variable for the expression e. Each rule application gener-
ates the following equations:
C ONST: e b ˛Œe D tb
N IL : e Œ ˛Œe D
˛ list .˛ new/
OP: e e1 C e2 ˛Œe D int
˛Œe1 D int
˛Œe2 D int
C OMP : e e1 D e2 ˛Œe1 D ˛Œe2
˛Œe D bool
T UPEL : e .e1 ; : : : ; em / ˛Œe D .˛Œe1 : : :
˛Œem /
C ONS : e e1 WW e2 ˛Œe2 D ˛Œe1 list
˛Œe D ˛Œe1 list
IF: e if e0 then e1 else e2 ˛Œe0 D bool
˛Œe D ˛Œe1
˛Œe D ˛Œe2
162 4 Semantic Analysis
Example 4.2.6 For the expression id fun x ! x in Example 4.2.5 the follow-
ing equation is obtained:
˛Œid D ˛Œx ! ˛Œx
Different solutions of this equation are obtained if different types t are chosen for
˛Œx. ut
What is the relation between the system of equations for an expression e and the
type judgments derivable for this expression? Let us assume that all variables oc-
curring in e are unique. Let V be the set of variables occurring in e. In the following
we only consider uniform derivations. These are derivations of type judgments all
using the same type environment. Each type judgment ` e W t for a type envi-
ronment for the free variables in e can be converted into a uniform derivation of
a type judgment 0 ` e W t for a 0 that agrees with on the free variables of e. It
follows:
`eWt
for
D fx 7! .˛Œx/ j x 2 V g and t D .˛Œe/
2. Let A be a uniform derivation for a type judgment ` e W t, in which for
each subexpression e 0 of e a type judgment ` e 0 W te0 is derived. Then the
substitution defined by:
.x/ if e 0 x 2 V
.˛Œe 0 / D
te0 if e 0 subterm of e otherwise
is a solution of E.
4.2 Type Inference 163
Theorem 4.2.1 states that all valid type judgments can be read off the solutions of
the system of equations to an expression. The systems of equations that occur here
construct equalities between type terms. The process of solving such systems of
equalities between terms is called unification.
Example 4.2.7
1. Consider the equation
Y DX !X
where ! is a binary constructor in infix notation. The set of solutions of this
equation is the substitution
fX 7! t; Y 7! .t ! t/g
for each term t. One such possible term t is the variable X itself.
2. The equation:
X ! int D bool ! Z
has exactly one solution, namely the substitution
fX 7! bool; Y 7! intg
3. The equation
bool D X ! Y
has no solution. u
t
We defined the function occurs for one generic constructor f of arity k 1 instead
of all possible constructors that may occur in a program text. The functions unify
and unifyList are simultaneously recursive. They are defined by:
The type-inference method described so far is not syntax directed. This is a disad-
vantage. If the system of equations to a program has no solution, no information is
provided where the type error originates from. A precise localization of the error
cause, however, is of utmost importance for the programmer. Therefore we mod-
ify the described method such that it closely follows the syntax of the program.
This syntax-oriented algorithm is given as a functional program, which uses case
distinction over the different possible forms of the program by pattern matching.
To distinguish the syntax of the expression e from syntax of the algorithm we use
capital letters for the keywords in e and put the operators in apostrophs.
A call of the function W is evaluated recursively over the structure of an expres-
sion e. A type environment and a substitution of type variables is passed in
additional accumulating parameters. The call returns as its result a type term t for e
and a substitution accumulated during the evaluation. In the description, which now
follows, the calls of the function unify are always emphasized. To increase read-
ability we assume that the calls of unify always return substitutions. If unification
at a call should fail an error message should be generated and the type inference
either terminates or continues with a meaningful error recovery.
j .FUN x ! e/
! let ˛ D new./
in let .t; / D W e . ˚ fx 7! ˛g; /
in .˛ ! t; /
j .let x1 D e1 in e0 /
! let .t1 ; / D W e1 . ; /
in let D ˚ fx1 7! t1 g
in let .t0 ; / D W e0 . ; /
in .t0 ; /
j .LET REC x1 D e1 AND : : : AND xm D em IN e0 /
! let ˛1 D new./
:::
in let ˛m D new./
in let D ˚ fx1 7! ˛1 ; : : : ; xm 7! ˛m g
in let .t1 ; / D W e1 . ; /
in let D unify .˛1 ; t1 /
:::
in let .tm ; / D W em . ; /
in let D unify .˛m ; tm /
in let .t0 ; / D W e0 . ; /
in .t0 ; /
168 4 Semantic Analysis
The last three cases treat functions and definitions of new variables. New type
variables are created for the unknown type of the formal parameter as well as for the
unknown types of the simultaneously recursively defined variables. Their bindings
are determined during the processing of the expression e. No new type variable
needs to be created for a variable x introduced by a let-expression; the type of
variable x derives directly from the type of the defining expression for x.
The function W is called for an expression e with a type environment 0 that
associates a new type variable ˛x with each free variable x occurring in e and the
empty substition ;. The call fails if and only if there is no type environment for
which the expression e has a type. If on the other hand, the call delivers as return
value a pair .t; /, then for each derivable type judgment 0 ` e W t 0 , there is a
substitution such that:
t 0 D ..t// and 0
.x/ D .. .x/// for all variables x
4.2.2 Polymorphism
˛Œsingle D . ! list/
is derived. Because of the function application .single 1/ the type variable is
instantiated with the basic type int. For the function application .single 1/ the
following type is obtained:
The type equation for the outermost function application therefore requires the in-
stantiation of with int list. Since unification of int with int list fails, a type error
is reported. u t
A possible solution to this problem consists in copying each let-definition for each
use of the defined variable. In the example we obtain:
The two occurrences of the subexpression .fun y ! Œy/ are now treated in-
dependently of each other and receive the types ! list and 0 ! 0 list for
distinct type variables ; 0 . The expanded program now has a type. In the example
one type variable can be instantiated with int and the other with int list.
A solution by copying as just sketched, is not recommended because it imposes
extra restrictions for the resulting program to have the same semantics as the orig-
inal program. In addition, the program expanded in this way, may become very
large. Also, type inference is no longer modular: for a function of another compi-
lation unit that is used several times, the implementation must be known in order to
copy it. A better idea therefore consists in copying not code but types. For this pur-
pose we extend types to type schemes. A type scheme is obtained from a type t by
generalizing some of the type variables that occur in t. Generalized type variables
may be instantiated differently at different uses of the type. In the type scheme:
8˛1 ; : : : ; ˛m :t
the variables ˛1 ; : : : ; ˛m are generalized in t. All other type variables that occur
in t must be identically instantiated at all uses of the type scheme. The quantifier
8 only occurs at the outermost level: The expression t may not contain any other
8. Type schemes are introduced for let-defined variables. At their occurrences type
variables in the scheme can be independently instantiated wtih different types. For
simplicity we regard normal type expressions as type schemes in which an empty
list of variables has been generalized. Aa new rules we obtain:
.x/ D 8 ˛1 ; : : : ; ˛k :t
I NST: .t1 ; : : : ; tk arbitrary/
` x W tŒt1 =˛1 ; : : : ; tk =˛k
` e1 W t1 ˚ fx 7! close t1 g ` e0 W t0
L ET:
` .let x1 D e1 in e0 / W t0
The operation close takes a type term t and a type environment and generalizes in
t all type variables that do not occur in . The types of variables that are introduced
in a recursive definition can also be generalized, but only for occurrences of the
variables in the main expression:
0 0 00
` e1 W t1 : : : ` em W tm ` e0 W t
L ETREC :
` .let rec x1 D e1 and : : : and xm D em in e0 / W t
where
0
D ˚ fx1 !
7 t1 ; : : : ; xm 7! tm g
00
D ˚ fx1 !7 close t1 ; : : : ; xm ; 7! close tm g
Thus, all variables in the type terms t1 ; : : : ; tm are generalized that do not occur in
the types of other variables of the type environment that are visible in the main
expression. It is mandatory that the types of recursive occurrences of the variables
xi in the right sides e1 ; : : : ; em may not be instantiated. Type systems permitting
such polymorphic recursion are in general undecidable.
170 4 Semantic Analysis
We now modify algorithm W such that it maintains type schemes in its type
environments. For the case of variables we need the following auxiliary function
that instantiates the type term in a type scheme with fresh type variables:
fun inst .8 ˛1 ; : : : ; ˛k : t/ D
let ˇ1 D new./
:::
in let ˇk D new./
in tŒˇ1 =˛1 ; : : : ; ˇk =˛k
Then we modify algorithm W for variables and let-expressions as follows:
:::
j x ! .inst .. .x///; /
j .LET x1 D e1 IN e0 /
! let .t1 ; / D W e1 . ; /
in let s1 D close . t1 / . ı /
in let D ˚ fx1 7! s1 g
in let .t0 ; / D W e0 . ; /
in .t0 ; /
Correspondingly, we modify algorithm W for letrec-expressions. We must take
care that the inferred types for newly introduced variables are only generalized for
their occurrences in the main expression:
j .LET REC x1 D e1 AND : : : AND xm D em IN e0 /
! let ˛1 D new./
:::
in let ˛m D new./
0
in let D ˚ fx1 7! ˛1 ; : : : ; xm 7! ˛m g
in let .t1 ; / D W e1 . 0 ; /
in let D unify .˛1 ; t1 /
:::
in let .tm ; / D W em . 0 ; /
in let D unify .˛m ; tm /
in let s1 D close . t1 / . ı g/
:::
in let sm D close . tm / . ı /
0
in let D ˚ fx1 7! s1 ; : : : ; xm 7! sm g
in let .t0 ; / D W e0 . 0 ; /
in .t0 ; /
of Example 4.2.8. Algorithm W derives the type scheme 8 : ! list for the
function single. This type scheme is instantiated for both occurrences of single in
the main expression with distinct type variables 1 ; 2 , which are then instantiated
to the types int list and int. Altogether algorithm W derives the type int list list
for the let-expression. u t
The extended algorithm W computes the most general type of an expression rela-
tive to a type environment with type schemes for the global variables of the expres-
sion. The instantiation of type schemes at all occurrences of variables allows us to
define polymorphic functions that can be applied to values of different types. Type
schemes admit modular type inference, since for functions of other program parts
only their type (scheme) must be known to derive the types of expressions, in which
they are used.
The possibility to instantiate variables in type schemes differently at different
occurrences allows to construct program expressions whose types not only have ex-
ponential but even doubly exponential size! Such examples, however, are artificial
and play no particular role in practical programming.
Variables whose values can be changed are sometimes useful even for essentially
functional programming. To study the problems for type inference resulting from
such modifiable variables, we extend our small programming language by refer-
ences:
e WWD : : : j ref e j Še j e1 WD e2
Example 4.2.10 A function new that returns a new value each time it is called,
can elegantly be defined using references. Such a function is needed to implement
algorithm W . Consider the following program:
The empty tuple ./ is the only element of type unit. Assigning a value to a reference
changes the contents of the reference as a side effect. The assignment itself is an
expression whose value is ./. Since this value is irrelevant, no dedicated variable is
provided for it in our program but an anonymous variable _. u t
172 4 Semantic Analysis
Type expressions are extended by the special type unit and by introducing ref as
new unary type constructor:
` e W t
R EF :
` .ref e/ W t ref
` e W t ref
D EREF :
` .Š e/ W t
` e1 W t ref ` e2 W t
A SSIGN :
` .e1 WD e2 / W unit
These rules seem plausible. Interestingly, they are nonetheless incompatible with
polymorphism.
let y D ref Œ
in let _ D y WD 1 WW .Š y/
in let _ D y WD true WW .Š y/
in 1
Type inference leads to no contradictions. For variable y, it returns the type scheme
8 ˛: ˛ list ref. At run-time, though, a list is constructed that contains the int-value
1 together with the boolean value true. The construction of lists with elements of
different types, though, should be prevented by the type system. u t
The problem in Example 4.2.11 can be avoided if the types of modifiable values are
never generalized. This is assured by the value restriction.
The set of value expressions contains all expressions without occurrences of ref-
erences and function applications outside a functional abstraction. In particular
every function fun x ! e is a value expression. The value restriction expresses
that in an expression
let x D e1 in e0
In O CAML, the function member has type 8 ˛: ˛ ! ˛ list ! bool. This follows
from O CAML’s design choice to assume equality to be defined for values of all
types: For some types it throws an exception, though. This is different in the func-
tional language S ML: S ML makes a difference between equality types providing
equality and arbitrary types. Function types for example do not provide equality. In
S ML the type variable ˛ in the S ML-type for member may only be instantiated by
equality types. ut
Name Operation
Equality types .D/ : ˛ ! ˛ ! bool
Comparison types ./ : ˛ ! ˛ ! bool
Printable types to_string : ˛ ! string
Hashable types hash : ˛ ! int
Here, ./ denotes the binary function corresponding to the binary infix operator .
t
u
8˛1 W S1 ; : : : ; ˛m W Sm : s
where S1 ; : : : ; Sm are finite sets of type classes, and s is a polymorphic type scheme
and thus may contain also generalized but unconstrained variables. A set S of type
classes, which occurs as a constraint, is also called sort. When a sort S D fC 0 g
consists of a single element only, we will also omit the set brackets. For simpliity,
we assume that each type class is assiciated with a single operation only. In order to
declare a new type class C , the associated operation opC must be specified together
with the type of the operation opC :
for some type expression t. The type scheme for opC may containt exactly one
generic variable, which is qualified by C .
Declarations of classes are complemented by instance declarations. An instance
declaration for the class C specifies assumptions for the argument types of the ap-
plication of a type constructor b, such that the resulting type is a member of class
C , and provides an implementation of the operator opC :
inst b.S1 ; : : : ; Sk / W C
where opC D e
An operator which has different implementations for different types has been called
overloaded. The case where the type constructor b has no parameters corresponds
to base types.
4.2 Type Inference 175
Example 4.2.14 The class Eq, which collects all equality types, together with two
instance declarations for this class may look as follows:
class Eq where
.D/ W 8 ˛ W C: ˛ ! ˛ ! bool
The implementation of equaltiy for pairs is defined by means of the equalities for
the component types. Accordingly, the equality for lists refers to the equality for the
element type. The challenge for type inference is not only to check that the types
of expressions are compatible, but also to identify for different occurrences of an
operator the corresponding correct implementation. u t
˙ `t WC
expresses that the type t belongs to C whenever each type variable ˛ occurring in t
belongs to all classes from ˙.˛/.
For a given sort environment ˙ , the set SŒt ˙ of all classes to which t belongs
can be determined inductively over the structure of t. If t is a type variable ˛, then
SŒt ˙ D ˙.˛/. If t is of the form b.t1 ; : : : ; tk / for some type constructor b of
176 4 Semantic Analysis
arity k 0, then SŒt ˙ is the set of all classes for which an instance declaration
inst b.S1 ; : : : ; Sk / W C : : : has been provided with Si SŒti ˙ for all i.
If, on the other hand, for every class C and every type constructor there is at
most one instance declaration, then a required sort constraint S for the root of t can
be translated into required sort constraints for the subterms of t. This allows us to
determine minimal sort constraints for the variables occurring in t, which must be
satisfied in order to make t a member of each class in S.
Example 4.2.15 Assume that the base types int and bool belong to the class Eq.
Then the types .bool; int list/ and .bool; int list/ list also belong to Eq.
The type bool ! int does not belong to the class Eq, as long as no instance
declaration for Eq and and the type constructor ! has been provided.
The type expression .˛; int/ list denotes types of the class Eq, whenever ˛ be-
longs to the class Eq. ut
In order to infer types for functional programs with class and instance declarations,
we first may ignore the constraints in typ schemes and just derive Hindley–Milner
polymoriphic types. In a second phase, we then may determine the sorts of each
type variable. The disadvantage of the procedure is that then type correctness of a
program is verified, while it still remains unclear how the program is translated.
A better idea, therefore, consists in modifying polymorphic type inference by
means of algorithm W in such a way that besides typing and sort information, it
also provides a translation of e into an expression e 0 which makes the selection of
the right operator implementation explicit.
The translation provides for every sort S a table ˛ dictS that for every operator
op of a class in S provides an implementation of op. The overloaded operator opC
of class C with type scheme 8 ˛ W C: t then is translated into a look-up ˛:opC in a
table ˛ that contains a corresponding component opC . The goal of the translation
is to provide tables such that the right implementations of a given operator can be
looked up at every use of this operator. A variable f for which the algorithm W
provides a type scheme 8 ˛1 W S1 ; : : : ; ˛m W Sm : s therefore is translated into a
function that receives m tables as extra actual parameters:
Thereby, 1 ˙ returns the minimal sort requirement ˙ 0 for the type variables oc-
curring the image of that must be provided in order to satisfy the constraints given
by ˙ , i.e., such that ˙ 0 ` . ˛/ W .˙ ˛/ holds for all type variables ˛.
Example 4.2.16 Consider the instance declarations that result in the following
rules:
Eq list W Eq
Comp set W Eq
and assume that
Propagating the sort constraint ˙.˛/ D Eq for the type variable ˛ w.r.t. the type
substitution to sort constraints for the type variables occurring the type expression
˛ (here: just ˇ), results in the sort constraint:
1 ˙ D fˇ 7! Compg
For the implementation of the extended algorithm W , we also modify the auxiliary
functions close and inst.
A call sort_close .t; e/ . ; ˙ / for a type t and and expression e w.r.t. a type
environment and a sort environment ˙ makes all type variables of t generic that
neither occur in nor in ˙ and makes all type variables of t constrained generic
that do not occur in , but in ˙ . Besides the type scheme, the call additionally re-
turns in a second component the sort environment ˙ , where all variables that have
been generalized in the type scheme are removed. As a third component, the func-
tional expression is returned that is obtained from e by abstracting the constrained
generic type variables of the type scheme as formal parameters:
sort_close .t; e/ . ; ˙ /
D let ˛10 ; : : : ; ˛n0 D free .t/ n .free . / [ dom .˙ //
in let s D 8 ˛10 ; : : : ; ˛n0 : t
in let ˛1 ; : : : ; ˛m D .free .t/ n free . // \ dom .˙ /
in let s D 8 ˛1 W ˙.˛1 /; : : : ; ˛m W ˙.˛m /: s
in let ˙ D ˙ nf˛1 ; : : : ; ˛m g
in .s; ˙; fun ˛1 ! : : : fun ˛m ! e/
178 4 Semantic Analysis
fun sort_inst .8 ˛1 W S1 ; : : : ; ˛m W Sm : s; x/
D let t D inst s
in let ˇ1 D new./
:::
in let ˇm D new./
in let t D tŒˇ1 =˛1 ; : : : ; ˇm =˛m
in .t; fˇ1 7! S1 ; : : : ; ˇm 7! Sm g;
x ˇ1 : : : ˇm /
Note that the transformation insists to create functional parameters only for type
variables that are constrained by sorts. The type variables that occur in output ex-
pressions may later be further instantiated by unification of type expressions. If a
variable ˛ W S is substituted by a type expression t, then an S-table correspond-
ing to type t is generated and substituted for the program variable ˛. The table is
generated by means of the transformation T :
T Œˇ S D ˇ
T Œb.t1 ; : : : ; tm / S D forall C 2 S
let op0C D let d1 D T Œti1 SC;i1 in
:::
let dk D T Œtik SC;ik in
opC;b d1 : : : dk
in fopC D op0C j C 2 Sg
:::
j opC ! let ˇ D new./
in .tC Œˇ=˛; ˙ ˚ fˇ 7! C g; ; ˇ:opC /
j x ! let .t; ˙ 0 ; e 0 / D sort_inst . .x/; x/
in .t; ˙ [ ˙ 0 ; ; e 0 /
4.2 Type Inference 179
j .LET x D e1 IN e0 /
! let .t1 ; ˙; ; e10 / D W e1 . ; ˙; /
in let e10 D T Œe10 .; ˙ /
in let .s1 ; ˙; e10 / D sort_close . t1 ; e10 / . ; ˙ /
in let D ˚ fx1 7! s1 g
in let .t0 ; ˙; ; e00 / D W e0 . ; ˙; /
in let e 0 D .LET x1 D e10 IN e00 /
in .t0 ; ˙; ; e 0 /
where T Œe .; ˙ / replaces each occurrence of a variable ˇ in e with ˙.ˇ/ D S
by the table T Œ ˇ S. Type inference/transformation starts with an empty sort
environment ˙0 D ; and an empty type environment 0 D ;.
Consider an instance declaration
˙.ˇi / Si
holds. The implementation of the operator opC for the type constructor b then is
given by:
opC;b D fun ˇi1 ! : : : ! fun ˇik ! T Œe 0 .; ˙ /
Example 4.2.17 Consider the implementations of equality for pairs and lists. Ac-
cording to their declaration, they have the following types:
For every constrained type parameter, an extra argument is provided. After the
transformation, we obtain the following implementation:
The program variables ˇ1 ; ˇ2 , and ˇ are fresh program variables that have been
generated from type variables. Their run-time values are tables that provide the
actual implementations of the overloaded operator .D/. ut
In our implementation, the component name op occurs in all tables of sorts that
require an implementation of the operator op. In programming languages such
as O CAML, components of different record types may not be equal. A practical
solution therefore is to rename components for op in different records, once type
inference and transformation has been finished and adapt the accesses to compo-
nents accordingly.
The invention of type classes is by no means the end of the story. The programming
language H ASKELL has proven to be an ingenious test bed for various extensions
of the Hindley–Milner type system. H ASKELL thus not only provides type classes
but also type constructor classes. These conveniently allow us to deal with monads.
Monads have evolved into a central part of H ASKELL since they allow us to realize
input/output as well as various kinds of side effects in a purely functional way.
each attribute a is associated a type a , which determines the set of possible values
for the instances of the attribute. Consider a production p W X0 ! X1 : : : Xk
with k 0 symbols occurring on the right side. To tell the different occurrences
of symbols in production p apart, we number these from left to right. The left side
nonterminal X0 will be denoted by pŒ0, the ith symbol Xi on the right side of p by
pŒi for i D 1; : : : ; k. The attribute a of a symbol X has an attribute occurrence at
each occurrence of X in a production. The occurrence of an attribute a at a symbol
Xi is denoted by pŒi:a.
For every production, functional specifications are provided how attributes of
occurring symbols may be determined from the values of further attributes of sym-
bol occurrences of the same production. These specifications are called semantic
rules. In our examples, semantic rules are realized by means of an O CAML-like
programming language. This has the extra advantage that the explicit specification
of types can be omitted.
A restricted instance of such a mechanism is already provided by standard LR-
parsers such as YACC or B ISON: Here, each symbol of the grammar is equipped
with a single attribute. For every production then there is one semantic rule that
determines how the attribute of the nonterminal occurring on the left side is deter-
mined from the attributes of the symbol occurrences on the right side.
Example 4.3.1 Consider a CFG with the nonterminals E; T; F for arithmetic ex-
pressions. The set of terminals consists of symbols for brackets, operators, and the
symbols var and const, that represent int-variables and constants, respectively. The
nonterminals should be equipped with an attribute tree that receives the internal
representation of the expression.
In order to determine the values for the attributes, we extend the productions of
the grammar by semantic rules as follows:
p1 W E ! E C T
p1 Œ0:tree D Plus .p1 Œ1:tree; p1 Œ2:tree/
p2 W E ! T
p2 Œ0:tree D p2 Œ1:tree
p3 W T ! T F
p3 Œ0:tree D Mult .p3 Œ1:tree; p3 Œ2:tree/
p4 W T ! F
p4 Œ0:tree D p4 Œ1:tree
p5 W F ! const
p5 Œ0:tree D Int .p5 Œ1:val/
p6 W F ! var
p6 Œ0:tree D Var .p6 Œ1:id/
p7 W F ! . E /
p6 Œ0:tree D p6 Œ2:tree
have been applied. Furthermore, we assumed that the symbol const has an attribute
val containing the value of the constant, and the symbol var has an attribute id
containing a unique identifier for the variable. u
t
Example 4.3.2 Consider again the grammar from Example 4.3.1. According to
the convention of indexing different occurrences of the same symbol in the produc-
tion, the semantic rules are denoted as follows:
p1 W E ! E C T
EŒ0:tree D Plus .EŒ1:tree; T:tree/
p2 W E ! T
E:tree D T:tree
p3 W T ! T F
T Œ0:tree D Mult .T Œ1:tree; F:tree/
p4 W T ! F
T:tree D F:tree
p5 W F ! const
F:tree D Int .const:val/
p6 W F ! var
F:tree D Var .var:id/
p7 W F ! . E /
F:tree D E:tree
The index is omitted if a symbol occurs only once. If a symbol occurs several
times, the index 0 identifies an occurrence of the left side, while all occurrences on
the productions’s right side are successively indexed starting from 1. u t
X0
X1 Xk
Fig. 4.6 An attributed node in the parse tree with its attributed successors. Instances of inherited
attributes are drawn as boxes to the left of syntactic symbols, instances of synthesized attributes
as boxes to the right of symbols. Red (darker) arrows show the information flow into the produc-
tion instance from the outside, yellow (lighter) arrows symbolize functional dependences between
attribute instances that are given through the semantic rules associated with the production
functionality. Also for inherited attribute instances at the root of the parse tree, no
semantic rules are provided by the grammar to compute their values. Here, the
application must provide meaningful values for their initialization.
The semantics of an attribute grammar determines for each parse tree t of the un-
derlying CFG which values the attributes at each node in t should have.
For each node n in t, let symb.n/ denote the symbol of the grammar labeling
n. If symb.n/ D X then n is associated with the attributes in A.X/. The attribute
a 2 A.n/ of the node n is addressed by n:a. Furthermore, we need an operator to
navigate from a node to its successors. Let n1 ; : : : ; nk be the sequence of successors
of node n in parse tree t. Then nŒ0 denotes the node n itself, and nŒi D ni for
i D 1; : : : ; k denotes the i-th successor of n in the parse tree t.
If X0 D symb.n/ and, if Xi D symb.ni / for i D 1; : : : ; k are the labels of
the successors ni of n then X0 ! X1 : : : Xk is the production of the CFG that
has been applied at node n. From the semantic rules of this production p, semantic
definitions of attributes at the nodes n; n1 ; : : : ; nk are generated by instantiating p
with the node n. The semantic rule
for the node n in the parse tree. Hereby, we assume that the semantic rules specify
total functions. For a parse tree t, let
denote the set of all attribute instances in t. The subset Vin .t/ of inherited attribute
instances at the root together with the set of all synthesized attribute instances at the
leaves are called the set of input attribute instances of t. Instantiating the semantic
rules of the attribute grammar at all nodes in t produces a system of equations in the
unknowns n:a that has for all but the input attribute instances exactly one equation.
Let AES.t/ be this attribute equation system. Now consider any assignment to
the input attribute instances. If AES.t/ is recursive (cyclic), it can have several
solutions or no solution (relative to ). If AES.t/ is not recursive then for every
assignment of the input attribute instance, there is exactly one assignment to the
noninput attribute instances of the parse trees t such that all equations are satisfied.
Accordingly, the attribute grammar is called well-formed, if the system of equations
AES.t/ is not recursive for any parse tree t of the underlying CFG. In this case, we
define the semantics of the attribute grammar as the function mapping each parse
tree t and each assignment of the input attribute instances to an assignment of
all attribute instances of t that agrees with on the input attribute instances and
additionally satisfies all equations of the system AES.t/.
In the following we present some (fragments of) attribute grammars that solve es-
sential subtasks of semantic analysis. The first attribute grammar shows how types
of expressions can be computed using an attribute grammar.
Example 4.3.3 (Type checking) The attribute grammar AGtypes realizes type infer-
ence for expressions containing assignments, nullary functions, operators C; ; ; =
as well as variable and constants of type int or float for a C-like programming lan-
guage with explicit type declarations for variables. The attribute grammar has an
attribute type for the nonterminal symbols E; T , and F , and for the terminal symbol
const, which may take values Int and Float. This grammar can be easily extended
to more general expressions with function application, component selection in com-
posed values, and pointers.
E ! var 0D0 E
EŒ1:env D EŒ0:env
EŒ0:typ D EŒ0:env var:id
EŒ0:ok D let x D var:id
in let D EŒ0:env x
in . ¤ error/ ^ .EŒ1:type v /
186 4 Semantic Analysis
E ! E aop T E ! T
EŒ1:env D EŒ0:env T:env D E:env
T:env D EŒ0:env E:typ D T:typ
EŒ0:typ D EŒ1:typ t T:typ E:ok D T:ok
EŒ0:ok D .EŒ1:typ v float/
^ .T:typ v float/
T ! T mop F T ! F
T Œ1:env D T Œ0:env F:env D T:env
F:env D T Œ0:env T:typ D F:typ
T Œ0:typ D T Œ1:typ t F:typ T:ok D F:ok
T Œ0:ok D .T Œ1:typ v float/
^ .F:typ v float/
F ! .E/ F ! const
E:env D F:env F:typ D const:typ
F:typ D E:typ F:ok D true
F:ok D E:ok
F ! var F ! var ./
F:typ D F:env var:id F:typ D .F:env var:id/ ./
F:ok D .F:env var:id ¤ error/ F:ok D match F:env var:id
with ./ ! true
j _ ! false
The attribute env of the nonterminals E; T , and F is inherited while all other at-
tributes of grammar AGtypes are synthesized. ut
Attribute grammars refer to some underlying CFG. If, e.g., operator precedences
have been coded into the grammar, a larger number of chain rules may occur forc-
ing the values of attibutes to be copied from the upper node to the single child
(in case of inherited attributes) or from the single child to the ancestor (in case of
synthesized attributes). This phenomenon can be nicely observed already for the at-
tribute grammar AGtypes . Therefore, we introduce a convention for writing attribute
grammars that reduces the overhead of specifying copying attribute values:
synthesized attribute on the right with that particular name. The following examples
use this convention, at least for chain productions of the form A ! B.
Example 4.3.4 (Managing Symbol Tables) The attribute grammar AGscopes man-
ages symbol tables for a fragment of a C-like imperative language with parameter-
less procedures. Nonterminals for declarations, statements, blocks, and expressions
are associated with an inherited attribute env that will contain the actual symbol
table.
The redeclaration of an identifier within the same block is forbidden while it is
allowed in a new block. To check this a further inherited attribute same is used
to collect the set of identifiers that are encountered so far in the actual block. The
synthesized attribute ok signals whether all used identifiers are declared and used in
a type-correct way.
This grammar only contains a minimalistic set of productions for the nonterminal
symbol hstati. To obtain a more complete grammar, further productions for expres-
sions like those in 4.3.3 are needed. For the case that the programming language
also contains type declarations, another attribute is required that manages the actual
type environment.
188 4 Semantic Analysis
Since the given rules collect declarations from left to right, the use of a procedure
before its declaration is excluded. This formalizes the intended scoping rule for the
language, namely that the scope of a procedure declaration begins at the end of the
declaration. Let us now change this scoping rule to allow the use of procedures
starting with the beginning of the block in which they are declared. The modified
attribute grammar reflecting the modified scoping rule is called AGscopesC . In the
attribute grammar AGscopesC the semantic rule of the attribute env is modified such
that all procedures declared in the block are added to env already at the beginning
of a block. The nonterminal hblocki therefore receives an additional synthesized
attribute procs, and the productions for the nonterminal hblocki obtain the additional
rules:
hblocki ! "
hblocki :procs D ;
The procedures collected in hblocki :procs, are added to the environment hblocki :env
in the productions that introduce new blocks. The attribute grammar AGscopesC then
has the following semantic rules:
hstati ! f hblocki g
hblocki :env D hstati :env ˚ hblocki :procs
The rest of attribute grammar AGscopesC agrees with attribute grammar AGscopes .
Note that the new semantic rules induce an interesting functional dependency: in-
herited attributes of a nonterminal on the right side of a production depend on
synthesized attributes of the same nonterminal. ut
Attribute grammars can be used to generate intermediate code or even code for
machines like the virtual machine as presented in the first volume Wilhelm/Seidl:
Compiler Design – Virtual Machines. The functions realizing code-generation as
described in that volume are recursively defined over the structure of programs.
They use information about the program such as the types of identifiers visible in
a program fragment whose computation can be described by attribute grammars as
4.3 Attribute Grammars 189
Example 4.3.5 We consider code generation for a virtual machine like the CM A
in Wilhelm/Seidl: Compiler Design – Virtual Machines. The code generated for a
boolean expression according to attribute grammar AGbool should have the follow-
ing properties:
The generated code consists only of load-instructions and conditional jumps. In
particular, no boolean operations are generated.
Subexpressions are evaluated from left to right.
Of each subexpression as well of the whole expression only the smallest
subexpressions are evaluated that uniquely determine the value of the whole
(sub)expression. So, each subexpression is left as soon as its value determines
the value of its containing expression.
The following code is generated for the boolean expression .a ^ b/ _ :c with the
boolean variables a; b and c:
load a
jumpf l1 == jump-on-false
load b
jumpt l2 == jump-on-true
l1 W load c
jumpt l3
l2 W == continuation if the expression evaluates to true
l3 W == continuation if the expression evaluates to false
The attribute grammar AGbool generates labels for the code for subexpressions, and
it transports these labels to atomic subexpressions from which the evaluation jumps
to these labels. Each subexpression E and T receives in fsucc the label of the
successor if the expression evaluates to false, and in tsucc the label of the successor
if it evaluates to true. A synthesized attribute jcond contains the relation of the
value of the whole (sub)expression to its rightmost identifier.
If jcond has the value true for an expression this means that the value of the
expression is the same as the value of its rightmost identifier. This identifier is
the last one that is loaded during the evaluation.
If jcond has the value false the value of expression is the negation of the value
of its rightmost identifier.
Correspondingly, a load instruction for the last identifier is followed by a jumpt to
the label in tsucc, if jcond D true, and it is followed by a jumpf if jcond D false.
This selection is performed by the function:
correspond to the start addresses of the code for the then and the else parts of the
conditional statements. The code for the condition ends in a conditional jump to
the else part. It tests the condition E for the value false. Therefore the function
gencjump receives :jcond as first parameter. We obtain:
F ! .E/
F ! not F
F Œ1:tsucc D F Œ0:fsucc
F Œ1:fsucc D F Œ0:tsucc
F Œ0:code D F Œ1:code
F Œ0:jcond D :F Œ1:jcond
F ! var
F:jcond D true
F:code D load var:id
Here, the infix operator O denotes the concatenation of code fragments. This at-
tribute grammar is not in normal form: The semantic rule for the synthesized
4.4 The Generation of Attribute Evaluators 191
attribute code of the left side hif _stati in the first production uses the inherited
attributes tsucc and fsucc of the nonterminal E on the right side. The reason is that
the two inherited attributes are computed using a function new./ that generates a
new label every time it is called. Since it implicitly changes a global state, calls
to the function new./ are, puristically viewed, not admitted in semantic rules of
attribute grammars.
Here, at least two solutions are conceivable:
The global state, that is, the counter of already allocated labels, is propagated
through the parse tree in dedicated auxiliary attributes. Generating a new label
then accesses these local attributes without referring to any global state. The
disadvantage of this procedure is that the esssential flow of computation within
the attribute grammar is blurred by the auxiliary attributes.
We do allow functions that access a global state such as the auxiliary function
new./ as described in the example grammar. Then, however, we have to aban-
don normalization of some semantic rules, since function calls referring to the
global state may not be dublicated. Furthermore, we must convince ourselves
that distinct sequences of attribute evaluations, while not always returning iden-
tical results, will at least always return acceptable results.
t
u
This section treats attribute evaluation, more precisely the evaluation of attribute
instances in parse trees, and the generation of the corresponding evaluators. An
attribute grammar defines for each parse tree t of the underlying CFG a system of
equations AES.t/, the attribute evaluation system. The unknowns in this system of
equations are the attribute instances at the nodes of t. Let us assume that the attribute
grammar is well-formed. In this case, the system of equations is not recursive and
therefore can be solved by elimination methods. Each elimination step selects one
attribute instance to be evaluated next that must only depend on attribute instances
whose values have already been determined. Such an attribute evaluator is purely
dynamic if it does not exploit any information about the dependences in the attribute
gammar. One such evaluator is described in the next section.
already received its value. If this is the case, the function returns with the value that
already has been computed. Otherwise the evaluation of n:a is triggered. This eval-
uation may in turn query the values of other attribute instances, whose evaluation is
triggered recursively. This strategy has the consequence that for each attribute in-
stance in the parse tree the right side of its semantic rule is evaluated at most once.
The evaluation of attribute instances that are never demanded is avoided.
To realize this idea all attribute instances that are not initialized are set to the
value Undef before the first value enquiry. Each attribute instance initialized with
a non-Undef value d is set to the value Value d . For navigation in the parse tree
we use the postfix operators Œi to go from a node n to its ith successor. For i D 0
the navigation stays at n. Furthermore, we need an operator father that, when
given a node n, returns the pair .n0 ; j / consisting of the father n0 of node n and the
information in which direction, seen from n0 , to find n. This latter information says
which child of its father n0 the argument node is. To implement the function solve
for the recursive evaluation, we need a function eval. If p is the production that was
applied at node n, and if
is the right side of the semantic rule for the attribute occurrence pŒi:a, eval n .i; a/
returns the value of f , where for each demanded attribute instance the function
solve is called. Therefore, we define:
In a simultaneous recursion with the function eval the function solve is imple-
mented by:
The function solve checks whether the attribute instance n:a in the parse tree al-
ready has a value. If this is the case, solve returns this value. If the attribute instance
n:a does not yet have a value, n:a is labeled with Undef. In this case the semantic
rule for n:a is searched for.
If a is a synthesized attribute of the symbol at node n, a semantic rule for a is
supplied by the production p at node n. The right side f of this rule is modified
4.4 The Generation of Attribute Evaluators 193
such that it does not directly attempt to access its argument attribute instances, but
instead calls the function solve recursively for these instances for node n. If a value
d for the attribute instance n:a is obtained, it is assigned to the attribute instance
n:a and in addition returned as result.
If, on the other hand, a is an inherited attribute of the symbol at node n, the
semantic rule for n:a is not supplied by the production at n, but instead by the
production at the father of n. Let n0 be the father of n and n be the j 0 th child of n0 .
The semantic rule for the attribute occurrence p 0 Œj 0 :a is chosen if the production
p 0 is applied at node n0 . Its right side is again modified in the same way such that
before any access to attribute values the function solve is called. The computed
value is again stored in the attribute instance n:a and returned as result.
If the attribute grammar is well-formed, the demand-driven evaluator always
computes for each parse tree and for each attribute instance the correct value. If
the attribute grammar is not well-formed, the attribute evaluation systems for some
parse trees may be recursive. If t is such a parse tree, there are a node n and an
attribute a at n in t such that n:a depends, directly or indirectly, on itself, imply-
ing that the call solve n a may not terminate. To avoid nontermination, attribute
instances are labeled with Called if their evaluation has started, but not yet termi-
nated. Furthermore, the function solve is modified to terminate and return some
error value whenever it meets an attribute instance labeled with Called (see Exer-
cise 8).
Dynamic attribute evaluation does not exploit information about the attribute gram-
mar to improve the efficiency of attribute evaluation. More efficient attribute-
evaluation methods are possible if knowledge of the functional dependences in
productions is taken into account. An attribute occurrence pŒi:a in production
p functionally depends on an occurrence pŒj :b if pŒj :b is an argument of the
semantic rule for pŒi:a. The production-local dependences determine the depen-
dences in the system of equation AES.t/. Based on the functional dependences,
sometimes attributes can be evaluated according to statically determined visit se-
quences. The visit sequences guarantee that an attribute instance is only scheduled
for evaluation when the argument instances for the corresponding semantic rule are
already evaluated. Consider again Fig. 4.6. Attribute evaluation requires a coop-
eration of the local computations at a node n and and its successors n1 ; : : : ; nk ,
and those in the context of this production instance. A local computation of an in-
stance of a synthesized attribute at a node n labeled with X0 provides an attribute
value to be used by local computation at the ancestor of n in the upper context. The
computation of the value of an inherited attribute instance at the same node n takes
place at the ancestor of n and may enable further evaluations according to the se-
mantic rules of the production corresponding to n. A similar exchange of data takes
place through the attribute instances at the nodes n1 ; : : : ; nk with the computations
within the subtrees. To schedule this interaction of computations, global functional
194 4 Semantic Analysis
Fig. 4.7 The production-local dependence relation to production hblocki ! hstati hblocki in
AGscopes
Fig. 4.8 The production-local dependence relation to production hblocki ! hdecli hblocki in
AGscopes
same env block ok procs decl new same env block ok procs
Fig. 4.9 The production-local dependence relation to productions hstati ! f hblocki g and
hblocki ! hdecli hblocki in AGscopesC . Here, the terminal leaves for opening and closing brack-
ets have been omitted
env T type
In attribute grammars in normal form the arguments of semantic rules for defin-
ing occurrences are always applied attribute occurrences. Therefore the paths in
production-local dependence relations all have length 1, and there exist no cycles of
the form .pŒi:a; pŒi:a/. The adherence to normal form therefore simplifies some
considerations.
The production-local dependences between attribute occurrences in productions
induce dependences between attribute instances in the parse trees of the grammar.
Let t be a tree of the CFG underlying an attribute grammar. The individual de-
pendence relation on the set I.t/ of attribute instances of t, D.t/ is obtained by
instantiating the production-local dependence relations of productions applied in
t. For each node n in t at which production p has been applied, the relation D.t/
consists of exactly the pairs .nŒj :b; nŒi:a/ with .pŒj :b; pŒi:a/ 2 D.p/.
Example 4.4.2 (Continuation of Example 4.3.4) The dependence relation for the
parse tree of statement f int xI x D 1I g according to attribute grammar AGscopesC
is shown in Fig. 4.12. For simplicity we assume that the nonterminal type directly
196 4 Semantic Analysis
env stat ok
const type
Fig. 4.12 The individual dependence relation for the parse tree to f int xI x D 1I g according
to attribute grammar AGscopes
derives the base type int, and that nonterminal E for expressions directly derives
the terminal const. ut
A relation R on a set A is called cyclic if its transitive closure contains a pair .a; a/.
Otherwise we call the relation R acyclic. An attribute grammar is called noncircu-
lar, if all its individual dependence relations are acyclic. An individual dependence
relation D.t/ is acyclic if and only if the system of equations AES.t/ that was in-
troduced in Sect. 4.3.1 is not recursive. Attribute grammars that satisfy the latter
condition are called well-formed. Thus, an attribute grammar is well-formed if and
only it is noncircular.
Consider a parse tree t with root label X as in Fig. 4.13. The instances of the
inherited attributes at the root are viewed as input to t, and the instances of the
synthesized attributes at the root as output of t. The instance of d at the root (tran-
sitively) depends only on the instance of c at the root. If the value of the instance of
c is known, an attribute evaluator can descend into t and return with the value for
the instance of d since there are no other dependences of instances external to t that
do not pass through c. The instance of e at the root depends on the instances of a
and b at the root. When both values are available the evaluation of the instance of e
4.4 The Generation of Attribute Evaluators 197
a b c X d e a b c X d e
Fig. 4.13 Attribute dependences in a parse tree for X and the induced lower characteristic de-
pendence relation
Fig. 4.14 Lower characteristic dependence relation for block same env block ok
The operation ŒŒp] takes the local dependence relation of production p and adds the
instantiated dependence relations for the symbol occurrences of the right side. The
transitive closure of this relation is computed and then projected to the attributes
of the left-side nonterminal of p. If production p is applied at the root of a parse
trees t, and if the relations L1 ; : : : ; Lk are the lower dependence relations for the
subtrees under the root of t the characteristic dependence relation for t is obtained
by
L t .X/ D ŒŒp] .L1 ; : : : ; Lk /
The sets L.X/; X 2 V of all lower dependence relations for nonterminal symbols
X result as the least solution of the system of equations
X 2 VN
Here, VT ; VN , and P are the sets of terminal and nonterminal symbols, and pro-
ductions, respectively, of the underlying CFG. Each right side of these equations
is monotonic in each unknown L.Xi / on which it depends. The set of all transi-
tive binary relations over a finite set is finite. Therefore the set of its subsets is
also finite. Hence, the least solution of this system of equations, i.e., the set of
all lower dependence relations for each X, can be iteratively determined. The sets
of all lower dependence relations L.X/ allow for an alternative characterization of
noncircularity of attribute grammars. We have:
Lemma 4.4.1 For an attribute grammar the following statements are equivalent:
1. For every parse tree t, the lower dependence relation L t .X/ (X label of the root
of t) is acyclic;
2. For each production p W X ! X1 : : : Xk and all dependence relations Li 2
L.Xi /, the relation
is acyclic. u
t
Since the sets L.X/ are finite and can be effectively computed, the lemma provides
us with a decidable characterization of well-formed attribute grammars. We obtain:
In order to decide well-formedness, the sets L.X/ of all lower dependence relations
of the attribute grammar for symbols X must be computed. These sets are finite,
but their sizes may grow exponentially in the number of attributes. The check for
noncircularity is thus only practically feasible if either the number of attributes is
small, or the symbols have only few lower dependence relations. In general, though,
the exponential effort is inevitable since the problem to check for noncircularity of
an attribute grammar is EXPTIME-complete.
In many attribute grammars a nonterminal X may have several lower charac-
teristic dependence relations, but these are all contained in one common transitive
acyclic dependence relation.
Example 4.4.4 Consider the attribute grammar AGscopesC in Example 4.3.4. For
nonterminal hblocki there are the following lower characteristic dependence rela-
tions:
.1/ ;
.2/ f.same; ok/g
.3/ f.env; ok/g
.4/ f.same; ok/; .env; ok/g
The first three dependence relations are all contained in the fourth. u
t
To compute for each symbol X a transitive relation that contains all lower character-
istic dependences for X we set up the following system of equations over transitive
relations:
.R/ R.a/ D ; ; a 2 VT
G
R.X/ D fŒŒp] .R.X1 /; : : : ; R.Xk // j p W X ! X1 : : : Xk 2 P g ;
X 2 VN
The partial order on transitive relations is the subset relation . Note that the least
upper bound of the transitive relations R 2 S is not just their union. Instead we
have: G [
SD. S/C
i.e., following the union of the relations the transitive closure must be recomputed.
For each production p the operation ŒŒp] is monotonic in each of its arguments.
Therefore the system of equations possesses a least solution. Since there are only
finitely many transitive relations over the set of attributes this solution can again
be determined by iteration. Let L.X/; X 2 V; and R.X/; X 2 V; be the least
solutions of the systems of equations .L/ and .R/. By induction over the iterations
of the fixed-point algorithm, it can be proved that for all X 2 V ,
[
R.X/ L.X/
holds. We conclude that all characteristic lower dependence relations of the attribute
grammar are acyclic if all relations R.X/; X 2 V; are acyclic.
200 4 Semantic Analysis
a b X c d e
a b X c d e
Fig. 4.15 Attribute dependences in an upper tree fragment for X and the induced upper charac-
teristic dependence relation
evaluated are required at evaluation time. The largest class of attribute grammars
for which we describe the generation of attribute evaluators is the class of l-ordered
or simple-multivisit attribute grammars. An attribute grammar is called l-ordered if
there is a function T that maps each symbol X to a total order T .X/ A2 on the
set A of attributes of X that is compatible with all productions. This means that for
each production p W X0 ! X1 : : : Xk of the underlying grammar the relation
By comparing this inequality with the equation for the unknown X0 in the system
of equations .R/ in the last section, we conclude that the total order T .X0 / con-
tains the dependence relation R.X0 /. Since T .X0 / is a total order and therefore
acyclic, the attribute grammar is absolutely noncircular, and all local lower depen-
dence relations at X0 are contained in T .X0 /. In analogy, it can be shown that
T .X0 / contains all upper dependence relations at X0 .
hstati env ! ok
hblocki procs ! same ! env ! ok
hdecli new
E env ! ok
var id
t
u
where IX;i 2 I .X/ and SX;i 2 S.X/ holds for all i D 1; : : : ; rX , and furthermore
IX;i ¤ for i D 2; : : : ; rX , and SX;i ¤ for i D 1; : : : ; rX 1.
Intuitively, this factorization of the sequence BT .X/ means that the synthesized
attributes at each node of a parse tree labeled with X can be evaluated in at most rX
visits; at the first visit of the node, coming from the parent node, the values of the
inherited attributes in IX;1 are available, at the return to the parent node, the values
of the synthesized attributes in SX;1 are evaluated. Correspondingly, at the ith visit
of the node, the values of the inherited attributes in IX;1 : : : IX;i are available, and
the synthesized attributes in SX;i are computed. A subsequence IX;i SX;i of BT .X/
is called a visit of X. To determine which evaluations may be performed during
the ith visit at a node n and at the successors of the node n, one considers the
dependence relation DT .p/ for the production X0 ! X1 : : : Xk that is applied
at n. Since the relation DT .p/ is acyclic, DT .p/ can be arranged into a linear
order. In our case, we choose the order BT .p/, which can be factorized into visits.
Altogether we obtain for the relation DT .p/ a visit sequence:
The ith subsequence BT ;i .p/ describes what happens during the ith visit of a node
n at which the production p W X0 ! X1 : : : Xk is applied. For each occurrence
of inherited attributes of the Xj (j > 0) in the subsequence, the corresponding
attribute instances are computed one after the other. After the computation of the
listed inherited attribute instances of the i 0 th visit of the j th successor this successor
is recursively visited to determine the values of the synthesized attributes associated
with the i 0 th visit. When the values of the synthesized attributes of all successors are
available that are directly or indirectly needed for th computation of the synthesized
attributes of the ith visit of the left side X0 the values of these synthesized attributes
are computed.
To describe the subsequence BT ;i .p/ in an elegant way we introduce the fol-
lowing abbreviations. Let w D a1 : : : al be a sequence of attributes of nonterminal
Xj . pŒj :w D pŒj :a1 : : : pŒj :al shall denote the associated sequence of attribute
occurrences in p. The i 0 th visit IXj ;i 0 SXj ;i 0 of the j th symbol of the production p
is denoted by the sequence pŒj :.IXj ;i 0 SXj ;i 0 /. The sequence BT ;i .p/, interpreted
as a sequence of attribute occurrences in p, has the form:
BT ;i .p/ D pŒ0:IX0 ;i
pŒj1 :.IXj1 ;i1 SXj1 ;i1 /
:::
pŒjr :.IXjr ;ir SXjr ;ir /
pŒ0:SX0;i
The functions evalp;j;a are used to generate a function solvep;i from the ith subse-
quence BT ;i .p/ of production p:
hdecli :new !
hblocki Œ1:procs !
hblocki Œ0:procs !
hblocki Œ0:same ! hblocki Œ0:env !
hblocki Œ1:same ! hblocki Œ1:env ! hblocki Œ1:ok !
hblocki Œ0:ok
According to this total order, the evaluator for the attribute grammar AGscopesC first
descends into the subtree for the nonterminal decl to determine the value of the
attribute new. Then the second subtree must be visited in order to determine the
synthesized attribute procs of the nonterminal hblocki on the right and then also
on the left side of the production. For the second visit, the inherited attributes
procs; same Und env of the left side have already been computed. The evaluation
then again descends into the subtree of the nonterminal hblocki on the right side
during which the value of the synthesized attribute ok is determined. Then all values
are available in order to determine the value of the synthesized attribute ok of the
left side hblocki of the production.
In the simpler attribute grammar AGscopes , the attribute procs is not necessary.
There, a single visit suffices to evaluate all attributes. A meaningful ordering is
204 4 Semantic Analysis
given by:
The evaluation orders in visiti are chosen in such a way that the value of each
attribute instance nŒj 0 :b is computed before any attempt is made to read its value.
The functions solvep;i are simultaneously recursive with themselves and with the
functions visiti . For a node n let get_prod n be the production that was applied at
n or Null if n is a leaf that is labeled with a terminal symbol or . If p1 ; : : : ; pm is
a sequence of the productions of the grammar, the function visiti is given by:
over the transitive relations on attributes, ordered by the subset relation . Recall
that the least upper bound of transitive relations R 2 S is given by:
G [ C
SD S
The least solution of the system of equation .R0 / exists since the operators on the
right side of the equations are monotonic. The least solution can be determined
by the iterative method that we used in Chapt. 3.2.5 for the computation of the
firstk sets. Termination is guaranteed since the number of possible transitive rela-
tions is finite.
Let R0 .X/, for X 2 V , be the least solution of the system of equations. Each
system T .X/; X 2 V; of compatible total orders is a solution of the system of
equations .R0 /. Therefore R0 .X/ T .X/ holds for all symbols X 2 V . If there
exists such a system T .X/; X 2 V; of compatible total orders the relations R0 .X/
are all acyclic. The relations R0 .X/ are therefore a good starting point to construct
total orders T .X/.
The construction is attempted in a way that for each X a sequence with a min-
imal number of visits is obtained. For a symbol X with A.X/ ¤ ; a sequence
I1 S1 : : : Ir Sr is computed, where Ii and Si are sequences of inherited and synthe-
sized attributes, respectively. All already listed attributes are collected in a set D,
which is initialized with the empty set. Let us assume, I1 ; S1 ; : : : Ii 1 ; Si 1 are
already computed, and D would contain all attributes that occur in these sequences.
Two steps are executed:
1. First, a maximally large set of inherited attributes of X is determined that are
not in D, and which only depend on each other or on attributes in D. This set is
topologically sorted, delivering some sequence Ii . This set is added to D.
2. Next, a maximally large set of synthesized attributes is determined that are not
in D, and that only depend on each other or on attributes in D. This set is added
to D, and a topologically sorted sequence is produced as Si .
This procedure is iterated producing more subsequences Ii Si until all attributes are
listed, that is, until D is equal to the whole set A.X/ of attributes of the nonterminal
X.
Let T 0 .X/; X 2 V; be the total orders on the attributes of the symbols of X that
are computed this way. We call the attribute grammar ordered if the total orders
T 0 .X/; X 2 V; are already compatible, that is, satisfy the system of equations
.R0 /. In this method the relations R0 .X/ are expanded one by one to total orders,
without checking whether the added artificial dependences generate cycles in the
productions. The price to be paid for the polynomial complexity of the construction
therefore is a restriction in the expressivity of the accepted attribute grammars.
In Examples 4.3.4, 4.3.3, and 4.3.5 from Sect. 4.3.2 attribute evaluators are gen-
erated by our method that visit each node of a parse tree exactly once. For the
attribute grammar AGscopesC , on the other hand, an evaluator is required that uses
two visits. Several visits are also required when computing the assignments of types
to identifier occurrences in JAVA. Here, the body of a class must be traversed sev-
eral times because in JAVA methods can be called, although they are declared only
206 4 Semantic Analysis
In this section we consider classes of attribute grammars that are severely restricted
in the types of attribute dependences they admit, but that are still useful in practice.
The introductory Example 4.3.1 belongs to one of these classes. For attribute gram-
mars in these classes, attribute evaluation can be performed in parallel to syntax
analysis and directed by the parser. Attribute values are administered in a stack-like
fashion either on a dedicated attribute stack or together with the parser states on the
parse stack. The construction of the parse tree, at least for the purpose of attribute
evaluation, is unnecessary. Attribute grammars in these classes are therefore inter-
esting for the implementation of highly efficient compilers for not-too-complicated
languages. Since attribute evaluation is directed by the parser, the values of syn-
thesized attributes at terminal symbols need to be obtained by the scanner when the
symbol is passed on to the parser.
L-Attributed Grammars
All parsers that we consider as possibly directing attribute evaluation process their
input from left to right. This suggests that attribute dependences going from right
to left are not acceptable. The first class of attribute grammars that we introduce,
the L-attributed grammars, excludes exactly such dependences. This class properly
contains all grammars subject to parser-directed attribute evaluation. It consists of
those attribute grammars in normal form where the attribute instances in each parse
tree can be evaluated in one left-to-right traversal of the parse tree. Formally, we
call an attribute grammar L-attributed (abbreviated an L AG), if for each pro-
duction p W X0 ! X1 : : : Xk of the underlying grammar the occurrence pŒj :b of
an inherited attribute only depends on attribute occurrences pŒi:a with i < j . At-
tribute evaluation in one left-to-right traversal can be performed using the algorithm
of Sect. 4.4.3, which visits each node in the parse tree only once and visits the chil-
dren of a node in a fixed left-to-right order. For a production p W X0 ! X1 : : : Xk
a function solvep is generated, which is defined by:
Here, IX and SX are the sets of inherited and synthesized attributes of symbol X,
and the call evalp;j;a n returns the value of the right side of the semantic rule for
the attribute instance nŒj :a. The visit of a node n is realized by the function visit:
Again the function call get_prod n returns the production that was applied at node
n (or Null if n is a leaf). The attribute grammars AGscopes , AGtypes , and AGbool of
Examples 4.3.4, 4.3.3, and 4.3.5 are all L-attributed, where the last one is not in
normal form.
LL Attributed Grammars
Let us consider the actions that are necessary for parser-directed attribute evalua-
tion:
When reading a terminal symbol a: Receiving the synthesized attributes of a
from the scanner;
When expanding a nonterminal X: Evaluation of the inherited attributes of X;
When reducing to X: Evaluation of the synthesized attributes of X;
An LL.k/-parser as it was described in Chapt. 3 can trigger these actions at the
reading of a terminal symbol, at expansion, and at reduction, respectively. An at-
tribute grammar in normal form is called LL-attributed,
if it is L-attributed, and
if the underlying CFG is an LL.k/-grammar (for some k 1).
The property of an attribute grammar to be LL-attributed means that syntax anal-
ysis can be performed by an LL-parser, and that whenever the LL-parser expands
a nonterminal, all arguments for its inherited attributes are available.
In Sect. 3.3 we described how to construct a parser for an LL.k/-grammar. This
parser administers items on its pushdown, which describe productions together with
the parts of the right sides that has already been processed. We now extend this
PDA such that it maintains for every item for a production p W X0 ! X1 : : : Xk
a structure S./ that may receive all inherited attributes of the left side X0 together
with all synthesized attributes of the symbols X1 ; : : : ; Xk of the right side. If the
dot of the item is in front of the ith symbol, all values of the inherited attributes of
X0 as well as all values of the synthesized attributes of the symbols X1 ; : : : ; Xi 1
have already been computed.
Figure 4.16 visualizes the actions of the LL parser-directed attribute evaluation.
Assume that the item corresponds to the production p W A ! ˛Xˇ and the dot is
positioned behind the prefix ˛ of length i 1, which has already been processed.
A shift-transition of for a terminal X D a moves the dot in across the symbol
a. Additionally, the new attribute structure is obtained from S./ by storing the
values of the synthesized attributes of a as provided by the scanner.
208 4 Semantic Analysis
Expansion of a nonterminal B
Before After
B → .γ I(B)
A → α.Bβ I(A) S(α) A → α.Bβ I(A) S(α)
Reduction according to B → γ
Before After
B → γ. I(B) S(γ)
A → α.Bβ I(A) S(α) A → αB.β I(A) S(α) S(B)
Fig. 4.16 Actions of LL parser-directed attribute evaluation, where I .A/ and S.˛/ denote the
sequences of the values of the inherited attributes of a symbol A and of the synthesized attributes
of the symbols in ˛, respectively
LR-Attributed Grammars
We now present a method by which an LR-parser can direct the evaluation of at-
tributes. An LR-parser maintains states on its pushdown. States consist of sets of
items, possibly extended by lookahead sets. With each such state q we associate
an attribute structure S.q/. The attribute structure of the initial state is empty. For
any other state q 62 fq0 ; f g with entry symbol X the structure S.q/ contains the
values of the synthesized attributes of the symbol X. We extend the LR parser with
a (global) attribute structure I which holds the value of each inherited attribute b
or ? if the value of the attributes b is not available. Initially, the global attribute
structure I contains the values of the inherited attributes of the start symbol.
The values of the synthesized attributes of a terminal symbol are made available
by the scanner. Two problems need to be solved if the values of the attributes for
the attribute structure S.q/ of a state q are computed:
The semantic rule needs to be identified by which the attribute values should be
evaluated.
The values of the attribute occurrences that are arguments of the semantic rule
need to be accessed.
The values of the synthesized attributes of a nonterminal X0 can be computed when
the LR-parser makes a reduce-transition: The production p W X0 ! X1 : : : Xk is
known by which the reduction to X0 is done. To compute a synthesized attribute
b of X0 the semantic rule for the attribute occurrence pŒ0:b of this production
is used. Before the reduction a sequence q 0 q1 ; : : : ; qk of states is on top of the
pushdown, where q1 ; : : : ; qk have entry symbols X1 ; : : : ; Xk of the right side of p.
Let us assume the values for the attribute structures S.q1 /; : : : ; S.qk / have already
been computed. The semantic rule for a synthesized attribute of X0 can be applied
by accessing the values for the occurrences pŒ0:b of inherited attributes of the left
side X0 in I and the values for occurrences pŒj :b of synthesized attributes of Xj
of the right side in S.qj /. Before the reduce-transition, the values of the synthesized
attributes of X0 can be computed for the state q D .q 0 ; X0 / that is entered under
X0 . Still unsolved is the question how the values of the inherited attributes of X0
can be determined.
For the case that there are no inherited attributes, though, we already have ob-
tained a method for attribute evaluation. An attribute grammar is called S-attributed
if it has only synthesized attributes. Example 4.3.1 is such a grammar. Despite the
restriction to have only synthesized attributes one could describe how trees for ex-
pressions are constructed. More generally, the computation of some semantic value
can be specified by an S-grammar. This mechanism is offered by parser generators
such as YACC or B ISON. Each S-attributed grammar is also L-attributed. If an LR-
grammar is S-attributed, the attribute structures of the states can be maintained on
the pushdown, and thus allow us to determine the values of synthesized attributes
of the start symbol.
Attribute grammars with synthesized attributes alone are not expressive enough
for more challenging compilation tasks. Even the inference of types of expressions
relative to a symbol table env in Example 4.3.3 requires an inherited attribute, which
210 4 Semantic Analysis
is passed down the parse tree. Our goal therefore is to extend the approach for S-
attributed grammars to deal with inherited attributes as well. The LR-parser does
not in general know the upper tree fragment in which the transport paths for inher-
ited attribute values lie. If a grammar is left-recursive, the application of an arbitrary
number of semantic rules may be required to compute the value of an inherited at-
tribute. We observe, however, that the values of inherited attributes are often passed
down unchanged through the parse tree. This is the case in the attribute grammar
AGtypes of Example 4.3.3, which computes the type of an expression, where the
value of the attribute env is copied from the left side of productions in attributes of
the same name to occurrences of nonterminals on the right side. This is also the
case in production hblocki ! hstati hblocki of the attribute grammar AGscopes in
Example 4.3.4, where the inherited attribute same of the left side is copied to an
attribute of the same name of the nonterminal occurrence hblocki of the right side,
and the inherited attribute env of the left side is copied to attributes of the same
name at nonterminal occurrences of the right side.
Formally we call an occurrence pŒj :b of an inherited attribute b at the j th
symbol of a production P W X0 ! X1 : : : Xk copying if there exists an i < j , such
that the following holds:
1. The semantic rule for pŒj :b is pŒj :b D pŒi:b, or the right side of the semantic
rule for pŒj :b is semantically equal to the right side of the semantic rule for
pŒi:b; and
2. pŒi:b is the last occurrence of the attribute b before pŒj :b, that is, b 62 A.Xi 0 /
for all i < i 0 < j .
Clearly, semantic equality of right sides is in general undecidable. In a practical
implementation, though, it is sufficient to refer to syntactic equivalence instead. At
least, this covers the important case where both pŒj :b and pŒi:b are copies of the
same inherited attribute b of the left side.
In this sense all occurrences of the inherited attributes env on the right side of the
attribute grammar AGtypes are copying. The same holds for the occurrences of the
inherited attributes same and env of the attribute grammar AGscopes in the production
hblocki ! hstati hblocki.
Let us assume for a moment that the occurrences of inherited attributes in right
sides were all copying. This means that the values of inherited attributes will never
change. Once the global attribute structure I contains the right value of an inherited
attribute, it therefore need not be changed throughout the whole evaluation.
Sadly enough, certain occurrences of inherited attributes of L-attributed gram-
mars are not copying. For a noncopying occurrence pŒj :b of an inherited attribute
b, the attribute evaluator needs to know the production p W X0 ! X1 : : : Xk and the
position j in the right side of p, to select the correct semantic rule for the attribute
occurrence. We use a trick to accomplish this. A new nonterminal Np;j is intro-
duced with the only production Np;j ! . This nonterminal Np;j is inserted before
the symbol Xj in the right side of p. The nonterminal symbol Np;j is associated
with all inherited attributes b of Xj that are noncopying in p. Each attribute b of
Np;j is equipped with a semantic rule that computes the same value as the semantic
4.4 The Generation of Attribute Evaluators 211
rule for pŒj :b. Note that the insertion of auxiliary symbols Np;j1 ; : : : ; Np;jr affects
the positions of the original symbol occurrences in the right side of production p.
Example 4.4.7 Consider the production hblocki ! hdecli hblocki of the attribute
grammar AGscopes of Example 4.3.4. The attribute occurrences hblocki Œ1:same and
hblocki Œ1:env on the right side of the production are not copying. Therefore a new
nonterminal N is inserted before hblocki:
The new nonterminal symbol N has inherited attributes fsame; envg. It does not
need any synthesized attributes. The new semantic rules for the transformed pro-
duction:
Since N has only inherited attributes, it does not need any semantic rules. We
observe that the inherited attributes same and env of the nonterminal hblocki are
both copying after the transformation. u t
Insertion of the nonterminal Np;j does not change the accepted language. It may,
however, destroy the LR.k/-property. In Example 4.4.7 this is not the case. If the
underlying context-free grammar is still an LR.k/-grammar after the transforma-
tion, we call the attribute grammar LR-attributed.
After the transformation, the inherited attributes at the new nonterminals Np;j
are the only occurrences of inherited attributes that are noncopying. At a reduce-
transition for Np;j , the LR-parser has identified the production p and the position
j in the right side of p at which Np;j has been positioned. At reduction the new
value for the inherited attribute b therefore can be computed and stored in the global
attribute structure I . The states q 0 which the parser may reach by a transition under
nonterminal Np;j , are now associated with a dedicated attribute structure old.q 0 /
which does not contain the values of synthesized attributes. Instead, the previous
values of inherited attributes of I are stored that have been overwritten during the
reduction. These previous values are required to reconstruct the original values of
the inherited attributes before the descent into the subtree for X.
212 4 Semantic Analysis
before after
q5
c 5 q4 c 2 b 9 c 2
b 4 q3 b 1
q2 b 1
q1 q
q q
Fig. 4.17 The reconstruction of the inherited attributes at a reduce-transition for a production
X ! with jj D 5 and .q 0 ; X/ D q. The attribute structures old.q2 / and old.q4 / contain
the overwritten inherited attributes b and c in I
Let us consider in detail how the value of an inherited attribute b of the nonter-
minal Np;j can be computed. Let pN W X ! ˛Np;j ˇ be the production that results
from the transformation applied to p, where ˛ has length m. Before the reduce-
transition for Np;j there is a sequence q 0 q1 : : : qm on top of the pushdown, where
the states in the sequence q1 ; : : : ; qm correspond to the occurrences of symbols in
˛. The evaluation of the semantic rules for the inherited attribute b of Np;j there-
fore may access the values of the synthesized attributes of the symbols in ˛ in the
attribute structures of the states q1 ; : : : ; qm . The values of the inherited attribute a
of the left side X, on the other hand, can be found in the global structure I – given
that the attribute a has not been redefined by some Np;i with i < j during the eval-
uation of the production pN so far. If, however, that has been the case, the original
value of a has been recorded in the structure old.qi 0 / to state qi 0 , which corresponds
to the first redefinition of a in the right side of p.
Let us consider in detail what happens at a reduce-transition for a transformed
production p. N Let Np;j1 ; : : : ; Np;jr be the sequence of new nonterminals that were
inserted by the transformation in the right side of the production p, and let m be the
length of the transformed right side. Before the reduce-transition there is a sequence
q 0 q1 : : : qm of states on top of the pushdown where the states qj1 ; : : : ; qjr Cr1
correspond to the nonterminals Np;j1 ; : : : ; Np;jr . Using the attribute structures
old.qj1 /; : : : ; old.qjr Cr1 / the values of the inherited attributes before the descent
into the parse tree for X are reconstructed. If an attribute b occurs in no struc-
ture old.qji Ci 1 /, I already contains the correct value of b. Otherwise the value
of b is set to the value of b in the first structure old.qji Ci 1 / in which b occurs.
This reconstruction of the global structure I for the inherited attributes is shown
in Fig. 4.17. Once the former values in the structure I have been reconstructed,
the semantic rules of the synthesized attributes of the left side X can be evaluated,
where any required synthesized attribute of the ith symbol occurrence of the right
side of pN can be accessed in the attribute structure of qi . In this way, the values for
the attribute structure .q 0 ; X/ can be determined.
4.5 Exercises 213
The method we have presented enables LR-parsers not only to evaluate syn-
thesized attributes by means of their pushdown, but also to maintain and update
inherited attributes – given that the grammar is LR-attributed.
4.5 Exercises
1. Symbol Tables
What are the contents of the symbol table of the body of the procedure q after
the decaration of the procedure r in Example 4.1.6?
2. Overloading
Consider the following operators:
C W integer ! integer
C W real ! integer
C W integer integer ! integer
C W real real ! real
= W integer integer ! integer
= W integer integer ! real
= W real real ! real
Apply the algorithm of Sect. 4.1.3 in order to resolve the overloading in the
assignment A 1=2 C 3=4 to the real-variable A.
3. Type Inference
Apply the rules for type inference in order to infer the type of the following
O CAML-expression:
L ! A A ! sB B ! u
L:z D A:z B:a D B:y B:x D B:a
A:c D 0 B:b D A:c B:y D B:b
A:z D B:x B ! v
A ! tB B:x D B:a
B:a D A:c B:y D 0
B:b D B:x
A:z D B:y
The presentation of context conditions partly follows [68]. The datastructures for
symbol tables in Sect. 4.1.2 were independently invented by various compiler writ-
ers. An early source is [40].
216 4 Semantic Analysis
1. Thomas H Cormen and Charles E Leiserson and Ronald L Rivest and Clifford Stein (2001)
Introduction to Algorithms (second edition). MIT Press and McGraw-Hill
2. Aho AV, Sethi R, Ullman JD (1986) Principles of Compiler Design. Addison Wesley
3. Alblas H (1991) Attribute evaluation methods. In: Henk Alblas BM (ed) Proc. International
Summer School on Attribute Grammars, Applications and Systems, Springer, LNCS 545
4. Ammann U (1978) Error recovery in recursive descent parsers and run–time storage organiza-
tion, rep. No. 25, Inst. für Informatik der ETH Zürich
5. Baars AI, Swierstra SD, Viera M (2010) Typed transformations of typed grammars: The left
corner transform. Electr Notes Theor Comput Sci 253(7):51–64
6. Blum N (2010) On lr(k)-parsers of polynomial size. In: Abramsky S, Gavoille C, Kirchner C,
Meyer auf der Heide F, Spirakis PG (eds) ICALP (2), Springer, Lecture Notes in Computer
Science, vol 6199, pp 163–174
7. Bransen J, Middelkoop A, Dijkstra A, Swierstra SD (2012) The kennedy-warren algorithm
revisited: Ordering attribute grammars. In: Russo CV, Zhou NF (eds) PADL, Springer, Lecture
Notes in Computer Science, vol 7149, pp 183–197
8. Courcelle B (1984) Attribute grammars: Definitions, analysis of dependencies. In: [45]
9. Courcelle B (1986) Equivalences and transformations of regular systems—applications to pro-
gram schemes and grammars. Theoretical Computer Science 42:1–122
10. Damas L, Milner R (1982) Principal type schemes for functional programms. In: 9th ACM
Symp. on Principles of Programming Languages, pp 207–212
11. Dencker P, Dürre K, Heuft J (1984) Optimization of parser tables for portable compilers. ACM
Transactions on Programming Languages and Systems 6(4):546–572
12. Deransart P, Jourdan M, Lorho B (1988) Attribute Grammars, Definitions, Systems and Bibli-
ography. Springer, LNCS 323
13. DeRemer F (1969) Practical translators for LR(k) languages. PhD thesis, Massachusetts Insti-
tute of Technology
14. DeRemer F (1971) Simple LR(k) grammars. Communications of the ACM 14:453–460
15. DeRemer F (1974) Lexical analysis. In: F.L. Bauer, J. Eickel (Hrsg.), Compiler Construction,
An Advanced Course, Springer, LNCS 21
16. Dijkstra EW (1961) Algol-60 translation. Tech. rep., Stichting, Mathematisch Centrum, Ams-
terdam, rekenafdeling, MR 35. Algol Bulletin, supplement nr. 10
17. Engelfriet J (1984) Attribute grammars: Attribute evaluation methods. In: [45]
18. Floyd RW (1963) Syntactic analysis and operator precedence. J ACM 10(3):316–333
19. Garrigue J (2004) Relaxing the value restriction. In: Kameyama Y, Stuckey PJ (eds) Proc. of
Functional and Logic Programming, 7th International Symposium, FLOPS 2004, Nara, Japan,
April 7–9, 2004, Springer, LNCS 2998, pp 196–213
20. Giegerich R, Wilhelm R (1978) Counter–one–pass features in one–pass compilation: a for-
malization using attribute grammars. Information Processing Letters 7(6):279–284
21. Hall CV, Hammond K, Jones SLP, Wadler P (1994) Type classes in HASKELL . In: Sannella D
(ed) ESOP, Springer, LNCS 788, pp 241–256
22. Harrison MA (1983) Introduction to Formal Language Theory. Addison Wesley
23. Heckmann R (1986) An efficient el l.1/-parser generator. Acta Informatica 23:127–148
24. Hindley JR (1969) The principal type scheme of an object in combinatory logic. Transactions
of the AMS 146:29–60
25. Hopcroft J, Ullman JD (1979) Introduction to Automata Theory, Languages and Computation.
Addison-Wesley
26. II PML, Stearns RE (1966) Syntax directed transduction. In: IEEE 7. Annual Symposium on
Switching and Automata Theory, pp 21–35
27. II PML, Stearns RE (1968) Syntax directed transduction. Journal of the ACM 15:464–488
28. Jazayeri M, Ogden WF, Rounds WC (1975) The intrinsically exponential complexity of the
circularity problem for attribute grammars. Communications of the ACM 18(12):697–706
29. Johnson WL, Porter JH, Ackley SI, Ross DT (1968) Automatic generation of efficient lexical
analyzers using finite state techniques. Communications of the ACM 11(12):805–813
30. Jones MP (1995) A system of constructor classes: Overloading and implicit higher-order poly-
morphism. J Funct Program 5(1):1–35
31. Jones SP, Jones MP, Meijer E (1997) H ASKELL type classes: an exploration of the design
space. In: Proceedings of the 2nd H ASKELL Workshop
32. Jourdan JH, Pottier F, Leroy X (2012) Validating lr(1) parsers. In: Seidl H (ed) ESOP, Springer,
Lecture Notes in Computer Science, vol 7211, pp 397–416
33. Kannapinn S (2001) Eine rekonstruktion der LR-theorie zur elimination von redundanz mit
anwendung auf den bau von ELR-parsern. PhD thesis, Fachbereich 13 – Informatik
34. Kastens U (1980) Ordered attribute grammars. Acta Informatica 13(3):229–256
35. Kennedy K, Warren SK (1976) Automatic generation of efficient evaluators for attribute gram-
mars. In: Proc. 3rd ACM Symp. on Principles of Programming Languages, pp 32–49
36. Knuth DE (1965) On the translation of languages from left to right. Information and Control
8:607–639
37. Knuth DE (1968) Semantics of context-free languages. Math Systems Theory 2:127–145
38. Knuth DE (1971) Semantics of context–free languages, correction in Math. Systems Theory
5, pp. 95-96
39. Knuth DE (1977) A generalization of dijkstra’s algorithm. Information Processing Letters
6(1):1–5
40. Krieg B (1971) Formal definition of the block concept and some implementation models, mS.
Thesis, Cornell University
41. Kühnemann A, Vogler H (1997) Attributgrammatiken. Eine grundlegende Einführung.
Vieweg+Teubner
References 219
42. Lesk M (1975) Lex – a lexical analyzer generator, cSTR 39, Bell Laboratories, Murray Hill,
N.J.
43. Lewi J, DeVlaminck K, Huens J, Steegmans E (1982) A Programming Methodology in Com-
piler Construction, part 2. North Holland
44. Lipps P, Olk M, Möncke U, Wilhelm R (1988) Attribute (re)evaluation in the optran system.
Acta Informatica 26:213–239
45. Lorho B (ed) (1984) Methods and Tools for Compiler Construction. Cambridge University
Press
46. Mayer O (1986) Syntaxanalyse, 3. Aufl. Bibliographisches Institut
47. Milner R (1978) A theory of type polymorphism in programming. Journal of Computer and
System Sciences 17:348–375
48. Möncke U (1985) Generierung von systemen zur transformation attributierter operatorbäume;
komponenten des systems und mechanismen der generierung. PhD thesis, Informatik
49. Möncke U, Wilhelm R (1982) Iterative algorithms on grammar graphs. In: Proc. 8th Confer-
ence on Graphtheoretic Concepts in Computer Science, Hanser, pp 177–194
50. Möncke U, Wilhelm R (1991) Grammar flow analysis. In: H. Alblas, B. Melichar (Hrsg.),
Attribute Grammars, Applications and Systems, Springer, LNCS 545
51. Neven F, den Bussche JV (1998) Expressiveness of structured document query languages
based on attribute grammars. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems, June 1–3, 1998, Seattle, Washington,
ACM Press, pp 11–17
52. Nielson HR (1983) Computation sequences: A way to characterize classes of attribute gram-
mars. Acta Informatica 19:255–268
53. Pager D (1977) Eliminating unit productions from lr parsers. Acta Inf 9:31–59
54. Pager D (1977) The lane-tracing algorithm for constructing LR(k) parsers and ways of enhanc-
ing its efficiency. Inf Sci 12(1):19–42
55. Pennello TJ, DeRemer F (1978) A forward move for LR error recovery. In: Proc. 5th ACM
Symp. on Principles of Programming Languages, pp 241–254
56. Pennello TJ, DeRemer F, Myers R (1980) A simplified operator identification scheme for ADA.
ACM SIGPLAN Notices 15(7,8):82–87
57. Pratt VR (1973) Top down operator precedence. In: Proceedings of the 1st annual ACM
SIGACT-SIGPLAN symposium on Principles of programming languages, pp 41–51
58. Saraiva J, Swierstra SD (2003) Generating spreadsheet-like tools from strong attribute gram-
mars. In: Pfenning F, Smaragdakis Y (eds) GPCE, Springer, Lecture Notes in Computer
Science, vol 2830, pp 307–323
59. Sippu S, Soisalon-Soininen E (1990) Parsing Theory. Vol.1: Languages and Parsing. Springer
60. Sippu S, Soisalon-Soininen E (1990) Parsing Theory. Vol.2: LR(k) and LL(k) Parsing. Springer
61. Tarjan RE, Yao ACC (1979) Storing a sparse table. Communications of the ACM 22(11)
62. Tomita M (1984) LR parsers for natural languages. In: 10th International Conference on Com-
putational Linguistics (COLING), pp 354–357
63. Tomita M (1985) An efficient context-free parsing algorithm for natural languages. In: Inter-
national Joint Conference on Artificial Intelligence (IJCAI), pp 756–764
64. Van De Vanter ML (1975) A formalization and correctness proof of the cgol language sys-
tem (master’s thesis). Tech. Rep. MIT-LCS-TR-147, MIT Laboratory for Computer Science.
Cambridge, MA
220 References
65. Viera M, Swierstra SD, Swierstra W (2009) Attribute grammars fly first-class: how to do
aspect oriented programming in haskell. In: Hutton G, Tolmach AP (eds) ICFP, ACM,
pp 245–256
66. Wadler P, Blott S (1989) How to make ad-hoc polymorphism less ad-hoc. In: POPL, pp 60–76
67. Watt DA (1977) The parsing problem for affix grammars. Acta Informatica 8:1–20
68. Watt DA (1984) Contextual constraints. In: [45]
69. Wirth N (1978) Algorithms + Data Structures = Programs, Chapter 5. Prentice Hall
70. Wright AK (1995) Simple imperative polymorphism. Lisp and Symbolic Computation
8(4):343–355
Index
221
222 Index
I reachable, 55
identifier, 3, 139
applied occurrence of a, 140 O
defining occurrence of a, 140 optimization
hidden, 139 machine-independent, 7
identification, 145, 148 overloading, 6, 152, 174
visibile, 139 resolution of, 155
indentation, 4
initial configuration, 58 P
initial state, 16, 57 panic mode, 97
input alphabet, 16, 57 parenthesis
instance declaration, 174 nonrecursive, 28
instruction selection, 8 parse tree, 6, 43, 49
interpretation parser, 5, 43
abstract, 7 bottom-up, 44, 101
item deterministic, 64
complete, 59 LALR.1/-, 121
context-free, 59 left-, 64
history, 59 LL-, 64
LR.k/-, 116 LR-, 64
valid, 105 LR.k/-, 102, 117, 118
item-pushdown automaton (IPDA), 59, 79 Pratt-, 136
recursive-descent, 92
J
right-, 64
JAVA, 145, 155
RLL.1/-, 92
shift-reduce, 101
K
SLR.1/-, 121
keyword, 3, 38
top-down, 44
Kleene star, 13
partial order, 68
L partition, 26
LALR.1/, 121, 122 stable, 26
language, 49 PASCAL , 145
accepted, 17 polymorphism
regular, 13 constrained, 173
lattice pragma, 3, 4
complete, 68 prefix, 11, 13
lexical analysis, 3, 11 extendible, 97
LL.k/- k-, 65
grammar, 79 reliable, 105
parser (strong), 87 viable, 45, 124
LR.k/, 112 produces, 48
LR.k/-item, 116 directly, 48
production rule, 47
M P ROLOG, 145
metacharacter, 14 pushdown automaton
middle-end, 1, 8 deterministic, 58
monad, 180 item-, 59, 79, 103
language of a, 58
N with output, 63
name space, 140 pushdown automaton (PDA), 57
nonterminal, 47
left recursive, 85 Q
productive, 53 qualification, 146
224 Index
R idempotent, 163
reducing transition, 60 subword, 13
reduction suffix, 13
required, 101 symbol, 3, 11, 48
register allocation, 8 class, 3
regular language, 13 nonterminal, 47
rule, 159 reserved, 4
semantic, 181 start, 47
run time, 6 table, 148, 152
terminal, 47
S symbol class, 11, 12
scanner, 3 syntactic analysis, 5
generation, 29 syntactic structure, 49
representation, 34 syntax
compressed, 34 abstract, 141
states, 37 concrete, 141
scope, 140 syntax analysis
screener, 4, 36 bottom-up, 101
semantic analysis, 6 top-down, 77
semantics syntax error, 43
dynamic, 6 globaly optimal correction, 45
static, 6 RLL.1/-, 97
sentential form, 49 syntax tree, 49
left, 51
right, 51 T
separator, 3 table
shifting transition, 60 action-, 117, 118
SLR.1/, 121 goto-, 117
solution, 163 target program, 8
most general, 163 terminal, 47
sort, 174 anchor, 97
sort environment, 175 transformation phase, 1
source program, 3 transition, 16, 58
start symbol, 47 " , 58
start vertex, 18 expanding, 60
state, 57 reducing, 60
actual, 58 shifting, 60
error, 23 transition diagram, 17
final, 57 transition relation, 15, 57
inadequate tree
LALR.1/-, 122 ordered, 49
SLR.1/-, 122 type, 140
initial, 57 cast, 153
LR.0/-inadequate, 110 class, 173
step relation, 17 consistency, 139
strategy consistent association, 6
first-fit, 35 constructor, 180
string, 28 correctness, 6
pattern matching, 30 environment, 158
strong LL.k/-grammar, 84 judgment, 159
subject reduction, 160 scheme, 169
subset construction, 21 variable, 160
solution type inference, 185
Index 225
U variable
Unicode, 28 uninitialized, 139
unification, 163 variable-dependence graph, 75
union problem visibility, 139, 144, 147
pure, 74
unit W
lexical, 3 word, 12
ambiguous, 50
V
validity, 139, 144, 145 Y
value restriction, 172 YACC , 181