Lexical Analysis
Marriette Katarahweire
February 12, 2020
CSC 3205: Compiler Design 1/33
Phases of a Compiler
CSC 3205: Compiler Design 2/33
Introduction
The analysis phase of a compiler breaks up a source program
into constituent pieces and produces an internal
representation for it, called intermediate code.
The synthesis phase translates the intermediate code into the
target program.
Analysis is organized around the ”syntax” of the language to
be compiled.
The syntax of a programming language describes the proper
form of its programs,
The semantics of the language defines what its programs
mean; that is, what each program does when it executes.
For specifying syntax, we present a widely used notation:
context-free grammars or BNF (for Backus-Naur Form)
With the notations currently available, the semantics of a
language is much more difficult to describe than the syntax.
For specifying semantics, we shall therefore use informal
descriptions and suggestive examples
CSC 3205: Compiler Design 3/33
Lexical Analysis - Basics
Lexical analysis or scanning is the process where the stream of
characters making up the source program is read from
left-to-right and grouped into tokens.
Tokens are sequences of characters with a collective meaning.
There are usually only a small number of tokens for a
programming language: constants (integer, double, char,
string, etc.), operators (arithmetic, relational, logical),
punctuation, and reserved words.
CSC 3205: Compiler Design 4/33
Lexical Analysis - Basics
Lexical means ”pertaining to words”
Words, in programming languages are objects like variable
names, numbers, keywords. These words are known as tokens
Lexical analyzer/lexer takes as input a string of individual
letters and divide this string into tokens.
Lexer also filters out whatever separates the tokens
(whitespace) i.e lay-out characters (spaces, newlines) and
comments
CSC 3205: Compiler Design 5/33
Lexical Analysis - Basics
CSC 3205: Compiler Design 6/33
Lexical Analysis - Basics
The lexical analyzer might recognize particular instances of tokens
such as:
3 or 255 for an integer constant token
”Fred” or ”Wilma” for a string constant token
numTickets or queue for a variable token
Such specific instances are called lexemes. A lexeme is the actual
character sequence forming a token, the token is the general class
that a lexeme belongs to. Some tokens have exactly one lexeme
(e.g., the > character); for others, there are many lexemes (e.g.,
integer constants).
CSC 3205: Compiler Design 7/33
Lexical Analysis - Errors?
The scanner is tasked with determining that the input stream
can be divided into valid symbols in the source language, but
has no smarts about which token should come where.
Few errors can be detected at the lexical level alone because
the scanner has a very localized view of the source program
without any context.
The scanner can report about characters that are not valid
tokens (e.g., an illegal or unrecognized symbol) and a few
other malformed entities (illegal characters within a string
constant, unterminated comments, etc.)
It does not look for or detect garbled sequences, tokens out of
place, undeclared identifiers, misspelled keywords, mismatched
types
CSC 3205: Compiler Design 8/33
Lexical Analysis - Errors?
Example: does the following input generate any errors in the lexical
analysis phase?
int a double } switch b[2] =;
The scanner has no idea how tokens are grouped. In the above
sequence, it returns b, [, 2, ] as four separate tokens, having no idea
they collectively form an array access.
CSC 3205: Compiler Design 9/33
Scanner Implementation
There are two primary methods for implementing a scanner.
The first is a program that is hard-coded to perform the
scanning tasks.
The second uses regular expression and finite automata theory
to model the scanning process.
A ”loop & switch” implementation consists of a main loop
that reads characters one by one from the input file and uses
a switchstatement to process the character(s) just read. The
output is a list of tokens and lexemes from the source
program.
CSC 3205: Compiler Design 10/33
Terminology
Token: A classification for a common set of strings
Examples: Identifier, Integer, Float, Assign, ....
Pattern: The rules that characterize the set of strings for a
token
Examples: [0-9]+
Lexeme: Actual sequence of characters that matches a
pattern and has a given token class.
Examples: Identifier: Name, Data, x; Integer: 34, 2, ....
CSC 3205: Compiler Design 11/33
Regular expression review
symbol: an abstract entity that we shall not define formally
(such as “point” in geometry). Letters, digits and
punctuation are examples of symbols.
alphabet: a finite set of symbols out of which we build larger
structures. An alphabet is typically denoted using the Greek
sigma Σ, e.g., Σ = {0, 1}.
string: a finite sequence of symbols from a particular
alphabet juxtaposed. For example: a, b, c are symbols and
abcb is a string.
empty string denoted ε is the string consisting of zero
symbols.
formal language Σ∗ the set of all possible strings that can be
generated from a given alphabet.
CSC 3205: Compiler Design 12/33
Regular Expressions
Rules that define exactly the set of words that are valid tokens
in a formal language.
The rules are built up from three operators:
Concatenation xy
Union/alternation x|y x or y
Closure/repetition x ∗ x repeated 0 or more times
CSC 3205: Compiler Design 13/33
Operations on Languages
The (Kleene) closure of a language L, denoted L∗ , is the set
of strings you get by concatenating L zero or more times. If
Σ = {a, b}, Σ∗ = {ε, a, b, aa, ab, ba, bb, . . . . . . . . . ..}
The positive closure, denoted L+ . If
Σ = {a, b}, Σ+ = {a, b, aa, ab, ba, bb, . . . . . . . . . ..}
CSC 3205: Compiler Design 14/33
Regular Expressions
Formally, the set of regular expressions can be defined by the
following recursive rules:
Every symbol of Σ is a regular expression
ε is a regular expression
if r1 and r2 are regular expressions, so are
(r1 ), r1 r2 , r1 |r2 , r1∗
Nothing else is a regular expression.
CSC 3205: Compiler Design 15/33
Example
Let Σ = {a, b}
The regular expression a|b denotes the language {a, b}
(a|b) (a|b) denotes {aa, ab, ba, bb} , the language of all
strings of length two over the alphabet Σ . Another regular
expression for the same language is aa|ab|ba|bb
a∗ denotes the language consisting of all strings of zero or
more a0 s, that is, {ε, a, aa, aaa, ...}
(a|b)∗ denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a’s and b’s:
{ε, a, b, aa, ab, ba, bb, aaa, ...}. Another regular expression for
the same language is (a∗ b ∗ )∗
a|a∗ b denotes the language {a, b, ab, aab, aaab, ...}, that is,
the string a and all strings consisting of zero or more a’s and
ending in b.
CSC 3205: Compiler Design 16/33
Exercise
Give regular expressions for the following languages over the
alphabet a, b:
all strings beginning and ending in a
all strings with an odd number of a0 s
all strings without two consecutive a0 s
CSC 3205: Compiler Design 17/33
Regular Expressions
We can use regular expressions to define the tokens in a
programming language.
For example, a regular expression for an integer, which
consists of one or more digits. + is extended regular
expression syntax for 1 or more repetitions
(0|1|2|3|4|5|6|7|8|9)+
CSC 3205: Compiler Design 18/33
Finite Automata Review
Once we have all our tokens defined using regular expressions, we
can create a finite automaton for recognizing them. A finite
automata has:
A finite set of states, one of which is designated the initial
state or start state, and some (maybe none) of which are
designated as final states.
An alphabet Σ of possible input symbols
A finite set of transitions that specifies for each state and for
each symbol of the input alphabet, which state to go to next.
CSC 3205: Compiler Design 19/33
Finite Automata
What is a regular expression for the FA above?
CSC 3205: Compiler Design 20/33
Finite Automata
What is a regular expression for the FA above?
Define an FA that accepts the language of all strings that end in b
and do not contain the substring aa. What is a regular expression
for this language?
CSC 3205: Compiler Design 21/33
A regular expression and a simple finite automata that recognizes
an integer.
(0|1|2|3|4|5|6|7|8|9)+
CSC 3205: Compiler Design 22/33
Sample FA for Pascal
An FA that recognizes a subset of tokens in the Pascal language
CSC 3205: Compiler Design 23/33
Sample FA for Pascal
The numbered/lettered states are final states.
The loops on states 1 and 2 continue to execute until a
character other than a letter or digit is read.
For example, when scanning temp := temp + 1; it would
report the first token at final state 1 after reading the : having
recognized the lexeme temp as an identifier token.
CSC 3205: Compiler Design 24/33
In an FA-driven scanner
The source program is read one character at a time beginning
with the start state. As we read each character, we move
from our current state to the next by following the appropriate
transition for that. When we end up in a final state, we
perform an action associated with that final state.
For example, the action associated with state 1 is to first
check if the token is a reserved word by looking it up in the
reserved word list. If it is, the reserved word is passed to the
token stream being generated as output. If it is not a reserved
word, it is an identifier so a procedure is called to check if the
name is in the symbol table. If it is not there, it is inserted
into the table.
CSC 3205: Compiler Design 25/33
In an FA-driven scanner...
Once a final state is reached and the associated action is
performed, we pick up where we left off at the first character
of the next token and begin again at the start state.
If we do not end in a final state or encounter an unexpected
symbol while in any state, we have an error condition.
For example, if you run ”ASC@I” through the above FA, we
would error out of state 1.
CSC 3205: Compiler Design 26/33
Nondetermistic Finite Automata
Used to transform regular expressions into efficient programs.
A finite automaton is a machine that has a finite number of
states and a finite number of transitions between these
The automaton is nondeterministic as the choice of action is
not determined solely by looking at the current state and input
CSC 3205: Compiler Design 27/33
Nondetermistic Finite Automata ...
A nondeterministic finite automaton consists of:
a set S of states.
One of these states, s0 ∈ S, is called the starting state of the
automaton
a subset F ⊆ S of the states are accepting states
a set T of transitions
Each transition t connects a pair of states s1 and s2 and is
labelled with a symbol, which is either a character c from the
alphabet S, or the symbol (epsilon-transition)
A transition from state s to state t on the symbol c is written
as s c t.
CSC 3205: Compiler Design 28/33
Converting a regular expression to an NFA
construct an NFA compositionally from a regular expression,
i.e. construct the NFA for a composite regular expression from
the NFAs constructed from its subexpressions
from each subexpression construct an NFA fragment and then
combine these fragments into bigger fragments.
A fragment is not a complete NFA, so we complete the
construction by adding the necessary components to make a
complete NFA
CSC 3205: Compiler Design 29/33
Converting a regular expression to an NFA
An NFA fragment consists of a number of states with
transitions between these and additionally two incomplete
transitions: One pointing into the fragment and one pointing
out of the fragment.
The incoming half-transition is not labelled by a symbol, but
the outgoing half-transition is labelled by either e or an
alphabet symbol.
These half-transitions are the entry and exit to the fragment
and are used to connect it to other fragments or additional
“glue” states.
CSC 3205: Compiler Design 30/33
Converting a regular expression to an NFA
CSC 3205: Compiler Design 31/33
Exercise
Give reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax
analysis) phases.
CSC 3205: Compiler Design 32/33
Assignment 1
Understanding of regular expressions, language definitions,
finite automata (nondeterministic and deterministic)
Able to convert a NFA to a DFA
Building an NFA from a regular expression
KMP algorithm for pattern matching in regular expressions
Lexical Analyzer Generator
CSC 3205: Compiler Design 33/33