07/23/2020
Lexical Analysis
Zakia Zinat Choudhury
Lecturer
Department of Computer Science & Engineering
University of Rajshahi
The Role of Lexical Analyzer
Lexical analyzer is the first phase of a compiler. It is also known as a scanner. So,
it’s main job is to read the input characters of the source program and group
them into lexemes, and produce a sequence of tokens for each lexeme in the
source program as an output.
Tokens
Lexemes
Source Lexical Syntax
Program Analyzer Analyzer
Request for Tokens
Figure: Interaction between the Lexical analyzer and the Syntax analyzer
2
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
1
07/23/2020
The Role of Lexical Analyzer
Lexical analyzers sometimes are divided into two processes:
a) Scanning consists of the simple processes that do not require tokenization of
the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output.
3
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
Why lexical analysis and syntax analysis phases
are separated?
Simplicity of design is the most important consideration. The separation of
lexical and syntax analysis often allows us to simplify at least one of these
tasks.
Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing.
Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
4
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
2
07/23/2020
Convert the source code into stream of tokens
Removing white spaces
Removing the comments
Functions of
Lexical Analyzer Recognizing identifiers and keywords
Recognizing of constants
Show error when the lexeme does not match any
patterns
5
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
Lexemes, Patterns and Tokens
Lexeme:
A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Pattern:
A pattern is a description of the form or rule that describes the set of strings.
Token:
A token is a set of strings over source alphabets. Also a token is a pair consisting
of a token name and an optional attribute value.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
6
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
3
07/23/2020
When more than one lexeme can match a pattern,
the lexical analyzer must provide the subsequent
compiler phases additional information about the
particular lexeme that matched.
For each lexeme, the lexical analyzer produces as
output a token of the form
Attributes for Tokens (token-name, attribute-value)
❖ token-name is an abstract symbol that is used
during syntax analysis
❖ attribute-value points to an entry in the symbol
table for this token.
7
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
An alphabet is any finite set of symbols.
Typical examples of symbols are letters, digits,
and punctuation. The set {0,1} is the binary alphabet.
A string over an alphabet is a finite sequence of
symbols drawn from that alphabet.
For example, banana is a string of length six. The
empty string, denoted ꜫ, is the string of length zero.
Specification of Tokens
A language is any countable set of strings over some
fixed alphabet.
Abstract languages like Ø, the empty set, or {ꜫ},
the set containing only the empty string, are languages
under this definition.
8
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
4
07/23/2020
Terms for Parts of Strings
Prefix
Suffix
Substring
Proper prefixes, suffixes, substrings
Subsequence
9
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
Operations on Languages
Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L
The Positive Closure of a language L is written as
L+ = One or more occurrence of language L
10
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
5
07/23/2020
Regular Expressions
Regular expressions have the capability to express finite languages by defining
a pattern for finite strings of symbols.
The grammar defined by regular expressions is known as regular grammar.
The language defined by regular grammar is known as regular language.
11
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
Regular Expressions’ Operations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
12
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
6
07/23/2020
Example of Lexical Analyzer
int position , rate, initial
Symbol Table Manager
position = rate + initial *60; Serial no Variable Name Variable Type
1 position int
2 rate int
Lexical Analyzer
3 initial int
Stream of token
<id,1> <=> <id,2> <+> <id,3> <*> <60>
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU 13
Example of Lexical Analyzer
1. int x1;
x=23;
2. /*find the total value x and y*/
int x, y, sum;
sum = x + y ;
printf(“Total = %d\n”, sum);
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU 14
7
07/23/2020
Lexical Error
Lexical error is a sequence of characters that does not match the
pattern of any token.
Lexical phase error can be:
Spelling error.
Exceeding length of identifier or numeric constants.
Appearance of illegal characters.
To remove the character that should be present.
To replace a character with an incorrect character.
Transposition of two characters.
15
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
Transition Diagrams
As an intermediate step in the construction of a
lexical analyzer, patterns are converted into
stylized flowcharts, called “transition diagrams”.
Transition diagrams have a collection of nodes
or circles, called states. Each state represents a
condition that could occur during the process of
scanning the input looking for a lexeme that
matches one of several patterns.
Edges are directed from one state of the
transition diagram to another.
Figure: Transition Diagram of Relation Operator
16
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU
8
07/23/2020
Finite Automata
Finite Automata(FA) is the simplest machine to recognize patterns.
Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the labels
of their edges. A symbol can label several edges out of the same state, and
Ꜫ, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each
symbol of its input alphabet exactly one edge with that symbol leaving that
state.
17
Zakia Zinat Choudhury, Lecturer, Dept. of CSE, RU