Chapter2-Lexical Analysis
Chapter2-Lexical Analysis
Compiler
• Compiler translates from one language to another
Abstract
Strings/Files Tokens
Syntax Trees
Lexing Parsing
Interactions between the lexical analyzer
and the parser
Tokens, Patterns and Lexemes
• A pattern is a description of the form that the lexemes of a token may take
(the set of rule that define a TOKEN).
• For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was
found in the source program.
Tokens, Patterns and Lexemes
cout << 3+2+3;
Lexeme The following tokens are returned by
scanner to parser in specified order
cout <identifier, ‘cout’>
<< <operator, ‘<<‘>
3 <literal, ‘3’>
+ <operator, ‘+’>
2 <literal, ‘2’>
+ <operator, ‘+’>
3 <literal, ‘3’>
; <punctuator, ‘;’>
Tokens
if (num1 == num2)
result = 1;
else
result = 0;
• Identifier:
- Identifiers are strings of letters, digits, and underscores, starting with a letter or an
underscore
num1, result, name20, _result, …..
• Integer:
- A non-empty string of digits
10, 89, 001, 00, …….
• Keyword:
- A fix set of reserved words
if, else, for, while, ….
• Whitespace:
- A non-empty sequence of blanks, newlines, and tabs
Lexical Analysis
Tokens Abstract
Strings/Files
<name, attribute> Syntax Trees
Lexing Parsing
Lexical Analysis
Lexing Parsing
Lexical Analysis
\tif (num1 == num2)\n\t\tresult = 1;\n\telse\n\t\tresult = 0;
R = ε
| ‘a’ where c ∈ Σ
| A+B where A, B are regular expressions over Σ
| AB where A, B are regular expressions over Σ
| A* where A is a regular expression over Σ
Lexical Analysis: Regular expressions
Σ = {0, 1}
𝑖
1* = 𝑖≥0 1 = ε + 1 + 11 + 111 + 1111 + ……..
L(e) = M
Regular Set of
expression strings
L(regular_expression)
L(regular_expression) -> set of strings
- regular expression for the set of strings corresponding to all the single
digits
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
= [0-9]
letter_ = [a-zA-Z_]
identifier = letter_(letter_ + digit)*
Whitespace: a non-empty sequence of blanks, newlines, and tabs
letter+’@’letter+’.’letter+’.’letter+
Regular Expression
• At least one: AA* A+
• Option: A+ε A?
R = R1 + R2 + R3 + …..
If x1 ….xi L(R)
And x1 ….xj L(R)
ij
if L(Keywords)
if L(Identifiers)
=> Choose the rule listed FIRST.
• What if no rule matches?
x1 ….xi L(R)
Make a regular expression for error strings and PUT IT LAST IN PRIORITY
(lowest priority)
• Regular expressions are a concise notation for string patterns
• An accepting state
a
• A transition
Finite Automata
• A finite automata that accepts only “a”
a
q0 q1
q0 001 q0 011
q0 001 q0 011
q0 001 q1 011
q1 001
Accept Reject
Regular Expressions to non-deterministic
finite automata (NFA)
Non-
Deterministic
deterministic Table-driven
Lexical Regular Finite
Finite Implementat
Specification Expressions Automata
Automata ion of DFA
(DFA)
(NFA)
Regular Expressions to NFA
• For each kind of regular expression, define an equivalent NFA that accepts
exactly the same language as the language of a regular expression.
NFA for regular expression M
M
• For ε
• For input a a
Regular Expressions to NFA
• Concatenation
• For RS R S
R
• Union
• For R + S
S
• Iteration
R
• For R*
Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*
0
• For 0
1
• For 1
0
ε ε
• For 0 + 1
ε 1 ε
Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*
• For 01 0 ε 1
ε
• For (01)*
ε 0 ε 1 ε
ε
Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*
0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E
Regular Expressions to non-deterministic
finite automata (NFA)
Non-
Deterministic
deterministic Table-driven
Lexical Regular Finite
Finite Implementati
Specification Expressions Automata
Automata on of DFA
(DFA)
(NFA)
NFA to DFA
• Simulate the NFA
• Each state of DFA
= a non-empty subset of states of the NFA
• Start state of DFA
= the set of NFA states reachable through -moves from NFA start state
• Add a transition S a S’ to DFA if
– S’ is the set of NFA states reachable from any
state in S after seeing the input a, considering -moves as well
• Final state of DFA
= the set includes the final state of the NFA
NFA to DFA
• NFA for (0+1)(01)*
0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E
NFA to DFA
• NFA for (0+1)(01)*
0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E
0 CFGHL 0
1
ABD IJ KLH
1 EFGHL 0
0
Regular Expressions to non-deterministic
finite automata (NFA)
Non-
Deterministic Table-driven
Lexical Regular deterministic
Finite Automata Implementation
Specification Expressions Finite Automata
(DFA) of DFA
(NFA)
Implementation of DFA
• A DFA can be implemented by a 2D table T
– One dimension is “states”
– Other dimension is “input symbol”
a
– For every transition Si Sk define T[i,a] = k
a b
Input symbols
i k
j
states k
l
Implementation of DFA
• DFA for (0+1)(01)*
0 S1 0
S0 S3 1 S4
1 S2 0
0 0 1
S0 S1 S2
S1 S3
S2 S3
S3 S4
S4 S3
Implementation of DFA
i = 0;
state = 0;
0 1
while (input[i]){
state = T[state, input[i++]]; S0 S1 S2
} S1 S3
S2 S3
S3 S4
S4 S3
Implementation of DFA
• DFA for (0+1)(01)* 0 S1 0 1
S0 S3 S4
1 S2 0 0
0 1 0 1
S0 S1 S2 S0 S1 S2
S1 S3 S1
S3
S2 S3 S2
S3 S4 S3 S4
S4 S3 S4
Implementation of NFA
0 ε
0 1 ε B C
ε ε
A {B, D} ε ε 0 ε 1 ε
A F G H I J K L
B {C} ε 1 ε ε
D E
C {F}
D {E}
E {F}
F {G}
G {H, L}
H {I}
I {J}
J {K}
K {L, H}
Summarize
• Conversion of NFA to DFA is the key
• DFAs are faster and less compact so the tables can be very large
• NFAs are slower to implement but more concise.
• In practice, tools provide tradeoffs between speed and space.
• Tools give generally a series of options in the form of configuration files or
command lines which allow you to choose whether you want to be closer
to a full DFA or to a pure NFA.
Assignment 1 (Lexical Analyzer)