0% found this document useful (0 votes)
26 views84 pages

Lexical Analyzer

Uploaded by

Pratik Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views84 pages

Lexical Analyzer

Uploaded by

Pratik Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Lexical Analyzer

Topics to be covered
✓ Looping
• Interaction of scanner & parser
• Token, Pattern & Lexemes
• Input buffering
• Specification of tokens
• Regular expression & Regular definition
• Transition diagram
• Hard coding & automatic generation lexical analyzers
• Finite automata
• Regular expression to NFA using Thompson's rule
• Conversion from NFA to DFA using subset construction method
• DFA optimization
Interaction with Scanner & Parser
Interaction of scanner & parser
Token
Source Lexical
Parser
Program Analyzer
Get next token

Symbol Table

• Upon receiving a “Get next token” command from parser, the lexical analyzer
reads the input character until it can identify the next token.
• Lexical analyzer also stripping out comments and white space in the form of
blanks, tabs, and newline characters from the source program.
Why to separate lexical analysis & parsing?
1. Simplicity in design.
2. Improves compiler efficiency.
3. Enhance compiler portability.
Token, Pattern & Lexemes
Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern associated
Sequence of character having a
with a token.
collective meaning is known as
Example: “non-empty sequence of digits”,
token. “letter followed by letters and digits”
Categories of Tokens:
1. Identifier Lexemes

2. Keyword The sequence of character in a source


program matched with a pattern for a token
3. Operator
is called lexeme.
4. Special symbol Example: Rate, DIET, count, Flag
5. Constant
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
Identifier1
total
Operator1
=
Identifier2 Tokens
sum
Operator2
+
Constant1
45

Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Input buffering
Input buffering
There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair

The lexical analysis scans the input string from left to right one character at a
time.
Buffer divided into two N-character halves, where N is the number of character on
one disk block. : : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward
lexeme_beginnig

Pointer Lexeme Begin, marks the beginning of the current lexeme.


Pointer Forward, scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to character at its right end.
Lexeme Begin is set to the character immediately after the lexeme just found.
If forward pointer is at the end of first buffer half then second is filled with N input
character.
If forward pointer is at the end of second buffer half then first is filled with N input
character.
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward forward


lexeme_beginnig
Code to advance forward pointer
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end else forward := forward + 1;
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward
lexeme_beginnig

In buffer pairs we must check, each time we move the forward pointer that we
have not moved off one of the buffers.
Thus, for each character read, we make two tests.
We can combine the buffer-end test with the test for the current character.
We can reduce the two tests to one if we extend each buffer to hold a sentinel
character at the end.
The sentinel is a special character that cannot be part of the source program, and
a natural choice is the character EOF.
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward forward forward


lexeme_beginnig
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at the second half then begin
reload first half;
move forward to beginning of first half;
end
else terminate lexical analysis;
end
Specification of tokens
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more trailing symbol of
string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more leading symbol of
string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix from S.
e.g., nan is substring of banana
Proper prefix, suffix Any nonempty string x that is respectively proper prefix, suffix or
and substring of S substring of S, such that s≠x.
Subsequence of S A string obtained by removing zero or more not necessarily
contiguous symbol from S.
e.g., baaa is subsequence of banana.
Exercise
Write prefix, suffix, substring, proper prefix, proper suffix and subsequence of
following string:
String: Compiler
Operations on languages
Operation Definition
Union of L and M

Written L U M
L U M = { s | s is in L or s is in M }
Concatenation of L
and M LM = { st | s is in L and t is in M }
Written LM

Kleene closure of L L* = Ui=0infLi { L* denotes “Zero or More


Written L∗ concatenations of “ L.

Positive closure of L L+ = Ui=0infLi { L+ denotes “One or More


Written L+ concatenations of “ L.
Regular Expression & Regular Definition
Regular expression
A regular expression is a sequence of characters that define a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ
Rules to define regular expression
Regular expression
L = Zero or More Occurrences of a =
a*

*
𝜖
a
aa
aaa Infinite …..
aaaa
aaaaa…..
Regular expression
L = One or More Occurrences of a =
a+

+ a
aa
aaa
aaaa
aaaaa…..
Infinite …..
Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left
Regular expression examples
Regular expression examples
7. 0 or more occurrence of either a or b or both

8. 1 or more occurrence of either a or b or both

9. Binary no. ends with 0

10. Binary no. ends with 1

11. Binary no. starts and ends with 1

12. String starts and ends with same character


Regular expression examples
Regular expression examples
Regular expression examples
Regular expression examples
31. Language of all string containing both 11 and 00 as substring

32. String ending with 1 and not contain 00

33. Language of C identifier


Regular definition
Regular definition example
Example: Unsigned Pascal numbers
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit 🡪 0|1|…..|9
digits 🡪 digit digit*
optional_fraction 🡪 .digits | 𝜖
optional_exponent 🡪 (E(+|-|𝜖)digits)|𝜖
num 🡪 digits optional_fraction optional_exponent
Transition Diagram
Transition Diagram
A stylized flowchart is called transition diagram.

is a
state

is a transition

is a start state

is a final state
Transition Diagram : Relational operator

< =
2 return (relop,LE)

>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>
=
7 return (relop,GE)

other
8 return (relop,GT)
Transition diagram : Unsigned number

digit digit digit

start digit . digit E +or - digit other


8

E digit
3
5280
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Hard coding & automatic generation Lexical
analyzers
Hard coding and automatic generation lexical analyzers
Lexical analysis is about identifying the pattern from the input.
To recognize the pattern, transition diagram is constructed.
It is known as hard coding lexical analyzer.
Example: to represent identifier in ‘C’, the first character must be letter and other
characters are either letter or digits.
To recognize this pattern, hard coding lexical analyzer will work with a transition
diagram.
The automatic generation lexical analyzer takes special notation as input.
For example, lex compiler tool will take regular expression as input and finds out
the pattern matching to that regular expression.
Letter or digit
Start Letter
1 2 3
Finite Automata
Finite Automata
Types of finite automata
Types of finite automata are:
DFA
b

Deterministic finite automata (DFA): have for


each state exactly one edge leaving out for a b b
1 2 3 4
each symbol.
a
a
b a
NFA DFA
a

a b b
1 2 3 4

b NFA
Regular expression to NFA using Thompson's
rule
Regular expression to NFA using Thompson's rule

start
start 𝜖 N(s) N(t)

start a a b
1 2 3
Regular expression to NFA using Thompson's rule

𝜖
N(s) 𝜖
𝜖
start 𝜖 𝜖
start N(s)

𝜖 N(t) 𝜖 𝜖

𝜖
a
2 3
𝜖 𝜖 𝜖 𝜖
1 2 3
1 6

𝜖 𝜖 𝜖
4 5
b
Regular expression to NFA using Thompson's rule
a*b

𝜖 𝜖
1 2 3

𝜖
b*ab
𝜖 𝜖
1 2 3 5

𝜖
Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10. (0+1)*010(0+1)*
11. (010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA to DFA using subset
construction method
Subset construction algorithm

OPERATION DESCRIPTION

ϵ-Closure(s) Set of NFA states reachable from NFA state s on ϵ-


transition alone
Set of NFA states reachable from some NFA state s in T on
ϵ-Closure(T) E- transition alone

move(T,a) Set of NFA state to which there is a transition on


input symbol a from some NFA state s in T
Subset construction algorithm
Conversion from NFA to DFA
(a|b)*abb 𝜖

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖
Conversion from NFA to DFA

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

𝜖- Closure(0)= {0, 1, 7, 2, 4}

= {0,1,2,4,7} ---- A
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}

𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,a) = {3,8}
𝜖- Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,b) = {5}
𝜖- Closure(Move(A,b)) = {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
B = {1, 2, 3, 4, 6, 7, 8}
Move(B,a) = {3,8}
𝜖- Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

B= {1, 2, 3, 4, 6, 7, 8}
Move(B,b) = {5,9}
𝜖- Closure(Move(B,b)) = {5, 6, 7, 1, 2, 4, 9}
= {1,2,4,5,6,7,9} ---- D
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

C= {1, 2, 4, 5, 6 ,7}
Move(C,a) = {3,8}
𝜖- Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9}
b

𝜖
C= {1, 2, 4, 5, 6, 7}
Move(C,b) = {5}
𝜖- Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B
b

D= {1, 2, 4, 5, 6, 7, 9}
Move(D,a) = {3,8}
𝜖- Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10}
𝜖

D= {1, 2, 4, 5, 6, 7, 9}
Move(D,b) = {5,10}
𝜖- Closure(Move(D,b)) = {5, 6, 7, 1, 2, 4, 10}
= {1,2,4,5,6,7,10} ---- E
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,a) = {3,8}
𝜖- Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA

a
2 3 States a b
𝜖 𝜖
A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D = {1,2,4,5,6,7,9} B E
b
E = {1,2,4,5,6,7,10} B C
𝜖
E= {1, 2, 4, 5, 6, 7, 10}
Move(E,b)= {5}
𝜖- Closure(Move(E,b))= {5,6,7,1,2,4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA

b
States a b
a
A = {0,1,2,4,7} B C a
B = {1,2,3,4,6,7,8} B D
a a b
C = {1,2,4,5,6,7} B C
D = {1,2,4,5,6,7,9} B E b
E = {1,2,4,5,6,7,10} B C b

Transition Table
b
Note:
• Accepting state in NFA is 10 DFA
• 10 is element of E
• So, E is acceptance state in DFA
Exercise
Convert following regular expression to DFA using subset construction method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a
DFA optimization
DFA optimization
DFA optimization
DFA optimization
States a b
A B C
B B D
C B C
D B E
E B C

States a b
A B A
B B D
D B E
Now no more splitting is possible.
E B A
If we chose A as the representative for Optimized
group (AC), then we obtain reduced Transition Table
transition table
Conversion from regular expression to DFA
Rules to compute nullable, firstpos, lastpos
Rules to compute nullable, firstpos, lastpos
Node n nullable(n) firstpos(n) lastpos(n)
true

false

n nullable(c1) firstpos(c1) lastpos(c1)


or ∪ ∪
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)

if (nullable(c1)) if (nullable(c2))
n nullable(c1)
thenfirstpos(c1) ∪ then lastpos(c1)
and
c1 firstpos(c2) ∪ lastpos(c2)
c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
n
true firstpos(c1) lastpos(c1)
c1
Rules to compute followpos
1. If n is concatenation node with left child c1 and right child c2 and i is a position
in lastpos(c1), then all position in firstpos(c2) are in followpos(i)

2. If n is * node and i is position in lastpos(n), then all position in firstpos(n) are in


followpos(i)
Conversion from regular expression to DFA
(a|b) * abb # Step 1: Construct Syntax Tree
. Step 2: Nullable node
.
Here, * is only nullable node
.
.
Conversion from regular expression to DFA
Step 3: Calculate firstpos
Firstpos
.
.
.
n
. firstpos(c1) ∪ firstpos(c2)
c1 c2

n
firstpos(c1)
c1

n if (nullable(c1))
thenfirstpos(c1) ∪
firstpos(c2)
c1 c2 else firstpos(c1)
Conversion from regular expression to DFA
Step 3: Calculate lastpos
Lastpos
.
.
. n
. lastpos(c1) ∪ lastpos(c2)
c1 c2

n
lastpos(c1)
c1

n if (nullable(c2)) then
lastpos(c1) ∪ lastpos(c2)
else lastpos(c2)
c1 c2
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos .
Lastpos
.
.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
. 4 5
.
.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4

.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4
2 3
.
1 3
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4
2 1,2, 3
.
1 1,2, 3
.
*
Conversion from regular expression to DFA
Position followpos
5 6
4 5
3 4
2 1,2,3
1 1,2,3

States a b
A={1,2,3} B A
B={1,2,3,4}
Conversion from regular expression to DFA
State B
Position followpos
δ( (1,2,3,4),a) = followpos(1) U followpos(3) 5 6

=(1,2,3) U (4) = {1,2,3,4} ----- B 4 5


3 4
2 1,2,3
δ( (1,2,3,4),b) = followpos(2) U followpos(4) 1 1,2,3
=(1,2,3) U (5) = {1,2,3,5} ----- C
State C States a b
A={1,2,3} B A
δ( (1,2,3,5),a) = followpos(1) U followpos(3)
B={1,2,3,4} B C
=(1,2,3) U (4) = {1,2,3,4} ----- B C={1,2,3,5} B D
D={1,2,3,6}

δ( (1,2,3,5),b) = followpos(2) U followpos(5)


=(1,2,3) U (6) = {1,2,3,6} ----- D
Conversion from regular expression to DFA
State D
Position followpos
δ( (1,2,3,6),a) = followpos(1) U followpos(3) 5 6

=(1,2,3) U (4) = {1,2,3,4} ----- B 4 5


3 4
2 1,2,3
δ( (1,2,3,6),b) = followpos(2) 1 1,2,3
=(1,2,3) ----- A
b
a States a b
A={1,2,3} B A
a b b B={1,2,3,4} B C
A B C D
C={1,2,3,5} B D
a
a D={1,2,3,6} B A
b

DFA
Conversion from regular expression to DFA
Construct DFA for following regular expression:
1. (c | d)*c#
Thank You

You might also like