Lexical Analysis
Dr. Nguyen Hua Phung
HCMC University of Technology, Viet Nam
08, 2016
Dr. Nguyen Hua Phung Lexical Analysis 1 / 38
Outline
1 Introduction
2 Roles
3 Implementation
4 Use ANTLR to generate Lexer
5 Regex and Parser Libraries in Scala
Dr. Nguyen Hua Phung Lexical Analysis 2 / 38
Compilation Phases
source program
lexical analyzer
syntax analyzer
front end
semantic analyzer
intermediate code generator
code optimizer
back end
code generator
target program
Dr. Nguyen Hua Phung Lexical Analysis 4 / 38
Lexical Analysis
Like a word extractor
in ⇒ i n ⇒ in
Like a spell checker
I og:::
og to socholsochol
:::::::
Like a classification
I am a student
pronoun verb article noun
Dr. Nguyen Hua Phung Lexical Analysis 6 / 38
Lexical Analysis Roles
Identify lexemes: substrings of the source program
that belong to a grammar unit
Return tokens: a lexical category of lexemes
Ignore spaces such as blank, newline, tab
Record the position of tokens that are used in next
phases
Dr. Nguyen Hua Phung Lexical Analysis 7 / 38
Example on Lexeme and Token
result = oldsum - value / 100;
Lexemes Tokens
result IDENT
= ASSIGN_OP
oldsum IDENT
- SUBSTRACT_OP
value IDENT
/ DIV_OP
100 INT_LIT
; SEMICOLON
Dr. Nguyen Hua Phung Lexical Analysis 8 / 38
How to build a lexical analyzer?
How to build a lexical analysis for English?
65000 words
Simply build a dictionary:
{(I,pronoun);(We,pronoun);(am,verb);...}
Extract, search, compare
But for a programming language?
How many words?
Identifiers: abc, cab, Abc, aBc, cAb, ...
Integers: 1, 10, 120, 20, 210, ...
...
Too many words to build a dictionary, so how?
Dr. Nguyen Hua Phung Lexical Analysis 11 / 38
Finite Automata
... b b a a a a . . . Input Tape
Head (Read only)
q3 ...
q2 qn
q1 q0
Finite Control
Dr. Nguyen Hua Phung Lexical Analysis 12 / 38
State Diagram
a a
b
start q0 q1
Input: abaabb
Current state Read New State
q0 a q0
q0 b q1
q1 a q1
q1 a q1
q1 b q0
q0 b q1
Dr. Nguyen Hua Phung Lexical Analysis 13 / 38
Deterministic Finite Automata
Definition
Deterministic Finite Automaton(DFA) is a 5-tuple
M =(K,Σ,δ,s,F) where
K = a finite set of state
Σ = alphabet
s ∈ K = the initial state
F ⊆ K = the set of final states
δ = a transition function from K ×Σ to K
Dr. Nguyen Hua Phung Lexical Analysis 16 / 38
Example
M =(K,Σ,δ,s,F)
where K = {q0 , q1 } Σ = {a,b} s=q0 F={q1 }
and δ
K Σ δ(K , Σ)
q0 a q0
q0 b q1
q1 a q1
q1 b q0
a a
b
start q0 q1
Dr. Nguyen Hua Phung Lexical Analysis 17 / 38
Nondeterministic Finite Automata
Permit several possible “next states” for a given
combination of current state and input symbol
Accept the empty string in state diagram
Help simplifying the description of automata
Every NFA is equivalent to a DFA
Dr. Nguyen Hua Phung Lexical Analysis 18 / 38
Example
Language L = ({ab} ∪ {aba})*
start q0 q1
b
a b
q2
Dr. Nguyen Hua Phung Lexical Analysis 19 / 38
Example
Language L = ({ab} ∪ {aba})*
a
start q0 q1
b
a
q2
Dr. Nguyen Hua Phung Lexical Analysis 20 / 38
Regular Expression (regex)
Describe regular sets of strings
Symbols other than ( ) | * stand for themselves
Use for an empty string
Concatenation α β = First part matches α, second
part β
Union α | β = Match α or β
Kleene star α* = 0 or more matches of α
Use ( ) for grouping
Dr. Nguyen Hua Phung Lexical Analysis 22 / 38
Example
(i|I)(f|F)
Keyword if of language Pascal
if
IF
If
iF
E(0|1|2|3|4|5|6|7|8|9)*
An E followed by a (possibly empty) sequence of digits
E123
E9
E
Dr. Nguyen Hua Phung Lexical Analysis 23 / 38
Regular Expression and Finite Automata
a b
start ab
a
start a|b
b
a
start a*
Dr. Nguyen Hua Phung Lexical Analysis 24 / 38
Convenience Notation
α+ = one or more (i.e. αα∗)
α? = 0 or 1 (i.e. (α|))
[xyz]= x|y|z
[x-y]= all characters from x to y, e.g. [0-9] = all ASCII
digits
[^x-y]= all characters other than [x-y]
. matches any character
Dr. Nguyen Hua Phung Lexical Analysis 25 / 38
Example
Integer:
Hexadecimal number:
Fixed-point number:
Floating point number:
String:
Dr. Nguyen Hua Phung Lexical Analysis 26 / 38
ANTLR [1]
ANother Tool for Language Recognition
Terence Parr, Professor of CS at the Uni. San
Francisco
powerful parser/lexer generator
Dr. Nguyen Hua Phung Lexical Analysis 28 / 38
Lexer
/∗∗
∗ Filename : H e l l o . g4
∗/
l e x e r grammar H e l l o ;
/ / match any d i g i t s
INT : [0 −9]+;
/ / Hexadecimal number
HEX: 0 [ Xx][0 −9A−Fa−f ] + ;
/ / match lower−case i d e n t i f i e r s
ID : [ a−z ] + ;
/ / s k i p spaces , tabs , n e w l i n e s
WS : [ \ t \ r \ n ] + −> s k i p ;
Dr. Nguyen Hua Phung Lexical Analysis 29 / 38
Lexical Analyzer
1 0 . 0 e 2 0 . . . Input Tape
Look ahead
r3 ...
Token
r2 rn
with longest prefix match
r1 r0
Lexical Analyzer
Dr. Nguyen Hua Phung Lexical Analysis 30 / 38
Scala Regex Library
Library import [Link]
Construction new Regex(String)
new Regex("[0-9]+")
"[0-9]+".r
Method findFirstIn(String):Option[Match]
findFirstMatchIn(String):Option[String]
findPrefixOf(String):Option[String]
findPrefixMatchOf(String):Option[String]
findAllIn(String):MatchIterator
...
Dr. Nguyen Hua Phung Lexical Analysis 32 / 38
Example
import [Link]
val pat = new Regex("[0-9]+")
val pattern = "[a-z][a-z]*".r
val str = "123 abc 456"
[Link](str)
[Link](str)
Dr. Nguyen Hua Phung Lexical Analysis 33 / 38
Scala Parser Library
Library [Link]
Construction new Parser[T]
new Parser[Token]
new Parser[Any]
Method ~ p1 ~ p2: must match p1 followed by p2
| p1 | p2: must match either p1 or p2, with prefer-
ence given to p1
? p1.? : may match p1 or not
* p1.*: matches any number of repetitions of p1
^^ p1 ^^ f: combine for function application
^^^ p1 ^^^ T: changes a successful result into the
specified value
... ...
Dr. Nguyen Hua Phung Lexical Analysis 35 / 38
Summary
A lexical analyzer is a pattern matcher that isolates
small-scale parts of a program
Regular expressions are built based on Finite
Automata
How to write a lexical analyzer (lexer) in Scala
Dr. Nguyen Hua Phung Lexical Analysis 37 / 38
References I
[1] ANTLR, http:[Link], 19 08 2016.
Dr. Nguyen Hua Phung Lexical Analysis 38 / 38