Wachemo University
Institute of Technology
Department of Computer Science
Course Title: Compiler Design (CoSc4103)
Chapter Two: Lexical Analysis and Lex
By: Tseganesh M.(MSc.)
Subscribe on Yadah Academy YouTube channel
Compiler Design (CoSc4103)
Chapter Two
Lexical Analysis and Lex
2.1. The role of the lexical analyzer
2.2. Token: Specification and Recognition of Tokens
Outline2.3. Lexical Error-Recovery
2.4. Finite Automata: NFA to DFA Conversation
2.5. A typical Lexical Analyzer Generator
By: Tseganesh M.(MSc.)
2.1. The role of the Lexical Analyzer
Lexical analysis is the first phase of a compiler.
A lexical analyzer is also called a "Scanner".
The input to a lexical analyzer is the pure high-level code from the preprocessor.
Main functions of Lexical analyzer
1
st task: read the given source code from left to right in character-wise and produce a
sequence of tokens that are uses for syntax analysis.
i.e., the output of lexical analysis is a stream of tokens, which is input to the parser
2
nd task: is removing any comments and white spaces from source code in the form of blank,
tab, and newline characters.
Another task: it generates an error messages, if it finds invalid token from the source program.
It identifies valid lexemes from the program and returns tokens to the syntax analyzer,
one after the other, corresponding to the getNextToken command from the syntax
analyzer
read char Token & token value
Source Lexical To semantic
Parser
program Analyzer analysis
put back char getNextToken
id
Read entire program into memory Symbol table
11/28/202 WCU-CS Compiled by TM. 2
Lexical Analyzer cont’d……
The lexical analyzer works closely with the syntax analyzer.
But, there are some Issues/reasons why to separating lexical analysis from parsing
Simplicity of design
Improving compiler efficiency
Enhancing compiler portability (e.g. Linux to Win)
When you work on Lexical analysis, there are three important terms to know:
Lexemes, Pattern, and Tokens,
Token, Pattern, Lexeme
Lexeme: is a sequence of characters (alphanumeric) in the source program that matches the
pattern of a token.
Pattern: is a set of rules for every lexeme that the scanner follow to identify a valid token.
A pattern explains what can be a token, and
These patterns can be defined by means of regular expressions
Tokens: are a set of strings defining an atomic element with a defined meaning
It is a pre-defined sequence of characters that cannot be broken down further
A token can have a token name and an optional token/attribute value
11/28/202 WCU-CS Compiled by TM. 2
Lexical Analyzer cont’d……
Some example of tokens, lexemes, and pattern
Token Lexeme Pattern
Keyword while w-h-i-l-e
Relop < <, >, >=, <=, !=, ==
Integer 7 (0 - 9)*-> Sequence of digits with at least one digit
String "Hi" Characters enclosed by " "
Punctuation , ; , . ! etc.
Identifier number A - Z, a - z A sequence of characters and numbers initiated by a
character.
But, here is some questions which raised from the tasks of LA:
How does the lexical analyzer read the input string and break it into lexemes?
How can it understand the patterns and check if the lexemes are valid?
What does the Lexical Analyzer send to the next phase?
11/28/202 WCU-CS Compiled by TM. 2
2.2. Token: Specification and Recognition of Tokens
In programming language; keywords, constants, identifiers, strings, numbers, whitespace,
operators, and punctuations are considered as tokens.
For example, in C or C++ language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Attributes of Token
In a program, some times more than one lexeme matches pattern correspond to one token,
So, Lexical analyzer must provide additional information about the particular lexeme.
Because, the rest of the phases need additional information about the lexeme to perform
different operations.
Lexical analyzer collects information about tokens into their associated attributes and sends a
sequence of tokens with their information to the next phase.
i.e., the tokens are sent as a pair of <Token name, Attribute value> to the Syntax
analyzer
11/28/202 WCU-CS Compiled by TM. 6
Tokens cont’d……
Example: see the tokens and associated attribute-values for the following FORTRAN statement
E=M * C** 2 are written below as a sequence of pairs:
<id, pointer to symbol table entry for E> Token Attribute
<assign-op> ID Index to symbol table entry E
<id, pointer to symbol table entry for M> =
ID Index to symbol table entry M
<mult-op>
*
<id, pointer to symbol table entry for C>
ID Index to symbol table entry C
<exp-op>
**
<number, integer value 2> NUM 2
A lexeme is like an instance of a token, and the attribute column is used to show which lexeme
of the token is used.
For every lexeme, the 1st and 2nd columns of the above table are sent to the Syntax Analyzer.
11/28/202 WCU-CS Compiled by TM. 7
Tokens cont’d……
Specifications of Tokens
To answer the question “how the lexical analyzer can check the validity of lexemes with
tokens”, it is critical to know the following specifications of tokens:
1) Alphabet
2) Strings
3) Special symbols
4) Language
5) Regular expression
6) etc……
Let us understand how the language theory undertakes these terms:
1. Alphabets
Any finite set of symbols
{0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
{a-z, A-Z} is a set of English language alphabets.
2. Strings
Any finite sequence of alphabets (characters) is called a string.
A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
11/28/202 WCU-CS Compiled by TM. 8
Tokens cont’d……
In language theory, the terms sentence and word are often used as synonyms for the term
"string."
Length of the string S is the total number of occurrence of alphabets, and it is denoted by |S|
e.g., the length of the string compiler is 8 and is denoted by |compiler| = 8
A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon).
3. Special symbols
A typical high-level language contains the following special symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
11/28/202 WCU-CS Compiled by TM. 9
Tokens cont’d……
4. Language
Language is considered as a finite set of strings over some finite set of fixed alphabets.
Computer languages are considered as finite sets, and mathematically set operations can be
performed on them.
Finite languages can be described by means of regular expressions.
5. Regular Expressions
Regular expressions are an important notation to specify lexeme patterns for a token.
Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
Regular expressions are used to represent the language for lexical analyzer
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand.
It searches for the pattern defined by the language rules.
A grammar defined by regular expressions is known as regular grammar
The language defined by regular grammar is known as regular language.
11/28/202 WCU-CS Compiled by TM. 10
Tokens cont’d……
Programming language tokens can be described by regular languages.
There are a number of algebraic laws that are obeyed by regular expressions, also known as
operations on language
Operations on languages
There are several important operations that can be applied to languages.
Union of two languages L and M is written as;
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as;
LM = {st | s is in L and t is in M}
Kleene closure of a language L is written as;
L* = Zero or more occurrence of language L
Example: the following example shows the operations on strings:
Let L={0,1} and S={a,b,c}
Union : L U S={0,1,a,b,c}
Concatenation : L.S={0a,1a,0b,1b,0c,1c}
Kleene closure : L*={ ε,0,1,00….}
Positive closure : L+={0,1,00….}
11/28/2024 WCU-CS Compiled by TM. 11
Tokens cont’d……
In lexical analysis by using regular expression it is possible to represent:
i. valid tokens of a language,
ii. occurrences of symbols, and
iii. language tokens;
i. Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
11/28/2024 WCU-CS Compiled by TM. 12
Tokens cont’d……
ii. Representation of occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
iii. Representation of language tokens using regular expressions
?
Decimal = (sign) (digit)
+
Identifier = (letter)(letter | digit)*
However, the only problem left with the lexical analyzer is how to verify the validity of a
regular expression used in specifying the patterns of keywords of a language.
A well-accepted solution to this problem is use finite automata for verification.
To recognize and verify the tokens, the lexical analyzer builds Finite Automata for every pattern.
Transition diagrams can be built and converted into programs as an intermediate step.
Each state in the transition diagram represents a piece of code.
Every identified lexeme walks through the Automata.
The programs built from Automata can consist of switch statements to keep track of the state of the
lexeme. The lexeme is verified to be a valid token if it reaches the final state.
13
2.3. Lexical Error Recovery
Lexical errors:
are a type of error can be detected during the lexical analysis phase
is a sequence of characters that does not match the pattern of any token, which is not
possible to scan into any valid token
are thrown by the lexer when unable to continue. i.e., if there’s no way to recognize a
lexeme as a valid token.
Lexical errors are not very common, but it should be managed by a scanner
Some of common lexical errors in Lexical phase error can be
Spelling error of identifiers, operators, keyword, etc
Appearance of some illegal character
Exceeding length of identifier or numeric constants.
Remove the character that should be present.
Replace a character with an incorrect character.
Transposition of two characters.
11/28/202 14
Lexical Error cont’d……
Example: see this C code Void main() {
In this code, 1xab is neither a number nor
int x=10, y=20;
an identifier.
char * a;
So this code will show the lexical error.
a= &x;
x= 1xab;
}
Lexical Error recovery: There are some recovery mechanisms to remove lexical errors
See some of possible error-recovery actions with examples of “cout” are
i. deleting an unnecessary character eg. couttcout
ii. inserting a missing character eg cotcout
iii. replacing an incorrect character by a correct character eg coufcout
iv. transposing two adjacent characters. Eg ocutcout
However, few errors are out of power of lexical analyzer to recognize, because a lexical analyzer
has a very localized view of a source program. So, some other phase of compiler handle this error
For instance, if the string fi is encountered in a C++/C program for the first time in the context
of:
In this code, a lexical analyzer cannot tell whether fi is a misspelling
fi (a == b) … of the keyword if or an undeclared function identifier.
11/28/202 15
2.4. Automata: NFA to DFA Conversation
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly.
Finite automata is a recognizer for regular expressions.
When a regular expression string is fed into finite automata, it changes its state for each literal.
If the input string is successfully processed and the automata reaches its final state, it is
accepted,
i.e., the string that fed was said to be valid token of the language in hand
Regular expressions =specify the specification
Finite automata = implementation
A finite automaton consists of
An input alphabet
A set of states S
A start state n
A set of accepting states F S
A set of transitions state
input state
11/28/202 WCU-CS Compiled by TM. 16
Automata: NFA to DFA cont’d……
Transition: s1 a s2
This can be read as; In state s1 on input “a” go to state s2
If end of input
If in accepting state => accept, othewise => reject
If no transition possible => reject
Finite Automata State Graphs can be build up using
A state
The start state
An accepting state
a
A transition
Simple Example: A finite automaton that accepts only “1”
1
11/28/202 17
Automata: NFA to DFA cont’d……
A finite automaton accepts a string if we can follow transitions labeled with the characters in the
string from the start to some accepting state
Another Example: A finite automaton accepting any number of 1’s followed by a single 0
1
Alphabet: {0,1} 0
Check that “1110” is accepted with this finite automation
Exercise: given Alphabet {0,1}, What language will be recognized by this automation machine?
1 0
0 0
1
1
Epsilon Moves
Another kind of transition with: -moves
A B Here a machine can move from state A to state B without
reading input
11/28/202 18
Automata: NFA to DFA cont’d……
Types of Finite Automata
i. Non-Deterministic Automata (NFA).
ii. Deterministic Automata (DFA)
i. Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a given state
Can have -moves
NFA accepts if it can get in a final state
ii. Deterministic Finite Automata (DFA):
A DFA is a special case of a NFA in which:-
It has at most one transition per input from any state
No -moves, means it has no transitions on input € ,
DFA formally defined by 5 tuple notation; M = (Q, ∑, δ, qo, F), where
Q is a finite “set of states”, which is non empty.
∑ is “input alphabets”, indicates input set.
qo is an “initial state” and qo is in Q; ie, qo, ∑, Q F is a set of “Final states”,
δ is a “transmission function” or mapping function, using this function the next state
can be determined.
11/28/202 WCU-CS Compiled by TM. 19
Automata: NFA to DFA cont’d……
Reading assignment
Execution of Finite Automata ?????
Details of NFA vs. DFA ?????
Regular expression is converted into minimized DFA ?????
Regular Expressions to Finite Automata ????
NFA to DFA ????
Implementation of DFA ????
20
You can refer more and more for detail elaboration
2.5. Lexical Analyzer Generator
Creating a lexical analyzer with Lex:
First, a lexical analyzer is prepared by creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C program [Link].c.
Finally, [Link].c is run through the C compiler to produce an object program [Link],
[Link] is the lexical analyzer that transforms an input stream into a sequence of tokens.
11/28/202 WCU-CS Compiled by TM. 21
Lexical Analyzer cont’d……
■ Lex Specification: a Lex program consists of three parts:
%{ definitions }% %{
int vowels=0, int cons=0;
%% %}{
%%
{rules } [aeiouAEIOU] {vowels++;}
%% [a-zA-Z] {cons++;}
%%
{ user subroutines } where,
■ Definitions include declarations of variables, constants, and regular definitions
■ Rules are statements of the form p1{action1}p2{action2}… pn{actionn}
■ where pi is regular expression and
■ action describes what action the lexical analyzer should take when pattern pi matches a
lexeme.
■ Actions are written in C code.
■ User subroutines are auxiliary procedures needed by the actions.
■ These can be compiled separately and loaded with the lexical analyzer.
11/28/202 WCU-CS Compiled by TM. 22
Lexical Analyzer cont’d……
■ Consider the following lex program; that count vowels and consonants
%{
int vowels=0; Steps to executing this 'Lex' program:
int cons=0; First write the source code in lex editor
%} “EditPlusPortable” or any editor, then
Tools->'Lex File Compiler'
%% Tools->'Lex Build'
[aeiouAEIOU] {vowels++;} Tools->'Open CMD'
[a-zA-Z] {cons++;} Then in command prompt type
%% 'name_of_file.exe' example->‘[Link]‘ and
press enter
int yywrap() { Then entering your whole input and press enter
return 1; Final press Ctrl + Z and press Enter., then you
} see the output
main(){
printf(" Enter any string to count vowels and consonats at end press^d\n");
yylex();
printf("no: of vowels are: %d\n",vowels);
printf("no of constants: %d\n",cons);
return 0;
}
11/28/202 WCU-CS Compiled by TM. 23
Lexical Analyzer cont’d……
■ The output for the above program will look like
11/28/202 WCU-CS Compiled by TM. 24
Next class
Chapter 3: Syntax Analysis
3.1. Role of a parser
3.2. Parsing
Outline
3.3. Types of parsing
3.4. Parser Generator: Yacc
Subscribe Yadah Academy on YouTube
Click [Link]
educationalco8575