0 ratings0% found this document useful (0 votes) 40 views24 pagesScreenshot 2024-02-07 104122-Compressed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
1.4 Lexical analysis
Now, let us see “What is lexical analysis?”
Definition: The process of reading the source program and converting it
tokens is called lexical analysis.
1.4.1 The role of lexical analyzer
Now, let us see “What is the role of lexical analyzer?” The lexic:
phase of the compiler. The various tasks that are performed by the
‘+ Read a sequence of characters from the source program and produce tt
* The tokens thus generated are sent to the parser for syntax analysis
also called syntax analyzer.
During this process, lexical analyzer interacts with symbol table to in:
identifiers and constants. Sometimes, information of iden
table to assist in determining the proper token to send to the parser.
‘The interaction between the lexical analyzer and the parser is pictorially repres
shown below:4.18 © Lexical Analyzer
Token. ‘Semantic
Source ‘Scanner
(Lexical analyzer)
Analysis.
Program
qaNexiToken()
4. The parser program calls the function getNextToken@) which is the function defined
in lexical analyzer (See the calling sequence below)
Parser Program Lu er program
return token;
)
The function getNextToken() of lexical analyzer returns the token back to parser for
parsing.
_ ts in amtared into the symbol table along with¢ The function getNextToke
parsing.
4 If the token obtained i
various attribute values
denoted by ID and a poi
4. The other actions that are per
return token;
a0) of lexical analyzer returns the token back to parser for
ol table along with
ier, it is entered into the symbs
ff an integer code
s an identifi
and returns a token as a pair consisting o
‘inter to the symbol table for that identifier,
formed by the parser are:
Removes comments from the program.
Remove white spaces such as blanks, tabs and newline characters from the
are obtained.
source program and then tokens
Keep track of line numbers so as to associate Tine numbers with error
messages
TTany errors are encountered, the lexical analyzer displays appropriate ect
inessages along with line numbers
Preprocessing may be done during lexical analysis phasejg the start symbol
grammar to geneare the following language:
= {ww" where w € (a, b}*}
s
26: Obtain a
sample L
s+ \asa\bs
Janguage can be written as:
tion: The ;
sauton The ETE aa, bb, abba, baab,aaaa, bbbb,....}
Observe that the given string is a palindrome of even length. This is achived by
idleting the productions $—a|b. So, the final grammar is given by:
Sse 7 oO ‘a
S —» aSa|bSb ase. a
abst
Note: In the above grammar if the production Q
F soe a b b
isrplaced by
a
WCE
83.6" am wae
te result . : m >
ing grammar will generate the language m L= {wew" | w € {8b} "1020 and 1U5U are ine aumpute venues wi
1.5 Input buffering )® oe ye
Now, let us see “Why input buffering is required?” Input buffering is very eae
the following reasons:
¢ Since lexical analyzer is the first phase of the compiler, it is the only phase of the
compiler that reads the source program character-bi
considerable time in eding the source program. Thus, the speed of lexical analysis is
i coneorn-while Cesta ike Panes Ss speed of lexical analysis
¢ Lexical analyzers may have to look ne or more charac
before we vehave the ne right lexemi
ters beyond the next lexee
For this reason, we use the concept of input buffering where a block of 1024 or 4096 or
more characters are read in one menfbry read operation
Se HORE REMOLY Tead operation and stored in the array to speed
bien,The otectred. xf Cee eae a flocw of Cherro shi Cie oc de)
& Compiler Design - 1.25
Now, let us see “What is input buffering?”
Definition: The method of reading a block of characters (1K or 4K or more bytes) from
the disk in ofe'read- operation and storing in memory (normally in the form of-an aay)
for further” processing and faster accessing is called inpur buffering. The_memory (an
array) where a block of characters read from the disk are stored is called buffer. Now,
|<———— Buffer2,
4 The size of each buffer is N where N is usually the size of the disk block. If size o
disk block is 4K, in one read operation 4096 characters can be read into the buf
using one system command: rather than using one system call per character which
consumes lot of time.
4 Imespective of number of characters stored in the buffer, la
is eof.
4 Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.
4 The use of two pointers TexembeBeginning and input pointer and the methot Cl
accessing lexeme remains same as in buffer pairs.
* The algorithm consisting of lookahead code with sentinels is shown below:
switch (*inputPointer++)
{
character of each buf:
case eof: i: (inputPointer is at end of first buffer)
reload the second buffer;
inputPointer = beginning of second buffer;
break;& Compiler Design - 1.27
if (inputPointer is at the end of second buffer)
{
reload the first buffer;
inputPointer = beginning of first buffer;
break;
}
/* eof within a buffer indicates the end of the input */
/* So, terminate lexical analysis */
break;
/* Cases for other characters */
}
®
Observe from the above algorithm that instead of having two tests as in buffer pair
technique, there is only one test ie., testing the eof marker.
1.6 Specifications of tokensProblems
Prove that b= fol" | zi} as eet seer
L+for, oor, o00111, sooo «J
yy)
Le fot | mz, mej
D> Amume L us vy on
a method of
> lab n bea Contradt on -
3 hut wort” \
A gpiek wexyz such tet J
ye
anDS Amume L us vugueer=
method 4
> kt 7 bea Constant
au oe Cyatradecton
> Gplttk we Xy¥z such that
ye ’
a. /ayl=”
3-daale KP, aya e he
wed
md, w=00!!
syz00 7
a0, yO ZN @
pwyume Ke) TYE 7 cool ® b
ae
Fedo © inyt% dy not Tequtesrwsdl”
nd, weoo!!
syz00 | ~
a20, y=0 Zl" @
payee we ye = EET @L
—
Henu Lrqor® [ny ds not Tequlort
_-_-_-Pove that L=fat|Puts a freee}
fut db a Rigutan dangoage,
‘pl ds an gntegar Comtant
Seek a abtng ‘ud {rom L
Yuch — thouk,
be daa, dag, aaaee- - Y
dub 0-3.
22.0.
xXye.
el
a Kel exyke « woe © :
$ ye 2 contradict
KeQ - ayX2 = 000 gh =In the above statement, the patterns, lexemes re shown below:
and respective tokens ar
‘Symbolic names
defined using #define
keyword char p= CHAR,
Z
identifier str —> 41,1?
—> LEFT_BRACKET
lefi bracket |
pore ; —> RIGHT_BRACKET
right bracket
Pane —> ASSIGN
a
operator = x -L > , LITERAL? Se
‘strit a jo s
tring “hel me seMi_COLON
symbol 3
SY
Jexemes
PatterToken
=:
Webb Tokens:
Now, let us see what is a token?”
41 ig a pair consisting of token name and an 9) ion citrus value,
sieally integer codes: represented using Sym jolic names written in
INT. i ‘defined in the file token.h in
OAT, SEMI_COLON te
# Will not be present for keywords,
. tribute values are optional and a press
and symbols, The attribute values are present for all identifiers and constants,
rr unique token name. For example, INT, FLOAT, CHAR,
Definition: A tol
he token names i
operator
+ For every keyword ther
‘ar every symbol there is a unique token name. For example, SEMI_COLON,
COLON, COMMA, LEFT_PARANTHESIS, RIGHT_PARANTHESIS, ASSIGN
cl ————_
+ Foran identifier sum, the token is where ID is the token name and 1 is the
position of the identifier in the symbol table
er
Whenever there is a request from the parser, the lexical analyzer sends the token. So,
tokens are output of the lexical analyzer and input to the syntax analyzer. The syntax
analyzer uses these tokens to check whether the program is syntactically correct or not by
deriving the tokens from the grammar. All the tokens are represented using symboli¢
constants defined using #define directive as shown belo/* TOKENS with corresponding integer codes for keywords **/
#define
itdefine
#define
itdefine
define
#define
AURWN1.4.3.2 Lexeme
Now, let us see “What is a lexeme?”
Definition: A sequence of characters in the source program that matches the patterns
such as identifiers, numbers, relational operators, arithmetic operators, symbols such as #,
£1, G) and so on are called Jexemes. In other words, a lexeme is a string of patterns read
from the source file that corresponds to a token.
1.4.3.3 Patterns
Now, let us see “What is a pattern?”
Definition: The description of a lexeme is called pattern. More formally, a pattern is
described as Tule describing set of lexemes. The various patterns are shown below:
Fearon: Tie pais fojoord ia aati of cheeaien aicice eee rae
of a language. For example, int, if, else, while, do, switch etc are all reserve words.
They are also called keywords
¢- Identifier: The pattern identifier is described a sequence of letters or underscores
followed by any number of letters or digits or underscores. For example, sum, i, pos,
first, rate_of_interest that represent variables in a program or that represent names of
functions, structures etc. are all treated as identifiers.1.22 B Lexical Analyzer
¢ Relational Operator: The pattern relational operators which is described 2s a
symbols that reprovent various relational oparetors of a language. Fox exarapie: = ~
1= represent pattems identifying the relational operate
4 Sembols: The pattern symbols is described as vet of symbols 9
},: and soon
h 2s #, S$. 6) fExample 1.2: Identify lexemes and tokens in the following statement:
printf(“Simple Interest = Jof\n’, si);
soa
Solution: The lexemes, patterns and tokens for the given printf statements are shown
below:
+ prinif is a lexeme matching the pattern identifier and returning the token
Mint ID i Ge is The token name and 1 is the position of identifier pringf in the sym
table
4 The character ‘( is a lexeme matching the-patt i ken
TET PARMA ig Pattern symbol and returning. the ‘0
# The sequence of characters “Simple Interest i
= %fin” is hing
pate suing and retuming the Token LITERAL, 7 WIGS Lene ee
name and is the postion of literal in tie SHES sae
+ The character "is a Texeme matchin,
ig the patt i coset
Redan pattern symbol and returning the& Compiler Design - 1.23
¢ stis a lexeme matching the pattem identifier and returning the token where
ID ig the token name and 3 is the position oF identifier a in the symbol table
¢ The character °" is a lexeme matching the pattern symbol and returning the token
SEMICOLON —— :7; Obtain the grammar to generate the language b
f mee L={0"}"2"{m2 1 andn20} Kh
' eS
simple approach 4.21
tion: Given the language the productions can be gencrated as shown below:
solu
L={0"1"2"|mzLandn>0} 1s
we 5 AO
SAB ..., (1) pe
wee rhe variable A should produce m number of O's f
¢
’s with a minimum string 01 (Since m = 1). This j
following production:
wed by m number of
is achieved using thewhee ‘The variable A should produce m number of 0’s followed by m number of
1's with a minimum string 01 (Since m = 1). This is achieved using the
following production:
A—>01]0Al [Similar to example 20, page 4.17] =“
¢ B should produce any number of 2's. Any number of 2's can be generated
using the production:
Boe|2B [Similar to example 1, page 4.7]
So, final grammar to accept the given language is:
SAB
. A>01/0AI |ramart seen = {0"1"2"|m=>1andn>0}
B—>|2B
The following grammar also generates the same language. The reader is required
to verify the answer. 4 eo Ark.
S — A|S2
A-— 01/0A1 pse\2e-