0 ratings 0% found this document useful (0 votes) 174 views 427 pages Compiler Design - YesDee
The document provides an overview of compilers, interpreters, and assemblers, detailing their functions and differences. It explains the phases of a compiler, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with error handling and symbol table management. Additionally, it discusses compiler construction tools and the distinction between static and dynamic policies in programming languages.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Compiler Design - YesDee(1) For Later
Introduction to Compilers
44 Compilation and Interpretation
444 Compiler
Compiler is a translator which is used to convert programs in high-level language to low-level
translates the entire pro;
Janguage- It Te program and also reports the errot ce im
; : s rs in s a
encountered during the translation. eae
ee |g eer ee
Error messages
Figure 1.1 Compiler
11.2 Interpreter
is used to convert programs in high-level language to low-level
Interpreter is a translator which
ncountered during
language, Interpreter translates line by line and reports the error once it e
the translation process.
It directly executes the operations s| pecified in the source program when the input is given
by the user.
It gives better error diagnostics than a compiler.2 Compiler Design 2e
Source program Tj] Output
Input -
Figure 1.2
Interpreter
Table 1.1 Differences between compiler and interpreter
S.No. | Compiler Interpreter ”
. Performs the translation of a program | Performs statement by — statement
as a whole. translation.
om | Execution is faster. Execution is slower.
3. | Requires more memory as linking is | Memory usage is efficient as no
needed for the generated intermediate | intermediate object code is generated, |
object code.
4. Debugging is hard as the error | It stops translation when the first seal
messages are generated after scanning | is met. Hence, debugging is easy. |
the entire program only.
Se) Programming languages like C, C++ | Programming languages like Python,
uses compilers. BASIC, Ruby uses interpreters.
1.1.3. Assembler
Assembler is a translator which is used to
machine language code.
translate the assembly language code into
|___, Machine language
code
Figure 1.3 AssemblerIntroduction to Compilers 3
42 Language Processing System
Skeletal source program
it
Source program
Compier
Target assembly program
Assembler
Re-locatable machine program
Loasertinvedior je Library reocatable object fles
‘Absolute machine program
U
Figure 1.4 Language Processing system
Preprocessor
A source program may be divided into modules stored in separate files. The task of collecting
the source program is entrusted to a separate program called preprocessor. It may also expand
macros into source language statement.
Compiler
Compiler is a program that takes source program as input and produces assembly language
program as output.
Assembler
Assembler is a program that converts assembly language program into machine language
Program. It produces re-locatable machine code as its output.
Loader and link-editor
Y The re-locatable machine code has to be linked together with other re-locatable object
files and library files into the code that actually runs on the machine. ;
Y The linker resolves external memory addresses, where the code in one file may refer to
4 location in another file. : ge ion
V The loader puts together the entire executable object files into memory4 Compiler Design 2e
1.3 Structure of a Compiler
The structure of compiler consists of two parts:
eces and imposes a
Analysis part
create an intermediate
¥ Analysis part breaks th
grammatical structure on them W)
representation of the source program.
Y It is also termed as front end of compiler.
Y Information about the source program is CO]
symbol table.
constituent pil
e source program into
js structure to
hhich further uses thi
lected and stored in a data structure called
Intermediate
—— representation
rc
Source rogram ——* Analysis phase |
Figure 1.5 Analysis part
Synthesis part
v Synthesis part takes the intermediate representation as input and transforms it to the
target program.
V Itis also termed as back end of compiler.
Intermediate
—— —{ sree |] to eee
Figure 1.6 Synthesis part
2 ae pen neil a be Samer into several phases, each of which convert
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
¥ Symbol table management. :
v Error handling.E STRUCTURE OF A COMPILER
Symbol Table
character stream
Lexical Analyzer
i
token stream
Syntax Analyzer
T
syntax tree
Semantic Analyzer
be
syntax tree
Intermediate Code Generator
T 5
intermediate representation
Machine-Independent
Code Optimizer
T
intermediate representation
Code Generator
T
target-machine code
Machine-Dependent
Code Optimizer
T
target-machine code
Figure 1.6: Phases of a compiler4.3.1 Lexical Analysis
Y Lexical analysis is the first phase of compiler which is also termed as scanning.
¥ Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.
¥ Token Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers, etc.
¥ Lexeme Lexeme is instance of a token i.e., group of characters forming a token.
¥ Pattern Pattern describes the rule that the lexemes of a token takes. It is the structure
that must be matched by strings.
Y Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: 6 Compiler Design 2e
eg, c=at bx;
Table 1.2 Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
he + (addition symbol)
identifier
* * (multiplication symbol)
5 5 (number)
Hence, <=>< id, 2>< + >< id, 3 >< *><5>
1.3.2 Syntax Analysis
v Syntax analysis is the second phase of compiler which is also called as parsing.
¥ Parser converts the tokens
produced by lexical analyzer into a tree like representation
called parse tree,
v A parse tree describes the Syntactic structure of the input.
assignment
statement
ees
] oe
yt
ce
o— =
J
| |
b
Figure 1.8 Parse tree
v Syntax tree j
ri Site Seon peresentation of the parse tree in which the operators
S an
for that operator, the operands of the Operator are the children of the nodeIntroduction to Compilers. 7
Input : Tokens
Output : Syntax tree
oo
phil,
a
1.3.3 Semantic Analysis
¥ Semantic analysis is the third phase of compiler.
¥ Itchecks for the semantic consistency.
¥ Type information is gathered and stored in symbol table or in syntax tree.
¥ Performs type checking.
agi
Biitn
b inttofloat
Figure 1.10 Semantic analysis
13.4 Intermediate Code Generation
Y Intermediate code generation produces intermediate representations for the source
Program which are of the following forms:
© Postfix notation
© Three-address code
© Syntax tree8 Compiler Design 2¢
yy used form is the three-address code:
Most common
ty = inttofloat (5)
ty = idge th
ty = id + to
id) = ts
Properties of intermediate code
V¥ Itshould be easy to produce.
Y It should be easy to translate into target program
1.3.5 Code Optimization
¥ Code optimization phase gets the intermediate code as input and produces optirnized
intermediate code as output.
¥ Itresults in faster running machine code.
V Itcan be done by reducing the number of lines of code for a program.
Y¥ This phase reduces the redundant code and attempts to improve the intermediate code
so that faster-running machine code will result.
V During the code optimization, the result of the program is not affected
¥ To improve the code generation, the optimization involves
© Deduction and removal of dead code (unreachable code).
© Calculation of constants in expressions and terms.
© Collapsing of repeated expression into temporary string.
© Loop unrolling.
© Moving code outside the loop.
© Removal of unwanted temporary variables.Introduction to Compilers 9
1.36 Code Generation
Y Code generation is the final phase of a compiler.
¥ Itgets input from code optimization phase and produces the target code or object code
as result.
¥ Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
¥ The code generation involves
© Allocation of register and memory.
¢ Generation of correct references.
© Generation of correct data types.
Generation of missing code.
LDF Rp,
MULF Ro, #5.0
LDF 21, ida
ADDF R1, Ro
STF id), Ri
1.3.7 Symbol Table Management
¥ Symbol table is used to store all the information about identifiers used in the program.
¥ Itis a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
¥ Itallows finding the record for each identifier quickly and to store or retrieve data from
that record.
¥ Whenever an identifier is detected in any of the phases, itis stored in the symbol table.12 Compiler Design 2e
4.3.8 Error Handling
Y Esch phase can encounter errors. After detecting an erron @ phase must handle th,
error so that compilation can proceed.
¥ Inlexical analysis, errors occur in separation of tokens
¥ Insyntax analysis, errors occur during construction of syntax tree.
Y Insemantic analysis, errors may occur at the following cases:
) When the compiler detects constructs that have right syntactic structure but ny
meaning
(ii) During type conversion.
¥ Incode optimization, errors occur when the result is affected by the optimization, jy
code generation, it shows error when code is missing, etc.
Figure 1.11 illustrates the translation of source code through each phase, considering th
statement
=atb*5
1.4 Errors Encountered in Different Phases
Each phase can encounter errors. After detecting an error, a phase must somehow deal with
the error, so that compilation can proceed.
‘A program may have the following kinds of errors at various stages;
1.4.1. Lexical Errors
It includes incorrect or misspelled name of some identifier, i.e., identifiers typed incorrectly.
1.4.2. Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by
syntax analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the res
of the input. In general, errors may be expected at various stages of compilation but most of
the errors are syntactic errors and hence the parser should be able to detect and report those
errors in the program.
‘The goals of error handler in parser are:
¥ Report the presence of errors clearly and accurately.
¥ Recover from each error quickly enough to detect subsequent errors.
¥. Add minimal overhead to the processing of correcting programs.
There are four common error-recovery strategies that can be implemented in the parse!
to deal with errors in the code.
¥ Panic mode.
¥ Statement level.Introduction to Compilers 13
/ Brror productions.
i. Global correction.
443 Semantical Errors
a result of incompatible yg i
These errors Breen patible value assignme
= ualyzer is expcted th recognize’ ignment. The semantic errors that the
¥ Type mismatch.
Y Undeclared variable.
Reserved identifier misuse.
¥ Multiple declaration of variable in a scope.
Accessing an out of scope variable,
¥ Actual and formal parameter mismatch.
Logical errors
‘These errors occur due to not reachable code— infinite loop
1.5 Grouping of Phases
The phases of a compiler can be grouped as:
Front end
Front end of a compiler consists of the phases
¥ Lexical analysis.
¥ Syntax analysis.
¥ Semantic analysis.
v Intermediate code generation.
Back end
Back end of a compiler contains
¥ Code optimization.
V Code generation.
15.1 Front End
¥ Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language).
¥ Itincludes lexical and syntactic analysis, symbol table management, semantic analysis
and the generation of intermediate code.
¥ Code optimization can also be done by the front end.
¥ Italso includes error handling at the phases concerned.14 Compiler Design 2¢
Se ee
Front end
Semantic analysis,
intermediate code generation
eee
Figure 1.12 Frontend
1.5.2 Back End
¥ Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language.
¥ This includes code optimization, code generation.
¥ Inaddition to this, it also encompasses error handling and symbol table management
operations.
Code optimization
Code generation |
Bie remem
Back end
Figure 1.13 Back end
1.5.3 Passes
¥ The phases of compiler can be implemented in a single pass by marking the primary
actions, viz., reading of input file and writing to the output file.
¥ Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
For example, lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
1.5.4 Reducing the Number of Passes
¥ Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.Introduction to Compilers 15
y When cone a One Pass, the entire program has to be kept in memory to
ensure proper i eT Sa each phase because one phase may need information
in a different order than the information Produced in previous phase.
‘The source program or target program differs from its internal representation. So, the
memory for internal form may be larger than that of input and output
4,6 compiler Construction Tools
Some commonly used compiler construction tools include
Parser generators.
Scanner generators.
. Syntax-directed translation engines.
. Automatic code generators,
. Data-flow analysis engines.
Aweyne
. Compiler construction toolkits.
14.6.1 Parser Generators
Input Grammatical description of a programming language
Ouput Syntax analyzers.
Parser generator takes the grammatical description of a programming language and produces
asyntax analyzer.
1.6.2 Scanner Generators
Input Regular expression description of the tokens of a language
Output Lexical analyzers.
Scanner generator generates lexical analyzers from a regular expression description of the
tokens of a language.
16.3 Syntax-directed Translation Engines
Input Parse tree.
Output Intermediate code.
Synax-directed translation engines produce collections of routines that walk a parse tree and
Senerates intermediate code.
184 Automatic Code Generators
me Intermediate language.
Mtput Machine language.
has 8enerator takes a collection of rules that define the translation of each operation of the
diate language into the machine language for a target machine.16 Compiler Design 2
4.6.5 Data-flow Analysis Engines
Data-flow analysis engine gathers the ine
part of a program to each of the other parts.
n, that is, the valu transmitted from one
Data-flow analysis is a key part of code
optimization.
4.6.6 Compiler Construction Toolkits
i a ci ier. Cor
The toolkits provide integrated set of routines for various phases of compi Compiler
construction toolkits provide an integrated set of routines for construction of phases of compile,
1.7. Programming Language Basics
4.7.1 Static/Dynamic Distinction
4.7.1.1. Policy
Policy defines what decisions are made by a compiler for a program.
Static policy If the language of a compiler enables it to make decisions on an issue at
compile time it is termed as static policy.
Dynamic policy When a compiler is able to make decision at runtime, it is said to follow 2
dynamic policy.
1.7.1.2 Scope
Scope defines the region of program that can access a particular declaration.
Static scope If a scope of a declaration is found by looking at the program, then it i
scope. It can also be termed as lexical scope.
Dynamic scope
variable.
s static
At runtime, the scope of a variable refers to different declarations of that
e.g., In java,
public static int a;
(rs to create only one of the variable a for any number of objects created by its
ss and the location of the variable can be found in advance by a compiler. :
public int a;
This makes each obj
locations could pot be g ject of its class to have its own copy of the variable and all the
und by compiler before running the program.
17.4.3 Environment and states
V tis used to know whethe:
occur when the Program
v They are used to de:
with its value at ry
T the values of
ae es of data elements get affected by the changes that
scribe the associat;
SOociati sigs a
ntime, ‘on of names with its location and then location20 Compiler Design 2e
Y_ sL.name and s2.name are aliases (same |_value, refer to same location in memory)
Vv sland s2 are not aliases.
1.8 Lexical Analysis
Lexical analysis is the process of converting a sequence of characters from source program
into a sequence of tokens. 2
A program which performs lexical analysis is termed as @ lexical analyzer (lexer),
tokenizer or scanner.
hich are as follows:
Lexical analysis consists of two stages of processing WI
¥ Scanning ¥ Tokenization
1.8.1 Token, Pattern and Lexeme
Token :
Token is a valid sequence of characters which are given by lexeme. In a programming
language,
¥ Keywords,
¥ Constant,
v Identifiers,
v¥ Numbers,
v¥ Operators and
¥ Punctuations symbols
are possible tokens to be identified.
Pattern
Pattern describes a rule that must be mat:
s iched by sequer E
token. It can be defined by regular expressions is sence ta tr (lexemes) to form
S.
Lexeme
Lexeme is a sequence of
characters
ey that mate!
eg, cHatheS
hes the pattern for a token, i.e., instance ofIntroduction to Compilers 21
Lexemes img ee
¢ Identifier res
=__| Assignment symbol
a Identifier
| + | + (Addition symbor)
[ees Identifier
«| |_* QMultiplication symbol) |
5 S(Number) |
The sequence of tokens produced by lexical analyzer helps the parser in analyzing the
syntax of programming languages
1.8.2 Role of Lexical Analyzer
Token
To semantic
analysis
Source
Lexical analyzer [ bo
program ya Parser
getNextToken()
‘Symbol table
Figure 1.15 Interaction between lexical analyzer and parser
Lexical analyzer performs the following tasks:
V Reads the source program, scans the input characters, group them into lexemes and
produce the token as output.
¥ Enters the identified token into the symbol table.
¥ Strips out white spaces and comments from source program.
¥ Correlates error messages with the source program, i.¢., displays error message with
its occurrence by specifying the line number.
¥ Expands the macros if it is found in the source program.
Tasks of lexical analyzer can be divided into two processes:
Scanning Performs reading of input characters, removal of white spaces and comments.
Lexical analysis Produce tokens as the output.22 Compiler Design 2e
4.8.2.1 Need of Lexical Analyzer
f compiler The removal of white spaces and comments enables the
n of cor
Simplicity of desig!
syntax analyzer for efficient synt
Compiler efficiency is improved Special
speed up the compiler process.
Compiler portability is enhanced.
actic constructs.
ized buffering techniques for reading characters
1.8.2.2 Issues in Lexical Analysis
Lexical analysis is the process of producing tokens from the source program. It has the
following issues
¥ Lookahead
¥ Ambiguities
Lookahead
Lookahead is required to decide when one token will end and the next token will begin. The
. Therefore a way to
simple example which has lookahead issues are i vs. if, =
describe the lexemes of each token is required.
A way is needed to resolve the following ambiguities
¥ Is if is two variables i and f or if?
¥ Is == is two equal signs = and = or =
wv arr(5, 4) vs. fn(5, 4) // in Ada (as array reference syntax and function call syntax are
similar),
Hence, the number of lookahead to be considered and a way to describe the lexemes of
each token is also needed.
Regular expressions are one of the most popular ways of repre
Ambiguities
enting tokens,
‘The lexical analysis programs written with lex accept ambiguous specifications and choose
the longest match possible at each input point. Lex can handle ambiguous specifications.
When more than one expression can match the current input, lex chooses as follows:
¥ The longest match is preferred, S
¥ Among rules which matched the same number of
preferred
1.8.3. Lexical Errors
characters, the rule given first is
¥ Acharacter sequence that cannot be scar
¥ Lexical errors are uncommon, but
¥ Misspelling of identifiers,
ined into any valid token is a lexical error.
key wor
Usually, a lexical error 1S Caused by the appearance of legal character, mostly
F U “trance of some illegal cl ‘Introduction to Compilers. 23
1034 Error Recovery Schemes
Panic mode recovery
{ Local correction
© Source text is changed around the
‘error poi
© Analyzer will be restarted with Point in order to get a correct text.
the resultant new text as ‘input.
¥ Global correction
© ltis an enhanced panic mode recovery.
© Preferred when local correction fails,
Panic mode recovery
In panic mode recovery, unmatched patterns are deleted from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning of what input is left.
eg, For instance the string fi is encountered for the first time in aC program in the
context
fi (a== f(z))---
A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
Since fi is a valid lexeme for the token id, the lexical analyzer will return the token id to
the parser.
Local correction
Local correction performs deletion/insertion and/or replacement of any number of symbols
in the error detection point,
eg In Pascal, c[i]‘="; the scanner deletes the first quote because it cannot legally
follow the closing bracket and the parser replaces the resulting ‘=" by an assignment
statement, Most of the errors are corrected by local correction.
eg. The effects of lexical error recovery might well create a later syntax error, handled
by the parser, Consider
© for Stnight -..
The § terminates scanning of for. Since no valid token begins with $, itis deleted. Then
‘might is scanned as an identifier.
Ineffect it results,
for tnight ...
hich will cause a syntax error. Such false errors are unavoidable, though a syntactic
“"Ot-repair may help.24 Compiler Design 2e
Lexical error handling approaches ;
lowing actions:
rs can be handled by the following
Lee aes ‘one character from the remaining input.
i i it.
¥. Inserting a missing character into the remaining input
Y Replacing a character by another character.
¥ Transposing two adjacent characters.
1.9 Input Buffering
V To ensure that a right lexeme is found, one
beyond the next lexeme.
¥ Hence a two-buffer scheme is introduced to handle large lookaheads safely.
f lexical analyzer such as the use of sentinels
or more characters have to be looked up
¥ Techniques for speeding up the process o}
to mark the buffer end have been adopted. .
There are three general approaches for the implementation of a lexical analyzer:
(i) By using a lexical analyzer generator, such as lex compiler to produce the lexical
analyzer from a regular expression based specification. In this, the generator provides
routines for reading and buffering the input.
(ii) By writing the lexical analyzer in a conventional systems-programming language,
using /O facilities of that language to read the input
(iii) By writing the lexical analyzer in assembly language and explicitly managing the
reading of input.
1.9.1 Buffer Pairs
Because of large amount of time consumption in moving characters, specialized buffering
techniques have been developed to reduce the f overt Juired to process ai
amount of overhead required
Figure 1.16 shows the buffer pairs which are used to hold the input data,
El [=] [e[=[el=] =] 2 Teor
Treat
for
lexemeBegin “4
Figure 1.16 Buffer pairsIntroduction to Compilers 25
pointers i
inters lexemeBegin and forward are maintained
iexemeBesil points to the beginning of the cur
i rent lexeme which is
fo ward scans ahead until a match for a patter ver betound.
m is found.
¥ Once & lexeme is found, lexemeBegin is set to the character
jexeme which is just found and forward is set to the elaacee
Y Current lexeme is the set of characters between two pointers,
immediately after the
its right end.
Disadvantages of this scheme
¥ This scheme works well most of the time, but the amount of lookahead is limited.
¥ This limited lookahead may make it impossible to recognize tokens in situations where
the distance that the forward pointer must travel is more than the length of the buffer.
CB
DECLARE (ARG1, ARG2, . . . , ARGn)// in PL/1 program
¥ It cannot determine whether the DECLARE is a keyword or an array name until the
character that follows the right parenthesis.
1.9.2 Sentinels
¥ Inthe previous scheme, each time when the forward pointer is moved, a check is done
to ensure that one half of the buffer has not moved off. If it is done, then the other half
must be reloaded.
v Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is read.
¥ The usage of sentinel reduces the two tests to one by extending each buffer half to hold
a sentinel character at the end.
¥ The sentinel is a special character that cannot be part of the source program. (cof
character is used as sentinel).
ay [e | TL
Figure 1.17 Sentinels at the end of each buffer26 Compiler Design 2c
Advantages:
¥ Most of the time, it performs only one tes
points to an cof. .
¥ More tests are performed only when it reaches the end of the buffer half Re :
¥ Since N input characters are encountered between €0 fs, the average number of tes;
per input character is very close to 1.
t to see whether forward pointe,
1.10 Specification of Token
¥ Regular expressions are a notation to represent lexeme patterns
V They are used to represent the language for lexical analyzer.
V They assist in finding the type of token that accounts for a particular lexeme.
for a token.
1.10.1 Strings and Languages
Alphabets are finite, non-empty set of input symbols.
ST = {0,1} — binary alphabets
String represents the collection of alphabets.
w = {0,1,00, 01, 10, 11,001, 010, ...}
w indicates the set of possible strings for the given binary alphabet 5°.
Language (L) is the collection of strings which are accepted by finite automata.
L={0"1|n >=0}
Length of string is defined as the number of input symbols in a gi i i
ies put symbols in a given string. It is found
ig w = 0101
lw = 4
Empty suing denotes zero occurrence of input symbol. It is represented b:
Concatenation of two strings p and q is denoted by pq sis fees
Le p=010
and q=001
pq = 010001
gp = 001010
Pq # apIntroduction to Compilers 27
empty string is identity under concatenation,
Let x bea string.
© = ae
x
Prefix A prefix of any string s, is obtained by removin
end of -
egy $= balloon
' Zero Or more symbols from the
possible prefixes are: ball, balloon,
Sufix A suffix of any string s, is obtained b
beginning of s.
eg, 8
Possible prefixes are: loon, balloon
Proper prefix: Proper prefix p of a string s, can be given by s # p and p 4 ©
Proper suffix Proper suffix « of a string s, can be given bys#vandr fe
Substring Substring is part of a string obtained by removing any prefix or any suffix
from s.
y Temoving zero or more symbols from the
balloon
1.10.2 Operations on Languages
Important operations on a language are:
¥ Union
¥ Concatenation and
¥ Closure
Union
Union of two languages L and M produces the set of strings which may be either in language
Lorin language M or in both. It can be denoted as,
LUM = {p| pis inL or pis in M}
Concatenation
Concatenation of two languages L and M, produces a set of strings which are formed by
Merging the strings in L with strings in M (strings in L must be followed by strings in M). It
cat
nde Tepresented as,
LM 16684 = {pq | pis in Land gis in M}
s28 Compiler Design 2e
Closure
(i) Kleene closure (L*)
Kleene closure refers to zero or more
includes empty string ¢ (set of strings with 0 or more oc
co
ve=UL
i=0
occurrences of input symbols ina string, ie,
.currences of input symbols)
(ii) Positive closure (L*)
ces of input symbols in a string, ie, i
Positive closure indicates one or more occurren i
6 of input symbols),
excludes empty string ¢ (set of strings with 1 or more occurrence:
a
wreUu
=
L3~— set of strings each with length 3.
eg, Let © = {a,b}
+ = {e,0,,aa, ab, ba, bb, aab, aba, aba, ...}
uy
L” = {a,b, aa, ab, ba, bb, aab, aaba,
L3 = {aaa, aba, abb, ba, bab, bbb, . --
Precedence of operators
Y Unary operator (*) is having highest precedence.
Y Concatenation operator (-) is second highest and is left associative.
¥_ Union operator ( | or U ) has least precedence and is left associative.
Based on the precedence, the regular expression is transformed to finite automata whet
implementing lexical analyzer.
1.10.3 Regular Expressions
Regular expressions are a combination of input symbols and language operators such &
union, concatenation and closure.
It-can be used to describe the identifier for a language. The identifier is a collection
letters, digits and underscore which must begin with a letter. Hence, the regular expressi®
for an identifier can be given by,
letter. (letter. | digit)*
Note:
Vertical bar ( | ) refers to ‘or’ (Union operator).Introduction to Compilers 29
The following describes the language for given regular expression:
Table 1.3 Languages for regular expressions
| Regular expression | Language
ae r Ur)
2. a (a) Bf,
ae ris Ur) |L(s)
4. rs Lor) L6s)
5. om (Lor)
Regular set Language defined by regular expression.
‘Two regular expressions are equivalent, if they represent the same regular set.
(P| 9) = (a lp)
Table 1.4 Algebraic laws of regular expressions
Law Description
rls=s|r | is commutative
ri@ld=Cis)lt | is associative
r(st) = (rs)t
Concatenation is associative
r(s|t) =rs | rt;(s|t)r = sr| tr | Concatenation is distributive
er=re=r
¢ is identity for concatenation
(r | e)* é is guaranteed in closure
* is idempotent
1.10.4 Regular Definition
Regular definition d, gives aliases to regular expressions r and uses it for convenience.
Sequence of definitions are of the following form
dary
dg > 12
ds > 13
dy Tn
in which definitions dy, dy, ... ,can be used in place of ry, r respectively.30 Compiler Design 2e
‘ ‘i ‘ ws:
Regular expression for an identifier and number given as follo
letter > A|B| -*° {Zabol jz]
digit > 0[1|2-- 19
id — letter_ (letter | digit)*
num — digit(digit)”
1.11. Recognition of Tokens
| tokens have been taken.
Recognition of token explains how the patterns for all
ing and to find the prefix (lexeme) tha
It also generates code for examining the input stri
matches with any one of the patterns.
Rules for conditional statement can be given as follows:
stmt — if expr then stmt
| if expr then stmt else stmt
|
expr — term relop term
| term
term + id | number
Figure 1.17 Conditions for branching statements
‘The syntax in Figure 1.17 is very much similar to that of Pascal.
The terminals of the grammar which are if, then, else, relop, id and number are the name
of tokens for lexical analyzer.
For easy recognition, keywords are considered as reserved words even though their lexem*
match with the pattern for identifiers.
Lexical analyzer also performs stripping out of white spaces. In order recogni
whitespace, it is defined as
ws — (blank | tab | newline)t
Generally, when a token is found it is returned to the parser. But when a lexical analy2
encounters ws, it restarts its process from the character that follows the whitespace.Table 1
Introduction to Compilers 31
‘5 Token names with their attribute value
“Lexemes | Tokenname | Attribute value
Any ws - zm
ae ‘i
| phen then a
[area else : =
| Any id id Pointer to symbol table entry
| any number | number Pointer to symbol table entry
< relop LT
ee relop LE e
= relop EQ
<> relop NE me
Z | relop ct 2
>= relop GE é 7
1.11.1 Transition Diagrams
Transition diagrams are pictorial representation of transition from one state to another on
taking some input symbol.
The patterns are converted into transition diagram while constructing the lexical analyzer.
It comprises of states and edges, where states represent the condition that occurs in the
process of scanning the input and edges indicate the transition.
Edges are labelled by some input symbols.
Forward pointer is advanced if an edge is found from some state with label of the input
under consideration.
Conventions
¥ Start state is indicated by an arrow labelled with start.
—O
Y Final states indicate that a lexeme has been found, It is represented by a double circle.
O32. Compiler Design 2e
i “4? wi ~ed near the acceptii
¥ To indicate the retraction of forward pointer, a “* will be 2 eng oo pting
state as the lexeme does not include the symbol that reaches the a
i 5 (which is similar
ition di i signed integers (whic! in
The transition diagram for relation operators and unsi Sa
most of the programming languages like Pascal, C) can be given as
and 1.19.
stat = ao @) retur(relop, LE)
> © retum(relop, NE)
ott C4) tit. 0
returm(relop, EQ)
Ci ‘)——@) retur(rolop, SE)
Figure 1.18 Tran:
digt digit digit
start
Figure 1.19 Transition diagram for unsigned integers
1.11.2 Recognition of Reserved Words and Keywords
Keyword patterns match with that of identifiers. But they should be recognized differently
The transition diagram for identifiers is given in Fi hs
pattern for keywords, is given in Figure 1.20. This figure also satisfiesIntroduction to Compilers 33
letter | digit
start letter othe .
“OO =o reluin(getTokeni)instalO()
Figure 1.20 Transition diagram for identifiers and keywords
To handle reserved words different from identi fiers,
v Install the reserved words in the symbol table initially. Symbol table entry indicates
the token they represent from a Separate field,
© installID() When an identifier is found, this function places it in symbol table if
itis not already there and returns pointer to the symbol table entry for the lexeme
found.
© gerToken() When a lexeme is found, this function examines the symbol table
entry and returns the token name indicated in it
¥ Create separate transition diagrams for each keyword.
start t h e a nonlet | dig CO
ie ay ee i
Figure 1.21 Transition diagram for keyword
In Figure 1.21
¥ Atest for ‘non-letter or digit’ is done to check the end of identifier.
v If it reaches the accepting state it is recognised as keyword else identifier.
¥ This is done since lexeme such as the nextvalue has a proper prefix then.
v When lexeme matches both patterns, priority is given to reserved words.
112 Lex
Y Lexis a tool in lexical analysis phase to recognize tokens using regular expression.
¥ Lex tool itself is a lex compiler.
1.12.1 Use of Lex
¥ lest is an a input file written in a language which describes the generation of lexical
analyzer, The lex compiler transforms lex.! to a C program known as lex.yy.c.
¥ lex.yy.e is compiled by the C compiler to a file called a.out.
Y The output of C compiler is the working lexical analyzer which takes stream of input
Characters and produces a stream of tokens.34 Compiler Design 2e
¥ yylval isa global variable which is shared by lexical analyzer and parser to return the
name and an attribute value of token.
¥ The attribute value can be numeric code, pointer to symbo!
Y. Another tool for lexical analyzer generation is flex.
| table or nothing.
fa
ed cine Lexcompiter. |-——> lexyy.c
—"
lexyyc — Compiler. |-——> aout
(rout stream oo pi [___» Sequence of tokens
a
Figure 1.22 Creating lexical analyzer
1.12.2 Structure of Lex Programs
Lex program will be in following form
declarations
ss
translation rules
ae
auxiliary functions
Declarations This section includes declaration of variables, constants and regular
defir
Translation rules _\t contains regular expressions and code segments.
Form : Pattern {action}
Pattern is a regular expression or regular definition,
Action refers to segments of code.
Auxiliary functions This section holds additional functions which are used in actions
These functions are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time
until a valid match for a pattern is found.
Once a match is found, the associated action takes place to produce token.
The token is then given to parser for further processing.Introduction to Compilers 35
4.12.3. Conflict Resolution in Lex
Conflict arises when several Prefixes of input matches one or more patterns. This can be
resolved by the following:
¥_ Always prefer a longer prefix than a shorter prefix.
If two or more patterns are matched for the longest prefix, then the first pattern listed
in lex program is preferred.
4.12.4 Lookahead Operator
¥ Lookahead operator is the additional operator that is read by lex in order to distinguish
additional pattern for a token,
¥ Lexical analyzer is used to read one character ahead of valid lexeme and then retracts
to produce token.
¥ At times, it is needed to have certain characters at the end of input to match with a
pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches
the lexeme.
eg. In some languages keywords are not reserved. So the statements
IF (I, J) = 5 and IF(condition) THEN .
results in conflict whether to produce IF as an array name or a keyword. To resolve this the
lex rule for keyword IF can be written as,
IF, /\, (, tab
letter }
1.13 Design of Lexical Analyzer
V Lexical analyzer can either be generated by NFA or by DEA.
V DEA is preferable in the implementation of lex.
1.13.1 Structure of Generated Analyzer
Architecture of lexical analyzer generated by lex is given in Figure 1.23
Lexical analyzer program includes:
¥ Program to simulate automata
¥ Components created from lex program by lex itself which are listed as follows:
© A transition table for automaton.
© Functions that are passed directly through lex to the output.
© Actions from input program (fragments of code) which are invoked
by automaton simulator when needed.36 Compiler Design 2¢
‘Automaton
simulator
Transition table
Lex program ors
Figure 1.23 Lex program used by finite automaton simulator
Steps to construct automaton
Step 1: Convert each regular expression into NFA either by Thompson’s sub-set
construction or direct method.
Step 2: Combine all NFAs into one by introducing new start state with €-transitions to each
of start states of NFAs N; for pattern pj.
Step 2 is needed as the objective is to construct single automaton to recognize lexemes
that matches with any of the patterns.
Figure 1
:24 Construction of NFA from lex ProgramIntroduction to Compilers. 37
eB
@ {action A; for pattern p, }
ab {action Ag for pattern pz}
a°b* {action Ag for pattern p3}
For string abd, pattern p2 and pattern py matches. But the pattern pp will be taken into
account as it was listed first in lex program.
For string aabbb ---, matches pattern p3 as it has many prefixes,
Figure 1.25 shows NFAs for recognizing the above mentioned three patterns.
The combined NFA for all three given patterns is shown in Figure 1.26
“OO
ah Bh eye
“GY
Figure 1.25 NFA’s for a,abb,a*bt
DO
Nee
Figure 1.26 Combined NFA
1.13.2 Pattern Matching based on NFAs
Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by the
pointer lexemeBegin. Forward pointer is used to move ahead of input symbols, calculates
the set of states it is in at each point, If NFA simulation has no next state for some input
symbol, then there will be no longer prefix which reaches the accepting state exists. At such
cases¥the decision will be made on the so seen longest prefix, i.e., lexeme matching some
Patten. Process is repeated until one or more accepting states are reached. If there are several
accepting states, then the pattern p; which appears earliest in the list of lex program is chosen.Compiler Design 2€
38
CBs
woe
a’b*
b a
= a _———+ none
osha E —|7 ae
4
Es
=I
Figure 1.27 Processing input aaba
Explanation
Process starts with e-closure of initial state 0. After processing all the input symbols, no state
is found as there is no transition out of state 8 on input a. Hence, accepting state is looked by
retracting to previous state. From Figure 1.27 state 2 which is an accepting state is reached
after reading input symbol a and therefore the pattern a has been matched. At state 8, string
aab has been matched with pattern a*b*. By lex rule, the longest matching prefix should be
considered. Hence, action Ag corresponds to pattern ps will be executed for the string aab.
1.13.3 DFAs for Lexical Analyzers
DFAs are also used to represent the output of lex. DFA is constructed from NFA, by
converting all the patterns into equivalent DFA using sub-set construction algorithm. If there
are one or more accepting NFA states, the first pattern whose accepting state is represented
in each DFA state is determined and displayed as output of DFA state. Process of DFA is
similar to that of NFA. Simulation of DEA is conti i
ntinued until ni i
retraction takes place to find the acce ‘0 next state is found. Then
pling state of i i A
for that state is executed. DFA. Action associated with the pattem
1.13.4 Implementing Lookahead Operator
Lookahead operator 11 /r2 is need
to describe some trailing context 12 in order to Correctly
For the pattern r, /r2,'/' is treated as ¢
If some prefix ab, is recopn;
» 18 recognized by NFA as a mato!
Ismot ended as NFA reaches the accepting state, hohe ad Mia nesaueneetl
led because the pattern r) for a particular token may need
identify the actual lexeme.Introduction to Compilers 39
‘The end of lexeme occurs when NFA enters a state p such that
1. phas an ¢-transition on /,
2. There is a path from start state to state p, that spells out a.
3, There is a path from state p to accepting state that spells out b.
4, aisas long as possible for any ab satisfying conditions 1 — 3.
any
ell) (
roe
Figure 1.28 NFA for keyword IF
ait letter
Figure 1.28 shows the NFA for recognizing the keyword IF with lookahead. Transition
from state 2 to state 3 represents the lookahead operator (¢-transition).
Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme
IF is found by looking backwards to the state 2, whenever accepting state (state 6) is reached.
1.14 Finite Automata
A recognizer for a language is a program that takes as input a string © and answers yes if x is
a sentence of the language and no otherwise.
Aregular expression is compiled into a recognizer by constructing a generalized transition
diagram called a Finite Automaton (FA).
‘A finite automata can be Non-deterministic Finite Automata (NFA) or Deterministic
Finite Automata (DFA).
Itis given by M = (Q, ©, go, F; 4).
where Q — Set of states
& — Set of input symbols
qo — Start state
F — Set of final states
6 — Transition function (mapping states to input symbol).
6:QxzZ-Q
¥ Non-deterministic Finite Automata (NFA)
© More than one transition occurs for any input symbol from a state.
© Transition can occur even on empty string (€).40 Compiler Design 2e
V Deterministic Finite Automata (DFA)
f i S cl sition occurs fr s
© For each state and for each input symbol, exactly one transitio TOM thay
state.
Regular expression can be converted into DFA by the following methods:
(i) Thompson's sub-set construction
V Given regular expression is converted into NFA
V Resultant NFA is converted into DFA
(ii) Direct method
V Indirect method, given regular expression is converted directly into DFA.
1.15 Regular Expressions to DFA
Regular expression is used to represent the language (lexeme) of finite automata (lexical
analyzer).
1.15.1 Rules for Conversion of Regular Expression to NFA
© Union
* ConcatenationIntroduction to Compilers. 41
e Closure
1.15.2 e-closure
e-closure is the set of states that are reachable from the state concemed on taking empty
string as input. It describes the path that consumes empty string (€) to reach some states of
NFA.
@ Example 1.3
e-closure(qgo) = {¢o, 11, 42}
e-closure(qi) = {a1 92}
e-closure(qa) = {q2}
@ Example 1.442. Compiler Design 2e
4, 6}
e-closure(1) = {1,2,3
e-closure(2) = {2,3; 6}
e-closure(3) = 13, 6}
e-closure(4) = {4}
--closure(5) = {5,7}
closure(6) = {6}
)
e-closure(7) = {7}
1.15.3 Sub-set Construction
¥. Given regular expression is converted into NFA.
v Then, NFA is converted into DFA.
Steps
1. Convert given RE into NFA using above rules for operators (union, concatenation and
closure) and precedence.
Find e-closure of all states.
N
Start with e-closure of start state of NFA.
rw
Apply the input symbols and find its ¢-closure.
Diran{state, input symbol] =
losure(move(state, input symbol))
where Dtran — transition function of DFA
5. Analyze the output state to find whether it is a new state.
. If new state is found, repeat step 4 and step 5 until no more new states are found.
. Construct the transition table for Diran function.
oe ND
- Draw the transition diagram with start state as the -closure (start state of NFA) and
final state as the state that contains final state of NFA drawn for given RE.
@ Example 1.5
RE = (a| b)*abb
Step 1: Construct NFAIntroduction to Compilers 43
Step 2: Start with finding ¢-closure of state 0
e-closure(0) = {0,1,2,4,7} =A
Step 3: Apply input symbols a, bwoA
Dtran{A, a] = ¢-closure(move(A, a))
e-closure(move({0, 1,2,4,7}.@))
e-closure(3, 8)
3,6, 7,1,2,4,8}
{1,2,3, 4,6, 7,8} =B
Dtran[A, a] = B
e-closure(move(A, b))
e-closure(move({0,1,2,4,7},b))
= e-closure(5)
= {5,6,7,1,2,4,7}
= (1,2,4.5,6,7,=°
Dtran[A, b] =C
i
Step 4: Apply input symbols to new state B
Dtran[B, a] = e-closure(move(B, a))
= e-closure(move({1, 2,3, 4,6, 7,8} a))
= e-closure(3, 8)
= {1,2,3,4,6, 7,8} = B44 Compiler Design 2e
Dtran[B, a] = B
Dtran[B, }] = e-closure( move(B, ®)) 7 “
= e-closure(move({1,2,3:4 6,7, 8}; 5))
-closure(5, 9)
= {1,2,4,5,6 7,9} =D
Dtran[B, 6] =D
Step 5: Apply input symbols to new state C
Dtran{C, a] = e-closure(move(C, @))
= -closure(move({1, 2, 4,5, 6, 7}, 4))
= e-closure(3, 8)
= {1,2,3,4,6,7,8} =B
Dtran{C, a] = B
Dtran{C, 6] = e-closure(move(C, b))
= e-closure(move({1, 2, 4,5, 6, 7}, >)
= e-closure(5)
= {1,2,4,5,6,7}=C
Dtran{C, b| = C
Step 6: Apply input symbols to new state D
Dtran[D, a] = e-closure(move(D, a))
= e-closure(move({1, 2, 4,5, 6,7, 9}, a))
= e-closure(3, 8)
= {1,2,3,4,6,7,8} =B
Dtran[D, a] = B
Drran[D, b] = e-closure(move(D, b))
= e-closure(move({1, 2, 4,5, 6, 7, 9}, b))
= e-closure(5, 10)
= {1,2,4,5,6, 7,10} =
Dtran(D, b] = E TO} =EIntroduction to Compilers
Step 7! Apply input symbols to new state E
Dtran[E, a] = €-closure(move(E, a))
= €-closure(move({1, 2, 4, 5, 6,7, 10}, a))
= e-closure(3, 8)
= {1, 2,3, 4,6, 7,8} =B
Dtran{E, a] = B
Dtran{E, b] = e-closure(move(E, b))
e-closure(move({1, 2, 4, 5,6, 7, 10}, 6))
= e-closure(5)
= {1,2,4,5,6,7}=C
Dtran[E, 6] = C
I
Step 8: Construct transition table
a
aAaljmialsojyalye=
Di mi wmliwi we
¥E
Note:
¥ Start state is the e-closure(0), i.e., A.
i d ‘A.
V Final state is the state that contains final state of drawn NF
Step 9: Construct transition diagram48 Compiler Design 2e
1.15.4 Direct Method
.d to convert given regular expression directly into DFA.
v Direct method is uses
¥ Uses augmented regular expression r#.
¥ Important states of NFA correspond to positions in reg!
of the alphabet.
ular expression that hold symbols
Regular expression is represented as syntax tree where interior nodes correspond to operators
representing union, concatenation and closure operations.
v Leaf nodes corresponds to the input symbols
¥ Construct DFA directly from a regular expression by ©
nullable(n), firsipos(n), lastpos(n) and followpos(i) from the syntax tree.
© nullable(n); Is true for « node and node labeled with ¢. For other nodes it is
computing the functions
false.
© firstpos(n): Set of positions at node n that corresponds to the first symbol of the
sub-expression rooted at 7.
© lastpos(n): Set of positions at node n that corresponds to the last symbol of the
sub-expression rooted at n.
© followpos(i): Set of positions that follows given position by matching the first or
last symbol of a string generated by sub-expression of the given regular
expression.
Table 1.6 Rules for computing nullable, firstpos and lastpos
Noden nullable(n) firstpos(n) lastpos(n)
A leaf labeled & True % : ¢
A leaf with
position i False {i} {i}
An or node nullable(c)) or | firstpos(c,) U lastpos(c,) U
n=cy|Co nullable(c2) firstpos(c2) lastpos(c2)
if (nullable(c1)) | if (nullable(e,))
nullable(c) and | ftstpos(c1) U | Iastpos(e1) U
A cat node
nna nullable(cy) ay lastpos(c2)
Sisto) else
irstpos(c, lastpost
A star node Posten’
n=c True firstpos(c,) lastpos(cy)Introduction to Compilers 49
Computation of followpos
‘The position of regular expression can follow another in the following
ways:
V If nis a cat node with left child cy and right chi
it ‘ 7 it child ae te
lastpos(c1), all positions in firstpos(cy) - Gana ies every poeuen
© For cat node, for each position i in k ‘|
t ni ;
right child will be in flscaty ‘pos of its le ft child, the firstpos of its
¥ If nis a star node and 7 is a position in | ; te :
are in followpos(i). astpos(n), then all positions in firstpos(n)
© For star node, the firstpos of that node i: ‘
i f is in foll it i
lastpos of that node. Followpos of all ‘positions in
Input A regular expression r
Output A DFA D, that recognizes L(r)
Method
1. Construct syntax tree T for the augmented regular
expression r#.
2. Compute nullable, firstpos, lastpos and followpos
for T.
3. Construct Dstates, the set of states for DFA and
Dtran, the transition function for DFA.
3.1 Initially all states are unmarked. Make the
state marked by considering its out-transitions.
3.2 Start state of D is firstpos(root of T)
3.3 Apply union of followpos(p) for all p corresponds
to considered input symbol |
et containing the
3.4 Accepting states are the s
vr end marker #
ion of regular expression into DFA
Figure 1.29 Algorithm for conversi
You might also like Programming For Problem Solving-I - Set-I - CSE-K-Q, AI-B, IT-C-E, ECE-B-D, CE-A, EEE-A&B & ME-A - I-I - R22 - Mar-2023 PDF
Programming For Problem Solving-I - Set-I - CSE-K-Q, AI-B, IT-C-E, ECE-B-D, CE-A, EEE-A&B & ME-A - I-I - R22 - Mar-2023
1 page