What Is Translators
What Is Translators
A program written in high-level language is called as source code. To convert the source
code into machine code, translators are needed.
A translator takes a program written in source language as input and converts it into a
program in target language as output.
• Translating the high-level language program input into an equivalent machine language
program.
Compiler
Interpreter
Assembler
Assembler is a translator which is used to translate the assembly language code into
machine language code.
Phases of Compiler - Compiler Design
Analysis part
• Analysis part breaks the source program into constituent pieces and imposes a
grammatical structure on them which further uses this structure to create an intermediate
representation of the source program.
• Information about the source program is collected and stored in a data structure called
symbol table.
Synthesis part
• Synthesis part takes the intermediate representation as input and transforms it to the
target program.
The design of compiler can be decomposed into several phases, each of which converts
one form of source program into another.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
5. Code optimization
6. Code generation
• Error handling.
Lexical Analysis
• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
• Once a token is generated the corresponding entry is made in the symbol table.
Output: Token
(eg.) c=a+b*5;
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
Syntax Analysis
• Syntax analysis is the second phase of compiler which is also called as parsing.
• Parser converts the tokens produced by lexical analyzer into a tree like representation
called parse tree.
• Syntax tree is a compressed representation of the parse tree in which the operators
appear as interior nodes and the operands of the operator are the children of the node for
that operator.
Input: Tokens
Semantic Analysis
Intermediate Code Generation
t2 = id3* tl
t3 = id2 + t2
id1 = t3
Code Optimization
• Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
• This phase reduces the redundant code and attempts to improve the intermediate code so
that faster-running machine code will result.
• During the code optimization, the result of the program is not affected.
Code Generation
• It gets input from code optimization phase and produces the target code or object code
as result.
• Symbol table is used to store all the information about identifiers used in the program.
• It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
• It allows finding the record for each identifier quickly and to store or retrieve data from
that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Example
Example
}
Error Handling
• Each phase can encounter errors. After detecting an error, a phase must handle the error
so that compilation can proceed.
(i) When the compiler detects constructs that have right syntactic structure but no
meaning
Figure illustrates the translation of source code through each phase, considering the
statement
c =a+ b * 5.
Each phase can encounter errors. After detecting an error, a phase must some how deal
with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
Syntactical Errors
There are four common error-recovery strategies that can be implemented in the parser to
deal with errors in the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the
semantic analyzer is expected to recognize are:
• Type mismatch.
• Undeclared variable.
• Reserved identifier misuse.
• Multiple declaration of variable in a scope.
• Accessing an out of scope variable.
• Actual and formal parameter mismatch.
Logical errors
• To ensure that a right lexeme is found, one or more characters have to be looked up
beyond the next lexeme.
• Techniques for speeding up the process of lexical analyzer such as the use of sentinels
to mark the buffer end have been adopted.
There are three general approaches for the implementation of a lexical analyzer:
(i) By using a lexical-analyzer generator, such as lex compiler to produce the lexical
analyzer from a regular expression based specification. In this, the generator provides
routines for reading and buffering the input.
(iii) By writing the lexical analyzer in assembly language and explicitly managing the
reading of input.
Buffer Pairs
Because of large amount of time consumption in moving characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an
input character.
Fig shows the buffer pairs which are used to hold the input data.
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
• N characters are read from the input file to the buffer using one system read command.
Pointers
lexeme Begin points to the beginning of the current lexeme which is yet to be found.
• Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.
• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited lookahead may make it impossible to recognize tokens in situations where
the distance that the forward pointer must travel is more than the length of the buffer.
Sentinels
• In the previous scheme, each time when the forward pointer is moved, a check is done
to ensure that one half of the buffer has not moved off. If it is done, then the other half
must be reloaded.
• Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer.
• The usage of sentinel reduces the two tests to one by extending each buffer half to hold
a sentinel character at the end.
• The sentinel is a special character that cannot be part of the source program. (eof
character is used as sentinel).
Advantages
• Most of the time, It performs only one test to see whether forward pointer points to an
eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of tests per
input character is very close to 1.
• Scanning
• Tokenization
Token
• keywords,
• constant,
• identifiers,
• numbers,
• operators and
• punctuations symbols
Pattern
Lexeme
Lexeme is a sequence of characters that matches the pattern for a token i.e., instance of a
token.
(eg.) c=a+b*5;
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
The sequence of tokens produced by lexical analyzer helps the parser in analyzing the
syntax of programming languages.
Role of Lexical Analyzer
• Reads the source program, scans the input characters, group them into lexemes and
produce the token as output.
• Correlates error messages with the source program i.e., displays error message with its
occurrence by specifying the line number.
Scanning: Performs reading of input characters, removal of white spaces and comments.
Simplicity of design of compiler The removal of white spaces and comments enables the
syntax analyzer for efficient syntactic constructs.
• Lookahead
• Ambiguities
Lookahead
Lookahead is required to decide when one token will end and the next token will begin.
The simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to
describe the lexemes of each token is required.
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are
similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes of
each token is also needed.
Regular expressions are one of the most popular ways of representing tokens.
Ambiguities
The lexical analysis programs written with lex accept ambiguous specifications and
choose the longest match possible at each input point. Lex can handle ambiguous
specifications. When more than one expression can match the current input, lex chooses
as follows:
• Among rules which matched the same number of characters, the rule given first is
preferred.
Lexical Errors
• A character sequence that cannot be scanned into any valid token is a lexical error.
• Lexical errors are uncommon, but they still must be handled by a scanner.
• Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
• Local correction
o Source text is changed around the error point in order to get a correct text.
• Global correction
In panic mode recovery, unmatched patterns are deleted from the remaining input, until
the lexical analyzer can find a well-formed token at the beginning of what input is left.
(eg.) For instance the string fi is encountered for the first time in a C program in the
context:
fi (a== f(x))
Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to
the parser.
Local correction
(eg.) In Pascal, c[i] '='; the scanner deletes the first quote because it cannot legally follow
the closing bracket and the parser replaces the resulting'=' by an assignment statement.
Most of the errors are corrected by local correction.
(eg.) The effects of lexical error recovery might well create a later syntax error, handled
by the parser. Consider
· · · for $tnight · · ·
The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then
tnight is scanned as an identifier.
In effect it results,
· · · fortnight · · ·
Which will cause a syntax error? Such false errors are unavoidable, though a syntactic
error-repair may help.
Regular expression is used to represent the language (lexeme) of finite automata (lexical
analyzer).
Finite automata
A recognizer for a language is a program that takes as input a string x and answers yes if
x is a sentence of the language and no otherwise.
A regular expression is compiled into a recognizer by constructing a generalized
transition diagram called a Finite Automaton (FA).
qo - Start state
δ :Q x Σ → Q
o More than one transition occurs for any input symbol from a state.
o For each state and for each input symbol, exactly one transition occurs from that state.
• In direct method, given regular expression is converted directly into DFA.
• Union
r = r1 + r2
Concatenation
r = r1 r2
Closure
r = r1*
Ɛ –closure
Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states
of NFA.
Example 1
Ɛ -closure(q2) = { q0}
Example 2
Sub-set Construction
Steps
l. Convert into NFA using above rules for operators (union, concatenation and closure)
and precedence.
6. If new state is found, repeat step 4 and step 5 until no more new states are found.
8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and
final state is the state that contains final state of NFA drawn.
Direct Method
• Direct method is used to convert given regular expression directly into DFA.
• Important states of NFA correspond to positions in regular expression that hold symbols
of the alphabet.
Regular expression is represented as syntax tree where interior nodes correspond to
operators representing union, concatenation and closure operations.
o nullable (n): Is true for * node and node labeled with Ɛ. For other nodes it is false.
o firstpos (n): Set of positions at node ti that corresponds to the first symbol of the sub-
expression rooted at n.
o lastpos (n): Set of positions at node ti that corresponds to the last symbol of the sub-
expression rooted at n.
o followpos (i): Set of positions that follows given position by matching the first or last
symbol of a string generated by sub-expression of the given regular expression.
else else
Computation of followpos
The position of regular expression can follow another in the following ways:
• If n is a cat node with left child c 1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
o For cat node, for each position i in lastpos of its left child, the firstpos of its
• If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are in
followpos(i).
o For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of that
node.
Regular expression is used to represent the language (lexeme) of finite automata (lexical
analyzer).
Finite automata
A recognizer for a language is a program that takes as input a string x and answers yes if
x is a sentence of the language and no otherwise.
qo - Start state
δ :Q x Σ → Q
o More than one transition occurs for any input symbol from a state.
o For each state and for each input symbol, exactly one transition occurs from that state.
• In direct method, given regular expression is converted directly into DFA.
• Union
r = r1 + r2
Concatenation
r = r1 r2
Closure
r = r1*
Ɛ –closure
Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states
of NFA.
Example 1
Ɛ -closure(q2) = { q0}
Example 2
Sub-set Construction
Steps
l. Convert into NFA using above rules for operators (union, concatenation and closure)
and precedence.
6. If new state is found, repeat step 4 and step 5 until no more new states are found.
8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and
final state is the state that contains final state of NFA drawn.
Direct Method
• Direct method is used to convert given regular expression directly into DFA.
• Important states of NFA correspond to positions in regular expression that hold symbols
of the alphabet.
o firstpos (n): Set of positions at node ti that corresponds to the first symbol of the sub-
expression rooted at n.
o lastpos (n): Set of positions at node ti that corresponds to the last symbol of the sub-
expression rooted at n.
o followpos (i): Set of positions that follows given position by matching the first or last
symbol of a string generated by sub-expression of the given regular expression.
else else
Computation of followpos
The position of regular expression can follow another in the following ways:
• If n is a cat node with left child c 1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
o For cat node, for each position i in lastpos of its left child, the firstpos of its
• If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are in
followpos(i).
o For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of that
node.
LR parsers are used to parse the large class of context free grammars. This technique is
called LR(k) parsing.
• k is the number of input symbols of lookahead that are used in making parsing
decisions.
There are three widely used algorithms available for constructing an LR parser:
• SLR(l) - Simple LR
• LR( 1) - LR parser
• An LR parser can detect the syntax errors as soon as they can occur.
Drawbacks of LR parsers
Model of LR Parser
LR parser consists of an input, an output, a stack, a driver program and a parsing table
that has two functions
1. Action
2. Goto
The driver program is same for all LR parsers. Only the parsing table changes from one
parser to another.
The parsing program reads character from an input buffer one at a time, where a shift
reduces parser would shift a symbol; an LR parser shifts a state. Each state summarizes
the information contained in the stack.
The stack holds a sequence of states, so, s1, · ·· , Sm, where Sm is on the top.
Action This function takes as arguments a state i and a terminal a (or $, the input end
marker). The value of ACTION [i, a] can have one of the four forms:
iii) Accept.
iv) Error.
Goto This function takes a state and grammar symbol as arguments and produces a state.
If GOTO [Ii ,A] = Ij, the GOTO also maps a state i and non terminal A to state j.
1. If ACTION[sm, ai] = shift s. The parser executes the shift move, it shifts the next state
s onto the stack, entering the configuration
b) First popped r state symbols off the stack, exposing state Sm-r·
c) Then pushed s, the entry for GOTO[sm-r, A], onto the stack.
4. If ACTION[sm, ai] = error, the parser has discovered an error and calls an error
recovery routine.
LR Parsing Algorithm
}
LR(O) Items
(eg.)
One collection of set of LR(O) items, called the canonical LR(O) collection, provides
finite automaton that is used to make parsing decisions. Such an automaton is called an
LR(O) automaton.
If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed
from I by the two rules.
• If A ---> αB,β is in CLOSURE(I) and B ---> ɣ is a production, then add the item B ---> •
ɣ to CLOSURE(i), if it is not already there. Apply this rule until no more items can be
added to CLOSURE (!).
• The role of augmented production is to stop parsing and notify the acceptance of the
input i.e., acceptance occurs when and only when the parser performs reduction by S' --->
S.
1. S' ---> S
3. S' ---> Ɛ
Conflicts
Conflicts are the situations which arise due to more than one option to opt for a particular
step of shift or reduce.
SLR(1) grammars
o Lookahead token.
• The • represents how much of the right-hand side has been seen,
• This is the extension of LR(O) items, by introducing the one symbol of lookahead on
the input.
• Find the items that have same set of first components (core) and merge these sets into
one.
• Revise the parsing table of LR(l) parser by replacing states and goto's with combined
states and combined goto's respectively.
Type of Parsing
Top-Down Parsing
Top-down parsing constructs parse tree for the input string, starting from root node and
creating the nodes of parse tree in pre-order.
General Strategies
• Top-down parsing involves constructing the parse tree starting from root node to leaf
node by consuming tokens generated by lexical analyzer.
o All possible combinations are attempted before the failure to parse is recognized.
• The parsing program consists of a set of procedures, one for each non-terminal.
• Start symbol is placed at the root node and on encountering each non-terminal, the
procedure concerned is called to expand the non-terminal with its corresponding
production.
• Successful completion occurs when the scan over entire input string is done. ie., all
terminals in the sentence are derived by parse tree.
void A()
for (i = 1 to k)
else
error;
}
Limitation
• When a grammar with left recursive production is given, then the parser might get into
infinite loop.
S ----> SAd
A ---> ab I d
S ----> cAd
A ----> ab | d
w = cad
Explanation
• The body of production begins with c, which matches with the first symbol of the input
string.
• Apply the first production of A, which results in the string cabd that does not match
with the given string cad.
• Backtrack to the previous step where the production of A gets expanded and try with
alternate production of it.
• This produces the string cad that matches with the given string.
Limitation
• If the given grammar has more number of alternatives then the cost of backtracking will
be high.
Recursive descent parser without backtracking works in a similar way as that of recursive
descent parser with backtracking with the difference that each non-terminal should be
expanded by its correct alternative in the first selection itself.
When the correct alternative is not chosen, the parser cannot backtrack and results in
syntactic error.
Advantage
Limitation
• When more than one alternative with common prefixes occur, then the selection of the
correct alternative is highly difficult.
Hence, this process requires a grammar with no common prefixes for alternatives.
• They can also be termed as LL (l) parser as it is constructed for a class of grammars
called LL (l).
• The production to be applied for a non-terminal is decided based on the current input
symbol.
A grammar G is LL(l) if there are two distinct productions A ---> α | βwith the following
conditions hold:
o If β *---> Ɛ then α does not derive any string beginning with a terminal in
FOLLOW(A).
o If α *---> Ɛ then βdoes not derive any string beginning with a terminal in FOLLOW(A).
In order to overcome the limitations of recursive descent parser, LL(1) parser is designed
by using stack data structure explicitly to hold grammar symbols.
In addition to this,
A grammar is left recursive if it has a production of the form A ----> A α, for some string
α.
Rule
A ---> β A'
A' ---> αA' I Ɛ
Example
A ----> A α1 | A α2 | · · · | β1 | β2 | · · · | βm
Solution:
Left factoring
When a production has more than one alternatives with common prefixes, then it is
necessary to make right choice on production.
This can be done through rewriting the production until enough of the input has been
seen.
Rule
A ---> α A'
A' ---> β1I β2
Example
A -> α β1 I α β2 I ···I α βm I ɣ
Solution
Computation of FIRST
Rules
Computation of FOLLOW
Rules
• For the FOLLOW(Start symbol) place $, where $ is the input end marker.
Input Grammar G
Output Parsing table M
2. If Ɛis in FIRST (α) , then for each terminal b in FOLLOW(A) ' add A --> αto M[A, b]
Note:
In general, parsing table entry will be empty for indicating error status.
Parsing of input
• Input buffer - contains the input to be parsed with $ as an end marker for the string.
• Parsing table.
Process
• Initially the stack contains $ to indicate bottom of the stack and the start symbol of
grammar on top of $.
• The input string is placed in input buffer with $ at the end to indicate the end of the
string.
• Parsing algorithm refers the grammar symbol on the top of stack and input symbol
pointed by the pointer and consults the entry in M[A, α] where A is in top of stack and αis
the symbol read by the pointer.
• Based on the table entry, if a production is found then the tail of the production is
pushed onto stack in reversal order with leftmost symbol on the top of stack.
• Process repeats until the entire string is processed.
• When the stack contains $ (bottom end marker) and the pointer reads $ (end of input
string), successful parsing occurs.
• If no entry is found, it reports error stating that the input string cannot be parsed by the
grammar.
Method
while(X ≠ $)
symbol of w;
Components
• Predictive parsing algorithm - contains steps to parse the input string; controls the
parser's process.
• Parsing table - contains entries based on which parsing action has to be carried out.
Process
• The input string to be parsed is placed in the input buffer with $ as the end marker.
• If X is a non-terminal on the top of stack and the input symbol being read is a, the
parser chooses a production by consulting entry in the parsing table M[X, a].
• Replace the non-terminal in stack with the production found in M[X, a] in such a way
that the leftmost symbol of right side of production is on the top of stack i.e., the
production has to be pushed to stack in reverse order.
• If it matches, pop the symbol from stack and advance the pointer reading the input
buffer.
• Stop parsing when the stack is empty (holds $) and input buffer reads end marker ($).
o If a non-terminal on stack, shift the input until the terminal can expand.
BOTTOM-UP PARSING
• Bottom-up parsers construct parse trees starting from the leaves and work up to the
root.
• Shift-reduce parsing try to build a parse tree for an input string beginning at the leaves
(the bottom) and working up towards the root (the top).
• At each and every step of reduction, the right side of a production which matches with
the substring is replaced by the left side symbol of the production.
• If the substring is chosen correctly at each step, a rightmost derivation is traced out in
reverse.
Handles
A handle of a string is a substring that matches the right side of a production and whose
reduction to the non-terminal on the left side of the production represents one step along
the reverse of a rightmost derivation.
• The string w to the right of the handle contains only terminal symbols.
S -->αABe
A --> Abc I b
B --> d
abbcde
aAbcde
aAde
aABe
Handle Pruning
• If A -->β is a production then reducing βto A by the given production is called handle
pruning i.e., removing the children of A from the parse tree.
E --> id
Shift-reduce Parsing
i) Shift Reduce parsing is a bottom-up parsing that reduces a string w to the start symbol
of grammar.
ii) It scans and parses the input text in one forward pass without backtracking.
• Handle pruning must solve the following two problems to perform parsing:
o Determining what production to choose in case there is more than one productions
with that substring on the right side.
• The type of data structure to use in a shift-reduce parser.
• $ is used to mark the bottom of the stack and also the right end of the input.
• Initially the stack is empty and the string ωis on the input, as follows:
$ ω $
• The parser processes by shifting zero or more input symbols onto the stack until a
handle β is on top of the stack.
• The parser then reduces β to the left side of the appropriate production.
• The parser repeats this cycle until it has detected an error or until the stack contains the
start symbol and the input is empty.
$S $
• When the input buffer reaches the end marker symbol $ and the stack contains the start
symbol, the parser halts and announces successful completion of parsing.
A shift-reduce parser can make four possible actions viz: 1) shift 2) reduce 3) accept 4)
error.
• A shift action, shifts the next symbol onto the top of the stack.
• A reduce action, replaces the symbol on the right side of production by the symbol on
left side of the production concerned.
To perform reduction, the parser must know the right end of the handle which is at the
top of the stack. Then the left end of the handle within the stack is located and the non-
terminal to replace the handle is decided.
• An error action, discovers that a syntax error has occurred and calls an error recovery
routine.
Note:
An important fact that justifies the use of a stack in shift-reduce parsing is that the handle
will always appear on top of the stack and never inside.
E --> id
and the input string id1+ id2 * id3. Use the shift-reduce parser to check whether the input
string is accepted by the above grammar.
Viable prefixes
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce
parser are called viable prefixes.
• For every shift-reduce parser, such grammar can reach a configuration in which the
parser cannot decide whether to shift or to reduce (a shift-reduce conflict), or cannot
decide which of the several reductions to make (a reduce/reduce conflict), by knowing
the entire stack contents and the next input symbol.
(eg.)
I other
• Parse tree is a hierarchical structure which represents the derivation of the grammar to
yield input strings.
• Root node of parse tree has the start symbol of the given grammar from where the
derivation proceeds.
• Leaves of parse tree represent terminals.
• If A -> xyz is a production, then the parse tree will have A as interior node whose
children are x, y and z from its left to right.
Leaf nodes of parse tree are concatenated from left to right to form the input string
derived from a grammar which is called yield of parse tree.
Figure represents the parse tree for the string id+ id* id.
The string id + id * id, is the yield of parse tree depicted in Fig.
• Lex is a tool in lexical analysis phase to recognize tokens using regular expression.
Use of Lex
• lex.l is an a input file written in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.
• The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.
• yylval is a global variable which is shared by lexical analyzer and parser to return the
name and an attribute value of token.
• The attribute value can be numeric code, pointer to symbol table or nothing.
declarations
%%
translation rules
%%
auxiliary functions
Auxiliary functions This section holds additional functions which are used in actions.
These functions are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time until
a valid match for a pattern is found.
Once a match is found, the associated action takes place to produce token.
Conflict arises when several prefixes of input matches one or more patterns. This can be
resolved by the following:
• If two or more patterns are matched for the longest prefix, then the first pattern listed in
lex program is preferred.
Lookahead Operator
• Lookahead operator is the additional operator that is read by lex in order to distinguish
additional pattern for a token.
• Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.
• At times, it is needed to have certain characters at the end of input to match with a
pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches the
lexeme.
IF/\ (.* \) {
letter }
• Program to simulate automata
• Components created from lex program by lex itself which are listed as follows:
o Actions from input program (fragments of code) which are invoked by automaton
simulator when needed.
Step 1: Convert each regular expression into NFA either by Thompson's subset
construction or Direct Method.
Step 2: Combine all NFAs into one by introducing new start state with s-transitions to
each of start states of NFAs Ni for pattern Pi·
For string obb, pattern P2 and pattern p3 matches. But the pattern P2 will be taken into
account as it was listed first in lex program.
Fig. Shows NFAs for recognizing the above mentioned three patterns.
The combined NFA for all three given patterns is shown in Fig.
Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by
the pointer lexemeBegin. Forward pointer is used to move ahead of input symbols,
calculates the set of states it is in at each point. If NFA simulation has no next state for
some input symbol, then there will be no longer prefix which reaches the accepting state
exists. At such cases, the decision will be made on the so seen longest prefix i.e., lexeme
matching some pattern. Process is repeated until one or more accepting states are
reached. If there are several accepting states, then the pattern Pi which appears earliest in
the list of lex program is chosen.
e.g.
W= aaba
Explanation
Process starts with s-closure of initial state 0. After processing all the input symbols, no
state is found as there is no transition out of state 8 on input a. Hence, look for accepting
state by retracting to previous state. From Fig. state 2 which is an accepting state is
reached after reading input symbol a and therefore the pattern a has been matched. At
state 8, string aab has been matched with pattern avb": By Lex rule, the longest matching
prefix should be considered. Hence, action Ag corresponds to pattern p3 will be executed
for the string aab.
DFAs are also used to represent the output oflex. DFA is constructed from NFA, by
converting all the patterns into equivalent DFA using subset construction algorithm. If
there are one or more accepting NFA states, the first pattern whose accepting state is
represented in each DFA state is determined and displayed as output of DFA state.
Process of DFA is similar to that of NFA. Simulation of DFA is continued until no next
state is found. Then retraction takes place to find the accepting state of DFA. Action
associated with the pattern for that state is executed.
Lookahead operator r1/r2 is needed because the pattern r1 for a particular token may need
to describe some trailing context r2 in order to correctly identify the actual lexeme.
The end of lexeme occurs when NFA enters a state p such that
• p has an Ɛ -transition on I,
Figure shows the NFA for recognizing the keyword IF with lookahead. Transition from
state 2 to state 3 represents the lookahead operator (-transition).
Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme
IF is found by looking backwards to the state 2, whenever accepting state (state 6) is
reached.
Syntax directed definition specifies the values of attributes by associating semantic rules
with the grammar productions.
It is a context free grammar with attributes and rules together which are associated with
grammar symbols and productions respectively.
• Computing values of attributes at each node by visiting the nodes of syntax tree.
Semantic actions
Semantic actions are fragments of code which are embedded within production bodies by
syntax directed translation.
(eg.)
Types of translation
• L-attributed translation
• S-attributed translation
Types of attributes
• Inherited attributes
o It is defined by the semantic rule associated with the production at the parent of node.
o Attributes values are confined to the parent of node, its siblings and by itself.
• Synthesized attributes
o It is defined by the semantic rule associated with the production at the node.
o Terminals have synthesized attributes which are the lexical values (denoted by lexval)
generated by the lexical analyzer.
S-attributed Definitions
Syntax directed definition that involves only synthesized attributes is called S-attributed.
Attribute values for the non-terminal at the head is computed from the attribute values of
the symbols at the body of the production.
The attributes of a S-attributed SDD can be evaluated in bottom up order of nodes of the
parse tree. i.e., by performing post order traversal of the parse tree and evaluating the
attributes at a node when the traversal leaves that node for the last time.
Production Semantic rules
L ---> En L.val = E.val
E ---> E1+ T E.val = E1.val+ T.val
E ---> T E.val = T.val
T---> T1*F T.val = Ti.val x F.val
T ---> F T.val = F.val
F ---> (E) F.val = E.val
F ---> digit F.val = digit.lexval
L-attributed Definitions
The syntax directed definition in which the edges of dependency graph for the attributes
in production body, can go from left to right and not from right to left is called L-
attributed definitions. Attributes of L-attributed definitions may either be synthesized or
inherited.
• Either by inherited or synthesized attribute associated with the production located to the
left of the attribute which is being computed.
In production 1, the inherited attribute T' is computed from the value of F which is to its
left. In production 2, the inherited attributed Tl' is computed from T'. inh associated with
its head and the value of F which appears to its left in the production. i.e., for computing
inherited attribute it must either use from the above or from the left information
Compiler Construction tools - Compiler Design
1. Parser generators.
2. Scanner generators.
3. Syntax-directed translation engines.
4. Automatic code generators.
5. Data-flow analysis engines.
6. Compiler-construction toolkits.
Parser Generators
Scanner Generators
Data-flow analysis engine gathers the information, that is, the values transmitted from
one part of a program to each of the other parts. Data-flow analysis is a key part of code
optimization.
The toolkits provide integrated set of routines for various phases of compiler. Compiler
construction toolkits provide an integrated set of routines for construction of phases of
compiler.
• They assist in finding the type of token that accounts for a particular lexeme.
w indicates the set of possible strings for the given binary alphabet Σ
Language (L) is the collection of strings which are accepted by finite automata.
Length of string is defined as the number of input symbols in a given string. It is found
by || operator.
| ω | =4
pq = 010001
qp = 001010
i.e., pq ≠ qp
Prefix A prefix of any string s, is obtained by removing zero or more symbols from the
end of s.
Suffix A suffix of any string s, is obtained by removing zero or more symbols from the
beginning of s.
Substring: Substring is part of a string obtained by removing any prefix and any suffix
from s.
Operations on Languages
• Union
• Concatenation and
• Closure
Union
Union of two languages Land M produces the set of strings which may be either in
language L or in language M or in both. It can be denoted as,
LUM = {p I p is in L or p is in M}
Concatenation
Concatenation of two languages L and M, produces a set of strings which are formed by
merging the strings in L with strings in M (strings in L must be followed by strings in M).
It can be represented as,
Closure
Positive closure (L +)
Positive closure indicates one or more occurrences of input symbols in a string, i.e., it
excludes empty string Ɛ(set of strings with 1or more occurrences of input symbols).
Precedence of operators
Based on the precedence, the regular expression is transformed to finite automata when
implementing lexical analyzer.
Regular Expressions
Regular expressions are a combination of input symbols and language operators such as
union, concatenation and closure.
It can be used to describe the identifier for a language. The identifier is a collection of
letters, digits and underscore which must begin with a letter. Hence, the regular
expression for an identifier can be given by,
Two regular expressions are equivalent, if they represent the same regular set.
(p I q) = (q | p)
Law Description
r|s=s|r | is commutative
r | (s | t) = (r | s ) | t | is associative
r (st) = (rs)t Concatenation is associative
r(s|t) = rs | rt; (s|t)r = sr | tr Concatenation is distributive
Ɛr = rƐ = r Ɛ is identity for concatenation
r* = (r | Ɛ)* Ɛ is guaranteed in closure
r** = r* * is idempotent
Regular Definition
Regular definition d gives aliases to regular expressions r and uses it for convenience.
Sequences of definitions are of the following form
di --> ri
d2-->r2
d3--> rs
dn--> rn
in which definitions di, d2, ... , can be used in place of ri, r2 respectively.
letter --> A I B I · · · I Z I a I b I · · · I z I
Types of grammar
• Type 0 grammar
• Type 1 grammar
• Type 2 grammar
• Type 3 grammar
Definition
G=(V,T,P,S)
where,
G - Grammar
V - Set of variables
T - Set of Terminals
P - Set of productions
S - Start symbol
where,
L-Language
G- Grammar
w - Input string
S - Start symbol
T - Terminal
Hence, CFL is a collection of input strings which are terminals, derived from the start
symbol of grammar on multiple steps.
Conventions
• Operators i.e.,+,-,*·
Start symbol is the head of the production stated first in the grammar.
Production is of the form LHS ->RHS (or) head -> body, where head contains only one
non-terminal and body contains a collection of terminals and non-terminals.
(eg.)
Grammar
Rules
If state i has a transition to state j on input a, add the production Ai -> aAj.
In addition to construction of the parse tree, syntax analysis also checks and reports
syntax errors accurately.
(eg.)
C = a + b * 5
Parser is a program that obtains tokens from lexical analyzer and constructs the parse tree
which is passed to the next phase of compiler for further processing.
Types of Parser
• Top down parsers Top down parsers construct parse tree from root to leaves.
• Bottom up parsers Bottom up parsers construct parse tree from leaves to root.
Role of Parser
• On receiving a token, the parser verifies the string of token names that can be generated
by the grammar of source language.
• It calls the function getNextToken(), to notify the lexical analyzer to yield another
token.
• It scans the token one at a time from left to right to construct the parse tree.
• Error is detected as soon as a prefix of the input cannot be completed to form a string in
the language. This process of analyzing the prefix of input is called viable-prefix
property.
Error Recovery Strategies
Error recovery strategies are used by the parser to recover from errors once it is detected.
The simplest recovery strategy is to quit parsing with an error message for the first error
itself.
Once an error is found, the parser intends to find designated set of synchronizing tokens
by discarding input symbols one at a time.
• When parser finds an error in the statement, it ignores the rest of the statement by not
processing the input.
Advantages
• Simplicity.
Disadvantage
• Additional errors cannot be checked as some of the input symbols will be skipped.
Parser performs local correction on the remaining input when an error is detected.
• When a parser finds an error, it tries to take corrective measures so that the rest of
inputs of statement allow the parser to parse ahead.
Advantage
Disadvantage
• It is difficult to cope up with actual error if it has occurred before the point of detection.
Error Production
Error diagnostics about the erroneous constructs are generated by the parser.
Global Correction
There are algorithms which make changes to modify an incorrect string into a correct
string.
When a grammar G and an incorrect string pis given, these algorithms find a parse tree
for a string q related top with smaller number of transformations.
Advantage
• It has been used for phrase level recovery to find optimal replacement strings.
Disadvantage
Front end
• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.
Back end
• Code optimization.
• Code generation.
Front End
• Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table management, semantic analysis
and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.
Back End
• Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol table management
operations.
Passes
• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept in memory to
ensure proper information flow to each phase because one phase may need information in
a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So, the
memory for internal form may be larger than that of input and output.
Pre-processor
A source program may be divided into modules stored in separate files. The task of
collecting the source program is entrusted to a separate program called pre-processor. It
may also expand macros into source language statement.
Compiler
Compiler is a program that takes source program as input and produces assembly
language program as output.
Assembler
Assembler is a program that converts assembly language program into machine language
program. It produces re-locatable machine code as its output.
Types
• Leftmost derivation.
• Rightmost derivation.
Leftmost Derivation
In leftmost derivation, at each and every step the leftmost non-terminal is expanded by
substituting its corresponding production to derive a string.
Example
Rightmost Derivation
In rightmost derivation, at each and every step the rightmost non-terminal is expanded by
substituting its corresponding production to derive a string.
Example