0% found this document useful (0 votes)
197 views

What Is Translators

The document discusses different types of translators that are used to convert programs written in high-level languages into machine-readable format. It explains compilers, interpreters, and assemblers. It also describes the different phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The document provides details about each phase and how they work to translate source code into executable machine code.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views

What Is Translators

The document discusses different types of translators that are used to convert programs written in high-level languages into machine-readable format. It explains compilers, interpreters, and assemblers. It also describes the different phases of a compiler including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The document provides details about each phase and how they work to translate source code into executable machine code.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 95

What is Translators?

A program written in high-level language is called as source code. To convert the source
code into machine code, translators are needed.

A translator takes a program written in source language as input and converts it into a
program in target language as output.

It also detects and reports the error during translation.

Roles of translator are:

• Translating the high-level language program input into an equivalent machine language
program.

• Providing diagnostic messages wherever the programmer violates specification of the


high-level language program.

Different type of translators

The different types of translator are as follows:

Compiler

Compiler is a translator which is used to convert programs in high-level language to low-


level language. It translates the entire program and also reports the errors in source
program encountered during the translation.

               

Interpreter

Interpreter is a translator which is used to convert programs in high-level language to


low-level language. Interpreter translates line by line and reports the error once it
encountered during the translation process.
It directly executes the operations specified in the source program when the input is given
by the user.

It gives better error diagnostics than a compiler.

               

                                 Differences between compiler and interpreter

SI. No Compiler Interpreter


1 Performs the translation of aPerforms statement by statement
program as a whole. translation.
2 Execution is faster. Execution is slower.
3 Requires more memory as linkingMemory usage is efficient as no
is needed for the generatedintermediate object code is
intermediate object code. generated.
4 Debugging is hard as the errorIt stops translation when the first
messages are generated aftererror is met. Hence, debugging is
scanning the entire program only. easy.
5 Programming languages like C,Programming languages like Python,
C++ uses compilers. BASIC, and Ruby uses interpreters.

 Assembler

Assembler is a translator which is used to translate the assembly language code into
machine language code.
 Phases of Compiler - Compiler Design

by Dinesh Thakur Category: Compiler Design

The structure of compiler consists of two parts:

Analysis part

• Analysis part breaks the source program into constituent pieces and imposes a
grammatical structure on them which further uses this structure to create an intermediate
representation of the source program.

• It is also termed as front end of compiler.

• Information about the source program is collected and stored in a data structure called
symbol table.

                          

Synthesis part

• Synthesis part takes the intermediate representation as input and transforms it to the
target program.

• It is also termed as back end of compiler.

                         

The design of compiler can be decomposed into several phases, each of which converts
one form of source program into another.

The different phases of compiler are as follows:

1. Lexical analysis

2. Syntax analysis
3. Semantic analysis

4. Intermediate code generation

5. Code optimization

6. Code generation

All of the aforementioned phases involve the following tasks:

• Symbol table management.

• Error handling.

                      

Lexical Analysis

• Lexical analysis is the first phase of compiler which is also termed as scanning.

• Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.

• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.

• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,


• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure
that must be matched by strings.

• Once a token is generated the corresponding entry is made in the symbol table.

Input: stream of characters

Output: Token

Token Template: <token-name, attribute-value>

(eg.) c=a+b*5;

                                                 Lexemes and tokens

Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)

 
 

Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>

Syntax Analysis

• Syntax analysis is the second phase of compiler which is also called as parsing.

• Parser converts the tokens produced by lexical analyzer into a tree like representation
called parse tree.

• A parse tree describes the syntactic structure of the input.

             

• Syntax tree is a compressed representation of the parse tree in which the operators
appear as interior nodes and the operands of the operator are the children of the node for
that operator.

Input: Tokens

Output: Syntax tree


              

Semantic Analysis

• Semantic analysis is the third phase of compiler.

• It checks for the semantic consistency.

• Type information is gathered and stored in symbol table or in syntax tree.

• Performs type checking.

               
Intermediate Code Generation

• Intermediate code generation produces intermediate representations for the source


program which are of the following forms:

     o Postfix notation

     o Three address code

     o Syntax tree

Most commonly used form is the three address code.

        t1 = inttofloat (5)

        t2 = id3* tl

        t3 = id2 + t2

        id1 = t3

Properties of intermediate code

• It should be easy to produce.

• It should be easy to translate into target program.

Code Optimization

• Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.

• It results in faster running machine code.

• It can be done by reducing the number of lines of code for a program.

• This phase reduces the redundant code and attempts to improve the intermediate code so
that faster-running machine code will result.

• During the code optimization, the result of the program is not affected.

• To improve the code generation, the optimization involves


       o Deduction and removal of dead code (unreachable code).

       o Calculation of constants in expressions and terms.

       o Collapsing of repeated expression into temporary string.

       o Loop unrolling.

       o Moving code outside the loop.

       o Removal of unwanted temporary variables.

                   t1 = id3* 5.0

                   id1 = id2 + t1

Code Generation

• Code generation is the final phase of a compiler.

• It gets input from code optimization phase and produces the target code or object code
as result.

• Intermediate instructions are translated into a sequence of machine instructions that


perform the same task.

• The code generation involves

     o Allocation of register and memory.

     o Generation of correct references.

     o Generation of correct data types.

     o Generation of missing code.

                LDF R2, id3

                MULF R2, # 5.0

                LDF R1, id2

                ADDF R1, R2


                STF id1, R1

Symbol Table Management

• Symbol table is used to store all the information about identifiers used in the program.

• It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.

• It allows finding the record for each identifier quickly and to store or retrieve data from
that record.

• Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

Example

int a, b; float c; char z;

    

Symbol name Type Address


a Int 1000
b Int 1002
c Float 1004
z char 1008

 
Example

extern double test (double x);

   double sample (int count)

   double sum= 0.0;

     for (int i = 1; i < = count; i++)

       sum+= test((double) i);

       return sum;

   }

Symbol name Type Scope


test function, double extern
x double function parameter
sample function, double global
count int function parameter
sum double block local
i int for-loop statement

 
                           

Error Handling

• Each phase can encounter errors. After detecting an error, a phase must handle the error
so that compilation can proceed.

• In lexical analysis, errors occur in separation of tokens.

• In syntax analysis, errors occur during construction of syntax tree.

• In semantic analysis, errors may occur at the following cases:

(i) When the compiler detects constructs that have right syntactic structure but no
meaning

(ii) During type conversion.


• In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.

Figure illustrates the translation of source code through each phase, considering the
statement

    c =a+ b * 5.

Error Encountered in Different Phases

Each phase can encounter errors. After detecting an error, a phase must some how deal
with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:

Lexical Errors

It includes incorrect or misspelled name of some identifier i.e., identifiers typed


incorrectly.

Syntactical Errors

It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by


syntax analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of
the input. In general, errors may be expected at various stages of compilation but most of
the errors are syntactic errors and hence the parser should be able to detect and report
those errors in the program.

The goals of error handler in parser are:

• Report the presence of errors clearly and accurately.


• Recover from each error quickly enough to detect subsequent errors.
• Add minimal overhead to the processing of correcting programs.

There are four common error-recovery strategies that can be implemented in the parser to
deal with errors in the code.

o Panic mode.
o Statement level.
o Error productions.
o Global correction.

Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the
semantic analyzer is expected to recognize are:

• Type mismatch.
• Undeclared variable.
• Reserved identifier misuse.
• Multiple declaration of variable in a scope.
• Accessing an out of scope variable.
• Actual and formal parameter mismatch.

Logical errors

These errors occur due to not reachable code-infinite loop.

Input Buffering – Compiler Design

by Dinesh Thakur Category: Compiler Design

• To ensure that a right lexeme is found, one or more characters have to be looked up
beyond the next lexeme.

• Hence a two-buffer scheme is introduced to handle large lookaheads safely.

• Techniques for speeding up the process of lexical analyzer such as the use of sentinels
to mark the buffer end have been adopted.

There are three general approaches for the implementation of a lexical analyzer:

(i) By using a lexical-analyzer generator, such as lex compiler to produce the lexical
analyzer from a regular expression based specification. In this, the generator provides
routines for reading and buffering the input.

(ii) By writing the lexical analyzer in a conventional systems-programming language,


using I/O facilities of that language to read the input.

(iii) By writing the lexical analyzer in assembly language and explicitly managing the
reading of input.

Buffer Pairs
Because of large amount of time consumption in moving characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an
input character.

Fig shows the buffer pairs which are used to hold the input data.

          

Scheme

• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.

• N-Number of characters on one disk block, e.g., 4096.

• N characters are read from the input file to the buffer using one system read command.

• eof is inserted at the end if the number of characters is less than N.

Pointers

Two pointers lexemeBegin and forward are maintained.

lexeme Begin points to the beginning of the current lexeme which is yet to be found.

forward scans ahead until a match for a pattern is found.

• Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.

• Current lexeme is the set of characters between two pointers.

Disadvantages of this scheme

• This scheme works well most of the time, but the amount of lookahead is limited.

• This limited lookahead may make it impossible to recognize tokens in situations where
the distance that the forward pointer must travel is more than the length of the buffer.

(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;


• It cannot determine whether the DECLARE is a keyword or an array name until the
character that follows the right parenthesis.

Sentinels

• In the previous scheme, each time when the forward pointer is moved, a check is done
to ensure that one half of the buffer has not moved off. If it is done, then the other half
must be reloaded.

• Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer.

Test 1: For end of buffer.

Test 2: To determine what character is read.

• The usage of sentinel reduces the two tests to one by extending each buffer half to hold
a sentinel character at the end.

• The sentinel is a special character that cannot be part of the source program. (eof
character is used as sentinel).

         

Advantages

• Most of the time, It performs only one test to see whether forward pointer points to an
eof.

• Only when it reaches the end of the buffer half or eof, it performs more tests.

• Since N input characters are encountered between eofs, the average number of tests per
input character is very close to 1.

Lexical Analysis – Compiler Design


Lexical analysis is the process of converting a sequence of characters from source
program into a sequence of tokens.

A program which performs lexical analysis is termed as a lexical analyzer (lexer),


tokenizer or scanner.

Lexical analysis consists of two stages of processing which are as follows:

• Scanning

• Tokenization

Token, Pattern and Lexeme

Token

Token is a valid sequence of characters which are given by lexeme. In a programming


language,

• keywords,

• constant,

• identifiers,

• numbers,

• operators and

• punctuations symbols

are possible tokens to be identified.

Pattern

Pattern describes a rule that must be matched by sequence of characters (lexemes) to


form a token. It can be defined by regular expressions or grammar rules.

Lexeme

Lexeme is a sequence of characters that matches the pattern for a token i.e., instance of a

token.
(eg.) c=a+b*5;

                                               Lexemes and tokens

Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)

The sequence of tokens produced by lexical analyzer helps the parser in analyzing the
syntax of programming languages.
Role of Lexical Analyzer

                        

Lexical analyzer performs the following tasks:

• Reads the source program, scans the input characters, group them into lexemes and
produce the token as output.

• Enters the identified token into the symbol table.

• Strips out white spaces and comments from source program.

• Correlates error messages with the source program i.e., displays error message with its
occurrence by specifying the line number.

• Expands the macros if it is found in the source program.

Tasks of lexical analyzer can be divided into two processes:

Scanning: Performs reading of input characters, removal of white spaces and comments.

Lexical Analysis: Produce tokens as the output.

Need of Lexical Analyzer

Simplicity of design of compiler The removal of white spaces and comments enables the
syntax analyzer for efficient syntactic constructs.

Compiler efficiency is improved Specialized buffering techniques for reading characters


speed up the compiler process.

Compiler portability is enhanced

Issues in Lexical Analysis


Lexical analysis is the process of producing tokens from the source program. It has the
following issues:

• Lookahead

• Ambiguities

Lookahead

Lookahead is required to decide when one token will end and the next token will begin.
The simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to
describe the lexemes of each token is required.

A way needed to resolve ambiguities

• Is if it is two variables i and f or if?

• Is == is two equal signs =, = or ==?

• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are
similar.

Hence, the number of lookahead to be considered and a way to describe the lexemes of
each token is also needed.

Regular expressions are one of the most popular ways of representing tokens.

Ambiguities

The lexical analysis programs written with lex accept ambiguous specifications and
choose the longest match possible at each input point. Lex can handle ambiguous
specifications. When more than one expression can match the current input, lex chooses
as follows:

• The longest match is preferred.

• Among rules which matched the same number of characters, the rule given first is
preferred.

Lexical Errors

• A character sequence that cannot be scanned into any valid token is a lexical error.

• Lexical errors are uncommon, but they still must be handled by a scanner.
• Misspelling of identifiers, keyword, or operators are considered as lexical errors.

Usually, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.

Error Recovery Schemes

• Panic mode recovery

• Local correction

   o Source text is changed around the error point in order to get a correct text.

   o Analyzer will be restarted with the resultant new text as input.

• Global correction

   o It is an enhanced panic mode recovery.

   o Preferred when local correction fails.

Panic mode recovery

In panic mode recovery, unmatched patterns are deleted from the remaining input, until
the lexical analyzer can find a well-formed token at the beginning of what input is left.

(eg.) For instance the string fi is encountered for the first time in a C program in the
context:

fi (a== f(x))

A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an


undeclared function identifier.

Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to
the parser.

Local correction

Local correction performs deletion/insertion and/or replacement of any number of


symbols in the error detection point.

(eg.) In Pascal, c[i] '='; the scanner deletes the first quote because it cannot legally follow
the closing bracket and the parser replaces the resulting'=' by an assignment statement.
Most of the errors are corrected by local correction.

(eg.) The effects of lexical error recovery might well create a later syntax error, handled
by the parser. Consider

· · · for $tnight · · ·

The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then
tnight is scanned as an identifier.

In effect it results,

· · · fortnight · · ·

Which will cause a syntax error? Such false errors are unavoidable, though a syntactic
error-repair may help.

Lexical error handling approaches

Lexical errors can be handled by the following actions:

• Deleting one character from the remaining input.

• Inserting a missing character into the remaining input.

• Replacing a character by another character.

• Transposing two adjacent characters.

Convert Regular Expression to DFA - Compiler Design

by Dinesh Thakur Category: Compiler Design

Regular expression is used to represent the language (lexeme) of finite automata (lexical
analyzer).

Finite automata

A recognizer for a language is a program that takes as input a string x and answers yes if
x is a sentence of the language and no otherwise.
A regular expression is compiled into a recognizer by constructing a generalized
transition diagram called a Finite Automaton (FA).

Finite automata can be Non-deterministic Finite Automata (NFA) or Deterministic Finite


Automata (DFA).

It is given by M = (Q, Σ, qo, F, δ).

Where Q - Set of states

Σ - Set of input symbols

qo - Start state

F - set of final states

δ - Transition function (mapping states to input symbol).

δ :Q x Σ → Q

• Non-deterministic Finite Automata (NFA)

       o More than one transition occurs for any input symbol from a state.

       o Transition can occur even on empty string (Ɛ).

• Deterministic Finite Automata (DFA)

   o For each state and for each input symbol, exactly one transition occurs from that state.

Regular expression can be converted into DFA by the following methods:

   (i) Thompson's subset construction

         • Given regular expression is converted into NFA

         • Resultant NFA is converted into DFA

(ii) Direct Method

         • In direct method, given regular expression is converted directly into DFA.

Rules for Conversion of Regular Expression to NFA


 

• Union

                                     r = r1 + r2

            

Concatenation

                                            r = r1 r2

            

Closure

                                             r = r1*

            

Ɛ –closure
Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states
of NFA.

Example 1

             

Ɛ -closure(q0) = { q0, q1, q2}

Ɛ –closure(q1 ) = {q1, q2}

Ɛ -closure(q2) = { q0}

Example 2

             

Ɛ -closure (l) = {l, 2, 3, 4, 6}

Ɛ-closure (2) = {2, 3, 6}

Ɛ-closure (3) = {3, 6}

Ɛ-closure (4) = {4}

Ɛ-closure (5) = {5, 7}

Ɛ -closure (6) = {6}


Ɛ-closure (7) = {7}

Sub-set Construction

• Given regular expression is converted into NFA.

• Then, NFA is converted into DFA.

Steps

l. Convert into NFA using above rules for operators (union, concatenation and closure)
and precedence.

2. Find Ɛ -closure of all states.

3. Start with epsilon closure of start state of NFA.

4. Apply the input symbols and find its epsilon closure.

Dtran[state, input symbol] = Ɛ -closure(move(state, input symbol))

where Dtran àtransition function of DFA

5. Analyze the output state to find whether it is a new state.

6. If new state is found, repeat step 4 and step 5 until no more new states are found.

7. Construct the transition table for Dtran function.

8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and
final state is the state that contains final state of NFA drawn.

Direct Method

• Direct method is used to convert given regular expression directly into DFA.

• Uses augmented regular expression r#.

• Important states of NFA correspond to positions in regular expression that hold symbols
of the alphabet.
Regular expression is represented as syntax tree where interior nodes correspond to
operators representing union, concatenation and closure operations.

• Leaf nodes corresponds to the input symbols

• Construct DFA directly from a regular expression by computing the functions


nullable(n), firstpos(n), lastpos(n) andfollowpos(i) from the syntax tree.

o nullable (n): Is true for * node and node labeled with Ɛ. For other nodes it is false.

o firstpos (n): Set of positions at node ti that corresponds to the first symbol of the sub-
expression rooted at n.

o lastpos (n): Set of positions at node ti that corresponds to the last symbol of the sub-
expression rooted at n.

o followpos (i): Set of positions that follows given position by matching the first or last
symbol of a string generated by sub-expression of the given regular expression.

                                     Rules for computing nullable, firstpos and lastpos

Node n nullable (n) firstpos (n) lastpos (n)


A leaf labeled Ɛ True Ø Ø
A leaf with False {i} {i}
position i
An or node n = c1| Nullable (c1 ) or firstpos (c1) U Iastpos (c1) U
c2
Nullable (c2 ) firstpos (c2) Iastpos (c2)
A cat node n = c1c2 Nullable (c1 ) and If (Nullable (c1 )) If (Nullable (c2 ))
Nullable (c2 ) firstpos (c1) U lastpos (c1) U

firstpos (c2) Iastpos (c2)

else else

firstpos (c1) lastpos (c1)


A star node n = c1* True firstpos (c1) lastpos (c1)

 
Computation of followpos

The position of regular expression can follow another in the following ways:

• If n is a cat node with left child c 1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).

o For cat node, for each position i in lastpos of its left child, the firstpos of its

right child will be in followpos(i).

• If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are in
followpos(i).

o For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of that
node.

Convert Regular Expression to DFA - Compiler Design

by Dinesh Thakur Category: Compiler Design

Regular expression is used to represent the language (lexeme) of finite automata (lexical
analyzer).

Finite automata

A recognizer for a language is a program that takes as input a string x and answers yes if
x is a sentence of the language and no otherwise.

A regular expression is compiled into a recognizer by constructing a generalized


transition diagram called a Finite Automaton (FA).

Finite automata can be Non-deterministic Finite Automata (NFA) or Deterministic Finite


Automata (DFA).
It is given by M = (Q, Σ, qo, F, δ).

Where Q - Set of states

Σ - Set of input symbols

qo - Start state

F - set of final states

δ - Transition function (mapping states to input symbol).

δ :Q x Σ → Q

• Non-deterministic Finite Automata (NFA)

       o More than one transition occurs for any input symbol from a state.

       o Transition can occur even on empty string (Ɛ).

• Deterministic Finite Automata (DFA)

   o For each state and for each input symbol, exactly one transition occurs from that state.

Regular expression can be converted into DFA by the following methods:

   (i) Thompson's subset construction

         • Given regular expression is converted into NFA

         • Resultant NFA is converted into DFA

(ii) Direct Method

         • In direct method, given regular expression is converted directly into DFA.

Rules for Conversion of Regular Expression to NFA

• Union

                                     r = r1 + r2
            

Concatenation

                                            r = r1 r2

            

Closure

                                             r = r1*

            

Ɛ –closure

Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states
of NFA.

Example 1
             

Ɛ -closure(q0) = { q0, q1, q2}

Ɛ –closure(q1 ) = {q1, q2}

Ɛ -closure(q2) = { q0}

Example 2

             

Ɛ -closure (l) = {l, 2, 3, 4, 6}

Ɛ-closure (2) = {2, 3, 6}

Ɛ-closure (3) = {3, 6}

Ɛ-closure (4) = {4}

Ɛ-closure (5) = {5, 7}

Ɛ -closure (6) = {6}

Ɛ-closure (7) = {7}

Sub-set Construction

• Given regular expression is converted into NFA.


• Then, NFA is converted into DFA.

Steps

l. Convert into NFA using above rules for operators (union, concatenation and closure)
and precedence.

2. Find Ɛ -closure of all states.

3. Start with epsilon closure of start state of NFA.

4. Apply the input symbols and find its epsilon closure.

Dtran[state, input symbol] = Ɛ -closure(move(state, input symbol))

where Dtran àtransition function of DFA

5. Analyze the output state to find whether it is a new state.

6. If new state is found, repeat step 4 and step 5 until no more new states are found.

7. Construct the transition table for Dtran function.

8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and
final state is the state that contains final state of NFA drawn.

Direct Method

• Direct method is used to convert given regular expression directly into DFA.

• Uses augmented regular expression r#.

• Important states of NFA correspond to positions in regular expression that hold symbols
of the alphabet.

Regular expression is represented as syntax tree where interior nodes correspond to


operators representing union, concatenation and closure operations.

• Leaf nodes corresponds to the input symbols

• Construct DFA directly from a regular expression by computing the functions


nullable(n), firstpos(n), lastpos(n) andfollowpos(i) from the syntax tree.
o nullable (n): Is true for * node and node labeled with Ɛ. For other nodes it is false.

o firstpos (n): Set of positions at node ti that corresponds to the first symbol of the sub-
expression rooted at n.

o lastpos (n): Set of positions at node ti that corresponds to the last symbol of the sub-
expression rooted at n.

o followpos (i): Set of positions that follows given position by matching the first or last
symbol of a string generated by sub-expression of the given regular expression.

                                     Rules for computing nullable, firstpos and lastpos

Node n nullable (n) firstpos (n) lastpos (n)


A leaf labeled Ɛ True Ø Ø
A leaf with False {i} {i}
position i
An or node n = c1| Nullable (c1 ) or firstpos (c1) U Iastpos (c1) U
c2
Nullable (c2 ) firstpos (c2) Iastpos (c2)
A cat node n = c1c2 Nullable (c1 ) and If (Nullable (c1 )) If (Nullable (c2 ))
Nullable (c2 ) firstpos (c1) U lastpos (c1) U

firstpos (c2) Iastpos (c2)

else else

firstpos (c1) lastpos (c1)


A star node n = c1* True firstpos (c1) lastpos (c1)

Computation of followpos

The position of regular expression can follow another in the following ways:
• If n is a cat node with left child c 1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).

o For cat node, for each position i in lastpos of its left child, the firstpos of its

right child will be in followpos(i).

• If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are in
followpos(i).

o For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of that
node.

LR Parsers - Compiler Design

by Dinesh Thakur Category: Compiler Design

LR parsers are used to parse the large class of context free grammars. This technique is
called LR(k) parsing.

• L is left-to-right scanning of the input.

• R is for constructing a right most derivation in reverse.

• k is the number of input symbols of lookahead that are used in making parsing
decisions.

There are three widely used algorithms available for constructing an LR parser:

• SLR(l) - Simple LR

    o Works on smallest class of grammar.

    o Few number of states, hence very small table.

    o Simple and fast construction.

• LR( 1) - LR parser

    o Also called as Canonical LR parser.


    o Works on complete set of LR(l) Grammar.

    o Generates large table and large number of states.

    o Slow construction.

• LALR(l) - Look ahead LR parser

    o Works on intermediate size of grammar.

    o Number of states are same as in SLR(l).

Reasons for attractiveness of LR parser

• LR parsers can handle a large class of context-free grammars.

• The LR parsing method is a most general non-back tracking shift-reduce parsing


method.

• An LR parser can detect the syntax errors as soon as they can occur.

• LR grammars can describe more languages than LL grammars.

Drawbacks of LR parsers

• It is too much work to construct LR parser by hand. It needs an automated parser


generator.

• If the grammar contains ambiguities or other constructs then it is difficult to parse in a


left-to-right scan of the input.

Model of LR Parser

LR parser consists of an input, an output, a stack, a driver program and a parsing table
that has two functions

1. Action

2. Goto

The driver program is same for all LR parsers. Only the parsing table changes from one
parser to another.
The parsing program reads character from an input buffer one at a time, where a shift
reduces parser would shift a symbol; an LR parser shifts a state. Each state summarizes
the information contained in the stack.

The stack holds a sequence of states, so, s1, · ·· , Sm, where Sm is on the top.

            

Action This function takes as arguments a state i and a terminal a (or $, the input end
marker). The value of ACTION [i, a] can have one of the four forms:

i) Shift j, where j is a state.

ii) Reduce by a grammar production A---> β.

iii) Accept.

iv) Error.

Goto This function takes a state and grammar symbol as arguments and produces a state.

If GOTO [Ii ,A] = Ij, the GOTO also maps a state i and non terminal A to state j.

Behavior of the LR parser

1. If ACTION[sm, ai] = shift s. The parser executes the shift move, it shifts the next state
s onto the stack, entering the configuration

a) Sm - the state on top of the stack.

b) ai- the current input symbol.


2. If ACTION[sm, ai] =reduce A---> β, then the parser executes a reduce move, entering
the configuration

                                   (s0s1 ... S(m-r)S, ai+l ... an$)

a) where r is the length of β and s= GOTO[sm - r, A].

b) First popped r state symbols off the stack, exposing state Sm-r·

c) Then pushed s, the entry for GOTO[sm-r, A], onto the stack.

3. If ACTION[sm, ai] = accept, parsing is completed.

4. If ACTION[sm, ai] = error, the parser has discovered an error and calls an error
recovery routine.

LR Parsing Algorithm

Algorithm LR Parsing Algorithm.

Input   Input string w,

       LR-Parsing table with functions ACTION and

       GOTO for a grammar G

Output If w is in L(G), the reduction steps of a

       bottom-up parse for w,

       otherwise, an error indication.

Method Initially, the parser has So on its stack,

       where So is the initial state, and w $ in the

       input buffer.

       let a be the first symbol of w $

       while(l) { //repeat forever

       let s be the state on top of the stack;


       if(ACTION[s, a] =shift t {

       push t onto the stack;

       let a be the next input symbol;

       } else if (ACTION [s, a] = reduce A---> β) {

       pop β symbols off the stack;

       let state t now be on top of the stack;

       push GOTO[t, A] onto the stack;

       output the production A---> β;

       } else if (ACTION [s, a] accept) break;

       //parsing is done

       else call error-recovery routine;

               }

LR(O) Items

An LR(O) item of a grammar G is a production of G with a dot at some position of the


body.

(eg.)

                                                   A ---> •XYZ

                                                   A ---> XeYZ

                                                   A ---> XYeZ

                                                   A ---> XYZ•

One collection of set of LR(O) items, called the canonical LR(O) collection, provides
finite automaton that is used to make parsing decisions. Such an automaton is called an
LR(O) automaton.

LR(O) Parser I SLR(1) Parser


An LR(O)parser is a shift-reduce parser that uses zero tokens of lookahead to determine
what action to take (hence the 0). This means that in any configuration of the parser, the
parser must have an unambiguous action to choose-either it shifts a specific symbol or
applies a specific reduction. If there are ever two or more choices to make, the parser fails
and the grammar is not LR(O).

An LR parser makes shift-reduce decisions by maintaining states to keep track of parsing.


States represent a set of items.

Closure of item sets

If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed
from I by the two rules.

• Initially, add every item I to CLOSURE(I).

• If A ---> αB,β is in CLOSURE(I) and B ---> ɣ is a production, then add the item B ---> •
ɣ to CLOSURE(i), if it is not already there. Apply this rule until no more items can be
added to CLOSURE (!).

Construct canonical LR(O) collection

• Augmented grammar is defined with two functions, CLOSURE and GOTO. If G is a


grammar with start symbol S, then augmented grammar G' is G with a new start symbol
S' ---> S.

• The role of augmented production is to stop parsing and notify the acceptance of the
input i.e., acceptance occurs when and only when the parser performs reduction by S' --->
S.

Limitations of the LR(O) parsing method

Consider grammar for matched parentheses

1. S' ---> S

2. S' ---> (S) S

3. S' ---> Ɛ

The LR(O) DFA of grammar G is shown below

In states: 0, 2 and 4 parser can shift (and reduce Ɛ to S)


       

Conflicts

Conflicts are the situations which arise due to more than one option to opt for a particular
step of shift or reduce.

• Two kinds of conflicts may arise.

     Shift-reduce and reduce-reduce.

• In state 0 parser encounters a conflict.

     It can shift state 2 on stack when next token is (.

     It can reduce production 2: S ---> Ɛ on +.

     This is a called a shift-reduce conflict.

     This conflict also appears in states 2 and 4.


       

Shift-reduce conflict parser can shift and can reduce.

Reduce-reduce conflict two (or more) productions can be reduced.

SLR(1) grammars

• SLR(l) parsing increases the power of LR(O) significantly.

Look ahead token is used to make parsing decisions

Reduce action is applied more selectively according to FOLLOW set.

• A grammar is SLR(l) if two conditions are met in every state.

If A ---> α • x ɣ and B ---> β ,•then token x Ɛ FOLLOW(B).

If A ---> α • and B ---> • then FOLLOW(A) П FOLLOW(B) = Ø.

• Violation of first condition results in shift-reduce conflict.

A---> α • x ɣ and B ---> β• and x Ɛ FOLLOW(B) then ...


Parser can shift x and reduce B ---> β.

• Violation of second condition results in reduce-reduce conflict.

A---> α • and B ---> β,•and x Ɛ FOLLOW(A) n FOLLOW(B).

Parser can reduce A ---> α and B ---> β,.

• SLR(l) grammars are a superset of LR(O) grammars.

LR(1) Parser I Canonical LR (CLR)

• Even more powerful than SLR(l) is the LR(l) parsing method.

• LR(l) includes LR(O) items and a look ahead token in itemsets.

• An LR(l) item consists of,

o Grammar production rule.

o Right-hand position represented by the dot and.

o Lookahead token.

o A --->X1 · · · Xi • Xi+1 · · · Xn, l where l is a lookahead token

• The • represents how much of the right-hand side has been seen,

o X1 · · · Xi appear on top of the stack.

o Xi+l · · · Xn are expected to appear on input buffer.

• The lookahead token l is expected after X1 · · · Xn appears on stack.

• An LR(l) state is a set of LR(l) items.

Introduction to LALR Parser

• LALR stands for lookahead LR parser.

• This is the extension of LR(O) items, by introducing the one symbol of lookahead on
the input.

• It supports large class of grammars.


• The number of states is LALR parser is lesser than that of LR( 1) parser. Hence, LALR
is preferable as it can be used with reduced memory.

• Most syntactic constructs of programming language can be stated conveniently.

Steps to construct LALR parsing table

• Generate LR(l) items.

• Find the items that have same set of first components (core) and merge these sets into
one.

• Merge the goto's of combined itemsets.

• Revise the parsing table of LR(l) parser by replacing states and goto's with combined
states and combined goto's respectively.

Type of Parsing

by Dinesh Thakur Category: Compiler Design

Top-Down Parsing

Top-down parsing constructs parse tree for the input string, starting from root node and
creating the nodes of parse tree in pre-order.

It is done by leftmost derivation for an input string.

 
       

                             

General Strategies

• Top-down parsing involves constructing the parse tree starting from root node to leaf
node by consuming tokens generated by lexical analyzer.

• Top-down parsing is characterized by the following methods:

• Brute-force method, accompanied by a parsing algorithm.

   o All possible combinations are attempted before the failure to parse is recognized.

• Recursive descent, is a parsing technique which does not allow backup.


   o Involves backtracking and left recursion.

• Top-down parsing with limited or partial backup.

Recursive Descent Parser

• Recursive descent parser is a top-down parser.

• It requires backtracking to find the correct production to be applied.

• The parsing program consists of a set of procedures, one for each non-terminal.

• Process begins with the procedure for start symbol.

• Start symbol is placed at the root node and on encountering each non-terminal, the
procedure concerned is called to expand the non-terminal with its corresponding
production.

• Procedure is called recursively until all non-terminals are expanded.

• Successful completion occurs when the scan over entire input string is done. ie., all
terminals in the sentence are derived by parse tree.

void A()

     choose an A-production, A ----> X1 X2 X3... Xk;

       for (i = 1 to k)

           if (Xi is a non-terminal)

             call procedure Xi ();

             else if (Xi equals the current input symbol a)

                   advance the input to the next symbol;

             else

                   error;

}
Limitation

• When a grammar with left recursive production is given, then the parser might get into
infinite loop.

Hence, left recursion must be eliminated.

(eg.) Let grammar G be,

S ----> SAd

A ---> ab I d

              

Recursive descent parser with backtracking

(eg.) Let grammar G be,

S ---->  cAd

A ----> ab | d

w = cad
                 

Explanation

• The root node contains the start symbol which is S.

• The body of production begins with c, which matches with the first symbol of the input
string.

• A is a non-terminal which is having two productions A ----> ab I d.

• Apply the first production of A, which results in the string cabd that does not match
with the given string cad.

• Backtrack to the previous step where the production of A gets expanded and try with
alternate production of it.

• This produces the string cad that matches with the given string.

Limitation

• If the given grammar has more number of alternatives then the cost of backtracking will
be high.

Recursive descent parser without backtracking

 
Recursive descent parser without backtracking works in a similar way as that of recursive
descent parser with backtracking with the difference that each non-terminal should be
expanded by its correct alternative in the first selection itself.

When the correct alternative is not chosen, the parser cannot backtrack and results in
syntactic error.

Advantage

• Overhead associated with backtracking is eliminated.

Limitation

• When more than one alternative with common prefixes occur, then the selection of the
correct alternative is highly difficult.

Hence, this process requires a grammar with no common prefixes for alternatives.

Predictive Parser I LL(1) Parser

• Predictive parsers are top-down parsers.

• It is a type of recursive descent parser but with no backtracking.

• It can be implemented non-recursively by using stack data structure.

• They can also be termed as LL (l) parser as it is constructed for a class of grammars
called LL (l).

• The production to be applied for a non-terminal is decided based on the current input
symbol.

             
A grammar G is LL(l) if there are two distinct productions A ---> α | βwith the following
conditions hold:

o For no terminal α and βderive strings beginning with a.

o At most one of α and βcan derive empty string.

o If β *---> Ɛ then α does not derive any string beginning with a terminal in
FOLLOW(A).

o If α *---> Ɛ then βdoes not derive any string beginning with a terminal in FOLLOW(A).

In order to overcome the limitations of recursive descent parser, LL(1) parser is designed
by using stack data structure explicitly to hold grammar symbols.

In addition to this,

• Left recursion is eliminated.

• Common prefixes are also eliminated (Left factoring).

Eliminating left recursion

A grammar is left recursive if it has a production of the form A ----> A α, for some string
α.

To eliminate left recursion for the production, A ---> A α I β

Rule

A ---> β A'

A' ---> αA' I Ɛ

Example

A ----> A α1 | A α2 | · · · | β1 | β2 | · · · | βm

 
Solution:

A----> β1A' | β2A' | ··· | βmA'

A'----> α 1A' | α 2A'

Left factoring

When a production has more than one alternatives with common prefixes, then it is
necessary to make right choice on production.

This can be done through rewriting the production until enough of the input has been
seen.

To perform left-factoring for the production, A ---> αβ1 | αβ2

Rule

A ---> α A'

A' ---> β1I β2

Example

A -> α β1 I α β2 I ···I α βm I ɣ

Solution

         A-> α A'

A' -> β1I β2I ··· I βm

Computation of FIRST
 

FIRST(α) is the set of terminals that begin strings derived from α.

Rules

• To compute FIRST(X), where X is a grammar symbol,

• If X is a terminal, then FIRST(X) = {X}.

• If X -> Ɛis a production, then add Ɛto FIRST (X).

• If X is a non-terminal and X-> Y 1Y2 · · · Ykis a production, then add FIRST(Y 1) to


FIRST(X). If Y1 derives c, then add FIRST(Y2) to FIRST(X).

Computation of FOLLOW

FOLLOW(A) is the set of terminals a, that appear immediately to the right of A.

For rightmost sentential form of A, $ will be in FOLLOW(A).

Rules

• For the FOLLOW(Start symbol) place $, where $ is the input end marker.

• If there is a production A -> α B β, then everything in FIRST(β) except Ɛis in


FOLLOW(A).

• If there is a production A -> α B, or a production A -> α B β where FIRST((β) contains


Ɛ, then everything in FOLLOW(A) is in FOLLOW(B).

Construction of parsing table

Algorithm Construction of predictive parsing table

Input   Grammar G
Output Parsing table M

Method For each production A --> α, do the following:

1. For each terminal αin FIRST(α), add A -->αto M[A, a].

2. If Ɛis in FIRST (α) , then for each terminal b in FOLLOW(A) ' add A --> αto M[A, b]

3. If Ɛ is in FIRST (Ɛ) and $ is in FOLLOW(A) , add A --> αto M[A, $] .

4. If no production is found in M[A, a] then set error to M[A, a].

Note:

In general, parsing table entry will be empty for indicating error status.

Parsing of input

Predictive parser contains the following components:

• Stack - holds sequence of grammar symbols with $ on the bottom of stack

• Input buffer - contains the input to be parsed with $ as an end marker for the string.

• Parsing table.

Process

• Initially the stack contains $ to indicate bottom of the stack and the start symbol of
grammar on top of $.

• The input string is placed in input buffer with $ at the end to indicate the end of the
string.

• Parsing algorithm refers the grammar symbol on the top of stack and input symbol
pointed by the pointer and consults the entry in M[A, α] where A is in top of stack and αis
the symbol read by the pointer.

• Based on the table entry, if a production is found then the tail of the production is
pushed onto stack in reversal order with leftmost symbol on the top of stack.
• Process repeats until the entire string is processed.

• When the stack contains $ (bottom end marker) and the pointer reads $ (end of input
string), successful parsing occurs.

• If no entry is found, it reports error stating that the input string cannot be parsed by the
grammar.

Algorithm Table-driven predictive parsing

Input   A string w and parsing table M for a grammar G

Output If w is in L(G) then success; otherwise error

Method

Let a be the first symbol of w;

Let X be the top of stack symbol;

while(X ≠ $)

if (X = a) pop the stack and let a be the next

symbol of w;

else if (X is a terminal) error();

else if (M[X, a] is an error entry) error();

else if (M[X, a]=X-->Y1Y2...Yk){

output the production X-->Y1Y2...Yk;

pop the stack;

push Yk,Yk-1 r ... ,Y1 onto

stack with Y1 on the top;}

Let X be the top stack symbol; }


Non-recursive Predictive Parser

Non-recursive predictive parser uses explicit stack data structure.

     This prevents implicit recursive calls.

     It can also be termed as table-driven predictive parser.

Components

• Input buffer - holds input string to be parsed.

• Stack - holds sequence of grammar symbols.

• Predictive parsing algorithm - contains steps to parse the input string; controls the
parser's process.

• Parsing table - contains entries based on which parsing action has to be carried out.

          

Process

• Initially, the stack contains $ at the bottom of the stack.

• The input string to be parsed is placed in the input buffer with $ as the end marker.

• If X is a non-terminal on the top of stack and the input symbol being read is a, the
parser chooses a production by consulting entry in the parsing table M[X, a].
• Replace the non-terminal in stack with the production found in M[X, a] in such a way
that the leftmost symbol of right side of production is on the top of stack i.e., the
production has to be pushed to stack in reverse order.

• Compare the top of stack symbol with input symbol.

• If it matches, pop the symbol from stack and advance the pointer reading the input
buffer.

• If no match is found repeat from step 2.

• Stop parsing when the stack is empty (holds $) and input buffer reads end marker ($).

Error Recovery in Predictive Parsing

• Recovery in a non-recursive predictive parser is easier than in a recursive descent


parser.

• Panic mode recovery

   o If a terminal on stack, pop the terminal.

   o If a non-terminal on stack, shift the input until the terminal can expand.

• Phrase level recovery

   o Carefully filling in the blank entries about what to do.

BOTTOM-UP PARSING

• Bottom-up parsers construct parse trees starting from the leaves and work up to the
root.

• Bottom-up syntax analysis is also termed as shift-reduce parsing.

• The common method of shift-reduce parsing is called LR parsing.

• Operator precedence parsing is an easy-to-implement shift-reduce parser.

• Shift-reduce parsing try to build a parse tree for an input string beginning at the leaves
(the bottom) and working up towards the root (the top).

• At each and every step of reduction, the right side of a production which matches with
the substring is replaced by the left side symbol of the production.
• If the substring is chosen correctly at each step, a rightmost derivation is traced out in
reverse.

          

Handles

A handle of a string is a substring that matches the right side of a production and whose
reduction to the non-terminal on the left side of the production represents one step along
the reverse of a rightmost derivation.

Precise definition of a handle

• A handle of a right-sentential form ɣ is a production A -->βand a position of ɣ where the


string βmay be found and replaced by A to produce the previous right-sentential form in a
rightmost derivation of ɣ.

  

• The string w to the right of the handle contains only terminal symbols.

(eg.) Consider the grammar

S -->αABe
A --> Abc I b

B --> d

The sentence abbcde can be reduced to S by the following steps.

abbcde

aAbcde

aAde

aABe

These reductions trace out the following rightmost derivation in reverse.

     

Handle Pruning

• If A -->β is a production then reducing βto A by the given production is called handle
pruning i.e., removing the children of A from the parse tree.

• A rightmost derivation in reverse can be obtained by handle pruning.

Start with a string of terminals ω that is to parse. If ω is a sentence of the grammar at


hand, then ω= ɣn where ɣn is the nth right sentential form of some as yet unknown
rightmost derivation.

     

Example for right sentential form and handle for grammar


                                 E --> E+E

                                 E --> E*E

                                                  E --> (E)

                                 E --> id

Right sentential form Handle Reduction production


id1+ id2 * id3 id1 E --> id
E + id2 * id3 id2 E --> id
E + E * id3 Id3 E --> id
E+E*E E*E E --> E * E
E+E E+E E --> E + E
E    

Shift-reduce Parsing

i) Shift Reduce parsing is a bottom-up parsing that reduces a string w to the start symbol
of grammar.

ii) It scans and parses the input text in one forward pass without backtracking.

Stack implementation of shift-reduce parsing

• Handle pruning must solve the following two problems to perform parsing:

   o Locating the substring to be reduced in a right sentential form.

   o Determining what production to choose in case there is more than one productions
with that substring on the right side.
• The type of data structure to use in a shift-reduce parser.

Implementation of shift-reduce parser

Shift-reduce parser can be implemented by using the following components:

• Stack is used to hold grammar symbols.

• An input buffer is used to hold the string w to be parsed.

• $ is used to mark the bottom of the stack and also the right end of the input.

• Initially the stack is empty and the string ωis on the input, as follows:

                                         Stack       Input

                                                               $                   ω $

• The parser processes by shifting zero or more input symbols onto the stack until a
handle β is on top of the stack.

• The parser then reduces β to the left side of the appropriate production.

• The parser repeats this cycle until it has detected an error or until the stack contains the
start symbol and the input is empty.

                                                          Stack         Input

                                                             $S             $

• When the input buffer reaches the end marker symbol $ and the stack contains the start
symbol, the parser halts and announces successful completion of parsing.

Actions in shift-reduce parser

A shift-reduce parser can make four possible actions viz: 1) shift 2) reduce 3) accept 4)
error.

• A shift action, shifts the next symbol onto the top of the stack.
• A reduce action, replaces the symbol on the right side of production by the symbol on
left side of the production concerned.

To perform reduction, the parser must know the right end of the handle which is at the
top of the stack. Then the left end of the handle within the stack is located and the non-
terminal to replace the handle is decided.

• An accept action, initiates the parser to announce successful completion of parsing.

• An error action, discovers that a syntax error has occurred and calls an error recovery
routine.

Note:

An important fact that justifies the use of a stack in shift-reduce parsing is that the handle
will always appear on top of the stack and never inside.

(eg.) Consider the grammar

                                 E --> E+E

                                 E --> E*E

                                                 E --> (E)

                                 E --> id

and the input string id1+ id2 * id3. Use the shift-reduce parser to check whether the input
string is accepted by the above grammar.

Stack Input Action


$ id1+ id2 * id3 $ shift
$ id1 + id2 * id3 $ reduce by E --> id
$E + id2 * id3 $ shift
$ E+ id2 * id3 $ shift
$ E+ E * id3 $ shift
$ E+ E * id3 $ shift
$ E+ E * id3 $ reduce by E --> id
$ E+ E * E $ reduce by E --> E * E
$ E+ E $ reduce by E --> E + E
$E $ accept

Viable prefixes
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce
parser are called viable prefixes.

Conflicts during shift-reduce parsing

• Shift-reduce parsing cannot be used in context free grammar.

• For every shift-reduce parser, such grammar can reach a configuration in which the
parser cannot decide whether to shift or to reduce (a shift-reduce conflict), or cannot
decide which of the several reductions to make (a reduce/reduce conflict), by knowing
the entire stack contents and the next input symbol.

(eg.)

• An ambiguous grammar can never be LR. Consider dangling-else grammar,

Stmt-->if expr then stmt

         I if expr then stmt else stmt

         I other

• In this grammar a shift/reduce conflict occurs for some input string.

• So this grammer is not LR(l) grammar.

What is Parse Tree? - Compiler Design

by Dinesh Thakur Category: Compiler Design

 • Parse tree is a hierarchical structure which represents the derivation of the grammar to
yield input strings.

• Root node of parse tree has the start symbol of the given grammar from where the
derivation proceeds.
• Leaves of parse tree represent terminals.

• Each interior node represents productions of grammar.

• If A -> xyz is a production, then the parse tree will have A as interior node whose
children are x, y and z from its left to right.

                       

Construct parse tree for E --> E + E I E * E I id

              

Construct parse tree for s --> SS* I ss+ I a


             

Yield of Parse Tree

Leaf nodes of parse tree are concatenated from left to right to form the input string
derived from a grammar which is called yield of parse tree.

Figure represents the parse tree for the string id+ id* id.
The string id + id * id, is the yield of parse tree depicted in Fig.

Figure represents the parse tree for the string aa +a*.


The string aa + a*, is the yield of parse tree depicted in Fig.
What is LEX? Use of Lex.

by Dinesh Thakur Category: Compiler Design

• Lex is a tool in lexical analysis phase to recognize tokens using regular expression.

• Lex tool itself is a lex compiler.

Use of Lex

                        

• lex.l is an a input file written in a language which describes the generation of lexical
analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.

• lex.yy.c is compiled by the C compiler to a file called a.out.

• The output of C compiler is the working lexical analyzer which takes stream of input
characters and produces a stream of tokens.

• yylval is a global variable which is shared by lexical analyzer and parser to return the
name and an attribute value of token.

• The attribute value can be numeric code, pointer to symbol table or nothing.

• Another tool for lexical analyzer generation is Flex.

Structure of Lex Programs

Lex program will be in following form

declarations
%%

translation rules

%%

auxiliary functions

Declarations This section includes declaration of variables, constants and regular


definitions.

Translation rules It contains regular expressions and code segments.

Form : Pattern {Action}

Pattern is a regular expression or regular definition.

Action refers to segments of code.

Auxiliary functions This section holds additional functions which are used in actions.
These functions are compiled separately and loaded with lexical analyzer.

Lexical analyzer produced by lex starts its process by reading one character at a time until
a valid match for a pattern is found.

Once a match is found, the associated action takes place to produce token.

The token is then given to parser for further processing.

Conflict Resolution in Lex

Conflict arises when several prefixes of input matches one or more patterns. This can be
resolved by the following:

• Always prefer a longer prefix than a shorter prefix.

• If two or more patterns are matched for the longest prefix, then the first pattern listed in
lex program is preferred.

Lookahead Operator

• Lookahead operator is the additional operator that is read by lex in order to distinguish
additional pattern for a token.
• Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.

• At times, it is needed to have certain characters at the end of input to match with a
pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches the
lexeme.

(eg.) In some languages keywords are not reserved. So the statements

IF (I, J) = 5 and IF(condition) THEN

results in conflict whether to produce IF as an array name or a keyword. To resolve this


the lex rule for keyword IF can be written as,

IF/\ (.* \) {

letter }

Design of Lexical Analyzer

• Lexical analyzer can either be generated by NFA or by DFA.

• DFA is preferable in the implementation of lex.

Structure of Generated Analyzer

Architecture of lexical analyzer generated by lex is given in Fig.

               

Lexical analyzer program includes:

 
• Program to simulate automata

• Components created from lex program by lex itself which are listed as follows:

   o A transition table for automaton.

   o Functions that are passed directly through lex to the output.

    o Actions from input program (fragments of code) which are invoked by automaton
simulator when needed.

 Steps to construct automaton

Step 1: Convert each regular expression into NFA either by Thompson's subset
construction or Direct Method.

Step 2: Combine all NFAs into one by introducing new start state with s-transitions to
each of start states of NFAs Ni for pattern Pi·

Step 2 is needed as the objective is to construct single automaton to recognize lexemes


that matches with any of the patterns.

                

(eg.)    a {action A1 for pattern Pl}

           abb { action A2 for pattern P2 }

           a*b+ { action A3 for pattern P3 }

 
For string obb, pattern P2 and pattern p3 matches. But the pattern P2 will be taken into
account as it was listed first in lex program.

For string aabbb · · · , matches pattern p3 as it has many prefixes.

Fig. Shows NFAs for recognizing the above mentioned three patterns.

The combined NFA for all three given patterns is shown in Fig.

                

                

Pattern Matching Based on NFAs

Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by
the pointer lexemeBegin. Forward pointer is used to move ahead of input symbols,
calculates the set of states it is in at each point. If NFA simulation has no next state for
some input symbol, then there will be no longer prefix which reaches the accepting state
exists. At such cases, the decision will be made on the so seen longest prefix i.e., lexeme
matching some pattern. Process is repeated until one or more accepting states are
reached. If there are several accepting states, then the pattern Pi which appears earliest in
the list of lex program is chosen.

e.g.
           W= aaba

          

Explanation

Process starts with s-closure of initial state 0. After processing all the input symbols, no
state is found as there is no transition out of state 8 on input a. Hence, look for accepting
state by retracting to previous state. From Fig. state 2 which is an accepting state is
reached after reading input symbol a and therefore the pattern a has been matched. At
state 8, string aab has been matched with pattern avb": By Lex rule, the longest matching
prefix should be considered. Hence, action Ag corresponds to pattern p3 will be executed
for the string aab.

DFAs for Lexical Analyzers

DFAs are also used to represent the output oflex. DFA is constructed from NFA, by
converting all the patterns into equivalent DFA using subset construction algorithm. If
there are one or more accepting NFA states, the first pattern whose accepting state is
represented in each DFA state is determined and displayed as output of DFA state.
Process of DFA is similar to that of NFA. Simulation of DFA is continued until no next
state is found. Then retraction takes place to find the accepting state of DFA. Action
associated with the pattern for that state is executed.

Implementing Lookahead Operator

Lookahead operator r1/r2 is needed because the pattern r1 for a particular token may need
to describe some trailing context r2 in order to correctly identify the actual lexeme.

For the pattern r1/r2, ‘/’ is treated as Ɛ.


If some prefix ab, is recognized by NFA as a match for regular expression then the
lexeme is not ended as NFA reaches the accepting state.

The end of lexeme occurs when NFA enters a state p such that

• p has an Ɛ -transition on I,

• There is a path from start state to state p, that spells out a.

• There is a path from state p to accepting state that spells out b.

• a is as Jong as possible for any ab satisfying conditions 1 - 3.

      

Figure shows the NFA for recognizing the keyword IF with lookahead. Transition from
state 2 to state 3 represents the lookahead operator (-transition).

Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme
IF is found by looking backwards to the state 2, whenever accepting state (state 6) is
reached.

Syntax Directed Definition (SDD) and Types of Syntax Directed Definitions

by Dinesh Thakur Category: Compiler Design

Syntax directed definition specifies the values of attributes by associating semantic rules
with the grammar productions.
It is a context free grammar with attributes and rules together which are associated with
grammar symbols and productions respectively.

The process of syntax directed translation is two-fold:

• Construction of syntax tree and

• Computing values of attributes at each node by visiting the nodes of syntax tree.

Semantic actions

Semantic actions are fragments of code which are embedded within production bodies by
syntax directed translation.

They are usually enclosed within curly braces ({ }).

It can occur anywhere in a production but usually at the end of production.

(eg.)

                                            E---> E1 + T {print ‘+’}

Types of translation

• L-attributed translation

o It performs translation during parsing itself.

o No need of explicit tree construction.

o L represents 'left to right'.

• S-attributed translation

o It is performed in connection with bottom up parsing.

o 'S' represents synthesized.

Types of attributes
• Inherited attributes

   o It is defined by the semantic rule associated with the production at the parent of node.

   o Attributes values are confined to the parent of node, its siblings and by itself.

   o The non-terminal concerned must be in the body of the production.

• Synthesized attributes

   o It is defined by the semantic rule associated with the production at the node.

   o Attributes values are confined to the children of node and by itself.

   o The non terminal concerned must be in the head of production.

   o Terminals have synthesized attributes which are the lexical values (denoted by lexval)
generated by the lexical analyzer.

                      Syntax directed definition of simple desk calculator

Production Semantic rules


L ---> En L.val = E.val
E ---> E1+ T E.val = E1.val+ T.val
E ---> T E.val = T.val
T---> T1*F T.val = Ti.val x F.val
T ---> F T.val = F.val
F ---> (E) F.val = E.val
F ---> digit F.val = digit.lexval

                        

                             Syntax-directed definition-inherited attributes

Production Semantic Rules


D --->TL L.inh = T.type
T ---> int T.type =integer
T ---> float T.type = float
L ---> L1, id L1.inh = L.inh
addType (id.entry, Linh)
L ---> id addType (id.entry, L.inh)

• Symbol T is associated with a synthesized attribute type.

• Symbol L is associated with an inherited attribute inh,

Types of Syntax Directed Definitions

S-attributed Definitions

Syntax directed definition that involves only synthesized attributes is called S-attributed.
Attribute values for the non-terminal at the head is computed from the attribute values of
the symbols at the body of the production.

The attributes of a S-attributed SDD can be evaluated in bottom up order of nodes of the
parse tree. i.e., by performing post order traversal of the parse tree and evaluating the
attributes at a node when the traversal leaves that node for the last time.

 
Production Semantic rules
L ---> En L.val = E.val
E ---> E1+ T E.val = E1.val+ T.val
E ---> T E.val = T.val
T---> T1*F T.val = Ti.val x F.val
T ---> F T.val = F.val
F ---> (E) F.val = E.val
F ---> digit F.val = digit.lexval

L-attributed Definitions

The syntax directed definition in which the edges of dependency graph for the attributes
in production body, can go from left to right and not from right to left is called L-
attributed definitions. Attributes of L-attributed definitions may either be synthesized or
inherited.

If the attributes are inherited, it must be computed from:

• Inherited attribute associated with the production head.

• Either by inherited or synthesized attribute associated with the production located to the
left of the attribute which is being computed.

• Either by inherited or synthesized attribute associated with the attribute under


consideration in such a way that no cycles can be formed by it in dependency graph.

Production Semantic Rules


T ---> FT' T '.inh = F.val
T ' ---> *FT1’ T’1.inh =T'.inh x F.val

In production 1, the inherited attribute T' is computed from the value of F which is to its
left. In production 2, the inherited attributed Tl' is computed from T'. inh associated with
its head and the value of F which appears to its left in the production. i.e., for computing
inherited attribute it must either use from the above or from the left information
Compiler Construction tools - Compiler Design

by Dinesh Thakur Category: Compiler Design

Some commonly used compiler-construction tools. include

1. Parser generators.
2. Scanner generators.
3. Syntax-directed translation engines.
4. Automatic code generators.
5. Data-flow analysis engines.
6. Compiler-construction toolkits.

Parser Generators

Input: Grammatical description of a programming language


Output: Syntax analyzers.

Parser generator takes the grammatical description of a programming language and


produces a syntax analyzer.

Scanner Generators

Input: Regular expression description of the tokens of a language


Output: Lexical analyzers.
Scanner generator generates lexical analyzers from a regular expression description of the
tokens of a language.

Syntax-directed Translation Engines

Input: Parse tree.


Output: Intermediate code.
Syntax-directed translation engines produce collections of routines that walk a parse tree
and generates intermediate code.

Automatic Code Generators

Input: Intermediate language.


Output: Machine language.
Code-generator takes a collection of rules that define the translation of each operation of
the intermediate language into the machine language for a target machine.

Data-flow Analysis Engines

Data-flow analysis engine gathers the information, that is, the values transmitted from
one part of a program to each of the other parts. Data-flow analysis is a key part of code
optimization.

Compiler Construction Toolkits

The toolkits provide integrated set of routines for various phases of compiler. Compiler
construction toolkits provide an integrated set of routines for construction of phases of
compiler.

Regular Expression - Compiler Design

by Dinesh Thakur Category: Compiler Design

• Regular expressions are a notation to represent lexeme patterns for a token.

• They are used to represent the language for lexical analyzer.

• They assist in finding the type of token that accounts for a particular lexeme.

Strings and Languages

Alphabets are finite, non-empty set of input symbols.


               Σ = {0, 1} - binary alphabets

String represents the collection of alphabets.

               w = {0,1, 00, 01, 10, 11, 001, 010, ... }

w indicates the set of possible strings for the given binary alphabet Σ

Language (L) is the collection of strings which are accepted by finite automata.

                L = {0n1 I n >= 0}

Length of string is defined as the number of input symbols in a given string. It is found
by || operator.

             Let ω = 0101

             | ω | =4

Empty string denotes zero occurrence of input symbol. It is represented by Ɛ.


Concatenation of two strings p and q is denoted by pq.

        Let       p = 010

        And      q = 001

                  pq = 010001

                  qp = 001010

                  i.e., pq ≠ qp

Empty string is identity under concatenation.

     Let x be a string.

                    Ex= XE= X

Prefix A prefix of any string s, is obtained by removing zero or more symbols from the
end of s.

          (eg.) s = balloon


Possible prefixes are: ball, balloon,

Suffix A suffix of any string s, is obtained by removing zero or more symbols from the
beginning of s.

          (eg.) s =balloon

Possible prefixes are: loon, balloon

Proper prefix: Proper prefix p of a strings, can be given by s ≠ p and p ≠ E

Proper suffix: Proper suffix x of a string s, can be given by s ≠ x and x ≠ E

Substring: Substring is part of a string obtained by removing any prefix and any suffix
from s.

Operations on Languages

Important operations on a language are:

• Union

• Concatenation and

• Closure

Union

Union of two languages Land M produces the set of strings which may be either in
language L or in language M or in both. It can be denoted as,

LUM = {p I p is in L or p is in M}

Concatenation

Concatenation of two languages L and M, produces a set of strings which are formed by
merging the strings in L with strings in M (strings in L must be followed by strings in M).
It can be represented as,

LUM= {pq | p is in L and q is in M}

Closure

Kleene closure (L*)


Kleene closure refers to zero or more occurrences of input symbols in a string, i.e., it
includes empty string Ɛ(set of strings with 0 or more occurrences of input symbols).

                                       

Positive closure (L +)

Positive closure indicates one or more occurrences of input symbols in a string, i.e., it
excludes empty string Ɛ(set of strings with 1or more occurrences of input symbols).

                                       

L3- set of strings each with length 3.

(eg.) Let Σ = {a, b}

L* = {E, a, b, aa, ab, ba, bb, aab, aba, aaba, ... }

L+ = {a, b, aa, ab, ba, bb, aab, aaba, }

L3 = {aaa, aba, abb, bba, bob, bbb, }

Precedence of operators

• Unary operator (*) is having highest precedence.

• Concatenation operator (-) is second highest and is left associative.

           letter_ (letter_ I digit )*

• Union operator ( I or U) has least precedence and is left associative.

Based on the precedence, the regular expression is transformed to finite automata when
implementing lexical analyzer.
Regular Expressions

Regular expressions are a combination of input symbols and language operators such as
union, concatenation and closure.

It can be used to describe the identifier for a language. The identifier is a collection of
letters, digits and underscore which must begin with a letter. Hence, the regular
expression for an identifier can be given by,

Letter_ (letter I digit)*

Note: Vertical bar ( I ) refers to 'or' (Union operator).

The following describes the language for given regular expression:

                                       Languages for regular expressions

                

S.No. Regular expression Language


1 r L(r)
2 a L(a)
3 r|s L(r) | L(s)
4 rs L(r) L(s)
5 r* (L(r))*

Regular set Language defined by regular expression.

Two regular expressions are equivalent, if they represent the same regular set.

                                        (p I q) = (q | p)

                                    Algebraic laws of regular expressions

Law Description
r|s=s|r | is commutative
r | (s | t) = (r | s ) | t | is associative
r (st) = (rs)t Concatenation is associative
r(s|t) = rs | rt; (s|t)r = sr | tr Concatenation is distributive
Ɛr = rƐ = r Ɛ is identity for concatenation
r* = (r | Ɛ)* Ɛ is guaranteed in closure
r** = r* * is idempotent

Regular Definition

Regular definition d gives aliases to regular expressions r and uses it for convenience.
Sequences of definitions are of the following form

di --> ri

d2-->r2

d3--> rs

dn--> rn

in which definitions di, d2, ... , can be used in place of ri, r2 respectively.

letter --> A I B I · · · I Z I a I b I · · · I z I

digit -->0 |1 I 2 ... I 9

id --> letter_ (letter I digit)*

What is Context Free Grammars? Compiler Design

by Dinesh Thakur Category: Compiler Design


Grammars are used to describe the syntax of a programming language. It specifies the
structure of expression and statements.

stmt -> if (expr) then stmt

where stmt denotes statements,

expr denotes expressions.

Types of grammar

• Type 0 grammar

• Type 1 grammar

• Type 2 grammar

• Type 3 grammar

Context Free Grammar

Context free grammar is also called as Type 2 grammar.

Definition

A context free grammar G is defined by four tuples as,

                         G=(V,T,P,S)

where,

G - Grammar
V - Set of variables

T - Set of Terminals

P - Set of productions

S - Start symbol

It produces Context Free Language (CFL) which is defined as,

                  

where,

L-Language

G- Grammar

w - Input string

S - Start symbol

T - Terminal

Hence, CFL is a collection of input strings which are terminals, derived from the start
symbol of grammar on multiple steps.

Conventions

Terminals are symbols from which strings are formed.

• Lowercase letters i.e., a, b, c.

• Operators i.e.,+,-,*·

• Punctuation symbols i.e., comma, parenthesis.

• Digits i.e. 0, 1, 2, · · · ,9.


• Boldface letters i.e., id, if.

Non-terminals are syntactic variables that denote a set of strings.

Uppercase letters i.e., A, B, C.

Lowercase italic names i.e., expr , stmt.

Start symbol is the head of the production stated first in the grammar.

Production is of the form LHS ->RHS (or) head -> body, where head contains only one
non-terminal and body contains a collection of terminals and non-terminals.

(eg.) Let G be,

        

Context Free Grammars vs Regular Expressions

Grammars are more powerful than regular expressions.

Every construct that can be described by a regular expression can be described by a


grammar but not vice-versa.
Every regular language is a context free language but reverse does not hold.

(eg.)

RE= (a I b)*abb (set of strings ending with abb).

 Grammar

             

Rules

For each state i of the NFA, create a non-terminal Ai.

If state i has a transition to state j on input a, add the production Ai -> aAj.

If state i goes to state j on input e, add the production Ai -> Aj.

If i is an accepting state, add Ai -> Ɛ.

If i is a start state, make Ai be the start symbol of the grammar.


What is Parser (Syntax analysis)? Error Handling and Recovery in Syntax Analyzer

by Dinesh Thakur Category: Compiler Design

Syntax analysis is the second phase of compiler.

Syntax analysis is also known as parsing.

Parsing is the process of determining whether a string of tokens can be generated by a


grammar.

It is performed by syntax analyzer which can also be termed as parser.

In addition to construction of the parse tree, syntax analysis also checks and reports
syntax errors accurately.

(eg.)

                   C = a + b * 5

Syntax tree can be given as,

        

 
Parser is a program that obtains tokens from lexical analyzer and constructs the parse tree
which is passed to the next phase of compiler for further processing.

Parser implements context free grammar for performing error checks.

Types of Parser

• Top down parsers Top down parsers construct parse tree from root to leaves.

• Bottom up parsers Bottom up parsers construct parse tree from leaves to root.

Role of Parser

Figure depicts the role of parser with respect to other phases.

• Once a token is generated by the lexical analyzer, it is passed to the parser.

• On receiving a token, the parser verifies the string of token names that can be generated
by the grammar of source language.

• It calls the function getNextToken(), to notify the lexical analyzer to yield another
token.

• It scans the token one at a time from left to right to construct the parse tree.

• It also checks the syntactic constructs of the grammar.

       

Need for Parser

• Parser is needed to detect syntactic errors efficiently.

• Error is detected as soon as a prefix of the input cannot be completed to form a string in
the language. This process of analyzing the prefix of input is called viable-prefix
property.
Error Recovery Strategies

Error recovery strategies are used by the parser to recover from errors once it is detected.
The simplest recovery strategy is to quit parsing with an error message for the first error
itself.

Panic Mode Recovery

Once an error is found, the parser intends to find designated set of synchronizing tokens
by discarding input symbols one at a time.

Synchronizing tokens are delimiters, semicolon or } whose role in source program is


clear.

• When parser finds an error in the statement, it ignores the rest of the statement by not
processing the input.

• This is the easiest way of error-recovery.

• It prevents the parser from developing infinite loops.

Advantages

• Simplicity.

• Never get into infinite loop.

Disadvantage

• Additional errors cannot be checked as some of the input symbols will be skipped.

Phrase Level Recovery

Parser performs local correction on the remaining input when an error is detected.

• When a parser finds an error, it tries to take corrective measures so that the rest of
inputs of statement allow the parser to parse ahead.

• One wrong correction will lead to an infinite loop.

The local correction may be

• Replacing a prefix by some string.


• Replacing comma by semicolon.

• Deleting extraneous semicolon.

• Insert missing semicolon.

Advantage

• It can correct any input string.

Disadvantage

• It is difficult to cope up with actual error if it has occurred before the point of detection.

Error Production

Productions which generate erroneous constructs are augmented to the grammar by


considering common errors that occur.

These productions detect the anticipated errors during parsing.

Error diagnostics about the erroneous constructs are generated by the parser.

Global Correction

There are algorithms which make changes to modify an incorrect string into a correct
string.

These algorithms perform minimal sequence of changes to obtain globally least-cost


correction.

When a grammar G and an incorrect string pis given, these algorithms find a parse tree
for a string q related top with smaller number of transformations.

The transformations may be insertions, deletions and change of tokens.

Advantage

• It has been used for phrase level recovery to find optimal replacement strings.

Disadvantage

• This strategy is too costly to implement in terms of time and space.


Grouping of Phases - Compiler Design

by Dinesh Thakur Category: Compiler Design

The phases of a compiler can be grouped as:

Front end

Front end of a compiler consists of the phases

• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.

Back end

Back end of a compiler contains

• Code optimization.
• Code generation.

Front End

• Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table management, semantic analysis
and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.
           

Back End

• Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol table management
operations.

           

Passes

• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.

Reducing the Number of Passes

• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept in memory to
ensure proper information flow to each phase because one phase may need information in
a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So, the
memory for internal form may be larger than that of input and output.

Compiler Design - Language Processing System

by Dinesh Thakur Category: Compiler Design


                    

Pre-processor

A source program may be divided into modules stored in separate files. The task of
collecting the source program is entrusted to a separate program called pre-processor. It
may also expand macros into source language statement.

Compiler

Compiler is a program that takes source program as input and produces assembly
language program as output.

Assembler

Assembler is a program that converts assembly language program into machine language
program. It produces re-locatable machine code as its output.

Loader and link-editor


• The re-locatable machine code has to be linked together with other re-locatable object
files and library files into the code that actually runs on the machine.
• The linker resolves external memory addresses, where the code in one file may refer to
a location in another file.
• The loader puts together the entire executable object files into memory for execution.

What is Derivations? - Compiler Design

by Dinesh Thakur Category: Compiler Design

Derivation is used to find whether the string belongs to a given grammar.

Types

• Leftmost derivation.

• Rightmost derivation.

Leftmost Derivation

In leftmost derivation, at each and every step the leftmost non-terminal is expanded by
substituting its corresponding production to derive a string.

Example
            

Rightmost Derivation

In rightmost derivation, at each and every step the rightmost non-terminal is expanded by
substituting its corresponding production to derive a string.

Example

              

You might also like