0% found this document useful (0 votes)
7 views

Compiler Design UNIT 1

Uploaded by

786luqmandanish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Compiler Design UNIT 1

Uploaded by

786luqmandanish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Compiler Design

Unit 1: INTRODUCTION TO COMPILERS: Structure of a compiler–Lexical Analysis –Role of Lexical


Analyzer – Input Buffering – Specification of Tokens – Recognition of Tokens – Lex – Finite Automata –
Regular Expressions to Automata – Minimizing DFA.

TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as output a
program in another language. Beside program translation, the translator performs another very important role,
the error-detection. Any violation of HLL specification would be detected and reported to the programmers.
Important role of translator are:
1 Translating the HLL program input into an equivalent machine language program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
A translator is a program that takes as input a program written in one language and produces as output a
program in another language. Beside program translation, the translator performs another very important role,
the error-detection. Any violation of HLL specification would be detected and reported to the programmers.
Important role of translator are:
1 Translating the hll program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the hll.
TYPE OF TRANSLATORS: -

a. Compiler
b. Interpreter
c. Preprocessor
Compiler
Compiler is a translator program that translates a program written in (HLL) the source program and translate
it into an equivalent program in (MLL) the target program. As an important part of a compiler is error showing
to the programmer.

Executing a program written in HLL programming language is basically of two parts. The source program
must first be compiled and translated into a object program. Then the resulting object program is loaded into
a memory executed.

Language Processing System:


A Language Processing System is a comprehensive framework that facilitates the translation, interpretation,
and execution of programs written in high-level programming languages. It encompasses a variety of
components and tools that work together to process and convert human-readable source code into a form that
can be understood and executed by a computer. The main components of a Language Processing System
include the following:

1. Preprocessor
- Function: The preprocessor processes the source code before it is passed to the compiler. It handles tasks
such as macro expansion, file inclusion, and conditional compilation.
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for longer
constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-of-
control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro
- Example: In C/C++, the preprocessor expands macros (e.g., `#define`) and includes header files (e.g.,
`#include`).
2. Compiler - Function: The compiler is the core component that translates the high-level source code into
machine code or an intermediate representation. It performs lexical analysis, syntax analysis, semantic
analysis, optimization, and code generation.
- Output: The compiler generates object code, which is a machine-level code that is not yet executable on its
own.
3. Assembler
Programmers found it difficult to write or read programs in machine language. They begin to use a mnemonic
(symbols) for each machine instruction, which they would subsequently translate into machine language. Such
a mnemonic machine language is now called an assembly language. Programs known as assembler were
written to automate the translation of assembly language in to machine language. The input to an assembler
program is called source program, the output is a machine language translation (object program).
- Function: The assembler converts the object code generated by the compiler into machine code or binary
code that the processor can execute. This process may involve translating assembly language instructions into
machine instructions.
- Output: The assembler produces an assembly language code or a relocatable machine code file.
4. Linker
- Function: The linker combines multiple object code files into a single executable program. It resolves
references to external symbols (such as functions or variables defined in other object files) and ensures that all
necessary code and data are properly linked together.
- Output: The linker produces an executable file, which can be loaded and executed by the operating system.
5. Loader
- Function: The loader is responsible for loading the executable file into memory and preparing it for
execution. It sets up the memory address space, initializes registers, and starts the execution of the program.
- Output: Once the loader has done its work, the CPU will be ready to execute the program.

Once the assembler procedures an object program, that program must be placed into memory and executed.
The assembler could place the object program directly in memory and transfer control to it, thereby causing
the machine language program to be executed. This would waste core by leaving the assembler in memory
while the user‟ 's program was being executed. Also, the programmer would have to retranslate his program
with each execution, thus wasting translation time. To overcome this problem of wasted translation time and
memory. System programmers developed another component called the loader
“A loader is a program that places programs into memory and prepares them for execution.” It would be
more efficient if subroutines could be translated into object form the loader could” relocate” directly behind
the user’s program. The task of adjusting programs so they may be placed in arbitrary core locations is called
relocation.

6. Interpreter (Optional)- Function: An interpreter directly executes the high-level source code or an
intermediate representation without generating machine code. Interpreters typically execute code line by line,
which can make them slower than compiled programs but more flexible for certain tasks.
- Example: Python and JavaScript are often executed using interpreters.
7. Debugger (Optional)- Function: A debugger is a tool that helps developers test and debug their programs.
It allows for step-by-step execution, setting breakpoints, and inspecting the values of variables to identify and
fix errors in the code.
8. Editor (Optional) - Function: The editor is an environment where developers write and edit their source
code. Integrated Development Environments (IDEs) often include editors with features such as syntax
highlighting, auto-completion, and error detection.
9. Profiler (Optional) - Function: A profiler analyses the runtime behavior of a program, helping developers
identify performance bottlenecks, such as functions that consume excessive CPU or memory resources.
10. Version Control System (Optional) - Function: Although not part of the traditional language processing
system, version control systems like Git are essential for managing changes to the source code over time,
enabling collaboration among multiple developers, and tracking the history of code modifications.

Structure of Compiler:
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation that takes
the source program in one representation and produces output in another representation. The phases of a
compiler are shown below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis (Machine Dependent/Language independent)
The compilation process is partitioned into no-of-sub processes called ‘phases.
Lexical Analysis: - LA or Scanners reads the source program one character at a time, carving the source
program into a sequence of atomic units called tokens.
Syntax Analysis: - The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of lexical analysis. Syntax
analysis is aided by using techniques based on formal grammar of the programming language.
Intermediate Code Generations: - An intermediate representation of the final machine language code is
produced. This phase bridges the analysis and synthesis phases of translation.
Code Optimization: - This is an optional phase described to improve the intermediate code so that the output
runs faster and takes less space.
Code Generation: - The last phase of translation is code generation. Several optimizations to reduce the length
of machine language program are carried out during this phase. The output of the code generator is the machine
language program of the specified computer.
Table Management (or) Book-keeping: - This is the portion to keep the names used by the program and
records essential information about each. The data structure used to record this information is called a “Symbol
Table”.
Fig: - Structure of Compiler.
Error Handing: -
One of the most important functions of a compiler is detecting and reporting errors in the source program. The
error message should allow the programmer to determine exactly where the errors have occurred. Errors may
occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error handler, which issues
an appropriate diagnostic msg. Both of the table-management and error-Handling routines interact with all
phases of the compiler.
LEXICAL ANALYSIS

OVER VIEW OF LEXICAL ANALYSIS


o To identify the tokens, we need some method of describing the possible tokens that can appear in the
input stream. For this purpose, we introduce regular expression, a notation that can be used to describe
essentially all the tokens of programming language.
o Secondly, having decided what the tokens are, we need some mechanism to recognize these in the input
stream. This is done by the token recognizers, which are designed using transition diagrams and finite
automata.

ROLE OF LEXICAL ANALYZER


The LA is the first phase of a compiler. Its main task is to read the input character and produce as output a
sequence of tokens that the parser uses for syntax analysis.

Upon receiving a get next token command form the parser, the lexical analyzer reads the input character until
it can identify the next token. The LA return to the parser representation for the token it has found. The
representation will be an integer code, if the token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab and new line characters. Another is
correlating error messages from the compiler with the source program.
LEXICAL ANALYSIS VS PARSING:

Lexical analysis Parsing


A Scanner simply turns an input String A parser converts this list of tokens into
(say a file) into a list of tokens. These a Tree-like object to represent how the
tokens represent things like identifiers, tokens fit together to form a cohesive
parentheses, operators etc. whole (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer") parses
individual symbols from the source code A parser does not give the nodes any
file into tokens. From there, the "parser" meaning beyond structural cohesion.
proper turns those whole tokens into The next thing to do is extract meaning
sentences of from this structure (sometimes called
contextual
your grammar.
analysis).

INPUT BUFFERING
The LA scans the characters of the source pgm one at a time to discover tokens. Because of large amount of
time can be consumed scanning characters, specialized buffering techniques have been developed to reduce
the amount of overhead required to process an input character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover tokens. Often,
however, many characters beyond the next token many have to be examined before the next token itself can
be determined. For this and other reasons, it is desirable for the lexical analyzer to read its input from an input
buffer. Figure shows a buffer divided into two haves of, say 100 characters each. One pointer marks the
beginning of the token being discovered. A look ahead pointer scans ahead of the beginning point, until the
token is discovered . We view the position of each pointer as being between the character's last read and the
character next to be read. In practice each buffering scheme adopts one convention either a pointer is at the
symbol last read or the symbol it is ready to read.

Token beginnings look ahead pointer The distance which the lookahead pointer may have to travel past the
actual token may be large. For example, in a PL/I program we may see: DECALRE (ARG1, ARG2… ARG
n) Without knowing whether DECLARE is a keyword or an array name until we see the character that
follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead pointer
travels beyond the buffer half in which it began, the other half must be loaded with the next characters from
the source file. Since the buffer shown in above figure is of limited size there is an implied constraint on how
much look ahead can be used before the next token is discovered. In the above example, if the look ahead
traveled to the left half and all the way through the left half to the middle, we could not reload the right half,
because we would lose characters that had not yet been grouped into tokens. While we can make the buffer
larger if we chose or use another buffering scheme, we cannot ignore the fact that overhead is limited.
TOKEN, LEXEME, PATTERN:
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

Example:
Description of token
Token lexeme pattern

const const const

if if If

relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant

nun 3.14 any character b/w “and “except"

literal "core" pattern

A pattern is a rule describing the set of lexemes that can represent a particular token in source program.
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there's no way
to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side, will be thrown by your
scanner when a given set of already recognised valid tokens don't match any of the right sides of your grammar
rules. simple panic-mode error handling system requires that we return to a high-level parsing function when
a parsing or lexical error is detected.
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.
REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string.
Component of regular expression.
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the set of strings
in each token class as an language, we can use the regular-expression notation to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits. In regular
expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet.
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each „a‟ in ∑, is a regular expression denoting { a }, the language with only one string consisting of
the single symbol „a‟ .
o If R and S are regular expressions, then
(R) | (S) means LrULs
R.S means Lr.Ls
R* denotes Lr*
Regular Definitions
For notational convenience, we may wish to give names to regular expressions and to define regular
expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular definition
provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits -0 | 1 | 2 | …. | 9
letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the patterns for
all the needed tokens and build a piece of code that examins the input string and finds a prefix that is a lexeme
matching one of the patterns.
Stmt -> if expr then stmt | If expr then else stmt | є
Expr --> term relop term |term
Term -->id
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes. The terminal of
grammar, which are if, then , else, relop ,id and numbers are the names of tokens as far as the lexical analyzer
is concerned, the patterns for the tokens are described using regular definitions.
digit -->[0,9] digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)? letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --></>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws --> (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of the same
names. Token ws is different from the other tokens in that ,when we recognize it, we do not return it to parser
,but rather restart the lexical analysis from the character that follows the white space . It is the following token
that gets returned to the parser.
Lexeme Token Name Attribute Value
Any ws _ _
If if _
Then then _
Else else _
Any Id id pointer to table entry
Any number number pointer to table entry

< relop LT
TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a condition that
could occur during the process of scanning the input looking for a lexeme that matches one of several
patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a symbol or
set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled by a. if we
find such an edge ,we advance the forward pointer and enter the state of the transition diagram to which that
edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final. These states indicates that a lexeme has been found,
although the actual lexeme may not consist of all positions b/w the lexeme Begin and forward pointers
we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall additionally place
a * near that accepting state.
3. One state is designed the state, or initial state., it is indicated by an edge labeled “start” entering from
nowhere. The transition diagram always begins in the state before any input symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized flowchart, called a transition
diagram. Position in a transition diagram, are drawn as circles and are called as states.

The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A sequence of
transition diagram can be converted into program to look for the tokens specified by the diagrams. Each state
gets a segment of code.
If = if
T = then else
he =
n
El
se
R = < | <= | = | > |
el >=
op
Id = letter (letter |
digit) *|
N = digit |
u
m
AUTOMATA
An automation is defined as a system where information is transmitted and used for performing some
functions without direct participation of man.
1, an automation in which the output depends only on the input is called an automation without memory.
2, an automation in which the output depends on the input and state also is called as automation with
memory.
3, an automation in which the output depends only on the state of the machine is
called a Moore machine.
3, an automation in which the output depends on the state and input at any instant of time is called a mealy
machine.
DESCRIPTION OF AUTOMATA
1, an automata has a mechanism to read input from input tape,
2, any language is recognized by some automation, Hence these automation are
basically language „acceptors‟ or „language recognizers‟.
Types of Finite Automata
Deterministic Automata
Non-Deterministic Automata.
DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on any input. A DFA is a special
case of a NFA in which:-
1, it has no transitions on input € ,
2, each input symbol has at most one transition from any state.
DFA formally defined by 5 tuple notation M = (Q, ∑, δ, qo, F), where
Q is a finite „set of states‟, which is non empty.
∑ is „input alphabets‟, indicates input set.
qo is an „initial state‟ and qo is in Q ie, qo, ∑, Q F is a set of „Final states‟,
δ is a „transmission function‟ or mapping function, using this function
the next state can be determined.
The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA
The Finite Automata is called DFA if there is only one path for a specific input from current state to next
state.

NONDETERMINISTIC AUTOMATA
A NFA is a mathematical model that consists of
A set of states S.
A set of input symbols ∑.
A transition for move from one state to an other.
A state so that is distinguished as the start (or initial) state. A set of states F distinguished as accepting (or
final) state. A number of transition to a single symbol.
A NFA can be diagrammatically represented by a labeled directed graph, called a transition graph, In which
the nodes are the states and the labeled edges represent the transition function.
This graph looks like a transition diagram, but the same character can label two or more transitions out of
one state and edges can be labeled by the special symbol € as well as by input symbols.
The transition graph for an NFA that recognizes the language ( a | b ) * abb is shown

DEFINITION OF CFG
It involves four quantities.
CFG contain terminals, N-T, start symbol and production.
Terminal are basic symbols form which string are formed.
N-terminals are synthetic variables that denote sets of strings
In a Grammar, one N-T are distinguished as the start symbol, and the set of string it denotes is the language
defined by the grammar.
The production of the grammar specify the manor in which the terminal and N-T can be combined to form
strings.
Each production consists of a N-T, followed by an arrow, followed by a string of one terminal and terminals.

DEFINITION OF SYMBOL TABLE


An extensible array of records.
The identifier and the associated records contains collected information about the identifier.
FUNCTION identify (Identifier name)
RETURNING a pointer to identifier information contains

The actual string


A macro definition
A keyword definition
A list of type, variable & function definition
A list of structure and union name definition
A list of structure and union field selected definitions.
Creating a lexical analyzer with Lex

Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest constant is
an identifier that is declared to represent a constant e.g. # define PIE 3.14), and regular definitions.
2. The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
… …
… …
where each p is a regular expression and each action is a program fragment describing what action the
lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the actions. Alternatively, these
procedures can be compiled separately and loaded with the lexical analyzer.

Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.

UNIT -2

SYNTAX ANALYSIS

ROLE OF THE PARSER


Parser obtains a string of tokens from the lexical analyser and verifies that it can be generated by the language for the
source program. The parser should report any syntax errors in an intelligible fashion. The two types of parsers
employed are:

1. Top down parser: which build parse trees from top(root) to bottom(leaves)
2. Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up parsing

Context free Grammars (CFG)


CFG is used to specify the syntax of a language. A grammar naturally describes the hierarchical structure of most
program-ming language constructs.
Formal Definition of Grammars
A context-free grammar has four components:

1. A set of terminal symbols, sometimes referred to as "tokens." The terminals are the elementary
symbols of the language defined by the grammar.
2. A set of nonterminals, sometimes called "syntactic variables." Each non-terminal represents a set of
strings of terminals, in a manner we shall describe.
3. A set of productions, where each production consists of a nonterminal, called the head or left side of
the production, an arrow, and a sequence of terminals and1or nonterminals, called the
body or right side of the production. The intuitive intent of a production is to specify one of the
written forms of a construct; if the head nonterminal represents a construct, then the body
represents a written form of the construct.
4. A designation of one of the nonterminals as the start symbol.
Production is for a nonterminal if the nonterminal is the head of the production. A string of terminals is
a sequence of zero or more terminals. The string of zero terminals, written as E , is called the empty string.

Derivations
A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the
body of a production for that nonterminal. The terminal strings that can be derived from the start symbol form
the language defined by the grammar.
Leftmost and Rightmost Derivation of a String

• Leftmost derivation − A leftmost derivation is obtained by applying production to the leftmost


variable in each step.
• Rightmost derivation − A rightmost derivation is obtained by applying production to the
rightmost variable in each step.
• Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" is
X → X+X → a+X → a + X*X → a+a*X → a+a*a
The rightmost derivation for the above string "a+a*a" is
X → X*X → X*a → X+X*a → X+a*a → a+a*a

Derivation or Yield of a Tree


The derivation or the yield of a parse tree is the final string obtained by concatenating the labels of the leaves
of the tree from left to right, ignoring the Nulls. However, if all the leaves are Null, derivation is Null.
parse tree pictorially shows how the start symbol of a grammar derives a string in the language. If nonterminal
A has a production A XYZ , then a parse tree may have an interior node labeled A with three children labeled
X, Y, and Z, from left to right:
Given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
1. The root is labeled by the start symbol.
2. Each leaf is labeled by a terminal or by e.
3. Each interior node is labeled by a nonterminal.

If A is the nonterminal labeling some interior node and X I , Xz, . . . ,Xn are the labels of the children of that
node from left to right, then there must be a production A → X1X2 . . Xn . Here, X1, X2,. . . , Xn, each
stand for a symbol that is either a terminal or a nonterminal. As a special case,
if A → c is a production, then a node labeled A may have a single child labeled E

Ambiguity
A grammar can have more than one parse tree generating a given string of terminals. Such a grammar is said
to be ambiguous. To show that a grammar is ambiguous, all we need to do is find a terminal string that is the
yield of more than one parse tree. Since a string with more than one parse tree usually has more than one
meaning, we need to design unambiguous grammars for compiling applications, or to use ambiguous
grammars with additional rules to resolve the ambiguities.
Example :: Suppose we used a single nonterminal string and did not distinguish between digits and lists,

Fig. shows that an expression like 9-5+2 has more than one parse tree with this grammar. The two trees for 9-
5+2 correspond to the two ways of parenthesizing the expression: (9-5) +2 and 9- (5+2) . This second
parenthesization gives the expression the unexpected value 2 rather than the customary value 6.

Two parse trees for 9-5+2 Verifying the language generated by a grammar

The set of all strings that can be derived from a grammar is said to be the language generated from that
grammar. A language generated by a grammar G is a subset formally defined by
L(G)={W|W ∈ ∑*, S ⇒G W}
If L(G1) = L(G2), the Grammar G1 is equivalent to the Grammar G2.
Example
If there is a grammar
G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e., L(G) =
{ab}

Writing a grammar
A grammar consists of a number of productions. Each production has an abstract symbol called a nonterminal
as its left-hand side, and a sequence of one or more nonterminal and terminal symbols as its right-hand
side. For each grammar, the terminal symbols are drawn from a specified alphabet.
Starting from a sentence consisting of a single distinguished nonterminal, called the goal symbol, a given
context-free grammar specifies a language, namely, the set of possible sequences of terminal symbols that can
result from repeatedly replacing any nonterminal in the sequence with a right-hand side of a production for
which the nonterminal is the left-hand side.
There are four categories in writing a grammar :
1. Lexical Vs Syntax Analysis
2. Eliminating ambiguous grammar.
3. Eliminating left-recursion
4. Left-factoring.
Each parsing method can handle grammars only of a certain form hence, the initial grammar may have to be
rewritten to make it parsable

1. Lexical Vs Syntax Analysis


Reasons for using the regular expression to define the lexical syntax of a language

a) Regular expressions provide a more concise and easier to understand notation for tokens than
grammars.
b) The lexical rules of a language are simple and to describe them, we donot need notation as
powerful as grammars.
c) Efficient lexical analyzers can be constructed automatically from RE than from grammars.
d) Separating the syntactic structure of a language into lexical and nonlexical parts provides a
convenient way of modularizing the front end into two manageable-sized components.

Eliminating ambiguous grammar.


Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost derivation can be
eliminated by re-writing the grammar.
Consider this example,
G: stmt→if expr then stmt
|if expr then stmt else stmt
|other

This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the following two parse
trees for leftmost derivation
Two parse trees for an ambiguous sentence

The general rule is “Match each else with the closest unmatched then.This disambiguating rule can be
used directly in the grammar,

To eliminate ambiguity, the following grammar may be used:


stmt→matched | unmatchedstmt
matched→if expr stmt then matched else matchedstmt | other
unmatched→ if expr then stmt | if expr then matched else unmatchedstmt

Eliminating left-recursion

Because we try to generate a leftmost derivation by scanning the input from left to right, grammars of the
form A A x may cause endless recursion.Such grammars are called left-recursive and they must be
transformed if we want to use a top-down parser.

◼ A grammar is left recursive if for a non-terminal A, there is a derivation A+ A

◼ To eliminate direct left recursion replace


1) A → A | with A’ →  A’|
2) A → A1 | A2 | ... | Am | 1 | 2 | ... | n
with
A → 1B | 2B | ... | nB

B → 1B | 2B | ... | mB | 

Left-factoring

Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive parsing.
When it is not clear which of two alternative productions to use to expand a non-terminal A, we can rewrite the
A-productions to defer the decision until we have seen enough of the input to make the right choice.
◼ Consider S → if E then S else S | if E then S
◼ Which of the two productions should we use to expand non-terminal S when the next
token is if?
We can solve this problem by factoring out the common part in these rules.

A → 1 | 2 |...| n | 


becomes
A → B| 
B → 1 | 2 |...| n

Consider the grammar , G : S → iEtS | iEtSeS | a


E→b
Left factored, this grammar becomes
S → iEtSS’ | a S’ → eS |ε
E→b

You might also like