CSC303 - Compiler Design - 060624
CSC303 - Compiler Design - 060624
FACULTY OF COMPUTING
COMPUTER SCIENCE DEPARTMENT
COURSE TITLE: COMPILER CONSTRUCTION & DESIGN
INTRODUCTION
We generally write a computer program using a high-level
language. A high-level language is one that is understandable by
us, humans. This is called source code.
However, a computer does not understand high-level language. It
only understands the program written in 0's and 1's in binary,
called the machine code.
To convert source code into machine code, we use either a
compiler or an interpreter.
Both compilers and interpreters are used to convert a program
written in a high-level language into machine code understood by
computers. However, there are differences between how an
interpreter and a compiler works.
Compiler
A compiler is a software /program that converts a program written
in high level language to a low level language (object/target
language).
It also reports errors present in source program
Input
Second pass
Source Program
Interpreter Output
Input
Error messages
Interpreter Compiler
Translates program one statement at a Scans the entire program and translates
time. it as a whole into machine code.
Slow in speed Fast in speed
No intermediate object code is generated, Generates intermediate object code
hence are memory requirement is less. which further requires linking, hence
requires more memory.
Errors – continues translating the program Errors – All errors are displayed at once
until the 1st error is encountered, and stops. (together). Hence difficult to detect
Errors easy to detect
Interpreters are small in size Compilers are large in size
Examples – Perl, Python, Ruby, Matlab etc Examples – C, C++, Scala etc
Language Processing System
We have learnt that any computer system is made of hardware
and software. The hardware understands a language, which
humans cannot understand. So we write programs in high-
level language, which is easier for us to understand and
remember. These programs are then fed into a series of tools
and OS components to get the desired code that can be used by
the machine. This is known as Language Processing System.
Removes directives, adds files and
performs macro expansion
Analysis Phase
Analysis phase is known as the front-end of the compiler, this
phase of the compiler reads the source program, divides it into
core parts, and then checks for lexical, grammar, and syntax
errors. The analysis phase generates an intermediate
representation of the source program and symbol table, which
should be fed to the Synthesis phase as input.
Working Principle of Compiler
Synthesis Phase
Synthesis phase is known as the back-end of the compiler, this
phase generates the target program with the help of intermediate
source code representation and symbol table. A compiler can
have many phases and passes.
Pass: A pass refers to the traversal of a compiler through the
entire program.
Phase: A phase of a compiler is a distinguishable stage, which
takes input from the previous stage, processes and yields output
that can be used as input for the next stage. A pass can have
more than one phase.
Phases v/s Passes:
Phases of a compiler are sub tasks that must be
performed to complete the compilation process. Passes
refers to the number of times the compiler has to traverse
through the entire program.
Phases of Compiler
The compilation process is a sequence of various phases. Each
phase takes input from its previous stage, has its own
representation of source program, and feeds its output to the
next phase of the compiler. Let us understand the phases of a
compiler.
High Level Language
Tokens
Parse tree
Optimized code
int main( )
{
// 2 variables declared below
18 token (comments
int a, b;
are omitted)
a = 10;
return 0;
}
Tokens, Lexemes and Pattern
Questions: Count the number of tokens
printf(“Never give up”); 5 tokens
int main( )
{
int a = 10, b = 20;
27 tokens
printf(“sum is = %d”, a + b);
return 0;
}
Specifications of Tokens
Let us understand how the language theory considers the
following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary
alphabets.
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets.
{a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string.
Length of the string is the total number of
alphabets in the string, e.g., the string S is “NIGERIA”,
the length of the string, S is 7 and is denoted by |S|= 7.
A string having no alphabets, i.e. a string of zero
length is known as an empty string and is denoted by
ε (epsilon).
Language
A language is considered as a finite set of strings over some
finite set of alphabets. Computer languages are considered as
finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of
regular expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only
a finite set of valid string/token/lexeme that belong to the
language in hand. It searches for the pattern defined by the
language rules. Regular expressions have the capability to
express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is
known as Regular Grammar. The language defined by
regular grammar is known as Regular Language.
Regular expression is an important notation for specifying
patterns. Each pattern matches a set of strings, so regular
expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive
definition. Regular languages are easy to understand and have
efficient implementation.
There are a number of algebraic laws that are obeyed by regular
expressions, which can be used to manipulate regular
expressions into equivalent forms.
Operations
The various operations on languages are:
1. Union of two languages L and M is written as:
L U M = {s | s is in L or s is in M}
2. Concatenation of two languages L and M is written as:
LM = {st | s is in L and t is in M}
3. The Kleene Closure of a language L is written as:
L* = Zero or more occurrence of language L.
X* means zero or more occurrence of x. i.e., it can
generate { e, x, xx, xxx, xxxx, … }
Notations
If r and s are regular expressions denoting the languages L(r)
and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting
L(r)L(s)
Kleene closure : (r)* is a regular expression denoting
(L(r))*
Note: (r) is a regular expression denoting L(r)
Example:
Given the regular languages
A = {xy, z} and B = {k, mn}
Perform the following
(i)A* (ii) B* (iii) A U B (iv) A o B
Solution
(i) A* = {ɛ, xy, z, xyz, xyxy, zz, xyxyxy, zzz, …}
(ii) B* = {ɛ, k, mn, kmn, kk, mnmn, kkk, mnmnmn…}
(iii) A U B = {xy, z, k, mn}
(iv) A o B = {xyk, xymn, zk, zmn}