0% found this document useful (0 votes)
36 views

CSC303 - Compiler Design - 060624

Compiler Design

Uploaded by

Ov
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

CSC303 - Compiler Design - 060624

Compiler Design

Uploaded by

Ov
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

UNIVERSITY OF DELTA, AGBOR

FACULTY OF COMPUTING
COMPUTER SCIENCE DEPARTMENT
COURSE TITLE: COMPILER CONSTRUCTION & DESIGN

COURSE CODE: CSC 303


LESSON 1

INTRODUCTION
 We generally write a computer program using a high-level
language. A high-level language is one that is understandable by
us, humans. This is called source code.
 However, a computer does not understand high-level language. It
only understands the program written in 0's and 1's in binary,
called the machine code.
 To convert source code into machine code, we use either a
compiler or an interpreter.
 Both compilers and interpreters are used to convert a program
written in a high-level language into machine code understood by
computers. However, there are differences between how an
interpreter and a compiler works.
Compiler
 A compiler is a software /program that converts a program written
in high level language to a low level language (object/target
language).
 It also reports errors present in source program
Input

Source Program Compiler Target program

Error messages Output


Types of Compilers
a). Single pass compilers: These are compilers that process the
source code only once. Example Turbo Pascal compiler.

b). Multi-pass compilers: These are compilers that process the


source code multiple times, in converting from high-level language
to low-level language. Example – GCC compiler.
High-level language High-level language

All passes in First pass


one single
module

Second pass

low-level language low-level language


Compilation process/phase

 Analysis: This phase breakdown the source code or program into


smaller parts and creates an intermediate code or representation of
the source program.

 Synthesis: This phase takes the intermediate code or representation


of the source program as input and creates the desired code or
program
Interpreters

 Interpreters: An interpreters translates code line by line during


execution, making it easier to detect errors but potentially slowing
down the program.

Source Program

Interpreter Output

Input
Error messages
Interpreter Compiler
Translates program one statement at a Scans the entire program and translates
time. it as a whole into machine code.
Slow in speed Fast in speed
No intermediate object code is generated, Generates intermediate object code
hence are memory requirement is less. which further requires linking, hence
requires more memory.
Errors – continues translating the program Errors – All errors are displayed at once
until the 1st error is encountered, and stops. (together). Hence difficult to detect
Errors easy to detect
Interpreters are small in size Compilers are large in size
Examples – Perl, Python, Ruby, Matlab etc Examples – C, C++, Scala etc
Language Processing System
We have learnt that any computer system is made of hardware
and software. The hardware understands a language, which
humans cannot understand. So we write programs in high-
level language, which is easier for us to understand and
remember. These programs are then fed into a series of tools
and OS components to get the desired code that can be used by
the machine. This is known as Language Processing System.
Removes directives, adds files and
performs macro expansion

Language Processing System


Preprocessor
A preprocessor, generally considered as a part of compiler, is a
tool that produces input for compilers. It deals with the
following:
 High-Level Language (source code) is converted to pure
HLL by removing preprocessor directives (#define,
#include <stdio.h> etc) and add the respective file (file
inclusion)
 It performs macro expansion, operator conversion (e.g
a++; a=a+1)
Compiler
The compiler, translates high-level language into low-level
machine language. The difference lies in the way they read the
source code or input. A compiler reads the whole source code at
once, creates tokens, checks semantics, generates intermediate
code, executes the whole
Assembler
An assembler translates assembly language programs into machine
code. The output of an assembler is called an object file,
which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object
files together in order to make an executable file. All these files might
have been compiled by separate assemblers. The major task of a
linker is to search and locate referenced module/routines in a program
and to determine the memory location where these codes will be
loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for
loading executable files into memory and execute them. It
calculates the size of a program (instructions and data) and creates
memory space for it. It initializes various registers to initiate execution.
Native-compiler
A compiler that runs on platform (A) and is capable of generating
executable code for platform (A) is called a native-compiler.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating
executable code for platform (B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming
language and translates it into the source code of another
programming language is called a source-to-source compiler.
Compiler – writing – tools
Number of tools has been developed in helping to construct
compilers. Tools range from scanner and parser generators
to complex systems, called compiler-compilers, compiler-
generators or translator-writing systems.
The input specification for these systems may contain:
1. A description of the lexical and syntactic structure of the
source languages.
2. A description of what output is to be generated for each
source language construct.
3. A description of the target machine.
The principle aids provided by the compiler-compilers are:
1. For Scanner Generator the Regular Expression is being
used.
2. For Parser Generator the Context Free Grammars are
used.
NOTE: A compiler is characterized by three languages:
1. source language
2. object language
3. The language in which it is written.
Compiler Architecture
A compiler can broadly be divided into two phases based on the
way they compile.

Analysis Phase
Analysis phase is known as the front-end of the compiler, this
phase of the compiler reads the source program, divides it into
core parts, and then checks for lexical, grammar, and syntax
errors. The analysis phase generates an intermediate
representation of the source program and symbol table, which
should be fed to the Synthesis phase as input.
Working Principle of Compiler
Synthesis Phase
Synthesis phase is known as the back-end of the compiler, this
phase generates the target program with the help of intermediate
source code representation and symbol table. A compiler can
have many phases and passes.
Pass: A pass refers to the traversal of a compiler through the
entire program.
Phase: A phase of a compiler is a distinguishable stage, which
takes input from the previous stage, processes and yields output
that can be used as input for the next stage. A pass can have
more than one phase.
Phases v/s Passes:
Phases of a compiler are sub tasks that must be
performed to complete the compilation process. Passes
refers to the number of times the compiler has to traverse
through the entire program.
Phases of Compiler
The compilation process is a sequence of various phases. Each
phase takes input from its previous stage, has its own
representation of source program, and feeds its output to the
next phase of the compiler. Let us understand the phases of a
compiler.
High Level Language

Tokens

Parse tree

Parse tree (verified semantically)

Three address code

Optimized code

Assembly code Architecture of the Compiler


Lexical Analysis
The first phase of compiler is also known as Scanner. The scanner
works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens as:
<Token-name, attribute-value>
 reads the source code/program and converts it into tokens using a
tool called LEX.
 Tokens are defined by regular expression which are understood by
the lexical analyzer
 the lexical analyzer removes white spaces, comments, tabs etc.
from the source code.
Syntax Analysis
 takes the token one by one and uses Context Free Grammar (CFG)
to construct the parse tree. If it is not possible to construct the
parse tree, then the input is syntactically incorrect and error
message will be shown or displayed.
 Using the production from CFG, we can represent what the
program actually is.
 The input has to be checked whether it is in the desired format or
not.
 Syntax errors can be detected by it if the input is not according to
the grammar given.
Semantic Analysis (Parser)
 Semantic analysis checks whether the parse tree constructed
(is meaningful or not) thus follows the rules of language.
For example, it checks type casting, type conversions issues
and so on.
 Also, the semantic analyzer keeps track of identifiers, their
types and expressions; whether identifiers are declared
before use or not, etc.
 The semantic analyzer produces an annotated syntax tree
as an output.
Intermediate Code Generation
 After semantic analysis, the compiler generates an
intermediate code of the source code for the target
machine.
 It represents a program for some abstract machine. It is in
between the high-level language and the machine language.
 This intermediate code should be generated in such a way
that it makes it easier to be translated into the target
machine code. The intermediate code may be a Three
Address code or Assembly code.
Code Optimization
 The next phase does code optimization, it is an optional phase.
Optimization can be assumed as something that removes
unnecessary code lines, and arranges the sequence of statements in
order to speed up the program execution without wasting resources
like CPU, memory. The output of this phase is an optimized
intermediate code.
 Hence, code optimization phase attempts to improve the
intermediate code so that it runs faster and consumes less resources.
Code Generation
 In this phase, the code generator takes the optimized
representation of the intermediate code and maps it to the target
machine language.
 The code generator translates the intermediate code into a sequence
of re-locatable machine code (Assembly code) - sequence of
instructions of machine code performs the task as the intermediate
code would do.
Symbol Table
 Symbol Table is also known as Book Keeping.
 It is a data-structure maintained throughout all the
phases of a compiler. All the identifiers‟ names along
with their information like type, size, etc., are stored here.
 The symbol table makes it easier for the compiler to
quickly search and retrieve the identifiers record.
 The symbol table is also used for scope management (All
phases interacts with the symbol table).
Error Hander
 It is a module which takes care of all events encountered
during compilation.
 It takes care to continue the compilation process even if
errors are encountered.
 The task of the error handling process are to detect each
error, report it to the user and make some recovery strategy
and implement them to handle errors.
Summary
 A compiler is a program that converts high-level language to
assembly language.
 A linker tool is used to link all the parts of the program
together for execution. A loader loads all of them into
memory and then the program is executed.
 A compiler that runs on machine and produces executable
code for another machine is called a cross-compiler.
Summary
 A Compiler is divided into two parts namely Analysis and
Synthesis. The compilation process is done in various
phases.
 Two or more phases can be combined to form a pass.
 A parser should be able to detect and report any error in the
program.
Assignment
1. Differentiate between a compiler and an Interpreter.
2. List five (5) programming languages each that uses
compiler and Interpreter.
3. Write a short note on Compiler Writing tools.
2. Differentiate between Linker and Loader.
3. Explain Bootstrapping.
4. Differentiate between Analysis phase and Synthesis phase.
5. Describe the phases of the Compiler.
MAIN FUNCTION OF LEXICAL ANALYZER
The Lexical Analysis phase converts source programs into
streams of tokens. This phase is also called the scanning phase.
Functions
 it reads the input program character by character, and
produces a stream of tokens, and passes the data to the
syntax analyzer when demanded.
 removing whitespaces/tabs
 removing comments from the source program
 generates error and gives the line number of the error
Parse Tree

Suppose we pass the following:


a = b + c;
We will get tokens like
id = id + id
Where each id refers to its variable in the symbol table
Tokens, Lexemes and Pattern
 Tokens: A token is a sequence of characters that can be treated as a
unit or single logical entity. Typical tokens are: keywords (for, if,
while), identifiers (variable names), operators (+, -, *, /, ;) etc.
int a = 5; (has 5 tokens)
int is a keyword, a is an identifier, = is an operator, 5 is a constant and
; is a separator.

 Lexemes: A lexeme is a sequence of characters in the source


program that is matched by the pattern for a token or a sequence of
input characters that comprises a single token.
Tokens, Lexemes and Pattern
 Pattern: A pattern is a rule describing all lexemes that can
represent a particular token in a source language and are defined by
means of regular expression. Or A pattern are some predefined rules
for every lexeme to be identified as a valid token. These rules are
defined by grammar rules by means of pattern.
Tokens, Lexemes and Pattern
 Questions: Count the number of tokens
 Int max (int i); 7 tokens

 int main( )
{
// 2 variables declared below
18 token (comments
int a, b;
are omitted)
a = 10;
return 0;
}
Tokens, Lexemes and Pattern
 Questions: Count the number of tokens
 printf(“Never give up”); 5 tokens

 printf(“%d Hello”, &x); 8 tokens

 int main( )
{
int a = 10, b = 20;
27 tokens
printf(“sum is = %d”, a + b);
return 0;
}
Specifications of Tokens
Let us understand how the language theory considers the
following terms:
 Alphabets
 Any finite set of symbols {0,1} is a set of binary
alphabets.
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets.
 {a-z, A-Z} is a set of English language alphabets.
 Strings
 Any finite sequence of alphabets is called a string.
Length of the string is the total number of
alphabets in the string, e.g., the string S is “NIGERIA”,
the length of the string, S is 7 and is denoted by |S|= 7.
A string having no alphabets, i.e. a string of zero
length is known as an empty string and is denoted by
ε (epsilon).
Language
A language is considered as a finite set of strings over some
finite set of alphabets. Computer languages are considered as
finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of
regular expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only
a finite set of valid string/token/lexeme that belong to the
language in hand. It searches for the pattern defined by the
language rules. Regular expressions have the capability to
express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is
known as Regular Grammar. The language defined by
regular grammar is known as Regular Language.
Regular expression is an important notation for specifying
patterns. Each pattern matches a set of strings, so regular
expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive
definition. Regular languages are easy to understand and have
efficient implementation.
There are a number of algebraic laws that are obeyed by regular
expressions, which can be used to manipulate regular
expressions into equivalent forms.
Operations
The various operations on languages are:
1. Union of two languages L and M is written as:
L U M = {s | s is in L or s is in M}
2. Concatenation of two languages L and M is written as:
LM = {st | s is in L and t is in M}
3. The Kleene Closure of a language L is written as:
L* = Zero or more occurrence of language L.
X* means zero or more occurrence of x. i.e., it can
generate { e, x, xx, xxx, xxxx, … }
Notations
If r and s are regular expressions denoting the languages L(r)
and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)
 Concatenation : (r)(s) is a regular expression denoting
L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting
(L(r))*
Note: (r) is a regular expression denoting L(r)
Example:
Given the regular languages
A = {xy, z} and B = {k, mn}
Perform the following
(i)A* (ii) B* (iii) A U B (iv) A o B

Solution
(i) A* = {ɛ, xy, z, xyz, xyxy, zz, xyxyxy, zzz, …}
(ii) B* = {ɛ, k, mn, kmn, kk, mnmn, kkk, mnmnmn…}
(iii) A U B = {xy, z, k, mn}
(iv) A o B = {xyk, xymn, zk, zmn}

You might also like