Unit 1 Compiler Design
Unit 1 Compiler Design
Compiler Design
2
Preliminaries Required
• Basic knowledge of programming languages.
programming assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.
3
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
• Syntax-Directed Translation
• Attribute Definitions
• Run-Time Organization
• Code Optimization
• Code Generation
4
Compiler - Introduction
• A compiler is a program that can read a program in one language - the
COMPILERS
• A compiler is a program takes a program written in a
error messages
Compiler vs Interpreter
• An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly execute
the operations specified in the source program on
inputs supplied by the user
Compiler Applications
• Machine Code Generation
– Convert source language program to machine understandable one
– Takes care of semantics of varied constructs of source language
– Considers limitations and specific features of target machine
– Automata theory helps in syntactic checks
– valid and invalid programs
– Compilation also generate code for syntactically correct programs
8
Other Applications
• In addition to the development of a compiler, the techniques used in compiler
SQL.
• Many software having a complex front-end may need techniques used in
compiler design.
• A symbolic equation solver which takes an equation as input. That
Synthesis
• In analysis phase, an intermediate representation is created
phase.
1
0
11
12
Phases of A Compiler
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns
the tokens of the source program.
• A token describes a pattern of characters having same meaning in the source
program. (such as identifiers, operators, keywords, numbers, delimeters and so
on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment
operator
oldval identifier
+ add operator
12 a number
• This phase scans the source code as a stream of characters and converts it
into meaningful lexemes.
• For each lexeme, the lexical analyzer produces as output a token of the
form
• It passes on to the subsequent phase, syntax analysis .
This points to an entry in the
symbol table for this token.
It is an abstract Information from the symbol-
symbol that is <token-name, table
used during entry 'is needed for semantic
attribute-value> analysis and code generation
syntax analysis
14
15
Lexical Analysis
16
Lexical Analysis
• Lexical analysis breaks up a program into tokens
• Grouping characters into non-separatable units (tokens)
• Changing a stream to characters to a stream of tokens
17
Phases of Compiler-Symbol
Table Management
• Symbol table is a data structure holding information about all symbols defined in the
source program
• Not part of the final code, however used as reference by all phases of a compiler
• Typical information stored there include name, type, size, relative offset of variables
• Generally created by lexical analyzer and syntax analyzer
• Good data structures needed to minimize searching time
• The data structure may be flat or hierarchical
19
Syntax
structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure
Phases of Compiler-Syntax
Analysis
• This is the second phase, it is also called as parsing
• It takes the token produced by lexical analysis as input and generates a parse tree (or
syntax tree).
• In this phase, token arrangements are checked against the source code grammar, i.e.
the parser checks if the expression made by the tokens is syntactically correct.
21
• A syntax analyzer checks whether a given program satisfies the rules implied by
a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program.
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• Top-Down Parsing,
• Bottom-Up Parsing
• Top-Down Parsing:
• Construction of the parse tree starts at the root, and proceeds towards the leaves.
• Bottom-Up Parsing:
• Construction of the parse tree starts at the leaves, and proceeds towards the root.
• Normally efficient bottom-up parsers are created with the help of some software tools.
source program.
• The syntax analyzer works on the smallest meaningful units (tokens) in a
Semantic
Analysis
25
Phases of Compiler-Semantic
Analysis
• Semantic analysis checks whether the parse tree constructed follows the rules
of language.
• The semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking
26
Phases of Compiler-Semantic
Analysis
• Suppose that position, initial, and rate have been declared to be floating-
point numbers and that the lexeme 60 by itself forms an integer.
• The type checker in the semantic analyzer discovers that the operator
Intermediate Code
Generation
28
Phases of Compiler-Intermediate
Code Generation
• After semantic analysis the compiler generates an intermediate code of the
Phases of Compiler-Intermediate
Code Generation
• An intermediate form called three-address code were used
Code
Optimization
31
Phases of Compiler-Code
Optimization
• The next phase does code optimization of the intermediate code.
Code
Generation
33
Phases of Compiler-Code
Generation
• In this phase, the code generator takes the optimized representation of the
machine codes
34
Phases of Compiler-Code
Generation
• For example, using registers R1 and R2, the intermediate code
might get translated into the machine code
• The first operand of each instruction specifies a destination. The F
in each instruction tells us that it deals with floating-point
numbers.
35
Phases of Compiler-Translation of
assignment statement
36
3
7
Assembler
• Assembly code is a mnemonic version of machine code, in which names are used instead of
binary codes for operation
MOV a,R1
ADD #2,R1
MOV R1,b
• Some compiler produce assembly code , which will be passed to an assembler for further
processing
• Some other compiler perform the job of assembler, producing relocatable machine code which will
be passed directly to the loader/link editor
3
8
Two-Pass
Assembler
• This is the simplest form of assembler
Identifier Address
a 0
b 4
3
9
Loader/Link
editor
• Loading – It Loads the relocatable machine code to the
proper location
• Link editor allows us to make a single program from
several files of relocatable machine code
4
0
41
• Specification of tokens
• Recognition of tokens
• Finite automata
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
44
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
45
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to
produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot,
Lexical errors
• Some errors are out of power of lexical analyzer to
recognize:
• fi (a == f(x)) …
• d = 2r
Error recovery
• Panic mode: successive characters are ignored until we
Token
• Token represents a set of strings described by a pattern.
• Identifier represents a set of strings which start with a letter continues with letters and
digits
• The actual string (newval) is called as lexeme.
• Since a token can represent more than one lexeme, additional information should be held
for that specific lexeme. This additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information
Token
• Some attributes:
Example
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
Terminology of Languages
• Operators on Strings:
x and y.
• s =s
• s=s
• sn = s s s .. s ( n times)
• s0 =
54
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide
return
• In Fortran: DO 5 I = 1,25
safely
E = M * C * * 2 eof
55
Cont..,
56
Cont..,
57
Cont..,
58
Sentinels
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
59
Specification of tokens
• In theory of compilation regular expressions are used to
languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form of
strings
60
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
62
Extensions
• One or more instances: (r)+
• Example:
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
64
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 L1 and s2 L2 }
• Union
• L1 L2 = { s | s L1 or s L2 }
• Exponentiation:
• L0 = {} L1 = L L2 = LL
• Kleene Closure
• L =
*
L
i 0
i
• Positive Closure
• L+ =
L
i 1
i
66
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
Regular Expressions
• We use regular expressions to describe tokens of a
programming language.
• A regular expression is built up of simpler regular
regular set.
68
• (r)+ = (r)(r)*
• (r)? = (r) |
69
• Ex:
• = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => { ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string
70
Regular Definitions
• To write regular expression for some languages can be difficult, because their regular expressions can
regular expressions.
. {d1,d2,...,di-1}
dn rn
basic symbols previously defined names
71
digit 0 | 1 | ... | 9
digits digit +
opt-fraction ( . digits ) ?
opt-exponent ( E (+|-)? digits ) ?
unsigned-num digits opt-fraction opt-exponent
72
• 341.00
• 341E
• 341.10E
• 341.10E+-1
73
74
75
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
77
Extensions
• One or more instances: (r)+
• Example:
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
79
Transition diagrams
• Transition diagram for relop
81
8
4
Design of a Lexical
Analyzer
• LEX is a software tool that automatically construct a lexical
analyzer from a program
• The Lexical analyzer will be of the form
P1 {action 1}
P2 {action 2}
--
--
8
6
Example
Consider Lexeme
a {action A1 for pattern p1}
abb{action A2 for pattern p2}
a*b* {action A3 for pattern p3}
8
7
LEX in use
• An input file, which we call lex.1, is
written in the Lex language and
describes the lexical analyzer to be
generated.
• The Lex compiler transforms lex. 1
to a C program, in a file that is
always named lex. yy . c.
• The latter file is compiled by the C
compiler into a file called a. out.
• The C-compiler output is a working
lexical analyzer that can take a
stream of input characters and
produce a stream of tokens.
8
8
General
format
• The declarations section includes declarations
of variables, manifest constants (identifiers
declared to stand for a constant, e.g., the
name of a token)
• The translation rules each have the form
Pattern { Action )
• Each pattern is a regular expression, which
may use the regular definitions of the
declaration section.
• The actions are fragments of code, typically
written in C, although many variants of Lex
using other languages have been created.
• The third section holds whatever additional
functions are used in the actions.
8
9
Consider the following
statement
9
0
9
1
Lexical Analyzer Generator - Lex
Lex Source
Lexical Compiler lex.yy.c
program
lex.l
C
lex.yy.c a.out
compiler
Input a.out
Sequenc
stream e of
tokens
92
93
Lexeme
Forward
begin
Automaton Simulator
94
Finite Automata
• An input alphabet
• A set of states S
• A start state n
5
Finite Automata
• Transition
s1 a s 2
• Is read
• If end of input
96
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
97
98
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Which one?
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical
• Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
FINITE STATE AUTOMATA
• The finite automata or finite state machine is an abstract machine which have five elements or
tuple.
• It has a set of states and rules for moving from one state to another but it depends upon the applied
input symbol.
• Two Types: Deterministic Finite state automata (DFA); Non Deterministic Finite
State Automata (NFA)
FINITE STATE AUTOMATA (…)
Example: Consider r1={a,b} r2={0,1}
1.Union:
r1+r2={a,b,0,1} // set of alphabets or digits
2.Concatenation:
r1.r2=r1r2={a0,a1,b0,b1} //set of alphabets and digits
3. Kleen’s Closure
r1* ={Є, a, aa, aaa, aaaa, …, b, bb, bbb, …} //0 or any number of times
repeating the symbols
4. Positive Closure
r1+={a, aa, aaa, …, b, bb, bbb, …} //1 or any number of times repeating the
symbols
102
Non-Deterministic Finite Automaton
(NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
• S - a set of states
• - a set of input symbols (alphabet)
• move – a transition function move to map state-symbol pairs to sets of states.
• - transitions are allowed in NFAs. In other words, we can move from one state to
accepting states such that edge labels along this path spell out x.
103
104
• No -moves
• Nondeterministic Finite Automata (NFA)
state
• Can have -moves
• Finite automata have finite memory
105
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
106
107
N(r1)
NFA for r1 | r2
i f
N(r2)
113
i N(r) f
NFA for r*
114
a
(a|b) *
b
a
(a|b) * a a
b
115
116
S1
S0 b a
S2
b
119
120
121
NFA
122
NFA
123
Transition Table
124