0% found this document useful (0 votes)
128 views

Compiler Design Note1

The document discusses compilers and their components. A compiler translates a program written in one language into an equivalent program in another language. It consists of phases like lexical analysis, syntax analysis, semantic analysis, code generation. Interpreters execute programs directly without translation while compilers translate programs before execution.

Uploaded by

Success College
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Compiler Design Note1

The document discusses compilers and their components. A compiler translates a program written in one language into an equivalent program in another language. It consists of phases like lexical analysis, syntax analysis, semantic analysis, code generation. Interpreters execute programs directly without translation while compilers translate programs before execution.

Uploaded by

Success College
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Introduction to Compiler Design

What is a compiler?
 A program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language).
 Why we design compiler?
 Why we study compiler construction techniques?
Compilers provide an essential interface between applications and architectures
Compilers embody a wide range of theoretical techniques

Since different platforms, or hardware architectures along with the operating systems
(Windows, Macs, Unix), require different machine code, you must compile most
programs separately for each platform.

Programs related to compilers


 Interpreter:

 Is a program that reads a source program and executes it


 Works by analyzing and executing the source program commands one at a time

Page 1 of 115
 Does not translate the whole source program into object code
 Interpretation is important when:
Programmer is working in interactive mode and needs to view and update
variables
Running speed is not important
Commands have simple formats, and thus can be quickly analyzed and
executed
Modification or addition to user programs is required as execution proceeds

Interpreter and compiler differences

 Interpreter:

o Interpreter takes one statement then translates it and executes it and then takes
another statement.

o Interpreter will stop the translation after it gets the first error.

o Interpreter takes less time to analyze the source code.

o Over all execution speed is less.

 Compiler:
 While compiler translates the entire program in one go and then executes it.
 Generates the error report after the translation of the entire program.
 Takes a large amount of time in analyzing and processing the high level language
code.
 Overall execution time is faster.

Page 2 of 115
 Interpreter:

Well-known examples of interpreters:


o Basic interpreter, Lisp interpreter, UNIX shell command interpreter, SQL
interpreter, java interpreter…
In principle, any programming language can be either interpreted or compiled:
o Some languages are designed to be interpreted, others are designed to be
compiled
Interpreters involve large overheads:
o Execution speed degradation can vary from 10:1 to 100:1
o Substantial space overhead may be involved

E.g., Compiling Java Programs

 The Java compiler produces bytecode not machine code

 Bytecode is converted into machine code using a Java Interpreter

 You can run bytecode on any computer that has a Java Interpreter installed

Android and Java

 Assemblers:

Page 3 of 115
Translator for the assembly language.

Assembly code is translated into machine code

Output is relocatable machine code.

 Linker

Links object files separately compiled or assembled

Links object files to standard library functions

Generates a file that can be loaded and executed

 Loader

Loading of the executable codes, which are the outputs of linker, into main memory.

 Pre-processors

A pre-processor is a separate program that is called by the compiler before actual


translation begins.

Such a pre-processor:

o Produce input to a compiler

o can delete comments,

o Macro processing (substitutions)

o Include other files...

Page 4 of 115
The translation process
A compiler consists of internally of a number of steps, or phases, that perform distinct
logical operations.

The phases of a compiler are shown in the next slide, together with three auxiliary
components that interact with some or all of the phases:

 The symbol table,

 the literal table,

 and error handler.

There are two important parts in compilation process:

Analysis and Synthesis

Analysis and Synthesis

Analysis (front end)


Breaks up the source program into constituent pieces and
Creates an intermediate representation of the source program.
During analysis, the operations implied by the source program are determined and
recorded in hierarchical structure called a tree.

Synthesis (back end)


The synthesis part constructs the desired program from the intermediate representation.

Analysis of the source program

Page 5 of 115
 Analysis consists of three phases:

1) Linear/Lexical analysis
2) Hierarchical/Syntax analysis
3) Semantic analysis

Lexical analysis or Scanning


The stream of characters making up the source program is read from left to right and is
grouped into tokens.
A token is a sequence of characters having a collective meaning.
A lexical analyzer, also called a lexer or a scanner, receives a stream of characters from
the source program and groups them into tokens.
Examples:
 Identifiers
 Keywords
 Symbols (+, -, …)
 Numbers …

Blanks, new lines, tabulation marks will be removed during lexical analysis.
Example:

a[index] = 4 + 2;

a identifier

[ left bracket all are tokens

index identifier

] right bracket

= assignment operator

4 number

+ plus operator

2 number

; semicolon

A scanner may perform other operations along with the recognition of tokens.

Page 6 of 115
 It may inter identifiers into the symbol table, and

 It may inter literals into literal table.

Lexical Analysis Tools

There are tools available to assist in the writing of lexical analyzers.

o lex - produces C source code (UNIX/linux).

o flex - produces C source code (gnu).

o JLex - produces Java source code.

We will use Lex.

Syntax analysis or Parsing


The parser receives the source code in the form of tokens from the scanner and performs
syntax analysis.

The results of syntax analysis are usually represented by a parse tree or a syntax tree.

Syntax tree  each interior node represents an operation and the children of the node
represent the arguments of the operation.

The syntactic structure of a programming language is determined by context free


grammar (CFG).

Ex. Consider again the line of C code: a[index] = 4 + 2

Sometimes syntax trees are called abstract syntax trees, since they represent a further
abstraction from parse trees. Example is shown in the following figure.

Page 7 of 115
Syntax Analysis Tools

There are tools available to assist in the writing of parsers.

o yacc - produces C source code (UNIX/Linux).

o bison - produces C source code (gnu).

o CUP - produces Java source code.

We will use yacc.

Semantic analysis

The semantics of a program are its meaning as opposed to syntax or structure

The semantics consist of:

o Runtime semantics – behavior of program at runtime

o Static semantics – checked by the compiler

Static semantics include:

o Declarations of variables and constants before use

o Calling functions that exist (predefined in a library or defined by the user)

o Passing parameters properly

o Type checking.

The semantic analyzer does the following:

o Checks the static semantics of the language

o Annotates the syntax tree with type information

Ex. Consider again the line of C code: a[index] = 4 + 2

Page 8 of 115
Synthesis of the target program

 The target code generator

 Intermediate code generator

Intermediate code generator

Comes after syntax and semantic analysis

Separates the compiler front end from its backend

Intermediate representation should have 2 important properties:

o Should be easy to produce

o Should be easy to translate into the target program

Intermediate representation can have a variety of forms:

o Three-address code, P-code for an abstract machine, Tree or DAG


representation

Code generator

The machine code generator receives the (optimized) intermediate code, and then it
produces either:

o Machine code for a specific machine, or

Page 9 of 115
o Assembly code for a specific machine and assembler.

Code generator

o Selects appropriate machine instructions

o Allocates memory locations for variables

o Allocates registers for intermediate computations

The code generator takes the IR code and generates code for the target machine.

Here we will write target code in assembly language: a[index]=6

MOV R0, index ;; value of index -> R0

MUL R0, 2 ;; double value in R0

MOV R1, &a ;; address of a ->R1

ADD R1, R0 ;; add R0 to R1

MOV *R1, 6 ;; constant 6 -> address in R1

&a –the address of a (the base address of the array)

*R1-indirect registers addressing (the last instruction stores the value 6 to the address
contained in R1)

Grouping of phases
The discussion of phases deals with the logical organization of a compiler.

In practice most compilers are divided into:

o Front end - language-specific and machine-independent.

o Back end - machine-specific and language-independent.

Compiler passes:

Page 10 of 115
A pass consists of reading an input file and writing an output file.

Several phases may be grouped in one pass.

For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis,
and intermediate code generation might be grouped together into one pass.

Single pass:

o Is a compiler that passes through the source code of each compilation unit only
once

o A one-pass compiler does not "look back" at code it previously processed.

o A one-pass compilers is faster than multi-pass compilers

o They are unable to generate efficient programs, due to the limited scope available.

Multi pass:

o Is a type of compiler that processes the source code or abstract syntax tree of a
program several times

o A collection of phases is done multiple times

Major Data and Structures in a Compiler


Token

o Represented by an integer value or an enumeration literal

o Sometimes, it is necessary to preserve the string of characters that was scanned

o For example, name of an identifiers or value of a literal

Syntax Tree

o Constructed as a pointer-based structure

o Dynamically allocated as parsing proceeds

o Nodes have fields containing information collected by the parser and semantic
analyzer

Symbol Table

o Keeps information associated with all kinds of tokens:

 Identifiers, numbers, variables, functions, parameters, types, fields, etc.

Page 11 of 115
o Tokens are entered by the scanner and parser

o Semantic analyzer adds type information and other attributes

o Code generation and optimization phases use the information in the symbol table

Performance Issues

o Insertion, deletion, and search operations need to be efficient because they are
frequent

o More than one symbol table may be used

Literal Table

o Stores constant values and string literals in a program.

o One literal table applies globally to the entire program.

o Used by the code generator to:

 Assign addresses for literals.

o Avoids the replication of constants and strings.

o Quick insertion and lookup are essential

Compiler construction tools


Various tools are used in the construction of the various parts of a compiler.

Scanner generators

o Ex. Lex, flex, JLex

o These tools generate a scanner /lexical analyzer/ if given a regular expression.

Parser Generators

o Ex. Yacc, Bison, CUP

o These tools produce a parser /syntax analyzer/ if given a Context Free Grammar
(CFG) that describes the syntax of the source language.

Syntax directed translation engines

o Ex. Cornell Synthesizer Generator

o It produces a collection of routines that walk the parse tree and execute some
tasks.
Page 12 of 115
Automatic code generators

o Take a collection of rules that define the translation of the IC to target code and
produce a code generator.

 This completes our brief description of the phases of compiler (chapter 1).
 For any unclear, comment, question, doubt things and etc please don’t hesitate to
announce me as you can.

Review Exercise
1) What is compiler?

2) Why designing it is necessary?

3) What are the main phases of compiler construction? Explain each.

4) Consider the line of C++ code: float [index] = a-c. write its:

A. Lexical analysis D. Syntax analysis

B. Semantic analysis E. Intermediate code generator

C. Code generator

5) Consider the line of C code: int [index] = (4 + 2) / 2. write its:

A. Lexical analysis C. Syntax analysis

B. Semantic analysis D. Intermediate code generator E. Code generator

6) Explain diagrammatically how high level languages can be changed to machine


understandable language (machine code).

a) What is the role of compilers in this action?

b) What is another programs used in this process and what makes them different
from compilers.

Page 13 of 115
Chapter 2

Lexical analysis

Introduction
The role of lexical analyzer is:

o to read a sequence of characters from the source program

o group them into lexemes and

o Produce as output a sequence of tokens for each lexeme in the source program.

The scanner can also perform the following secondary tasks:

o stripping out blanks, tabs, new lines

o stripping out comments

o keep track of line numbers (for error reporting)

Interaction of the Lexical Analyzer with the Parser

token: smallest meaningful sequence of characters of interest in source


program

Token, pattern, lexeme


A token is a sequence of characters from the source program having a collective
meaning.

A token is a classification of lexical units.

 For example: id and num

Lexemes are the specific character strings that make up a token.

 For example: abc and 123A

Page 14 of 115
Patterns are rules describing the set of lexemes belonging to a token.

 For example: “letter followed by letters and digits”

Patterns are usually specified using regular expressions. [a-zA-Z]*

Example: printf("Total = %d\n", score);

Example: The following table shows some tokens and their lexemes in Pascal (a high
level, case insensitive programming language)

In general, in programming languages, the following are tokens:

o Keywords, operators, identifiers, constants, literals, punctuation symbols…

Specification of patterns using regular expressions


o Regular expressions

o Regular expressions for tokens

Regular expression: Definitions

Represents patterns of strings of characters.

An alphabet Σ is a finite set of symbols (characters)

A string s is a finite sequence of symbols from Σ

o |s| denotes the length of string s

o ε denotes the empty string, thus |ε| = 0

A language L is a specific set of strings over some fixed alphabet Σ

A regular expression is one of the following:

Symbol: a basic regular expression consisting of a single character a, where a is from:

Page 15 of 115
o an alphabet Σ of legal characters;

o the metacharacter ε: or

o the metacharacter ø.

In the first case, L(a)={a}; in the second case, L(ε)= {ε}; and in the third case, L(ø)= { }.

{} – contains no string at all and {ε} – contains the single string consists of no character

Alternation: an expression of the form r|s, where r and s are regular expressions.
o In this case , L(r|s) = L(r) U L(s) ={r,s}

Concatenation: An expression of the form rs, where r and s are regular expressions.

o In this case, L(rs) = L(r)L(s)={rs}

Repetition: An expression of the form r*, where r is a regular expression.

o In this case, L(r*) = L(r)* ={ε, r,…}

Regular expression: Language Operations


Union of L and M

o L ∪ M = {s |s ∈ L or s ∈ M}

Concatenation of L and M

o LM = {xy | x ∈ L and y ∈ M}

Exponentiation of L

o L0 = {ε}; Li = Li-1L

Kleene closure of L

o L* = ∪i=0,…,∞ Li

Positive closure of L

o L+ = ∪i=1,…,∞ Li
Note: The following short hands are often used:

r+ =rr*
r* = r+| ε
r? =r|ε

Page 16 of 115
RE‟s: Examples

a) L(01) = ?

b) L(01|0) = ?

c) L(0(1|0)) = ?

o Note order of precedence of operators.

L(0*) = ?

L((0|10)*(ε|1)) = ?

RE‟s: Examples Solution

L(01) = {01}.

L(01|0) = {01, 0}.

L(0(1|0)) = {01, 00}.

o Note order of precedence of operators.

L(0*) = {ε, 0, 00, 000,… }.

L((0|10)*(ε|1)) = all strings of 0‟s and 1‟s without two consecutive 1‟s.

RE‟s: Examples (more)

1- a | b = ?

2- (a|b)a = ?

3- (ab) | ε = ?

4- ((a|b)a)* = ?

 Reverse

1 – Even binary numbers =?

2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of
all strings over this alphabet that contains exactly one b.

RE‟s: Examples (more) Solutions

Page 17 of 115
1- a | b = {a,b}

2- (a|b)a = {aa,ba}

3- (ab) | ε ={ab, ε}

4- ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}

 Reverse

1 – Even binary numbers (0|1)*0

2 – An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}. Consider the set of
all strings over this alphabet that contains exactly one b.

(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}

Exercises

Describe the languages denoted by the following regular expressions:

1- a(a|b)*a

2- ((ε|a)b*)*

3- (a|b)*a(a|b)(a|b)

4- a*ba*ba*ba*

Regular Expressions (Summary)

Definition: A regular expression is a string over ∑ if the following conditions hold:

o ε, Ø, and a Є ∑ are regular expressions

o If α and β are regular expressions, so is αβ

o If α and β are regular expressions, so is α+β

o If α is a regular expression, so is α*

o Nothing else is a regular expression if it doesn‟t follow from (1) to (4)

Let α be a regular expression, the language represented by α is denoted by L(α).

Regular expressions for tokens


Regular expressions are used to specify the patterns of tokens.

Each pattern matches a set of strings. It falls into different categories:

Page 18 of 115
Reserved (Key) words: They are represented by their fixed sequence of characters,

o Ex. if, while and do....

If we want to collect all the reserved words into one definition, we could write it as follows:

Reserved = if | while | do |...

Special symbols: including arithmetic operators, assignment and equality such as =, :=, +, -, *

Identifiers: which are defined to be a sequence of letters and digits beginning with letter,

o we can express this in terms of regular definitions as follows:

letter = A|B|…|Z|a|b|…|z

digit = 0|1|…|9

or

letter= [a-zA-Z]

digit = [0-9]

identifiers = letter(letter|digit)*

Numbers: Numbers can be:

o sequence of digits (natural numbers), or

o decimal numbers, or

o numbers with exponent (indicated by an e or E).

Example: 2.71E-2 represents the number 0.0271.

We can write regular definitions for these numbers as follows:

nat = [0-9]+

signedNat = (+|-)? Nat

number = signedNat(“.” nat)?(E signedNat)?

Literals or constants: This can include:

o numeric constants such as 42, and

o String literals such as “ hello, world”.

Page 19 of 115
relop  < | <= | = | <> | > | >=

Comments: Ex. /* this is a C comment*/

Delimiter  newline | blank | tab | comment

White space = (delimiter )+

Example: Divide the following Java program into appropriate tokens.

public class Dog {

private String name;

private String color;

public Dog(String n, String c) {

name = n;

color = c;

public String getName() { return name; }

public String getColor() { return color; }

public void speak() {

System.out.println("Woof");

Automata
Abstract machines

Characteristics

Input: input values (from an input alphabet ∑) are applied to the machine

Output: outputs of the machine

States: at any instant, the automation can be in one of the several states

Page 20 of 115
State relation: the next state of the automation at any instant is determined by the present
state and the present input

Types of automata

o Finite State Automata (FSA)

 Deterministic FSA (DFSA)

 Nondeterministic FSA (NFSA)

o Push Down Automata (PDA)

 Deterministic PDA (DPDA)

 Nondeterministic PDA (NPDA)

Finite State Automaton

o Finite Automaton, Finite State Machine, FSA or FSM

o An abstract machine which can be used to implement regular expressions (etc.).

o Have a finite number of states, and a finite amount of memory (i.e., the current
state).

o Can be represented by directed graphs or transition tables

Design of a Lexical Analyzer/Scanner

Finite Automata

Page 21 of 115
Lex – turns its input program into lexical analyzer.

Finite automata are recognizers; they simply say "yes" or "no" about each possible input
string.

Finite automata come in two flavors:

a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges.

ε, the empty string, is a possible label.

b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.

The Whole Scanner Generator Process

Overview

Direct construction of Nondeterministic finite Automation (NFA) to recognize a given


regular expression.

o Easy to build in an algorithmic way

o Requires ε-transitions to combine regular sub expressions

Construct a deterministic finite automation (DFA) to simulate the NFA

o Use a set-of-state construction

Minimize the number of states in the DFA (optional)

Generate the scanner code.

Design of a Lexical Analyzer …

o Token  Pattern

o Pattern  Regular Expression

o Regular Expression  NFA

o NFA  DFA

o DFA‟s or NFA‟s for all tokens  Lexical Analyzer

Page 22 of 115
Non-Deterministic Finite Automata (NFA)

Definition:

An NFA M consists of five tuples: ( Σ,S, T, S0, F)

o A set of input symbols Σ, the input alphabet

o a finite set of states‟ S,

o a transition function T: S × (Σ U { ε}) -> S (next state),

o a start state S0 from S, and

o a set of accepting/final states F from S.

The language accepted by M, written L(M), is defined as:

The set of strings of characters c1c2...cn with each ci from Σ U { ε} such that there exist
states s1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.

It is a finite automata which has choice of edges

o The same symbol can label edges from one state to several different states.

An edge may be labeled by ε, the empty string

o We can have transitions without any input character consumption.

Transition Graph

The transition graph for an NFA recognizing the language of regular expression
(a|b)*abb

Page 23 of 115
Transition Table

The mapping T of an NFA can be represented in a transition table

The language defined by an NFA is the set of input strings it accepts, such as (a|b)*abb
for the example NFA

Acceptance of input strings by NFA

An NFA accepts input string x if and only if there is some path in the transition graph
from the start state to one of the accepting states

The string aabb is accepted by the NFA:

Another NFA

An -transition is taken without consuming any character from the input.

What does the NFA above accepts?

aa*|bb*

Deterministic Finite Automata (DFA)


A deterministic finite automaton is a special case of an NFA

o No state has an ε-transition

o For each state S and input symbol a there is at most one edge labeled a leaving S

Page 24 of 115
Each entry in the transition table is a single state

o At most one path exists to accept a string

o Simulation algorithm is simple

DFSA: Example

DFA example

A DFA that accepts (a|b)*abb

Simulating a DFA: Algorithm

How to apply a DFA to a string?

INPUT:

o An input string x terminated by an end-of-file character eof.

o A DFA D with start state So, accepting states F, and transition function move.

OUTPUT: Answer ''yes" if D accepts x; "no" otherwise

METHOD

o Apply the algorithm in (next slide) to the input string x.

o The function move(s, c) gives the state to which there is an edge from state s on input c.

o The function nextChar() returns the next character of the input string x.

Page 25 of 115
Example:

DFA: Exercise

Construct DFAs for the string matched by the following definition:

digit =[0-9]

nat=digit+

signednat=(+|-)?nat

number=signednat(“.”nat)?(E signedNat)?

Why do we study RE,NFA,DFA?

Goal: To scan the given source program

Process:

o Start with Regular Expression (RE)

o Build a DFA

 How?

 We can build a non-deterministic finite automaton, NFA (Thompson's


construction)

 Convert that to a deterministic one, DFA (Subset construction)

 Minimize the DFA (optional) (different algorithms)

 Implement it

 Existing scanner generator: Lex/Flex

RENFADFA Minimize DFA states

Page 26 of 115
Step 1: Come up with a Regular Expression

(a|b)*ab

Step 2: Use Thompson's construction to create an NFA for that expression

r RENFADFA Minimize DFA states

Step 1: Come up with a Regular Expression (a|b)*ab

Step 2: Use Thompson's construction to create an NFA for that expression

RENFADFA Minimize DFA states

Step 3: Use subset construction to convert the NFA to a DFA

States 0 and 2 behave the same way, so they can be merged.

Step 4: Minimize the DFA states

Design of a Lexical Analyzer Generator

Two algorithms:

1- Translate a regular expression into an NFA (Thompson‟s construction)

Page 27 of 115
2- Translate NFA into DFA (Subset construction)

From regular expression to an NFA


It is known as Thompson‟s construction.

Rules:

1- For a ε, a regular expressions, construct:

2- For a composition of regular expression:

Case 1: Alternation: regular expression (s|r), assume that NFAs equivalent to r and s
have been constructed.

Case 2: Concatenation: regular expression sr.

From RE to NFA: Exercises

Construct NFA for token identifier.

letter(letter|digit)*

Construct NFA for the following regular expression:

(a|b)*abb

Page 28 of 115
From an NFA to a DFA (subset construction algorithm)

Rules:

Start state of D is assumed to be unmarked.

Start state of D is = ε-closer (S0),

Where S0 -start state of N.

ε- closure

ε-closure (S‟) – is a set of states with the following characteristics:

1- S‟ € ε-closure(S‟) itself

2- if t € ε-closure (S‟) and if there is an edge labeled ε from t to v, then v € ε-closure (S‟)

3- Repeat step 2 until no more states can be added to ε-closure (S‟).

E.g: for NFA of (a|b)*abb

ε-closure (0)= {0, 1, 2, 4, 7} and ε-closure (1)= {1, 2, 4}

Algorithm

While there is unmarked state

X = { s0, s1, s2,..., sn} of D do

Begin

Mark X

For each input symbol „a‟ do

Begin

Let T be the set of states to which there is a transition „a‟ from state si in X.

Y= ε-Closer (T)

If Y has not been added to the set of states of D then {

Page 29 of 115
Mark Y an “Unmarked” state of D add a transition from X to Y labeled „a‟ if not already
presented

End

End

NFA for identifier: letter(letter|digit)*

Example: Convert the following NFA into the corresponding DFA. letter (letter|digit)*

Exercise: convert NFA of (a|b)*abb in to DFA.

Other Algorithms

 How to minimize a DFA? (see Dragon Book 3.9, pp.173)

 How to convert RE to DFA directly? (see Dragon Book 3.9.5 pp.179)

Page 30 of 115
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the input source and converts strings in the source to
tokens.

Lex: generates a scanner (lexical analyzer or lexer) given a specification of the tokens using
REs.

o The input notation for the Lex tool is referred to as the Lex language and

o The tool itself is the Lex compiler.

The Lex compiler transforms the input patterns into a transition diagram and generates
code, in a file called lex.yy.c, that simulates this transition diagram.

By using regular expressions, we can specify patterns to lex that allow it to scan and match
strings in the input.

Each pattern in lex has an associated action.

Typically an action returns a token, representing the matched string, for subsequent use by
the parser.

It uses patterns that match strings in the input and converts the strings to tokens.

General Compiler Infra-structure

Page 31 of 115
Scanner, Parser, Lex and Yacc

Generating a Lexical Analyzer using Lex

We will see more about lex and its construction in lab case.

Summary of Chapter 2
Tokens: The lexical analyzer scans the source program and produces as output a
sequence of tokens, which are normally passed, one at a time to the parser. Some tokens
may consist only of a token name while others may also have an associated lexical value
that gives information about the particular instance of the token that has been found on
the input.

Lexemes: Each time the lexical analyzer returns a token to the parser, it has an associated
lexeme - the sequence of input characters that the token represents.

Buffering: Because it is often necessary to scan ahead on the input in order to see where
the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input.
Using a pair of buffers cyclically and ending each buffer's contents with a sentinel that
warns of its end are two techniques that accelerate the process of scanning the input. +
Patterns. Each token has a pattern that describes which sequences of characters can form
the lexemes corresponding to that token. The set of words or strings of characters that
match a given pattern is called a language.

Page 32 of 115
Regular Expressions: These expressions are commonly used to describe patterns.
Regular expressions are built from single characters, using union, concatenation, and the
Kleene closure, or any-number-of, operator.

Regular Definitions: Complex collections of languages, such as the patterns that


describe the tokens of a programming language, are often defined by a regular definition,
which is a sequence of statements that each define one variable to stand for some regular
expression. The regular expression for one variable can use previously defined variables
in its regular expression.

Transition Diagrams: The behavior of a lexical analyzer can often be described by a


transition diagram. These diagrams have states, each of which represents something
about the history of the characters seen during the current search for a lexeme that
matches one of the possible patterns. There are arrows, or transitions, from one state to
another, each of which indicates the possible next input characters that cause the lexical
analyzer to make that change of state. + Finite Automata. These are a formalization of
transition diagrams that include a designation of a start state and one or more accepting
states, as well as the set of states, input characters, and transitions among states.
Accepting states indicate that the lexeme for some token has been found. Unlike
transition diagrams, finite automata can make transitions on empty input as well as on
input characters.

Deterministic Finite Automata: A DFA is a special kind of finite automaton that has
exactly one transition out of each state for each input symbol. Also, transitions on empty
input are disallowed. The DFA is easily simulated and makes a good implementation of a
lexical analyzer, similar to a transition diagram.

Nondeterministic Finite Automata: Automata that are not DFA7s are called
nondeterministic. NFA's often are easier to design than are DFA's. Another possible
architecture for a lexical analyzer is to tabulate all the states that NFA7s for each of the
possible patterns can be in, as we scan the input characters.

Conversion among Pattern Representations: It is possible to convert any regular


expression into an NFA of about the same size, recognizing the same language as the
regular expression defines. Further, any NFA can be converted to a DFA for the same
pattern, although in the worst case (never encountered in common programming
languages) the size of the automaton can grow exponentially. It is also possible to convert
any nondeterministic or deterministic finite automaton into a regular expression that
defines the same language recognized by the finite automaton.

Lex: There is a family of software systems, including Lex and Flex, that are lexical-
analyzer generators. The user specifies the patterns for tokens using an extended regular-
expression notation. Lex converts these expressions into a lexical analyzer that is

Page 33 of 115
essentially a deterministic finite automaton that recognizes any of the patterns.
Minimization of Finite Automata: For every DFA there is a minimum state DM
accepting the same language. Moreover, the minimum-state DFA for a given language is
unique except for the names given to the various states.

Review Exercise
1) Divide the following C + + program:
float linitedSquare(x) float x {
/* returns x-squared, but never more than 100 */
return (x<=-10.01||x>=l0.0)?100:x*x;
into appropriate lexemes. Which lexemes should get associated lexical values? What should
those values be?
2) Write regular definitions for the following languages:
a) All strings of lowercase letters that contain the five vowels in order.
b) All strings of lowercase letters in which the letters are in ascending lexicographic order.
c) Comments, consisting of a string surrounded by /* and */, without an intervening */,
unless it is inside double-quotes (").
d) All strings of digits with no repeated digits. Hint: Try this problem first with a few digits,
such as {0, 1, 2). !!
e) All strings of digits with at most one repeated digit. !!
f) All strings of a's and b's with an even number of a's and an odd number of b's.
g) The set of Chess moves, in the informal notation, such as p-k4 or kbp x qn.!!
h) All strings of a's and b's that do not contain the substring abb.
i) All strings of a's and b's that do not contain the subsequence abb.
3) Construct the minimum-state DFA7s for the following regular expressions:
a) (a|b)*a(a|b)
b) (a|b)*a(a|b) (a|b)
c) (a|b)*a(a|b) (a|b)(a|b)

Page 34 of 115
Chapter – 3

Syntax analysis

Introduction
Syntax: the way in which tokens are put together to form expressions, statements, or
blocks of statements.

o The rules governing the formation of statements in a programming language.

Syntax analysis: the task concerned with fitting a sequence of tokens into a specified
syntax.

Parsing: To break a sentence down into its component parts with an explanation of the
form, function, and syntactical relationship of each part.

The syntax of a programming language is usually given by the grammar rules of a


context free grammar (CFG).

Parser

The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a CFG or not.

o If it satisfies, the parser creates the parse tree of that program.

o Otherwise, the parser gives the error messages.

A CFG:

o Gives a precise syntactic specification of a programming language.

o A grammar can be directly converted in to a parser by some tools (yacc).

The parser can be categorized into two groups:

Top-down parser

Page 35 of 115
o The parse tree is created top to bottom, starting from the root to leaves.

Bottom-up parser

o The parse tree is created bottom to top, starting from the leaves to root.

Both top-down and bottom-up parser scan the input from left to right (one symbol at a
time).

Efficient top-down and bottom-up parsers can be implemented by making use of context-
free- grammar.

o LL for top-down parsing

o LR for bottom-up parsing

Context free grammar (CFG)


A context-free grammar is a specification for the syntactic structure of a programming
language.

Context-free grammar has 4-tuples:

G = (T, N, P, S) where

o T is a finite set of terminals (a set of tokens)

o N is a finite set of non-terminals (syntactic variables)

o P is a finite set of productions of the form A → α where A is non-terminal


and α is a strings of terminals and non-terminals (including the empty string)

S ∈ N is a designated start symbol (one of the non- terminal symbols)

Example: grammar for simple arithmetic expressions

Page 36 of 115
Derivation
A derivation is a sequence of replacements of structure names by choices on the right
hand sides of grammar rules.

Example: E → E + E | E – E | E * E | E / E | -E

E→(E)

E → id

E => E + E means that E + E is derived from E

o we can replace E by E + E
o we have to have a production rule E → E+E in our grammar.

E=>E+E =>id+E=>id+id means that a sequence of replacements of non-terminal symbols is


called a derivation of id+id from E.

If we always choose the left-most non-terminal in each derivation step, this derivation is
called left-most derivation.

Example: E=>-E=>-(E)=>-(E+E)=>-(id+E)=>-(id+id)

If we always choose the right-most non-terminal in each derivation step, this derivation is
called right-most derivation.

Example: E=>-E=>-(E)=>-(E+E)=>-(E+id)=>-(id+id)

We will see that the top-down parser try to find the left-most derivation of the given source
program.

We will see that the bottom-up parser try to find right-most derivation of the given source
program in the reverse order.

Parse tree
A parse tree is a graphical representation of a derivation.

It filters out the order in which productions are applied to replace non-terminals.

A parse tree corresponding to a derivation is a labeled tree in which:

o the interior nodes are labeled by non-terminals,

o the leaf nodes are labeled by terminals, and

o the children of each internal node represent the replacement of the associated non-
terminal in one step of the derivation.

Page 37 of 115
Parse tree and Derivation

Ambiguity: example

Ambiguity: example…

Elimination of ambiguity Precedence/Association


These two derivations point out a problem with the grammar:

The grammar do not have notion of precedence, or implied order of evaluation

To add precedence

o Create a non-terminal for each level of precedence

o Isolate the corresponding part of the grammar

Page 38 of 115
o Force the parser to recognize high precedence sub expressions first

For algebraic expressions

o Multiplication and division, first (level one)

o Subtraction and addition, next (level two)

To add association

o Left-associative : The next-level (higher) non-terminal places at the last of a production

o Elimination of ambiguity

o To disambiguate the grammar :

o we can use precedence of operators as follows:

* Higher precedence (left associative)

+ Lower precedence (left associative)

o We get the following unambiguous grammar:

Left Recursion

Elimination of Left recursion

A grammar is left recursive, if it has a non-terminal A such that there is a derivation

A=>+Aα for some string α.

Page 39 of 115
Top-down parsing methods cannot handle left-recursive grammar. So a transformation that
eliminates left-recursion is needed.

To eliminate left recursion for single production A  Aα |β could be replaced by the non-
left- recursive productions

A  β A‟

A‟  α A‟| ε

Generally, we can eliminate immediate left recursion from them by the following technique.

First we group the A-productions as:

A  Aα1 |Aα2 |…. |Aαm |β1 | β2|….| βn

Where no βi begins with A. then we replace the A productions by:

A  β1A‟ | β2A‟ | … | βnA‟

A‟  α1Α‟ | α2A‟ | … | αmA‟ |ε

Left factoring
When a non-terminal has two or more productions whose right-hand sides start with the same
grammar symbols, the grammar is not LL(1) and cannot be used for predictive parsing

A predictive parser (a top-down parser without backtracking) insists that the grammar must
be left-factored.

In general: A  αβ1 | αβ2 , where α-is a non-empty and the first symbol of β1 and β2.

When processing α we do not know whether to expand A to αβ1 or to αβ2, but if we re-write
the grammar as follows:

A  αA‟

A‟  β1 | β2 so, we can immediately expand A to αA‟.

Example: given the following grammar:

Page 40 of 115
S  iEtS | iEtSeS | a

Eb

Left factored, this grammar becomes:

S  iEtSS‟ | a

S‟  eS | ε

Eb

Syntax analysis

Every language has rules that prescribe the syntactic structure of well-formed programs.

The syntax can be described using Context Free Grammars (CFG) notation.

The use of CFGs has several advantages:

o helps in identifying ambiguities

o it is possible to have a tool which produces automatically a parser using the grammar

o a properly designed grammar helps in modifying the parser easily when the language
changes

Top-down parsing
Recursive Descent Parsing (RDP)

This method of top-down parsing can be considered as an attempt to find the left most
derivation for an input string. It may involve backtracking.

To construct the parse tree using RDP:

o We create one node tree consisting of S.

o Two pointers, one for the tree and one for the input, will be used to indicate where the
parsing process is.

Page 41 of 115
o Initially, they will be on S and the first input symbol, respectively.

o Then we use the first S-production to expand the tree. The tree pointer will be
positioned on the left most symbol of the newly created sub-tree.

As the symbol pointed by the tree pointer matches that of the symbol pointed by the input
pointer, both pointers are moved to the right.

Whenever the tree pointer points on a non-terminal, we expand it using the first
production of the non-terminal.

Whenever the pointers point on different terminals, the production that was used is not
correct, thus another production should be used. We have to go back to the step just
before we replaced the non-terminal and use another production.

If we reach the end of the input and the tree pointer passes the last symbol of the tree, we
have finished parsing.

Example: G: S  cAd

A  ab|a

Draw the parse tree for the input string cad using the above method.

Exercise:

1) Consider the following grammar:

SA

A  A + A | B++

By

Draw the parse tree for the input “ y+++y++”


2) Using the grammar below, construct a parse tree for the following string using RDP
algorithm: ( ( id . id ) id ( id ) ( ( ) ) )

S→E

E → id

|(E.E)

|(L)

|()

Page 42 of 115
L→LE

|E

Non-recursive predictive parsing


It is possible to build a non-recursive parser by explicitly maintaining a stack.

This method uses a parsing table that determines the next production to be applied. The
input buffer contains the string to be parsed followed by $ (the right end marker)

The stack contains a sequence of grammar symbols with $ at the bottom.

Initially, the stack contains the start symbol of the grammar followed by $.

The parsing table is a two dimensional array M[A, a] where A is a non-terminal of the
grammar and a is a terminal or $.

The parser program behaves as follows.

The program always considers

o X, the symbol on top of the stack and

o a, the current input symbol.

Predictive Parsing…

There are three possibilities:

1. x = a = $ : the parser halts and announces a successful completion of parsing

2. x = a ≠ $ : the parser pops x off the stack and advances the input pointer to the next
symbol

3. X is a non-terminal: the program consults entry M[X, a] which can be an X-production


or an error entry.

Page 43 of 115
If M[X, a] = {X  uvw}, X on top of the stack will be replaced by uvw (u at the top of the
stack).

As an output, any code associated with the X-production can be executed.

If M[X, a] = error, the parser calls the error recovery method.

A Predictive Parser table

A Predictive Parser: Example

Non-recursive predictive parsing Example: G:

E  TR

R  +TR Input: 1+2

R  -TR

Rε

T  0|1|…|9

Page 44 of 115
FIRST and FOLLOW
The construction of both top-down and bottom-up parsers are aided by two functions, FIRST
and FOLLOW, associated with a grammar G.

During top-down parsing, FIRST and FOLLOW allow us to choose which production to
apply, based on the next input symbol. During panic-mode error recovery, sets of tokens
produced by FOLLOW can be used as synchronizing tokens.

We need to build a FIRST set and a FOLLOW set for each symbol in the grammar. The
elements of FIRST and FOLLOW are terminal symbols.

o FIRST() is the set of terminal symbols that can begin any string derived from .

o FOLLOW() is the set of terminal symbols that can follow : t  FOLLOW()


  derivation containing t

Construction of a predictive parsing table

Makes use of two functions: FIRST and FOLLOW.

FIRST

o FIRST(α) = set of terminals that begin the strings derived from α.


o If α => ε in zero or more steps, ε is in FIRST(α).
o FIRST(X) where X is a grammar symbol can be found using the following rules:

1- If X is a terminal, then FIRST(x) = {x}

2- If X is a non-terminal: two cases

a) If X  ε is a production, then add ε to FIRST(X)

b) For each production X  y1y2…yk, place a in FIRST(X) if for some i, a Є FIRST (yi)
and ε Є FIRST (yj), for 1<j<i. If ε Є FIRST (yj), for j=1, …,k then ε Є FIRST(X)

For any string y = x1x2…xn

Page 45 of 115
a- Add all non- ε symbols of FIRST(X1) in FIRST(y)

b- Add all non- ε symbols of FIRST (Xi) for i≠1 if for all j<i, ε Є FIRST (Xj)

c- ε Є FIRST(y) if ε Є FIRST(Xi) for all i

FOLLOW

FOLLOW(A) = set of terminals that can appear immediately to the right of A in some
sentential form.
o Place $ in FOLLOW(A), where A is the start symbol.
o If there is a production B  αAβ, then everything in FIRST(β), except ε, should be added
to FOLLOW(A).
o If there is a production B  αA or B  αAβ and ε Є FIRST(β), then all elements of
FOLLOW(B) should be added to FOLLOW(A).

Exercises:

1) Consider the following grammars G, find FIRST and FOLLOW sets.

2) Find FIRST and FOLLOW sets for the following grammar G:

3) Consider the following grammar over the alphabet { g,h,i,b}

A  BCD

B  bB | ε

C  Cg | g | Ch | i

D  AB | ε

Fill in the table below with the FIRST and FOLLOW sets for the non-terminals in this grammar:
Page 46 of 115
Construction of predictive parsing table

o Input Grammar G

o Output Parsing table M

For each production of the form A  α of the grammar do:

 For each terminal a in FIRST(α), add A  α to M[A, a]

 If ε Є FIRST(α), add A  α to M[A, b] for each b in FOLLOW(A)

 If ε Є FIRST(α) and $ Є FOLLOW(A), add A  α to M[A, $]

 Make each undefined entry of M be an error.

Non-recursive predictive parsing…

Exercise:

1) Consider the following grammars G, Construct the predictive parsing table and parse the
input symbols:

2) Construct the predictive parsing table for the grammar G:

3) Let G be the following grammar:

Page 47 of 115
S  [ SX ] | a

X  ε | +SY | Yb

Y  ε | -SXc

A – Find FIRST and FOLLOW sets for the non-terminals in this grammar.

B – Construct predictive parsing table for the grammar above.

C – Show a top down parse of the string [a+a-ac]

LL (1) Grammars…
Exercises: 1) Consider the following grammar G:

A‟  A

A  xA | yA |y

a) Find FIRST and FOLLOW sets for G:

b) Construct the LL(1) parse table for this grammar.

c) Explain why this grammar is not LL(1).

d) Transform the grammar into a grammar that is LL(1).

e) Give the parse table for the grammar created in (d).

Solution:

2) Given the following grammar:

Page 48 of 115
S  WAB | ABCS

A  B | WB

B  ε |yB

Cz

Wx

a) Find FIRST and FOLLOW sets of the grammar.

b) Construct the LL(1) parse table.

c) Is the grammar LL(1)? Justify your answer.

3) Consider the following grammar:

S  ScB | B

B  e | efg | efCg

C  SdC | S

a) Justify whether the grammar is LL(1) or not?

b) If not, translate the grammar into LL(1).

c) Construct predictive parsing table for the above grammar.

Bottom-Up and Top-Down Parsers


Top-down parsers:

Starts constructing the parse tree at the top (root) of the tree and move down
towards the leaves.

Easy to implement by hand, but work with restricted grammars.

Example: predictive parsers

Bottom-up parsers:

Build the nodes on the bottom of the parse tree first.

Suitable for automatic parser generation, handle a larger class of grammars.

Example: shift-reduce parser (or LR (k) parsers)

Page 49 of 115
A bottom-up parser, or a shift-reduce parser, begins at the leaves and works up to the top of
the tree. The reduction steps trace a rightmost derivation on reverse.

We want to parse the input string abbcde. This parser is known as an LR Parser because it
scans the input from Left to right, and it constructs a rightmost derivation in reverse order.

Example of Bottom-up parser (LR parsing)

S  aABe

A  Abc | b

Bd

abbcde  aAbcde  aAde  aABe  S

At each step, we have to find α such that α is a substring of the sentence and replace α by A,
where A  α

Stack implementation of shift/reduce parsing

In LR parsing the two major problems are:

o locate the substring that is to be reduced

o locate the production to use

A shift/reduce parser operates:

o By shifting zero or more input into the stack until the right side of the handle is on top
of the stack.

o The parser then replaces handle by the non-terminal of the production.

o This is repeated until the start symbol is in the stack and the input is empty, or until
error is detected.

Four actions are possible:

o Shift: the next input is shifted on to the top of the stack

o Reduce: the parser knows the right end of the handle is at the top of the stack. It
should then decide what non-terminal should replace that substring

Page 50 of 115
o Accept: the parser announces successful completion of parsing

o Error: the parser discovers a syntax error

Example: An example of the operations of a shift/reduce parser G: E  E + E | E*E | (E) | id

Conflict during shift/reduce parsing

Grammars for which we can construct an LR (k) parsing table are called LR (k) grammars.

Most of the grammars that are used in practice are LR (1).

There are two types of conflicts in shift/reduce parsing:

o Shift/reduce conflict: when we have a situation where the parser knows the entire
stack content and the next k symbols but cannot decide whether it should shift or
reduce. Ambiguity

o Reduce/reduce conflict: when the parser cannot decide which of the several
productions it should use for a reduction.

ET

E id with an id on the top of stack

T id

LR parser

The LR(k) stack stores strings of the form: S0X0S1X1…XmSm where

Page 51 of 115
o Si is a new symbol called state that summarizes the information contained in
the stack

o Sm is the state on top of the stack

o Xi is a grammar symbol

The parser program decides the next step by using:

 the top of the stack (Sm),

 the input symbol (ai), and

 the parsing table which has two parts: ACTION and GOTO.

 then consulting the entry ACTION[Sm , ai] in the parsing action table

Structure of the LR Parsing Table

The parsing table consists of two parts:

o a parsing-action function ACTION and

o a goto function GOTO.

The ACTION function takes as arguments a state i and a terminal a (or $, the input
endmarker).

The value of ACTION[i, a] can have one of four forms:

o Shift j, where j is a state. The action taken by the parser shifts input a on the top of the
stack, but uses state j to represent a.

o Reduce A  β, The action of the parser reduces β on the top of the stack to head A.

o Accept, The parser accepts the input and finishes parsing.

o Error, The parser discovers an error

GOTO function, defined on sets of items, to states.

o GOTO[Ii, A] = Ij, then GOTO maps a state i and a non-terminal A to state j.

LR parser configuration

Behavior of an LR parser  describes the complete state of the parser.

A configuration of an LR parser is a pair:

Page 52 of 115
This configuration represents the right-sentential form

Behavior of LR parser

The parser program decides the next step by using:

o the top of the stack (Sm),

o the input symbol (ai), and

o the parsing table which has two parts: ACTION and GOTO.

o then consulting the entry ACTION[Sm , ai] in the parsing action table

1. If Action[Sm, ai] = shift S, the parser program shifts both the current input symbol ai and
state S on the top of the stack, entering the configuration

(S0 X1 S1 X2 S2 … Xm Sm ai S, ai+1 … an $)

2. Action[Sm, ai] = reduce A  β: the parser pops the first 2r symbols off the stack, where r =
|β| (at this point, Sm-r will be the state on top of the stack), entering the configuration,

(S0 X1 S1 X2 S2 … Xm-r Sm-r A S, ai ai+1 … an $)

o Then A and S are pushed on top of the stack where S = goto[Sm-r, A]. The input buffer is
not modified.

3. Action[Sm, ai] = accept, parsing is completed.

4. Action[Sm, ai] = error, parsing has discovered an error and calls an error recovery routine.

LR-parsing algorithm

let a be the first symbol of w$;

while(1) { /* repeat forever */

Page 53 of 115
let S be the state on top of the stack;

if ( ACTION[S, a] = shift t ) {

push t onto the stack;

let a be the next input symbol;

} else if ( ACTION[S, a] = reduce A β ) {

pop IβI symbols off the stack;

let state t now be on top of the stack;

push GOTO[t, A] onto the stack;

output the production A β;

} else if ( ACTION[S, a] = accept ) break; /* parsing is done */

else call error-recovery routine; }

Example: Let G1 be:

Legend: Si means shift to state i, Rj means reduce production by j

The following grammar can be parsed with this action and goto table as bellow.

Page 54 of 115
Example: The following example shows how a shift/reduce parser parses an input string w = id
* id + id using the parsing table shown above.

Constructing SLR parsing tables

This method is the simplest of the three methods used to construct an LR parsing table. It is
called SLR (simple LR) because it is the easiest to implement. However, it is also the
weakest in terms of the number of grammars for which it succeeds. A parsing table
constructed by this method is called SLR table. A grammar for which an SLR table can be
constructed is said to be an SLR grammar.

LR (0) item

An LR (0) item (item for short) is a production of a grammar G with a dot at some
position of the right side.

For example for the production A  X Y Z we have four items:

A.XYZ

AX.YZ

AXY.Z

A  X Y Z.

For the production A  ε we only have one item:

A .

Page 55 of 115
An item indicates what is the part of a production? That we have seen and what we hope
to see. The central idea in the SLR method is to construct, from the grammar, a
deterministic finite automaton to recognize viable prefixes.

A viable prefix is a prefix of a right sentential form that can appear on the stack of a
shift/reduce parser.

o If you have a viable prefix in the stack it is possible to have inputs that will reduce
to the start symbol.

o If you don‟t have a viable prefix on top of the stack you can never reach the start
symbol; therefore you have to call the error recovery procedure.

The closure operation

If I is a set of items of G, then Closure (I) is the set of items constructed by two
rules:

o Initially, every item in I is added to closure (I)

o If A  α.Bβ is in Closure of (I) and B  γ is a production, then add B  .γ to I.

o This rule is applied until no more new item can be added to Closure (I).

Example G1‟:

E‟  E

EE+T

ET

TT*F

TF

F  (E)

F  id

I = {[E‟  .E]}

Closure (I) = {[E‟  .E], [E  .E + T], [E  .T], [T  .T * F], [T  .F], [F  .(E)], [F


 .id]}

The Goto operation

Page 56 of 115
The second useful function is Goto (I, X) where I is a set of items and X is a grammar
symbol. Goto (I, X) is defined as the closure of all items [A  αX.β] such that [A  α.Xβ]
is in I.

Example:

I = {[E‟  E.], [E  E . + T]} Then goto (I, +) = {[E  E +. T], [T  .T * F], [T 


.F], [F  .(E)] [F  .id]}

The set of Items construction

Below is given an algorithm to construct C, the canonical collection of sets of LR(0) items
for augmented grammar G‟.

Procedure Items (G‟);

Begin

C := {Closure ({[S‟  . S]})}

Repeat

For Each item of I in C and each grammar symbol X such that Goto (I, X) is not empty and
not in C do

Add Goto (I, X) to C;

Until no more sets of items can be added to C

End

Example: Construction of the set of Items for the augmented grammar above G1‟.

I0 = {[E‟  .E], [E  .E + T], [E .T], [T .T * F],

[T .F], [F .(E)], [F .id]}

I1 = Goto (I0, E) = {[E‟  E.], [E  E. + T]}

I2 = Goto (I0, T) = {[E  T.], [T  T. * F]}

I3 = Goto (I0, F) = {[T  F.]}

I4 = Goto (I0, () = {[F  (.E)], [E  .E + T], [E  .T],

o [T  .T * F], [T  .F], [F  . (E)], [F  .id]}

I5 = Goto (I0, id) = {[F  id.]}

Page 57 of 115
I6 = Goto (I1, +) = {[E  E + . T], [T  .T * F], [T  .F],

o [F  .(E)], [F  .id]}

I7 = Goto (I2, *) = {[T T * . F], [F .(E)],

o [F  .id]}

I8 = Goto (I4, E) = {[F (E.)], [E  E . + T]}

o Goto(I4,T)={[ET.], [TT.*F]}=I2;
o Goto(I4,F)={[TF.]}=I3;
o Goto (I4, () = I4;
o Goto (I4, id) = I5;

I9 = Goto (I6, T) = {[E  E + T.], [T  T . * F]}

o Goto (I6, F) = I3;


o Goto (I6, () = I4;
o Goto (I6, id) = I5;

I10 = Goto (I7, F) = {[T  T * F.]}

o Goto (I7, () = I4;


o Goto (I7, id) = I5;

I11= Goto (I8, )) = {[F  (E).]}

o Goto (I8, +) = I6;


o Goto (I9, *) = I7;

LR (0) automation

SLR table construction algorithm

Page 58 of 115
1. Construct C = {I0, I1, ......, IN} the collection of the set of LR (0) items for G‟.

2. State i is constructed from Ii and

a) If [A  α.aβ] is in Ii and Goto (Ii, a) = Ij (a is a terminal) then action [i, a]=shift j

b) If [A  α.] is in Ii then action [i, a] = reduce A  α for a in Follow (A) for A ≠ S‟

c) If [S‟  S.] is in Ii then action [i, $] = accept.

o If no conflicting action is created by 1 and 2 the grammar is SLR (1); otherwise it is not.

3. For all non-terminals A, if Goto (Ii, A) = Ij then Goto [i, A] = j

4. All entries of the parsing table not defined by 2 and 3 are made error

5. The initial state is the one constructed from the set of items containing [S‟  .S]

Example: Construct the SLR parsing table for the grammar G1‟

Follow (E) = {+, ), $} Follow (T) = {+, ), $, *}

Follow (F) = {+, ), $,*}

E‟  E

1 EE+T

2 ET

3 TT*F

4 TF

5 F  (E)

6 F  id

By following the method we find the Parsing table used earlier.

Page 59 of 115
Legend: Si means shift to state i, Rj means reduce production by j

Exercise: Construct the SLR parsing table for the following grammar:/* Grammar G2‟ */

S‟  S

SL=R

SR

L  *R

L  id

RL

Answer

C = {I0, I1, I2, I3, I4, I5, I6, I7, I8, I9}

I0 = {[S‟  .S], [S .L = R], [S .R], [L  .*R],

o [L .id], [R .L]}

I1 = goto (I0, S) = {[S‟  S.]}

I2 = goto (I0, L) = {[S  L . = R], [R  L . ]}

I3 = goto (I0, R) = {[S  R . ]}

I4 = goto (I0, *) ={[L  * . R] [L .*R], [L .id],

o [R .L]}

I5 = goto (I0, id) ={[L  id . ]}

I6 = goto (I2, =) ={[S  L = . R], [R . L ], [L .*R],

Page 60 of 115
o [L .id]}

I7 = goto (I4, R) ={[L  * R . ]}

I8 = goto (I4, L) ={[R  L . ]}

o goto (I4, *) = I4
o goto (I4, id) = I5

I9 = goto (I6, R) ={[S  L = R .]}

o goto (I6, L) = I8
o goto (I6, *) = I4
o goto (I6, id) = I5
Follow (S) = {$} Follow (R) = {$, =} Follow (L) = {$, =}. We have shift/reduce conflict
since = is in Follow (R) and R  L. is in I2 and Goto (I2, =) = I6. Every SLR(1)
grammar is unambiguous, but there are many unambiguous grammars that are not
SLR(1).
G2‟ is not an ambiguous grammar. However, it is not SLR. This is because the SLR
parser is not powerful enough to remember enough left context to decide whether to shift
or reduce when it sees an =.

Exercise

1) Given the following Grammar:

(1) S  A

(2) S  B

(3) A  a A b

(4) A  0

(5) B  a B b b

(6) B  1

A. Construct the SLR parsing table.

B. Write the action of an LR parse for the following string aa1bbbb

The Parser Generator: Yacc


Yacc stands for "yet another compiler-compiler". Yacc: a tool for automatically
generating a parser given a grammar written in a yacc specification (.y file). Yacc parser

Page 61 of 115
– calls lexical analyzer to collect tokens from input stream. Tokens are organized using
grammar rules. When a rule is recognized, its action is executed

Note:

o lex tokenizes the input and yacc parses the tokens, taking the right actions, in context.

Scanner, Parser, Lex and Yacc

Yacc…

There are four steps involved in creating a compiler in Yacc:

1) Generate a parser from Yacc by running Yacc over the grammar file.

2) Specify the grammar:

o Write the grammar in a .y file (also specify the actions here that are to be
taken in C).

o Write a lexical analyzer to process input and pass tokens to the parser. This
can be done using Lex.

o Write a function that starts parsing by calling yyparse().

o Write error handling routines (like yyerror()).

Page 62 of 115
3) Compile code produced by Yacc as well as any other relevant source files.

4) Link the object files to appropriate libraries for the executable parser.

Review Exercise

1) Consider the context-free grammar and the string: aa + a*.


a) Give a leftmost derivation for the string.
b) Give a rightmost derivation for the string.
c) Give a parse tree for the string.
d) Is the grammar ambiguous or unambiguous? Justify your answer.
e) Describe the language generated by this grammar.
2) Design grammars for the following languages:
a) The set of all strings of 0s and 1s such that every 0 is immediately followed by at
least one 1.
b) The set of all strings of 0s and 1s that are palindromes; that is, the string reads the
same backward as forward.
c) The set of all strings of 0s and 1s with an equal number of 0s and 1s.
d) The set of all strings of 0s and 1s with an unequal number of 0s and 1s.
e) The set of all strings of 0s and 1s in which 011 does not appear as a substring.
f) The set of all strings of 0s and 1s of the form xy, where x # y and x and y are of
the same length.
3) The following is a grammar for regular expressions over symbols a and b only, using + in
place of 1 for union, to avoid conflict with the use of vertical bar as a metasymbol in
grammars:
rexpr rexpr + rterm | rterm
rterm rterm rfactor | rfactor
rfactor rfactor * | rprirnary
rprimary a | b
a) Left factor this grammar.
b) Does left factoring make the grammar suitable for top-down parsing?
c) In addition to left factoring, eliminate left recursion from the original grammar.
d) Is the resulting grammar suitable for top-down parsing?
4) The grammar S -> a S a |a a generates all even-length strings of a's. We can devise a
recursive-descent parser with backtrack for this grammar. If we choose to expand by
production S --> a a first, then we shall only recognize the string aa. Thus, any
reasonable recursive-descent parser will try S --> a S a first.
a) Show that this recursive-descent parser recognizes inputs aa, aaaa, and aaaaaaaa, but
not aaaaaa.
b) What language does this recursive-descent parser recognize?
5) Show that the following grammar is LL(1) but not SLR(1).

Page 63 of 115
S->AaAb|BbBa

A-> ε

B->ε

6) Show that the following grammar is SLR(1) but not LL(1).

S-> SA | A

A-> a

Page 64 of 115
CHAPTER 4

Syntax-Directed Translation

Introduction

Grammar symbols are associated with attributes to associate information with the
programming language constructs that they represent. Values of these attributes are
evaluated by the semantic rules associated with the production rules.

Evaluation of these semantic rules:

o may generate intermediate codes

o may put information into the symbol table

o may perform type checking

o may issue error messages

o may perform some other activities

o in fact, they may perform almost any activities.

Syntax-Directed Definitions and Translation Schemes


When we associate semantic rules with productions, we use two notations:

 Syntax-Directed Definitions

 Translation Schemes

Syntax-Directed Definitions:

o give high-level specifications for translations

o hide many implementation details such as order of evaluation of semantic actions.

o We associate a production rule with a set of semantic actions, and we do not say
when they will be evaluated.

Translation Schemes:

o indicate the order of evaluation of semantic actions associated with a production rule.

o In other words, translation schemes give more information about implementation


details.

Page 65 of 115
Syntax-Directed Definitions

A syntax-directed definition is a generalization of a context-free grammar in which:

o Each grammar symbol is associated with a set of attributes. This set of attributes for
a grammar symbol is partitioned into two subsets called synthesized and inherited
attributes of that grammar symbol. Each production rule is associated with a set of
semantic rules.

An attribute can represent anything we choose:

o a string, a number, a type, a memory location, intermediate program representation


etc...

o The value of a synthesized attribute is computed from the values of attributes at the
children of that node in the parse tree.

o The value of an inherited attribute is computed from the values of attributes at the
siblings and parent of that node in the parse tree.

Semantic rules set up dependencies between attributes which can be represented by a


dependency graph. This dependency graph determines the evaluation order of these semantic
rules. Evaluation of a semantic rule defines the value of an attribute. But a semantic rule may
also have some side effects such as printing a value.

A depth-first traversal algorithm traverses the parse tree thereby executing semantic rules to
assign attribute values. After the traversal is completed the attributes contain the translated
form of the input.

In a syntax-directed definition, each production A → α is associated with a set of semantic


rules of the form:

b=f(c1,c2,…,cn) where f is a function, and b can be one of the followings:

 b is a synthesized attribute of A and c1,c2,…,cn are attributes of the grammar


symbols in the production

( A → α ). For A  C A.b = C.c

OR

b is an inherited attribute one of the grammar symbols in α (on the right side of the
production), and c1,c2,…,cn are attributes of the grammar symbols in the production ( A → α ).
For A C C.c = A.b

Page 66 of 115
Annotated Parse Tree
A parse tree showing the values of attributes at each node is called an annotated parse tree.
The process of computing the attributes values at the nodes is called annotating (or
decorating) of the parse tree. Of course, the order of these computations depends on the
dependency graph induced by the semantic rules.

Annotating a Parse Tree with Depth-First Traversals


procedure visit(n : node);

Begin

for each child m of n, from left to right do

visit(m);

evaluate semantic rules at node n

end

Example 4.1: Synthesized Attributed grammar that calculate the value of expression

Production Semantic Rules

L→En print(E.val)

E → E1 + T E.val = E1.val + T.val

E→T E.val = T.val

T → T1 * F T.val = T1.val * F.val

T→F T.val = F.val

F→(E) F.val = E.val

F → digit F.val = digit.lexval

It specifies a simple calculator that reads an input line containing an arithmetic expression
involving:

o digits, parenthesis, the operator + and *, followed by a new line character n, and

o prints the value of expression.

Example 4.2: Synthesized Attributed grammar that calculate the value of expression

Production Semantic Rules

Page 67 of 115
L→En print(E.val)

E → E1 + T E.val = E1.val + T.val

E→T E.val = T.val

T → T1 * F T.val = T1.val * F.val

T→F T.val = F.val

F→(E) F.val = E.val

F → digit F.val = digit.lexval Input: 9+5+2n

Symbols E, T, and F are associated with a synthesized attribute val. The token digit has a
synthesized attribute lexval (it is assumed that it is evaluated by the lexical analyzer).

Depth-First Traversals: Example

Annotated Parse Tree: Example

Exercise: 1) Synthesized Attributed grammar that calculate the value of expression

a) Given the expression 5+3*4 followed by new line, the program prints 17.

b) Draw the decorated parse tree for input: 1*2n

Page 68 of 115
c) Draw the annotated parse tree for input: 5*3+4n

d) Draw the annotated parse tree for input: 5*(3+4)n

2) Synthesized Attributed grammar that calculate the value of expression. By making


use of SDD of example 4.2 give annotated parse trees for the following expressions:

a) (3+4) * (5+6)n

b) 7*5*9*(4+5)n

c) (9+8*(7+6)+5)*4n

Dependency Graphs for Attributed Parse Trees


Annotated parse tree shows the values of attributes. Dependency graph helps us to
determine how those values can be computed. The attributes should be evaluated in a
given order because they depend on one another. The dependency of the attributes is
represented by a dependency graph. b(j) -----D()----> a(i) if and only if there exists a
semantic action such as a (i) := f (... b (j) ...)

Dependency Graphs for Attributed Parse Trees

Annotated Parse Tree: Example

Page 69 of 115
Dependency Graph

Syntax-Directed Definition: Inherited Attributes

Production Semantic Rules

D→TL L.inh = T.type

T → int T.type = integer Inherited

T → real T.type = real

L → L1 , id L1.inh = L.inh, synthesized

addtype(id.entry,L.inh)

L → id addtype(id.entry,L.inh)

Symbol T is associated with a synthesized attribute type. Symbol L is associated with an


inherited attribute inh. Input: real id1,id2,id3

A Dependency Graph – Inherited Attributes

Page 70 of 115
SDD based on a grammar suitable for top-down parsing

Production Semantic Rules

T → FT‟ T‟.inh = F.val

T.val = T‟.syn

T‟ → *FT1‟ T1‟.inh = T‟.inh X F.val

T‟.syn = T1‟.syn

T‟ → ε T‟.syn = T‟.inh

F → digit F.val = digit.lexval

The SDD above computes terms like 3 * 5 and 3 * 5 * 7. Each of the non-terminals T and
F has a synthesized attribute val; The terminal digit has a synthesized attribute lexval.

The non-terminal T‟ has two attributes:

o an inherited attribute inh and

o a synthesized attribute syn.

Annotated parse tree for 3*5

Dependency graph for the annotated parse tree of 3*5

Exercises

Page 71 of 115
Production Semantic Rules

N → L1.L2 N.v = L1.v + L2.v / (2L2.l)

L1 → L2 B L1.v = 2 * L2.v + B.v

L1.l = L2.l + 1

L→B L.v = B.v

L1.l = 1

B→0 B.v = 0

B→1 B.v = 1

Draw the decorated parse tree and draw the dependency graph for input:

a - 1011.01

b – 11.1

c – 1001.001

Evaluation Order
A topological sort of a directed acyclic graph (DAG) is any ordering m1, m2, …, mn of
the nodes of the graph, such that: if mi → mj is an edge, then mi appears before mj

Any topological sort of a dependency graph gives a valid evaluation order of the semantic
rules.

Example Parse Tree with Topologically Sorted Actions

S-Attributed Definitions
Syntax-directed definitions are used to specify syntax-directed translations that guarantee
an evaluation order. We would like to evaluate the semantic rules during parsing (i.e. in a
single pass, we will parse and we will also evaluate semantic rules during the parsing).

Page 72 of 115
We will look at two sub-classes of the syntax-directed definitions:

o S-Attributed Definitions: only synthesized attributes used in the syntax-directed


definitions.

o L-Attributed Definitions: in addition to synthesized attributes, we may also use


inherited attributes.

These classes of SDD can be implemented efficiently in connection with top-down and
bottom-up parsing.

S-Attributed Definitions

A syntax-directed definition that uses synthesized attributes exclusively is called an S-


attributed definition (or S-attributed grammar). A parse tree of an S-attributed definition
is annotated with a single bottom-up traversal.

Bottom-up parser uses depth first traversal. A new stack is maintained to store the values
of the attributes as in the following example. Yacc/Bison only support S-attributed
definitions

Example: Attribute Grammar in Yacc

%{

#include <stdio.h>

void yyerror(char *);

%}

%token INTEGER

%%

program:

program expr '\n' { printf("%d\n", $2); }

expr:

INTEGER { $$=$1;}

| expr '+' expr { $$ = $1 + $3; }

Page 73 of 115
| expr '-' expr { $$ = $1 - $3; }

; Synthesized attribute of parent node expr

%%

Bottom-Up Evaluation of S-Attributed Definitions


We put the values of the synthesized attributes of the grammar symbols into a parallel stack.

o When an entry of the parser stack holds a grammar symbol X (terminal or non-
terminal), the corresponding entry in the parallel stack will hold the synthesized
attribute(s) of the symbol X.

We evaluate the values of the attributes during reductions.

A  XYZ A.a=f(X.x,Y.y,Z.z) where all attributes are synthesized.

Production Semantic Rules

L→En print(val[top-1])

E → E1 + T val[ntop] = val[top-2] + val[top]

E→T $$ = $1 + $3; in yacc

T → T1 * F val[ntop] = val[top-2] * val[top]

T→F

F→(E) val[ntop] = val[top-1]

F → digit

At each shift of digit, we also push digit.lexval into val-stack. At all other shifts, we do
not put anything into val-stack because other terminals do not have attributes (but we
increment the stack pointer for val-stack).

Page 74 of 115
Canonical LR(0) Collection for The Grammar

Bottom-Up Evaluation of S-Attributed Definitions in Yacc: Example

At each shift of digit, we also push digit.lexval into val-stack.

Page 75 of 115
CHAPTER 5
Type checking
Introduction

The compiler must check that the source program follows both the syntatic and semantic
conventions of the source language.

Semantic Checks

o Static – done during compilation

o Dynamic – done during run-time

This checking is called static checking (to distinguish it from dynamic checking executed
during execution of the target program). Static checking ensures that certain kind of
errors will be detected and reported.

Position of type checker

Static versus Dynamic Checking


Static checking: the compiler enforces programming language‟s static semantics

o Program properties that can be checked at compile time

Dynamic semantics: checked at run time

o Compiler generates verification code to enforce programming language‟s


dynamic semantics

Type checking is one of these static checking operations.

o We may not do all type checking at compile-time.

o Some systems also use dynamic type checking too.

Why static checking?

Parsing finds syntactic errors

o An input that can't be derived from the grammar

Page 76 of 115
Static checking finds semantic errors

o Calling a function with the wrong number/kind of arguments

o Applying operators to the wrong kinds of arguments

o Using undeclared variables

o Invalid conditions (not Boolean) in conditionals

o inappropriate instruction

 return, break, continue used in wrong place

Other Static Checks

A variety of other miscellaneous static checks can be performed

o Check for return statements outside of a function


o Check for case statements outside of a switch statement
o Check for duplicate cases in a case statement
o Check for break or continue statements outside of any loop
o Check for goto statements that jump to undefined labels
o Check for goto statements that jump to labels not in scope

Most such checks can be done using 1 or 2 traversals of (part of) the parse tree

The Need for Type checking

We want to generate machine code

Memory layout

o Different data types have different sizes

 In C, char, short, int, long, float, double usually have different sizes
 Need to allocate different amounts of memory for different types

Choice of instructions

o Machine instructions are different for different types

 add (for i386 ints)


 fadd (for i386 floats)

One important kind of static checking is type checking

Page 77 of 115
o Do operators match their operands? Do types of variables match the values assigned
to them? Do function parameters match the function declarations? Have called
function and variable names been declared?

Not all languages can be completely type checked. All compiled languages must be at least
partially type checked

Type checking can be done bottom up using the parse tree. For convenience, we may create
one or more pseudo-types for error handling purposes

o Error type can be generated when a type checking error occurs

 e.g., adding a number and a string


o Unknown type can be generated when the type of an expression is unknown
 e.g., an undeclared variable

Static checking

Typical examples of static checking are:

o Type checks

o Flow-of-control checks

o Uniqueness checks…

Type checks:

o A compiler should report an error if an operator is applied to an incompatible operand.

Example: if an array variable and a function variable are added together.

int a, c[10],d;

d = c + d;

Flow of control check:

o Statements that cause flow of control to leave a construct must have some place to which
to transfer the flow of control.

o Example: a break statement in C causes control to leave the smallest enclosing while, for,
or switch statement.

o An error occurs if such an enclosing statement does not exist.

Page 78 of 115
for(i=0;i<attempts;i++) {

cout<<“Please enter your password:”;

Cin>>password;

if(verify(password))

break;//OK

cout<<“incorrect\n”;

Flow of control example…

Uniqueness check:

o Variables or objects must be defined exactly once.

o Example: in most PL, an identifier must be declared uniquely.

Page 79 of 115
One-Pass versus Multi-Pass Static Checking

One-pass compiler: static checking in C, Pascal, Fortran, and many other languages is
performed in one pass while intermediate code is generated

– Influences design of a language: placement constraints

Multi-pass compiler: static checking in Ada, Java, and C# is performed in a separate phase,
sometimes by traversing a syntax tree multiple times. A separate type-checking pass between
parsing and intermediate code generation.

In this chapter, we focus on type checking. A type checker verifies that the type construct
matches that expected by its context.

For example:

o The type checker should verify that the type value assigned to a variable is
compatible with the type of the variable.

o Built in operator mod requires integer operands.

o Indexing is done only to array.

o Dereference to pointer

o A user-defined function is applied to the correct number and type of arguments…

Type systems
A type system is a collection of rules for assigning type expressions to the parts of a
program. A type checker implements a type system. A sound type system eliminates run-
time type checking for type errors.

A programming language is strongly-typed, if every program its compiler accepts will


execute without type errors.

o In practice, some of type checking operations is done at run-time (so, most of the
programming languages are not strongly-typed).

o Ex: int x [100]; … x[i]  most of the compilers cannot guarantee that i will be
between 0 and 99

Type expressions

The type of a language construct is denoted by a type expression.

A type expression is either:

Page 80 of 115
o a basic type or

o Formed by applying an operator called type constructor to the type expression.

The followings are type expressions:

o A basic type is a type expression:

Example: Boolean, char, integer, and real

o A special basic type, type_error, will signal an error during type checking.

o A basic type void denoting “the absence of a value” allows statement to be


checked.

The following are type constructors:

o Arrays: if I in an index set and T is a type expression, then Array (I, T) is a TE:

In java, int[] A = new int[10];

In C++ int A[10];

In pascal var A: array [1…10] of integer; associates the type expression


array(1..10, integer) with A.

o Products: if T1 and T2 are type expressions, the Cartesian product T1 x T2 is a TE. X is


left associative.

Example: foo(int, char), int x char  (1,‟a‟), (2,‟b‟)…

o Pointers: if T is a TE then pointer (T) is a TE. Denotes the type “pointer to an object of
type T.”

var p: ^integer  pointer(integer)

int *a;

Functions: the TE of a function has the form D  R where:

o D is the type expression of the parameters and

o R is the TE of the returned value.

For example:

o mod function has domain type int x int, a pair of integers and range type int, thus
mod has int x int  int

Page 81 of 115
The TE corresponding to the Pascal declaration:

function f (a, b : char): ^Integer:

char x char  pointer (integer)

A convenient way to represent a type expression is to use a graph (tree or DAG).

For example, the type expression corresponding to the above function declaration can
be represented with the tree shown below:

TE: char x char  pointer (integer)

Example: tree and DAG

int *foo(char *, char *)

Type Expression (summary)

The type of a language construct is denoted by a type expression.

A type expression can be:

o A basic type

 a primitive data type such as integer, real, char, boolean, …

 type-error to signal a type error

 void : no type

Page 82 of 115
o A type name

 a name can be used to denote a type expression.

o A type constructor applies to other type expressions.

 arrays: If T is a type expression, then array(I,T) is a type expression


where I denotes index range. Ex: array(0..99,int)

 products: If T1 and T2 are type expressions, then their cartesian product


T1 x T2 is a type expression. Ex: int x int

 pointers: If T is a type expression, then pointer(T) is a type expression.


Ex: pointer(int)

 functions: We may treat functions in a programming language as mapping


from a domain type D to a range type R. So, the type of a function can be
denoted by the type expression D→R where D are R type expressions.
Ex: int→int represents the type of a function which takes an int value as
parameter, and its return type is also int.

Specification of a simple type checker


In this section, we specify a type checker for a simple language. The type of each
identifier must be declared before the identifier is used. The type checker is a translation
scheme:

o Synthesizes the type of each expression from the types of its sub expressions.

The type checker can handle:

o arrays,

o pointers,

o statements, and

o Functions.

A Simple Language example


This grammar generates programs, represented by the non-terminal P consisting of
sequence of declarations D followed by a single expression E or statement S.

PD;E|D;S

D  D ; D | id : T

T  boolean | char | integer | array { num } of T | ^T

Page 83 of 115
E  true | false | literal | num | id | E mod E | E [ E ] | E ^

|E=E|E+E

S  id := E | if E then S | while E do S | S ; S

One program generated by the grammar is:

key : integer;

key mod 100

Types in the language

The language has three basic types: boolean, char and integer. Type_error used to signal
error. Void used to check statements. All arrays start at 1. For example:

array [256] of char => leads to the TE array (1…256, char)

Consisting of the constructor array applied to the sub range 1...256 and the type char.
The prefix operator ^ in declarations builds a pointer type, so ^ integer leads to the TE
pointer (integer), consisting of the constructor pointer applied to the type integer.

Specification of a simple type checker

Translation schemes for Declarations

P D;E|D;S

DD;D

D  id : T {addtype (id.entry, T. Type)}

T  boolean { T.type:= boolean}

T  char {T.type : = char}

T  integer {T.type : = integer}

T  ^T1 {T.type := pointer (T1.type)}

T  array [ num ] of T1 {T.type := array (1…num.val, T1.type)}

The purpose of the above semantic actions is:

o to synthesize the type expression corresponding to a declaration of a type and

Page 84 of 115
o add the type expression in the symbol table entry corresponding to the variable
identifier.

Translation scheme for type checking of Expressions:

E  true {E.type:= boolean}

E  false {E.type:= boolean}

E  literal {E.type := char}

E  num {E.type := integer}

E  id {E .type := lookup (id.entry)}

E  E1 mod E2 {E.type = if E1.type = integer and E2.type := integer then

integer

else type_error}

E  E1 [E2] {E.type := if E2.type = integer and E1.type := array (s,t) then

else type_error}

E  E1 ^ {E.type := if E1.type = pointer (t) then t

else type_error}

The objects pointed to by its operand

E → E1 + E2 { if (E1.type=int and E2.type=int) then E.type=int

else if (E1.type=int and E2.type=real) then E.type=real

else if (E1.type=real and E2.type=int) then E.type=real

else if (E1.type=real and E2.type=real) then E.type=real

else E.type=type-error }

E  E1 = E2 { E.type := if E1.type = boolean and E2.type =

boolean then boolean

else type_error}

Page 85 of 115
Translation scheme for type checking of Statements:

S  id := E {S.type := if id.type = E.type then void

else type_error }

S  if E then S1 {S.type := if E.type = boolean then

S1.type

else type_error}

S  while E do S1 {S.type := if E.type = boolean then

S1.type

else type_error}

S  S1 ; S2 {S.type := if S1.type = void and S2.type = void

then void

else type_error}

Exercises:

1) For the translation scheme of a simple type checker presented above, draw the
decorated parse tree for:
a) A: array [1…10] of ^array [1…5] of char
b) A: array [1…10] of ^ array [1…5] of char; B: char
Solutions:
1) a)

1) b)

Page 86 of 115
2) For the translation scheme of a simple type checker presented above, draw the
decorated parse tree for:

a: array [1…5] of integer;

b: integer;

c: char;

b=a[1];

if b = a[10] then

b = b + 1;

c = a[5];

Page 87 of 115
CHAPTER 6

Intermediate Code Generation

Intermediate Representations
In a compiler, the front end translates source program into an intermediate representation,
and the back end generates the target code from this intermediate representation. The use
of a machine independent intermediate code (IC) is:

o retargeting to another machine is facilitated

o the optimization can be done on the machine independent code

Type checking is done in another pass  Multi – pass. IC generation and type checking
can be done at the same time  one pass.

Decisions in IR design affect the speed and efficiency of the compiler. Some important
IR properties:

o Ease of generation

o Ease of manipulation

o Procedure size

o Level of abstraction

The importance of different properties varies between compilers

o Selecting an appropriate IR for a compiler is critical

Intermediate Code Generation

Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.

o Syntax tree can be used as an intermediate language.

Page 88 of 115
o Postfix notation can be used as an intermediate language.

o Three-address code (Quadraples) can be used as an intermediate language

 We will use three address to discuss intermediate code generation

 Three addresses are close to machine instructions, but they are not actual
machine instructions.

o Some programming languages have well defined intermediate languages.

 java – java virtual machine

 prolog – warren abstract machine

 In fact, there are byte-code emulators to execute instructions in these


intermediate languages.

Types of Intermediate Representations


Three major categories:

o Structural

 Graphically oriented
 Heavily used in source-to-source translators
 Tend to be large
 Examples: Trees, DAG

o Linear

 Pseudo-code for an abstract machine


 Level of abstraction varies
 Simple, compact data structures
 Easier to rearrange
 Examples: 3 address code and Stack machine code

o Hybrid

 Combination of graphs and linear code


 Example: control-flow graph

Intermediate languages
Syntax tree

While parsing the input, a syntax tree can be constructed. A syntax tree (abstract tree) is
a condensed form of parse tree useful for representing language constructs. For example,

Page 89 of 115
for the string a+b, the parse tree in (a) below can be represented by the syntax tree shown
in (b); the keywords (syntactic sugar) that existed in the parse tree will no longer exist in
the syntax tree.

Syntax-Directed Translation of Abstract Syntax Trees


Production Semantic Rules

S → id :=E S.nptr := mknode(„:=„, mkleaf(id, id.entry), E.nptr)

E → E1 + E2 E.nptr := mknode(„+‟, E1.nptr, E2.nptr)

E → E1 * E2 E.nptr := mknode(„*‟, E1.nptr, E2.nptr)

E → - E1 E.nptr := mknode(„uminus‟, E1.nptr)

E → ( E1 ) E. nptr := E1.nptr

E → id E.nptr := mkleaf (id, id.entry) ………..***

Abstract Syntax Trees

Page 90 of 115
Abstract Syntax Trees versus DAGs

Syntax Tree representation

Postfix notation

Stack Machine Code


Originally used for stack-based computers, now Java

Example:

Page 91 of 115
Three-Address Code

A three address code is: x := y op z where x, y and z are names, constants or compiler-
generated temporaries; op is any operator. But we may also use the following notation for
three address code (much better notation because it looks like a machine code instruction)

op y,z,x apply operator op to y and z, and store the result in x.

We use the term “three-address code” because each statement usually contains three
addresses (two for operands, one for the result).

In three-address code:

o Only one operator at the right side of the assignment is possible, i.e. x + y * z is not
possible

o Similar to postfix notation, the three address code is a linear representation of a


syntax tree.

o It has been given the name three-address code because such an instruction usually
contains three addresses (the two operands and the result)

Page 92 of 115
t1 = y * z

t2 = x + t1

Three-Address Statements
Binary Operator:

op y,z,result or result := y op z

Where op is a binary arithmetic or logical operator. This binary operator is applied to y and z,
and the result of the operation is stored in result.

Ex: add a,b,c

mul a,b,c

addr a,b,c

addi a,b,c

Unary Operator:

op y,, result or result := op y

Where op is a unary arithmetic or logical operator. This unary operator is applied to y, and
the result of the operation is stored in result.

Ex: uminus a,,c

not a,,c

inttoreal a,,c

Copy/ Move Operator:

mov y,,result or result := y where the content of y is copied into result.

Ex: mov a,,c

movi a,,c

movr a,,c

Unconditional Jumps:

jmp ,,L or goto L

Page 93 of 115
We will jump to the three-address code with the label L, and the execution continues from
that statement.

Ex: jmp ,,L1 // jump to L1

jmp ,,7 // jump to the statement 7

Conditional Jumps:

jmprelop y,z,L or if y relop z goto L

We will jump to the three-address code with the label L if the result of y relop z is true, and
the execution continues from that statement. If the result is false, the execution continues
from the statement following this conditional jump statement.

Ex: jmpgt y,z,L1 // jump to L1 if y>z

jmpgte y,z,L1 // jump to L1 if y>=z

jmpe y,z,L1 // jump to L1 if y==z

jmpne y,z,L1 // jump to L1 if y!=z

Our relational operator can also be a unary operator.

jmpnz y,,L1 // jump to L1 if y is not zero

jmpz y,,L1 // jump to L1 if y is zero

jmpt y,,L1 // jump to L1 if y is true

jmpf y,,L1 // jump to L1 if y is false

Procedure Parameters:

param x,, or param x

Procedure Calls: call p,n, or call p,n where x is an actual parameter, we invoke the
procedure p with n parameters.

Ex: param x1,,

param x2,,

p(x1,...,xn)

param xn,,

call p,n,
Page 94 of 115
f(x+1,y) add x,1,t1

param t1,,

param y,,

call f,2,

Indexed Assignments:

move y[i],,x or x := y[i]

move x,,y[i] or y[i] := x

Address and Pointer Assignments:

moveaddr y,,x or x := &y

movecont y,,x or x := *y

Three Address Statements (summary)

o Assignment statements: x := y op z, x := op y

o Indexed assignments: x := y[i], x[i] := y

o Pointer assignments: x := &y, x := *y, *x := y

o Copy statements: x := y

o Unconditional jumps: goto L

o Conditional jumps: if y relop z goto L

o Function calls: param x… call p, n

 return y

Syntax-Directed Translation into Three-Address Code


Syntax directed translation can be used to generate the three-address code. Generally,
either:

o the three-address code is generated as an attribute of the attributed parse tree or

o the semantic actions have side effects that write the three-address code statements
in a file.

When the three-address code is generated, it is often necessary to use temporary variables
and temporary names. The following functions are used to generate 3-address code:
Page 95 of 115
newtemp() - each time this function is called, it gives distinct names that can be used for
temporary variables.

o returns t1, t2,…, tn in response to successive calls

newlabel() - each time this function is called, it gives distinct names that can be used for
label names.

gen() to generate a single three address statement given the necessary information.

o variable names and operations.

gen will produce a three-address code after concatenating all the parameters.

For example: If id1.lexeme = x, id2.lexeme =y and id3.lexeme = z:

gen (id1.lexeme, „:=‟, id2.lexeme, „+‟, id3.lexeme) will produce the three-
address code : x := y + z

Note: variables and attribute values are evaluated by gen before being concatenated with
the other parameters.

Use attributes:

E.place: the name that will hold the value of E.

o Identifier will be assumed to already have the place attribute defined.

E.code: hold the three address code statements that evaluate E (this is the
`translation‟ attribute).

Production Semantic Rules

S → id := E S.code three address code for S

| while E do S S.begin lable to start of S or nil

E→ E+E S.after lable to end of S or nil

|E*E E.code three-address code for E

|-E E.place a name holding the value of E

|(E)

| id

| num

Page 96 of 115
Implementation of Three-Address Statements
The description of three-address instructions specifies the components of each type of
instruction. However, it does not specify the representation of these instructions in a data
structure. In a compiler, these statements can be implemented as objects or as records with
fields for the operator and the operands.

Three such representations are:

o Quadruples

o Triples and

o Indirect triples

Quadruples: A quadruple (or just "quad') has four fields, which we call op, arg1, arg2, and result

Triples: A triple has only three fields, which we call op, arg1, and arg2.

Indirect Triples: consists of a listing of pointers to triples, rather than a listing of triples
themselves.

The benefit of Quadruples over Triples can be seen in an optimizing compiler, where
instructions are often moved around. With quadruples, if we move an instruction that
computes a temporary t, then the instructions that use t require no change.

With triples, the result of an operation is referred to by its position, so moving an instruction
may require to change all references to that result. This problem does not occur with indirect
triples.

Implementation of Three-Address Statements: Quads

Page 97 of 115
Implementation of Three-Address Statements: Triples

More triplet representations

Major tradeoff between quads and triples is compactness versus ease of manipulation

o In the past compile-time and space was critical


o Today, speed may be more important

Implementation of Three-Address Statements: Indirect Triples

Exercises

Translate the arithmetic expression a + -(b + c) into

a) A syntax tree and DAG.

Page 98 of 115
b) Quadruples.

c) Triples.

d) Indirect triples by making use of translation scheme given in ………..*** above.

Three address code for an assignment statement and an expression

Productions Semantic actions

S  id := E S.code := E.code || gen (id.lexeme „ :=„ E.place); S.begin = S.after = nil

E  E1 + E2 E.place := newtemp();

E.code := E1.code || E2.code || gen (E.place, „:=‟, E1.place, „+‟,E2.place)

E  E1 * E2 E.place := newtemp();

E.code := E1.code || E2.code || gen (E.place, „:=‟, E1.place, „*‟, E2.place)

E  - E1 E.place := newtemp();

E.code := E1.code || gen (E.place, „:= uminus ‟ E1.place)

E  ( E1) E.place := E1.place

E.code := E1.code

E  id E.place := id.lexeme

E.code := „‟ /* empty code */

E  num E.place := newtemp();

E.code := gen (E.place „=„ num. value)

Three address code for an assignment statement and an expression

S  while E do S1 S.begin = newlabel();

S.after = newlabel();

S.code = gen(S.begin “:”) || E.code ||

gen(„if‟ E.place „=„ „0‟ „goto‟ S.after) || S1.code ||

gen(„goto‟ S.begin) ||

gen(S.after „:”)

Page 99 of 115
S  if E then S1 else S2 S.else = newlabel();

S.after = newlabel();

S.code = E.code ||

gen(„if‟ E.place „=‟‟0‟ „goto‟ S.else) ||

S1.code ||

gen(„goto‟ S.after) ||

gen(S.else „:”) || S2.code ||

gen(S.after „:”)

EE<E E.place=newtemp();

E.code = E1.code || E2.code ||

gen (E.place, „=„, E1.place, „<„, E2.place)

Code for flow-of-control statements

Syntax-Directed Translation (cont.)

S  while E do S1 S.begin = newlabel();

S.after = newlabel();

S.code = gen(S.begin “:”) || E.code ||

Page 100 of 115


gen(„jmpf‟ E.place „,,‟ S.after) || S1.code ||

gen(„jmp‟ „,,‟ S.begin) ||

gen(S.after „:”)

S  if E then S1 else S2 S.else = newlabel();

S.after = newlabel();

S.code = E.code ||

gen(„jmpf‟ E.place „,,‟ S.else) || S1.code ||

gen(„jmp‟ „,,‟ S.after) ||

gen(S.else „:”) || S2.code ||

gen(S.after „:”)

Exercises:

1) Draw the decorated parse tree and generate three-address code by using the translation
schemes given:

a) A := B + C d) while a < b do a := a + b

b) A := C * ( B + D) e) a:= b * -c + b * -c

c) while a < b do a := (a + b) * c

Solutions for:
Three address code of A := B + C

Three address code of A: = C * (B + D)

Page 101 of 115


Note: Please do for the rest questions.

Page 102 of 115


CHAPTER 7

Code Generation

Introduction
Position of a Code Generator

The final phase in our compiler model is code generator. It takes as input the intermediate
representation (IR) produced by the front end of the compiler, along with relevant symbol
table information, and produces as output a semantically equivalent target program.

Requirements imposed on a code generator

o Preserving the semantic meaning of the source program and being of high quality

o Making effective use of the available resources of the target machine

o The code generator itself must run efficiently.

A code generator has three primary tasks:

o Instruction selection, register allocation and assignment, and instruction ordering

Issue in the Design of a Code Generator


General tasks in almost all code generators:

o instruction selection,

o register allocation and assignment and

o instruction ordering

The details are also dependent on:

o the specifics of the intermediate representation,

o the target language, and

Page 103 of 115


o the run-time system.

The most important criterion for a code generator is that it should produce correct code.

Most serious issues in the design of a code generator are:

o Input to the Code Generator

o The Target Program

o Instruction Selection

o Register Allocation

o Choice of Evaluation Order

Input to the Code Generator

The input to the code generator is

o the intermediate representation of the source program produced by the frontend along
with

o information in the symbol table that is used to determine the run-time address of the
data objects denoted by the names in the IR.

Choices for the IR

o Three-address representations: quadruples, triples, indirect triples

o Virtual machine representations: such as byte codes and stack-machine code

o Linear representations: such as postfix notation

o Graphical representation: such as syntax trees and DAG‟s

Assumptions

o Front end has scanned, parsed and translated into relatively lower level IR

o All syntactic and static semantic errors are detected.

The Target Program

The most common target-machine architectures are RISC, CISC, and stack based.

o A RISC machine typically has many registers, three-address instructions, simple


addressing modes, and a relatively simple instruction-set architecture.

Page 104 of 115


o A CISC machine typically has few registers, two-address instructions, and variety of
addressing modes, variable-length instructions.

o In a stack-based machine, operations are done by pushing operands onto a stack and
then performing the operations on the operands at the top of the stack.

Producing the target program as

o Absolute machine code (executable code)

o Relocatable machine code (Object files for linker and loader)

o Assembly language (assembler)

o Byte code forms for interpreters (e.g. JVM)

In this chapter

o Use very simple RISC-like computer as the target machine.


o Add some CISC-like addressing modes
o Use assembly code as the target language.

Instruction Selection

The code generator must map the IR program into a code sequence that can be executed by
the target machine. The complexity of the mapping is determined by factors such as:

o The level of the IR

o The nature of the instruction-set architecture

o The desired quality of the generated code

If the IR is high level, use code templates to translate each IR statement into a sequence of
machine instruction.

o Produces poor code, needs further optimization.

If the IR reflects some of the low-level details of the underlying machine, then it can use this
information to generate more efficient code sequence. The nature of the instruction set of the
target machine has a strong effect on the difficulty of instruction selection. For example,

o The uniformity and completeness of the instruction set are important factors.

o Instruction speeds are another important factor.

Page 105 of 115


o If we do not care about the efficiency of the target program, instruction
selection is straightforward.

Example: consider the following statement: x := x + 1

o Use ADD instruction (straight forward)


 costly
o Use INC instruction
 Less costly

Straight forward translation may not always be the best one, which leads to unacceptably
inefficient target code.

Suppose we translate three-address code:

Register Allocation

Efficient and careful management of registers results in a faster program. A key problem
in code generation is deciding what values to hold in what registers.

o Use of registers imposes two problems:


o Register allocation: select the variables that will reside in registers.
o Register assignment: pick the register that a variable will reside in.

Finding an optimal assignment of registers to variables is mathematically difficult. In


addition, the hardware/OS may require some register usage rules to be followed.

Example:

Page 106 of 115


Choice of Evaluation Order

The order in which computations are performed can affect the efficiency of the target code.
Some computation orders require fewer registers to hold intermediate results than others.
However, Selection of the best evaluation order is also mathematically difficult. When
instructions are independent, their evaluation order can be changed.

A Simple Target Machine Model

Implementing code generation requires complete understanding of the target machine


architecture and its instruction set.

Our (hypothetical) machine:

o Byte-addressable (word = 4 bytes)

o Has n general purpose registers R0, R1, …, Rn-1

o All operands are integers

o Three-address instructions of the form op dest, src1, src2

Assume the following kinds of instructions are available:

o Load operations

o Store operations

Page 107 of 115


o Computation operations

o Unconditional jumps

o Conditional jumps

Load operations

The instruction LD dst, addr loads the value in location addr into location dst. This
instruction denotes the assignment dst = addr. The most common form of this instruction is
LD r, x which loads the value in location x into register r. An instruction of the form LD r1,
r2 is a register-to-register copy in which the contents of register r2 are copied into register r1.

Store operations

The instruction ST x, r stores the value in register r into the location x. This instruction
denotes the assignment x = r.

Computation operations

Has the form OP dst, src1, src2, where OP is an operator like ADD or SUB, and dst, src1,
src2 are locations, not necessarily distinct.

The effect of this machine instruction is to apply the operation represented by OP to the
values in locations src1 and src2, and place the result of this operation in location dst.

For example, SUB r1, r2, r3 computes r1 = r2 – r3 any value formerly stored in r1 is lost, but
if r1 is r2 or r3 the old value is read first. Unary operators that take only one operand do not
have a src2.

Unconditional Jumps

The instruction BR L causes control to branch to the machine instruction with label L. (BR
stands for branch)

Conditional Jumps

Has the form Bcond r, L, where: r is a register, L is a label, and cond is any of the common
tests on values in the register r.

For example: BLTZ r, L causes a jump to label L if the value in register r is less than zero,
and allows control to pass to the next machine instruction if not.

The Target Machine: Addressing Modes


We assume that our target machine has a variety of addressing modes:

Page 108 of 115


o In instructions, a location can be a variable name x referring to the memory location
that is reserved for x.

o Indexed address, a(r), where a is a variable and r is a register.

LD R1, a (R2) R l = contents (a + contents (R2))

This addressing mode is useful for accessing arrays.

o A memory location can be an integer indexed by a register, for example,

LD R1, 100(R2) R1 = contents (100 + contents (R2))

Useful for following pointers.

o Two indirect addressing modes: *r and *100(r)

LD R1, *100 (R2) R1 = contents (contents (l00 + contents (R2)))

Loading into R1 the value in the memory location stored in the memory location obtained by
adding 100 to the contents of register R2. Immediate constant addressing mode. The constant
is prefixed by #.

The instruction LD R1, #100 loads the integer 100 into register R1, and ADD R1, R1,
#100 adds the integer 100 into register R1.

R1 = R1 + 100

Comments at the end of instructions are preceded by //.

Op-codes (op), for example

LD and ST (move content of source to destination)

ADD (add content of source to destination)

SUB (subtract content of source from dest.)

Page 109 of 115


A Simple Target language (assembly language)

Example:

Program and Instruction Costs


Cost is associated with compiling and running a program. It measures are:

o The length of compilation time and the size

o Running time and power consumption of the target program

For simplicity, we take:

o The cost of an instruction = one + the costs associated with the addressing modes
of the operands.

Addressing modes involving:

o registers have zero additional cost,

o a memory location or constant in them have an additional cost of one.

Examples

Page 110 of 115


Page 111 of 115

You might also like