Chapter 2
Chapter 2
CHAPTER TWO
2. PROGRAM LANGUAGE TRANSLATION
Programming languages are notations for describing computations to people and to machines. A
program must be translated into a form in which it can be executed by a computer. The software
systems that do this translation are known as compilers. This course is about how to design and
implement compilers.
A compiler translates (or compiles) a program written in a high-level programming language that
is suitable for human programmers into the low-level machine language that is required by
computers. During this process, the compiler will also attempt to spot and report obvious
programmer mistakes.
2.1. Language Processors
A compiler is a program that can read a program written in one language, called the source
language, and translate it into an equivalent program in another language – called the target
language (see Figure 1.1). The target program is then provided the input to produce output.
Source program Compiler Target program
Error message
Figure 1.1: Compiler
A compiler also reports any errors in the source program that it detects during the translation process. If
the target program is an executable machine-language program, it can then be called the user to process
input and produce output.
A compiler acts as a translator, transforming human-oriented programming languages into computer-
oriented machine languages. It is a program that translates an executable program in one language into an
executable program in another language.
Source program
Preprocessor
Compiler
Assembler
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Preprocessor: A preprocessor, generally considered as a part of compiler, is a tool that produces
input for compilers. It deals with macro-processing, augmentation, file inclusion, language
extension, etc.
The task of a preprocessor (a separate program) is collecting modules of a program stored in
separate files. It may also expand short hands, called macros, into source language statements.
The modified source program is fed to a compiler.
Interpreter: An interpreter, like a compiler, translates high-level language into low-level
machine language. The difference lies in the way they read the source code or input. A compiler
reads the whole source code at once, creates tokens, checks semantics, generates intermediate
code, executes the whole program and may involve many passes. In contrast, an interpreter reads
a statement from the input, converts it to an intermediate code, executes it, then takes the next
statement in sequence. If an error occurs, an interpreter stops execution and reports it; whereas a
compiler reads the whole program even if it encounters several errors.
An interpreter is another common kind of language processor that instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in the
source program on input supplied by the user.
The machine-language target produced by a compiler is usually much faster than an interpreter at
mapping inputs to outputs. An interpreter can usually give better error diagnostics than a
compiler, because it executes the source program statement by statement. Several other programs
may be needed in addition to a compiler to create an executable program as shown in Figure 1.2.
Assembler: An assembler translates assembly language programs into machine code. The output
of an assembler is called an object file, which contains a combination of machine instructions as
well as the data required to place these instructions in memory.
The compiler may produce an assembly-language program as its output, because assembly
language is easier to produce as an output and easier to debug. The assembly language program
is then processed by a program called assembler that produces a relocatable machine code as its
output.
Linker: Linker is a computer program that links and merges various object files together in order
to make an executable file. All these files might have been compiled by separate assemblers. The
major task of a linker is to search and locate referenced module/routines in a program.
Large programs are often compiled in pieces, so that the relocatable machine code may have to
be linked with other relocatable object files and library files into the code actually runs on the
machine. The linker resolves external memory addresses, where the code in one file may refer to
a location in another file.
Loader: Loader is a part of operating system and is responsible for loading executable files into
memory and execute them. It calculates the size of a program (instructions and data) and creates
memory space for it. It initializes various registers to initiate execution. It puts together all
executable object files into memory for execution.
Principles of Compiler Design (SEng 4031) 2
DMIoT School of Computing Software Engineering Academic Program
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Code Optimizer
Code Generator
Target Program
Principles of Compiler Design (SEng 4031) 3
DMIoT School of Computing Software Engineering Academic Program
Character stream
Lexical Analyzer
token stream
Syntax Analyzer
syntax tree
Semantic Analyzer
Syntax tree
intermediate representation
intermediate representation
Code Generator
target-machine code
target-machine code
Figure 1.3: Phases of a compiler
A typical decomposition of a compiler into phases is shown in in the above figures. In practice,
several phases may be grouped together and the intermediate representations between the
grouped phases need not be constructed explicitly
The analysis phase consists of Lexical Analyzer, Syntax Analyzer and Semantic Analyzer. The
synthesis phase comprises of Intermediate Code Generator, Code Generator, and Code
Optimizer.
The symbol table, which stores information about the entire source program, is used by all
phases of the compiler.
called lexemes. The text is read and divided into tokens, each of which corresponds to a symbol
in the programming language, e.g., a variable name, keyword or number.
For each lexeme the lexical analyzer produces a token of the form:
<token-name, attribute-value>
that it passes on to the subsequent phase, syntax analysis. In the token, the first component token-
name is an abstract symbol that is used during syntax analysis, and the second component
attribute-value points to an entry in the symbol table for this token. Information from the
symbol-table entry is needed for semantic analysis and code generation.
For example, suppose a source program contains the following assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token <id, 1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position. The
symbol table entry holds information about the identifier, such as its name and type
2. The assignment symbol = is a lexeme that is mapped into the token < = >. Since it needs
no attribute value, the second component is omitted. We could have used any abstract
symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol
3. initial is a lexeme that would be mapped into a token <id, 2>, where 2 points to the
symbol table entry for initial.
4. + is a lexeme that is mapped into the token <+>
5. rate is a lexeme that would be mapped into a token <id, 3>, where 3 points to the symbol
table entry for rate.
6. * is a lexeme that is mapped into the token <*>
7. 60 is a lexeme that is mapped into the token <60>
The blanks separating the characters of these tokens would normally be eliminated during lexical
analysis.
After lexical analysis, the sequence of tokens in equation 1.1 are
<id, 1> < = > <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for the assignment,
addition, and multiplication operators, respectively
<id, 1> +
<id, 2> *
<id, 3>
60
This tree shows the order in which the operations in the assignment position = initial + rate * 60
are to be performed. The tree has an interior node labeled * with <id, 3> as its left child and the
integer 60 as its right child. The node <id, 3> represents the identifier rate. The node labeled *
makes it explicit that we must first multiply the value of rate by 60. The node labeled + indicates
that we must add the result of this multiplication to the value of initial. The root of the tree,
labeled =, indicates that we must store the result of this addition into the location for the
identifier position. This ordering of operations is consistent with the usual conventions of
arithmetic which tell us that multiplication has higher precedence than addition, and hence that
the multiplication is to be performed before the addition.
2.2.3. Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that each
operator has matching operands. For example, many programming language definitions require
an array index to be an integer; the compiler must report an error if a floating-point number is
used to index an array.
This phase analyses the syntax tree to determine if the program violates certain consistency
requirements, e.g., if a variable is used but not declared or if it is used in a context that does not
make sense given the type of the variable.
The language specification may permit some type conversions called coercions. For example, a
binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point
numbers. If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number.
Suppose that position, initial and rate have been declared to be floating-point numbers, and
lexeme 60 by itself forms an integer. The type checker in the semantic analyzer in Fig. 1.5
discovers that the operator * is applied to a floating-point number rate and an integer 60. In this
case, the integer may be converted into a floating-point number. In Fig. 1.7, notice that the
output of the semantic analyzer has an extra node for the operator inttofloat, which explicitly
Principles of Compiler Design (SEng 4031) 6
DMIoT School of Computing Software Engineering Academic Program
converts its integer argument into a floating-point number. Semantic analyzer first converts
integer 60 to a floating point number before applying *
t1 = id3 * 60.0
id1 = id2 + t1 (1.4)
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include
1. Scanner generators that produce lexical analyzers from a regular expression description
of the tokens of a language.
2. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
3. Syntax-directed translation engines that produce collections of routines for walking a
parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values
are transmitted from one part of a program to each other part. Data-flow analysis is a key
part of code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing
various phases of a compiler.