0% found this document useful (0 votes)
69 views

Compiler L 400

A compiler is a computer program that translates a high-level language into a lower-level target language. It takes source code as input and produces an executable program as output. An interpreter translates source code line-by-line without producing an executable file. A translator converts one programming language into another while maintaining the original meaning. The compilation process involves lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation.

Uploaded by

Reindolf Chambas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Compiler L 400

A compiler is a computer program that translates a high-level language into a lower-level target language. It takes source code as input and produces an executable program as output. An interpreter translates source code line-by-line without producing an executable file. A translator converts one programming language into another while maintaining the original meaning. The compilation process involves lexical analysis, syntax analysis, semantic analysis, code optimization, and code generation.

Uploaded by

Reindolf Chambas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Compilers and Translators

What is a Translator

What is a Compiler

1
Compilers are computer programs that’s translate one language into another.
Compiler takes the source language and produces an equivalent
program written in target language.

Usually source language is a high level language such as C or C++ and


the target language is object code(sometimes called machine code) for
the target machine – code written in the machine instructions of the
computer or which is to be executed.

Source program compiler Target program

Compiler is fairly complex program. How many lines of code? 10k to 1 mil.
Writing or understanding it is not simple.

2
A Translator is a computer program that translates one programming
language instruction(s) into another programming language
instruction(s) without the loss of original meaning.
OR, the translator will translate Q language and produce Q’ language.
Where Q is the MEANING and ‘(DASH) is the language.
Some advanced translators will even change the logic (not meaning) or
will simplify the logic without losing the essence.
Types
1. If the translator translates a high level language into an assembly or
machine language it is called a compiler. Eg. include Ada, ALGOL, BASIC,
COBOL, FORTRAN, PL/I, C/C++.
2. If the translator translates a high level language into an intermediate
code which will be immediately executed it is called interpreter.
eg. include APL, ASP, CYBOL, LISP, Smalltalk, PHP and PERL.
3. If the compiled program can run on a computer whose CPU or
operating system is different from the one on which the compiler runs,
the compiler is known as a cross-compiler. 3
Why Compilers ?
Initially programs were written in machine language – numeric codes that
represented the actual machine operations.
C7 06 0000 0002
Move number 2 to location 0000 (in hexadecimal on intel 8x86 processor)
It looks easy, right?

Machine language was replaced by Assembly Language


- in which instructions and memory locations are given symbolic forms. eg
MOV X , 2
Assembler translates the symbolic code and memory locations of
assembly language into corresponding numeric codes of machine
language
Assembly language improved the speed and accuracy with which
programs could be written.
Still have some defects – not easy to write and difficult to read and understand.
4
-Assembly language is extremely dependent on a particular
machine for which it was written.
- Therefore code written for one computer must be completely
rewritten for another machine.
Solution ???
- write operations of a program in a more concise form nearly resembling
mathematical notation or natural language.
- independent of any one particular machine and yet capable of itself
being translated by a program into executable code

eg. the previous assembly language code will be X = 2

5
PROGRAMS RELATED TO COMPILERS

Interpreters – language translator like a compiler. It differs from compiler


in that it executes the source program immediately rather than generating
object code that is executed after translation is completed.
Assemblers – translator for the assembly language of a particular computer.
Its symbolic form of a machine language and quite easy to translate.
Sometimes a compiler will generate assembly language as its target
language and then rely on assembler to finish the translation into object code.
Linkers – both compilers and assemblers often rely on a linker which collects
code separately compiled or assembled in different object files into file
that is directly executable.
A linker also connects an object program to the code for standard library
functions and to resources supplied by the operating system of the computer,
Such as memory allocators and I/O devices.
Linkers now perform the task which was originally one of the principal
activities of a compiler ie compile – to construct by collecting from different
sources. 6
Editors – compilers usually accept source code programs written using any
editor that will produce a standard file, such as ASCII file.
Recently , compilers have been bundled together with editors and
other programs into interactive development environment or IDE.
Such editors are oriented toward the format or structure of the programming
language and are called structured based. It can inform the programmer of
errors as the programs is being written rather than when being compiled.
Debuggers – it’s used to determine execution errors in compiled program.
Its often part of IDE. Running a program with a debugger differs
from straight execution in that the debugger keeps track of source
code information such as line number and names of variables and procedures.
It can also halt execution at pre-specified location called break points.

Project Manager – coordinate the merging of separate versions of the same


file produced by different programmers. It maintains a history
of changes to each of a group file, so that coherent versions of a
program under development can be maintained. eg sccs (source
code control system) and rcs (revision control system)
7
Source Code THE TRANSLATION PROCESS

scanner The phases of compiler

Tokens

parser
Literal
Syntax Tree Table

Semantic Analyzer
symbol
Annotated Tree Table
Source Code Optimizer
Intermediate Code error
handler
Code Generator

Target Code

Target Code optimizer

Target Code 8
The Scanner – does the actual reading of the source code,
which is usually in the form of stream characters. It performs
lexical analysis: it collects sequences of characters into
meaningful units called tokens, which are like the words of
natural language such as English. ie performs a function similar
to spelling.
eg. in C program: a [index] = 4 + 6
This code contain 12 non blank characters, but only 8 tokens:
a identifier
[ left bracket
index identifier
] right bracket
= assignment
4 number
+ plus sign
6 number 9
Each token consists of one or more characters that’s are
collected into a unit before further processing takes place.
It may enter identifiers into the symbol table, and may enter
literals into literal tables.
Literals include numeric constants such as 3.141 and quoted
strings of text such as “Hello , World!”
The Parser – receives the source in the form of tokens from the
scanner and performs syntax analysis, which determines the
structure of the program.
Its similar to performing grammatical analysis on a sentence
in a natural language. Syntax analysis determines the
structural elements of the program as well as their
relationships. The results of the syntax analysis are usually
presented as a parse tree or syntax tree.
10
eg. in C program:

expression

assigned-expression

expression = expression

subscript-expression additive-expression

+
expression [ expression ] expression expression

identifier identifier number number


a index 4 6
11
Abstract syntax expression

assigned-expression

subscript-expression additive-expression

identifier identifier number number


a index 4 6

12
Semantic Analyser
Semantics of a program are its meaning not syntax.
Semantics of a program determines the runtime behaviour
Most programming languages have features that can be
determined prior to execution and yet cant be conveniently
expressed as syntax and analysed by the parser. – static semantics
Analysis of such semantics is the work of the semantic analyser.
Dynamic semantics of a program – properties of a program that
can only be determine by executing it, can not be determined by
a compiler, since it does not execute the program.
Typical static semantic features of common programming
languages include declaration and type checking.

Extra info such as data types computed by the semantic analyser


are called attributes, they are added to the tree as annotations or
decorations. 13
a [index] = 4 + 6
Info gathered before analysis of this line might be that a is an
array of integer values with subscripts from a subrange of integers.
index is an integer variable.
Semantic analyser would annotate the syntax tree with types of
all the expressions and check that the assignment makes sense
for these types, declaring a type mismatch error if not.
assigned-expression

subscript-expression additive-expression
integer integer

identifier identifier number number


a Index 4 6
Array of integer integer integer integer 14
Source Code Optimizer
Individual compilers exhibit a wide variation not only in the
kinds of optimisations performed but also in the placement of
the optimisation.

4 + 6 can be precompiled to the results 10


This is called Constant Folding

assigned-expression

subscript-expression number
integer 10
integer

identifier identifier
a Index
Array of integer integer 15
t=4+6
a [ index ] = t
Variable t to store intermediate results. Optimiser would
improve code in 2 steps

1. t = 10
a [index] = t
2. a [index] = 10

This optimisation is three-address code, which is an intermediate


code or intermediate representation or IR

16
Code Generator
Takes the IR and generate code for machine.
Most compilers generate object codes directly, but we shall go
thru the assembly language for ease of understanding.
Properties of the target machine is now a major factor.
Eg representation of data such as how many bytes or words
variables of integer and floating-point data types occupy in
memory.
How integers are to be stored to for array indexing.
sample code in hypothetical assembly language

MOV R0 , index ;; value of index  R0


MUL R0 , 2 ;; double value in R0
MOV R1, &a ;; address of a  R1
ADD R1, R0 ;; add R0 to R1
MOV *R1, 10 ;; constant 10  address in R1 17
&a means address of a.
*R1 indirect register addressing.
Assume machine perform byte code addressing and that
integers occupy 2 bytes of memory hence the use of 2 as the
multiplication factor in 2nd instruction.

Target Code Optimizer – computer attempts to improve the


target code generated by the code generator. It includes:
- Choosing addressing modes to improve performance
- replacing slow instructions with faster ones.
- eliminating redundant and unnecessary operations
Improving the above code:
- Replace multiplication instruction in line 2 with shift instruction.
- use more powerful address mode such as index addressing to
perform the array store.
18
The improved code will look like:
MOV R0 , index ;; value of index  R0
SHL R0 ;; double value in R0
MOV &a [R0] , 10 ;; constant 10  address a + R0
Symbol Table – it keeps information associated with identifiers:
functions, variables, constants, and data types. The symbol
table interacts with almost every phase of the compiler:
the scanner, parser or semantic analyzer may enter identifiers
into the table: the semantic analyzer will add data type and
other info: and the optimization and code generation phases
will use the info provided by the symbol table.

Literal Table – it stores constants and strings used in the program.


It need not allow deletion since its data applies globally to the
program and a constant or string appears only ones in this table.
The literal table is important in reducing the size of the program
19
in memory by allowing the reuse of constants and strings. It is
also needed by the code generator to construct symbolic
addresses for literals and for entering data definitions in the
target code file.
Analysis And Synthesis
Compiler operations that analyze the source program to
computes its properties are classified as the analysis part.
Operation involved in producing translated code are called the
synthesis part.
Where is the synthesis and analysis part ???
Synthesis – code generation
Analysis – lexical analysis, syntax analysis and semantic analysis.

20
Front End and Back End
Front End – operations that depend only on the source language

Back End – operations that depend only on the target language

This is very similar to analysis and synthesis

Source Front Intermediate Back Target


code End code End code

21
Regular Expressions
Regular expressions represent patterns of strings of characters.
A regular expression r is completely defined by the set of
strings that it matches. This set is called the language
generated by the regular expression and written as L(r)
Language here means set of strings. Eg set of ASCII characters.
Basic Regular Expressions – these are just the single character
from the alphabets which match themselves.
Given any character a from the alphabet ∑ , RE a matches the
character a by writing:
L(a) = {a} a is the character a used as a pattern
Empty string is s string that contains no characters ie Ɛ
L(Ɛ) = {Ɛ}
Empty set matches no string. ie { } or ɸ.
L(ɸ) = { } 22
What’s the diff between empty string and empty set ???
{ } and {Ɛ}

Contains no strings contains a single string


consisting of no characters

Regular Expression Operations –there are 3 operations in Res:


1) Choice among alternatives, which is indicated by the
metacharacter | (vertical bar)
2) Concatenation, which is indicated by juxtaposition(without a
metacharacter)
3) Repetition or “closure” , which is indicated by the
metacharacter * .

23
Choice Among Alternatives – if r and s are REs, then r|s is a RE
which matches any string that is matched either by r or by s.
In terms of languages, the language of r |s is the union of the
languages of r and s, or
L(r |s)= L(r) U L(s)
eg, consider the RE a|b : it matches either of character a or b ,
ie L(a|b) = L(a) U L(b) = {a} U {b} = {a, b}
Also, a|Ɛ matches either the single character a or empty
string(consisting of no characters), ie L(a | Ɛ ) = {a, Ɛ}

24
ICT 441: COMPILER AND TRANSLATORS
This course introduces the concepts of compilation and illustrates
those by a compiler for a small Pascal-like language. It further
deals with the understanding of how compilers work and a deep
understanding of the syntax of programming languages, efficiency
and memory considerations of the available control structures and
data types, issues in separate compilation, differences between
programming languages, and the implications of processor
architecture.
Topics include the compilation process (stages, phases, passes);
language definition (syntax, grammar, regular and context-free
languages) lexical analysis; parsing; semantic analysis, storage
allocation and code generation.

25

You might also like