0% found this document useful (0 votes)
13 views

CD Unit3,4

The document discusses error handling in compilers. It describes the different types of errors like lexical, syntactic and semantic errors. It also discusses various error recovery techniques used by compilers like panic mode recovery and statement mode recovery. Finally, it explains the purpose of symbol tables in compilers for tasks like scope resolution, type checking and code generation.

Uploaded by

Arpit gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

CD Unit3,4

The document discusses error handling in compilers. It describes the different types of errors like lexical, syntactic and semantic errors. It also discusses various error recovery techniques used by compilers like panic mode recovery and statement mode recovery. Finally, it explains the purpose of symbol tables in compilers for tasks like scope resolution, type checking and code generation.

Uploaded by

Arpit gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Compiler Design

Error detection and Recovery in Compiler

This process of locating errors and reporting it to user is called ​Error Handling process​.
Functions of Error handler

● Detection
● Reporting
● Recovery
Classification of errors:-
1. Compile time
2. Run time

Compile time error is of 3 types


1. Lexical Phase error
2. Syntactic Phase error
3. Semantics Error
Or
Classification of Compile-time error –

1. Lexical ​: This includes misspellings of identifiers, keywords or operators


2. Syntactical​ : missing semicolon or unbalanced parenthesis
3. Semantical​ : incompatible value assignment or type mismatches between operator
and operand
4. Logical​ : code not reachable, infinite loop.

Finding error or reporting an error –​ Viable-prefix is the property of a parser which allows

early detection of syntax errors.

● Goal:​ detection of an error as soon as possible without further consuming


unnecessary input
● How:​ detect an error as soon as the prefix of the input does not match a prefix of
any string in the
language.
Lexical phase errors

These errors are detected during the lexical analysis phase. Typical lexical errors are

● Exceeding length of identifier or numeric constants.


● Appearance of illegal characters
● Unmatched string
Syntactic phase errors

These errors are detected during syntax analysis phase. Typical syntax errors are

● Errors in structure
● Missing operator
● Misspelled keywords
● Unbalanced parenthesis
Semantic errors

These errors are detected during semantic analysis phase. Typical semantic errors are

● Incompatible type of operands


● Undeclared variables
● Not matching of actual arguments with formal one
It generates a semantic error because of an incompatible type of a and b.(e.g. Int a[10]=b).
Error recovery:

1. Panic Mode Recovery


○ In this method, successive characters from input are removed one at a
time until a designated set of synchronizing tokens is found.
Synchronizing tokens are deli-meters such as ; or }
○ Advantage is that it's easy to implement and guarantees not to go to
infinite loop
○ Disadvantage is that a considerable amount of input is skipped without
checking it for additional errors

2. Statement Mode recovery


● In this method, when a parser encounters an error, it performs
necessary correction on remaining input so that the rest of input statement
allow the parser to parse ahead.
● The correction can be deletion of extra semicolons, replacing
comma by semicolon or inserting missing semicolon.
● While performing correction, at most care should be taken for not
going in infinite loop.
● Disadvantage is that it finds difficult to handle situations where
actual error occured before point of detection.
3. Error production
● If user has knowledge of common errors that can be encountered
then, these errors can be incorporated by augmenting the grammar with
error productions that generate erroneous constructs.
● If this is used then, during parsing appropriate error messages can
be generated and parsing can be continued.
● Disadvantage is that its difficult to maintain.
4. Global Correction
● The parser examines the whole program and tries to find out the closest
match for it which is error free.
● The closest match program has less number of insertions, deletions and
changes of tokens to recover from erroneous input.
● Due to high time and space complexity, this method is not implemented
practically.
Error recovery

● If error ​“Undeclared Identifier”​ is encountered then, to recover from this a symbol


table entry for corresponding identifier is made.
● If data types of two operands are incompatible then, automatic type conversion is
done by the compiler.

Abstract Syntax Trees


Parse tree representations are not easy to be parsed by the compiler, as they contain more details than
actually needed. Take the following parse tree as an example:
Symbol Table:
Symbol Table is an important data structure created and maintained by the compiler in order to keep track
of semantics of variable i.e. it stores information about scope and binding information about names,
information about instances of various entities such as variable and function names, classes, objects, etc.
or
Symbol table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names, objects,
classes, interfaces, etc. Symbol table is used by both the analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:

● To store the names of all entities in a structured form at one place.


● To verify if a variable has been declared.

● To implement type checking, by verifying assignments and expressions in the source code are

semantically correct.

● To determine the scope of a name (scope resolution).

It is built in lexical and syntax analysis phases.


The information is collected by the analysis phases of compiler and is used by synthesis phases
of compiler to generate code.
It is used by compiler to achieve compile time efficiency.
It is used by various phases of compiler as follows :-
1. Lexical Analysis: Creates new table entries in the table, example like entries about token.
2. Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table to check for semantics i.e. to
verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how much and what type
of run-time is allocated and table helps in adding temporary variable information.
5. Code Optimization: Uses information present in symbol table for machine dependent
optimization.
6. Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries –​ Each entry in symbol table is associated with attributes that support
compiler in different phases.
Items stored in Symbol table:

● Variable names and constants


● Procedure and function names
● Literal constants and strings
● Compiler generated temporaries
● Labels in source languages
Information used by compiler from Symbol table:

● Data type and name


● Declaring procedures
● Offset in storage
● If structure or record then, pointer to structure table.
● For parameters, whether parameter passing by value or by reference
● Number and type of arguments passed to function
● Base Address

Implementation of Symbol table –

1. List –​In this method, an array is used to store names and associated information.
● A pointer “available” is maintained at end of all stored records and new names are added in
the order as they arrive
● To search for a name we start from beginning of list till available pointer and if not found
we get an error “use of undeclared name”
● While inserting a new name we must ensure that it is not already present otherwise error
occurs i.e. “Multiple defined name”
● Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
● Advantage is that it takes minimum amount of space.
2. LinkedList
3. HashTable
4. Bst
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry for each
name in the following format:

<symbol​ ​name​, ​type​, ​attribute​>

For example, if a symbol table has to store information about the following variable declaration:

static​ ​int​ interest​;

then it should store the entry such as:

<interest​, ​int​, ​static​>

Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed by all the
procedures and scope symbol tables that are created for each scope in the program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure

● first a symbol will be searched in the current scope, i.e. current symbol table.

● if a name is found, then search is completed, else it will be searched in the parent symbol table

until,

● either the name is found or global symbol table has been searched for the name.

A program is a sequence of instructions combined into a number of procedures. The execution of a

procedure is called its activation. An activation record contains all the necessary information required to

call a procedure. An activation record may contain the following units (depending upon the source

language used).

Temporaries Stores temporary and intermediate values of an expression.

Local Data Stores local data of the called procedure.

Machine Status Stores machine status such as Registers, Program Counter etc., before the
procedure is called.

Control Link Stores the address of activation record of the caller procedure.

Access Link Stores the information of data which is outside the local scope.

Actual Parameters Stores actual parameters, i.e., parameters which are used to send input to
the called procedure.

Return Value Stores return values.


Whenever a procedure is executed, its activation record is stored on the stack, also known as control

stack.

We assume that the program control flows in a sequential manner and when a procedure is called, its

control is transferred to the called procedure. When a called procedure is executed, it returns the

control back to the caller. This type of control flow makes it easier to represent a series of activations

in the form of a tree, known as the activation tree.

To understand this concept, we take a piece of code as an example:

.​ ​.​ ​.

printf​(“​Enter​ ​Your​ ​Name​:​ ​“);

scanf​(“%​s​”,​ username​);

show_data​(​username​);

printf​(“​Press​ any key to ​continue​…”);

.​ ​.​ ​.

int​ show_data​(​char​ ​*​user​)

​{

printf​(“​Your​ name ​is​ ​%​s​”,​ username​);

​return​ ​0​;

​}
Storage Allocation
Runtime environment manages runtime memory requirements for the following entities:

● Code : It is known as the text part of a program that does not change at runtime. Its memory requirements

are known at the compile time.

● Procedures : Their text part is static but they are called in a random manner. That is why, stack storage is

used to manage procedure calls and activations.

● Variables : Variables are known at the runtime only, unless they are global or constant. Heap memory

allocation scheme is used for managing allocation and de-allocation of memory for variables in runtime.

Static Allocation
In this allocation scheme, the compilation data is bound to a fixed location in the memory and it does not change
when the program executes. As the memory requirement and storage locations are known in advance, runtime
support package for memory allocation and de-allocation is not required.

Stack Allocation
Procedure calls and their activations are managed by means of stack memory allocation. It works in last-in-first-out
(LIFO) method and this allocation strategy is very useful for recursive procedure calls.

Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap allocation is used to
dynamically allocate memory to the variables and claim it back when the variables are no more required.

Except statically allocated memory area, both stack and heap memory can grow and shrink dynamically and
unexpectedly. Therefore, they cannot be provided with a fixed amount of memory in the system.

Stack and heap memory are arranged at the extremes of total memory allocated to the program. Both shrink and
grow against each other.
Parameter Passing

r-value

The value of an expression is called its r-value. The value contained in a single variable also
becomes an r-value if it appears on the right-hand side of the assignment operator.

l-value
The location of memory (address) where an expression is stored is known as the l-value of that
expression. It always appears at the left hand side of an assignment operator.

Code Optimization

Optimization is a program transformation technique, which tries to improve the code by making it
consume less resources (i.e. CPU, Memory) and deliver high speed.

In optimization, high-level general programming constructs are replaced by very efficient low-level
programming codes. A code optimizing process must follow the three rules given below:

● The output code must not, in any way, change the meaning of the program.
● Optimization should increase the speed of the program and if possible, the program should demand

less number of resources.

● Optimization should itself be fast and should not delay the overall compiling process.

Efforts for an optimized code can be made at various levels of compiling the process.
● At the beginning, users can change/rearrange the code or use better algorithms to write the code.
● After generating intermediate code, the compiler can modify the intermediate code by address

calculations and improving loops.

● While producing the target machine code, the compiler can make use of memory hierarchy and CPU

registers.

Optimization can be categorized broadly into two types : machine independent and machine
dependent.

Machine-independent Optimization

In this optimization, the compiler takes in the intermediate code and transforms a part of the code
that does not involve any CPU registers and/or absolute memory locations. For example:

do

item ​=​ ​10​;

value ​=​ value ​+​ item​;

}​ ​while​(​value​<​100​);

This code involves repeated assignment of the identifier item, which if we put this way:

Item​ ​=​ ​10​;

do

value ​=​ value ​+​ item​;

}​ ​while​(​value​<​100​);

Machine-dependent Optimization
Machine-dependent optimization is done after the target code has been generated and when the code
is transformed according to the target machine architecture. It involves CPU registers and may have
absolute memory references rather than relative references. Machine-dependent optimizers put
efforts to take maximum advantage of memory hierarchy.

Basic Blocks

Source codes generally have a number of instructions, which are always executed in sequence and
are considered as the basic blocks of the code. These basic blocks do not have any jump statements
among them,

Basic blocks are important concepts from both code generation and optimization point of view.

Basic blocks play an important role in identifying variables, which are being used more than once in
a single basic block. If any variable is being used more than once, the register memory allocated to
that variable need not be emptied unless the block finishes execution.

Control Flow Graph


Basic blocks in a program can be represented by means of control flow graphs. A control flow graph
depicts how the program control is being passed among the blocks. It is a useful tool that helps in
optimization by help locating any unwanted loops in the program.
Loop Optimization

Most programs run as a loop in the system. It becomes necessary to optimize the loops in order to
save CPU cycles and memory. Loops can be optimized by the following techniques:

● Invariant code : A fragment of code that resides in the loop and computes the same value at each
iteration is called a loop-invariant code. This code can be moved out of the loop by saving it to be

computed only once, rather than with each iteration.

● Induction analysis : A variable is called an induction variable if its value is altered within the loop by a

loop-invariant value.

● Strength reduction : There are expressions that consume more CPU cycles, time, and memory. These

expressions should be replaced with cheaper expressions without compromising the output of

expression. For example, multiplication (x * 2) is expensive in terms of CPU cycles than (x << 1) and

yields the same result.

Dead-code Elimination

Dead code is one or more than one code statements, which are:

● Either never executed or unreachable,


● Or if executed, their output is never used.
Thus, dead code plays no role in any program operation and therefore it can simply be eliminated.

Partially dead code

There are some code statements whose computed values are used only under certain circumstances,
i.e., sometimes the values are used and sometimes they are not. Such codes are known as partially
dead-code.

The above control flow graph depicts a chunk of program where variable ‘a’ is used to assign the
output of expression ‘x * y’. Let us assume that the value assigned to ‘a’ is never used inside the
loop.Immediately after the control leaves the loop, ‘a’ is assigned the value of variable ‘z’, which
would be used later in the program. We conclude here that the assignment code of ‘a’ is never used
anywhere, therefore it is eligible to be eliminated.

Partial Redundancy

Redundant expressions are computed more than once in parallel path, without any change in
operands.whereas partial-redundant expressions are computed more than once in a path, without any
change in operands. For example,
[redundant expression] [partially redundant expression]

Directed Acyclic Graph

Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps to see the
flow of values flowing among the basic blocks, and offers optimization too. DAG provides easy
transformation on basic blocks. DAG can be understood here:

● Leaf nodes represent identifiers, names or constants.


● Interior nodes represent operators.

● Interior nodes also represent the results of expressions or the identifiers/name where the values are to

be stored or assigned.

Example:

t​0​ ​=​ a ​+​ b

t​1​ ​=​ t​0​ ​+​ c

d ​=​ t​0​ ​+​ t​1


[t​0​ = a + b]

[t​1​ = t​0​ + c]
[d = t​0​ + t​1​]

Peephole Optimization

This optimization technique works locally on the source code to transform it into an optimized code.
By locally, we mean a small portion of the code block at hand. These methods can be applied on
intermediate codes as well as on target codes. A bunch of statements is analyzed and are checked for
the following possible optimization:

Redundant instruction elimination

At source code level, the following can be done by the user:

int​ add_ten​(​int​ x​) int​ add_ten​(​int​ x​) int​ add_ten​(​int​ x​) int​ add_ten​(​int​ x​)
​{ ​{ ​{ ​{
​int​ y​,​ z​; ​int​ y​; ​int​ y ​=​ ​10​; ​return​ x ​+​ ​10​;
y ​=​ ​10​; y ​=​ ​10​; ​return​ x ​+​ y​; ​}
z ​=​ x ​+​ y​; y ​=​ x ​+​ y​; ​}
​return​ z​; ​return​ y​;

​} ​}
At compilation level, the compiler searches for instructions redundant in nature. Multiple loading
and storing of instructions may carry the same meaning even if some of them are removed. For
example:

● MOV x, R0
● MOV R0, R1

We can delete the first instruction and rewrite the sentence as:

MOV x​,​ R1

Unreachable code
Unreachable code is a part of the program code that is never accessed because of programming constructs.
Programmers may have accidentally written a piece of code that can never be reached.

Example:

void add_ten(int x)

return x + 10;

printf(“value of x is %d”, x);

In this code segment, the printf statement will never be executed as the program control returns back before it can
execute, hence printf can be removed.

Flow of control optimization


There are instances in a code where the program control jumps back and forth without performing any significant
task. These jumps can be removed. Consider the following chunk of code:

...

MOV R1​,​ R2

GOTO L1
...

L1 ​:​ GOTO L2

L2 ​:​ INC R1

In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1 and then to L2, the
control can directly reach L2, as shown below:

...

MOV R1​,​ R2

GOTO L2

...

L2 ​:​ INC R1

Algebraic expression simplification


There are occasions where algebraic expressions can be made simple. For example, the expression a = a
+ 0 can be replaced by a itself and the expression a = a + 1 can simply be replaced by INC a.

Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by replacing
them with other operations that consume less time and space, but produce the same result.

For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the output of a
* a and a​2​ is same, a​2​ is much more efficient to implement.

Accessing machine instructions


The target machine can deploy more sophisticated instructions, which can have the capability to perform
specific operations much efficiently. If the target code can accommodate those instructions directly, that
will not only improve the quality of code, but also yield more efficient results.

Code Generator
A code generator is expected to have an understanding of the target machine’s runtime environment and its
instruction set. The code generator should take the following things into consideration to generate the code:
● Target language : The code generator has to be aware of the nature of the target language for which
the code is to be transformed. That language may facilitate some machine-specific instructions to help

the compiler generate the code in a more convenient way. The target machine can have either CISC or

RISC processor architecture.

● IR Type : Intermediate representation has various forms. It can be in Abstract Syntax Tree

(AST) structure, Reverse Polish Notation, or 3-address code.

● Selection of instruction : The code generator takes Intermediate Representation as input and converts

(maps) it into target machine’s instruction set. One representation can have many ways (instructions)

to convert it, so it becomes the responsibility of the code generator to choose the appropriate

instructions wisely.

● Register allocation : A program has a number of values to be maintained during the execution.

The target machine’s architecture may not allow all of the values to be kept in the CPU memory or

registers. Code generator decides what values to keep in the registers. Also, it decides the registers to

be used to keep these values.

● Ordering of instructions : At last, the code generator decides the order in which the instruction will be

executed. It creates schedules for instructions to execute them.

Descriptors
The code generator has to track both the registers (for availability) and addresses (location of values) while
generating the code. For both of them, the following two descriptors are used:

● Register descriptor : Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register. Whenever a

new register is required during code generation, this descriptor is consulted for register availability.

● Address descriptor : Values of the names (identifiers) used in the program might be stored at different

locations while in execution. Address descriptors are used to keep track of memory locations where

the values of identifiers are stored. These locations may include CPU registers, heaps, stacks, memory

or a combination of the mentioned locations.

Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x, the code generator:
● updates the Register Descriptor R1 that has value of x and
● updates the Address Descriptor (x) to show that one instance of x is in R1.

Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these sequence of
instructions as input.

getReg : Code generator uses ​getReg function to determine the status of available registers and the location of name
values. ​getReg​ works as follows:

● If variable Y is already in register R, it uses that register.


● Else if some register R is available, it uses that register.

● Else if both the above options are not possible, it chooses a register that requires minimal number of

load and store instructions.

For an instruction x = y OP z, the code generator may perform the following actions. Let us assume that L is
the location (preferably register) where the output of y OP z is to be saved:

● Call function getReg, to decide the location of L.


● Determine the present location (register or memory) of y by consulting the Address Descriptor of y. If

y is not presently in register L, then generate the following instruction to copy the value of y to L:

MOV y’, L

where y’ represents the copied value of y.

● Determine the present location of z using the same method used in step 2 for yand generate the

following instruction:

OP z’, L

where z’ represents the copied value of z.

● Now L contains the value of y OP z, that is intended to be assigned to x. So, if L is a register, update

its descriptor to indicate that it contains the value of x. Update the descriptor of x to indicate that it is

stored at location L.

● If y and z has no further use, they can be given back to the system.

You might also like