0% found this document useful (0 votes)
49 views

Module-2: Introduction, Lexical Analysis: Syllabus

This document provides an overview of compilers and their structure. It discusses how compilers take source code as input and convert it into target code. Compilers go through several phases including preprocessing, compilation, assembly, and linking. The compilation phase includes lexical analysis, syntax analysis, code generation and optimization. Compilers analyze source code, detect errors, and create an intermediate representation which is then synthesized into target code. They allow programming in high-level languages that are easier for humans while computers execute machine-level instructions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Module-2: Introduction, Lexical Analysis: Syllabus

This document provides an overview of compilers and their structure. It discusses how compilers take source code as input and convert it into target code. Compilers go through several phases including preprocessing, compilation, assembly, and linking. The compilation phase includes lexical analysis, syntax analysis, code generation and optimization. Compilers analyze source code, detect errors, and create an intermediate representation which is then synthesized into target code. They allow programming in high-level languages that are easier for humans while computers execute machine-level instructions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

MODULE-2: INTRODUCTION, LEXICAL ANALYSIS

SYLLABUS:
Language processors; The structure of a Compiler; The evolution of programming
languages; The science of building a Compiler; Applications of compiler technology;
Programming language basics.
Lexical analysis: The Role of Lexical Analyzer; Input Buffering; Specifications of
Tokens; Recognition of Tokens, Lexical Analyzer Generator and Finite Automata.

Overview of Compilers

Compiler is system software that takes source program as input and converts it into
target program. The source programs are independent of the machine on which they are
executed where as target programs are machine dependent. Source program can be any high
level language like C, C++, etc., and the target program may be in assembly language or
machine language.

To solve any problem-using computer, it is required to generate a set of instructions to


solve the problem. These instructions can be in machine level code, assembly level language
or high level language. Machine level code consists of instructions written in machine code
consisting of ‘0’s and 1’s.

Example: A3 06 0000 0075

The above instruction is used to move contents form register AX to BX. These codes are
easily understood by the computer system and hence their execution is faster. They can be
executed on the system without any intermediatory software. It is difficult for the
programmer to read, write and debug instructions in machine code.

In assembly level language, instruction consists of mnemonics and operands.

Example: MOV AX,BX

The above instruction moves the contents of register AX to BX. These instructions are
written based on the number and type of general purpose registers available, addressing
modes and organization of memory. Though assembly code is easier to read and write when
compared to machine code, the programmer should have complete knowledge of efficiently
using registers, choose appropriate instruction for faster and better utilization of memory. It
requires an intermediatory software called assembler, which converts assembly code to
machine code before execution, hence it is slower when compared to machine code.

In high level language, instructions are defined using programming language like C, C++,
Java etc.,

Example: c=a+b
This instruction adds two values of variables a and b, and stores the result in c. These are
very easy to be read, write and debug by programmer, but are difficult to understand by the

DEPT. OF CSE, SJCIT 1 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

system. Hence it requires intermediatory software to convert instructions to machine code


and also check for errors. Compilers serve this purpose. It converts program written in high
level language to assembly level language or machine language.
Compilers can also be used to convert code written in one high level language to another, like
‘C’ to Java or Pascal to ‘C’.

1.1 Language Processors


A compiler is a program that can read a program in one language i.e.., the source
language, and translate it into an equivalent program in another language i.e.., the target
language.
The pictorial representation of the compiler is shown below.

An important role of the compiler is to report any errors in the source program that it
detects during the translation process.
If the target program is an executable machine language program, it can then be called
by the user to process inputs and produce outputs.
The running the target program is shown below.

An interpreter is another common kind of language processor. Instead of producing


a target program as translation, an interpreter appears to directly execute the operations
specified in the source program on inputs supplied by the user.
The pictorial representation of an interpreter is shown below.

The machine language target program produced by a compiler is usually much faster
than an interpreter at mapping inputs to outputs. An interpreter can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

Example:
Java language processors combine compilation and interpretation as shown below.

A Java source program may first be compiled into an intermediate form called
bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this
arrangement is that bytecodes compiled on one machine can be interpreted on another
machine, perhaps across a network.

DEPT. OF CSE, SJCIT 2 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

In order to achieve faster processing of inputs to outputs, some java compilers called
just-in-time compilers, translate the bytecodes into machine language immediately before
they run the intermediate program to process the input.

A block diagram of the language processing system is as shown below.

Source
Preprocessor Modified source program
program

Library files
Compiler
Relocatable object files

Target assembly
Relocatable
program
machine code Linker/Loader
Assembler

Target
machine
code

The different steps involved in converting instructions in high level code to machine
level code is called language processing. There are different components involved in
language processing. They are

a. Preprocessor

b. Compiler

c. Assembler

d. Linker /loader

Preprocessor

First step in language processing is preprocessing. Input to this phase is source


program. Different parts of the source program may be stored in different files. Example-
function definition may be in one file and the main program may be in another file.
Preprocessor collects all these files and creates a single file. It also performs macro
expansion. Macros are small set of instructions written to perform specific operations. In C
#define and #include are expanded during preprocessing. Some preprocessors also delete
comments from the source program.

Compiler

Compiler takes pre-processed file and generates assembly level code. It also generates
symbol table and literal table. Compiler has error handler which displays error messages and
performs some error recovery if necessary. In order to reduce the amount of time taken for
the execution and for better utilization of memory, compiler generates intermediate form of
code and optimizes this code. Functionality of compiler is divided into multiple phases. Each

DEPT. OF CSE, SJCIT 3 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

phase performs a set of operations. Like lexical analysis generating tokens, code optimiser
optimizing code etc.

Assembler

Assembler takes assembly code as input and converts it into relocatable object code.
The instruction in assembly code will have two parts opcode part and an operand part.
Opcode specifies the type of operation like ADD for addition, SUB for subtraction, INC for
increment etc. The operand part consists of number of operands on which the operations are
to be applied. These operands may be memory location, register or immediate data.
Assemblers may be single pass or two pass assembler. In a single pass assembler, reading
assembly code, generation of symbol table and conversion of opcode to machine instruction
are all done in a single pass. In two pass assembler, first pass reads the input file and stores
the identifiers in symbol table. In second pass, it translates opcode to sequence of bits
(machine code or relocatable code) with the help of symbol table.

Linker/Loader

In the final step, an executable code is generated with the help of linker and loader.
Linkers are used to link system wide libraries and resources supplied by operating system
such as I/O devices, memory allocator etc. Loader resolves all relocatable address relative to
starting address and produces absolute executable code.

1.2 The Structure of a Compiler


Up to this point we have treated a compiler as a single box that maps a source
program into a semantically equivalent target program. If we open up this box a little, we see
that there are two parts to this mapping: analysis and synthesis.

The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate
representation of the source program. If the analysis part detects that the source program is
either syntactically ill formed or semantically unsound, then it must provide informative
message, so the user can take corrective action.

The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.

The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.

The analysis part is often called the front end of the compiler. The synthesis part is
the back end of the compiler.

DEPT. OF CSE, SJCIT 4 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

The different phases of a compiler are as shown below.

Character stream

Lexical Analyser

Token stream

Syntax Analyser

Syntax tree

Semantic Analyser

Syntax tree

Symbol Intermediate Code


Table Generation

Intermediate representation

Machine-independent code
optimizer

Intermediate representation

Code Generator

Target-machine code

Machine-Dependent Code
optimizer

Target-machine code

Lexical Analyzer
Lexical Analyzer reads the source program character by character and returns the
tokens of the source program.
 Lexical Analyzer also called as Scanner.
 It reads the stream of characters making up the source program and groups the
characters into meaningful sequences called Lexemes.
 For each lexeme, it produces as output a token of the form
<token_name, attribute_value>

DEPT. OF CSE, SJCIT 5 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

token_name is an abstract symbol that is used during syntax analysis, and the second
component attribute_value points to an entry in the symbol table for this token.
 A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)

Example: suppose a source program contains the assignment statement.


position = initial + rate * 60
The characters in this assignment could be grouped into the following lexemes and mapped
into the following tokens passed on to the syntax analyzer:
1. position is a lexeme that is mapped into the token <id,1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position.
2. The assignment symbol = is a lexeme that is mapped into the token < = >.
3. Initial is a lexeme that is mapped into the token <id,2>, where 2 points to the symbol
table entry for initial.
4. + is a lexeme that is mapped into the token < + >.
5. rate is a lexeme that is mapped into the token <id,3>, where 3 points to the symbol
table entry for rate.
6. * is a lexeme that is mapped into the token < * >.
7. 60 is a lexeme that is mapped into the token <60>.
 Puts information about identifiers into the symbol table not all attributes.
 Regular expressions are used to describe tokens (lexical constructs).
 A (Deterministic) Finite State Automaton (DFA) can be used in the implementation of
a lexical analyzer.
 Blank spaces removed.
The representation of the assignment statement after lexical analysis as the sequence of
tokens
<id,1> < = > <id,2> <+> <id,3> <*> <60>
In this representation, the token names =, +, and * are abstract symbols for the assignment,
addition, and multiplication operators respectively.
Syntax Analyzer
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given
program.
 A syntax analyzer is also called as a parsing or syntax analysis.
 The parser uses the first components of the tokens produced by the lexical analyzer to
create a tree-like intermediate representation that depicts the grammatical structure of
the token stream.
 A typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments of the operation.
 A syntax tree for the token stream is shown below as the output of the syntactic
analyzer.
=

<id, 1> +

<id,2> *

<id,3> 60

DEPT. OF CSE, SJCIT 6 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

The tree has an interior node labelled * with <id,3> as its left child and the integer 60
as its right child. The node <id,3> represent the identifier rate. The node labelled * makes it
explicit that we must first multiply the value of rate by 60.
The node labelled + indicates that we must add the result of this multiplication to the
value of initial. The root of the tree, labelled = indicated that we must store the result of this
addition into location for the identifier position.
This ordering of operations is consistent with the usual conventions of arithmetic
which tell us that multiplication has higher precedence than addition, and hence that the
multiplication is to be performed before the addition.
The subsequent phases of the compiler use the grammatical structure to help analyze
the source program and generate the target program.

Semantic Analyzer
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.
It also gathers type information and saves it in either the syntax tree or the symbol
table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks
that each operator has matching operands.
Example:
many programming language definitions require an array index to be an integer, the
compiler must report an error if a floating point number is used to index an array.

The language specification may permit some type conversions called coercions.
Example:
A binary arithmetic operator may be applied to either a pair of integers or to a pair of
floating point numbers. If the operator is applied to a floating-point number and an integer,
the compiler may convert or coerce the integer into a floating point number. Such a coercion
appears as shown below.
=

<id,1> +

<id,2> *

<id,3> inttofloat

60
Intermediate Code Generation
In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety of forms.
After syntax and semantic analysis of the source program, many compilers generate
an explicit low-level or machine-like intermediate representation, which we can think of as a
program for a abstract machine.
This intermediate representation should have two important properties:
 It should be easy to produce
 It should be easy to translate into the target machine.

DEPT. OF CSE, SJCIT 7 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

We consider an intermediate form called three address code, which consists of a


sequence of assembly like instructions with three operands per instruction. Each operand can
act like a register. The output of the intermediate code generator consist of the three address
code sequence as follows
t1=inttofloat(60)
t2=id3*t1
t3=id2+t2
id1=t3
Three address instructions
 Each three address assignment instruction has atmost one operator on the right side.
 The compiler must generate a temporary name to hold the value computed by a three
address instruction.
 Some three address instructions have fewer than three operands.

Code optimization
The machine independent code optimization phase attempts to improve the
intermediate code so that better target code will result. Usually better means faster, such as
shorter code or target code that consumes less power.
It is required to generate good target code.
The optimizer can deduce that the conversion of 60 from integer to floating point can
be done once and for all at compile time.
t1=id3*60.0
id1=id2+t1
There is a great variation in the amount of code optimization different compilers
perform. There are simple optimizations that significantly improve the running time of the
target program without slowing down compilation too much.

Code generation
The code generator takes as input an intermediate representation of the source
program and maps it into the target language.
If the target language is machine code, registers or memory locations are selected for
each of the variables used by the program.
The crucial aspect of code generation is the judicious assignment of registers to hold
variables.
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating point numbers.

Symbol table management


An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
These attributes may provide information about the storage allocated for a name, its
type, its scope and in the case of procedure names, such things as the number and types of its
arguments, the method of passing each argument and the type returned.

DEPT. OF CSE, SJCIT 8 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the same.
The data structure should be designed to allow the compiler to find the record for each
name quickly and to store or retrieve data from that record quickly.

Grouping of phases into passes


In an implementation, activities from several phases may be grouped together into a
pass that reads an input file and writes an output file.
For example,
The front-end phases of lexical analysis, syntax analysis, semantic analysis and from
back end intermediate code generation might be grouped together into one pass.
Code optimization might be an optional pass.
Then there could be a back-end pass consisting of code generation for a particular
target machine.
Some compiler collections have been created around carefully designed intermediate
representations that allow the front end for a particular language to interface with the back
end for a certain target machine.
With these collections, we can produce compilers for different source languages for
one target machine by combining different front ends with the back end for that target
machine.
Similarly, we can produce compilers for different target machines, by combining a
front end with back ends for different target machine.

Compiler-construction tools
Some commonly used compiler construction tools include.
 Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
 Scanner generators that produce lexical analyzers from a regular expression
description of the tokens of a language.
 Syntax directed translation engines that produce collections of routines for walking a
parse tree and generating intermediate code.
 Code generator generators that produce a code generator from a collection of rules
for translating each operation of the intermediate language into the machine language
for a target machine.
 Data flow analysis engines that facilitate the gathering of information about how
values are translating from one part of a program to each other part. It is a key part of
code optimization.
 Compiler construction toolkits that provide an integrated set of routines for
constructing various phases of a compiler.

1.3 The Evolution of Programming Languages


The first electronic computers appeared in the 1940’s and were programmed in
machine language by sequences of 0’s and 1’s that explicitly told the computer what
operations to execute and in what order.

The operations themselves were low level: move data from one location to another,
add the contents of two registers, compare two values, and so on. This kind of programming

DEPT. OF CSE, SJCIT 9 Prepared by Shrihari M R


INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

was slow, tedious, and error prone. And once written, the programs were hard to understand
and modify.

Move to higher-level languages

The first step towards programming languages was the development of mnemonic
assembly languages in the early 1950’s.

Initially, the instructions in an assembly language were just mnemonic representations


of machine instructions. Later, macro instructions were added to assembly languages so that a
programmer could define parameterized shorthands for frequently used sequences of machine
instructions.

In the latter half of the 1950’s with the development of Fortran for scientific
computation, Cobol for business data processing, and Lisp for symbolic computation.

In the following decades, many more languages were created with innovative features
to help make programming easier, more robust and more natural.

Today, there are thousands of programming languages. They can be classified in a


variety of ways.

One classification is by generation.

First generation language-machine languages. Second generation languages-assembly


languages. Third generation languages-higher level languages. Fourth generation languages-
designed for specific applications like sql for database applications, NOMADS for report
generation and Postscript for text formatting. Fifth generation languages-applied to logic and
constraint based languages like prolog and ops5.
Another classification of languages uses the term
Imperative languages: languages in which a program specifies how a computation is to be
done.
eg: c, c++.
Declarative languages:
languages in which a program specifies what computation is to be done.

eg: prolog.

The term von Neumann language is applied to programming languages whose computational
model is based on the von Neumann computer architecture.

An object-oriented language is one that supports object-oriented programming, a


programming style in which a program consists of a collection of objects that interact with
one another.

Eg: Simula 67, Smalltalk, C++, C#, Java and Ruby

Scripting languages are interpreted languages with high-level operators designed for “gluing
together” computations. These computations were originally called scripts.
DEPT. OF CSE, SJCIT 10 Prepared by Shrihari M
R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Eg: Awk, JavaScript, Perl, PHP, Python, Ruby and Tel.

Impacts on Compilers
High performance compilers (i.e., the code generated performs well) are crucial for
the adoption of new language concepts and computer architectures. Also important is the
resource utilization of the compiler itself.

Compiler writing is challenging. A compiler by itself is a large program. Moreover,


many modern language processing systems handle several source languages and target
machines within the same framework.
A compiler must translate correctly the potentially infinite set of programs that could
be written in the source language.

1.4 The science of building a compiler


• Compiler design deals with complicated real world problems.
• First the problem is taken.
• A mathematical abstraction is formulated.
• Solve using mathematical techniques.

Modeling in compiler design and implementation


The study of compilers is mainly a study of how we design the right mathematical
models and choose the right algorithm, while balancing the need for generality and power
against simplicity and efficiency.
Fundamental models – finite state machine, regular expression, context free grammar.

The science of code optimization


Optimization : It is an attempt made by compiler to produce code that is more efficient than
the obvious one.
Compiler optimizations-Design objectives
 Must improve performance of many programs.
 Optimization must be correct.
 Compilation time must be kept reasonable.
 Engineering effort required must be manageable.

1.5 Applications of Compiler Technology

 Implementation of high level programming languages.


The programmer expresses an alg using the Lang, and the compiler must translate that
prog to the target language.
Generally HLP langs are easier to program in, but are less efficient, i.e., the target
programs runs more slowly.
Programmers using LLPL have more control over a computation and can produce more
efficient code.
Unfortunately, LLP are harder to write and still worse less portable, more prone to errors
and harder to maintain.
Optimizing compilers include techniques to improve the performance of general code,
thus offsetting the inefficiency introduced by HL abstractions.

DEPT. OF CSE, SJCIT 11 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

 Optimizations for computer architectures


The rapid evolution of comp architecture has also led to an insatiable demand for a new
compiler technology.
Almost all high performance systems take advantage of the same basic 2 techniques:
parallelism and memory hierarchies.
• Parallelism can be found at several levels : at the instruction level, where multiple
operations are executed simultaneously and at the processor level, where different
threads of same application are run on different processors.
• Memory hierarchies are a response to the basic limitation that we can built very fast
storage or very large storage, but not storage that is both fast and large.

Parallelism
All modern microprocessors exploit instruction-level parallelism. This can be hidden
from the programmer.
The hardware scheduler dynamically checks for dependencies in the sequential
instruction stream and issues them in parallel when possible.
Whether the hardware reorders the instruction or not, compilers can rearrange the
instruction to make instruction-level parallelism more effective.

Memory Hierarchies
A memory hierarchy consists of several levels of storage with different speeds and
sizes.
A processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary storage that
contains gigabytes and beyond.
Correspondingly, the speed of accesses between adjacent levels of the hierarchy can
differ by two or three orders of magnitude.
The performance of a system is often limited not by the speed of the processor but by
the performance of the memory subsystem.
While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.

 Design of new computer architectures.


In modern computer arch development, compilers are developed in the processor
design stage, and compiled code, running on simulators, is used to evaluate the proposed
architectural features.
One of the best known examples of how compilers influenced the design of computer
architecture was the invention of RISC (reduced inst-set comp) architecture.
Over the last 3 decades, many architectural concepts have been proposed. They
include data flow machines, vector machines, VLIW(very long inst word) machines,
multiprocessors with shared memory, and with distributed memory.
The development of each of these architectural concepts was accompanied by the
research and development of corresponding compiler technology.
Compiler technology is not only needed to support programming of these architecture,
but also to evaluate the proposed architectural designs.

 Program translations

DEPT. OF CSE, SJCIT 12 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Normally we think of compiling as a translation of a high level Language to machine


level Language, the same technology can be applied to translate between different kinds of
languages.
The following are some of the imp applications of program translation techniques:
BINARY TRANSLATION
Compiler technology can be used to translate the binary code for one machine to that
of another, allowing a machine to run programs originally compiled for another instr set.
This tech has been used by various computer companies to increase the availability of
software to their machines.
HARDWARE SYNTHESIS
Not only is most software written in high level languages, even hardware designs are
mostly described in high level hardware description languages like verilog and VHDL(very
high speed integrated circuit hardware description languages).
Hardware designs are typically described at the register transfer level (RTL).
Hardware synthesis tools translate RTL descriptions automatically into gates which
are then mapped to transistors and eventually to a physical layout. This process takes long
hours to optimize the circuits unlike compilers for programming languages.
DATABASE QUERY INTERPRETERS
Query languages like SQL are used to search databases.
These database queries consist of relational and Boolean operators.
They can be compiled into commands to search a database for records satisfying the
needs.
COMPILED SIMULATION
Simulation is a general technique used in many scientific and engineering disciplines
to understand a phenomenon or to validate a design.
Inputs to a simulator usually include the description of the design and specific input
parameters for that particular simulation run.
Simulations can be very expensive.
Instead of writing a simulator that interprets the design, it is faster to compile the
design to produce machine code that simulates that particular design natively.
Compiled simulation can run orders of magnitude faster than an interpreter-based
approach.
Compiled simulation is used in many state-of-the-art tools that simulate designs
written in verilog or VHDL.

 Software productivity tools


Several ways in which program analysis, building techniques originally developed to
optimize code in compilers, have improved software productivity.
Type checking
It is an effective and well-established technique to catch inconsistencies in programs.
It can be used to catch errors.
Example:
Where an operation is applied to the wrong type of object, or if parameters passed to a
procedure do not match the signature of the procedure.
Program analysis can go beyond finding type errors by analyzing the flow of data
through a program.
Example:
If a pointer is assigned null and then immediately dereferenced, the program is clearly
in error.

DEPT. OF CSE, SJCIT 13 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Bounds checking
It is easier to make mistakes when programming in a lower-level language than a
higher level one.
Example:
Many security breaches in systems are caused by buffer overflows in programs
written in C. Because C does not have array-bound checks, it is up to the user to ensure that
the arrays are not accessed out of bounds.
Had the program been written in a safe language that includes automatic range
checking, this problem would not have occurred.
The same data-flow analysis that is used to eliminate redundant range checks can also
be used to locate buffer overflows.
Memory-management tools
Garbage collection is another excellent example of the tradeoff between efficiency
and a combination of ease of programming and software reliability.
Automatic memory management obliterates all memory-management errors, which
are a major source of problems in C and C++ programs.
Various tools have been developed to help programmers find memory management
errors.
Example:
Purify is a widely used tool that dynamically catches memory management errors as
they occur.
Tools that help identify some of these problems statically have also been developed.

1.6 Programming language basics


 Static/Dynamic Distinction
The language uses a policy that allows the compiler to decide an issue then that
language uses static or issue decided at compile time.
The decision is made during execution of a program is said to be dynamic or decision
at run time.
Scope of declarations.
A language uses static scope or lexical scope if it is possible to determine the scope of
a declaration by looking only at the program.
With the dynamic scope, as the program runs, the same use of x could refer to any of
several different declarations of x.
Eg: public static int x;
 Environments and States
Programming language is whether changes occurring as the program runs affect the
values of data elements or affect the interpretation of names for that data.
Eg: x=y+1; changes the value denoted by name x. More specifically, the assignment
changes the value in whatever location is denoted by x. It may be less clear that the
location denoted by x can change at run time.
Two stage mapping from names to values are shown below.

Environment state
names locations (variables) values
The association of names with locations in memory ( the store) and then with values can
be described by two mappings that change as the program runs as shown.

DEPT. OF CSE, SJCIT 14 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

1. The environment is a mapping from names to locations in the store. Since variables
refer to locations, we could alternatively define an environment as a mapping from
names to variables.
2. The state is a mapping from locations in store to their values.

Eg: …..
int i; /*global i */
…..
void f(….)
{
int i; /*local i */
…..
i=3; /* use of local i */
…..
}
…….
x=i+1; /* use of global i */
 Static Scope and Block Structure
Most of the languages, including C and its family uses static scope. The scope rules for C
are based on program structure, the scope of declaration is determined implicitly where the
declaration appears in the program. Later languages, such as Java, C++ and C# provide
explicit control over scopes by using the keywords like PUBLIC, PRIVATE, and
PROTECTED.
Static scope rules for a language with blocks-grouping of declarations and statements.
e.g.: C uses braces
Main()
{
int a=1;
int b=1;
{
int b=2;
{
int a=3;
Cout<< a< <b;
}
{
int b=4;
Cout << a<< b;
}
Cout << a<< b;
}
Cout << a<< b
};

(refer text book following topics)


 Explicit access control
 Dynamic scope
 Parameter passing mechanisms (call by value, call by reference and call by
name)

DEPT. OF CSE, SJCIT 15 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

 Aliasing.

1.7 The role of the lexical analyzer


Lexical Analyzer reads the source program character by character , group them into
lexemes and produce as output a sequence of tokens for each lexeme in the source
program. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns
a token when the parser asks a token from it.

Some Other Issues in Lexical Analyzer


• Skipping comments (stripping out comments & white space)
– Normally lexical analyzer don’t return a comment as a token.
– It skips a comment, and return the next token (which is not a comment) to the parser.
– So, the comments are only processed by the lexical analyzer, and don’t complicate the
syntax of the language.
• Correlating error messages
- It can associate a line number with each error message.
- In some compilers it makes a copy of the source program with the error messages inserted
at the appropriate positions.
- If the source program uses a macro-processor, the expansion of macros may be performed
by the lexical analyzer.
• Symbol table interface
– symbol table holds information about tokens (at least lexeme of identifiers)
– how to implement the symbol table, and what kind of operations.
• hash table – open addressing, chaining
• putting into the hash table, finding the position of a token from its lexeme.
Sometimes, lexical analyzer are divided into a cascade of two processes.
a) Scanning consists of the simple processes that do not require tokenization of the input,
such as deletion of comments and compaction of consecutive whitespace characters
into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.

DEPT. OF CSE, SJCIT 16 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

 Lexical analysis versus parsing


There are a number of reasons why the analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of lexical and
syntactic analysis often allows us to simplify at least one of these task.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized technique that serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading input characters can speed up
the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted
to the lexical analyzer.
Token: A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit.
For example, identifier, keywords, constants are called tokens
Patters: A pattern is a description of the form that lexemes of a token may take.
Lexeme: A sequence of character in the source program that are matched with the pattern of
the token and is identified by the lexical analyzer as an instance of that token.
Eg: int, I, num, ans etc

List out lexeme and token in the following example

DEPT. OF CSE, SJCIT 17 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Another disadvantage of LA is
Instead of if(a==b) statement if we mistype it as fi(a==b) then lexical analyzer will not rectify
this mistake.
1.8 Input Buffering
To recognize tokens reading data/ source program from hard disk is done. Accessing
hard disk each time is time consuming so special buffer technique have been developed to
reduce the amount of overhead required.
- One such technique is two-buffer scheme each of which is alternately loaded.
- Size of each buffer is N(size of disk block) Ex:4096 bytes
– One read command is used to read N characters.
– If fewer than N characters remain in the input file , then special character, represented by
eof, marks the end of source file.
.Sentinel is a special character that cannot be a part of source program. eof is used as sentinel
• Two pointers to the input are maintained
– Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
– Pointer forward scans ahead until a pattern match is found.

DEPT. OF CSE, SJCIT 18 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

(Refer diagram in text book i.e.., buffers using sentinels)

1.9 Specification of tokens


Alphabet : a finite set of symbols (ASCII characters)
String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– epsilon is the empty string
– |s| is the length of string s.
Language: sets of strings over some fixed alphabet
– pi the empty set is a language.
– {epsilon} the set containing empty string is a language
– The set of well-formed C programs is a language
– The set of all possible identifiers is a language.
Operators on Strings:
– Concatenation: xy represents the concatenation of strings x and y. s.epsilon = s
epsilon. s = s
– sn = s s s .. s ( n times)
s0= epsilon

Operations on Languages

DEPT. OF CSE, SJCIT 19 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Example

Regular Expressions
• We use regular expressions to describe tokens of a programming language.
• A regular expression is built up of simpler regular expressions (using defining rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.

Regular Expressions (Rules)

DEPT. OF CSE, SJCIT 20 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Note: algebraic laws for regular expressions refer text book.

Regular Definitions
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to define
other regular expressions.
• A regular definition is a sequence of the definitions of the form:
DEPT. OF CSE, SJCIT 21 Prepared by Shrihari M
R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Extensions of regular expression


1. One or more instances. The unary postfix operator +
2. Zero or more instances. The unary postfix operator ?
3. Character classes. [ ]
Example:
1. Using these shorthands, write the regular definition for C identifiers.
2. Using these shorthands, write the regular definition for unsigned number.

1.10 Recognition of tokens

DEPT. OF CSE, SJCIT 22 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

DEPT. OF CSE, SJCIT 23 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

On the board show how this can be done with just REs.
We also want the lexer to remove whitespace so we define a new token

ws → ( blank | tab | newline ) +

where blank, tab, and newline are symbols used to represent the corresponding ascii
characters.
Recall that the lexer will be called by the parser when the latter needs a new token. If the
lexer then recognizes the token ws, it does not return it to the parser but instead goes on to
recognize the next token, which is then returned. Note that you can't have two consecutive ws
tokens in the input because, for a given token, the lexer will match the longest lexeme
starting at the current position that yields this token. The table on the right summarizes the
situation.
For the parser, all the relational ops are to be treated the same so they are all the same token,
relop. Naturally, other parts of the compiler, for example the code generator, will need to
distinguish between the various relational ops so that appropriate code is generated. Hence,
they have distinct attribute values.

To recognize tokens there are 2 steps


1. Design of Transition Diagram
2. Implementation of Transition Diagram

Transition Diagrams
A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen.
The two main components are circles representing states (think of them as decision points of
the lexer) and arrows representing edges (think of them as the decisions made).
The transition diagram for relop is shown refer text book.
1. The double circles represent accepting or final states at which point a lexeme has been
found. There is often an action to be done (e.g., returning the token), which is written to the
right of the double circle.
2. If we have moved one (or more) characters too far in finding the token, one (or more) stars
are drawn.
3. An imaginary start state exists and has an arrow coming from it to indicate where to begin
the process.
It is fairly clear how to write code corresponding to this diagram. You look at the first
character, if it is <, you look at the next character. If that character is =, you return (relop,LE)
to the parser. If instead that character is >, you return (relop,NE). If it is another character,
return (relop,LT) and adjust the input buffer so that you will read this character again since
you have not used it for the current lexeme. If the first character was =, you return
(relop,EQ).

DEPT. OF CSE, SJCIT 24 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Recognition of Reserved Words and Identifiers


The next transition diagram corresponds to the regular definition given previously.
Note again the star affixed to the final state.
Two questions remain.
1. How do we distinguish between identifiers and keywords such as then, which also
match the pattern in the transition diagram?
2. What is (gettoken(), installID())?
We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers. (What if this is not the case—as in Pl/I, which had no reserved words? Then the
lexer does not distinguish between keywords and identifiers and the parser must.) We will
use the method mentioned last chapter and have the keywords installed into the identifier
table prior to any invocation of the lexer. The table entry will indicate that the entry is a
keyword.
installID() checks if the lexeme is already in the table. If it is not present, the lexeme is
installed as an id token. In either case a pointer to the entry is returned.
gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.
The text also gives another method to distinguish between identifiers and keywords.

DEPT. OF CSE, SJCIT 25 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Recognizing Whitespace
The diagram itself is quite simple reflecting the simplicity of the corresponding regular
expression.
The delim in the diagram represents any of the whitespace characters, say space,
tab, and newline.
The final star is there because we needed to find a non-whitespace character in
order to know when the whitespace ends and this character begins the next token.
There is no action performed at the accepting state. Indeed the lexer does not
return to the parser, but starts again from its beginning as it still must find the next
token.
Recognizing Numbers
This certainly looks formidable, but it is not that bad; it follows from the regular
expression.
In class go over the regular expression and show the corresponding parts in the diagram.
When an accepting states is reached, action is required but is not shown on the diagram. Just
as identifiers are stored in a identifier table and a pointer is returned, there is a corresponding
number table in which numbers are stored. These numbers are needed when code is
generated. Depending on the source language, we may wish to indicate in the table whether
this is a real or integer. A similar, but more complicated, transition diagram could be
produced if the language permitted complex numbers as well.

Architecture of a Transition-Diagram-Based Lexical Analyzer


The idea is that we write a piece of code for each decision diagram. I will show the one for
relational operations below. This piece of code contains a case for each state, which typically
reads a character and then goes to the next case depending on the character read.
The numbers in the circles are the names of the cases.

Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code. What should the code for a particular diagram do if at one state the
character read is not one of those for which a next state has been defined? That is, what if the
character read is not the label of any of the outgoing arcs? This means that we have failed to
find the token corresponding to this diagram.
The code calls fail(). This is not an error case. It simply means that the current input does not
match this particular token. So we need to go to the code section for another diagram after
restoring the input pointer so that we start the next diagram at the point where this failing
diagram started. If we have tried all the diagram, then we have a real failure and need to
print an error message and perhaps try to repair the input.
Note that the order the diagrams are tried is important. If the input matches more than one
token, the first one tried will be chosen.
TOKEN getRelop() // TOKEN has two components
TOKEN retToken = new(RELOP); // First component set here
while (true)
switch(state)
case 0: c = nextChar();
if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>') state = 6;

DEPT. OF CSE, SJCIT 26 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

else fail();
break;
case 1: ...
...
case 8: retract(); // an accepting state with a star
retToken.attribute = GT; // second component
return(retToken);

Alternate Methods

The book gives two other methods for combining the multiple transition-diagrams (in
addition to the one above).
1. Unlike the method above, which tries the diagrams one at a time, the first new method tries
them in parallel. That is, each character read is passed to each diagram (that hasn't already
failed). Care is needed when one diagram has accepted the input, but others still haven't failed
and may accept a longer prefix of the input.
2. The final possibility discussed, which appears to be promising, is to combine all the
diagrams into one. That is easy for the example we have been considering because all the
diagrams begin with different characters being matched. Hence we just have one large start
with multiple outgoing edges. It is more difficult when there is a character that can begin
more than one diagram.

NOTE:
Refer text book
1. Attribute of tokens
2. Lexical error
3. Transition diagrams for relop, unsigned numbers, white space
4. Lexical analyzer generator and finite automata.

DEPT. OF CSE, SJCIT 27 Prepared by Shrihari M


R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2

Unit 1:
1. Give the general structure of a complier. Show the working of different phases of a
complier taking an example. [june/jul 12] (10 Marks)
2. List and explain reasons for separating analysis portion of a complier into lexical analysis
and syntax analysis phases. [june/jul 12] (06 Marks)
3. Why two-buffer scheme is used in lexical analysis? Write an algorithm for “look ahead
code with sentinels”. [june/jul 12] (04 Marks)
4. Explain with a neat diagram, the phases of a compiler.[may/june 2010](10 Marks)
5. Construct a transition diagram for recognizing unsigned numbers. Sketch the program
segments to implement it, showing the first two states and one final state. [may/june 2010]
(10 Marks)
6. What is meant by input buffering ? Explain the use of sentinels in recognizing tokens
[june/jul 09] ( 08 Marks)
7. With the help of a diagram, explain the various phases of a compiler [june/jul 09]
(12 Marks)

DEPT. OF CSE, SJCIT 28 Prepared by Shrihari M


R

You might also like