Module-2: Introduction, Lexical Analysis: Syllabus
Module-2: Introduction, Lexical Analysis: Syllabus
SYLLABUS:
Language processors; The structure of a Compiler; The evolution of programming
languages; The science of building a Compiler; Applications of compiler technology;
Programming language basics.
Lexical analysis: The Role of Lexical Analyzer; Input Buffering; Specifications of
Tokens; Recognition of Tokens, Lexical Analyzer Generator and Finite Automata.
Overview of Compilers
Compiler is system software that takes source program as input and converts it into
target program. The source programs are independent of the machine on which they are
executed where as target programs are machine dependent. Source program can be any high
level language like C, C++, etc., and the target program may be in assembly language or
machine language.
The above instruction is used to move contents form register AX to BX. These codes are
easily understood by the computer system and hence their execution is faster. They can be
executed on the system without any intermediatory software. It is difficult for the
programmer to read, write and debug instructions in machine code.
The above instruction moves the contents of register AX to BX. These instructions are
written based on the number and type of general purpose registers available, addressing
modes and organization of memory. Though assembly code is easier to read and write when
compared to machine code, the programmer should have complete knowledge of efficiently
using registers, choose appropriate instruction for faster and better utilization of memory. It
requires an intermediatory software called assembler, which converts assembly code to
machine code before execution, hence it is slower when compared to machine code.
In high level language, instructions are defined using programming language like C, C++,
Java etc.,
Example: c=a+b
This instruction adds two values of variables a and b, and stores the result in c. These are
very easy to be read, write and debug by programmer, but are difficult to understand by the
An important role of the compiler is to report any errors in the source program that it
detects during the translation process.
If the target program is an executable machine language program, it can then be called
by the user to process inputs and produce outputs.
The running the target program is shown below.
The machine language target program produced by a compiler is usually much faster
than an interpreter at mapping inputs to outputs. An interpreter can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
Example:
Java language processors combine compilation and interpretation as shown below.
A Java source program may first be compiled into an intermediate form called
bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this
arrangement is that bytecodes compiled on one machine can be interpreted on another
machine, perhaps across a network.
In order to achieve faster processing of inputs to outputs, some java compilers called
just-in-time compilers, translate the bytecodes into machine language immediately before
they run the intermediate program to process the input.
Source
Preprocessor Modified source program
program
Library files
Compiler
Relocatable object files
Target assembly
Relocatable
program
machine code Linker/Loader
Assembler
Target
machine
code
The different steps involved in converting instructions in high level code to machine
level code is called language processing. There are different components involved in
language processing. They are
a. Preprocessor
b. Compiler
c. Assembler
d. Linker /loader
Preprocessor
Compiler
Compiler takes pre-processed file and generates assembly level code. It also generates
symbol table and literal table. Compiler has error handler which displays error messages and
performs some error recovery if necessary. In order to reduce the amount of time taken for
the execution and for better utilization of memory, compiler generates intermediate form of
code and optimizes this code. Functionality of compiler is divided into multiple phases. Each
phase performs a set of operations. Like lexical analysis generating tokens, code optimiser
optimizing code etc.
Assembler
Assembler takes assembly code as input and converts it into relocatable object code.
The instruction in assembly code will have two parts opcode part and an operand part.
Opcode specifies the type of operation like ADD for addition, SUB for subtraction, INC for
increment etc. The operand part consists of number of operands on which the operations are
to be applied. These operands may be memory location, register or immediate data.
Assemblers may be single pass or two pass assembler. In a single pass assembler, reading
assembly code, generation of symbol table and conversion of opcode to machine instruction
are all done in a single pass. In two pass assembler, first pass reads the input file and stores
the identifiers in symbol table. In second pass, it translates opcode to sequence of bits
(machine code or relocatable code) with the help of symbol table.
Linker/Loader
In the final step, an executable code is generated with the help of linker and loader.
Linkers are used to link system wide libraries and resources supplied by operating system
such as I/O devices, memory allocator etc. Loader resolves all relocatable address relative to
starting address and produces absolute executable code.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate
representation of the source program. If the analysis part detects that the source program is
either syntactically ill formed or semantically unsound, then it must provide informative
message, so the user can take corrective action.
The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.
The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.
The analysis part is often called the front end of the compiler. The synthesis part is
the back end of the compiler.
Character stream
Lexical Analyser
Token stream
Syntax Analyser
Syntax tree
Semantic Analyser
Syntax tree
Intermediate representation
Machine-independent code
optimizer
Intermediate representation
Code Generator
Target-machine code
Machine-Dependent Code
optimizer
Target-machine code
Lexical Analyzer
Lexical Analyzer reads the source program character by character and returns the
tokens of the source program.
Lexical Analyzer also called as Scanner.
It reads the stream of characters making up the source program and groups the
characters into meaningful sequences called Lexemes.
For each lexeme, it produces as output a token of the form
<token_name, attribute_value>
token_name is an abstract symbol that is used during syntax analysis, and the second
component attribute_value points to an entry in the symbol table for this token.
A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)
<id, 1> +
<id,2> *
<id,3> 60
The tree has an interior node labelled * with <id,3> as its left child and the integer 60
as its right child. The node <id,3> represent the identifier rate. The node labelled * makes it
explicit that we must first multiply the value of rate by 60.
The node labelled + indicates that we must add the result of this multiplication to the
value of initial. The root of the tree, labelled = indicated that we must store the result of this
addition into location for the identifier position.
This ordering of operations is consistent with the usual conventions of arithmetic
which tell us that multiplication has higher precedence than addition, and hence that the
multiplication is to be performed before the addition.
The subsequent phases of the compiler use the grammatical structure to help analyze
the source program and generate the target program.
Semantic Analyzer
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.
It also gathers type information and saves it in either the syntax tree or the symbol
table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks
that each operator has matching operands.
Example:
many programming language definitions require an array index to be an integer, the
compiler must report an error if a floating point number is used to index an array.
The language specification may permit some type conversions called coercions.
Example:
A binary arithmetic operator may be applied to either a pair of integers or to a pair of
floating point numbers. If the operator is applied to a floating-point number and an integer,
the compiler may convert or coerce the integer into a floating point number. Such a coercion
appears as shown below.
=
<id,1> +
<id,2> *
<id,3> inttofloat
60
Intermediate Code Generation
In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety of forms.
After syntax and semantic analysis of the source program, many compilers generate
an explicit low-level or machine-like intermediate representation, which we can think of as a
program for a abstract machine.
This intermediate representation should have two important properties:
It should be easy to produce
It should be easy to translate into the target machine.
Code optimization
The machine independent code optimization phase attempts to improve the
intermediate code so that better target code will result. Usually better means faster, such as
shorter code or target code that consumes less power.
It is required to generate good target code.
The optimizer can deduce that the conversion of 60 from integer to floating point can
be done once and for all at compile time.
t1=id3*60.0
id1=id2+t1
There is a great variation in the amount of code optimization different compilers
perform. There are simple optimizations that significantly improve the running time of the
target program without slowing down compilation too much.
Code generation
The code generator takes as input an intermediate representation of the source
program and maps it into the target language.
If the target language is machine code, registers or memory locations are selected for
each of the variables used by the program.
The crucial aspect of code generation is the judicious assignment of registers to hold
variables.
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating point numbers.
The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the same.
The data structure should be designed to allow the compiler to find the record for each
name quickly and to store or retrieve data from that record quickly.
Compiler-construction tools
Some commonly used compiler construction tools include.
Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
Scanner generators that produce lexical analyzers from a regular expression
description of the tokens of a language.
Syntax directed translation engines that produce collections of routines for walking a
parse tree and generating intermediate code.
Code generator generators that produce a code generator from a collection of rules
for translating each operation of the intermediate language into the machine language
for a target machine.
Data flow analysis engines that facilitate the gathering of information about how
values are translating from one part of a program to each other part. It is a key part of
code optimization.
Compiler construction toolkits that provide an integrated set of routines for
constructing various phases of a compiler.
The operations themselves were low level: move data from one location to another,
add the contents of two registers, compare two values, and so on. This kind of programming
was slow, tedious, and error prone. And once written, the programs were hard to understand
and modify.
The first step towards programming languages was the development of mnemonic
assembly languages in the early 1950’s.
In the latter half of the 1950’s with the development of Fortran for scientific
computation, Cobol for business data processing, and Lisp for symbolic computation.
In the following decades, many more languages were created with innovative features
to help make programming easier, more robust and more natural.
eg: prolog.
The term von Neumann language is applied to programming languages whose computational
model is based on the von Neumann computer architecture.
Scripting languages are interpreted languages with high-level operators designed for “gluing
together” computations. These computations were originally called scripts.
DEPT. OF CSE, SJCIT 10 Prepared by Shrihari M
R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2
Impacts on Compilers
High performance compilers (i.e., the code generated performs well) are crucial for
the adoption of new language concepts and computer architectures. Also important is the
resource utilization of the compiler itself.
Parallelism
All modern microprocessors exploit instruction-level parallelism. This can be hidden
from the programmer.
The hardware scheduler dynamically checks for dependencies in the sequential
instruction stream and issues them in parallel when possible.
Whether the hardware reorders the instruction or not, compilers can rearrange the
instruction to make instruction-level parallelism more effective.
Memory Hierarchies
A memory hierarchy consists of several levels of storage with different speeds and
sizes.
A processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary storage that
contains gigabytes and beyond.
Correspondingly, the speed of accesses between adjacent levels of the hierarchy can
differ by two or three orders of magnitude.
The performance of a system is often limited not by the speed of the processor but by
the performance of the memory subsystem.
While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.
Program translations
Bounds checking
It is easier to make mistakes when programming in a lower-level language than a
higher level one.
Example:
Many security breaches in systems are caused by buffer overflows in programs
written in C. Because C does not have array-bound checks, it is up to the user to ensure that
the arrays are not accessed out of bounds.
Had the program been written in a safe language that includes automatic range
checking, this problem would not have occurred.
The same data-flow analysis that is used to eliminate redundant range checks can also
be used to locate buffer overflows.
Memory-management tools
Garbage collection is another excellent example of the tradeoff between efficiency
and a combination of ease of programming and software reliability.
Automatic memory management obliterates all memory-management errors, which
are a major source of problems in C and C++ programs.
Various tools have been developed to help programmers find memory management
errors.
Example:
Purify is a widely used tool that dynamically catches memory management errors as
they occur.
Tools that help identify some of these problems statically have also been developed.
Environment state
names locations (variables) values
The association of names with locations in memory ( the store) and then with values can
be described by two mappings that change as the program runs as shown.
1. The environment is a mapping from names to locations in the store. Since variables
refer to locations, we could alternatively define an environment as a mapping from
names to variables.
2. The state is a mapping from locations in store to their values.
Eg: …..
int i; /*global i */
…..
void f(….)
{
int i; /*local i */
…..
i=3; /* use of local i */
…..
}
…….
x=i+1; /* use of global i */
Static Scope and Block Structure
Most of the languages, including C and its family uses static scope. The scope rules for C
are based on program structure, the scope of declaration is determined implicitly where the
declaration appears in the program. Later languages, such as Java, C++ and C# provide
explicit control over scopes by using the keywords like PUBLIC, PRIVATE, and
PROTECTED.
Static scope rules for a language with blocks-grouping of declarations and statements.
e.g.: C uses braces
Main()
{
int a=1;
int b=1;
{
int b=2;
{
int a=3;
Cout<< a< <b;
}
{
int b=4;
Cout << a<< b;
}
Cout << a<< b;
}
Cout << a<< b
};
Aliasing.
Another disadvantage of LA is
Instead of if(a==b) statement if we mistype it as fi(a==b) then lexical analyzer will not rectify
this mistake.
1.8 Input Buffering
To recognize tokens reading data/ source program from hard disk is done. Accessing
hard disk each time is time consuming so special buffer technique have been developed to
reduce the amount of overhead required.
- One such technique is two-buffer scheme each of which is alternately loaded.
- Size of each buffer is N(size of disk block) Ex:4096 bytes
– One read command is used to read N characters.
– If fewer than N characters remain in the input file , then special character, represented by
eof, marks the end of source file.
.Sentinel is a special character that cannot be a part of source program. eof is used as sentinel
• Two pointers to the input are maintained
– Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
– Pointer forward scans ahead until a pattern match is found.
Operations on Languages
Example
Regular Expressions
• We use regular expressions to describe tokens of a programming language.
• A regular expression is built up of simpler regular expressions (using defining rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.
Regular Definitions
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to define
other regular expressions.
• A regular definition is a sequence of the definitions of the form:
DEPT. OF CSE, SJCIT 21 Prepared by Shrihari M
R
INTRODUCTION & LEXICAL ANALYSIS MODULE - 2
On the board show how this can be done with just REs.
We also want the lexer to remove whitespace so we define a new token
where blank, tab, and newline are symbols used to represent the corresponding ascii
characters.
Recall that the lexer will be called by the parser when the latter needs a new token. If the
lexer then recognizes the token ws, it does not return it to the parser but instead goes on to
recognize the next token, which is then returned. Note that you can't have two consecutive ws
tokens in the input because, for a given token, the lexer will match the longest lexeme
starting at the current position that yields this token. The table on the right summarizes the
situation.
For the parser, all the relational ops are to be treated the same so they are all the same token,
relop. Naturally, other parts of the compiler, for example the code generator, will need to
distinguish between the various relational ops so that appropriate code is generated. Hence,
they have distinct attribute values.
Transition Diagrams
A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen.
The two main components are circles representing states (think of them as decision points of
the lexer) and arrows representing edges (think of them as the decisions made).
The transition diagram for relop is shown refer text book.
1. The double circles represent accepting or final states at which point a lexeme has been
found. There is often an action to be done (e.g., returning the token), which is written to the
right of the double circle.
2. If we have moved one (or more) characters too far in finding the token, one (or more) stars
are drawn.
3. An imaginary start state exists and has an arrow coming from it to indicate where to begin
the process.
It is fairly clear how to write code corresponding to this diagram. You look at the first
character, if it is <, you look at the next character. If that character is =, you return (relop,LE)
to the parser. If instead that character is >, you return (relop,NE). If it is another character,
return (relop,LT) and adjust the input buffer so that you will read this character again since
you have not used it for the current lexeme. If the first character was =, you return
(relop,EQ).
Recognizing Whitespace
The diagram itself is quite simple reflecting the simplicity of the corresponding regular
expression.
The delim in the diagram represents any of the whitespace characters, say space,
tab, and newline.
The final star is there because we needed to find a non-whitespace character in
order to know when the whitespace ends and this character begins the next token.
There is no action performed at the accepting state. Indeed the lexer does not
return to the parser, but starts again from its beginning as it still must find the next
token.
Recognizing Numbers
This certainly looks formidable, but it is not that bad; it follows from the regular
expression.
In class go over the regular expression and show the corresponding parts in the diagram.
When an accepting states is reached, action is required but is not shown on the diagram. Just
as identifiers are stored in a identifier table and a pointer is returned, there is a corresponding
number table in which numbers are stored. These numbers are needed when code is
generated. Depending on the source language, we may wish to indicate in the table whether
this is a real or integer. A similar, but more complicated, transition diagram could be
produced if the language permitted complex numbers as well.
Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code. What should the code for a particular diagram do if at one state the
character read is not one of those for which a next state has been defined? That is, what if the
character read is not the label of any of the outgoing arcs? This means that we have failed to
find the token corresponding to this diagram.
The code calls fail(). This is not an error case. It simply means that the current input does not
match this particular token. So we need to go to the code section for another diagram after
restoring the input pointer so that we start the next diagram at the point where this failing
diagram started. If we have tried all the diagram, then we have a real failure and need to
print an error message and perhaps try to repair the input.
Note that the order the diagrams are tried is important. If the input matches more than one
token, the first one tried will be chosen.
TOKEN getRelop() // TOKEN has two components
TOKEN retToken = new(RELOP); // First component set here
while (true)
switch(state)
case 0: c = nextChar();
if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>') state = 6;
else fail();
break;
case 1: ...
...
case 8: retract(); // an accepting state with a star
retToken.attribute = GT; // second component
return(retToken);
Alternate Methods
The book gives two other methods for combining the multiple transition-diagrams (in
addition to the one above).
1. Unlike the method above, which tries the diagrams one at a time, the first new method tries
them in parallel. That is, each character read is passed to each diagram (that hasn't already
failed). Care is needed when one diagram has accepted the input, but others still haven't failed
and may accept a longer prefix of the input.
2. The final possibility discussed, which appears to be promising, is to combine all the
diagrams into one. That is easy for the example we have been considering because all the
diagrams begin with different characters being matched. Hence we just have one large start
with multiple outgoing edges. It is more difficult when there is a character that can begin
more than one diagram.
NOTE:
Refer text book
1. Attribute of tokens
2. Lexical error
3. Transition diagrams for relop, unsigned numbers, white space
4. Lexical analyzer generator and finite automata.
Unit 1:
1. Give the general structure of a complier. Show the working of different phases of a
complier taking an example. [june/jul 12] (10 Marks)
2. List and explain reasons for separating analysis portion of a complier into lexical analysis
and syntax analysis phases. [june/jul 12] (06 Marks)
3. Why two-buffer scheme is used in lexical analysis? Write an algorithm for “look ahead
code with sentinels”. [june/jul 12] (04 Marks)
4. Explain with a neat diagram, the phases of a compiler.[may/june 2010](10 Marks)
5. Construct a transition diagram for recognizing unsigned numbers. Sketch the program
segments to implement it, showing the first two states and one final state. [may/june 2010]
(10 Marks)
6. What is meant by input buffering ? Explain the use of sentinels in recognizing tokens
[june/jul 09] ( 08 Marks)
7. With the help of a diagram, explain the various phases of a compiler [june/jul 09]
(12 Marks)