Chapter 2 Lexical Analysis
Chapter 2 Lexical Analysis
Lexical Analysis
Recap - Compilation Sequence
2
Introduction
• The syntax analysis portion of a language processor
nearly always consists of two parts:
– A low-level part called a lexical analyzer (mathematically, a
finite automaton based on a regular grammar)
– A high-level part called a syntax analyzer, or parser
(mathematically, a push-down automaton based on a
context-free grammar, or BNF)
tokens
source lexical analyzer syntax analyzer
program
(scanner) (parser)
symbol table
manager 3
Reasons to Separate Lexical and
Syntax Analysis
• Simplicity - less complex approaches can be
used for lexical analysis;
– separating them simplifies the parser
• Efficiency - separation allows optimization of
the lexical analyzer
• Portability - parts of the lexical analyzer may
not be portable, but the parser always is
portable
4
Tasks of Lexical Analyzer
–scan the source-code strings
–collect characters into logical
grouping( lexemes) and
–assigns internal codes ( tokens )
5
Tasks of Lexical Analyzer - cont
• The Lexical Analyzer may take care of a few
other things as well, unless they are
handled by a preprocessor:
– Removal of Comments
– Case Conversion
– Removal of White Space
– Interpretation of Compiler Directives
– Communication with the Symbol Table
– Preparation of Output Listing
6
Example
• Given the statement
– if distance >= rate*(end_time - start_time) then
distance := maxdist;
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser
10
Example Non Tokens
Type Examples
comment /* ignored */
preprocessor directive #include <foo.h>
#define NUMS 5, 6
macro NUMS
whitespace \t \n \b
11
Buffering
• In principle, the analyzer goes through the
source string a character at a time;
• In practice, it must be able to access
substrings of the source.
• Hence the source is normally read into a
buffer
• The scanner needs two subscripts to note
places in the buffer
– lexeme start & current position
12
Finite State Automata
• The compiler writer defines tokens in the
language by means of regular expressions.
• The lexical analyzer is best implemented as a
finite state machine or a finite state
automaton.
•
13
Example - Finite State Automata
14
Transition Table
15
Specification of Patterns for
Tokens: Definitions
• An alphabet S is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols from
S
– s denotes the length of string s
– denotes the empty string, thus = 0
• A language is a specific set of strings over
some fixed alphabet S
16
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined by
s0 =
si = si-1s for i > 0
note that s = s = s
17
Specification of Patterns for
Tokens: Language Operations
• Union
L M = {s s L or s M}
• Concatenation
LM = {xy x L and y M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
18
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
– is a regular expression denoting language {}
– a S is a regular expression denoting {a}
• If r and s are regular expressions denoting languages
L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a
regular set
19
Specification of Patterns for
Tokens: Regular Expressions
• Tokens are described using regular
expressions.
• Regular expression of an alphabet S - is a
combination of characters from S and
certain operators indicating concatenation,
selection, or repetition.
b* -- 0 or more b's (Kleene Star)
b+ -- 1 or more b's
| -- a|b -- choice 20
Specification of Patterns for
Tokens: Regular Expressions
• Lexical Analysis and Syntactic Analysis are
typically run off of tables.
• These tables are large and laborious to build.
• Therefore, we use a program to build the
tables.
• But there are two major problems:
– How do we represent a token for the table
generating program?
– How does the program convert this into the
corresponding FSA?
21
Specification of Patterns for
Tokens: Regular Expressions
• REs can be used to describe only a limited
variety of languages, but they are powerful
enough to be used to define tokens.
22
Specification of Patterns for
Tokens: Regular Definitions
• Regular definitions introduce a naming convention:
d 1 r1
d 2 r2
…
d n rn
where each ri is a regular expression over
S {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
23
Specification of Patterns for
Tokens: Regular Definitions
• Example:
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
24
Specification of Patterns for
Tokens: Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )?
25
Regular Definitions and Grammars
Grammar
stmt if expr then stmt
if expr then stmt else stmt
expr term relop term
term Regular definitions
term id if if
num then then
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit26+ )?
Approaches to building a lexical
analyzer:
1. Write a formal description of the token patterns
using descriptive language related to RE .
2. Design state transition diagram that describes the
token patterns of language and write program
that implement diagram
3. Design state transition diagram that describes the
token patterns of language and hand-construct a
table-driven implementation of state diagram.
27
Implementing Lexical Analyzers
• Using a scanner generator, e.g., lex or flex.
– This automatically generates a lexical analyzer
from a high-level description of the tokens.
(easiest to implement; least efficient)
• Programming it in a language such as C, using
the I/O facilities of the language.
(intermediate in ease, efficiency)
• Writing it in assembly language and explicitly
managing the input.
(hardest to implement, but most efficient) 28
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
29
Recognizing Tokens
• The scanner must ignore white space (except
to note the end of a token)
– Add white space transition from Start state to
Start state.
• When you enter an accept state, announce it
– (therefore you cannot pass through accept states)
– The string may be the entire program.
30
• One accept state for each token, so we
know what we found.
• Identifier/Keyword differences
– Accept everything as an identifier, and then
look up keywords in table.
– Or pre-load the Symbol Table with Keywords.
• Character Strings
– single or double quotes?
32
Assignment
• Give a brief summary of the Lex/Flex lexical
analyzer and use examples to show how it
works
33