0% found this document useful (0 votes)
17 views

Chapter 2 Lexical Analysis

Chapter 2 discusses lexical analysis as a crucial part of syntax analysis in language processing, highlighting the roles of lexical and syntax analyzers. It explains the tasks of a lexical analyzer, including scanning source code, grouping characters into tokens, and managing additional tasks like comment removal and whitespace handling. The chapter also covers the use of finite state automata for token recognition and the implementation of lexical analyzers using tools like Lex and Flex.

Uploaded by

kimmaurice4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Chapter 2 Lexical Analysis

Chapter 2 discusses lexical analysis as a crucial part of syntax analysis in language processing, highlighting the roles of lexical and syntax analyzers. It explains the tasks of a lexical analyzer, including scanning source code, grouping characters into tokens, and managing additional tasks like comment removal and whitespace handling. The chapter also covers the use of finite state automata for token recognition and the implementation of lexical analyzers using tools like Lex and Flex.

Uploaded by

kimmaurice4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Chapter 2

Lexical Analysis
Recap - Compilation Sequence

2
Introduction
• The syntax analysis portion of a language processor
nearly always consists of two parts:
– A low-level part called a lexical analyzer (mathematically, a
finite automaton based on a regular grammar)
– A high-level part called a syntax analyzer, or parser
(mathematically, a push-down automaton based on a
context-free grammar, or BNF)
tokens
source lexical analyzer syntax analyzer
program
(scanner) (parser)

symbol table
manager 3
Reasons to Separate Lexical and
Syntax Analysis
• Simplicity - less complex approaches can be
used for lexical analysis;
– separating them simplifies the parser
• Efficiency - separation allows optimization of
the lexical analyzer
• Portability - parts of the lexical analyzer may
not be portable, but the parser always is
portable

4
Tasks of Lexical Analyzer
–scan the source-code strings
–collect characters into logical
grouping( lexemes) and
–assigns internal codes ( tokens )

5
Tasks of Lexical Analyzer - cont
• The Lexical Analyzer may take care of a few
other things as well, unless they are
handled by a preprocessor:
– Removal of Comments
– Case Conversion
– Removal of White Space
– Interpretation of Compiler Directives
– Communication with the Symbol Table
– Preparation of Output Listing

6
Example
• Given the statement
– if distance >= rate*(end_time - start_time) then
distance := maxdist;

• The lexical analyzer must be able to isolate


the
– keywords {if, then}
– identifiers {distance, rate, …}
– operators {*, -, :=}
– relational operator {>=}
– parenthesis
– closing semicolon 7
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units
(defining an atomic element)
– For example: id and num
• Lexemes are the specific character strings that
make up a token (match some pattern)
– For example: abc and 123
• Patterns are rules describing the set of
lexemes belonging to a token
– For example: “letter followed by letters and digits”
and “non-empty sequence of digits”
8
Example of Tokens
TOKEN SAMPLE INFORMAL DESCRIPTION
LEXEMES OF PATTERN
const const const
if If if
relation <,<=,=,<>,>, < or <= or = or <> or > or >=
>=
id pi, count, D2 Letter followed by letters or digits
num 3.1416, 0, Any numeric constant
6.02E23
literal “core Any characters between “and”
dumped” except “
9
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
10
Example Non Tokens

Type Examples
comment /* ignored */
preprocessor directive #include <foo.h>
#define NUMS 5, 6
macro NUMS
whitespace \t \n \b

11
Buffering
• In principle, the analyzer goes through the
source string a character at a time;
• In practice, it must be able to access
substrings of the source.
• Hence the source is normally read into a
buffer
• The scanner needs two subscripts to note
places in the buffer
– lexeme start & current position
12
Finite State Automata
• The compiler writer defines tokens in the
language by means of regular expressions.
• The lexical analyzer is best implemented as a
finite state machine or a finite state
automaton.

13
Example - Finite State Automata

14
Transition Table

15
Specification of Patterns for
Tokens: Definitions
• An alphabet S is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols from
S
– s denotes the length of string s
–  denotes the empty string, thus  = 0
• A language is a specific set of strings over
some fixed alphabet S

16
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined by

s0 = 
si = si-1s for i > 0

note that s = s = s

17
Specification of Patterns for
Tokens: Language Operations
• Union
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li

18
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a  S is a regular expression denoting {a}
• If r and s are regular expressions denoting languages
L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a
regular set
19
Specification of Patterns for
Tokens: Regular Expressions
• Tokens are described using regular
expressions.
• Regular expression of an alphabet S - is a
combination of characters from S and
certain operators indicating concatenation,
selection, or repetition.
 b* -- 0 or more b's (Kleene Star)
 b+ -- 1 or more b's
 | -- a|b -- choice 20
Specification of Patterns for
Tokens: Regular Expressions
• Lexical Analysis and Syntactic Analysis are
typically run off of tables.
• These tables are large and laborious to build.
• Therefore, we use a program to build the
tables.
• But there are two major problems:
– How do we represent a token for the table
generating program?
– How does the program convert this into the
corresponding FSA?
21
Specification of Patterns for
Tokens: Regular Expressions
• REs can be used to describe only a limited
variety of languages, but they are powerful
enough to be used to define tokens.

• One limitation -- many languages put


length limitations on their tokens,
– RE's have no means of enforcing such
limitations.

22
Specification of Patterns for
Tokens: Regular Definitions
• Regular definitions introduce a naming convention:
d 1  r1
d 2  r2

d n  rn
where each ri is a regular expression over
S  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions

23
Specification of Patterns for
Tokens: Regular Definitions
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions are not recursive:

digits  digit digitsdigit wrong!

24
Specification of Patterns for
Tokens: Notational Shorthand
• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?

25
Regular Definitions and Grammars
Grammar
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term
 term Regular definitions
term  id if  if
 num then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? digit26+ )?
Approaches to building a lexical
analyzer:
1. Write a formal description of the token patterns
using descriptive language related to RE .
2. Design state transition diagram that describes the
token patterns of language and write program
that implement diagram
3. Design state transition diagram that describes the
token patterns of language and hand-construct a
table-driven implementation of state diagram.

27
Implementing Lexical Analyzers
• Using a scanner generator, e.g., lex or flex.
– This automatically generates a lexical analyzer
from a high-level description of the tokens.
(easiest to implement; least efficient)
• Programming it in a language such as C, using
the I/O facilities of the language.
(intermediate in ease, efficiency)
• Writing it in assembly language and explicitly
managing the input.
(hardest to implement, but most efficient) 28
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications

29
Recognizing Tokens
• The scanner must ignore white space (except
to note the end of a token)
– Add white space transition from Start state to
Start state.
• When you enter an accept state, announce it
– (therefore you cannot pass through accept states)
– The string may be the entire program.

30
• One accept state for each token, so we
know what we found.

• Identifier/Keyword differences
– Accept everything as an identifier, and then
look up keywords in table.
– Or pre-load the Symbol Table with Keywords.

• When you read an identifier, you read the


next character in order to tell it was the
end.
• You need to back up (put it back on the
input stream). 31
• Comments
– Recognize the beginning of comment, and
then ignore everything until the end of
comment.
– What if there are multiple types of comments?

• Character Strings
– single or double quotes?

32
Assignment
• Give a brief summary of the Lex/Flex lexical
analyzer and use examples to show how it
works

33

You might also like