0% found this document useful (0 votes)

67 views39 pages

2-Lexical Analysis Part1

Uploaded by

ahksase2312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views39 pages

2-Lexical Analysis Part1

Uploaded by

ahksase2312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Lecture (2)

Lexical Analysis (Part 1)

Dr. Wafaa Samy

CSE439: Design of Compilers (Spring 2024)

Contents
• Lexical Analysis
• Issues in Compiler Design
• Regular Expressions

2
Lexical Analysis
Source Code
• The first phase of a compiler.
Scanner
• The scanner or lexical analyzer performs the lexical analysis.
Parser
• The main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes,
and produce as output a sequence of tokens (i.e. a token for
each lexeme in the source program).
• When the lexical analyzer discovers a lexeme constituting an
identifier, it enters that lexeme into the symbol table.
• The stream of tokens is sent to the parser for syntax analysis.
• The lexical analyzer may perform other tasks besides
identification of lexemes like stripping out comments and
whitespace (blank, newline, tab, etc.).

3
Lexical Analysis (Cont.)
• What do we want to do?
• Example:
if (i == j)
Z = 0;
else
Z = 1;

• The input is just a string of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings (lexemes) to identify tokens.

Source program text Tokens
Scanner
4
Example (1)
• Consider the following line of code, which could be part of a C program:
a[index] = 4 + 2
• This code contains 12 non-blank characters but only 8 lexemes.
• The lexemes identified and tokens produced by the scanner are as follows:
Lexeme Token
a identifier < id , pointer to symbol-table entry for a >
[ left bracket <[>
index identifier < id , pointer to symbol-table entry for index >
] right bracket <]>
= assignment <=>
4 number < number , integer value 4 >
+ plus sign <+>
2 number < number , integer value 2 > 5
Tokens, Patterns, and Lexemes
• Three related but distinct terms. Examples:
• A token is a pair consisting of a token name and an • The sequence of
optional attribute value. characters “static int”
< token-name, attribute-value > is recognized as two
o The token name is an abstract symbol representing a kind of tokens, representing
lexical unit, e.g. a particular keyword, or a sequence of input the two lexemes
characters denoting an identifier. “static” and “int”.
• A pattern is a description of the form that the lexemes of • The sequence of
a token may take. characters “*x++” is
o For example, in a case-insensitive language, the lexemes
associated with the IF token are: if , IF , iF , and If. recognized as three
tokens, representing
• A lexeme is a sequence of characters in the source the lexemes “*”, “x”
program that matches the pattern for a token and is and “++”.
identified by the lexical analyzer as an instance of that
token. 6
Tokens, Patterns, and Lexemes (Cont.)
• The table shows some typical tokens, their informally described patterns,
and some sample lexemes.

7
Typical Tokens in Programming Languages
• In many programming languages, the following classes cover most or all of the
tokens:
1. One token for each keyword.
o The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison (i.e. as in table in previous slide).
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
strings.
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
8
Examples of Tokens and Non-Tokens
• Examples of Tokens:
o Operators = + - > <= = = < >
o Punctuation ( ) { } , ;
o Keywords if while for int double
o Numeric 43 6.035 -3.6e10 0x13F3A
o Character literals ‘a’ ‘~’ ‘\’’
o String literals “3.142” “aBcDe” “\”
o Identifiers first total f

• Examples of non-tokens:
o White space space(‘ ’) tab(‘\t’) eoln(‘\n’)
o Comments /*this is not a token*/
9
Example (2)
• Consider the following C statement:
printf (“Total = %d\n”, score);
• Both printf and score are lexemes matching the pattern for token id,
• “Total = %d\n” is a lexeme matching literal, and
• ( , ) ; are lexemes matching punctuation symbols.

10
Attributes for Tokens
• The string of characters represented by a token is called its string
value or its lexeme.
o Some tokens have only one lexeme such as reserved words (i.e.
keywords).
o A token may represent many lexemes such as identifiers. They all
represented by the token id, but they have many different string values.
• If more than one lexeme can match the pattern for a token, the
scanner must indicate the actual lexeme that matched.
o E.g. the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was
found in the source program.
o Any value associated to a token is called an attribute of the token, and the
string value is an example of an attribute.
11
Attributes for Tokens (Cont.)
Symbol Table
• Attribute: “Value of interest” about a token.
• Records the identifiers used in the
o Numerical value of an integer token. source program.
• Collects various associated
o Name (string) associated with an identifier token. information as attributes:
o Variables: type, scope,
• The token id, we need to associate with the storage allocation, etc.
token information about an identifier like its o Procedure: number and types
lexeme, etc. of arguments, method of
argument passing, etc.
• This information is kept in the symbol table. • It’s a data structure with
collection of records.
• Thus, the appropriate attribute value for an o Different fields are collected
identifier is a pointer to the symbol-table and used at different phases
of compilation.
entry for that identifier.
12
Example (3)
• Consider the following program statement:

count = 545

• yields the following token-attribute pairs:

< id , pointer to symbol-table entry for count >
< assign-op >
< number , integer value 545 >

13
Example (4)
• The token names and associated attribute values for the Fortran statement:

are written below as a sequence of pairs:

< id , pointer to symbol-table entry for E >
< assign-op >
< id , pointer to symbol-table entry for M >
< mult-op >
< id , pointer to symbol-table entry for C >
< exp-op >
< number , integer value 2 >

• Note that in certain pairs, especially operators, punctuation, and keywords, there is no
need for an attribute value.
• In this example, the token number has been given an integer-valued attribute.
14
Issues in Compiler Design
• Compilation appears to be very simple, but there are many pitfalls:
o Design of programming languages has a big impact on the complexity of the
compiler.
o How are erroneous programs handled?

15
Tricky Problems when recognizing Tokens
Fortran DO-Statement

• FORTRAN rule: Whitespace is insignificant. Blanks are DO a b = c , d , e

ignored.
a CONTINUE
o E.g. VAR1 is the same as VA R1
a – line label
• Consider the following Fortran statements: b – control variable
o DO 5 I = 1,25 c – start value
o DO 5 I = 1.25 d – end value
e – increment value (optional)
• The second statement is DO5I = 1.25
o It is not apparent that the first lexeme is DO5I, an identifier, until
we see the dot following the 1.

• The first statement is DO 5 I = 1 , 25

o If we see a comma instead of the dot, we have a do-statement in
which the first lexeme is the keyword DO.
16
Tricky Problems when recognizing Tokens
(Cont.)
• The previous slide illustrates a problem when recognizing tokens, but there are
many situations where we need to look at least one additional character ahead.

• Examples:
o Even simple examples have lookahead issues.
 We cannot be sure we've seen the end of an identifier until we see a
character that is not a letter or digit, and therefore is not part of the
lexeme for id.
i vs. if
 In C, single-character operators like -, =, or < could also be the beginning
of a two-character operator like ->, ==, or <=.
= vs. ==

17
Tricky Problems when recognizing Tokens
(Cont.)
• PL/I keywords are not reserved: Identifiers

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

• In Ada, array reference syntax and function call syntax are similar.
arr(2,3) vs. fn(1,2)
• In C++, array reference syntax and function call syntax are different.
arr[2,3] vs. fn(1,2)

18
Lexical Errors
• In what Situations do Errors Occur?
o Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.
• Sometimes, it is hard for a lexical analyzer to tell, without the aid of other
components, that there is a source-code error.
• For instance, if the string fi is encountered for the first time in a C program in the
context:

• A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an

undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer must return the
token id to the parser and let some other phase of the compiler )e.g. the parser)
handle an error due to transposition of the letters.
19
Error Detection, Recovery and Reporting
• Each phase can encounter error.
• Specific types of error can be detected by specific phases.
• Examples:
o Lexical Error: int abc, 1num;
o Syntax Error: total = capital + rate year;
o Semantic Error: value = myarray [realIndex];
• Should be able to proceed and process the rest of the program after
an error detected.
• Should be able to link the error with the source program.

20
Lexical Analyzer
• Designing a Lexical Analyzer:
1. Define a finite set of tokens.
o Tokens describe all items of interest like Identifiers, integers, keywords, etc.
2. Describe which strings belong to each token (i.e. pattern defines the strings).
• Implementing a Lexical Analyzer:
1. Recognize tokens from the corresponding lexemes.
o Partition of input string by reading left-to-right to recognize one token at a time => lookahead
sometimes required to decide where one token ends and the next token begins.
 Partition the input string into lexemes.
 Identify the token of each lexeme.
2. Return the type of the token and the value (attribute).
o 6036 < number , integer value 6036 >
o X6035 < id , pointer to symbol-table entry for X6035 >
o Eliminate whitespaces, comments,…etc. that do not contribute to parsing. 21
Specification of Tokens
• We need:
oA way to describe the lexemes of each token.
oA way to resolve ambiguities.
 Is if two variables i and f?
 Is == two equal signs = =?

• Typically, lexemes associated with a token (type) form a regular

language. So, use Regular Expressions to specify tokens.
o Regular expressions are an important notation for specifying lexeme patterns.

22
Languages
• Languages are sets of strings.
• Def.
o Let alphabet Σ be a set of characters.
o A language over Σ is a set of strings of characters drawn from Σ.

• Examples of Languages:
1. Alphabet = English characters
Language = English sentences
Not every string of English characters is an English sentence.

2. Alphabet = ASCII
Language = C programing language
Note: ASCII character set is different from English character set.
23
Regular Expressions
• Regular expressions represent patterns of strings of characters.
• A regular expression r is defined by the set of strings that it matches.
This set is called the language generated by the regular expression
and is written L(r).
• This language depends on the character set (symbols) that is
available.
o The set of legal symbols is called the alphabet Σ.
o An alphabet Σ is any finite set of symbols.
 Typical examples of symbols are letters, digits, and punctuation.

24
Regular Expressions (Cont.)
• Basic regular expressions:
• Given any character a from the alphabet Σ, L(a) = { a }.

• The empty string  : the string that contains no characters at all.

L() = {}.

•  is the symbol that matches no strings at all. Whose language is the

empty set { }.
L() = { }.
• L(ε) = {“”}
• L(f) = {}
25
Regular Expressions (Cont.)
• Choice among alternatives:
If r and s are regular expressions.
r|s is a regular expression.
L(r|s) = L(r)  L(s) (union)
• Concatenation:
If r and s are regular expressions. A regular expression a1 | a2 | … | an , where
rs is a regular expression. the ai's are each symbols of the alphabet, can
L (rs) = L(r)L(s) be replaced by the shorthand [a1 a2 … an].
E.g. [abc] is equivalent to a | b | c

26
Regular Expressions (Cont.)
• Repetition:
r*
L(r*) = L(r)*
0 or more occurrences (Kleene closure)
• One or more repetitions:
r+ indicates one or more repetitions of r (Positive closure)
“one or more occurrences of” r+ = r r*
• Optional sub-expressions:
“zero or one occurrence of” r? = r | ε
r? strings matched by r are optional.
Example: A number may or may not have a leading sign.
natural = [0-9]+
signedNatural = (+ | -)? natural
27
Example (5)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain exactly
one b.
• This set is generated by the regular expression:
(a | c)* b (a | c)*
• All the following strings are matched by the above regular expression:
b , abc , abaca , baaac , ccbaca , ....
28
Regular Expressions (Cont.)
• Any character:
A meta-character that is used to express a match of any character is the period
“.”
Example:
A regular expression for all strings that contain at least one b is:
.* b .*
• A range of characters:
use square brackets and a hyphen. “negated character class” [^A-Z]
[a-z] for lower case letters Any character EXCEPT an uppercase letter.
[0-9] for the digits
[a-z A-Z] represents all lowercase and uppercase letters.
[a-z] is equivalent to a | b | . . . |z
29
Precedence of Operations
* , + , ? are given the highest precedence,
concatenation is given the next highest,
and | is given the lowest.

• For example:
a|bc* is interpreted as a|(b(c*))

ab|cd is interpreted as (ab)|((c)d)

a b* = a (b*)
If you want (a b)* you must use parentheses.
a | b c = a | (b c)
If you want (a | b) c you must use parentheses. 30
Algebraic Laws for Regular Expressions

31
”

32
Example (6)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain at most
one b.
• This set is generated by the regular expression:
(a | c)* | (a | c)* b (a | c)*
• An alternative solution:
(a | c)* (b | ) (a | c)*
33
Example (7)
• Consider the strings over the alphabet
Σ = {a, b, c} that contain no two consecutive b’s.
• The regular expression is:
(a | c | ba | bc)* (b |  )
• An alternative solution:
(b |  ) (a | c | ab | cb)*
• An alternative solution:
(notb|b notb)* (b |  )
Where notb = a | c
34
Example (8)
• Consider the alphabet:
Σ = {a, b, c}
• and the regular expression:
((b | c)* a (b | c)* a)* (b | c)*

• Describe the language it generates.

• This generates the language of all strings containing an even number
of a’s.

35
Regular Expressions for Programming
Language Tokens
• Numbers:
• Numbers can be sequences of digits (natural numbers), or decimal
numbers, or numbers with an exponent.

nat = [0-9]+
signedNat = (+ | -)? nat
number = signedNat (“.” nat)? (E signedNat)?

36
Regular Expressions for Programming
Language Tokens (Cont.)
• Reserved words:
Reserved = if | while | do | …

• Identifiers:
• Identifier must begin with a letter and contain only letters and digits.

letter = [a-zA-Z]
digit = [0-9]
identifier = letter( letter | digit)*

37
References
• Compilers: Principles, Techniques, and Tools” by Aho, Sethi, and
Ullman, 2007, 2nd edition. (Chapter 3)

• Compiler Construction: Principles and Practice, Kenneth C. Louden,

1997, PWS Publishing Company, ISBN 0-534-93972-4. (Chapter 2)

38
Thank You
Dr. Wafaa Samy

[email protected]

002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Unit 2-LEXICAL ANALYSIS
No ratings yet
Unit 2-LEXICAL ANALYSIS
46 pages
5.tokens, Patterns, and Lexemes
No ratings yet
5.tokens, Patterns, and Lexemes
7 pages
2.1 - Lexical Analysis
No ratings yet
2.1 - Lexical Analysis
102 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Essential Grammar in Use-Verb To Be
100% (1)
Essential Grammar in Use-Verb To Be
2 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Chapter 2 Lexical Analysis (Scanning)
No ratings yet
Chapter 2 Lexical Analysis (Scanning)
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
14 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
Lexical Analysis
No ratings yet
Lexical Analysis
38 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Lecture 3 - Lexical Analysis
No ratings yet
Lecture 3 - Lexical Analysis
42 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Modern Mathematical Logic Joseph Mileti Instant Download
No ratings yet
Modern Mathematical Logic Joseph Mileti Instant Download
80 pages
CSC 415 Compiler Design: Lexical Analysis
No ratings yet
CSC 415 Compiler Design: Lexical Analysis
40 pages
1 - Scanning Slides Sanyal Part1
No ratings yet
1 - Scanning Slides Sanyal Part1
22 pages
CH 2 - Lexical Analysis
No ratings yet
CH 2 - Lexical Analysis
36 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
Lecture 2 10022025 035804pm
No ratings yet
Lecture 2 10022025 035804pm
27 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Chapter 2 Lexical Analysis (Scanning) Edited
No ratings yet
Chapter 2 Lexical Analysis (Scanning) Edited
46 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
48 pages
HW 31712
No ratings yet
HW 31712
22 pages
Core Network in GSM
No ratings yet
Core Network in GSM
81 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
No ratings yet
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
28 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Surah Ar-Rum Ayat 21 (30 - 21 Quran) With Tafsir - My Islam
No ratings yet
Surah Ar-Rum Ayat 21 (30 - 21 Quran) With Tafsir - My Islam
9 pages
Comp Final
No ratings yet
Comp Final
16 pages
Aquinas S Way To God The Proof in de Ente Et Essentia 1st Edition Gaven Kerr PDF Download
No ratings yet
Aquinas S Way To God The Proof in de Ente Et Essentia 1st Edition Gaven Kerr PDF Download
52 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Big Writing Booklet PDF
80% (5)
Big Writing Booklet PDF
26 pages
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
100% (1)
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
41 pages
Day 2 - Lexial Analyzer
No ratings yet
Day 2 - Lexial Analyzer
37 pages
Lexical Analysis (Scanner)
No ratings yet
Lexical Analysis (Scanner)
26 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
ATCD Mod 3
No ratings yet
ATCD Mod 3
46 pages
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
No ratings yet
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
28 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
L2 Lesson Plan - Technology Around Us - Y1
No ratings yet
L2 Lesson Plan - Technology Around Us - Y1
3 pages
DLP Analyn Lesson 3
100% (2)
DLP Analyn Lesson 3
8 pages
Haskell and Yesod
100% (1)
Haskell and Yesod
265 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Lexical Analysis: Programming Languages Translators
No ratings yet
Lexical Analysis: Programming Languages Translators
21 pages
Compiler Construction: Tahir Iqbal
No ratings yet
Compiler Construction: Tahir Iqbal
28 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
CB Chapterwise Index
No ratings yet
CB Chapterwise Index
4 pages
Learning Materials, CD, Unit-2 (Lexical Analysis)
No ratings yet
Learning Materials, CD, Unit-2 (Lexical Analysis)
13 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
Computer Basics Lesson Plan One
No ratings yet
Computer Basics Lesson Plan One
6 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
3a. Context Free Grammar
No ratings yet
3a. Context Free Grammar
18 pages
Problems in Compilation
No ratings yet
Problems in Compilation
21 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
END OF TERM 1 Maths EXAMINATION 2024 gr6
No ratings yet
END OF TERM 1 Maths EXAMINATION 2024 gr6
7 pages
9-Basic GK Questions With Answers PDF Notes
No ratings yet
9-Basic GK Questions With Answers PDF Notes
23 pages
CS606 Assignment 1
No ratings yet
CS606 Assignment 1
4 pages
2002 Amc 10B
No ratings yet
2002 Amc 10B
6 pages
PRM Library User Manual - EN
No ratings yet
PRM Library User Manual - EN
36 pages
BEI IDEACOD 07035 Encoder CANopen Manual Serie EN 102
No ratings yet
BEI IDEACOD 07035 Encoder CANopen Manual Serie EN 102
34 pages
(RB) COA - Notes UNIT-1
No ratings yet
(RB) COA - Notes UNIT-1
20 pages
Ldica Unit IV
No ratings yet
Ldica Unit IV
72 pages
An Abbreviated Life of Francis Robinson
No ratings yet
An Abbreviated Life of Francis Robinson
35 pages
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
No ratings yet
Grammar in Use 2: - Noun + Preposition - Adj + Preposition
16 pages
What Is Computer?
No ratings yet
What Is Computer?
17 pages
CSE2005 Lab Da1
No ratings yet
CSE2005 Lab Da1
25 pages
1803indonesian Grammer
No ratings yet
1803indonesian Grammer
7 pages
337 64 - SS - O Set A Tourism Eng+Hindi
No ratings yet
337 64 - SS - O Set A Tourism Eng+Hindi
7 pages
On Course A2 Test. Unit 8
No ratings yet
On Course A2 Test. Unit 8
3 pages
Typology of The Adjective
No ratings yet
Typology of The Adjective
15 pages
VALLEGA Lesson-Exemplar Q1 MUSIC7
No ratings yet
VALLEGA Lesson-Exemplar Q1 MUSIC7
6 pages
De Thi Chuyen Anh Co File Nghe
100% (1)
De Thi Chuyen Anh Co File Nghe
15 pages
Learn These 4 Word Stress Rules To Improve Your Pronunciation
No ratings yet
Learn These 4 Word Stress Rules To Improve Your Pronunciation
5 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

2-Lexical Analysis Part1

Uploaded by

2-Lexical Analysis Part1

Uploaded by

Lecture (2)

Lexical Analysis (Part 1)

Dr. Wafaa Samy

CSE439: Design of Compilers (Spring 2024)

• The input is just a string of characters:

• Goal: Partition input string into substrings (lexemes) to identify tokens.

• yields the following token-attribute pairs:

are written below as a sequence of pairs:

• FORTRAN rule: Whitespace is insignificant. Blanks are DO a b = c , d , e

• The first statement is DO 5 I = 1 , 25

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

• A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an

• Typically, lexemes associated with a token (type) form a regular

• The empty string  : the string that contains no characters at all.

•  is the symbol that matches no strings at all. Whose language is the

ab|c*d is interpreted as (ab)|((c*)d)

• Describe the language it generates.

• Compiler Construction: Principles and Practice, Kenneth C. Louden,

You might also like

ab|cd is interpreted as (ab)|((c)d)