0% found this document useful (0 votes)
67 views39 pages

2-Lexical Analysis Part1

Uploaded by

ahksase2312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views39 pages

2-Lexical Analysis Part1

Uploaded by

ahksase2312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lecture (2)

Lexical Analysis (Part 1)

Dr. Wafaa Samy

CSE439: Design of Compilers (Spring 2024)


Contents
• Lexical Analysis
• Issues in Compiler Design
• Regular Expressions

2
Lexical Analysis
Source Code
• The first phase of a compiler.
Scanner
• The scanner or lexical analyzer performs the lexical analysis.
Parser
• The main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes,
and produce as output a sequence of tokens (i.e. a token for
each lexeme in the source program).
• When the lexical analyzer discovers a lexeme constituting an
identifier, it enters that lexeme into the symbol table.
• The stream of tokens is sent to the parser for syntax analysis.
• The lexical analyzer may perform other tasks besides
identification of lexemes like stripping out comments and
whitespace (blank, newline, tab, etc.).

3
Lexical Analysis (Cont.)
• What do we want to do?
• Example:
if (i == j)
Z = 0;
else
Z = 1;

• The input is just a string of characters:


\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings (lexemes) to identify tokens.


Source program text Tokens
Scanner
4
Example (1)
• Consider the following line of code, which could be part of a C program:
a[index] = 4 + 2
• This code contains 12 non-blank characters but only 8 lexemes.
• The lexemes identified and tokens produced by the scanner are as follows:
Lexeme Token
a identifier < id , pointer to symbol-table entry for a >
[ left bracket <[>
index identifier < id , pointer to symbol-table entry for index >
] right bracket <]>
= assignment <=>
4 number < number , integer value 4 >
+ plus sign <+>
2 number < number , integer value 2 > 5
Tokens, Patterns, and Lexemes
• Three related but distinct terms. Examples:
• A token is a pair consisting of a token name and an • The sequence of
optional attribute value. characters “static int”
< token-name, attribute-value > is recognized as two
o The token name is an abstract symbol representing a kind of tokens, representing
lexical unit, e.g. a particular keyword, or a sequence of input the two lexemes
characters denoting an identifier. “static” and “int”.
• A pattern is a description of the form that the lexemes of • The sequence of
a token may take. characters “*x++” is
o For example, in a case-insensitive language, the lexemes
associated with the IF token are: if , IF , iF , and If. recognized as three
tokens, representing
• A lexeme is a sequence of characters in the source the lexemes “*”, “x”
program that matches the pattern for a token and is and “++”.
identified by the lexical analyzer as an instance of that
token. 6
Tokens, Patterns, and Lexemes (Cont.)
• The table shows some typical tokens, their informally described patterns,
and some sample lexemes.

7
Typical Tokens in Programming Languages
• In many programming languages, the following classes cover most or all of the
tokens:
1. One token for each keyword.
o The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison (i.e. as in table in previous slide).
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
strings.
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
8
Examples of Tokens and Non-Tokens
• Examples of Tokens:
o Operators = + - > <= = = < >
o Punctuation ( ) { } , ;
o Keywords if while for int double
o Numeric 43 6.035 -3.6e10 0x13F3A
o Character literals ‘a’ ‘~’ ‘\’’
o String literals “3.142” “aBcDe” “\”
o Identifiers first total f

• Examples of non-tokens:
o White space space(‘ ’) tab(‘\t’) eoln(‘\n’)
o Comments /*this is not a token*/
9
Example (2)
• Consider the following C statement:
printf (“Total = %d\n”, score);
• Both printf and score are lexemes matching the pattern for token id,
• “Total = %d\n” is a lexeme matching literal, and
• ( , ) ; are lexemes matching punctuation symbols.

10
Attributes for Tokens
• The string of characters represented by a token is called its string
value or its lexeme.
o Some tokens have only one lexeme such as reserved words (i.e.
keywords).
o A token may represent many lexemes such as identifiers. They all
represented by the token id, but they have many different string values.
• If more than one lexeme can match the pattern for a token, the
scanner must indicate the actual lexeme that matched.
o E.g. the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was
found in the source program.
o Any value associated to a token is called an attribute of the token, and the
string value is an example of an attribute.
11
Attributes for Tokens (Cont.)
Symbol Table
• Attribute: “Value of interest” about a token.
• Records the identifiers used in the
o Numerical value of an integer token. source program.
• Collects various associated
o Name (string) associated with an identifier token. information as attributes:
o Variables: type, scope,
• The token id, we need to associate with the storage allocation, etc.
token information about an identifier like its o Procedure: number and types
lexeme, etc. of arguments, method of
argument passing, etc.
• This information is kept in the symbol table. • It’s a data structure with
collection of records.
• Thus, the appropriate attribute value for an o Different fields are collected
identifier is a pointer to the symbol-table and used at different phases
of compilation.
entry for that identifier.
12
Example (3)
• Consider the following program statement:

count = 545

• yields the following token-attribute pairs:


< id , pointer to symbol-table entry for count >
< assign-op >
< number , integer value 545 >

13
Example (4)
• The token names and associated attribute values for the Fortran statement:

are written below as a sequence of pairs:


< id , pointer to symbol-table entry for E >
< assign-op >
< id , pointer to symbol-table entry for M >
< mult-op >
< id , pointer to symbol-table entry for C >
< exp-op >
< number , integer value 2 >

• Note that in certain pairs, especially operators, punctuation, and keywords, there is no
need for an attribute value.
• In this example, the token number has been given an integer-valued attribute.
14
Issues in Compiler Design
• Compilation appears to be very simple, but there are many pitfalls:
o Design of programming languages has a big impact on the complexity of the
compiler.
o How are erroneous programs handled?

15
Tricky Problems when recognizing Tokens
Fortran DO-Statement

• FORTRAN rule: Whitespace is insignificant. Blanks are DO a b = c , d , e


ignored.
a CONTINUE
o E.g. VAR1 is the same as VA R1
a – line label
• Consider the following Fortran statements: b – control variable
o DO 5 I = 1,25 c – start value
o DO 5 I = 1.25 d – end value
e – increment value (optional)
• The second statement is DO5I = 1.25
o It is not apparent that the first lexeme is DO5I, an identifier, until
we see the dot following the 1.

• The first statement is DO 5 I = 1 , 25


o If we see a comma instead of the dot, we have a do-statement in
which the first lexeme is the keyword DO.
16
Tricky Problems when recognizing Tokens
(Cont.)
• The previous slide illustrates a problem when recognizing tokens, but there are
many situations where we need to look at least one additional character ahead.

• Examples:
o Even simple examples have lookahead issues.
 We cannot be sure we've seen the end of an identifier until we see a
character that is not a letter or digit, and therefore is not part of the
lexeme for id.
i vs. if
 In C, single-character operators like -, =, or < could also be the beginning
of a two-character operator like ->, ==, or <=.
= vs. ==

17
Tricky Problems when recognizing Tokens
(Cont.)
• PL/I keywords are not reserved: Identifiers

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

• In Ada, array reference syntax and function call syntax are similar.
arr(2,3) vs. fn(1,2)
• In C++, array reference syntax and function call syntax are different.
arr[2,3] vs. fn(1,2)

18
Lexical Errors
• In what Situations do Errors Occur?
o Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.
• Sometimes, it is hard for a lexical analyzer to tell, without the aid of other
components, that there is a source-code error.
• For instance, if the string fi is encountered for the first time in a C program in the
context:

• A lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an


undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer must return the
token id to the parser and let some other phase of the compiler )e.g. the parser)
handle an error due to transposition of the letters.
19
Error Detection, Recovery and Reporting
• Each phase can encounter error.
• Specific types of error can be detected by specific phases.
• Examples:
o Lexical Error: int abc, 1num;
o Syntax Error: total = capital + rate year;
o Semantic Error: value = myarray [realIndex];
• Should be able to proceed and process the rest of the program after
an error detected.
• Should be able to link the error with the source program.

20
Lexical Analyzer
• Designing a Lexical Analyzer:
1. Define a finite set of tokens.
o Tokens describe all items of interest like Identifiers, integers, keywords, etc.
2. Describe which strings belong to each token (i.e. pattern defines the strings).
• Implementing a Lexical Analyzer:
1. Recognize tokens from the corresponding lexemes.
o Partition of input string by reading left-to-right to recognize one token at a time => lookahead
sometimes required to decide where one token ends and the next token begins.
 Partition the input string into lexemes.
 Identify the token of each lexeme.
2. Return the type of the token and the value (attribute).
o 6036 < number , integer value 6036 >
o X6035 < id , pointer to symbol-table entry for X6035 >
o Eliminate whitespaces, comments,…etc. that do not contribute to parsing. 21
Specification of Tokens
• We need:
oA way to describe the lexemes of each token.
oA way to resolve ambiguities.
 Is if two variables i and f?
 Is == two equal signs = =?

• Typically, lexemes associated with a token (type) form a regular


language. So, use Regular Expressions to specify tokens.
o Regular expressions are an important notation for specifying lexeme patterns.

22
Languages
• Languages are sets of strings.
• Def.
o Let alphabet Σ be a set of characters.
o A language over Σ is a set of strings of characters drawn from Σ.

• Examples of Languages:
1. Alphabet = English characters
Language = English sentences
Not every string of English characters is an English sentence.

2. Alphabet = ASCII
Language = C programing language
Note: ASCII character set is different from English character set.
23
Regular Expressions
• Regular expressions represent patterns of strings of characters.
• A regular expression r is defined by the set of strings that it matches.
This set is called the language generated by the regular expression
and is written L(r).
• This language depends on the character set (symbols) that is
available.
o The set of legal symbols is called the alphabet Σ.
o An alphabet Σ is any finite set of symbols.
 Typical examples of symbols are letters, digits, and punctuation.

24
Regular Expressions (Cont.)
• Basic regular expressions:
• Given any character a from the alphabet Σ, L(a) = { a }.

• The empty string  : the string that contains no characters at all.


L() = {}.

•  is the symbol that matches no strings at all. Whose language is the


empty set { }.
L() = { }.
• L(ε) = {“”}
• L(f) = {}
25
Regular Expressions (Cont.)
• Choice among alternatives:
If r and s are regular expressions.
r|s is a regular expression.
L(r|s) = L(r)  L(s) (union)
• Concatenation:
If r and s are regular expressions. A regular expression a1 | a2 | … | an , where
rs is a regular expression. the ai's are each symbols of the alphabet, can
L (rs) = L(r)L(s) be replaced by the shorthand [a1 a2 … an].
E.g. [abc] is equivalent to a | b | c

26
Regular Expressions (Cont.)
• Repetition:
r*
L(r*) = L(r)*
0 or more occurrences (Kleene closure)
• One or more repetitions:
r+ indicates one or more repetitions of r (Positive closure)
“one or more occurrences of” r+ = r r*
• Optional sub-expressions:
“zero or one occurrence of” r? = r | ε
r? strings matched by r are optional.
Example: A number may or may not have a leading sign.
natural = [0-9]+
signedNatural = (+ | -)? natural
27
Example (5)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain exactly
one b.
• This set is generated by the regular expression:
(a | c)* b (a | c)*
• All the following strings are matched by the above regular expression:
b , abc , abaca , baaac , ccbaca , ....
28
Regular Expressions (Cont.)
• Any character:
A meta-character that is used to express a match of any character is the period
“.”
Example:
A regular expression for all strings that contain at least one b is:
.* b .*
• A range of characters:
use square brackets and a hyphen. “negated character class” [^A-Z]
[a-z] for lower case letters Any character EXCEPT an uppercase letter.
[0-9] for the digits
[a-z A-Z] represents all lowercase and uppercase letters.
[a-z] is equivalent to a | b | . . . |z
29
Precedence of Operations
* , + , ? are given the highest precedence,
concatenation is given the next highest,
and | is given the lowest.

• For example:
a|bc* is interpreted as a|(b(c*))

ab|c*d is interpreted as (ab)|((c*)d)


a b* = a (b*)
If you want (a b)* you must use parentheses.
a | b c = a | (b c)
If you want (a | b) c you must use parentheses. 30
Algebraic Laws for Regular Expressions

31

32
Example (6)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain at most
one b.
• This set is generated by the regular expression:
(a | c)* | (a | c)* b (a | c)*
• An alternative solution:
(a | c)* (b | ) (a | c)*
33
Example (7)
• Consider the strings over the alphabet
Σ = {a, b, c} that contain no two consecutive b’s.
• The regular expression is:
(a | c | ba | bc)* (b |  )
• An alternative solution:
(b |  ) (a | c | ab | cb)*
• An alternative solution:
(notb|b notb)* (b |  )
Where notb = a | c
34
Example (8)
• Consider the alphabet:
Σ = {a, b, c}
• and the regular expression:
((b | c)* a (b | c)* a)* (b | c)*

• Describe the language it generates.


• This generates the language of all strings containing an even number
of a’s.

35
Regular Expressions for Programming
Language Tokens
• Numbers:
• Numbers can be sequences of digits (natural numbers), or decimal
numbers, or numbers with an exponent.

nat = [0-9]+
signedNat = (+ | -)? nat
number = signedNat (“.” nat)? (E signedNat)?

36
Regular Expressions for Programming
Language Tokens (Cont.)
• Reserved words:
Reserved = if | while | do | …

• Identifiers:
• Identifier must begin with a letter and contain only letters and digits.

letter = [a-zA-Z]
digit = [0-9]
identifier = letter( letter | digit)*

37
References
• Compilers: Principles, Techniques, and Tools” by Aho, Sethi, and
Ullman, 2007, 2nd edition. (Chapter 3)

• Compiler Construction: Principles and Practice, Kenneth C. Louden,


1997, PWS Publishing Company, ISBN 0-534-93972-4. (Chapter 2)

38
Thank You
Dr. Wafaa Samy

[email protected]

39

You might also like