2-Lexical Analysis Part1
2-Lexical Analysis Part1
2
Lexical Analysis
Source Code
• The first phase of a compiler.
Scanner
• The scanner or lexical analyzer performs the lexical analysis.
Parser
• The main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes,
and produce as output a sequence of tokens (i.e. a token for
each lexeme in the source program).
• When the lexical analyzer discovers a lexeme constituting an
identifier, it enters that lexeme into the symbol table.
• The stream of tokens is sent to the parser for syntax analysis.
• The lexical analyzer may perform other tasks besides
identification of lexemes like stripping out comments and
whitespace (blank, newline, tab, etc.).
3
Lexical Analysis (Cont.)
• What do we want to do?
• Example:
if (i == j)
Z = 0;
else
Z = 1;
7
Typical Tokens in Programming Languages
• In many programming languages, the following classes cover most or all of the
tokens:
1. One token for each keyword.
o The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison (i.e. as in table in previous slide).
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
strings.
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
8
Examples of Tokens and Non-Tokens
• Examples of Tokens:
o Operators = + - > <= = = < >
o Punctuation ( ) { } , ;
o Keywords if while for int double
o Numeric 43 6.035 -3.6e10 0x13F3A
o Character literals ‘a’ ‘~’ ‘\’’
o String literals “3.142” “aBcDe” “\”
o Identifiers first total f
• Examples of non-tokens:
o White space space(‘ ’) tab(‘\t’) eoln(‘\n’)
o Comments /*this is not a token*/
9
Example (2)
• Consider the following C statement:
printf (“Total = %d\n”, score);
• Both printf and score are lexemes matching the pattern for token id,
• “Total = %d\n” is a lexeme matching literal, and
• ( , ) ; are lexemes matching punctuation symbols.
10
Attributes for Tokens
• The string of characters represented by a token is called its string
value or its lexeme.
o Some tokens have only one lexeme such as reserved words (i.e.
keywords).
o A token may represent many lexemes such as identifiers. They all
represented by the token id, but they have many different string values.
• If more than one lexeme can match the pattern for a token, the
scanner must indicate the actual lexeme that matched.
o E.g. the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was
found in the source program.
o Any value associated to a token is called an attribute of the token, and the
string value is an example of an attribute.
11
Attributes for Tokens (Cont.)
Symbol Table
• Attribute: “Value of interest” about a token.
• Records the identifiers used in the
o Numerical value of an integer token. source program.
• Collects various associated
o Name (string) associated with an identifier token. information as attributes:
o Variables: type, scope,
• The token id, we need to associate with the storage allocation, etc.
token information about an identifier like its o Procedure: number and types
lexeme, etc. of arguments, method of
argument passing, etc.
• This information is kept in the symbol table. • It’s a data structure with
collection of records.
• Thus, the appropriate attribute value for an o Different fields are collected
identifier is a pointer to the symbol-table and used at different phases
of compilation.
entry for that identifier.
12
Example (3)
• Consider the following program statement:
count = 545
13
Example (4)
• The token names and associated attribute values for the Fortran statement:
• Note that in certain pairs, especially operators, punctuation, and keywords, there is no
need for an attribute value.
• In this example, the token number has been given an integer-valued attribute.
14
Issues in Compiler Design
• Compilation appears to be very simple, but there are many pitfalls:
o Design of programming languages has a big impact on the complexity of the
compiler.
o How are erroneous programs handled?
15
Tricky Problems when recognizing Tokens
Fortran DO-Statement
• Examples:
o Even simple examples have lookahead issues.
We cannot be sure we've seen the end of an identifier until we see a
character that is not a letter or digit, and therefore is not part of the
lexeme for id.
i vs. if
In C, single-character operators like -, =, or < could also be the beginning
of a two-character operator like ->, ==, or <=.
= vs. ==
17
Tricky Problems when recognizing Tokens
(Cont.)
• PL/I keywords are not reserved: Identifiers
• In Ada, array reference syntax and function call syntax are similar.
arr(2,3) vs. fn(1,2)
• In C++, array reference syntax and function call syntax are different.
arr[2,3] vs. fn(1,2)
18
Lexical Errors
• In what Situations do Errors Occur?
o Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.
• Sometimes, it is hard for a lexical analyzer to tell, without the aid of other
components, that there is a source-code error.
• For instance, if the string fi is encountered for the first time in a C program in the
context:
20
Lexical Analyzer
• Designing a Lexical Analyzer:
1. Define a finite set of tokens.
o Tokens describe all items of interest like Identifiers, integers, keywords, etc.
2. Describe which strings belong to each token (i.e. pattern defines the strings).
• Implementing a Lexical Analyzer:
1. Recognize tokens from the corresponding lexemes.
o Partition of input string by reading left-to-right to recognize one token at a time => lookahead
sometimes required to decide where one token ends and the next token begins.
Partition the input string into lexemes.
Identify the token of each lexeme.
2. Return the type of the token and the value (attribute).
o 6036 < number , integer value 6036 >
o X6035 < id , pointer to symbol-table entry for X6035 >
o Eliminate whitespaces, comments,…etc. that do not contribute to parsing. 21
Specification of Tokens
• We need:
oA way to describe the lexemes of each token.
oA way to resolve ambiguities.
Is if two variables i and f?
Is == two equal signs = =?
22
Languages
• Languages are sets of strings.
• Def.
o Let alphabet Σ be a set of characters.
o A language over Σ is a set of strings of characters drawn from Σ.
• Examples of Languages:
1. Alphabet = English characters
Language = English sentences
Not every string of English characters is an English sentence.
2. Alphabet = ASCII
Language = C programing language
Note: ASCII character set is different from English character set.
23
Regular Expressions
• Regular expressions represent patterns of strings of characters.
• A regular expression r is defined by the set of strings that it matches.
This set is called the language generated by the regular expression
and is written L(r).
• This language depends on the character set (symbols) that is
available.
o The set of legal symbols is called the alphabet Σ.
o An alphabet Σ is any finite set of symbols.
Typical examples of symbols are letters, digits, and punctuation.
24
Regular Expressions (Cont.)
• Basic regular expressions:
• Given any character a from the alphabet Σ, L(a) = { a }.
26
Regular Expressions (Cont.)
• Repetition:
r*
L(r*) = L(r)*
0 or more occurrences (Kleene closure)
• One or more repetitions:
r+ indicates one or more repetitions of r (Positive closure)
“one or more occurrences of” r+ = r r*
• Optional sub-expressions:
“zero or one occurrence of” r? = r | ε
r? strings matched by r are optional.
Example: A number may or may not have a leading sign.
natural = [0-9]+
signedNatural = (+ | -)? natural
27
Example (5)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain exactly
one b.
• This set is generated by the regular expression:
(a | c)* b (a | c)*
• All the following strings are matched by the above regular expression:
b , abc , abaca , baaac , ccbaca , ....
28
Regular Expressions (Cont.)
• Any character:
A meta-character that is used to express a match of any character is the period
“.”
Example:
A regular expression for all strings that contain at least one b is:
.* b .*
• A range of characters:
use square brackets and a hyphen. “negated character class” [^A-Z]
[a-z] for lower case letters Any character EXCEPT an uppercase letter.
[0-9] for the digits
[a-z A-Z] represents all lowercase and uppercase letters.
[a-z] is equivalent to a | b | . . . |z
29
Precedence of Operations
* , + , ? are given the highest precedence,
concatenation is given the next highest,
and | is given the lowest.
• For example:
a|bc* is interpreted as a|(b(c*))
31
”
32
Example (6)
• Consider the alphabet consisting of three alphabetic characters:
Σ = {a, b, c}
• Consider the set of all strings over this alphabet that contain at most
one b.
• This set is generated by the regular expression:
(a | c)* | (a | c)* b (a | c)*
• An alternative solution:
(a | c)* (b | ) (a | c)*
33
Example (7)
• Consider the strings over the alphabet
Σ = {a, b, c} that contain no two consecutive b’s.
• The regular expression is:
(a | c | ba | bc)* (b | )
• An alternative solution:
(b | ) (a | c | ab | cb)*
• An alternative solution:
(notb|b notb)* (b | )
Where notb = a | c
34
Example (8)
• Consider the alphabet:
Σ = {a, b, c}
• and the regular expression:
((b | c)* a (b | c)* a)* (b | c)*
35
Regular Expressions for Programming
Language Tokens
• Numbers:
• Numbers can be sequences of digits (natural numbers), or decimal
numbers, or numbers with an exponent.
nat = [0-9]+
signedNat = (+ | -)? nat
number = signedNat (“.” nat)? (E signedNat)?
36
Regular Expressions for Programming
Language Tokens (Cont.)
• Reserved words:
Reserved = if | while | do | …
• Identifiers:
• Identifier must begin with a letter and contain only letters and digits.
letter = [a-zA-Z]
digit = [0-9]
identifier = letter( letter | digit)*
37
References
• Compilers: Principles, Techniques, and Tools” by Aho, Sethi, and
Ullman, 2007, 2nd edition. (Chapter 3)
38
Thank You
Dr. Wafaa Samy
39