Compiler
Lexical Analyzer
Kamalika Bhattacharjee
Asst Prof., Dept. of CSE, NIT Trichy
Lexical Analysis
• Recognize tokens and ignore white spaces, comments
• Generates token stream
• Discard whatever does not contribute to parsing
• white spaces (blanks, tabs, newlines) and comments
• Construct constants:
• convert numbers to token num and pass number as its attribute ● Ex: integer 31 → <num, 31>
• Recognize keyword and identifiers ● counter = counter + increment → id = id + id
• Find word boundaries
• Report Errors
• Lexical Errors f i ( a == f ( x ) ) . ..
❖ Model using regular expressions
● Other cases: Panic mode error recovery
○ Delete one character from the remaining input ❖ Recognize using Finite State Automata
○ Insert a missing character into the remaining input ❖ Use symbol table
○ Replace a character by another character ❖ Line number
○ Transpose two adjacent characters
Units of Lexical Analyzer
● Token: a syntactic category
○ Sentences consist of a string of token
○ Ex: number, identifier, keyword, string etc.
● Lexeme: Sequence of characters in a token
○ Ex: 100.01, counter, const, "I am happy." etc.
● Pattern: Rule of description
○ Ex: letter (letter | digit)*
○ In general, there is a set of strings in the input for which the same token is produced as output.
■ Described by a rule called a pattern associated with the token
■ This pattern is said to match each string in the set.
○ A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
○ The patterns are specified using regular expressions.
Lexical Analysis
Tricky Problems:
• Fixed format vs Dynamic Format • Unreserved Keywords
If then then then=else else else = then else if if then then = then+1
Role of Lexical Analyzer
• Push back is required due to lookahead
• Implemented through a buffer
• Keep input in a buffer
• Move pointers over the input
• Using a pair of input buffers alternatively reloaded
• Two buffers of same size, usually size of a disk block
• Two pointers to move
Each character read needs two tests: end of buffer and what character is read
• Use of a special character to mark end of
buffer as well as input → Sentinels
• Only one test to understand what is read
⮚
Implementation Approaches
• Use assembly language: Most efficient but most difficult to implement
• Use high level languages like C: Efficient but difficult to implement
• Use tools like LEX, FLEX: Easy to implement but not as efficient as the first two cases
Usual Approach
• Start with tool-based and move towards
implementing in high level languages
• Then replace the I/O operations by fast and
efficient assembly language routines
Construct a Lexical Analyzer
• Allow white spaces, numbers and arithmetic operators
in an expression
• Return tokens and attributes to the syntax analyzer
• A global variable tokenval to use to return value of
the token (lexeme)
• A finite set of tokens be defined
• Patterns describing strings belonging to each token
Problems:
• Scans text character by character
• Look ahead character determines type of token and word boundary
• First character cannot determine the type of token
• Large computational overhead on processing an input characters
Approach: Systematically construct lexical analyzer
Symbol Table
• Stores information for subsequent phases
• Minimum Functionality
• Insert(s,t): save lexeme s and token t and return pointer
• Lookup(s): return index of entry for lexeme s or 0 if s is not found
• Implementation
• Fixed amount of space to store lexemes
• Waste space: not advisable
• Store lexemes in a separate array.
• Each lexeme is separated by eos.
• Symbol table has pointers to lexemes
• Can save ~ 70% of the space wasting before
• 'Other Attributes' are to be filled in the later phases
Implementation Issues
• Handling keywords as reserved
• Consider keywords themselves as lexemes
• Store all entries corresponding to keywords in symbol table while initialization
• Lookup for every new lexeme
• Returns nonzero value meaning existence of a corresponding entry in the Symbol Table
• Handling of blanks
Counter is same as Count er [FORTRAN]
• If keywords are not reserved
if then then then = else else else = then else if if then then = then + 1
[PL/1]
Declare(arg 1 ,arg 2 ,arg 3 ,...,arg n )
Requires arbitrary lookahead and very large buffers
• How to specify tokens? • How to describe tokens?
• Tokens may have similar prefixes • Regular languages
• Each character should be looked at only once • Finite Automata
⮚
Regular Definitions
• Take a fax number:
91-(431)-250-0133
• Take email id:
• Identifiers kamalika@[Link]
• Floating point numbers
Use of shorthand notations:
⮚
Implementation of Specifications
• Regular expressions are only specifications; implementation is still required
• Just yes/no answer on validity of the token is not enough
• Goal: Partition the input into tokens
• Gives priority to tokens listed earlier
• Reserved keyword policy
• If a token belongs to more than one category, needs to set
up priority rules to remove ambiguity
• Pick up the longest possible string in L(R)
• The principle of "maximal munch"
• Regular expressions provide a concise and useful notation
for string patterns
• Good algorithms require a single pass over the input
• How to break up text elsex=0 else x=0 or elsex=0 ?
• Regular expressions alone are not enough
• Lexical definitions consist of regular definitions, priority rules and maximal munch principle
Recognition of Tokens
Construct an analyzer that will return <token, attribute> pairs
• Relational Operators • Identifiers
• White spaces
• Unsigned numbers
Implementation of Transition Diagram
• Switch Case based structure • Unsigned numbers: another transition diagram
• Complexity of transition diagram increases implement
difficulty and is more prone to errors
• Tradeoff: may need to unget() a large number of
characters
Lexical Analyzer Generator
• Input to the generator
• List of regular expressions in priority order
• Associated actions for each of regular expression
(generates kind of token and other book keeping
information)
• Output of the generator
• Program that reads input character stream and
breaks that into tokens
• Reports lexical errors, if any
LEX regular expressions
• Implementing lookahead
DO 10 I = 1.25
DO 10 I = 1,25 Specification for DO as keyword: DO/(letter|digit)*=(letter|digit)*,
Lexical Analyzer Generator
• Structure of LEX program • Translation rule
• How does LEX work?
• Regular expressions to describe the tokens
• Translate each regular expression into NFA
• Convert the NFA into an equivalent DFA
• Minimize the DFA to reduce number of states
• Generate code driven by the DFA tables
• installID() returns a pointer to symbol table placed in yylval
• It has other two variables available:
• yytext: pointer to beginning of lexeme
• yyleng: length of the lexeme
• yylval is a global variable
Some Reference Materials
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]