0% found this document useful (0 votes)
51 views16 pages

Compiler - Lexical Analyzer-2

The document discusses the role and implementation of a lexical analyzer, which recognizes tokens, generates token streams, and reports lexical errors using regular expressions and finite state automata. It covers the concepts of tokens, lexemes, patterns, and the structure of a symbol table, along with implementation approaches and challenges. Additionally, it highlights the use of tools like LEX for generating lexical analyzers and the importance of handling keywords and ambiguities in token recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views16 pages

Compiler - Lexical Analyzer-2

The document discusses the role and implementation of a lexical analyzer, which recognizes tokens, generates token streams, and reports lexical errors using regular expressions and finite state automata. It covers the concepts of tokens, lexemes, patterns, and the structure of a symbol table, along with implementation approaches and challenges. Additionally, it highlights the use of tools like LEX for generating lexical analyzers and the importance of handling keywords and ambiguities in token recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Compiler

Lexical Analyzer

Kamalika Bhattacharjee
Asst Prof., Dept. of CSE, NIT Trichy
Lexical Analysis
• Recognize tokens and ignore white spaces, comments
• Generates token stream
• Discard whatever does not contribute to parsing
• white spaces (blanks, tabs, newlines) and comments
• Construct constants:
• convert numbers to token num and pass number as its attribute ● Ex: integer 31 → <num, 31>
• Recognize keyword and identifiers ● counter = counter + increment → id = id + id
• Find word boundaries

• Report Errors
• Lexical Errors f i ( a == f ( x ) ) . ..
❖ Model using regular expressions
● Other cases: Panic mode error recovery
○ Delete one character from the remaining input ❖ Recognize using Finite State Automata
○ Insert a missing character into the remaining input ❖ Use symbol table
○ Replace a character by another character ❖ Line number
○ Transpose two adjacent characters
Units of Lexical Analyzer
● Token: a syntactic category
○ Sentences consist of a string of token
○ Ex: number, identifier, keyword, string etc.

● Lexeme: Sequence of characters in a token


○ Ex: 100.01, counter, const, "I am happy." etc.

● Pattern: Rule of description


○ Ex: letter (letter | digit)*
○ In general, there is a set of strings in the input for which the same token is produced as output.
■ Described by a rule called a pattern associated with the token
■ This pattern is said to match each string in the set.
○ A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
○ The patterns are specified using regular expressions.
Lexical Analysis

Tricky Problems:
• Fixed format vs Dynamic Format • Unreserved Keywords
If then then then=else else else = then else if if then then = then+1
Role of Lexical Analyzer
• Push back is required due to lookahead
• Implemented through a buffer
• Keep input in a buffer
• Move pointers over the input

• Using a pair of input buffers alternatively reloaded


• Two buffers of same size, usually size of a disk block
• Two pointers to move
Each character read needs two tests: end of buffer and what character is read

• Use of a special character to mark end of


buffer as well as input → Sentinels
• Only one test to understand what is read

Implementation Approaches
• Use assembly language: Most efficient but most difficult to implement
• Use high level languages like C: Efficient but difficult to implement
• Use tools like LEX, FLEX: Easy to implement but not as efficient as the first two cases

Usual Approach
• Start with tool-based and move towards
implementing in high level languages
• Then replace the I/O operations by fast and
efficient assembly language routines
Construct a Lexical Analyzer
• Allow white spaces, numbers and arithmetic operators
in an expression
• Return tokens and attributes to the syntax analyzer
• A global variable tokenval to use to return value of
the token (lexeme)
• A finite set of tokens be defined
• Patterns describing strings belonging to each token

Problems:
• Scans text character by character
• Look ahead character determines type of token and word boundary
• First character cannot determine the type of token
• Large computational overhead on processing an input characters

Approach: Systematically construct lexical analyzer


Symbol Table
• Stores information for subsequent phases

• Minimum Functionality
• Insert(s,t): save lexeme s and token t and return pointer
• Lookup(s): return index of entry for lexeme s or 0 if s is not found

• Implementation
• Fixed amount of space to store lexemes
• Waste space: not advisable

• Store lexemes in a separate array.


• Each lexeme is separated by eos.
• Symbol table has pointers to lexemes
• Can save ~ 70% of the space wasting before
• 'Other Attributes' are to be filled in the later phases
Implementation Issues
• Handling keywords as reserved
• Consider keywords themselves as lexemes
• Store all entries corresponding to keywords in symbol table while initialization
• Lookup for every new lexeme
• Returns nonzero value meaning existence of a corresponding entry in the Symbol Table
• Handling of blanks
Counter is same as Count er [FORTRAN]

• If keywords are not reserved


if then then then = else else else = then else if if then then = then + 1
[PL/1]
Declare(arg 1 ,arg 2 ,arg 3 ,...,arg n )

Requires arbitrary lookahead and very large buffers

• How to specify tokens? • How to describe tokens?


• Tokens may have similar prefixes • Regular languages
• Each character should be looked at only once • Finite Automata

Regular Definitions
• Take a fax number:
91-(431)-250-0133

• Take email id:


• Identifiers kamalika@[Link]

• Floating point numbers

Use of shorthand notations:



Implementation of Specifications
• Regular expressions are only specifications; implementation is still required
• Just yes/no answer on validity of the token is not enough
• Goal: Partition the input into tokens
• Gives priority to tokens listed earlier
• Reserved keyword policy
• If a token belongs to more than one category, needs to set
up priority rules to remove ambiguity
• Pick up the longest possible string in L(R)
• The principle of "maximal munch"
• Regular expressions provide a concise and useful notation
for string patterns
• Good algorithms require a single pass over the input

• How to break up text elsex=0 else x=0 or elsex=0 ?

• Regular expressions alone are not enough


• Lexical definitions consist of regular definitions, priority rules and maximal munch principle
Recognition of Tokens
Construct an analyzer that will return <token, attribute> pairs
• Relational Operators • Identifiers

• White spaces

• Unsigned numbers
Implementation of Transition Diagram
• Switch Case based structure • Unsigned numbers: another transition diagram

• Complexity of transition diagram increases implement


difficulty and is more prone to errors
• Tradeoff: may need to unget() a large number of
characters
Lexical Analyzer Generator
• Input to the generator
• List of regular expressions in priority order
• Associated actions for each of regular expression
(generates kind of token and other book keeping
information)
• Output of the generator
• Program that reads input character stream and
breaks that into tokens
• Reports lexical errors, if any

LEX regular expressions


• Implementing lookahead
DO 10 I = 1.25
DO 10 I = 1,25 Specification for DO as keyword: DO/(letter|digit)*=(letter|digit)*,
Lexical Analyzer Generator
• Structure of LEX program • Translation rule

• How does LEX work?


• Regular expressions to describe the tokens
• Translate each regular expression into NFA
• Convert the NFA into an equivalent DFA
• Minimize the DFA to reduce number of states
• Generate code driven by the DFA tables

• installID() returns a pointer to symbol table placed in yylval


• It has other two variables available:
• yytext: pointer to beginning of lexeme
• yyleng: length of the lexeme
• yylval is a global variable
Some Reference Materials
[Link]
[Link]

[Link]
[Link]

[Link]

[Link]
[Link]

You might also like