Module-I
INTRODUCTION TO COMPILER AND LEXICAL ANALYSIS
1.1 STRUCTURE OF A COMPILER
A compiler is a program that builds on the more complex level of a programming language source code,
such as C, Java, or Python, to better understand the language, converting it into machine or intermediate
code, which can be run on a computer. The organization of a compiler is generally divided into many
small independent phases or components, each doing one thing to translate the original. These are the key
elements of a compiler explained briefly:
The translation process developed in this way makes sure that a program written on a high level is
effectively translated into a program written in machine code which can be executed on the target
hardware.
1. Lexical Analysis (Scanner)
Purpose: The lexical analyzer (or scanner) reads the source code and converts it into a
sequence of tokens. Tokens are the smallest meaningful units of the language (e.g.,
keywords, operators, identifiers, literals).
Output: A list of tokens.
Functionality:
o Recognizes the basic syntactic components of the language.
o Ignores irrelevant details such as whitespace and comments.
2. Syntax Analysis (Parser)
Purpose: The parser takes the list of tokens produced by the lexical analyzer and arranges
them into a syntax tree (also known as a parse tree) according to the grammar rules of the
programming language.
Output: A parse tree or abstract syntax tree (AST).
Functionality:
o Checks the syntactic correctness of the token sequence.
o Detects syntax errors and provides error messages.
3. Semantic Analysis
Purpose: This phase checks the logical consistency of the program. It verifies whether the
program follows the semantic rules of the language (e.g., type checking, variable declaration,
scope rules).
Output: Annotated syntax tree or intermediate representation (IR).
Functionality:
o Ensures that operations are valid (e.g., checking types in expressions, ensuring
variables are defined before use).
o Resolves symbol references (e.g., variables, functions).
4. Intermediate Code Generation
Purpose: The intermediate code generator translates the abstract syntax tree or the semantic
information into an intermediate representation (IR) that is easier to manipulate than the
original high-level code, but not yet machine-specific.
Output: Intermediate representation (IR), which could be a form of assembly or another
lower-level code.
Functionality:
o Simplifies the target machine’s instruction set while still being independent of a
specific machine architecture.
o Allows optimization and code generation to be more flexible.
5. Optimization
Purpose: This phase improves the intermediate code to make the generated program run
more efficiently. Optimizations can be done at various levels, such as optimizing loops,
removing dead code, or reducing memory usage.
Output: Optimized intermediate code.
Functionality:
o Reduces the program's execution time or memory usage.
o Examples: Constant folding, dead code elimination, loop unrolling, etc.
6. Code Generation
Purpose: The code generator translates the optimized intermediate code into the target
machine code or assembly code, which can be executed on the computer's hardware.
Output: Machine code or assembly code.
Functionality:
o Maps the intermediate representation to specific instructions for the target
architecture (e.g., x86, ARM).
o Ensures that the generated machine code follows the conventions of the target
platform, such as register allocation, memory addressing, etc.
7. Code Optimization (Optional)
Purpose: Further optimization after code generation to make the machine code even more
efficient.
Output: Optimized machine code.
Functionality:
o Can involve optimizations like instruction scheduling, register allocation, etc., to
improve the performance of the machine code.
8. Code Linking and Assembly (Optional)
Purpose: Linking involves combining the compiled program with libraries or other modules
that are needed to create an executable. The assembler converts assembly code into machine
code.
Output: Executable file.
Functionality:
o Resolves symbols and addresses in the code.
o Produces an executable or a relocatable object file.
9. Error Handling and Reporting
Throughout all phases, the compiler needs to identify errors (e.g., lexical, syntax, semantic)
and provide useful feedback to the programmer.
Functionality: Reports errors or warnings and provides information to help the programmer
correct issues.
1.2 INTERPRETATION – INTERPRETERS, RECURSIVE INTERPRETERS AND ITERATIVE
INTERPRETERS
In the context of compiler design, interpretation refers to the process where the compiler or interpreter
executes a program directly, rather than generating a complete machine code executable. This execution
is done either through a recursive interpreter or an iterative interpreter, each with different methods
of interpreting and executing the program. Below is an overview of each in the context of compiler
design.
1. Interpreter
An interpreter in the context of compiler design refers to a system that directly executes instructions in
a high-level programming language without first converting them into machine code. It processes the
source code line-by-line or statement-by-statement and performs the specified actions immediately.
Interpreters are often used in languages that are designed to be executed directly without a compilation
step (such as Python, JavaScript, or Ruby).
Role in Compiler Design:
o Interpreters can be used in just-in-time compilation (JIT), where some parts of the
code are interpreted and then compiled into machine code at runtime for
optimization.
o In interpreted languages, the interpreter is responsible for managing and executing
the program without an intermediate machine code step.
o Interpreters generally focus on runtime execution, error handling, and providing
feedback as the code is executed.
Features of an Interpreter:
o No Compilation to Machine Code: It translates and executes code directly.
o Line-by-Line Execution: The program is executed step by step.
o Error Reporting: Errors are identified during execution, rather than at compile time.
2. Recursive Interpreters
A recursive interpreter uses recursion to handle the execution of nested expressions, function calls, and
subprograms. In recursive interpreters, the interpreter repeatedly calls itself to handle nested expressions
or recursive function calls. This is particularly useful when interpreting languages that heavily rely on
recursive constructs (such as functional programming languages like Lisp or Scheme).
Characteristics of Recursive Interpreters:
Recursive Evaluation: A recursive interpreter will invoke itself to handle expressions that
depend on recursive data structures or recursive functions. For example, evaluating a
recursive function will cause the interpreter to call itself with new arguments.
Handling Recursion: Recursive interpreters naturally handle recursive calls without
needing a separate mechanism, such as a stack, to simulate recursion.
Role in Compiler Design:
Abstract Syntax Trees (ASTs): Recursive interpreters can be used to traverse and evaluate
Abstract Syntax Trees (ASTs) produced by a parser.
Symbol Tables: They often utilize symbol tables to track variables, functions, and scopes in
recursive function calls.
Memory Consumption: Recursive interpreters tend to consume more memory and stack
space because each recursive call requires additional memory on the call stack.
Example:
In languages like Lisp or Scheme, where functions are often recursive by nature, a recursive interpreter is
well-suited for interpreting such recursive constructs. The interpreter might recursively evaluate an
expression that contains further nested expressions or function calls.
3. Iterative Interpreters
An iterative interpreter, as opposed to a recursive interpreter, processes the source code using loops
rather than recursive function calls. Instead of invoking itself recursively to evaluate sub-expressions, it
uses an explicit loop to process the program step by step.
Characteristics of Iterative Interpreters:
Loop-Based Execution: Iterative interpreters use loops (such as while or for loops) to
sequentially fetch, parse, and execute each statement in the source code.
Memory Efficiency: Iterative interpreters typically consume less memory because they
avoid the overhead of maintaining deep recursion stacks.
Handling Recursion: While iterative interpreters do not inherently support recursion, they
can manage recursion manually by implementing an explicit stack to simulate the behavior.
Role in Compiler Design:
Execution of Intermediate Representations: Iterative interpreters may execute the
intermediate representations generated during compilation (such as bytecode or intermediate
code), rather than the original source code.
Optimization: In some cases, iterative interpreters may be used in a just-in-time (JIT)
compilation approach to optimize execution on the fly.
Example:
An example of an iterative interpreter could be one that interprets a simplified version of a programming
language, where statements like assignments, arithmetic operations, and function calls are processed in a
loop without recursion. An iterative interpreter might be used in earlier compiler phases or when dealing
with languages that are not highly recursive.
Comparison of Recursive vs. Iterative Interpreters in Compiler Design
1.3 LEXICAL ANALYSIS
Lexical analysis is the first phase of a compiler. During this phase, the lexical analyzer (also called a
scanner) breaks the input source code into a sequence of tokens. Tokens are the smallest units of
meaningful data in the programming language (such as keywords, identifiers, operators, literals, and
punctuation).
The primary role of lexical analysis is to ensure that the source code is divided into these manageable,
meaningful components for further processing by the syntax analyzer (parser). The lexical analyzer also
helps catch basic errors like unrecognized characters or malformed tokens.
1.3.1 THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer, also known as a scanner, is the first phase of the compiler and plays a crucial role
in transforming raw source code into a form that can be further processed by subsequent phases of the
compiler, such as the syntax analyzer (parser). The primary responsibility of the lexical analyzer is to
break the input source code into tokens, which are the smallest units of meaningful data in a
programming language.
Key Responsibilities of the Lexical Analyzer:
1. Tokenization:
o Tokenization is the process of dividing the source code into tokens, which are
categorized sequences of characters that represent syntactically meaningful elements
of the program.
o These tokens can represent a variety of constructs such as keywords (e.g., if, while),
identifiers (e.g., variable names), literals (e.g., numbers, strings), operators (e.g., +, -
, *), and punctuation (e.g., parentheses, semicolons).
o The lexical analyzer reads the raw source code character by character, groups
characters together based on predefined rules (such as regular expressions), and
produces these tokens.
2. Eliminating Whitespace and Comments:
o The lexical analyzer discards irrelevant elements like whitespace (spaces, tabs,
newlines) and comments, which do not contribute to the syntactic structure of the
program.
o These elements are useful for human readability but are unnecessary for the
compilation process. By removing them, the lexical analyzer simplifies the input
stream and reduces the work for later stages of the compiler.
3. Error Detection:
o The lexical analyzer helps catch lexical errors. A lexical error occurs when a
sequence of characters cannot be recognized as a valid token, such as an illegal
character or an invalid number format.
o For example, if the source code contains an invalid character (e.g., # in a language
where it is not allowed), the lexical analyzer will detect this and report an error before
the parsing phase.
o Error handling is a key part of the lexical analysis process. Lexical errors are
typically flagged, and informative error messages are provided, often pointing out
the exact location (line and column) of the issue.
4. Token Classification:
o The lexical analyzer classifies tokens into different token classes such as:
Keywords: Reserved words with predefined meanings in the language (e.g.,
if, for, return).
Identifiers: Names used to identify variables, functions, types, etc.
Literals: Constant values such as numbers (5, 3.14) and strings ("hello").
Operators: Symbols used to perform operations (e.g., +, -, *, =, ==).
Delimiters and Punctuation: Symbols like semicolons (;), parentheses (()),
curly braces ({}), and commas (,).
o By recognizing these tokens, the lexical analyzer helps structure the source code into
meaningful units that the parser can later use to build a syntax tree.
5. Providing a Stream of Tokens to the Parser:
o After processing the input source code and breaking it into tokens, the lexical
analyzer passes the token stream to the next phase of the compiler, which is the
syntax analyzer (parser).
o The parser expects the input in the form of a sequence of tokens, and it is the lexical
analyzer’s job to ensure that the tokens are correctly identified and passed on to the
parser.
How the Lexical Analyzer Works:
Input Processing: The lexical analyzer takes the source code as input, usually a sequence
of characters, and processes them to identify tokens.
Token Recognition: Using patterns specified by regular expressions or finite automata, the
lexical analyzer matches substrings in the source code to specific token patterns (e.g.,
identifiers, literals, keywords, etc.).
Token Generation: Each time a token is recognized, the lexical analyzer creates a token
object (or a simple representation of the token) and passes it to the parser.
Example of Lexical Analysis:
Consider the following simple source code:
int sum = 10 + 5;
The lexical analyzer would break this into the following tokens:
Keyword: int
Identifier: sum
Operator: =
Integer Literal: 10
Operator: +
Integer Literal: 5
Punctuation: ;
These tokens are passed to the syntax analyzer (parser) for further processing.
Benefits of Lexical Analysis:
1. Simplifies Parsing:
o By breaking the input into tokens, the lexical analyzer simplifies the job of the syntax
analyzer. The parser only needs to handle structured tokens instead of raw characters.
2. Error Detection Early:
o It allows for early detection of certain types of errors, particularly lexical errors,
before the program is passed to more complex stages like parsing or semantic
analysis.
3. Optimized Execution:
o Lexical analysis enables more efficient execution in modern compilers. Techniques
like input buffering (e.g., double-buffering) improve performance by reducing the
number of input operations needed.
4. Modularity:
o Lexical analysis modularizes the compilation process by separating the task of
recognizing tokens from more complex tasks like syntax checking, making the
compiler easier to design and maintain.
By performing these tasks efficiently, the lexical analyzer ensures that the compiler can process the source
code correctly and efficiently, setting the stage for subsequent phases of compilation.
1.3.2 INPUTBUFFERING
Input buffering is a technique used by the lexical analyzer (scanner) in a compiler to efficiently read
and process characters from the source code during lexical analysis. The main goal of input buffering is
to minimize the overhead of repeated input operations and improve the performance of the lexical
analyzer.
Since the lexical analyzer processes the source code character by character, if it reads directly from the
input source (e.g., a file or a stream) every time a character is needed, it could result in a lot of unnecessary
I/O operations. To address this, buffering is used to hold a block of input characters in memory, allowing
the lexical analyzer to process them efficiently in bulk.
Key Concepts of Input Buffering
1. Buffering Basics:
o Buffer: A contiguous block of memory used to hold a portion of the input source
code. The lexical analyzer reads from the buffer instead of directly from the source
file.
o By using a buffer, the program can access characters from memory, which is faster
than accessing characters from disk or other input devices.
2. The Role of Input Buffering:
o The lexical analyzer needs to read the source code in chunks and process tokens.
Instead of constantly reading from the source (which can be slow), the input is read
into a buffer. The buffer acts as an intermediary between the input source and the
lexical analyzer.
o The buffer allows the lexical analyzer to look ahead and consume characters as
needed without having to frequently perform input operations.
3. Challenges:
o The main challenge is that the lexical analyzer often needs to look at multiple
characters ahead in the input (to determine where a token starts and ends).
o It also needs to handle situations where a token might span multiple reads from the
buffer or the source code.
Types of Input Buffering
There are two common types of input buffering techniques used in lexical analysis:
1. Single Buffering:
o In single buffering, only one buffer is used to hold the input characters.
o The lexical analyzer reads from this buffer character by character.
o Once the buffer is exhausted, it needs to be refilled from the source code, which may
cause some delays as the analyzer waits for new input.
Drawback: Single buffering requires the analyzer to frequently access the input source (e.g., disk or
terminal), which can be inefficient.
2. Double Buffering:
o In double buffering, two buffers are used to store input characters.
o While the lexical analyzer is reading from one buffer, the other buffer is
simultaneously being filled with the next block of characters from the source code.
o This method minimizes the idle time spent waiting for new input because one buffer
is always ready to be processed while the other is being populated with data from the
source code.
Advantage: Double buffering significantly improves performance by reducing the frequency of input
operations and ensuring that characters are always available for processing.
How Double Buffering Works
In double buffering, the input is divided into two buffers (usually named Buffer 1 and Buffer 2):
1. Initial Setup: The lexical analyzer starts by filling Buffer 1 with a chunk of characters from
the input source.
2. Processing: The lexical analyzer processes the characters in Buffer 1. While it is reading
and processing Buffer 1, Buffer 2 is simultaneously filled with the next chunk of input
characters from the source.
3. Buffer Swap: Once Buffer 1 is fully processed (or when more input is needed), the lexical
analyzer switches to Buffer 2 and starts reading from it. The roles of the buffers are swapped,
with Buffer 1 being filled with the next chunk of input while the lexical analyzer processes
Buffer 2.
4. Loop: This process continues, with the buffers alternating between being read and being
filled, allowing the lexical analyzer to process characters efficiently without waiting for
input operations.
Example of Double Buffering:
Let’s say the input source is the string "int sum = 10 + 5;". In double buffering:
Buffer 1 could initially contain: "int sum = ".
Buffer 2 could be filled simultaneously with the next part of the input: "10 + 5;".
The lexical analyzer reads the characters from Buffer 1 while Buffer 2 is being filled. Once Buffer 1 is
exhausted, the analyzer switches to Buffer 2, and Buffer 1 is filled with the next chunk of data.
Advantages of Input Buffering
1. Improved Efficiency:
o Input buffering reduces the number of times the lexical analyzer has to perform an
I/O operation, which makes the process of lexical analysis much faster, especially
for large programs.
o Double buffering is particularly effective because it ensures that the lexical analyzer
is never waiting for input; there's always a buffer ready for processing.
2. Reduced Latency:
o By having a second buffer ready, double buffering minimizes delays that would occur
if the analyzer had to wait for new characters to be read from the source.
3. Optimized Token Recognition:
o Input buffering ensures that the lexical analyzer can look ahead at characters to
correctly identify tokens, which is especially important for languages with complex
tokens (e.g., where a token might start in one buffer and end in another).
Example: How Buffering Helps Lexical Analysis
Consider the following input:
int a = 5 + 3;
Without input buffering, the lexical analyzer would read each character individually, possibly making
many trips to the source code and spending significant time in each one.
With double buffering, the lexical analyzer reads two blocks at a time (say, "int a =" in one buffer and "5
+ 3;" in another). This improves efficiency by reducing I/O overhead and enabling the lexer to quickly
process the input and tokenize it.
Buffering in Action in Lexical Analyzers
1. Efficient Reading: In a typical lexical analyzer, input buffering ensures that after the initial
load of characters, the analyzer spends most of its time processing tokens, not waiting for
more data.
2. Look-Ahead: Input buffering allows the lexer to efficiently perform look-ahead in
tokenization. For example, in the case of a multi-character token, such as an identifier or
keyword, the lexical analyzer can read ahead and determine where the token ends.
Input buffering is a crucial optimization in lexical analysis, enabling the lexical analyzer to efficiently
process input by reducing the number of direct I/O operations. Double buffering is particularly effective,
ensuring continuous input availability while the lexer is processing the current buffer. These techniques
improve the performance of the compiler, especially when analyzing large programs, by minimizing
delays caused by waiting for input and enabling faster token recognition.
1.3.3 SPECIFICATION OF TOKENS
In compiler design, the specification of tokens refers to the process of defining the patterns or rules that
identify the basic building blocks of a programming language. Tokens are the smallest units of meaning
in a programming language and serve as the input for the syntax analyzer (parser). These tokens can
include keywords, operators, identifiers, literals, and punctuation marks.
The process of token specification is typically done using regular expressions or context-free
grammars, which describe the pattern or structure of valid tokens.
1. What are Tokens?
A token is a sequence of characters in the source code that matches a pattern defined by the programming
language's syntax rules. Tokens are categorized into classes, each representing a specific type of construct
in the language.
Common types of tokens include:
Keywords: Reserved words in the programming language that have a predefined meaning,
such as if, while, return, int, for, etc.
Identifiers: Names used for variables, functions, classes, etc. An identifier typically starts
with a letter or underscore and may be followed by letters, digits, or underscores.
Literals: Constants or fixed values in the program. These can be:
o Integer literals (e.g., 10, 0, -5).
o Floating-point literals (e.g., 3.14, 0.1, -0.56).
o String literals (e.g., "hello", "world").
o Character literals (e.g., 'a', '%').
Operators: Symbols that represent operations, such as +, -, *, /, ==, etc.
Delimiters: Punctuation marks used for separating different components in the source code,
such as ;, ,, (), {}, [], etc.
2. Specifying Tokens Using Regular Expressions
In lexical analysis, regular expressions (regex) are commonly used to define the patterns that tokens must
match. A regular expression describes a set of strings that belong to a particular class of tokens.
For example:
Identifiers: In many languages, an identifier consists of a letter (or underscore) followed by any number
of letters, digits, or underscores. The regular expression for an identifier might be:
[a-zA-Z_][a-zA-Z0-9_]*
This regex ensures that the token begins with a letter or underscore, followed by any number of
alphanumeric characters or underscores.
Integer Literals: An integer literal consists of one or more digits. The regex for an integer might be:
[0-9]+
This pattern matches one or more digits.
Keywords: Keywords are typically specified directly, as they are reserved words with a fixed set. For
example:
if | else | return | int | for
Operators: The regex for operators such as +, -, *, /, ==, etc., can be:
\+ | - | \* | / | == | != | < | >
String Literals: String literals are typically enclosed in double quotes and may contain escaped characters
like \n, \t, etc. The regex might look like:
\"(\\.|[^\\"])*\"
This regex matches a string literal, handling escape sequences within the string.
3. Token Specification Examples
Here’s a table of some common tokens and their corresponding regular expressions in a typical
programming language:
4. Using Finite Automata (DFA) for Token Specification
A deterministic finite automaton (DFA) is a mathematical model used by lexical analyzers to recognize
tokens. Regular expressions can be translated into a DFA, which then processes the input stream character
by character.
The process works as follows:
1. Conversion to DFA: The regular expressions for token classes are converted into a DFA.
2. Input Processing: The DFA processes the input source code character by character,
transitioning between states based on the input symbols.
3. Token Recognition: When the DFA reaches an accepting state, a token has been successfully
recognized, and the lexical analyzer outputs the token.
5. Example of Token Specification Process
Consider a simple language with the following token types:
Keywords: if, else, int
Identifiers: Any string starting with a letter or underscore, followed by letters, digits, or
underscores.
Integer literals: A sequence of digits.
Here’s how token specification might work:
1. Keywords: We define keywords as a special class using direct matching (e.g., if, else, int).
2. Identifiers: We define identifiers using a regular expression, [a-zA-Z_][a-zA-Z0-9_]*.
3. Integer Literals: We define integer literals with the regular expression [0-9]+.
Sample Input: int x = 10;
The lexical analyzer would process this input and recognize the following tokens:
Keyword: int
Identifier: x
Operator: =
Integer Literal: 10
Punctuation: ;
6. Lexical Analyzer Tools (e.g., LEX)
Tools like LEX (a lexical analyzer generator) automate the token specification process. A user writes
regular expressions for token classes in the LEX input file, and LEX generates the C code that implements
the lexical analyzer. For example:
%%
int { return KEYWORD_INT; }
[0-9]+ { return INTEGER; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
%%
This LEX specification defines tokens for the keyword int, integer literals, and identifiers, and generates
a scanner that can recognize these tokens.
The specification of tokens is an essential part of the lexical analysis phase in compiler design. By using
regular expressions or finite automata, the compiler defines patterns for different types of tokens,
including keywords, identifiers, literals, operators, and punctuation. This allows the lexical analyzer to
efficiently break down the source code into meaningful components, which can then be processed by later
phases of the compiler (like the syntax analyzer).
1.3.4 THE LEXICAL-ANALYZERGENERATOR LEX
LEX is a powerful tool used for generating lexical analyzers (also known as scanners) in compilers. It
automates the process of turning regular expressions, which specify the patterns of tokens, into a C-based
program that can identify those tokens in a given input stream. LEX is widely used to simplify the
implementation of the lexical analysis phase in a compiler.
What is LEX?
LEX is a lexical analyzer generator that takes a set of regular expressions (used to define the patterns of
tokens) as input and generates a C program that performs lexical analysis on a source code file. The
generated C code uses a finite automaton to recognize the defined patterns (tokens) and perform actions
based on what it matches.
How LEX Works
1. Input Specification: The user provides a specification file for the lexical analyzer, which
consists of a set of regular expressions for token classes and associated C code for actions
to be taken when tokens are matched.
2. Regular Expressions: LEX uses regular expressions to define patterns for different tokens.
For example, it might have regular expressions for keywords (if, else), identifiers (variable
names), integer literals, etc.
3. Action Code: Alongside each regular expression, the user provides action code (usually in
C) that is executed when a pattern is matched. The action might involve returning a token to
the parser, performing an operation, or printing an error message.
4. Finite Automaton Generation: LEX translates the regular expressions into a finite
automaton, which is a state machine that can process the input string character by character,
transitioning between states based on the input and matching tokens.
5. C Code Generation: After processing the regular expressions and actions, LEX generates a
C program that includes the logic to scan input based on the defined regular expressions.
This C program can then be compiled and used to perform lexical analysis.
Structure of a LEX Specification File
A LEX specification file generally consists of three sections:
1. Definitions Section:
o This section is used to define macros, regular expressions, or include necessary
libraries.
o It typically contains global declarations or defines for constants and regular
expressions.
2. Rules Section:
o This section is the heart of the LEX specification.
o Each rule consists of a regular expression (defining the token) followed by a block
of C code that is executed when the token is recognized.
o Rules are separated by new lines, and LEX processes them in the order they are
specified.
3. User Code Section:
o This section allows for additional C code that will be included in the generated
scanner.
o It often includes function definitions and main() function code that initializes the
scanner and calls the lexical analyzer functions.
Example LEX Specification
Here’s an example of a simple LEX specification for a lexical analyzer that recognizes integer literals,
identifiers, and a few keywords (if, else, int):
%{
/* Definitions section: include necessary libraries or macros */
#include <stdio.h>
#include <stdlib.h>
%}
/* Rules section: specify token patterns and actions */
%%
if { printf("Keyword: if\n"); }
else { printf("Keyword: else\n"); }
int { printf("Keyword: int\n"); }
[0-9]+ { printf("Integer: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); }
\n { /* ignore newlines */ }
[ \t] { /* ignore whitespace */ }
. { printf("Invalid character: %s\n", yytext); }
%%
/* User code section */
int main(int argc, char *argv[]) {
yylex(); /* Call the lexical analyzer */
return 0;
}
Explanation of the Example
Definitions Section:
o This section includes the necessary libraries (stdio.h for input/output and stdlib.h for
general functions).
o There are no macros or advanced definitions in this simple example.
Rules Section:
o if, else, and int: These are keywords in the language. When one of these keywords
is matched, the corresponding action (printf) is executed.
o [0-9]+: This regular expression matches one or more digits, representing an integer
literal. The action prints the matched number using yytext, which is a special
variable that holds the current matched token.
o [a-zA-Z_][a-zA-Z0-9_]*: This matches an identifier (a sequence starting with a
letter or underscore, followed by letters, digits, or underscores).
o [ \t]: This matches whitespace characters (spaces and tabs), which are ignored.
o \n: This matches a newline character, which is also ignored.
o .: This matches any character not matched by the previous rules and prints an error
message.
User Code Section:
o The main() function calls yylex(), which is the function generated by LEX that
performs the lexical analysis.
o The yylex() function is the entry point for the lexical analyzer. It reads input, applies
the rules, and invokes the corresponding actions.
LEX Output
When this LEX file is processed, the lex tool generates a C file (e.g., [Link].c) that contains a lexical
analyzer. This C file is then compiled, resulting in an executable that can read an input stream and perform
lexical analysis. For example, given the input:
int x = 10;
if (x > 5) {
x = 20;
}
The output from running the generated program might be:
Keyword: int
Identifier: x
Invalid character: =
Integer: 10
Invalid character: ;
Keyword: if
Invalid character: (
Identifier: x
Invalid character: >
Integer: 5
Invalid character: )
Invalid character: {
Identifier: x
Invalid character: =
Integer: 20
Invalid character: ;
Invalid character: }
Key Features of LEX
1. Token Definition: LEX allows you to define tokens using regular expressions, which are
simple and concise.
2. Action Code: LEX allows you to specify C code that is executed when a token is recognized.
This can include returning tokens to the parser, printing information, or handling errors.
3. Efficient: LEX uses finite automata to match regular expressions efficiently. This ensures
that lexical analysis is fast even for large inputs.
4. Error Handling: LEX allows you to define actions for unmatched characters, making it
easy to handle lexical errors.
Advantages of Using LEX
1. Automation: LEX automates the creation of lexical analyzers, reducing the amount of code
that needs to be written manually.
2. Efficiency: The finite automaton approach used by LEX ensures that lexical analysis is
performed efficiently.
3. Portability: LEX generates C code, which can be compiled and run on different platforms,
making it portable across systems.
4. Flexibility: LEX allows you to define a wide range of token patterns and actions, making it
highly customizable for different programming languages.