Introduction of Lexical Analysis
Last Updated :
26 Aug, 2025
Lexical analysis, also known as scanning is the first phase of a compiler which involves reading the source program character by character from left to right and organizing them into tokens.
Lexical AnalysisWhat is a Token?
A token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.
Categories of Tokens
- Keywords: In C programming, keywords are reserved words with specific meanings used to define the language's structure like if, else, for, and void. These cannot be used as variable names or identifiers, as doing so causes compilation errors. C programming has a total of 32 keywords.
- Identifiers: Identifiers in C are names for variables, functions, arrays, or other user-defined items. They must start with a letter or an underscore (_) and can include letters, digits, and underscores. C is case-sensitive, so uppercase and lowercase letters are different. Identifiers cannot be the same as keywords like if, else or for.
- Constants: Constants are fixed values that cannot change during a program's execution, also known as literals. In C, constants include types like integers, floating-point numbers, characters, and strings.
- Operators: Operators are symbols in C that perform actions on variables or other data items, called operands.
- Special Symbols: Special symbols in C are compiler tokens used for specific purposes, such as separating code elements or defining operations. Examples include ; (semicolon) to end statements, , (comma) to separate values, {} (curly braces) for code blocks, and [] (square brackets) for arrays. These symbols play a crucial role in the program's structure and syntax.
Read more about Tokens.
What is a Lexeme?
A lexeme is an actual string of characters that matches with a pattern and generates a token.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
Lexemes and Tokens Representation
Lexemes | Tokens | Lexemes Continued... | Tokens Continued... |
---|
while | WHILE | a | IDENTIEFIER |
( | LAPREN | = | ASSIGNMENT |
a | IDENTIFIER | a | IDENTIFIER |
>= | COMPARISON | - | ARITHMETIC |
b | IDENTIFIER | 2 | INTEGER |
) | RPAREN | ; | SEMICOLON |
How Lexical Analyzer Works?
Tokens in a programming language can be described using regular expressions. A scanner, or lexical analyzer, uses a Deterministic Finite Automaton (DFA) to recognize these tokens, as DFAs are designed to identify regular languages. Each final state of the DFA corresponds to a specific token type, allowing the scanner to classify the input. The process of creating a DFA from regular expressions can be automated, making it easier to handle token recognition efficiently.
Read more about Working of Lexical Analyzer in Compiler.
The lexical analyzer identifies the error with the help of the automation machine and the grammar of the given language on which it is based like C, C++, and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer: a = b + c;
It will generate token sequence like this: id=id+id; Where each id refers to it’s variable in the symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens. You can observe that we have omitted comments. As another example, consider below printf statement.
There are 5 valid token in this printf statement.
Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens:
int max(int i);
- Lexical analyzer first read int and finds it to be valid and accepts as token.
- max is read by it and found to be a valid function name after reading (
- int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;
Quiz on Lexical Analysis
Lexical Analysis & it's Working | Compiler Design
Explore
Compiler Design Basics
Lexical Analysis
Syntax Analysis & Parsers
Syntax Directed Translation & Intermediate Code Generation
Code Optimization & Runtime Environments
Practice Questions