0% found this document useful (0 votes)
16 views19 pages

K Sai CPDS

This project report details the design and implementation of a mini compiler for a subset of the C programming language, developed in C. The compiler performs essential compilation phases including lexical, syntax, and semantic analysis, as well as intermediate code generation, focusing on basic constructs while simplifying complexities. The report also discusses challenges faced, strategies for error handling, and potential future enhancements for the compiler.

Uploaded by

fekme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

K Sai CPDS

This project report details the design and implementation of a mini compiler for a subset of the C programming language, developed in C. The compiler performs essential compilation phases including lexical, syntax, and semantic analysis, as well as intermediate code generation, focusing on basic constructs while simplifying complexities. The report also discusses challenges faced, strategies for error handling, and potential future enhancements for the compiler.

Uploaded by

fekme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

A Project report on

Mini Compiler for C Subset in C Language

SUBMITTED BY
Kanuri Sai Karthik(24261A0490)
Sakshi Tiwari(24261A04B6)
Inkulu Varshini(24261A087)

SUBMITTED TO

Dr. Y. Praveen Kumar Reddy

Assistant Professor

Department of Electronics and Communication Engineering(ECE)

MAHATMA GANDHI INSTITUTE OF TECHNOLOGY

(Autonomous)

Chaitanya Bharathi (PO), Kokapet(V), Gandipet (M), Ranga Reddy district, Hyd,
Telangana, India-500075

MGIT| Page1
TABLE OF CONTENTS
ABSTRACT
1. Introduction
1.1 Overview of the Project
1.2 Objectives and Scope
1.3 Importance of a Mini Compiler

2. Background and Literature Review


2.1 Overview of Compiler Design
2.2 Key Concepts in C Subset Compilation
2.3 Related Work

3. System Design
3.1 Architectural Overview
3.2 Components of the Mini Compiler
3.3 Design Constraints and Assumptions

4. Lexical Analysis
4.1 Role of the Lexer
4.2 Token Specification for C Subset
4.3 Implementation Details

5. Syntax Analysis
5.1 Basic Definition and Techniques Used
5.2 Grammar for the C Subset
5.3 Abstract Syntax Tree Generation

MGIT| Page2
6. Semantic Analysis
6.1 Basic Definition and Working Principle
6.2 Symbol Table Management
6.3 Type Checking and Error Handling

7. Intermediate Code Generation


7.1 Intermediate Representation (IR)
7.2 Translation of Statements and Expressions

8. Testing and Debugging


8.1 Test Cases
8.2 Debugging Common Issues
8.3 Performance Analysis
9. Conclusion and Lessons Learnt
9.1 Lessons Learned
9.2 Future Enhancements

10. In Summary

REFERENCES

MGIT| Page3
ABSTRACT
A compiler is a computer program that transforms source code written in a
programming language(the source language) into another computer
language(the target language),with the latter often having a binary form
known as object code.This report presents the design and implementation of a
mini compiler for a subset of the C programming language, developed using
the C language itself. The compiler translates high-level C subset code into
low-level machine code, enabling the execution of programs on a virtual or
physical machine. Key features of the compiler include lexical analysis,
syntax analysis, semantic analysis, intermediate code generation, and code
optimization.The project focuses on supporting essential programming
constructs such as variable declarations, expressions, conditional statements,
loops, and function calls, while simplifying complexities like advanced
pointer manipulations, memory management, and extensive library support.
The implementation leverages modular programming principles, ensuring
scalability and maintainability of the compiler components.This report details
the compiler’s architecture, including the development of a custom lexical
analyzer, parser, and code generator. Challenges such as error detection and
recovery, symbol table management, and type checking are discussed, along
with strategies employed to overcome them. Performance evaluations and
comparisons with existing compilers highlight the compiler’s efficiency and
accuracy.The report concludes with potential improvements and extensions,
such as support for additional C language features, optimizations for
execution speed, and compatibility with other hardware architectures. This
project serves as a foundational step for further research and development in
compiler design and programming language processing.

MGIT| Page4
1. Introduction
1.1 Overview of the Project
The project focuses on the design and implementation of a mini compiler for a subset of
the C programming language. The mini compiler will serve as a tool to convert high-level
code written in the defined C subset into equivalent low-level machine code or an
intermediate representation. Using the provided codebase, the compiler processes and
executes simple C programs by performing the essential phases of compilation: lexical
analysis, syntax analysis, semantic analysis, and code generation while also providing
insights into these workings.This project is educational in nature, providing insights into
compiler design through a hands-on approach. It takes a minimalist approach by focusing
on a reduced set of C language features.The compiler processes basic C constructs,
including variable declarations, assignments, and print statements, to execute small
programs.

1.2 Objectives and Scope


The mini compiler is designed to handle a predefined subset of C language constructs,
such as basic variable declarations, arithmetic operations and print statements. However,
it does not support the entire C language and lacks advanced features.The primary
objective of this project is to build a functional compiler for educational purposes,
covering key phases such as lexical analysis, syntax analysis, semantic analysis, and code
generation. The scope is limited to a simplified C subset to reduce complexity.The project
focuses on academic learning and demonstration purposes rather than industrial-level use

1.3 Importance of a Mini Compiler


Building a compiler is crucial for understanding how programming languages are
designed and translated into machine-executable instructions, forming the foundation of
modern computing. It integrates concepts from algorithms, data structures, and automata
theory, providing practical experience in language syntax, semantics, and code
optimization. Compilers play a vital role in detecting errors, improving performance, and
enabling efficient software development. This process not only deepens knowledge of
how programming languages work but also enhances problem-solving and diagnostic
skills essential for advanced computer science and software engineering.

MGIT| Page5
2. Background and Literature Review
2.1 Overview of Compiler Design
A compiler is a software tool that translates high-level programming language code into
low-level machine code or an intermediate representation that can be executed by a
computer. It acts as a bridge between human-readable code and machine-executable code,
ensuring the program adheres to the syntactic and semantic rules of the programming
language. Compilers play a crucial role in making high-level programming efficient and
accessible by abstracting machine-level complexities. The provided code demonstrates
the fundamental phases of compiler design for a C subset. It includes lexical analysis,
syntax analysis, and semantic analysis. The evaluate expression() function processes
arithmetic operations, while execution is handled directly without generating machine
code. This simple pipeline highlights the core components of a compiler, providing a
practical understanding of tokenization, parsing, and code evaluation.

2.2 Key Concepts in C Subset Compilation


C subset compilation streamlines the compilation process by focusing on a limited range
of C language features, such as variable declarations, basic arithmetic, assignments, and
print statements. It involves several key components: tokenization, which breaks the
source code into tokens like keywords and symbols; parsing, which validates the syntax
according to specific rules; and symbol table management, which stores variable names
and their values to ensure semantic correctness. Additionally, it includes expression
evaluation for handling simple arithmetic and resolving variable references, as well as
error handling to detect issues like undefined variables and provide clear error messages.
This approach offers a focused and educational framework for grasping fundamental
compiler functionalities without the complexities associated with full C compilation.

2.3 Related Work


Several mini compilers exist for educational purposes, often focusing on specific
languages. This project draws inspiration from tools like "Tiny C Compiler" and similar
lightweight implementations.

MGIT| Page6
3. System Design
3.1 Architectural Overview
The mini compiler follows a modular architecture, where each phase of the compilation
process is implemented as an independent module. The workflow of the mini compiler
can be summarized as follows:

● Input Code: The user provides a source file written in a subset of the C language.
● Lexical Analysis: The source code is scanned to generate tokens.
● Syntax Analysis: Tokens are parsed to create an Abstract Syntax Tree (AST) that
represents the program's structure.
● Semantic Analysis: The AST is validated for type correctness, scope resolution,
and other semantic checks.
● Intermediate Code Generation: The validated AST is converted into three-address
code or another intermediate representation.
● Code Generation: The intermediate code is translated into target machine code or a
simpler assembly-like output.
● Output: The final output is a low-level representation or executable for a
predefined virtual machine or hardware

Diagram of Workflow

Source Code → Lexical Analysis → Syntax Analysis → Semantic Analysis →


Intermediate Code Generation → Code Generation → Output

3.2 Components of the Mini Compiler


The key components of the mini compiler include the lexer, which tokenizes the source
code into meaningful symbols such as keywords, identifiers, and operators; the parser,
which organizes these tokens into a structured format (like an abstract syntax tree) and
checks for syntactical correctness; and the symbol table, which stores variable names
and their associated values for semantic analysis. Additionally, the compiler features an

expression evaluator that computes simple arithmetic operations and resolves variable
references. Error handling mechanisms are integrated to provide feedback on syntax
and semantic errors, ensuring a robust compilation process. Together, these

MGIT| Page7
components enable the mini compiler to effectively process and execute a limited
subset of a programming language.

3.3 Design Constraints and Assumptions


The compiler assumes error-free input and supports a limited set of features.
These considerations help tailor the mini compiler to effectively meet its
educational and functional goals while also giving a complete overview on
the core fundamentals of functionality of a mini compiler.

4. Lexical Analysis
4.1 Role of the Lexer
The lexer plays a crucial role in the mini compiler by serving as the first stage of the
compilation process. Its primary function is to read the raw source code and convert it
into a sequence of tokens, which are the fundamental building blocks for further analysis.
The lexer identifies and categorizes various elements of the code, including keywords,
identifiers, numeric literals, and symbols.By skipping whitespace and tracking line
numbers, the lexer ensures that the tokens are accurately represented for the parser.

4.2 Token Specification for C Subset


In a mini compiler designed for a subset of the C programming language, token
specification involves defining the various types of tokens that the lexer will recognize
and categorize from the source code. The token types are essential for the compilation
process, as they represent the fundamental building blocks of the language.The key token
types for the C subset include:

1. Keywords: These are reserved words with special meaning in the language. In this
subset, important keywords include:
● int: Used for declaring integer variables.
● print: Used for outputting values to the console.
● main: The entry point of the program.

MGIT| Page8
2. Identifiers: These tokens represent variable names defined by the user. Identifiers
must follow specific naming conventions, typically starting with a letter or
underscore, followed by letters, digits, or underscores.
3. Numbers: Tokens that represent numeric literals, specifically integers in this
subset. They are recognized by their digit composition.
4. Operators and Symbols: These include various symbols that perform operations or
denote structure in the code:
● =: Assignment operator.
● +: Addition operator.
● ;: Semicolon, used to terminate statements.
● { and }: Curly braces, used to define code blocks.
● ( and ): Parentheses, used for grouping expressions and function calls.
5. End of File (EOF): A special token indicating the end of the input source code.

4.3 Implementation Details


The implementation of lexical analysis in the mini compiler revolves around the lexer,
which transforms raw source code into a sequence of tokens. The lexer is structured to
maintain the current state, including the source code, position, and line number, and is
initialized through the,‘lexer_init’ ,function. The core function,‘lexer_next’, generates
tokens by skipping whitespace, detecting the end of input, and recognizing keywords,
identifiers, numbers, and symbols. It also includes error handling for unknown characters,
providing informative messages about the line number and character encountered. Each
token is represented by a structure that captures its type, lexeme, and line number for

MGIT| Page9
error reporting. This implementation emphasizes clarity and efficiency, laying a solid
foundation for the subsequent phases of the compilation process. Here is an example of
how lexer works for a code when the code given has errors.

Input:

Output:

MGIT| Page10
5. Syntax Analysis(Parser)

5.1 Basic Definition and Technique Used


The parser verifies the token sequence against grammatical rules and ensures syntactic
correctness. Using recursive descent parsing, functions like parse_program() and
parse_statement() process the grammar of the subset (e.g., int main() { statements }) and
report syntax errors when rules are violated.

5.2 Grammar for the C Subset


The grammar for the C subset implemented in the provided mini compiler code defines
the syntactic structure of the language, specifying how various constructs can be formed.
This grammar is relatively simple and focuses on essential elements of the C
programming language. Key components of the grammar include

● Program: int main() { statements }


● Statement: print IDENTIFIER; | int IDENTIFIER =
expression;

5.3 Abstract Syntax Tree Generation


An Abstract Syntax Tree (AST) is a hierarchical tree representation of the abstract
syntactic structure of source code. It is a crucial component in the compilation process,
serving as an intermediate representation that simplifies the analysis and transformation
of code. But in this case , an AST is not explicitly constructed due to simplicity but can
be extended.

MGIT| Page11
6. Semantic Analysis
6.1 Basic Definition and Working Principle
Semantic analysis is the process of ensuring that a program's declarations and
statements are semantically correct, meaning their usage aligns with the intended
control structures and data types. It involves comparing information within
different parts of a parse tree, such as verifying that variable references match their
declarations and that function call parameters align with their definitions.
Implementing semantic actions is more straightforward in recursive descent
parsing, as they can be integrated into the recursive procedures. Key functions of
semantic analysis include maintaining and updating the symbol table, checking for
semantic errors and warnings like type mismatches, variable scope issues,
redefinitions, and the use of undeclared variables.

6.2 Symbol Table Management


Symbol table management in the provided mini compiler is essential for tracking
variable names and their associated values during compilation. Implemented as an
array of Symbol structures, the symbol table allows for efficient searching, adding,
and updating of variables. The ‘find_symbol’ function checks for existing
variables, while new entries are created when a variable is declared. The table also
facilitates value updates during assignment statements and includes error handling
for semantic checks, such as ensuring variables are declared before use and
preventing redefinitions. Overall, effective symbol table management is crucial for
maintaining variable state and ensuring adherence to variable scope and usage
rules throughout the compilation process.The symbol table stores variable
names and their associated values:

MGIT| Page12
6.3 Type Checking and Error Handling
Type checking and error handling in the provided mini compiler are essential
aspects of semantic analysis that ensure the correctness of operations and
expressions. Type checking verifies that variables are assigned compatible data
types and that arithmetic operations are performed on appropriate types. The
compiler checks for semantic errors, such as the use of undeclared variables,
through symbol table management, generating error messages when issues arise.
Additionally, the ‘error’ function provides informative feedback, including line
numbers and error descriptions, to help users identify and correct problems in their
code. Overall, these mechanisms enhance the robustness of the semantic analysis
phase, ensuring adherence to rules of type usage and variable scope. Here is an
example of this process with the example of a variable assignment case.

Variable Assignment:

Valid Assignment

Invalid Assignment

Type Checking Implementation: The compiler would check the type of the value being
assigned to x and raise an error if it does not match the expected type (integer).

7. Intermediate Code Generation


7.1 Intermediate Representation (IR)

MGIT| Page13
Intermediate Representation (IR) is a crucial concept in compiler design that serves
as a bridge between the high-level source code and the low-level machine code. It
provides a way to represent the program in a form that is easier for the compiler to
analyze and optimize, while still being abstract enough to be independent of the
target architecture.The IR consists of tokenized and parsed structures ready for
evaluation.

7.2 Translation of Statements and Expressions


In the mini compiler, the translation of statements and expressions involves converting
high-level constructs from the source code into executable actions. For assignment
statements (e.g., int a = 5;), the compiler matches the int keyword and variable identifier,
stores the variable in the symbol table if it’s new, and evaluates the right-hand expression
to assign its value. For print statements (e.g., print a;), it retrieves the variable's value
from the symbol table and outputs it. The evaluation of expressions includes checking
token types, converting numbers, and handling arithmetic operations, allowing for the
execution of complex expressions. This translation process is essential for accurately
representing the program's logic in a form that can be executed by the underlying system.
Expressions are evaluated recursively:

8. Testing and Debugging


8.1 Test Cases and Benchmarks
Testing for the provided mini compiler code involves systematically evaluating its
functionality by providing various input programs to ensure that the lexer correctly
identifies tokens, the parser accurately constructs the program structure, and the semantic
analysis effectively checks for type correctness and variable usage, all while handling
errors gracefully and producing the expected output. Here are some examples.

Sample Input(Test Case 1):

MGIT| Page14
Obtained Output(Test Case 1):

Sample Input(Test Case 2);

Obtained Output(Test Case 2):

MGIT| Page15
8.2 Debugging Common Issues
Below are some common issues encountered during development, along with strategies
for debugging:
8.2.1 Lexical Analysis Issues
Issue: Scanner fails to recognize a valid identifier or token.
Debugging Steps:
● Verify the regular expressions used for token recognition.
● Check the order of token matching rules to avoid conflicts (e.g., if being matched
as an identifier instead of a keyword).
● Add debug statements to log unrecognized characters.

8.2.2 Parsing Errors


Issue: The parser fails to build a syntax tree for valid input.
Debugging Steps:
● Verify the grammar rules and ensure they are unambiguous.
● Use a step-by-step parser trace to identify where the input deviates from expected
rules.
● Test each grammar rule independently using small inputs.

8.2.3 Semantic Analysis Errors


Issue: The symbol table does not resolve variable types correctly.
Debugging Steps:
● Check for scope-related issues (e.g., variable redeclaration in the same scope).
● Ensure that all identifiers are added to the symbol table during declaration.
● Add logging to trace symbol table lookups and ensure correctness.

8.2.4 Intermediate Code Generation Errors


Issue: Incorrect or missing intermediate code for certain operations.
Debugging Steps:
● Test intermediate code generation for simple expressions (e.g., a = b + c).
● Verify that the correct order of operations is maintained in the three-address code.
● Debug the handling of control flow statements like if and for.

8.2.5 Runtime Errors


Issue: The generated code produces incorrect results or crashes.

MGIT| Page16
Debugging Steps:
● Simulate the execution of the intermediate code and verify its correctness step by
step.
● Check register allocation and memory management for errors.
● Add debug output to the generated code to trace execution.

8.3 Performance Analysis


The compiler performs well on small inputs with minimal overhead.The compilation time
is a crucial metric for evaluating the performance of the mini compiler.The mini compiler
performs slightly better in terms of memory usage compared to most alternatives. It
exhibits faster compilation times for small to medium-sized programs, making it suitable
for educational and lightweight purposes.

9. Conclusion and Lessons Learnt


9.1 Lessons Learned
The project offered several valuable insights and lessons, here some of the technical one’s
which we have learned while developing the code and also overcoming the challenges
faced to solve the occurring errors

● Understanding Compiler Phases: Each phase of the compilation process requires


careful planning and implementation to ensure correctness and efficiency.
● Complexity of Parsing: Designing a context-free grammar and implementing an
LL or LR parser deepened understanding of syntax analysis.
● Optimization Techniques: Even basic optimizations, such as constant folding and
dead code elimination, can significantly enhance the efficiency of the generated
code.
● Debugging Challenges: Implementing and debugging each phase of the compiler
required attention to detail and rigorous testing to handle edge cases.

MGIT| Page17
9.2 Future Enhancements
9.2.1 Adding More Features of the C Language
Currently, the mini compiler supports a limited subset of the C language. Adding more
features would make it more versatile and closer to a full-fledged C compiler. Possible
enhancements include:

● Support for Pointers: Introducing pointer support, including dereferencing, pointer


arithmetic, and memory allocation functions like malloc and free.
● Structures and Unions: Adding support for user-defined data types to handle
complex data structures efficiently.
● File I/O: Enabling support for file handling functions such as (fopen, fclose, fread,
and fwrite).
● Advanced Data Types: Supporting additional data types like long, double, and
unsigned.
● Preprocessor Directives: Including support for macros (#define), conditional
compilation (#ifdef, #endif), and file inclusion (#include).

9.2.2 Improving Optimization Techniques


The current optimization techniques (e.g., dead code elimination, constant folding) can be
further enhanced to produce more efficient code. Potential improvements include:

● Loop Unrolling: Transforming loops to execute multiple iterations in a single pass,


reducing overhead and improving performance.
● Function Inlining: Replacing function calls with the actual function body to reduce
function call overhead for small functions.
● Global Code Optimization: Performing interprocedural optimizations to analyze
and optimize across multiple functions.
● Register Allocation Enhancements: Improving the register allocation algorithm to
reduce the number of memory accesses and enhance execution speed.
● Peephole Optimization: Scanning generated assembly code for patterns of
inefficient instructions and replacing them with optimized sequences.

MGIT| Page18
10.In Summary
In conclusion, the development of a mini compiler for a subset of the C programming
language represents a significant step in understanding the principles of compiler design
and implementation. Through the exploration of lexical analysis, syntax parsing,
semantic analysis, and code generation, we have demonstrated the fundamental processes
that transform high-level code into executable machine instructions.This project not only
highlights the intricacies involved in compiling a programming language but also serves
as a practical application of theoretical concepts in computer science. The mini compiler
effectively showcases the ability to parse and execute a limited set of C constructs,
providing a foundation for further enhancements and expansions.Future work could
involve extending the compiler's capabilities to support additional C features, optimizing
the generated code, or even implementing error handling mechanisms to improve user
experience. Overall, this mini compiler serves as a valuable educational tool, fostering a
deeper understanding of compiler architecture and paving the way for more complex
programming language implementations.

Reference
● Youtube (Cobb Coding)
● GitHub Projects
● GeeksforGeeks

MGIT| Page19

You might also like