Grammar Error Correction Project Report
Grammar Error Correction Project Report
Submitted in partial fulfillment of the Requirements for the award of the Degree of
By
2026091
Asst. Professor
2023-2024
1
PROFORMA FOR THE APPROVAL PROJECT PROPOSAL
Date: …..................................
2
RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE
CERTIFICATE
This is to certify that the project entitled, "Grammar Error Correction", is bonafied
work of Risal Shabbir Khan bearing Seat. No: 2026091 submitted in partial fulfilment
of the requirements for the award of degree of BACHELOR OF SCIENCE in
INFORMATION TECHNOLOGY from University of Mumbai.
External Examiner
3
Abstract
What do you think when you read this sentence “She like playing in park and come
here every week.” Well, if you know the English language you might say that this sentence
is grammatically incorrect, and the correct sentence will be “She likes playing in the park
and comes here every week.” Well, it’s not too difficult for us, if we know English but can
we make a computer understand to rectify an incorrect sentence. Now you might be
wondering why I need to do that. Well, the answer is, it’s needed by everyone whether you
are writing a mail to a client or writing a cover letter for a dream job or engaging in a social
media post, spelling or grammatical errors can be distracting and make a proposal look
unprofessional and we want to avoid this to make a good impression. Now the obvious
question is, how do we do that? Well to do that In NLP literature there is a wide range of
techniques in natural language processing from classing rule-based approaches to state-of-
the-art deep learning methods. Grammatical error correction is the task of automatically
correcting grammatical errors in a text. A grammatical error correction system takes an
erroneous sentence as input and is expected to find all the above errors transform the
sentence into the corrected version.
Text summarization is the process of generating short, fluent, and most importantly
accurate summary of a respectively longer text document. The main idea behind automatic
text summarization is to be able to find a short subset of the most essential information
from the entire set and present it in a human-readable format. As online textual data grows,
automatic text summarization methods have the potential to be very helpful because more
useful information can be read in a short time.
4
ACKNOWLEDGEMENT
I am gratitude of those people who were a part of this project in numerous ways,
people who gave their unending support right from the stage when the project idea was
conceived. The four things that go on to make a successful Endeavor are dedication, hard
work, patience and correct guidance.
I would like to thank our principal Mr. Khan Ashfaq Ahmad who has always been
the source of inspiration. We are also thankful to Mr. Arif Patel, our coordinator for all the
help she has rendered to ensure the successful completion of the project.
I take this opportunity to offer sincere thanks to Mrs. Hina Mahmood who was very
kind enough to give us an idea and guide us throughout our project work & also helped us
out in Project Documentation.
I am thankful to all teaching staff (I.T) who shared their experience and gave their
suggestions for developing our project in a better way.
Finally, I would like to thank all our friends and family members (PAPA,
MUMMY) for their support, and all others who have contributed to the completion of this
project directly or indirectly.
5
DECLARATION
I hereby declare that the project entitled, “Grammar Error Correction & Text
Summarization” done at RIZVI COLLEGE OF ARTS SCIENCE AND COMMERCE,
has not been in any case duplicated to submit to any other university for the award of any
degree. To the best of my knowledge other than me, no one has submitted to any other
university.
The project is done in partial fulfilment of the requirements for the award of degree
of BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY) to be submitted
as final semester project as part of our curriculum.
6
TABLE OF CONTENTS
Introduction ..................................................................................................................... 10
1.1 Background ............................................................................................................ 10
1.2 Objective ................................................................................................................ 10
1.3 Purpose, Scope and Applicability ........................................................................ 11
1.3.1 Purpose ............................................................................................................ 11
1.3.2 Scope ................................................................................................................ 11
1.3.3 Applicability .................................................................................................... 11
Survey of Technologies ................................................................................................... 12
2.1 List of Technology used ........................................................................................ 12
2.2 Python vs R ............................................................................................................ 13
2.3 Flask vs Django vs FastAPI .................................................................................. 13
2.4 Other popular Programming language for Deep learning ................................ 14
2.5 Other Python Web Frameworks .......................................................................... 14
Requirement and Analysis ............................................................................................. 15
3.1 Problem definition ................................................................................................. 15
3.2 Requirements Specification .................................................................................. 15
3.3 Planning and Scheduling ...................................................................................... 15
3.3.1 Gantt chart ...................................................................................................... 15
3.4 Requirements Specification .................................................................................. 16
3.4.1 Hardware Requirements: .............................................................................. 16
3.4.2 Software Requirements:................................................................................. 16
3.5 Conceptual models ................................................................................................ 17
3.5.1 Data Flow Diagram ........................................................................................ 17
3.5.2 System Flowchart ........................................................................................... 17
3.5.3 Class Diagram ................................................................................................. 18
System Design .................................................................................................................. 20
4.1 Basic Modules ........................................................................................................ 20
4.1.1 User Interface (UI) Framework: ................................................................... 20
4.1.2 Grammar Error Correction Module: ........................................................... 20
4.1.3 Text Summarization Module:........................................................................ 20
4.1.4 Integration and User Flow Management: .................................................... 20
4.1.5 Output Presentation Module: .................................................................... 20
7
4.2 System Architecture (Flowchart) ......................................................................... 21
4.3 Workflow of Deep Learning (GEC Task) ........................................................... 22
4.4 Dataset Preparation .............................................................................................. 23
4.5 Define Evaluator.................................................................................................... 23
4.6 Model Training ...................................................................................................... 24
4.7 Testing .................................................................................................................... 24
4.8 How to Do Text Summarization .......................................................................... 24
4.8.1 Text Cleaning .................................................................................................. 24
4.8.2 Sentence tokenization ..................................................................................... 25
4.8.3 Word tokenization .......................................................................................... 25
4.8.4 Word-frequency table .................................................................................... 25
4.8.5 Summarization ................................................................................................ 25
IMPLEMENTATION AND TESTING ........................................................................ 26
5.1 Implementation Approaches ................................................................................ 26
5.2 Code Details ........................................................................................................... 27
5.2.1 Algorithms (Dataset and Training of that dataset for GEC) ...................... 27
5.2.2 Algorithms (For Text Summarization) ......................................................... 33
5.2.3 [Link]............................................................................................................... 34
5.3 Code Efficiency ...................................................................................................... 37
5.4 Testing Approach .................................................................................................. 38
Results and Discussion.................................................................................................... 42
Conclusions ...................................................................................................................... 47
7.1 Conclusion .............................................................................................................. 47
7.2 Limitations and Future Scope of the Project ...................................................... 47
References ........................................................................................................................ 49
8
LIST OF FIGURES
9
Chapter 1
Introduction
1.1 Background
In recent years, the rapid advancement of Natural Language Processing (NLP) and deep
learning techniques has revolutionized the way we interact with and analyze textual data. One of
the prominent areas in which these advancements have made significant strides is in the field of
grammar error checking and text summarization. These two tasks, grammar error checking and
text summarization, have immense practical implications across various domains, including
education, communication, content creation, and information retrieval.
Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes in the text.
Grammarly is an example of such a grammar correction product. The GEC task can be thought of
as a sequence-to-sequence task where a Transformer model is trained to take an ungrammatical
sentence as input and return a grammatically correct sentence. Grammatical Error Correction
(GEC) is the task of detecting and correcting grammatical errors in text. Due to the growing
number of language learners of English, there has been increasing attention to the English GEC,
in the past decade. By training on diverse and extensive datasets, these models can capture complex
language patterns, syntactic structures, and even idiosyncrasies of individual writers. Text
summarization, the process of distilling lengthy documents into concise and coherent summaries,
addresses this challenge by enabling efficient information consumption. Manual summarization is
time-consuming and often lacks objectivity, making automated summarization techniques an asset.
1.2 Objective
• Develop a grammar error checking system using deep learning from extensive datasets.
• Implement an extractive text summarization model also using NLP.
• Build a user-friendly interface for users to interact with the developed systems for grammar
error checking and text summarization.
10
1.3 Purpose, Scope and Applicability
1.3.1 Purpose
• To find a short subset of the most essential information from the entire set and present
it in a human-readable format.
• To develop a trained Transformer model that will take an ungrammatical sentence as
input and return a grammatically correct sentence.
1.3.2 Scope
• The scope of a Grammar Error Checking model is wide-ranging and can have a significant
impact on various users and industries such as academic writing, content writing, blogging.
• The scope of extractive text summarization is substantial, with applications across various
industries and domains such as Summarizing news articles allows for quick digestion of
information and facilitates easy access to essential details, Researchers can use
summarization to get an overview of relevant papers, helping them quickly identify the
most pertinent information and Summarization can be used to generate brief descriptions
or previews of recommended content, helping users decide what to engage with.
1.3.3 Applicability
11
Chapter 2
Survey of Technologies
12
• T5 Model: T5 model from Google, is a text-to-text model meaning it can be trained to
go from input text of one format to output text of one format. This model can be used with
many different objectives like summarization and text classification. And used it to build
a trivia bot that can retrieve answers from memory without any provided context. T5 model
can be used for a lot of tasks for a few reasons — 1. Can be used for any text-to-text task,
2. Good accuracy on downstream tasks after fine-tuning, 3. Easy to train using
Huggingface
• C4_200M Dataset: For the training of our Grammar Corrector, we use the C4_200M
dataset recently released by Google. This dataset consists of 200MM examples of
synthetically generated grammatical corruptions along with the correct text.
2.2 Python vs R
Python and R are both formidable contenders in the realms of deep learning and Natural
Language Processing (NLP). Python stands out as the go-to language for these tasks, owing to its
versatile libraries like TensorFlow, PyTorch, and Keras, which provide robust support for deep
learning models. Its syntax is intuitive, making it easier for developers to implement complex
neural networks. Additionally, Python boasts an extensive ecosystem of NLP libraries such as
NLTK and SpaCy, offering a wide range of tools for text processing. While R has some capabilities
in these domains, it's generally considered more suitable for statistical analysis and data
visualization rather than heavy-duty deep learning or NLP tasks. As such, for projects primarily
focused on these advanced techniques, Python remains the language of choice for its
comprehensive toolset and active community support.
13
making it great for projects where speed and real-time updates matter. FastAPI stands out for its
speed and automatic data validation.
14
Chapter 3
Requirement and Analysis
15
The width of the horizontal bars in the graph shows the duration of each activity. As the project
progresses, the chart's bars are shaded to show which tasks have been completed.
16
3.5 Conceptual models
3.5.1 Data Flow Diagram
A data flow diagram (DFD) is a visual representation of how data moves through a process
or system. DFDs use standardized symbols and notations to describe a business's operations. They
can be divided into logical and physical.
A system flowchart is a diagram that shows how data flows through a system and how
decisions affect this process. It can help you recognize the flow of operations in the system.
17
System Flowchart for GEC task
A class diagram is a diagram used in software design and modeling to describe classes and
their relationships. In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships among objects.
18
Class Diagram for GEC task
19
Chapter 4
System Design
4.1.2 Grammar Error Correction Module: This module is responsible for identifying
and rectifying grammatical errors in the provided text using NLP techniques and T5 model. It
processes the input text, identifies errors (such as spelling, grammar, punctuation), and provides
corrected suggestions or feedback to the user.
4.1.3 Text Summarization Module: This module focuses on generating concise and
extractive summaries of lengthy texts, utilizing NLP techniques. It processes the input text,
extracts key information, and generates a summarized version, enabling users to grasp the essence
of the content quickly.
4.1.4 Integration and User Flow Management: This module ensures that both Grammar
Error Correction and Text Summarization components don't mix up the data. It orchestrates the
flow of user interactions, manages data transfer between modules, and ensures the cohesive
operation of the entire system.
4.1.5 Output Presentation Module: This module governs the presentation of results to the
user. It formats and displays the corrected text (from Grammar Error Correction) or the
summarized content (from Text Summarization) in an easily comprehensible manner.
20
4.2 System Architecture (Flowchart)
21
4.3 Workflow of Deep Learning (GEC Task)
22
4.4 Dataset Preparation
For the training of our Grammar Corrector, we use the C4_200M dataset recently released
by Google. This dataset consists of 200MM examples of synthetically generated grammatical
corruptions along with the correct text.
One of the biggest challenges in GEC is getting a good variety of data that simulates the
errors typically made in written language. If the corruptions are random, then they would not be
representative of the distribution of errors encountered in real use cases.
For this purpose, we extracted 550K sentences from C4_200M. The C4_200M dataset is
available on TF datasets. We extracted the sentences we needed and saved them as a CSV.
A screenshot of the C4_200M dataset is below. The input is the incorrect sentence, and the
output is the grammatically correct sentence. These random examples show that the dataset covers
inputs from different domains and a variety of writing styles.
23
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to
evaluate the quality of machine-generated text, such as summaries or translations, by comparing
it to a set of reference (or gold standard) texts created by humans. These metrics are commonly
used in natural language processing tasks like text summarization and machine translation.
T5 can be preferred for a lot of tasks for a few reasons 1. Can be used for any text-to-text
task, 2. Good accuracy on downstream tasks after fine-tuning. We set the incorrect sentence as the
input and the corrected text as the label. Both the inputs and targets are tokenized using the T5
tokenizer.
4.7 Testing
We will utilize the specified T5-based model, enabling users to input a text and obtain
corrected versions as output. This function can be employed for various applications where
automated grammar correction is desired.
24
4.8.2 Sentence tokenization
This involves splitting the text into individual sentences. This can be done using a simple
heuristic, such as splitting the text at periods, exclamation points, and question marks. However,
more sophisticated methods can also be used, such as using a part-of-speech tagger to identify
sentence boundaries.
4.8.5 Summarization
This involves generating an extractive summary of the text based on the word-frequency
table and other factors, such as the sentence structure of the text.
25
CHAPTER 5
IMPLEMENTATION AND TESTING
• Evaluation and Analysis: We have used the Rouge score as the metric for evaluating
the T5 model
26
5.2 Code Details
5.2.1 Algorithms (Dataset and Training of that dataset for GEC)
27
28
29
30
31
32
5.2.2 Algorithms (For Text Summarization)
33
5.2.3 [Link]
34
35
36
5.3 Code Efficiency
Here are some broad pointers that might be useful when trying to increase a GEC and Text
Summarization efficiency
• Pre-processing of the data by ensuring that the data is correctly cleaned, normalized and
indexed before it is utilized by the GEC system.
• To accelerate training of the model, I had used Google Colab where hardware
acceleration techniques like GPUs or TPUs.
• Using Weights and Biases to monitor the performance of the model as it trains.
• Using C4_200M dataset by Google which consists of 200M examples of synthetically
generated grammatical corruptions along with the correct text.
• The T5 model by Google stands out for its efficiency and performance due to its unified
text-to-text architecture, pre-training on extensive datasets, and transfer learning
approach. By treating all NLP tasks as text generation problems, T5 simplifies model
design and adaptation to various tasks, achieving state-of-the-art results across
benchmarks. Its flexibility in fine-tuning, availability in different sizes, and integration
into the Hugging Face Transformers library contribute to its widespread adoption in
natural language processing applications.
• Utilizing the spaCy library for tokenization provides efficient and fast tokenization of the
input text. SpaCy's tokenization is optimized and language-aware, enhancing overall
processing speed. 
37
5.4 Testing Approach
5.4.1 Categorize Grammar Errors:
grammar_module.py
Test Cases:
Test Description Test Data Expected Result Actual Result Status
Case
Id
TC- Missing Alan came to my Alan came to my Alan came to my Pass
01 Comma house and Jim house, and Jim house, and Jim joined
joined him joined him him
TC- Apostrophe It is my friends It is my friend’s It is my friend’s Pass
02 Usage house in England house in England house in England
TC- Mixing up The book has a The book has a The book has a good Pass
03 similar words good affect on my good effect on my effect on my mood
mood mood
TC- Pronoun Every girl must Every girl must Every girl must bring Pass
04 Disagreement bring their books bring her books to her books to school
to school school
TC- Comparison She is more taller She is taller She is taller Pass
05
TC- Prepositions I went to church I went to church I went to church on Pass
06 at Sunday on Sunday Sunday
TC- Subject – People is coming People are coming People are coming to Pass
07 Verb to my party to my party my party
disagreement
TC- Wrong Tense I have been to I was in New I was in New York Pass
08 New York last York last summer last summer
summer
38
TC- Misusing I want to speak I want to speak I want to speak good Pass
09 Adverbs – English good English good English.
Adjectives
TC- Wrong use of I must to buy a I must buy a new I must buy a new Pass
10 words new cartoon book cartoon book cartoon book
text="""
In 2018, twenty-three days after Thanos erased half of all life in the universe, [a] Carol
Danvers rescues Tony Stark and Nebula from deep space and they reunite with the
remaining Avengers-Bruce Banner, Steve Rogers, Thor, Natasha Romanoff, and James
Rhodes and Rocket on Earth. Locating Thanos on an uninhabited planet, they plan to
use the Infinity Stones to reverse his actions, only to find that Thanos has already
destroyed them to prevent any further use. Enraged, Thor decapitates Thanos. Five
years later, Scott Lang escapes from the Quantum Realm. [b] Reaching the Avengers
Compound, he explains that he experienced only five hours while trapped. Theorizing
that the Quantum Realm allows time travel, they ask a reluctant Stark to help them
retrieve the Stones from the past to reverse the actions of Thanos in the present. Stark,
Rocket, and Banner, who has since merged his intelligence with the Hulk's strength,
build a time machine. Banner notes that altering the past does not affect their present;
any changes create alternate realities.
"""
print(summary)
Theorizing that the Quantum Realm allows time travel, they ask a reluctant Stark to help
them retrieve the Stones from the past to reverse the actions of Thanos in the present.
In 2018, twenty-three days after thanos erased half of all life in the universe, [a] Carol
Danver's rescues Tony Stark and Nebula from deep space and they reunite with the
remaining
39
[Link]
40
[Link]
41
Chapter 6
[Link]
42
summarization_home.html
[Link]
43
/summarization_result
/url_summarizer_result
44
url_summarizer.html (Wikipedia Url)
/url_summarizer_result
45
/file_summarizer_result
46
Chapter 7
Conclusions
7.1 Conclusion
Building this project had its challenges, like learning a specific part of Python for data science.
We thought about using the R programming language but decided on Google's T5 model for
machine learning. T5 is a tool lots of applications use for training. It helps developers pick the
right tool for the job. Our model found mistakes in small text samples but had some issues,
showing it needs improvement. It's tough to catch all errors in English, as it changes a lot. Still,
we believe there's a chance to make a more accurate model with different approaches.
With all the current research in Natural Language Processing, there can be a revolutionary new
way of doing text NLP which does not revolve around Deep Learning. When it comes to be, the
algorithm used to train the dataset should reflect it. Although a more practical approach is to find
a greater amount of data than the one used in this model and have many more iterations of
training. The key is to find data, which is known to be grammatically correct, which can be
difficult to find. Many published bodies of text are a good resource for this type of data, but it is
important to stay away from non-reliable data which includes things like blogs, or twitter feeds,
or anything where there is no verification of correctness.
47
a larger dataset than the one employed in this model and undergoing numerous additional
training iterations. The crux lies in acquiring data known for its grammatical correctness, a task
that proves challenging. While numerous published texts serve as valuable resources, caution is
essential to steer clear of unreliable sources such as blogs, Twitter feeds, or any unverified
content.
Looking at advancements in Natural Language Processing (NLP), there might be a new
way without relying on Deep Learning. Adjusting the training method to match these new ideas
is crucial. Another idea is to get a bigger dataset than we have and do more training. The
challenge is finding data with correct grammar. Books and articles are good, but we need to
avoid unreliable data like blogs or Twitter. With more data and training, we hope to get a better
model.
To make our dataset better, we'll ask users for feedback through the app. They can correct
examples and add them to the dataset. Figuring out wrong corrections is a challenge, but having
more data from users is a good tradeoff. As more people use the app worldwide, we'll get more
data. This helps machine learning tools grow and create better systems. more data from users is a
good tradeoff. As more people use the app worldwide, we'll get more data. This helps machine
learning tools grow and create better systems.
There are promising avenues for enhancing the model's performance. Exploring
alternative attention mechanisms presents an opportunity to refine the score further, while
augmenting the training process with a more diverse dataset can contribute significantly to
overall improvement. By combining these approaches, the model stands to benefit from a more
robust and effective foundation for future applications.
48
References
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
using-spacy-f19c9fbcfca8
• [Link]
8750b1b6e404
• [Link]
• [Link]
49