0% found this document useful (0 votes)

37 views49 pages

Grammar Error Correction Project Report

The project report details the development of a Grammar Error Correction and Text Summarization system as part of the Bachelor of Science in Information Technology program. It outlines the objectives, technologies used, and the significance of automating grammar correction and summarization for effective communication and information processing. The report also includes acknowledgments, a declaration of originality, and a comprehensive table of contents for the project's structure.

Uploaded by

Risaal Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views49 pages

Grammar Error Correction Project Report

Uploaded by

Risaal Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

GRAMMAR ERROR CORRECTION

& TEXT SUMMARIZATION

A Project Report

Submitted in partial fulfillment of the Requirements for the award of the Degree of

BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY)

Risal Shabbir Khan

2026091

Under the esteemed guidance

Mrs. Hina Mahmood

Asst. Professor

DEPARTMENT OF INFORMATION TECHNOLOGY

RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

2023-2024

1
PROFORMA FOR THE APPROVAL PROJECT PROPOSAL

PNR NO.: …................................. ROLL NO.: …......................

[Link] of the Student: Risal Shabbir Khan

[Link] of the Project: Grammar Error Correction & Text Summarization

[Link] of the Guide: Prof. Mrs. Hina Mahmood

4. Teaching experience of the Guide: …................................

5. Is this your first submission? Yes No

Signature of the Student Signature of the Guide

Date: …............................ Date: …...........................

Signature of the Coordinator

Date: …..................................

2
RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that the project entitled, "Grammar Error Correction", is bonafied
work of Risal Shabbir Khan bearing Seat. No: 2026091 submitted in partial fulfilment
of the requirements for the award of degree of BACHELOR OF SCIENCE in
INFORMATION TECHNOLOGY from University of Mumbai.

Internal Guide Coordinator

External Examiner

Date: College Seal

3
Abstract
What do you think when you read this sentence “She like playing in park and come
here every week.” Well, if you know the English language you might say that this sentence
is grammatically incorrect, and the correct sentence will be “She likes playing in the park
and comes here every week.” Well, it’s not too difficult for us, if we know English but can
we make a computer understand to rectify an incorrect sentence. Now you might be
wondering why I need to do that. Well, the answer is, it’s needed by everyone whether you
are writing a mail to a client or writing a cover letter for a dream job or engaging in a social
media post, spelling or grammatical errors can be distracting and make a proposal look
unprofessional and we want to avoid this to make a good impression. Now the obvious
question is, how do we do that? Well to do that In NLP literature there is a wide range of
techniques in natural language processing from classing rule-based approaches to state-of-
the-art deep learning methods. Grammatical error correction is the task of automatically
correcting grammatical errors in a text. A grammatical error correction system takes an
erroneous sentence as input and is expected to find all the above errors transform the
sentence into the corrected version.

Text summarization is the process of generating short, fluent, and most importantly
accurate summary of a respectively longer text document. The main idea behind automatic
text summarization is to be able to find a short subset of the most essential information
from the entire set and present it in a human-readable format. As online textual data grows,
automatic text summarization methods have the potential to be very helpful because more
useful information can be read in a short time.

4
ACKNOWLEDGEMENT
I am gratitude of those people who were a part of this project in numerous ways,
people who gave their unending support right from the stage when the project idea was
conceived. The four things that go on to make a successful Endeavor are dedication, hard
work, patience and correct guidance.

I would like to thank our principal Mr. Khan Ashfaq Ahmad who has always been
the source of inspiration. We are also thankful to Mr. Arif Patel, our coordinator for all the
help she has rendered to ensure the successful completion of the project.

I take this opportunity to offer sincere thanks to Mrs. Hina Mahmood who was very
kind enough to give us an idea and guide us throughout our project work & also helped us
out in Project Documentation.

I am thankful to all teaching staff (I.T) who shared their experience and gave their
suggestions for developing our project in a better way.

Finally, I would like to thank all our friends and family members (PAPA,
MUMMY) for their support, and all others who have contributed to the completion of this
project directly or indirectly.

5
DECLARATION

I hereby declare that the project entitled, “Grammar Error Correction & Text
Summarization” done at RIZVI COLLEGE OF ARTS SCIENCE AND COMMERCE,
has not been in any case duplicated to submit to any other university for the award of any
degree. To the best of my knowledge other than me, no one has submitted to any other
university.

The project is done in partial fulfilment of the requirements for the award of degree
of BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY) to be submitted
as final semester project as part of our curriculum.

Name and Signature of the Student

6
TABLE OF CONTENTS
Introduction ..................................................................................................................... 10
1.1 Background ............................................................................................................ 10
1.2 Objective ................................................................................................................ 10
1.3 Purpose, Scope and Applicability ........................................................................ 11
1.3.1 Purpose ............................................................................................................ 11
1.3.2 Scope ................................................................................................................ 11
1.3.3 Applicability .................................................................................................... 11
Survey of Technologies ................................................................................................... 12
2.1 List of Technology used ........................................................................................ 12
2.2 Python vs R ............................................................................................................ 13
2.3 Flask vs Django vs FastAPI .................................................................................. 13
2.4 Other popular Programming language for Deep learning ................................ 14
2.5 Other Python Web Frameworks .......................................................................... 14
Requirement and Analysis ............................................................................................. 15
3.1 Problem definition ................................................................................................. 15
3.2 Requirements Specification .................................................................................. 15
3.3 Planning and Scheduling ...................................................................................... 15
3.3.1 Gantt chart ...................................................................................................... 15
3.4 Requirements Specification .................................................................................. 16
3.4.1 Hardware Requirements: .............................................................................. 16
3.4.2 Software Requirements:................................................................................. 16
3.5 Conceptual models ................................................................................................ 17
3.5.1 Data Flow Diagram ........................................................................................ 17
3.5.2 System Flowchart ........................................................................................... 17
3.5.3 Class Diagram ................................................................................................. 18
System Design .................................................................................................................. 20
4.1 Basic Modules ........................................................................................................ 20
4.1.1 User Interface (UI) Framework: ................................................................... 20
4.1.2 Grammar Error Correction Module: ........................................................... 20
4.1.3 Text Summarization Module:........................................................................ 20
4.1.4 Integration and User Flow Management: .................................................... 20
4.1.5 Output Presentation Module: .................................................................... 20

7
4.2 System Architecture (Flowchart) ......................................................................... 21
4.3 Workflow of Deep Learning (GEC Task) ........................................................... 22
4.4 Dataset Preparation .............................................................................................. 23
4.5 Define Evaluator.................................................................................................... 23
4.6 Model Training ...................................................................................................... 24
4.7 Testing .................................................................................................................... 24
4.8 How to Do Text Summarization .......................................................................... 24
4.8.1 Text Cleaning .................................................................................................. 24
4.8.2 Sentence tokenization ..................................................................................... 25
4.8.3 Word tokenization .......................................................................................... 25
4.8.4 Word-frequency table .................................................................................... 25
4.8.5 Summarization ................................................................................................ 25
IMPLEMENTATION AND TESTING ........................................................................ 26
5.1 Implementation Approaches ................................................................................ 26
5.2 Code Details ........................................................................................................... 27
5.2.1 Algorithms (Dataset and Training of that dataset for GEC) ...................... 27
5.2.2 Algorithms (For Text Summarization) ......................................................... 33
5.2.3 [Link]............................................................................................................... 34
5.3 Code Efficiency ...................................................................................................... 37
5.4 Testing Approach .................................................................................................. 38
Results and Discussion.................................................................................................... 42
Conclusions ...................................................................................................................... 47
7.1 Conclusion .............................................................................................................. 47
7.2 Limitations and Future Scope of the Project ...................................................... 47
References ........................................................................................................................ 49

8
LIST OF FIGURES

Sr. No Figure Page no.

1 Gantt Chart 15
2 Data Flow Diagram for GEC task 16
3 Data Flow Diagram for Text Summarization 16
4 System Flowchart for GEC task 17
5 System Flowchart for Text Summarization 17
6 Class Diagram for GEC task 18
7 Class Diagram for Text Summarization 18
8 Output Presentation Module 19
9 System Architecture (Flowchart) 20
10 Workflow of Deep Learning (GEC Task) 21
11 Screenshot of Dataset 22
12 Evaluator: F1 22
13 T5 Model 23
14 Word tokenization 24

9
Chapter 1
Introduction
1.1 Background
In recent years, the rapid advancement of Natural Language Processing (NLP) and deep
learning techniques has revolutionized the way we interact with and analyze textual data. One of
the prominent areas in which these advancements have made significant strides is in the field of
grammar error checking and text summarization. These two tasks, grammar error checking and
text summarization, have immense practical implications across various domains, including
education, communication, content creation, and information retrieval.

Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes in the text.
Grammarly is an example of such a grammar correction product. The GEC task can be thought of
as a sequence-to-sequence task where a Transformer model is trained to take an ungrammatical
sentence as input and return a grammatically correct sentence. Grammatical Error Correction
(GEC) is the task of detecting and correcting grammatical errors in text. Due to the growing
number of language learners of English, there has been increasing attention to the English GEC,
in the past decade. By training on diverse and extensive datasets, these models can capture complex
language patterns, syntactic structures, and even idiosyncrasies of individual writers. Text
summarization, the process of distilling lengthy documents into concise and coherent summaries,
addresses this challenge by enabling efficient information consumption. Manual summarization is
time-consuming and often lacks objectivity, making automated summarization techniques an asset.

1.2 Objective
• Develop a grammar error checking system using deep learning from extensive datasets.
• Implement an extractive text summarization model also using NLP.
• Build a user-friendly interface for users to interact with the developed systems for grammar
error checking and text summarization.

10
1.3 Purpose, Scope and Applicability

1.3.1 Purpose

• To find a short subset of the most essential information from the entire set and present
it in a human-readable format.
• To develop a trained Transformer model that will take an ungrammatical sentence as
input and return a grammatically correct sentence.

1.3.2 Scope

• The scope of a Grammar Error Checking model is wide-ranging and can have a significant
impact on various users and industries such as academic writing, content writing, blogging.
• The scope of extractive text summarization is substantial, with applications across various
industries and domains such as Summarizing news articles allows for quick digestion of
information and facilitates easy access to essential details, Researchers can use
summarization to get an overview of relevant papers, helping them quickly identify the
most pertinent information and Summarization can be used to generate brief descriptions
or previews of recommended content, helping users decide what to engage with.

1.3.3 Applicability

The project's outcomes have wide-ranging applications, benefiting writers, educators,

professionals, and anyone striving for high-quality written content. The grammar error checker is
relevant in academia, business, creative writing, and language learning. The text summarization
models address the need for efficient information consumption amidst information overloads.
Additionally, the project's exploration of NLP advances technology-driven language processing,
benefiting various industries reliant on efficient content summarization.

11
Chapter 2
Survey of Technologies

2.1 List of Technology used

• Python: Python is a popular programming language widely used in the field of deep
learning. Its simplicity, readability, and extensive libraries make it an excellent choice for
developing complex neural networks and models. Python is also extensively used in
Natural Language Processing (NLP), a subfield of artificial intelligence that focuses on
enabling computers to understand, process, and generate human language.
• Flask: Flask is a web framework for building websites and web applications using the
Python programming language. It provides a set of tools and templates that make it easier
to create web pages, handle user interactions, and manage data. Think of Flask as a toolkit
that helps you assemble the different pieces needed to create a functional and interactive
website.
• NLTK: NLTK, short for Natural Language Toolkit, is like a super helper for working
with human language using Python. It's a special toolbox full of tools that make it easier to
understand and play with words and sentences. NLTK helps computers read, analyze, and
even talk like humans do. Imagine having a collection of magic language tricks that let you
break down sentences into words, figure out what each word means, and even understand
the grammar. NLTK is like a language friend that helps computers understand and work
with human language in a smarter way.
• spaCy: Spacy is like a language superhero for computers that helps them understand and
work with human language. It's a special toolbox in Python that makes tasks like
understanding sentences, finding important words, and figuring out what each word means
easy. Just like how you can quickly read and understand a story, Spacy helps computers
read and understand text too. It's like having a clever friend that helps computers make
sense of all the words and sentences they come across, making it super helpful for building
smarter language-based applications.

12
• T5 Model: T5 model from Google, is a text-to-text model meaning it can be trained to
go from input text of one format to output text of one format. This model can be used with
many different objectives like summarization and text classification. And used it to build
a trivia bot that can retrieve answers from memory without any provided context. T5 model
can be used for a lot of tasks for a few reasons — 1. Can be used for any text-to-text task,
2. Good accuracy on downstream tasks after fine-tuning, 3. Easy to train using
Huggingface
• C4_200M Dataset: For the training of our Grammar Corrector, we use the C4_200M
dataset recently released by Google. This dataset consists of 200MM examples of
synthetically generated grammatical corruptions along with the correct text.

2.2 Python vs R
Python and R are both formidable contenders in the realms of deep learning and Natural
Language Processing (NLP). Python stands out as the go-to language for these tasks, owing to its
versatile libraries like TensorFlow, PyTorch, and Keras, which provide robust support for deep
learning models. Its syntax is intuitive, making it easier for developers to implement complex
neural networks. Additionally, Python boasts an extensive ecosystem of NLP libraries such as
NLTK and SpaCy, offering a wide range of tools for text processing. While R has some capabilities
in these domains, it's generally considered more suitable for statistical analysis and data
visualization rather than heavy-duty deep learning or NLP tasks. As such, for projects primarily
focused on these advanced techniques, Python remains the language of choice for its
comprehensive toolset and active community support.

2.3 Flask vs Django vs FastAPI

Flask is like a simple set of tools. It gives you the basics you need to start building a
website. You have more freedom to choose which tools you want to use for different tasks. It's like
a building with individual Lego pieces. Django, on the other hand, is like a bigger toolbox with
more predefined tools. It comes with a lot of built-in features like user authentication, databases,
and more. It's like a set of Lego pieces that are already designed to work together in specific ways.
FastAPI is like a speed racer for building web applications with Python. It's known for its high
performance and automatic validation of data. It helps you create APIs quickly and efficiently,

13
making it great for projects where speed and real-time updates matter. FastAPI stands out for its
speed and automatic data validation.

2.4 Other popular Programming language for Deep learning

• Julia: It is like a supercharged engine for deep learning. It's designed to do complex
calculations quickly, which is a big advantage when you're training big neural networks
and need them to learn fast.
• MATLAB: It is like a teacher who can help you learn deep learning step by step. It
provides tools and functions specifically for neural networks and machine learning, making
it easier to experiment and develop models. It's like a guided path into the world of deep
learning.
• Java: Java is a versatile programming language that can run on the Java Virtual Machine
(JVM). It integrates well with existing Java libraries and frameworks, including those used
in machine learning and NLP like Apache OpenNLP.

2.5 Other Python Web Frameworks

• Tornado: It is a framework designed for handling asynchronous operations. It's
particularly useful for applications that need to handle many concurrent connections, such
as real-time applications that involve machine learning predictions.
• Bottle: It is a minimalistic framework that's easy to use and lightweight. It's suitable for
small projects and can be used to create simple web applications that incorporate machine
learning models.
• Streamlit: While not a traditional web framework, Streamlit is designed specifically for
creating data-driven web applications quickly. It's great for building interactive dashboards
and apps to showcase your deep learning models.
• Dash (Plotly): It is another tool for building interactive web applications with Python.
It's well-suited for creating data visualization and analytics dashboards that can incorporate
deep learning models.

14
Chapter 3
Requirement and Analysis

3.1 Problem definition

Correcting grammar errors is crucial for effective communication as it prevents confusion,
maintains professionalism, and aids language learners. Deep learning algorithms power tools that
automatically identify and rectify grammar mistakes. In my project, I aim to create a deep learning
model focused on this task. The goal is to enhance written content's quality and clarity, thereby
improving communication and comprehension. The project may involve training the model on a
dataset with varying degrees of grammar errors and evaluating its performance based on correction
accuracy and efficiency.
Using a text summarizer offers numerous advantages. Firstly, it saves time by allowing
you to swiftly grasp the main points of a lengthy document. This is especially valuable for dense
or time-sensitive materials. Secondly, it enhances comprehension by providing a clearer overview
of the document's structure and the connections between ideas. It utilizes extractive
summarization, selecting sentences from the original document based on factors like keyword
frequency, sentence position, and relationships between sentences. This process automatically
generates a concise summary capturing the document's key points.

3.2 Requirements Specification

• A sequence-to-sequence task where a Transformer model is trained to take an

ungrammatical sentence as input and return a grammatically correct sentence.
• To able to find a short subset of the most essential information from the entire set and
present it in a human-readable format.

3.3 Planning and Scheduling

Planning and scheduling are processes that turn project action plans for scope, time, cost
and quality into an operating timetable. Planning is largely concerned with choosing the necessary
rules and procedures to fulfill the project’s objectives.

3.3.1 Gantt chart

A Gantt chart is a type of bar chart that illustrates a project schedule. It's a popular project
management tool that allows project managers to view the progress of a project briefly. A Gantt
chart lists the tasks to be performed on the vertical axis, and time intervals on the horizontal axis.

15
The width of the horizontal bars in the graph shows the duration of each activity. As the project
progresses, the chart's bars are shaded to show which tasks have been completed.

3.4 Requirements Specification

3.4.1 Hardware Requirements:

• CPU: A quad-core or higher CPU is recommended.
• GPU: A dedicated GPU is recommended for faster performance.
• RAM: 8GB of RAM is the minimum requirement. 16GB or more is recommended for
better performance.
• Storage: A large amount of storage space is required for the datasets and models.

3.4.2 Software Requirements:

• A programming language that supports NLP and machine learning, such as Python.
• A text processing library, such as NLTK or spaCy.
• A machine learning library, such as TensorFlow.
• A large dataset of text with annotated grammar errors or human-written summaries.

16
3.5 Conceptual models
3.5.1 Data Flow Diagram
A data flow diagram (DFD) is a visual representation of how data moves through a process
or system. DFDs use standardized symbols and notations to describe a business's operations. They
can be divided into logical and physical.

Data Flow Diagram for GEC task

Data Flow Diagram for Text Summarization

3.5.2 System Flowchart

A system flowchart is a diagram that shows how data flows through a system and how
decisions affect this process. It can help you recognize the flow of operations in the system.

17
System Flowchart for GEC task

System Flowchart for Text Summarization

3.5.3 Class Diagram

A class diagram is a diagram used in software design and modeling to describe classes and
their relationships. In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships among objects.

18
Class Diagram for GEC task

Class Diagram for Text Summarization

19
Chapter 4
System Design

4.1 Basic Modules

4.1.1 User Interface (UI) Framework: This module forms the backbone of the application,
providing the graphical interface for user interaction. It handles user inputs, displays relevant
information, and facilitates seamless navigation between the Grammar Error Correction and Text
Summarization components.

4.1.2 Grammar Error Correction Module: This module is responsible for identifying
and rectifying grammatical errors in the provided text using NLP techniques and T5 model. It
processes the input text, identifies errors (such as spelling, grammar, punctuation), and provides
corrected suggestions or feedback to the user.

4.1.3 Text Summarization Module: This module focuses on generating concise and
extractive summaries of lengthy texts, utilizing NLP techniques. It processes the input text,
extracts key information, and generates a summarized version, enabling users to grasp the essence
of the content quickly.

4.1.4 Integration and User Flow Management: This module ensures that both Grammar
Error Correction and Text Summarization components don't mix up the data. It orchestrates the
flow of user interactions, manages data transfer between modules, and ensures the cohesive
operation of the entire system.

4.1.5 Output Presentation Module: This module governs the presentation of results to the
user. It formats and displays the corrected text (from Grammar Error Correction) or the
summarized content (from Text Summarization) in an easily comprehensible manner.

20
4.2 System Architecture (Flowchart)

21
4.3 Workflow of Deep Learning (GEC Task)

22
4.4 Dataset Preparation
For the training of our Grammar Corrector, we use the C4_200M dataset recently released
by Google. This dataset consists of 200MM examples of synthetically generated grammatical
corruptions along with the correct text.
One of the biggest challenges in GEC is getting a good variety of data that simulates the
errors typically made in written language. If the corruptions are random, then they would not be
representative of the distribution of errors encountered in real use cases.
For this purpose, we extracted 550K sentences from C4_200M. The C4_200M dataset is
available on TF datasets. We extracted the sentences we needed and saved them as a CSV.
A screenshot of the C4_200M dataset is below. The input is the incorrect sentence, and the
output is the grammatically correct sentence. These random examples show that the dataset covers
inputs from different domains and a variety of writing styles.

4.5 Define Evaluator

The Evaluator is designed to compare the output generated by the T5 model against a
reference or ground truth, which consists of corrected sentences devoid of any grammatical
mistakes. Through a systematic evaluation process, the Evaluator provides valuable metrics such
as precision, recall, F1-score, and other relevant indicators, offering insights into the model's
proficiency. These metrics serve as essential benchmarks for fine-tuning and optimizing the T5
model, ultimately enhancing its ability to produce linguistically accurate and contextually
appropriate outputs. We will be using rouge scores as the metric.

23
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to
evaluate the quality of machine-generated text, such as summaries or translations, by comparing
it to a set of reference (or gold standard) texts created by humans. These metrics are commonly
used in natural language processing tasks like text summarization and machine translation.

4.6 Model Training

We will use the ever-versatile T5 model from Google for this training. T5 is a text-to-text
model meaning it can be trained to go from input text of one format to output text of one format. I
have personally used this model with many different objectives like summarization and text
classification. And used it to build a trivia bot that can retrieve answers from memory without any
provided context.

T5 can be preferred for a lot of tasks for a few reasons 1. Can be used for any text-to-text
task, 2. Good accuracy on downstream tasks after fine-tuning. We set the incorrect sentence as the
input and the corrected text as the label. Both the inputs and targets are tokenized using the T5
tokenizer.

4.7 Testing
We will utilize the specified T5-based model, enabling users to input a text and obtain
corrected versions as output. This function can be employed for various applications where
automated grammar correction is desired.

4.8 How to Do Text Summarization

4.8.1 Text Cleaning

This involves removing any unnecessary characters or punctuation from the text, such as
stop words, HTML tags, and special characters. This helps to make the text easier for the NLP
model to process.

24
4.8.2 Sentence tokenization
This involves splitting the text into individual sentences. This can be done using a simple
heuristic, such as splitting the text at periods, exclamation points, and question marks. However,
more sophisticated methods can also be used, such as using a part-of-speech tagger to identify
sentence boundaries.

4.8.3 Word tokenization

This involves splitting each sentence into individual words. This can be done using a simple
whitespace tokenizer, which splits the text into spaces. However, more sophisticated methods can
also be used, such as using a regular expression tokenizer to handle cases such as hyphenated
words and contractions.

4.8.4 Word-frequency table

This involves creating a table that shows the number of times each word appears in the
text. This table can be used to identify the most important words in the text, which can then be
used to generate a summary.

4.8.5 Summarization
This involves generating an extractive summary of the text based on the word-frequency
table and other factors, such as the sentence structure of the text.

25
CHAPTER 5
IMPLEMENTATION AND TESTING

5.1 Implementation Approaches

• Data Acquisition and Preprocessing: We prioritized a robust dataset that
accurately reflects real-world error distributions. The C4 200M dataset, with its 200 million
synthetically generated examples, offered the versatility and quality we sought. To ensure
efficient model training, we extracted 550,000 sentences and saved them in a CSV format,
suitable for further processing. We implemented a multi-step preprocessing pipeline to
prepare text for summarization. We will do Text Cleaning by Removal of irrelevant
elements such as punctuation, stop words, and special characters to refine the core content
and Sentence Tokenization by Segmentation of text into individual sentences for analysis.
• Model Selection and Training: We harnessed the power of Google's T5 model, a
versatile text-to-text transformer architecture, to learn grammatical correction patterns
from our dataset. To effectively model sentence-level corrections, we employed T5's
tokenizer with a maximum length of 64 tokens, aligning with the typical sentence length
in our data. We leveraged the Seq2Seq Trainer class from Huggingface for model
instantiation and integrated Weights & Biases (wandb) for seamless logging and
monitoring of training progress. For Text Summarization, we employed a word
tokenization technique to segment text into individual words, followed by the construction
of a word-frequency table to capture key content aspects.

• Evaluation and Analysis: We have used the Rouge score as the metric for evaluating
the T5 model

26
5.2 Code Details
5.2.1 Algorithms (Dataset and Training of that dataset for GEC)

27
28
29
30
31
32
5.2.2 Algorithms (For Text Summarization)

33
5.2.3 [Link]

34
35
36
5.3 Code Efficiency
Here are some broad pointers that might be useful when trying to increase a GEC and Text
Summarization efficiency
• Pre-processing of the data by ensuring that the data is correctly cleaned, normalized and
indexed before it is utilized by the GEC system.
• To accelerate training of the model, I had used Google Colab where hardware
acceleration techniques like GPUs or TPUs.
• Using Weights and Biases to monitor the performance of the model as it trains.
• Using C4_200M dataset by Google which consists of 200M examples of synthetically
generated grammatical corruptions along with the correct text.
• The T5 model by Google stands out for its efficiency and performance due to its unified
text-to-text architecture, pre-training on extensive datasets, and transfer learning
approach. By treating all NLP tasks as text generation problems, T5 simplifies model
design and adaptation to various tasks, achieving state-of-the-art results across
benchmarks. Its flexibility in fine-tuning, availability in different sizes, and integration
into the Hugging Face Transformers library contribute to its widespread adoption in
natural language processing applications.
• Utilizing the spaCy library for tokenization provides efficient and fast tokenization of the
input text. SpaCy's tokenization is optimized and language-aware, enhancing overall
processing speed.

37
5.4 Testing Approach
5.4.1 Categorize Grammar Errors:

grammar_module.py

Test Cases:
Test Description Test Data Expected Result Actual Result Status
Case
Id
TC- Missing Alan came to my Alan came to my Alan came to my Pass
01 Comma house and Jim house, and Jim house, and Jim joined
joined him joined him him
TC- Apostrophe It is my friends It is my friend’s It is my friend’s Pass
02 Usage house in England house in England house in England
TC- Mixing up The book has a The book has a The book has a good Pass
03 similar words good affect on my good effect on my effect on my mood
mood mood
TC- Pronoun Every girl must Every girl must Every girl must bring Pass
04 Disagreement bring their books bring her books to her books to school
to school school
TC- Comparison She is more taller She is taller She is taller Pass
05
TC- Prepositions I went to church I went to church I went to church on Pass
06 at Sunday on Sunday Sunday
TC- Subject – People is coming People are coming People are coming to Pass
07 Verb to my party to my party my party
disagreement
TC- Wrong Tense I have been to I was in New I was in New York Pass
08 New York last York last summer last summer
summer

38
TC- Misusing I want to speak I want to speak I want to speak good Pass
09 Adverbs – English good English good English.
Adjectives
TC- Wrong use of I must to buy a I must buy a new I must buy a new Pass
10 words new cartoon book cartoon book cartoon book

5.4.2 Testing for Text Summarization

text="""

In 2018, twenty-three days after Thanos erased half of all life in the universe, [a] Carol
Danvers rescues Tony Stark and Nebula from deep space and they reunite with the
remaining Avengers-Bruce Banner, Steve Rogers, Thor, Natasha Romanoff, and James
Rhodes and Rocket on Earth. Locating Thanos on an uninhabited planet, they plan to
use the Infinity Stones to reverse his actions, only to find that Thanos has already
destroyed them to prevent any further use. Enraged, Thor decapitates Thanos. Five
years later, Scott Lang escapes from the Quantum Realm. [b] Reaching the Avengers
Compound, he explains that he experienced only five hours while trapped. Theorizing
that the Quantum Realm allows time travel, they ask a reluctant Stark to help them
retrieve the Stones from the past to reverse the actions of Thanos in the present. Stark,
Rocket, and Banner, who has since merged his intelligence with the Hulk's strength,
build a time machine. Banner notes that altering the past does not affect their present;
any changes create alternate realities.
"""
print(summary)

Theorizing that the Quantum Realm allows time travel, they ask a reluctant Stark to help
them retrieve the Stones from the past to reverse the actions of Thanos in the present.
In 2018, twenty-three days after thanos erased half of all life in the universe, [a] Carol
Danver's rescues Tony Stark and Nebula from deep space and they reunite with the
remaining

39
[Link]

40
[Link]

41
Chapter 6

Results and Discussion

[Link]

42
summarization_home.html

[Link]

43
/summarization_result

url_summarizer.html (News Url)

/url_summarizer_result

44
url_summarizer.html (Wikipedia Url)

/url_summarizer_result

file_summarizer.html (File based)

45
/file_summarizer_result

46
Chapter 7

Conclusions
7.1 Conclusion
Building this project had its challenges, like learning a specific part of Python for data science.
We thought about using the R programming language but decided on Google's T5 model for
machine learning. T5 is a tool lots of applications use for training. It helps developers pick the
right tool for the job. Our model found mistakes in small text samples but had some issues,
showing it needs improvement. It's tough to catch all errors in English, as it changes a lot. Still,
we believe there's a chance to make a more accurate model with different approaches.
With all the current research in Natural Language Processing, there can be a revolutionary new
way of doing text NLP which does not revolve around Deep Learning. When it comes to be, the
algorithm used to train the dataset should reflect it. Although a more practical approach is to find
a greater amount of data than the one used in this model and have many more iterations of
training. The key is to find data, which is known to be grammatically correct, which can be
difficult to find. Many published bodies of text are a good resource for this type of data, but it is
important to stay away from non-reliable data which includes things like blogs, or twitter feeds,
or anything where there is no verification of correctness.

7.2 Limitations and Future Scope of the Project

The implementation of this model was shown to be effective in catching several mistakes in the
small text samples. However, there are many it did not do well on, which implies there is clear
room for improvement. Since English is a complex language filled with several dimensions
depending on context, there might not ever be a computer system which can fully be catch 100
percent of the errors all the time. Humans themselves struggle to keep up with all the new lingo
and different ways words gain or lose meanings every day. However, there can be a more
accurate model than the one generated and several approaches can be taken.
In the realm of ongoing Natural Language Processing research, there's potential for an
innovative approach to text NLP that doesn't center on Deep Learning. Should this materialize,
the training algorithm should align with it. However, a more practical strategy involves sourcing

47
a larger dataset than the one employed in this model and undergoing numerous additional
training iterations. The crux lies in acquiring data known for its grammatical correctness, a task
that proves challenging. While numerous published texts serve as valuable resources, caution is
essential to steer clear of unreliable sources such as blogs, Twitter feeds, or any unverified
content.
Looking at advancements in Natural Language Processing (NLP), there might be a new
way without relying on Deep Learning. Adjusting the training method to match these new ideas
is crucial. Another idea is to get a bigger dataset than we have and do more training. The
challenge is finding data with correct grammar. Books and articles are good, but we need to
avoid unreliable data like blogs or Twitter. With more data and training, we hope to get a better
model.
To make our dataset better, we'll ask users for feedback through the app. They can correct
examples and add them to the dataset. Figuring out wrong corrections is a challenge, but having
more data from users is a good tradeoff. As more people use the app worldwide, we'll get more
data. This helps machine learning tools grow and create better systems. more data from users is a
good tradeoff. As more people use the app worldwide, we'll get more data. This helps machine
learning tools grow and create better systems.
There are promising avenues for enhancing the model's performance. Exploring
alternative attention mechanisms presents an opportunity to refine the score further, while
augmenting the training process with a more diverse dataset can contribute significantly to
overall improvement. By combining these approaches, the model stands to benefit from a more
robust and effective foundation for future applications.

48
References
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
using-spacy-f19c9fbcfca8
• [Link]
8750b1b6e404
• [Link]
• [Link]

Grammatical Error Checker App Overview
No ratings yet
Grammatical Error Checker App Overview
8 pages
Intelligent Spelling Corrector Project Report
No ratings yet
Intelligent Spelling Corrector Project Report
25 pages
Spell Correction Project Report
No ratings yet
Spell Correction Project Report
46 pages
Bilal Proposal
No ratings yet
Bilal Proposal
5 pages
Free Grammar Checker Software
No ratings yet
Free Grammar Checker Software
6 pages
Rule-Based English Grammar Checker
No ratings yet
Rule-Based English Grammar Checker
50 pages
Communication Lab Report Template
No ratings yet
Communication Lab Report Template
20 pages
Civil Engineering Communication Skills Lab
No ratings yet
Civil Engineering Communication Skills Lab
58 pages
AI Grammar Correction System Overview
No ratings yet
AI Grammar Correction System Overview
5 pages
Marking Scheme for Class XII English Exam
No ratings yet
Marking Scheme for Class XII English Exam
12 pages
Intelligent Spelling Corrector Project
No ratings yet
Intelligent Spelling Corrector Project
24 pages
Error Correction For Improved Academic Writing in English
No ratings yet
Error Correction For Improved Academic Writing in English
31 pages
AI's Impact on SHS Writing Skills
No ratings yet
AI's Impact on SHS Writing Skills
20 pages
Internal Verification for BTEC Computing
No ratings yet
Internal Verification for BTEC Computing
53 pages
B.Tech Major Project Guidelines
No ratings yet
B.Tech Major Project Guidelines
12 pages
English Project and Job Application Guide
No ratings yet
English Project and Job Application Guide
9 pages
Project Synopsis on Spell Correction
No ratings yet
Project Synopsis on Spell Correction
15 pages
Effective Techniques for Error Code Queries
No ratings yet
Effective Techniques for Error Code Queries
67 pages
KCPS Kurud Project Guidelines 2023-24
No ratings yet
KCPS Kurud Project Guidelines 2023-24
24 pages
Project Report Preparation Guidelines
No ratings yet
Project Report Preparation Guidelines
37 pages
Project Thesis Guidelines for SVCE Indore
No ratings yet
Project Thesis Guidelines for SVCE Indore
11 pages
Dangling Modifiers & Faulty Parallelism Guide
No ratings yet
Dangling Modifiers & Faulty Parallelism Guide
11 pages
Rowan-Salisbury Graduation Project Guide
No ratings yet
Rowan-Salisbury Graduation Project Guide
16 pages
Grammar Error Correction with NLP Techniques
No ratings yet
Grammar Error Correction with NLP Techniques
7 pages
Grade XI Report Generating System Project
No ratings yet
Grade XI Report Generating System Project
36 pages
52119chap01 X
No ratings yet
52119chap01 X
27 pages
Nonverbal Communication Insights
No ratings yet
Nonverbal Communication Insights
6 pages
Proofreading and Editing Essentials
100% (1)
Proofreading and Editing Essentials
5 pages
Effective Grammar Teaching Strategies
No ratings yet
Effective Grammar Teaching Strategies
22 pages
Virtual Assistant Project Report
No ratings yet
Virtual Assistant Project Report
30 pages
Sample Answers: Focus 5 Photocopiable
No ratings yet
Sample Answers: Focus 5 Photocopiable
3 pages
Project Completion Acknowledgment Guide
No ratings yet
Project Completion Acknowledgment Guide
23 pages
Capital College and Research Center: A Practical Work of
No ratings yet
Capital College and Research Center: A Practical Work of
40 pages
Reflective Essay on English Syntax Skills
No ratings yet
Reflective Essay on English Syntax Skills
4 pages
Non-Organisational ML Project Proposal
No ratings yet
Non-Organisational ML Project Proposal
28 pages
B.Tech.: Delhi Skill and Entrepreneurship University
No ratings yet
B.Tech.: Delhi Skill and Entrepreneurship University
192 pages
Essential Writing Skills for Engineers
No ratings yet
Essential Writing Skills for Engineers
4 pages
Importance of Grammar and Spelling
No ratings yet
Importance of Grammar and Spelling
2 pages
6 Get It Write!: in Company Pre-Intermediate Resource Materials
No ratings yet
6 Get It Write!: in Company Pre-Intermediate Resource Materials
1 page
Rowan-Salisbury Graduation Project Guide
No ratings yet
Rowan-Salisbury Graduation Project Guide
17 pages
Précis Writing: A Comprehensive Guide
No ratings yet
Précis Writing: A Comprehensive Guide
5 pages
Hindi Project Prastavana Report
No ratings yet
Hindi Project Prastavana Report
10 pages
Graduation Project Proposal Guidelines
No ratings yet
Graduation Project Proposal Guidelines
6 pages
Proposal for WriteRight Assessments
No ratings yet
Proposal for WriteRight Assessments
1 page
Proofreading Guidelines and Tips
No ratings yet
Proofreading Guidelines and Tips
20 pages
Zero Tolerance for Poor Grammar
100% (1)
Zero Tolerance for Poor Grammar
4 pages
Bangla Grammar Error Correction System
No ratings yet
Bangla Grammar Error Correction System
153 pages
Corporate Communication Skills Course
No ratings yet
Corporate Communication Skills Course
28 pages
Error Techniques
No ratings yet
Error Techniques
21 pages
Proofreading Assignment Guidelines
No ratings yet
Proofreading Assignment Guidelines
3 pages
BS EE English Language Course Outline
No ratings yet
BS EE English Language Course Outline
4 pages
Acknowledgements for Project Report
No ratings yet
Acknowledgements for Project Report
18 pages
Soft Skills: Group Discussion Insights
No ratings yet
Soft Skills: Group Discussion Insights
28 pages
Walford School Fees Schedule 2025
No ratings yet
Walford School Fees Schedule 2025
4 pages
Fact Sheet Talking Circle
No ratings yet
Fact Sheet Talking Circle
1 page
B.Tech CSE Course Structure 2020
No ratings yet
B.Tech CSE Course Structure 2020
80 pages
SUMO for EV Charging Dispatching
No ratings yet
SUMO for EV Charging Dispatching
10 pages
Key Concepts in Electrostatics and Capacitance
No ratings yet
Key Concepts in Electrostatics and Capacitance
3 pages
Science: Planning Document University of Notre Dame
No ratings yet
Science: Planning Document University of Notre Dame
11 pages
Client Satisfaction Survey Form 2023
100% (4)
Client Satisfaction Survey Form 2023
1 page
AI Competency Framework for Public Sector
No ratings yet
AI Competency Framework for Public Sector
78 pages
Educational Exam: Sentence Structure Practice
No ratings yet
Educational Exam: Sentence Structure Practice
4 pages
Balanced Scorecard Test Bank Answers
No ratings yet
Balanced Scorecard Test Bank Answers
57 pages
Introduction to Computer Science Course
No ratings yet
Introduction to Computer Science Course
2 pages
Intelligent Detection of Mobile Malware
No ratings yet
Intelligent Detection of Mobile Malware
10 pages
Systems Approach to Project Management
No ratings yet
Systems Approach to Project Management
32 pages
Design Thinking Tools and Techniques
No ratings yet
Design Thinking Tools and Techniques
6 pages
South African CV of Ayanda Mlaba
No ratings yet
South African CV of Ayanda Mlaba
4 pages
Primary Health Care Lesson Plan
No ratings yet
Primary Health Care Lesson Plan
4 pages
1st Grade Story Elements Lesson Plan
No ratings yet
1st Grade Story Elements Lesson Plan
3 pages
4 Grade Social Studies: Laws: Kathleen Diedrich Unit Plan
No ratings yet
4 Grade Social Studies: Laws: Kathleen Diedrich Unit Plan
40 pages
MAPEH 8 Daily Lesson Log: Family Life
No ratings yet
MAPEH 8 Daily Lesson Log: Family Life
8 pages
Incorporating Literature in Language Lessons
No ratings yet
Incorporating Literature in Language Lessons
30 pages
Đề Thi Giữa Học Kỳ 1 Tiếng Anh Lớp 4
No ratings yet
Đề Thi Giữa Học Kỳ 1 Tiếng Anh Lớp 4
11 pages
AI Innovations in Ergonomics Software
No ratings yet
AI Innovations in Ergonomics Software
2 pages
Shear Viscosity in Neutron Matter
No ratings yet
Shear Viscosity in Neutron Matter
32 pages
Capti Voice: Enhancing Literacy Solutions
No ratings yet
Capti Voice: Enhancing Literacy Solutions
27 pages
Financial Literacy for Grade 1 Kids
No ratings yet
Financial Literacy for Grade 1 Kids
32 pages
In-Service Education in Nursing Management
No ratings yet
In-Service Education in Nursing Management
37 pages
Empowering Communities for Resilience
No ratings yet
Empowering Communities for Resilience
3 pages
Basic Weather Instruments for Grade 3
100% (1)
Basic Weather Instruments for Grade 3
4 pages
Profile of Liezel U. Castro
No ratings yet
Profile of Liezel U. Castro
1 page
Panel Discussion Rubric Overview
100% (9)
Panel Discussion Rubric Overview
2 pages

Grammar Error Correction Project Report

Uploaded by

Grammar Error Correction Project Report

Uploaded by

GRAMMAR ERROR CORRECTION

& TEXT SUMMARIZATION

BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY)

Risal Shabbir Khan

Under the esteemed guidance

Mrs. Hina Mahmood

DEPARTMENT OF INFORMATION TECHNOLOGY

RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

PNR NO.: …................................. ROLL NO.: …......................

[Link] of the Student: Risal Shabbir Khan

[Link] of the Project: Grammar Error Correction & Text Summarization

[Link] of the Guide: Prof. Mrs. Hina Mahmood

4. Teaching experience of the Guide: …................................

5. Is this your first submission? Yes No

Signature of the Student Signature of the Guide

Date: …............................ Date: …...........................

Signature of the Coordinator

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

DEPARTMENT OF INFORMATION TECHNOLOGY

Internal Guide Coordinator

Date: College Seal

Name and Signature of the Student

Sr. No Figure Page no.

The project's outcomes have wide-ranging applications, benefiting writers, educators,

2.1 List of Technology used

2.3 Flask vs Django vs FastAPI

2.4 Other popular Programming language for Deep learning

2.5 Other Python Web Frameworks

3.1 Problem definition

3.2 Requirements Specification

• A sequence-to-sequence task where a Transformer model is trained to take an

3.3 Planning and Scheduling

3.3.1 Gantt chart

3.4 Requirements Specification

3.4.1 Hardware Requirements:

3.4.2 Software Requirements:

Data Flow Diagram for GEC task

Data Flow Diagram for Text Summarization

3.5.2 System Flowchart

System Flowchart for Text Summarization

3.5.3 Class Diagram

Class Diagram for Text Summarization

4.1 Basic Modules

4.5 Define Evaluator

4.6 Model Training

4.8 How to Do Text Summarization

4.8.1 Text Cleaning

4.8.3 Word tokenization

4.8.4 Word-frequency table

5.1 Implementation Approaches

5.4.2 Testing for Text Summarization

Results and Discussion

url_summarizer.html (News Url)

file_summarizer.html (File based)

7.2 Limitations and Future Scope of the Project

You might also like