0% found this document useful (0 votes)
37 views49 pages

Grammar Error Correction Project Report

The project report details the development of a Grammar Error Correction and Text Summarization system as part of the Bachelor of Science in Information Technology program. It outlines the objectives, technologies used, and the significance of automating grammar correction and summarization for effective communication and information processing. The report also includes acknowledgments, a declaration of originality, and a comprehensive table of contents for the project's structure.

Uploaded by

Risaal Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views49 pages

Grammar Error Correction Project Report

The project report details the development of a Grammar Error Correction and Text Summarization system as part of the Bachelor of Science in Information Technology program. It outlines the objectives, technologies used, and the significance of automating grammar correction and summarization for effective communication and information processing. The report also includes acknowledgments, a declaration of originality, and a comprehensive table of contents for the project's structure.

Uploaded by

Risaal Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GRAMMAR ERROR CORRECTION

& TEXT SUMMARIZATION


A Project Report

Submitted in partial fulfillment of the Requirements for the award of the Degree of

BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY)

By

Risal Shabbir Khan

2026091

Under the esteemed guidance

Mrs. Hina Mahmood

Asst. Professor

DEPARTMENT OF INFORMATION TECHNOLOGY

RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

2023-2024

1
PROFORMA FOR THE APPROVAL PROJECT PROPOSAL

PNR NO.: …................................. ROLL NO.: …......................

[Link] of the Student: Risal Shabbir Khan

[Link] of the Project: Grammar Error Correction & Text Summarization

[Link] of the Guide: Prof. Mrs. Hina Mahmood

4. Teaching experience of the Guide: …................................

5. Is this your first submission? Yes No

Signature of the Student Signature of the Guide

Date: …............................ Date: …...........................

Signature of the Coordinator

Date: …..................................

2
RIZVI COLLEGE OF ARTS, SCIENCE & COMMERCE

(Affiliated to University of Mumbai)

MUMBAI, 400050 MAHARASHTRA

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that the project entitled, "Grammar Error Correction", is bonafied
work of Risal Shabbir Khan bearing Seat. No: 2026091 submitted in partial fulfilment
of the requirements for the award of degree of BACHELOR OF SCIENCE in
INFORMATION TECHNOLOGY from University of Mumbai.

Internal Guide Coordinator

External Examiner

Date: College Seal

3
Abstract
What do you think when you read this sentence “She like playing in park and come
here every week.” Well, if you know the English language you might say that this sentence
is grammatically incorrect, and the correct sentence will be “She likes playing in the park
and comes here every week.” Well, it’s not too difficult for us, if we know English but can
we make a computer understand to rectify an incorrect sentence. Now you might be
wondering why I need to do that. Well, the answer is, it’s needed by everyone whether you
are writing a mail to a client or writing a cover letter for a dream job or engaging in a social
media post, spelling or grammatical errors can be distracting and make a proposal look
unprofessional and we want to avoid this to make a good impression. Now the obvious
question is, how do we do that? Well to do that In NLP literature there is a wide range of
techniques in natural language processing from classing rule-based approaches to state-of-
the-art deep learning methods. Grammatical error correction is the task of automatically
correcting grammatical errors in a text. A grammatical error correction system takes an
erroneous sentence as input and is expected to find all the above errors transform the
sentence into the corrected version.

Text summarization is the process of generating short, fluent, and most importantly
accurate summary of a respectively longer text document. The main idea behind automatic
text summarization is to be able to find a short subset of the most essential information
from the entire set and present it in a human-readable format. As online textual data grows,
automatic text summarization methods have the potential to be very helpful because more
useful information can be read in a short time.

4
ACKNOWLEDGEMENT
I am gratitude of those people who were a part of this project in numerous ways,
people who gave their unending support right from the stage when the project idea was
conceived. The four things that go on to make a successful Endeavor are dedication, hard
work, patience and correct guidance.

I would like to thank our principal Mr. Khan Ashfaq Ahmad who has always been
the source of inspiration. We are also thankful to Mr. Arif Patel, our coordinator for all the
help she has rendered to ensure the successful completion of the project.

I take this opportunity to offer sincere thanks to Mrs. Hina Mahmood who was very
kind enough to give us an idea and guide us throughout our project work & also helped us
out in Project Documentation.

I am thankful to all teaching staff (I.T) who shared their experience and gave their
suggestions for developing our project in a better way.

Finally, I would like to thank all our friends and family members (PAPA,
MUMMY) for their support, and all others who have contributed to the completion of this
project directly or indirectly.

5
DECLARATION

I hereby declare that the project entitled, “Grammar Error Correction & Text
Summarization” done at RIZVI COLLEGE OF ARTS SCIENCE AND COMMERCE,
has not been in any case duplicated to submit to any other university for the award of any
degree. To the best of my knowledge other than me, no one has submitted to any other
university.

The project is done in partial fulfilment of the requirements for the award of degree
of BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY) to be submitted
as final semester project as part of our curriculum.

Name and Signature of the Student

6
TABLE OF CONTENTS
Introduction ..................................................................................................................... 10
1.1 Background ............................................................................................................ 10
1.2 Objective ................................................................................................................ 10
1.3 Purpose, Scope and Applicability ........................................................................ 11
1.3.1 Purpose ............................................................................................................ 11
1.3.2 Scope ................................................................................................................ 11
1.3.3 Applicability .................................................................................................... 11
Survey of Technologies ................................................................................................... 12
2.1 List of Technology used ........................................................................................ 12
2.2 Python vs R ............................................................................................................ 13
2.3 Flask vs Django vs FastAPI .................................................................................. 13
2.4 Other popular Programming language for Deep learning ................................ 14
2.5 Other Python Web Frameworks .......................................................................... 14
Requirement and Analysis ............................................................................................. 15
3.1 Problem definition ................................................................................................. 15
3.2 Requirements Specification .................................................................................. 15
3.3 Planning and Scheduling ...................................................................................... 15
3.3.1 Gantt chart ...................................................................................................... 15
3.4 Requirements Specification .................................................................................. 16
3.4.1 Hardware Requirements: .............................................................................. 16
3.4.2 Software Requirements:................................................................................. 16
3.5 Conceptual models ................................................................................................ 17
3.5.1 Data Flow Diagram ........................................................................................ 17
3.5.2 System Flowchart ........................................................................................... 17
3.5.3 Class Diagram ................................................................................................. 18
System Design .................................................................................................................. 20
4.1 Basic Modules ........................................................................................................ 20
4.1.1 User Interface (UI) Framework: ................................................................... 20
4.1.2 Grammar Error Correction Module: ........................................................... 20
4.1.3 Text Summarization Module:........................................................................ 20
4.1.4 Integration and User Flow Management: .................................................... 20
4.1.5 Output Presentation Module: .................................................................... 20

7
4.2 System Architecture (Flowchart) ......................................................................... 21
4.3 Workflow of Deep Learning (GEC Task) ........................................................... 22
4.4 Dataset Preparation .............................................................................................. 23
4.5 Define Evaluator.................................................................................................... 23
4.6 Model Training ...................................................................................................... 24
4.7 Testing .................................................................................................................... 24
4.8 How to Do Text Summarization .......................................................................... 24
4.8.1 Text Cleaning .................................................................................................. 24
4.8.2 Sentence tokenization ..................................................................................... 25
4.8.3 Word tokenization .......................................................................................... 25
4.8.4 Word-frequency table .................................................................................... 25
4.8.5 Summarization ................................................................................................ 25
IMPLEMENTATION AND TESTING ........................................................................ 26
5.1 Implementation Approaches ................................................................................ 26
5.2 Code Details ........................................................................................................... 27
5.2.1 Algorithms (Dataset and Training of that dataset for GEC) ...................... 27
5.2.2 Algorithms (For Text Summarization) ......................................................... 33
5.2.3 [Link]............................................................................................................... 34
5.3 Code Efficiency ...................................................................................................... 37
5.4 Testing Approach .................................................................................................. 38
Results and Discussion.................................................................................................... 42
Conclusions ...................................................................................................................... 47
7.1 Conclusion .............................................................................................................. 47
7.2 Limitations and Future Scope of the Project ...................................................... 47
References ........................................................................................................................ 49

8
LIST OF FIGURES

Sr. No Figure Page no.


1 Gantt Chart 15
2 Data Flow Diagram for GEC task 16
3 Data Flow Diagram for Text Summarization 16
4 System Flowchart for GEC task 17
5 System Flowchart for Text Summarization 17
6 Class Diagram for GEC task 18
7 Class Diagram for Text Summarization 18
8 Output Presentation Module 19
9 System Architecture (Flowchart) 20
10 Workflow of Deep Learning (GEC Task) 21
11 Screenshot of Dataset 22
12 Evaluator: F1 22
13 T5 Model 23
14 Word tokenization 24

9
Chapter 1
Introduction
1.1 Background
In recent years, the rapid advancement of Natural Language Processing (NLP) and deep
learning techniques has revolutionized the way we interact with and analyze textual data. One of
the prominent areas in which these advancements have made significant strides is in the field of
grammar error checking and text summarization. These two tasks, grammar error checking and
text summarization, have immense practical implications across various domains, including
education, communication, content creation, and information retrieval.

Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes in the text.
Grammarly is an example of such a grammar correction product. The GEC task can be thought of
as a sequence-to-sequence task where a Transformer model is trained to take an ungrammatical
sentence as input and return a grammatically correct sentence. Grammatical Error Correction
(GEC) is the task of detecting and correcting grammatical errors in text. Due to the growing
number of language learners of English, there has been increasing attention to the English GEC,
in the past decade. By training on diverse and extensive datasets, these models can capture complex
language patterns, syntactic structures, and even idiosyncrasies of individual writers. Text
summarization, the process of distilling lengthy documents into concise and coherent summaries,
addresses this challenge by enabling efficient information consumption. Manual summarization is
time-consuming and often lacks objectivity, making automated summarization techniques an asset.

1.2 Objective
• Develop a grammar error checking system using deep learning from extensive datasets.
• Implement an extractive text summarization model also using NLP.
• Build a user-friendly interface for users to interact with the developed systems for grammar
error checking and text summarization.

10
1.3 Purpose, Scope and Applicability

1.3.1 Purpose

• To find a short subset of the most essential information from the entire set and present
it in a human-readable format.
• To develop a trained Transformer model that will take an ungrammatical sentence as
input and return a grammatically correct sentence.

1.3.2 Scope

• The scope of a Grammar Error Checking model is wide-ranging and can have a significant
impact on various users and industries such as academic writing, content writing, blogging.
• The scope of extractive text summarization is substantial, with applications across various
industries and domains such as Summarizing news articles allows for quick digestion of
information and facilitates easy access to essential details, Researchers can use
summarization to get an overview of relevant papers, helping them quickly identify the
most pertinent information and Summarization can be used to generate brief descriptions
or previews of recommended content, helping users decide what to engage with.

1.3.3 Applicability

The project's outcomes have wide-ranging applications, benefiting writers, educators,


professionals, and anyone striving for high-quality written content. The grammar error checker is
relevant in academia, business, creative writing, and language learning. The text summarization
models address the need for efficient information consumption amidst information overloads.
Additionally, the project's exploration of NLP advances technology-driven language processing,
benefiting various industries reliant on efficient content summarization.

11
Chapter 2
Survey of Technologies

2.1 List of Technology used


• Python: Python is a popular programming language widely used in the field of deep
learning. Its simplicity, readability, and extensive libraries make it an excellent choice for
developing complex neural networks and models. Python is also extensively used in
Natural Language Processing (NLP), a subfield of artificial intelligence that focuses on
enabling computers to understand, process, and generate human language.
• Flask: Flask is a web framework for building websites and web applications using the
Python programming language. It provides a set of tools and templates that make it easier
to create web pages, handle user interactions, and manage data. Think of Flask as a toolkit
that helps you assemble the different pieces needed to create a functional and interactive
website.
• NLTK: NLTK, short for Natural Language Toolkit, is like a super helper for working
with human language using Python. It's a special toolbox full of tools that make it easier to
understand and play with words and sentences. NLTK helps computers read, analyze, and
even talk like humans do. Imagine having a collection of magic language tricks that let you
break down sentences into words, figure out what each word means, and even understand
the grammar. NLTK is like a language friend that helps computers understand and work
with human language in a smarter way.
• spaCy: Spacy is like a language superhero for computers that helps them understand and
work with human language. It's a special toolbox in Python that makes tasks like
understanding sentences, finding important words, and figuring out what each word means
easy. Just like how you can quickly read and understand a story, Spacy helps computers
read and understand text too. It's like having a clever friend that helps computers make
sense of all the words and sentences they come across, making it super helpful for building
smarter language-based applications.

12
• T5 Model: T5 model from Google, is a text-to-text model meaning it can be trained to
go from input text of one format to output text of one format. This model can be used with
many different objectives like summarization and text classification. And used it to build
a trivia bot that can retrieve answers from memory without any provided context. T5 model
can be used for a lot of tasks for a few reasons — 1. Can be used for any text-to-text task,
2. Good accuracy on downstream tasks after fine-tuning, 3. Easy to train using
Huggingface
• C4_200M Dataset: For the training of our Grammar Corrector, we use the C4_200M
dataset recently released by Google. This dataset consists of 200MM examples of
synthetically generated grammatical corruptions along with the correct text.

2.2 Python vs R
Python and R are both formidable contenders in the realms of deep learning and Natural
Language Processing (NLP). Python stands out as the go-to language for these tasks, owing to its
versatile libraries like TensorFlow, PyTorch, and Keras, which provide robust support for deep
learning models. Its syntax is intuitive, making it easier for developers to implement complex
neural networks. Additionally, Python boasts an extensive ecosystem of NLP libraries such as
NLTK and SpaCy, offering a wide range of tools for text processing. While R has some capabilities
in these domains, it's generally considered more suitable for statistical analysis and data
visualization rather than heavy-duty deep learning or NLP tasks. As such, for projects primarily
focused on these advanced techniques, Python remains the language of choice for its
comprehensive toolset and active community support.

2.3 Flask vs Django vs FastAPI


Flask is like a simple set of tools. It gives you the basics you need to start building a
website. You have more freedom to choose which tools you want to use for different tasks. It's like
a building with individual Lego pieces. Django, on the other hand, is like a bigger toolbox with
more predefined tools. It comes with a lot of built-in features like user authentication, databases,
and more. It's like a set of Lego pieces that are already designed to work together in specific ways.
FastAPI is like a speed racer for building web applications with Python. It's known for its high
performance and automatic validation of data. It helps you create APIs quickly and efficiently,

13
making it great for projects where speed and real-time updates matter. FastAPI stands out for its
speed and automatic data validation.

2.4 Other popular Programming language for Deep learning


• Julia: It is like a supercharged engine for deep learning. It's designed to do complex
calculations quickly, which is a big advantage when you're training big neural networks
and need them to learn fast.
• MATLAB: It is like a teacher who can help you learn deep learning step by step. It
provides tools and functions specifically for neural networks and machine learning, making
it easier to experiment and develop models. It's like a guided path into the world of deep
learning.
• Java: Java is a versatile programming language that can run on the Java Virtual Machine
(JVM). It integrates well with existing Java libraries and frameworks, including those used
in machine learning and NLP like Apache OpenNLP.

2.5 Other Python Web Frameworks


• Tornado: It is a framework designed for handling asynchronous operations. It's
particularly useful for applications that need to handle many concurrent connections, such
as real-time applications that involve machine learning predictions.
• Bottle: It is a minimalistic framework that's easy to use and lightweight. It's suitable for
small projects and can be used to create simple web applications that incorporate machine
learning models.
• Streamlit: While not a traditional web framework, Streamlit is designed specifically for
creating data-driven web applications quickly. It's great for building interactive dashboards
and apps to showcase your deep learning models.
• Dash (Plotly): It is another tool for building interactive web applications with Python.
It's well-suited for creating data visualization and analytics dashboards that can incorporate
deep learning models.

14
Chapter 3
Requirement and Analysis

3.1 Problem definition


Correcting grammar errors is crucial for effective communication as it prevents confusion,
maintains professionalism, and aids language learners. Deep learning algorithms power tools that
automatically identify and rectify grammar mistakes. In my project, I aim to create a deep learning
model focused on this task. The goal is to enhance written content's quality and clarity, thereby
improving communication and comprehension. The project may involve training the model on a
dataset with varying degrees of grammar errors and evaluating its performance based on correction
accuracy and efficiency.
Using a text summarizer offers numerous advantages. Firstly, it saves time by allowing
you to swiftly grasp the main points of a lengthy document. This is especially valuable for dense
or time-sensitive materials. Secondly, it enhances comprehension by providing a clearer overview
of the document's structure and the connections between ideas. It utilizes extractive
summarization, selecting sentences from the original document based on factors like keyword
frequency, sentence position, and relationships between sentences. This process automatically
generates a concise summary capturing the document's key points.

3.2 Requirements Specification

• A sequence-to-sequence task where a Transformer model is trained to take an


ungrammatical sentence as input and return a grammatically correct sentence.
• To able to find a short subset of the most essential information from the entire set and
present it in a human-readable format.

3.3 Planning and Scheduling


Planning and scheduling are processes that turn project action plans for scope, time, cost
and quality into an operating timetable. Planning is largely concerned with choosing the necessary
rules and procedures to fulfill the project’s objectives.

3.3.1 Gantt chart


A Gantt chart is a type of bar chart that illustrates a project schedule. It's a popular project
management tool that allows project managers to view the progress of a project briefly. A Gantt
chart lists the tasks to be performed on the vertical axis, and time intervals on the horizontal axis.

15
The width of the horizontal bars in the graph shows the duration of each activity. As the project
progresses, the chart's bars are shaded to show which tasks have been completed.

3.4 Requirements Specification

3.4.1 Hardware Requirements:


• CPU: A quad-core or higher CPU is recommended.
• GPU: A dedicated GPU is recommended for faster performance.
• RAM: 8GB of RAM is the minimum requirement. 16GB or more is recommended for
better performance.
• Storage: A large amount of storage space is required for the datasets and models.

3.4.2 Software Requirements:


• A programming language that supports NLP and machine learning, such as Python.
• A text processing library, such as NLTK or spaCy.
• A machine learning library, such as TensorFlow.
• A large dataset of text with annotated grammar errors or human-written summaries.

16
3.5 Conceptual models
3.5.1 Data Flow Diagram
A data flow diagram (DFD) is a visual representation of how data moves through a process
or system. DFDs use standardized symbols and notations to describe a business's operations. They
can be divided into logical and physical.

Data Flow Diagram for GEC task

Data Flow Diagram for Text Summarization

3.5.2 System Flowchart

A system flowchart is a diagram that shows how data flows through a system and how
decisions affect this process. It can help you recognize the flow of operations in the system.

17
System Flowchart for GEC task

System Flowchart for Text Summarization

3.5.3 Class Diagram

A class diagram is a diagram used in software design and modeling to describe classes and
their relationships. In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a system by showing
the system's classes, their attributes, operations (or methods), and the relationships among objects.

18
Class Diagram for GEC task

Class Diagram for Text Summarization

19
Chapter 4
System Design

4.1 Basic Modules


4.1.1 User Interface (UI) Framework: This module forms the backbone of the application,
providing the graphical interface for user interaction. It handles user inputs, displays relevant
information, and facilitates seamless navigation between the Grammar Error Correction and Text
Summarization components.

4.1.2 Grammar Error Correction Module: This module is responsible for identifying
and rectifying grammatical errors in the provided text using NLP techniques and T5 model. It
processes the input text, identifies errors (such as spelling, grammar, punctuation), and provides
corrected suggestions or feedback to the user.

4.1.3 Text Summarization Module: This module focuses on generating concise and
extractive summaries of lengthy texts, utilizing NLP techniques. It processes the input text,
extracts key information, and generates a summarized version, enabling users to grasp the essence
of the content quickly.

4.1.4 Integration and User Flow Management: This module ensures that both Grammar
Error Correction and Text Summarization components don't mix up the data. It orchestrates the
flow of user interactions, manages data transfer between modules, and ensures the cohesive
operation of the entire system.

4.1.5 Output Presentation Module: This module governs the presentation of results to the
user. It formats and displays the corrected text (from Grammar Error Correction) or the
summarized content (from Text Summarization) in an easily comprehensible manner.

20
4.2 System Architecture (Flowchart)

21
4.3 Workflow of Deep Learning (GEC Task)

22
4.4 Dataset Preparation
For the training of our Grammar Corrector, we use the C4_200M dataset recently released
by Google. This dataset consists of 200MM examples of synthetically generated grammatical
corruptions along with the correct text.
One of the biggest challenges in GEC is getting a good variety of data that simulates the
errors typically made in written language. If the corruptions are random, then they would not be
representative of the distribution of errors encountered in real use cases.
For this purpose, we extracted 550K sentences from C4_200M. The C4_200M dataset is
available on TF datasets. We extracted the sentences we needed and saved them as a CSV.
A screenshot of the C4_200M dataset is below. The input is the incorrect sentence, and the
output is the grammatically correct sentence. These random examples show that the dataset covers
inputs from different domains and a variety of writing styles.

4.5 Define Evaluator


The Evaluator is designed to compare the output generated by the T5 model against a
reference or ground truth, which consists of corrected sentences devoid of any grammatical
mistakes. Through a systematic evaluation process, the Evaluator provides valuable metrics such
as precision, recall, F1-score, and other relevant indicators, offering insights into the model's
proficiency. These metrics serve as essential benchmarks for fine-tuning and optimizing the T5
model, ultimately enhancing its ability to produce linguistically accurate and contextually
appropriate outputs. We will be using rouge scores as the metric.

23
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to
evaluate the quality of machine-generated text, such as summaries or translations, by comparing
it to a set of reference (or gold standard) texts created by humans. These metrics are commonly
used in natural language processing tasks like text summarization and machine translation.

4.6 Model Training


We will use the ever-versatile T5 model from Google for this training. T5 is a text-to-text
model meaning it can be trained to go from input text of one format to output text of one format. I
have personally used this model with many different objectives like summarization and text
classification. And used it to build a trivia bot that can retrieve answers from memory without any
provided context.

T5 can be preferred for a lot of tasks for a few reasons 1. Can be used for any text-to-text
task, 2. Good accuracy on downstream tasks after fine-tuning. We set the incorrect sentence as the
input and the corrected text as the label. Both the inputs and targets are tokenized using the T5
tokenizer.

4.7 Testing
We will utilize the specified T5-based model, enabling users to input a text and obtain
corrected versions as output. This function can be employed for various applications where
automated grammar correction is desired.

4.8 How to Do Text Summarization

4.8.1 Text Cleaning


This involves removing any unnecessary characters or punctuation from the text, such as
stop words, HTML tags, and special characters. This helps to make the text easier for the NLP
model to process.

24
4.8.2 Sentence tokenization
This involves splitting the text into individual sentences. This can be done using a simple
heuristic, such as splitting the text at periods, exclamation points, and question marks. However,
more sophisticated methods can also be used, such as using a part-of-speech tagger to identify
sentence boundaries.

4.8.3 Word tokenization


This involves splitting each sentence into individual words. This can be done using a simple
whitespace tokenizer, which splits the text into spaces. However, more sophisticated methods can
also be used, such as using a regular expression tokenizer to handle cases such as hyphenated
words and contractions.

4.8.4 Word-frequency table


This involves creating a table that shows the number of times each word appears in the
text. This table can be used to identify the most important words in the text, which can then be
used to generate a summary.

4.8.5 Summarization
This involves generating an extractive summary of the text based on the word-frequency
table and other factors, such as the sentence structure of the text.

25
CHAPTER 5
IMPLEMENTATION AND TESTING

5.1 Implementation Approaches


• Data Acquisition and Preprocessing: We prioritized a robust dataset that
accurately reflects real-world error distributions. The C4 200M dataset, with its 200 million
synthetically generated examples, offered the versatility and quality we sought. To ensure
efficient model training, we extracted 550,000 sentences and saved them in a CSV format,
suitable for further processing. We implemented a multi-step preprocessing pipeline to
prepare text for summarization. We will do Text Cleaning by Removal of irrelevant
elements such as punctuation, stop words, and special characters to refine the core content
and Sentence Tokenization by Segmentation of text into individual sentences for analysis.
• Model Selection and Training: We harnessed the power of Google's T5 model, a
versatile text-to-text transformer architecture, to learn grammatical correction patterns
from our dataset. To effectively model sentence-level corrections, we employed T5's
tokenizer with a maximum length of 64 tokens, aligning with the typical sentence length
in our data. We leveraged the Seq2Seq Trainer class from Huggingface for model
instantiation and integrated Weights & Biases (wandb) for seamless logging and
monitoring of training progress. For Text Summarization, we employed a word
tokenization technique to segment text into individual words, followed by the construction
of a word-frequency table to capture key content aspects.

• Evaluation and Analysis: We have used the Rouge score as the metric for evaluating
the T5 model

26
5.2 Code Details
5.2.1 Algorithms (Dataset and Training of that dataset for GEC)

27
28
29
30
31
32
5.2.2 Algorithms (For Text Summarization)

33
5.2.3 [Link]

34
35
36
5.3 Code Efficiency
Here are some broad pointers that might be useful when trying to increase a GEC and Text
Summarization efficiency
• Pre-processing of the data by ensuring that the data is correctly cleaned, normalized and
indexed before it is utilized by the GEC system.
• To accelerate training of the model, I had used Google Colab where hardware
acceleration techniques like GPUs or TPUs.
• Using Weights and Biases to monitor the performance of the model as it trains.
• Using C4_200M dataset by Google which consists of 200M examples of synthetically
generated grammatical corruptions along with the correct text.
• The T5 model by Google stands out for its efficiency and performance due to its unified
text-to-text architecture, pre-training on extensive datasets, and transfer learning
approach. By treating all NLP tasks as text generation problems, T5 simplifies model
design and adaptation to various tasks, achieving state-of-the-art results across
benchmarks. Its flexibility in fine-tuning, availability in different sizes, and integration
into the Hugging Face Transformers library contribute to its widespread adoption in
natural language processing applications.
• Utilizing the spaCy library for tokenization provides efficient and fast tokenization of the
input text. SpaCy's tokenization is optimized and language-aware, enhancing overall
processing speed. 

37
5.4 Testing Approach
5.4.1 Categorize Grammar Errors:

grammar_module.py

Test Cases:
Test Description Test Data Expected Result Actual Result Status
Case
Id
TC- Missing Alan came to my Alan came to my Alan came to my Pass
01 Comma house and Jim house, and Jim house, and Jim joined
joined him joined him him
TC- Apostrophe It is my friends It is my friend’s It is my friend’s Pass
02 Usage house in England house in England house in England
TC- Mixing up The book has a The book has a The book has a good Pass
03 similar words good affect on my good effect on my effect on my mood
mood mood
TC- Pronoun Every girl must Every girl must Every girl must bring Pass
04 Disagreement bring their books bring her books to her books to school
to school school
TC- Comparison She is more taller She is taller She is taller Pass
05
TC- Prepositions I went to church I went to church I went to church on Pass
06 at Sunday on Sunday Sunday
TC- Subject – People is coming People are coming People are coming to Pass
07 Verb to my party to my party my party
disagreement
TC- Wrong Tense I have been to I was in New I was in New York Pass
08 New York last York last summer last summer
summer

38
TC- Misusing I want to speak I want to speak I want to speak good Pass
09 Adverbs – English good English good English.
Adjectives
TC- Wrong use of I must to buy a I must buy a new I must buy a new Pass
10 words new cartoon book cartoon book cartoon book

5.4.2 Testing for Text Summarization

text="""

In 2018, twenty-three days after Thanos erased half of all life in the universe, [a] Carol
Danvers rescues Tony Stark and Nebula from deep space and they reunite with the
remaining Avengers-Bruce Banner, Steve Rogers, Thor, Natasha Romanoff, and James
Rhodes and Rocket on Earth. Locating Thanos on an uninhabited planet, they plan to
use the Infinity Stones to reverse his actions, only to find that Thanos has already
destroyed them to prevent any further use. Enraged, Thor decapitates Thanos. Five
years later, Scott Lang escapes from the Quantum Realm. [b] Reaching the Avengers
Compound, he explains that he experienced only five hours while trapped. Theorizing
that the Quantum Realm allows time travel, they ask a reluctant Stark to help them
retrieve the Stones from the past to reverse the actions of Thanos in the present. Stark,
Rocket, and Banner, who has since merged his intelligence with the Hulk's strength,
build a time machine. Banner notes that altering the past does not affect their present;
any changes create alternate realities.
"""
print(summary)

Theorizing that the Quantum Realm allows time travel, they ask a reluctant Stark to help
them retrieve the Stones from the past to reverse the actions of Thanos in the present.
In 2018, twenty-three days after thanos erased half of all life in the universe, [a] Carol
Danver's rescues Tony Stark and Nebula from deep space and they reunite with the
remaining

39
[Link]

40
[Link]

41
Chapter 6

Results and Discussion


[Link]

[Link]

42
summarization_home.html

[Link]

43
/summarization_result

url_summarizer.html (News Url)

/url_summarizer_result

44
url_summarizer.html (Wikipedia Url)

/url_summarizer_result

file_summarizer.html (File based)

45
/file_summarizer_result

46
Chapter 7

Conclusions
7.1 Conclusion
Building this project had its challenges, like learning a specific part of Python for data science.
We thought about using the R programming language but decided on Google's T5 model for
machine learning. T5 is a tool lots of applications use for training. It helps developers pick the
right tool for the job. Our model found mistakes in small text samples but had some issues,
showing it needs improvement. It's tough to catch all errors in English, as it changes a lot. Still,
we believe there's a chance to make a more accurate model with different approaches.
With all the current research in Natural Language Processing, there can be a revolutionary new
way of doing text NLP which does not revolve around Deep Learning. When it comes to be, the
algorithm used to train the dataset should reflect it. Although a more practical approach is to find
a greater amount of data than the one used in this model and have many more iterations of
training. The key is to find data, which is known to be grammatically correct, which can be
difficult to find. Many published bodies of text are a good resource for this type of data, but it is
important to stay away from non-reliable data which includes things like blogs, or twitter feeds,
or anything where there is no verification of correctness.

7.2 Limitations and Future Scope of the Project


The implementation of this model was shown to be effective in catching several mistakes in the
small text samples. However, there are many it did not do well on, which implies there is clear
room for improvement. Since English is a complex language filled with several dimensions
depending on context, there might not ever be a computer system which can fully be catch 100
percent of the errors all the time. Humans themselves struggle to keep up with all the new lingo
and different ways words gain or lose meanings every day. However, there can be a more
accurate model than the one generated and several approaches can be taken.
In the realm of ongoing Natural Language Processing research, there's potential for an
innovative approach to text NLP that doesn't center on Deep Learning. Should this materialize,
the training algorithm should align with it. However, a more practical strategy involves sourcing

47
a larger dataset than the one employed in this model and undergoing numerous additional
training iterations. The crux lies in acquiring data known for its grammatical correctness, a task
that proves challenging. While numerous published texts serve as valuable resources, caution is
essential to steer clear of unreliable sources such as blogs, Twitter feeds, or any unverified
content.
Looking at advancements in Natural Language Processing (NLP), there might be a new
way without relying on Deep Learning. Adjusting the training method to match these new ideas
is crucial. Another idea is to get a bigger dataset than we have and do more training. The
challenge is finding data with correct grammar. Books and articles are good, but we need to
avoid unreliable data like blogs or Twitter. With more data and training, we hope to get a better
model.
To make our dataset better, we'll ask users for feedback through the app. They can correct
examples and add them to the dataset. Figuring out wrong corrections is a challenge, but having
more data from users is a good tradeoff. As more people use the app worldwide, we'll get more
data. This helps machine learning tools grow and create better systems. more data from users is a
good tradeoff. As more people use the app worldwide, we'll get more data. This helps machine
learning tools grow and create better systems.
There are promising avenues for enhancing the model's performance. Exploring
alternative attention mechanisms presents an opportunity to refine the score further, while
augmenting the training process with a more diverse dataset can contribute significantly to
overall improvement. By combining these approaches, the model stands to benefit from a more
robust and effective foundation for future applications.

48
References
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
using-spacy-f19c9fbcfca8
• [Link]
8750b1b6e404
• [Link]
• [Link]

49

You might also like