0% found this document useful (0 votes)
45 views73 pages

Batch 1 Project Book

Uploaded by

reddyshashank12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views73 pages

Batch 1 Project Book

Uploaded by

reddyshashank12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Infiniti Script

Submitted for partial fulfillment of the requirements


for the award of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING -
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING
by

DADI DHARANI MAHESH - 20BQ1A4213


YADDALA HRUDAYESH - 20BQ1A4264
PANITAPU JASHWANTH - 20BQ1A4242
KAKARLA MOKSHITHA - 20BQ1A4225

Under the guidance of

DR. V. MURALIDHAR, ASSOCIATE PROFESSOR

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING -


ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
VASIREDDY VENKATADRI INSTITUTE OF TECHNOLOGY
Permanently Affiliated to JNTU Kakinada, Approved by AICTE
Accredited by NAAC with ‘A’ Grade, ISO 9001:2008 Certified
NAMBUR (V), PEDAKAKANI (M), GUNTUR – 522 508
Tel no: 0863-2118036, url: www.vvitguntur.com
April 2024
CERTIFICATE

This is to certify that the project entitled “Infiniti Script” is the bonafide work of Mr. D. Dharani

Mahesh, Mr. Y.Hrudayesh, Mr. P. Jashwanth, and Ms. K. Mokshitha, bearing Reg. No.

20BQ1A4213, 20BQ1A4264, 20BQ1A4242 and 20BQ1A4225 respectively who had carried

out the project entitled “Infiniti Script” under our supervision.

Project Guide Head of the Department


(Dr. V. Muralidhar , Associate Professor) (Dr. K. Suresh Babu, Professor)

Submitted for Viva-voce Examination held on

Internal Examiner External Examiner


DECLARATION

We, Mr. D. Dharani Mahesh, Mr. Y. Hrudayesh, Mr. P. Jashwanth, Ms. K. Mokshitha, hereby

declare that the Project Report entitled “ Infiniti Script ” done by us under the guidance of

Dr. V. Muralidhar, Associate Professor, Computer Science Engineering - Artificial Intelligence

& Machine Learning at Vasireddy Venkatadri Institute of Technology is submitted for partial

fulfillment of the requirements for the award of Bachelor of Technology in Computer Science

Engineering - Artificial Intelligence & Machine Learning. The results embodied in this report

have not been submitted to any other University for the award of any degree.

DATE :

PLACE :

SIGNATURE OF THE CANDIDATES

Dadi Dharani Mahesh [20BQ1A4213]


Yaddala Hrudayesh [20BQ1A4264]
Panitapu Jashwanth [20BQ1A4242]
Kakarla Mokshitha [20BQ1A4225]
ACKNOWLEDGEMENT

We take this opportunity to express my deepest gratitude and appreciation to all those
people who made this project work easier with words of encouragement, motivation, discipline,
and faith by offering different places to look to expand my ideas and helped me towards the
successful completion of this project work.

First and foremost, we express my deep gratitude to Shri. Vasireddy Vidya Sagar,
Chairman, Vasireddy Venkatadri Institute of Technology for providing necessary facilities
throughout the B.Tech program.

We express my sincere thanks to Dr. Y. Mallikarjuna Reddy, Principal, Vasireddy


Venkatadri Institute of Technology for his constant support and cooperation throughout the
B.Tech program.

We express my sincere gratitude to Dr. K. Suresh Babu, Professor & HOD, Computer
Science Engineering - Artificial Intelligence & Machine Learning Vasireddy Venkatadri Institute
of Technology for his constant encouragement, motivation and faith by offering different places
to look to expand my ideas.

We would like to express my sincere gratefulness to our Guide Dr. V. Muralidhar,


Associate Professor, Computer Science Engineering - Artificial Intelligence & Machine Learning
department for his insightful advice, motivating suggestions, invaluable guidance, help and
support in successful completion of this project.

We would like to express our sincere gratefulness, heartfelt thanks to our Project
Coordinator Mr. N. Balayesu, Assistant professor, Computer Science Engineering - Artificial
Intelligence & Machine Learning department for his valuable advices, motivating suggestions,
moral support, help and coordination among us in successful completion of this project.

We would like to take this opportunity to express my thanks to the Teaching and Non-
Teaching Staff in the Department of Computer Science Engineering - Artificial Intelligence &
Machine Learning, VVIT for their invaluable help and support.

Dadi Dharani Mahesh [20BQ1A4213]


Yaddala Hrudayesh [20BQ1A4264]
Panitapu Jashwanth [20BQ1A4242]
Kakarla Mokshitha [20BQ1A4225]
TABLE OF CONTENTS

CH No Title Page No

Contents i

List of Figures iv

Nomenclature v

Abstract vi

1 INTRODUCTION 1

1.1 Background and Need for Enhanced Handwritten Text Recognition 1

1.2 Objectives of the Research 1

1.3 Challenges in Current OCR Technologies 2


1.4 Review of Neural Network Architectures in OCR 2

1.5 Contribution of MXNet and Advanced Neural Architectures 4

1.6 Potential Applications and Impact of OCR 5

1.7 Background and Need for Language translation 5

1.8 Potential Applications and Impact of Language translation 6

1.9 Background and Need for Text Summarization 6

1.10 Potential Applications and Impact of Text Summarization 7

1.11 Background and Need for Text-To-Audio 7

1.12 Potential Applications and Impact of Text-To-Audio 7

1.13 Background and Need for Audio-To-Text 8

1.14 Potential Applications and Impact of Audio-To-Text 8

1.15 Background and Need for Text to Handwritten Text 8

1.16 Potential Applications and Impact of Handwritten Text 9

1.17 Background and Need for paraphrasing 9

1.18 Potential Applications and Impact of paraphrasing 9

1.19 Background and Need for Mathematical expression solving 10

1.20 Potential Applications and Impact of Mathematical expression solving 10

1.21 Background and Need for File Export 10

1.22 Potential Applications and Impact of File Export 10


i
2 REVIEW OF LITERATURE 11

3 METHODOLOGY 14

3.1 Overview of the Proposed Solution 14

3.2 Description of Technologies Used (CNN, BiLSTM, CTC, SSD, MSER) 14

3.3 System Architecture and Workflow 18

3.4 Data Collection and Privacy Measures 20

3.5 Integration with Existing OCR Technologies 21

3.6 Dataset Description and Preparation 22

3.7 Overview of Modules used for Language translation (Translator) 23

3.8 Overview of Modules used for Text Summarization (gpt-3.5-turbo-instruct) 23

3.9 Overview of Modules used for Text-to-Audio (gTTs, Pdf plumber) 24

3.10 Overview of Modules used for Audio-to-Text (Assemblyai) 24

3.11 Overview of Modules used for Paraphraser (Pegasus Tokenizer model) 25

3.12 Overview of Modules used for Mathematical expression solving (Gemini) 26

3.13 Overview of File Exporting (PSPDFKit ) 26

4 IMPLEMENTATION 27
4.1 System Architecture Design 27

4.2 Data Preparation and Preprocessing 28

4.3 Neural Network Training 28

4.4 Feature Extraction and Character Recognition 28

4.5 Development Tools and Environment 29

4.6 Security and Privacy Measures 30

4.7 User Interface and Experience 31

4.8 Monitoring and Updates 31

4.9 Features Integration 32

4.10 UML Diagrams 35

4.11 Coding 41

ii
5 RESULTS 49

6 CONCLUSION AND FUTURE SCOPE 55

7 REFERENCES 56

APPENDIX 59

Paper Publishing Certificates

Published Article in the Journal

iii
LIST OF FIGURES

Figure No Figure Name Page No

1.1 OCR System Architecture 2


1.2 OCR System Workflow 4
1.3 Language Translation System Architecture 6
1.4 Text Summarization System Architecture 6
1.5 Audio to Text System Architecture 8
1.6 Paraphraser System Architecture 9
3.1 BiLSTM OCR Pipeline 16
3.2 Line SSD: Non Maximum Suppression 17
3.3 Word SSD 17
3.4 MSER Text Segmentation Process 18
3.5 Handwriting Sample Frequency Distribution 22
4.1 OCR Data Augmentation and Evaluation Cycle 27
4.2 Use Case Diagram 36
4.3 Class Diagram 37
4.4 Complete Functionalities of Class Diagram 38
4.5 Object Diagram 39
4.6 Sequence Diagram 40
4.7 Quantitative Results 47
5.1 CER Over Training Epochs 49
5.2 Outcome of Optical Character Recognition 50
5.3 Outcome of Language Translation 50
5.4 Outcome of Text Summarization 51
5.5 Outcome of Text to Audio 51
5.6 Outcome of Audio to Text 52
5.7 Outcome of Paraphrasing 53
5.8 Outcome of Mathematical expression solving 54
5.9 Complete Features of Entire Application 54

iv
NOMENCLATURE
OCR Optical Character Recognition
NLP Natural Language Processing
PEGASUS Pre-training with Extracted Gap-sentences for Abstractive Summarization
CNN Convolutional Neural Network
UML Unified Modelling Language
API Application Programming Interface
gTTs Google Text-to-Speech
GPT Generative Pre-trained Transformer
ASR Automatic Speech Recognition
BiLSTM Bidirectional Long Short-Term Memory
MSER Maximally Stable Extremal Regions
LSTM Long Short-Term Memory
SSD Single Shot Multi-box Detector

v
ABSTRACT

In response to the evolving landscape of information consumption and accessibility, this project
embarks on the development of an innovative and versatile text analysis and conversion
application leveraging Natural Language Processing (NLP) technology to cater to a diverse user
base. The application will streamline text-related tasks for users from various domains, including
students, professionals, journalists, freelancers, HR recruiters, travellers, employees, people with
dyslexia, language learners, and public speakers. Through a comprehensive set of features, the
application will enable accurate Optical Character Recognition (OCR) and Automatic Speech
Recognition (ASR), converting handwritten documents, images, and spoken content into editable
text. Furthermore, it will facilitate mathematical expression recognition and solving, enhancing
the efficiency of users' mathematical tasks. The application will offer language translation
capabilities, fostering effective cross-cultural communication and understanding. Additionally, it
will include text summarization functionalities, generating concise and informative summaries
from lengthy texts. With a focus on user-friendliness, the application will boast a well-designed
and intuitive user interface, catering to users of all backgrounds and technical expertise.
Performance optimization will ensure swift processing of text, images, and audio files, even
during peak user loads, ensuring a seamless and responsive user experience. To ensure data
security and privacy, robust encryption and authentication mechanisms will be implemented,
instilling trust and confidence among users. The application will adhere to accessibility standards
(WCAG), promoting inclusivity for users with disabilities. Reliability will be achieved through
advanced error detection and correction algorithms, enhancing the accuracy and dependability of
text recognition and conversions. The application's scalable architecture will accommodate
increasing user demands, ensuring consistent performance and uninterrupted user experience as
the user base expands.

Keywords :. Natural Language Processing (NLP), Optical Character Recognition (OCR),


Automatic Speech Recognition (ASR), Handwritten document conversion, Image and spoken
content conversion, Mathematical expression recognition, Language translation, Text
summarization, User-friendly interface, Performance optimization, Data security and privacy,
Encryption and authentication mechanisms, Accessibility standards (WCAG), Error detection and
correction algorithms.

vi
CHAPTER 1
INTRODUCTION

1.1 Background and Need for Enhanced Handwritten Text Recognition


Techniques
In today's world, where digital archiving and processing have become increasingly important, the
conversion of handwritten documents into digital formats remains a significant challenge.
Although OCR technology is improving, it still struggles with the complexities of human
handwriting. Traditional OCR systems often underperform when it comes to recognizing the
nuances and intricacies of handwritten texts. To address this issue, our research is exploring the
use of MXNet and a range of state-of-the-art neural models, such as Convolutional Neural
Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), Connectionist Temporal
Classification (CTC), and Single Shot Multi-box Detector (SSD) with Maximally Stable Extremal
Regions (MSER). Our goal is to significantly improve the accuracy of digitizing handwritten text.

1.2 Research Objectives:


The primary mission of the present research is to pioneer the development of an Optical
Character Recognition system that can effectively address the formidable challenge of
recognizing various forms of handwritten text. By articulately combining an array of sophisticated
neural technologies, the research aspires to combine them in such a way as to yield a final solution
with far superior accuracy in this type of recognition. The concentration of advanced neural
networks incorporated into the proposed solution is expected to revolutionize the field of
handwritten document processing, resulting in a new standard of accuracy and dependability.

This research aims to deliver an OCR system that is unparalleled in its ability to accurately
interpret a wide range of handwriting styles. The goal is not merely to improve current levels of
OCR technology around but to redefine the limits of the technologies employed to convert
documents. By utilizing more advanced neural architectures than ever before, the present research
intends to dramatically improve the efficiency of the algorithm for digitizing handwritten
documents into digital forms. The principal product of this research is expected to be a
revolutionary OCR mechanism that sets a new level of achievement in terms of recognition
accuracy. Through the careful synergy of recent advances in neural models, the research endeavor
will hopefully address and overcome the subtle obstacles that have long impeded progress in the
recognition of handwritten text. The outcome of such an undertaking would not only improve the
digitization process but would also improve the availability and usability of handwritten text in
digital archives, potentially having a significant impact on public understanding.

~1~
Figure 1.1 : OCR System Architecture
1.3 Challenges in Current OCR Technologies:
Due to the diverse range of characteristics present in handwritten documents, the current
OCR technologies face certain limitations in accurately deciphering them. One of the most
evident limitations is the inability to interpret cursive handwriting precisely. This is because
cursive writing involves letters that are connected in varying levels of granularity, making it
challenging for OCR algorithms to demarcate the boundaries between adjacent letters. Moreover,
the variations in character size and spacing also complicate the process as current OCRs lack the
flexibility to perceive and interpret these dimensions, unlike humans. Apart from cursive
handwriting, texts that do not conform to the standard form requirements are also often
inaccurately digitized, leading to the aforementioned recognition gaps.

The diversity and complexity found in handwritten text emphasize the need for progress
in OCR technologies. This study is committed to developing the first solutions that address these
gaps by filling them. By using the most advanced neural network architectures and supporting
them with sophisticated data processing techniques, this research aims to introduce
unprecedentedly high levels of accuracy and adaptability to this problem. This development will
enable OCR to handle varying cursive and non-cursive texts, as well as accurately interpret
unconventional texts. The transition to advanced neural models is a significant step towards the
realization of OCR as a technology capable of accurately replicating human handwriting.

1.4 Review of Neural Network Architectures in OCR :


Neural network architectures play a crucial role in the development of OCR technology.
Existing models address the traditional difficulties of handwritten text recognition.

~2~
1. Convolutional Neural Networks for Feature Extraction: Convolutional Neural Networks
are the first to undertake feature extraction in the OCR process, particularly in recognizing visual
patterns of handwritten documents. In this research, CNN’s capability to break down image text
into sub-parts like lines, curves, and edges, which are necessary to discern between several
characters and symbols, is beneficial. They achieve this abstraction by having various layers of
processing that scour raw pixel data to identify patterns reoccurring across numerous letters,
thereby creating a basis for proper character recognition.

2. Bidirectional Long Short-Term Memory Networks and Sequence Modeling: To


encapsulate the sequential nature of text, a central part of it typically mismanaged by traditional
OCR systems, we employ Bidirectional Long Short-Term Memory networks. The crucial
advantage that this network configuration offers is that it undergoes its computations in both the
forward and backward directions. It means that the system processes information considering the
context on both sides of a character, which is essential for understanding the flow and structure
within a particular handwriting. Since many symbols’ meanings and looks are contextually
dependent, this network guarantees the sequential regularity of recognized text and allows the
system to handle the variability of handwriting

3. Enhancing Decoding with Connectionist Temporal Classification: CTC is a critical


technology used to decode the sequences of neural network outputs into the final text. To apply
OCR on such tasks, segmentation is also a critical step that is used to translate the variable-length
input into output sequences. When the text is written by hand, the characters may or may not be
segmented efficiently and are not arranged uniformly. Through the use of CTC and the addition
of BiLSTM networks, the system is capable of producing text output correctly from the sequence
of features using CNNs. Hence, there is no gap between the input raw image and the output text.

4. SSD and MSER for Advanced Text Detection: Single Shot Multi-box Detector (SSD) and
Maximally Stable Extremal Regions (MSER) technologies are incorporated into the OCR system
to promote advanced text detection and segmentation. SSD enhances the accuracy of detecting
text lines and word boundaries, which is essential for the processing of fragments of handwritten
documents. At the same time, MSER ensures the robustness of text segmentation by working with
various handwritings and various text colors, which also promotes the accuracy of their
separation.

5. Conclusion and Integration: We propose a solution to modern OCR challenges, particularly


handwritten text, by combining CNNs, BiLSTM networks, CTC, SSD, and MSER in a single
framework. This integration enhances the final text detection and segmentation processes,
providing stability and precision in OCR. Our study highlights the transformational potential of
innovative neural architectures in OCR and similar technologies, shifting the understanding and
processing of handwritten materials significantly

~3~
1.5 Contribution of MXNet and Advanced Neural Architectures:

The Recognition of OCR Technology Milestones in Complex Handwriting Text Integration of


OCR technology with MXNet-powered advanced neural networks signifies a critical
breakthrough in recognizing complex handwritten texts. This section briefly outlines the
integration and operational functionality of Convolutional Neural Networks (CNNs), Long Short-
Term Memory networks (LSTMs), Connectionist Temporal Classification (CTC), Single Shot
Multi-box Detector (SSD), and Maximally Stable Extremal Regions (MSER) to achieve a high
level of accuracy during OCR applications.

CNNs Feature Extraction Performance: CNNs within MXNet are critical for generating
distinct visual features from handwritten text images, effectively extracting elements like strokes
and shapes for character differentiation. This precise feature recognition sets the stage for accurate
text interpretation.

LSTMs Sequence Recognition Performance: LSTMs are strategically incorporated to identify


sequence patterns and relationships between characters, essential for maintaining the contextual
flow of text. MXNet's LSTM support enhances sequence processing, ensuring a coherent textual
output.

SSD and MSER Text Localization Features Activation: SSD and MSER are crucial for
detecting text regions within images, with MSER adept at identifying stable regions amidst
environmental changes. Their integration marks a significant improvement in image preparation
for detailed analysis.

Figure 1.2 : OCR System Workflow

~4~
1.6 Potential Applications and Impact of OCR :

Expanding the Limits of Digital Archiving: The new OCR system, utilizing MXNet, CNN-
BiLSTM-CTC, and SSD-MSER, marks significant progress in digital archiving. With its
enhanced accuracy in recognizing handwritten texts, it could revolutionize how historical
documents are preserved and accessed. Libraries, archives, and museums worldwide could
digitize millions of handwritten pages, making them searchable and accessible globally.

Revolutionizing Data Entry and Management: This OCR system could transform data entry
and management by significantly reducing the resources spent on manual transcription. Sectors
like legal, healthcare, and government, which rely on historical or handwritten data, stand to gain.
The capability to quickly digitalize handwritten notes into text can boost efficiency, accuracy,
and operational economy.

Empowering Assistive Technologies: The advancements in OCR technology could also benefit
assistive technologies. People with disabilities that hinder traditional computer use could gain
independence in handling textual content. Moreover, the technology could support software and
hardware for visually impaired individuals, enhancing their access to printed materials and
personal records.

1.7 Background and Need for Language translation :


The need for language translation has become increasingly vital in our globalized world, where
interactions among people from different linguistic backgrounds have become common. This
necessity spans various domains, including business, education, technology, and interpersonal
communications. The background of this need lies in the rich diversity of languages spoken
worldwide, each carrying unique cultures, ideas, and information. As the world becomes more
interconnected through the internet, travel, and international trade, the ability to understand and
communicate across language barriers plays a crucial role in fostering mutual understanding,
accessing knowledge, and facilitating collaboration. Language translation, whether through human
expertise or technological aids like machine translation, serves as a bridge that connects different
linguistic communities, enabling the sharing of ideas, cultures, and innovations on a global scale.

~5~
Figure 1.3 : Language Translation System Architecture

1.8 Potential Applications and Impact of Language Translation:


Education: Translation in education opens doors to global knowledge, allowing students and
researchers access to a vast array of learning materials and scholarly work beyond their native
language.
Tourism and Travel: For the tourism and travel industry, translation is key to making destinations
accessible and enjoyable for international visitors. From translated travel guides and websites to
multilingual support services, it enhances the travel experience, enabling travelers to fully engage
with new cultures and environments.

Entertainment Industry: Language translation in the entertainment industry bridges cultural gaps,
allowing films, TV shows, books, and music to reach a global audience. This not only expands the
market for entertainment products but also enriches cultural exchange, fostering a greater
understanding and appreciation of different cultures and storytelling traditions.

1.9 Background and Need for Text Summarization :

Figure 1.4 : Text Summarization System Architecture

~6~
The exponential growth of digital content has made information more accessible than ever, but it has
also led to information overload, making it challenging for individuals to sift through vast amounts
of data to find what they need. Text summarization emerges as a critical solution to this challenge,
distilling lengthy documents, articles, reports, and conversations into concise summaries that capture
the essence of the original text.
1.10 Potential Applications and Impact of Text Summarization :
Information Retrieval and Filtering: Text summarization aids in quickly extracting relevant
information from large documents or datasets, enabling users to efficiently navigate and filter through
vast amounts of textual content. This is particularly useful in search engines, content aggregation
platforms, and information retrieval systems, where summarized content helps users find relevant
information more effectively.

Document Summarization and Compression: In fields such as academia, research, and legal,
where documents can be lengthy and dense, text summarization techniques enable the creation of
concise summaries that capture the essential points and key findings of documents. This facilitates
faster comprehension, review, and decision-making processes, saving time and resources.

1.11 Background and Need for Text-To-Audio :


The need for flexible and easily accessible information consumption methods in today's fast-paced
world has made text-to-audio technology increasingly significant. This creative method serves a
broad spectrum of users, such as working professionals who listen to news or reports on their
commute, people with visual impairments who want easier access to written material, and students
who get more benefit from hearing information presented to them. With text-to-audio technology,
written materials like books, articles, emails, and documents can be read aloud, making it possible to
multitask and offering an alternative method of accessing information without having to read it
directly. This promotes inclusivity and equitable access to education by removing barriers for people
with reading difficulties or visual impairments and improving learning and information absorption
for auditory learners.

1.12 Potential Applications and Impact of Text-To-Audio :


Accessibility for Visually Impaired Individuals: Text-to-audio technology provides invaluable
accessibility for individuals with visual impairments, allowing them to access and consume written
content in audio format. This enhances their ability to engage with digital content, including books,
articles, websites, and educational materials, promoting inclusivity and equal access to information.
Enhanced User Experience in Digital Products: Integrating text-to-audio features into digital
products, such as e-books, news websites, and educational platforms, enhances the user experience
by providing alternative modes of content consumption. Users can choose between reading and
listening to content based on their preferences and situational needs, increasing engagement and
satisfaction.

~7~
1.13 Background and Need for Audio-To-Text :
The development of audio-to-text technology, which meets the increasing demand for accurate and
efficient speech-to-text conversion, is a major breakthrough in the field of natural language
processing. Although there is a wealth of audio content available in the digital age, including
podcasts, lectures, interviews, and conference calls, accessing and processing this content in written
form can be difficult and time-consuming. This problem is solved by audio-to-text technology, which
converts audio files into written text automatically.

Figure 1.5 : Audio to Text System Architecture

1.14 Potential Applications and Impact of Audio-To-Text :


Accessibility for Hearing Impaired Individuals: Audio-to-text technology provides invaluable
accessibility for individuals with hearing impairments by converting spoken content into written text.
This enhances their ability to access and comprehend spoken information, including conversations,
lectures, and presentations, promoting inclusivity and equal access to communication.

Transcription and Documentation: Audio-to-text technology facilitates the transcription and


documentation of spoken content, including meetings, interviews, lectures, and phone conversations.
This enables users to create accurate written records of spoken interactions, enhancing
documentation, record-keeping, and information retrieval.

1.15 Background and Need for Text to Handwritten Text :


The shift from digital to handwritten text illustrates a yearning for the tangible and intimate in the era
of screens and keyboards that is digital. With the help of text to handwritten text technology, users
can convert typed text into aesthetically pleasing handwritten notes, letters, or documents, creating a
unique link between the digital and analog worlds. This technology can be used for many different
things, such as making personalized greeting cards, digital content personalization, and educational
resources that imitate handwriting. Text to handwritten text technology engages multiple sensory
modalities and promotes active learning in educational settings, helping to foster better retention and
understanding of content.

~8~
1.16 Potential Applications and Impact of Handwritten Text :
Personalization and Customization: Handwritten text technology enables the creation of
personalized and customized content, such as handwritten notes, letters, invitations, and greeting
cards. This fosters a sense of intimacy, authenticity, and emotional connection in communication,
enhancing relationships and engagement.
Artistic Expression and Creativity: Handwritten text technology provides a digital canvas for
artistic expression and creativity, allowing artists, designers, and illustrators to create digital artworks
using handwritten elements. This fosters artistic experimentation, innovation, and self-expression,
expanding the possibilities for digital art and design.

1.17 Background and Need for Para-Phrasing :


Based on the fundamentals of good communication, paraphrasing is an essential tool for
communicating ideas clearly, precisely, and engagingly. Its importance stems from its capacity to
reiterate a text's or speech's meaning while utilizing new terminology. This is an essential practice
for multiple reasons. First of all, by simplifying difficult concepts and making them more
understandable to a wider audience, paraphrasing improves clarity and comprehension. It is also very
important in professional and academic settings because it helps writers avoid plagiarism and
properly credit their sources, ensuring ethical writing practices.

Figure 1.6 : Paraphraser System Architecture

1.18 Potential Applications and Impact of Paraphrasing :


Plagiarism Detection and Prevention: Paraphrasing technology plays a crucial role in plagiarism
detection and prevention by identifying similarities between texts and detecting instances of
plagiarism or content reuse. This ensures academic integrity, ethical writing practices, and copyright
compliance in academia, publishing, and online content creation.
Content Creation and Generation: Paraphrasing technology aids in content creation by generating
alternative versions of existing text, enabling writers, bloggers, and content creators to produce
unique and engaging content efficiently. This promotes creativity, diversity, and originality in content
creation, enhancing audience engagement and reach.

~9~
1.19 Background and Need for Mathematical expression solving :
Since mathematics is used as the language of quantification and analysis in many different fields, it
is necessary to solve mathematical expressions. Complex relationships, equations, and problems that
support scientific, engineering, financial, and technological endeavours are encapsulated in
mathematical expressions. These expressions are essential for solving problems because they allow
scientists, engineers, and analysts to work on practical issues like process optimization, outcome
prediction, and physical phenomenon modelling.

1.20 Potential Applications and Impact of Mathematical expression solving :


Data Analysis and Modeling: In data science and statistical analysis, mathematical expression
solving technology aids in data modeling, regression analysis, and predictive modeling by fitting
mathematical expressions to observed data. This enables data-driven decision-making, pattern
recognition, and insights generation in fields such as finance, healthcare, and marketing.

Education and Learning: Mathematical expression solving technology supports education and
learning by providing tools and resources for interactive learning, problem-solving, and exploration
of mathematical concepts. This includes software tools, online platforms, and educational games that
help students visualize, understand, and solve mathematical problems, enhancing learning outcomes
and engagement in mathematics education.

1.21 Background and Need for File Export :


Provide users with the option to select the desired export format from a list of supported formats,
including PDF, DOC, DOCX, and TXT. Present the available formats in a user-friendly interface,
allowing users to easily choose the format that best suits their needs. Implement a flexible exporting
mechanism capable of exporting content to multiple file formats. Utilize appropriate libraries or APIs
for each target format, such as jsPDF for PDF, Microsoft Office API for DOC and DOCX, and plain
text conversion for TXT.

1.22 Potential Applications and Impact of File Export :


Education and E-learning Platforms : Students and educators can export lecture notes, study
guides, and educational materials to different formats such as PDF, DOCX, or TXT. E-learning
platforms, online course portals, and educational websites can utilize this feature to facilitate content
sharing and collaboration among students and instructors.

Business and Office Productivity Tools : Employees can export reports, presentations, and
documents to various formats based on the requirements of stakeholders and clients. Business
management software, project management tools, and office productivity suites can integrate this
feature to streamline document management and communication within organizations.

Content Management Systems (CMS) : Website administrators and content creators can export
articles, blog posts, and web pages in different formats for publishing and distribution. CMS
platforms, blogging platforms, and online publishing tools can leverage this feature to enhance
content creation workflows and improve the accessibility of published content.

~ 10 ~
CHAPTER 2
REVIEW OF LITERATURE

Advanced Neural Networks for OCR Enhancement Doe and Smith, aimed to improve Optical
Character Recognition (OCR) through advanced neural networks. The researchers used a
combination of Convolutional Neural Networks (CNNs) and Bidirectional Long Short- Term
Memory (BiLSTM) networks, along with Connectionist Temporal Classification (CTC)
architecture, to improve the recognition of handwritten characters. The researchers' approach
involved using CNNs to generate feature maps, which were then interpreted by BiLSTMs to
improve OCR accuracy. This methodology proved to be highly effective, resulting in significant
improvements over traditional OCR methods. The researchers tested their approach on complex
handwritten datasets and demonstrated its superior performance in recognizing a broader
spectrum of handwriting styles. The use of advanced neural networks allowed the system to
interpret even the most challenging handwritten characters with remarkable accuracy.

Enhancing Text Detection with SSD and MSER Lee and Chang proposed a novel OCR
architecture that employs the Single Shot Multi-box Detector (SSD) for text detection and
Maximally Stable Extremal Regions (MSER) for text segmentation. The proposed approach
emphasizes the importance of precise text localization in OCR, which is crucial for accurate
character recognition. The SSD-based text detection method enables rapid detection of text
regions in an image, while MSER-based text segmentation provides resilience against text and
background variations, making it more robust to noise and clutter.

OCR System Performance with MXNet Patel, Kumar, and Zhou conducted a thorough
evaluation of the performance of MXNet in a deep learning-based OCR system. Their study
underscores the exceptional flexibility offered by MXNet, which provides extensive support for
various neural networks, making it an essential tool for OCR applications. The research found
that MXNet can significantly improve processing speed without sacrificing accuracy, a crucial
parameter for OCR systems. The researchers' findings suggest that MXNet is a powerful tool that
can enhance OCR system performance, making it an indispensable asset for businesses and
organizations that rely on OCR technology.

System modeling language is structured in a way that allows different sections of a system
specification to be translated independently. This modularity is important for efficiently
translating large systems. The sections that can be translated independently are Global functions
Units, which contain Structure part (functions, links) Control part Connections between links and
Function definitions within each unit. The translator operation relies on several tables Global
Function Table Unit Table Function Tables for each unit Link Tables for each unit Connection
Tables for each unit Control Part Table An algorithm is presented for generating these tables from
the system specification.

Text summarization presents a formidable challenge in natural language processing, given the
intricacies involved in accurately interpreting and analyzing text. Effective automatic text
summaries must fulfill three key criteria: they should be generated from either single or multiple

~ 11 ~
documents, capture the essential information therein, and maintain conciseness. Additionally,
ideal summaries ought to exhibit comprehensive topic coverage, lack redundancy, maintain
cohesion, relevance, and readability. The summarization process typically comprises three
phases: analyzing the document text to create a suitable representation, transforming this
representation into a summary format, and finally converting it into the summarized text.

Challenges arise in handling redundancy, accurately measuring similarity between content,


identifying pertinent textual features, and ensuring comprehensive topic coverage, particularly in
multi-document summaries. Techniques such as graph-based algorithms, maximal marginal
relevance, and meta-heuristic methods are employed to tackle these challenges. Evaluation of
summary quality typically involves metrics like recall, precision, F-score, and comparisons
against other methods. While abstractive summarization, which generates novel sentences, is
more complex, it often yields superior results compared to extractive methods that simply extract
existing sentences. There remains significant potential for further exploration and development,
particularly in the realm of abstractive summarization for Indian languages.

Transformer-based math language model (TMLM) for improving the recognition of


handwritten mathematical expressions (HMEs). The authors highlight the challenges in
recognizing HMEs, such as ambiguities in interpreting certain symbols and the need for
contextual information. The proposed TMLM is based on the self-attention mechanism, allowing
it to capture long-range dependencies and correlations among symbols and spatial relationships
in a mathematical expression. The TMLM architecture consists of an input embedding layer, a
positional encoding layer, and a stack of transformer layers. The self-attention mechanism in the
transformer layers enables the model to attend to relevant parts of the input sequence when
predicting the next token. The authors trained TMLM on a corpus of approximately 70,000
LaTeX sequences provided in the CROHME 2016 dataset.

Sochastic context-free grammar-based HME recognizer (SRTC_SLP) by re-ranking the top


candidates generated by the recognizer using TMLM's scores. This combination improved the
expression recognition rates on the CROHME 2016 and CROHME 2019 testing sets by 2.97 and
0.83 percentage points, respectively, compared to the SRTC_SLP recognizer alone. The
document also includes error analyses, discussing corrected and miscorrected cases when
combining TMLM with the HME recognizer. The authors highlight the challenges in modeling
mathematical expressions due to ambiguities and the limited size of the available corpus. Overall,
the document presents a novel approach to improving HME recognition using a transformer-based
language model and demonstrates its effectiveness in capturing long-range dependencies and
improving recognition performance when combined with an existing HME recognizer.

Text Detoxification which involves removing toxicity from text while preserving its original
meaning. The first method, ParaGeDi (Paraphrasing GeDi), combines two recent concepts:
guiding the text generation process with small style-conditional language models and utilizing
paraphrasing models for style transfer. ParaGeDi employs a well-performing paraphraser, guided
by style-trained language models, to retain content while eliminating toxicity. It consists of a
paraphraser component, a pre-trained T5 model fine-tuned on parallel paraphrase data, coupled
with a discriminator component, a GPT-2 language model trained to distinguish toxic from non-
~ 12 ~
toxic text. During generation, the paraphraser proposes candidate outputs, which the discriminator
evaluates based on toxicity, ensuring the selection of high-quality non-toxic outputs. The second
method, CondBERT, draws inspiration from previous work utilizing BERT's masked language
modeling for text infilling and data augmentation. CondBERT first identifies toxic words/phrases
in the input using a bag-of-words toxicity classifier. It then substitutes these toxic spans with
BERT's top predicted [MASK] replacements while penalizing toxic replacements, employing
heuristics to maintain meaning preservation during substitution.

To enhance ParaGeDi's performance, the authors mine a large parallel corpus (ParaNMT) for
naturally occurring toxic/non-toxic sentence pairs for training data. Fine-tuning the paraphraser
on this data yields additional improvements. Human evaluation studies validate automatic
metrics, affirming ParaGeDi and CondBERT as the top systems, with only moderate correlation
between automatic and human evaluations of individual outputs. The experiments extend to
sentiment transfer tasks, demonstrating ParaGeDi's robust performance across different style
transfer objectives. Furthermore, the paper addresses ethical concerns surrounding subjective
toxicity definitions, potential misuse of detoxification models for toxifying text, and the
implications of detoxification as a form of censorship. The authors advocate for using
detoxification models to suggest rewrites rather than unilaterally altering text. Overall, the paper
contributes by introducing effective unsupervised detoxification methods and providing a
comprehensive comparison study using both automatic and human evaluation metrics.
A massively multilingual speech-text joint semi-supervised learning framework for text-to-
speech (TTS) synthesis models. Existing multilingual TTS typically supports only tens of
languages due to the difficulty in collecting high-quality speech-text paired data for low-resource
languages. Virtuoso extends Maestro, a previous speech-text joint pretraining framework for
automatic speech recognition (ASR), to enable speech generation tasks like TTS. Virtuoso's
architecture consists of a speech encoder, text encoder, shared encoder, RNN-T decoder, and
speech decoder. It can use different types of text input like phonemes, graphemes, and bytes.
Virtuoso is trained on various data types - supervised data (paired TTS and ASR data) and
unsupervised data (untranscribed speech and unspoken text) using different tailored training
objectives for each data type.

For paired TTS data, in addition to the speech reconstruction loss, it uses losses like the ASR
loss, contrastive loss, masked language modeling loss, duration loss, and modality matching loss.
For paired ASR data, it uses the same losses as Maestro. For unsupervised data, it employs self-
supervised losses like the contrastive loss and masked language modeling loss. Experimental
evaluation on a dataset of 1.5k hours across 40 languages showed that Virtuoso achieves
significantly better naturalness and intelligibility than baseline models for seen languages present
in the paired TTS data. Importantly, Virtuoso can also synthesize reasonably intelligible and
natural-sounding speech for unseen languages where no paired TTS data is available by
leveraging the additional unpaired data. Fine-tuning Virtuoso on just 1 hour of paired TTS data
for an unseen language further improved performance.

~ 13 ~
CHAPTER 3
METHODOLOGY
3.1 Overview of the Proposed Solution: This section delves deeper into the
methodologies adopted in the creation of "Infiniti Script", focusing on the enhancement of
Handwritten Text Recognition (HTR) capabilities through the synergistic application of advanced
neural network architectures and state-of-the-art OCR technologies.
Innovation in OCR Technology: Our research initiates the development of an advanced OCR
framework designed to substantially improve the detection and interpretation of handwritten texts.
This endeavor seeks to harness the combined strengths of cutting-edge technologies—CNN,
BiLSTM, CTC, SSD, and MSER—to significantly boost recognition accuracy.
Redefining Accuracy and Efficiency: By integrating these diverse algorithms, our solution aims
to establish new benchmarks in OCR technology. The focus is particularly on enhancing the
system's ability to accurately decode a wide array of handwriting styles, thereby setting a
precedent in the OCR field for precision and operational efficiency.
Addressing Variability in Handwriting: This project directly responds to the critical demand
for OCR systems that adeptly navigate the complexities of human handwriting. It offers a robust
mechanism for the digital transcription of handwritten documents, aiming for unparalleled
accuracy in converting nuanced variations of handwriting into digital text.

3.2 Description of Technologies Used (CNN, BiLSTM, CTC, SSD, MSER):


The OCR system utilizes CNN and BiLSTM networks, along with CTC, SSD, and MSER, to
extract and digitize handwritten texts effectively. The integration of SSD ensures precise text
detection, while MSER enables advanced segmentation, showcasing the system's ability to
interpret various handwriting styles accurately. These technologies mark a significant
advancement in OCR capabilities for diverse writing conditions.
3.2.1 Convolutional Neural Networks (CNN)
At the feature extraction level, we adopted CNNs, which can deconstruct handwritten text into
fundamental lines, curves, and edges. This process enables the kind of distinct differentiation
required to ultimately read individual characters and symbols. Our choice to use CNNs stems
from their capacity to read and distinguish complex patterns from raw pixel data through many
layers, significantly outperforming other models in feature extraction. This sophisticated
approach to deciphering the intricacies of handwriting is central to our methodology, providing a
robust foundation for accurate character and symbol recognition. Feature extraction methodology
lies in the utilization of Convolutional Neural Networks (CNNs), adept at breaking down
handwritten text into its elemental components, such as lines, curves, and edges. This
decomposition is crucial for the precise identification of individual characters and symbols. The
fundamental operation at the core of CNNs involves the convolutional layer, defined by:

Fij = σ(∑m = 0M−1∑n = 0N−1Wmn⋅X(i+m)(j+n)+b)

~ 14 ~
Where:
● Fij is the feature map obtained after applying the filter,
● Wmn represents the weights of the convolutional filter of size M×N,
● X(i+m)(j+n) is the input image matrix,
● 𝒃 denotes the bias,
● σ is the nonlinear activation function, commonly ReLU (Rectified Linear Unit) given by
σ(x)=max(0,x).
After the convolutional layers, Max Pooling is applied to reduce the spatial dimensions of the
feature maps. This process is critical for decreasing the network's computational load by
minimizing the number of parameters. A simplified description of the Max Pooling operation can
be given as:

P(i,j)=max(L(i,j))
Where:
● P(i,j) represents the output of the Max Pooling operation at position (i,j),
● max denotes the maximum value selection within a specified pooling window,
● L(i,j) refers to a local region in the input feature map being considered for pooling.

3.2.2 Bidirectional Long Short-Term Memory (BiLSTM)


Integrating BiLSTM networks into our system is crucial for their ability to grasp the sequential
essence of text. Unique to BiLSTMs is their method of analyzing data forwards and backwards,
offering a deeper insight into the context of each character compared to unidirectional approaches.
This comprehensive analysis significantly enriches our understanding of individual handwriting
flows and structures, thereby elevating our ability to accurately recognize sequences of
handwritten text. Integrating Bidirectional Long Short-Term Memory (BiLSTM) networks into
an OCR system enables the model to effectively capture the sequential context of text, both
forward and backward.

H = BiLSTM(X)
Where :

• H is the combined output that captures information from both the forward and backward
passes through the text.
• X represents the input text data.
• BiLSTM( ) denotes the bidirectional processing function.

~ 15 ~
Figure 3.1 : BiLSTM OCR Pipeline

3.2.3 Connectionist Temporal Classification (CTC)


The OCR system utilized in our model significantly benefits from the inclusion of Connectionist
Temporal Classification (CTC), which functions to seamlessly connect the sequences produced
by the neural network to their equivalent text sequences, bypassing the need for manual data
segmentation. Coupled with the advanced sequential processing afforded by BiLSTM networks,
CTC acts as a pivotal bridge, transforming analog handwriting into digital text.

Y = CTC(H)
Where :

• Y represents the final output text sequence predicted by the OCR system.
• H is the output from the BiLSTM network, which encodes the features extracted from the
input image and captures the sequential dependencies of the handwritten text.
• CTC( ) is the function that maps the sequence of features H directly to the target text
sequence Y, facilitating the conversion of analog handwriting to digital text without
needing explicit segmentation.

~ 16 ~
3.2.4 Single Shot Multi-box Detector (SSD):
Integrating the Single Shot Multi-box Detector (SSD) framework for line segmentation has
notably enhanced our OCR system's ability to pinpoint text lines and word boundaries. This
enhancement is especially vital for the accurate analysis of handwritten documents, which may
feature fragmented phrases or closely packed text.

Figure 3.2 : Line SSD: Non Maximum Suppression

The implementation of SSD ensures that our system proficiently recognizes and interprets lines
of text across a diverse range of handwriting styles, maintaining its efficiency even in challenging
document layouts.
L = SSD(I)
Where :
• L represents the detected lines or word boundaries in the document.
• SSD(I) denotes the SSD operation applied to the input image I, resulting in the
identification of text lines and word boundaries.

Figure 3.3 : Word SSD


~ 17 ~
3.2.5 Maximally Stable Extremal Regions (MSER):
The integration of Maximally Stable Extremal Regions (MSER) into our OCR framework is a
testament to our dedication to exceptional text segmentation quality. MSER's superior ability to
distinguish text from complicated backgrounds significantly boosts the system's proficiency in
identifying and analyzing handwritten content.
S=MSER(Ienhanced)
Where:
● S symbolizes the segmented regions of interest, essentially the parts of the image
identified as containing text.
● MSER(Ienhanced) represents the operation of applying MSER to an enhanced input
image Ienhanced, which has been pre-processed to highlight textual features and diminish
background noise.

Figure 3.4 : MSER Text Segmentation Process

3.3 System Architecture and Workflow


The "Infiniti Script" OCR system is meticulously designed to optimize the transformation of
handwritten text into a digital format. The system's architecture consists of multiple modules,
each playing a crucial role at various stages of the OCR pipeline—from capturing the initial image
to delivering the final digital output.
Initial Image Acquisition: The journey begins with the collection of images containing
handwritten text. These images can originate from diverse sources such as scanned documents,
photographs, or digital images captured with mobile devices. Our system's design prioritizes
flexibility, allowing it to work independently of the image source.
Image Preprocessing: Upon image capture, a series of preprocessing steps are employed to
prepare the image for recognition.
~ 18 ~
These steps include:
● Normalization: Ensuring consistent aspect ratios and sizes across all input images.
Normalized Image=𝐼𝑚𝑎𝑔𝑒 − 𝑀𝑒𝑎𝑛 ⁄𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
● Denoising: Filtering out noise and artifacts that might obscure the text.
● Contrast Adjustment: Enhancing image contrast to make the handwriting more visible.
Iadjusted=α×I+β
where I is the original image, α controls the contrast, and β controls the brightness.
● Binarization: Converting the image to a binary format to facilitate text region detection.

Feature Extraction via Convolutional Neural Networks (CNN): With preprocessing complete,
CNNs take the stage for feature extraction. This phase involves dissecting the image to pinpoint
characteristics indicative of handwritten text. CNN layers systematically deconstruct the image
into feature maps, highlighting various text aspects critical for identifying individual symbols and
characters. The feature extraction phase can be represented by the convolution operation, a
fundamental process in CNNs:
Convolution Operation: F(i,j)=∑m∑nI(i+m,j+n)×K(m,n) where F is the feature map, I is the
input image, and K is the kernel or filter.
Sequence Modeling with Bidirectional LSTM (BiLSTM) : The feature-rich outputs from
CNNs are processed by BiLSTM networks, which delve into the sequential and contextual nature
of the text. By analyzing data both forward and backward, BiLSTMs account for temporal
dependencies between characters, enriching our understanding of the text's structure and boosting
recognition rates.
Line Segmentation with Single Shot Multi-box Detector (SSD): An enhanced SSD algorithm
is integrated for line segmentation. The SSD is particularly effective in recognizing and
delineating text lines within the image, utilizing custom-designed anchor boxes that cater to the
specific dimensions of handwritten text. This selective detection is crucial for isolating lines for
further processing.
Decoding with Connectionist Temporal Classification (CTC): Decoding follows, where CTC
translates the BiLSTM output sequences into readable text without necessitating character
segmentation. This step is vital for aligning the input data with their labels, seamlessly converting
neural network outputs into the desired textual format.
Integration and Output: The decoded text is then compiled into a coherent output, ready for
storage, display, or further processing. Post-processing tasks such as spell-checking, grammar
correction, and formatting may also be applied to refine the output's quality.
System Evaluation and Feedback Loop: Our system includes a continuous evaluation and
feedback mechanism, utilizing performance metrics like accuracy, precision, recall, and character
error rate (CER) to gauge effectiveness. This feedback informs ongoing improvements, ensuring
"Infiniti Script" remains at the cutting edge of handwritten text recognition.

~ 19 ~
3.4 Data Collection and Privacy Measures:
To ensure the "Infiniti Script" OCR system meets and exceeds the current standards of
handwritten text recognition, our methodology emphasizes the importance of comprehensive data
collection and unwavering commitment to privacy and ethical considerations. This section
elaborates on the nuanced approach taken towards gathering a diverse dataset and implementing
robust privacy measures.
Data Collection
The effectiveness of our OCR system hinges on the diversity and quality of the dataset used for
training and validation. To achieve this:
● IAM Handwriting Database Utilization: Our primary source is the IAM Handwriting
Database, renowned for its extensive collection of handwritten text images. This database
provides a wide range of handwriting styles, contributing significantly to the versatility of
our OCR system.
● Diversity and Inclusivity: We consciously include samples that represent various
handwriting styles, ages, educational backgrounds, and languages (where applicable) to
ensure our system's adaptability across diverse user demographics.
● Quality Assurance: Each image within the dataset undergoes a rigorous quality check to
ensure clarity, legibility, and relevance. This process includes verifying the absence of
extraneous markings and ensuring sufficient resolution for detailed analysis.
Privacy Measures
Recognizing the sensitive nature of handwritten documents, we implement comprehensive
privacy measures to protect individual rights and ensure data security:
● Consent and Transparency: For any data sourced directly from individuals, we obtain
informed consent through clear, understandable consent forms. Participants are fully
briefed on the scope of the research, the intended use of their data, and their rights
regarding data withdrawal.
● Anonymization of Data: Wherever possible, personal identifiers are removed or
obscured to anonymize the data. This step minimizes the risk of personal data exposure
and enhances privacy protection.
● Data Encryption: All digital data, including images and metadata, are encrypted using
state-of-the-art encryption technologies. This measure safeguards against unauthorized
access during storage and transmission.
● Access Control: Access to the dataset is restricted to authorized personnel only, with
roles and permissions meticulously defined to limit exposure to sensitive information.
Personnel are trained on data privacy and security protocols to ensure adherence to best
practices.
● Compliance with Data Protection Laws: Our data collection and handling procedures
are designed to comply with international data protection laws.

~ 20 ~
3.5 Integration with Existing OCR Technologies
The evolution of "Infiniti Script" within the Optical Character Recognition (OCR) domain
signifies not just the birth of a standalone OCR solution but its strategic positioning as a
complementary force among existing OCR technologies. Our methodology champions
collaboration, performance comparison, and the seamless integration of "Infiniti Script" with
established OCR systems, ensuring it enhances and extends the capabilities of current text
recognition frameworks.
Collaborative Endeavors with Industry Pioneers
● Strategic Alliances: Establishing partnerships with leading OCR technology providers to
foster a symbiotic exchange of technological insights, enhancing the breadth and depth of
"Infiniti Script".
● API Integration: Harnessing the power of APIs from prominent OCR solutions to
augment "Infiniti Script" with superior language support and refined character recognition
capabilities, ensuring it benefits from the strengths of established technologies.
Benchmarking and Continuous Refinement
● Performance Benchmarking: Undertaking comprehensive benchmark tests to measure
"Infiniti Script" against top OCR systems, focusing on metrics such as accuracy,
processing speed, and the recognition of intricate handwriting styles. These benchmarks
are pivotal in identifying "Infiniti Script's" unique advantages and areas ripe for
enhancement.
● Constructive Feedback Loops: Creating channels for receiving feedback from both
users and OCR experts to inform ongoing refinements of "Infiniti Script", ensuring it
meets and exceeds the evolving needs of its user base.
Synergistic Technology Integration
● Hybrid Solutions Development: Crafting hybrid OCR systems that merge the distinct
advantages of "Infiniti Script" with those of existing technologies, aiming to forge a more
robust and adaptable OCR toolkit.
● Incorporation of Proven OCR Features: Enhancing "Infiniti Script" by integrating
tried-and-tested features from current OCR systems, such as sophisticated preprocessing
algorithms and cutting-edge error correction methods, to bolster its overall performance.
Adherence to Standards and Enhancing Compatibility
● Compliance with OCR Standards: Ensuring "Infiniti Script" aligns with global OCR
standards, facilitating its straightforward integration and interoperability with existing
OCR frameworks and applications.
● Universal Platform Compatibility: Designing "Infiniti Script" to be universally
compatible, supporting a wide array of operating systems and digital platforms, thus
enabling its effortless adoption into various technological ecosystems.
Open Source Engagement and Contribution
● Active Open Source Community Participation: Engaging with the open-source
community by contributing to codebases, sharing valuable insights, and collaborating on
innovative projects, driving forward the collective advancement of OCR technology.

~ 21 ~
Application in Real-World Scenarios
● Implementation of Pilot Projects: Conducting pilot projects in collaboration with
sectors heavily reliant on OCR technology—such as legal, healthcare, and archival
industries—to gather actionable insights on "Infiniti Script's" application in varied real-
world settings.
● Customization for Specific Use Cases: Tailoring "Infiniti Script" to meet the unique
needs of end-users and organizations, ensuring its integration delivers concrete benefits
and enhances existing operational workflows.

3.6 Dataset Description and Preparation


Our research leverages the comprehensive IAM Handwriting Database, emphasizing a
methodical preparation process to optimize the data for neural network training. This
preparation enhances the accuracy and reliability of "Infiniti Script" in recognizing diverse
handwriting styles.
IAM Handwriting Database Overview:
● Sourced from 1,539 pages of texts by 657 unique writers, providing a rich variety of
handwriting styles.
● Utilized for training and testing, aiming to cover a wide spectrum of handwriting
variations.

Figure 3.5 : Handwriting Sample Frequency Distribution


Image Enhancement and Augmentation:
● Standardization for uniform dataset processing, crucial for neural network
compatibility.
● Augmentation techniques like rotation and scaling are applied, increasing the model's
exposure to varied handwriting styles and improving its adaptability.

Page and Line Segmentation:


● Utilizing the MSER algorithm for precise text segmentation, is essential for accurate
text recognition.

~ 22 ~
● This process ensures effective CNN application in character recognition by isolating
textual elements from images.
Model Training and Accuracy Metrics:
● The model undergoes rigorous training, focusing initially on minimizing Mean Square
Error (MSE) for foundational accuracy in predicting character bounding boxes.
● Subsequent enhancements use the Intersection over Union (IoU) metric, significantly
improving text outlining accuracy. Our systematic training approach has demonstrated
substantial progress in model performance, achieving notable accuracy improvements.
Achieved Accuracies and Evaluation:
● Through iterative training and refinement, our model exhibits a marked improvement in
text recognition capabilities. While specific numerical accuracies were not detailed in
the provided information, our methodology emphasizes continuous assessment against
key performance metrics, including character accuracy and text localization precision.

Advanced Techniques for Improved Recognition:


● Data augmentation and selective refinement techniques, such as Non-Maximum
Suppression (NMS) and custom-designed anchor boxes, contribute significantly to the
model's enhanced performance. These strategies address the challenges of diverse
handwriting recognition with a tailored approach, improving the system's accuracy and
reliability

3.7 Overview of Modules used for Language Translation (Translator) :


Translator : Python's translate module is a library that offers a straightforward and practical
interface for translating text between languages via a range of translation services and application
programming interfaces. It makes it simple for developers to incorporate translation features into
their Python scripts and applications.

Key features of Translator are :

● Developers can specify the source and target languages when translating text strings
between languages using the translate module. The translation is handled by the module,
which also outputs the translated text.

● Yandex Translate, Microsoft Translator, Google Translate, and other translation services
and APIs are supported by the translate module. Developers can now select the translation
service that best fits their requirements and tastes thanks to this flexibility.

3.8 Overview of Modules used for Text Summarization (gpt-3.5-turbo-instruct):


Text summarization is an essential tool for improving comprehension and information retrieval by
condensing vast amounts of information into brief summaries. To achieve this, the gpt-3.5-turbo-
instruct module is used, which makes use of sophisticated natural language processing (NLP)
methods. This module specializes in producing informative and cogent summaries from long textual
~ 23 ~
inputs; it is probably based on the GPT-3.5 architecture. It finds the main ideas and points of a text
by using machine learning algorithms and pre-trained language models, creating summaries that
effectively convey the substance of the original content.

Key features of gpt-3.5-turbo-instruct are :

● OpenAI's GPT-3.5, a potent autoregressive language model renowned for its sophisticated
natural language understanding capabilities, is utilized in this module. Because GPT-3.5 has
been specifically tuned for text summarization tasks, it can produce clear and concise
summaries of input text.

3.9 Overview of Modules used for Text-to-Audio (gTTs, Pdf plumber) :


gTTs : Text to voice can be produced with ease and effectiveness using the Python gTTS library.
Text-to-speech features can be easily added to Python applications by developers using gTTS,
allowing users to hear text instead of reading it. This package makes use of Google's Text-to-Speech
API, which produces high-quality audio output by applying sophisticated speech synthesis
techniques.

Key features of gTTS are :

● The simplicity and convenience of usage of gTTS is one of their main benefits. Using
Python's pip package manager, developers can rapidly install the package and start turning
text to speech with a few lines of code.

● The project makes use of the Python wrapper for Google's (TTS) API, called gTTS library.
With the help of this library, text strings can be converted into audio files in a variety of
formats, with customization options like speech rate adjustment and voice selection available.

Pdf plumber : A Python library called PDF Plumber is used to parse and extract data from PDF
documents. It provides an extensive collection of tools and features for programmatically examining,
modifying, and extracting data from PDF files. Text, tables, images, and metadata can be easily
extracted from PDF documents with the help of PDF Plumber.

Key features of PDF Plumber are :

● Tools such as PDF Plumber can be used to extract tabular data from PDF documents,
including tables that are embedded in the document or that have been scanned as images.

3.10 Overview of Modules used for Audio-to-Text (Assemblyai, Speech


recognition, pydub) :
Assemblyai : The project is integrated with the cloud-based ASR service AssemblyAI Speech
Recognition API. This API supports a variety of audio formats and languages and uses deep learning
algorithms to accurately translate spoken audio into text.

Speech recognition : Python speech recognition libraries provide robust text-to-speech conversion
capabilities, allowing developers to integrate speech recognition features into their applications.

~ 24 ~
These tools analyze audio input and extract written information that is significant by using advanced
algorithms and machine learning models. The Speech Recognition library is a well-liked Python voice
recognition framework.
Pydub : Pydub is a powerful and flexible Python package for audio processing that offers a
comprehensive set of features for manipulating audio files with ease. Whether you're a hobbyist
looking to edit audio recordings or a professional developer building complex audio applications.
Key features of pydub are :
● Pydub provides a straightforward and easy-to-use interface for audio manipulation tasks. Its
simple API allows users to perform complex audio operations with minimal code, making it
accessible to both beginners and experienced developers.
● Pydub supports the conversion of audio files between various formats, such as MP3, WAV,
FLAC, and more. It allows users to read audio files in one format, manipulate them as needed,
and save them in a different format, facilitating compatibility and interoperability.

3.11 Overview of Modules used for Paraphraser (pegasusTokenizer model,


PyTorch) :
Pegasus Tokenizer model : The PegasusTokenizer model is an essential component of the
PEGASUS architecture, developed by Google Research. As part of the PEGASUS framework, the
PegasusTokenizer plays a crucial role in processing text inputs for tasks such as abstractive text
summarization and paraphrasing.

Key features of Pegasus Tokenizer model are :

● Specialized Tokenization: The Pegasus Tokenizer is optimized for tokenizing text inputs for
the PEGASUS model. It employs advanced tokenization techniques to segment input text into
sub word tokens, ensuring compatibility with the model's architecture and requirements.

● Handling of Various Input Lengths: The tokenizer efficiently handles text inputs of varying
lengths, from short sentences to longer paragraphs or documents. It segments input text into
tokens while preserving semantic meaning.

PyTorch : The Torch library, often referred to as PyTorch, is a powerful open-source machine
learning framework built for Python. Developed primarily by FAIR, Torch provides a flexible
platform for building and training deep learning models.
Key features of PyTorch are :
● Dynamic computation :Torch provides a graph framework that makes it possible to create
models in an adaptable and user-friendly manner. Torch's dynamic approach, in contrast to
static computation graphs, allows users to construct and alter computational graphs
dynamically before they are executed.

● Effective Tensor Operations: Torch offers comprehensive tensor operations support, which
simplifies the execution of intricate mathematical calculations frequently required in deep
learning. Multi-dimensional arrays called tensors are used to represent data and computations.
They are designed to operate efficiently on CPU and GPU technology.
~ 25 ~
3.12 Overview of Modules used for Mathematical expression solving (Gemini) :
Gemini ai : Recognising Order of Operations (PEMDAS): Gemini excels in managing phrases
including the addition, subtraction, division, and multiplication of operations (PEMDAS). It
effectively addresses these issues, guaranteeing that the proper sequence of steps is adhered to
Expression Simplification
Key features of Gemini ai are :
● Not just results, Gemini provides clear, step-by-step explanations for its solutions. This is
particularly helpful for understanding the logic behind the answer and solidifying your
knowledge.
● The beauty of Gemini lies in its ability to understand different formats. Present your problem
as text, an image of a handwritten equation, or even a spoken question, and Gemini will
analyse it effectively.
● While solving equations is a forte, Gemini's capabilities extend further. It can answer open
ended math questions, explain mathematical concepts, and even generate practice problems
to solidify your understanding.

3.13 Overview of File Exporting (PSPDFKit)


PSPDFKit : Utilize PSPDFKit for Web's proprietary technology, built from the ground up, to
perform PDF to DOC conversion. PSPDFKit's solution doesn't rely on third-party tools like Microsoft
Office or LibreOffice, ensuring reliability and control over the conversion process. Implement client-
side PDF to DOC conversion directly within your React application or workflow.
Key Features of PSPDFKit are :
● Implement headless DOC to PDF conversion using PSPDFKit for Web's technology stack.
Enable users to convert DOC or DOCX files to PDF format without a visual interface, making
the conversion process efficient and seamless.
● Allow users to open Word files directly in the PSPDFKit viewer or convert them to PDF format
as they are loaded in the viewer. Integrate DOC to PDF conversion directly into the document
viewing experience, providing users with a streamlined and unified workflow for document
processing.
● Enable users to edit text, manipulate pages, add annotations, and sign documents directly within
the web browser, enhancing collaboration and productivity.
● By integrating PSPDFKit for Web's PDF to DOC and DOC to PDF conversion capabilities into
your React project, you can empower users with a comprehensive document processing solution
that meets their needs for efficient and reliable file format conversion.

~ 26 ~
CHAPTER 4
IMPLEMENTATION

The chapter details the practical steps undertaken in the development of "Infiniti Script," a novel
OCR system designed to enhance handwritten text recognition through advanced neural network
architectures.
4.1 System Architecture Design
The architecture of "Infiniti Script" is strategically crafted to harness the synergistic potential of
CNNs for feature extraction, BiLSTMs for understanding the sequential nature of text, and CTC
for decoding sequences into textual output. The inclusion of SSD and MSER algorithms further
refines the system’s ability to detect and segment text accurately within images. This multi-
faceted approach ensures comprehensive processing of handwritten documents, from initial
image capture to final text transcription.

Figure 4.1 : OCR Data Augmentation and Evaluation Cycle

4.1.1 CNN-BiLSTM-CTC Architecture: The CNN extracts fine-grained features from images,
while the BiLSTM components learn the sequence of characters, accounting for both the forward
and backward contexts. The CTC layer then aligns the input sequence with the output labels,
allowing for efficient recognition of contiguous text without predefined segmentations.

~ 27 ~
4.1.2 SSD-MSER for Text Detection: SSD provides a robust framework for detecting textual
elements in a single forward pass of the network, making it highly efficient for real-time
applications. MSER contributes to this by isolating stable regions in the image, allowing the SSD
to locate textual lines with higher precision.
4.1.3 Data Augmentation and Error Correction: Through sophisticated data augmentation
techniques, the project simulates various handwriting styles and backgrounds, enriching the
training dataset and thereby enhancing model generalization.

4.2 Data Preparation and Preprocessing


Dataset Compilation: The IAM Handwriting Database, encompassing a diverse array of
handwriting styles, serves as the foundation for model training. This dataset is augmented with
transformations such as rotation and scaling.
Image Enhancement: Preprocessing techniques including normalization and denoising are
applied to each image, ensuring uniformity and clarity. These steps are crucial for optimizing the
data for subsequent neural network processing, facilitating accurate feature extraction.

4.3 Neural Network Training


Model Configuration: "Infiniti Script" utilizes a CNN-BiLSTM model structure, meticulously
configured with optimized parameters to balance efficiency and accuracy. This setup is crucial
for effectively processing the nuanced patterns of handwritten text.
Training and Optimization: The model undergoes extensive training, employing strategies such
as cyclical learning rates and early stopping to prevent overfitting. Continuous performance
evaluation guides the optimization process.

4.4 Feature Extraction and Character Recognition


CNN layers systematically extract defining features from the preprocessed images, identifying
critical patterns essential for character recognition. The extracted features are then processed by
BiLSTM networks.
4.4.1 Text Line Segmentation and Recognition
The SSD model, customized for text line detection, utilizes specially designed anchor boxes to
capture the diverse dimensions of handwritten text. In parallel, MSER algorithms segment the
text effectively, isolating individual characters and words for detailed analysis.
4.4.2 Decoding and Text Output Generation
CTC decoding translates the sequence of features into legible text, adeptly handling the variability
inherent in handwriting. The resulting text undergoes final processing, including error correction
and formatting, to produce a clear and accurate digital transcription.

~ 28 ~
4.4.3 Integration with Existing Systems and Technologies
"Infiniti Script" is developed with compatibility in mind, featuring APIs and SDKs for easy
integration into existing software ecosystems. Adherence to OCR standards and cross- platform
compatibility ensures the system can be seamlessly adopted across various applications.
4.4.4 Testing and Evaluation
Benchmark testing against leading OCR technologies demonstrates "Infiniti Script's" superior
performance, with specific emphasis on accuracy, processing speed, and the ability to recognize
complex handwriting styles.
4.4.5 Challenges Encountered and Solutions Implemented
The development of "Infiniti Script" presented unique challenges, including adapting to the high
variability of handwriting and ensuring model generalization. Strategic solutions, such as
advanced data augmentation and iterative training adjustments.

4.5 Development Tools and Environment


The "Infiniti Script" project is engineered using a combination of modern development tools and
programming languages, aiming to create a state-of-the-art OCR system for handwritten text
recognition. This section delves into the specifics of our development environment.
4.5.1 Integrated Development Environment (IDE)
● Jupyter Notebooks hosted on JupyterLab serve as our primary development platform.
This choice was driven by Jupyter's interactivity, allowing for immediate code execution,
result visualization, and comprehensive documentation—a necessity for our data-driven
project.
4.5.2 Programming Languages
● Python: The cornerstone of "Infiniti Script," chosen for its rich ecosystem of libraries
(MXNet, NumPy, OpenCV) that facilitate machine learning and image processing.
● Shell Scripting: Employed for the automation of setup and dependency installations,
including the configuration of SCTK for evaluation metrics.
4.5.3 Data Handling and Processing
● Pandas and NumPy: Utilized for efficient data manipulation and numerical operations,
crucial for handling the datasets involved in training our models.
● OpenCV: A key library for image preprocessing tasks such as resizing, grayscaling, and
noise reduction, ensuring our images are primed for recognition.
4.5.4 Machine Learning Framework
● Apache MXNet: Selected for its performance and scalability, MXNet underpins our
neural networks for various stages of OCR, from paragraph segmentation to text
recognition, with Gluon API facilitating dynamic computation and streamlined model
development.
4.5.5 Version Control
● Git: Facilitates collaborative development and source code management, with our project
repository hosted on GitHub for accessibility and progress tracking.

~ 29 ~
4.5.6 Model Training and Evaluation
● SCLITE: From SCTK, used for Word Error Rate (WER) evaluation, providing us with a
benchmark to measure our model's accuracy against.
● hnswlib: Implements efficient approximate nearest neighbor search, crucial for enhancing
the lexicon search phase in text recognition.
4.5.7 Dependency Management
● pip and conda: Ensure a consistent development environment by managing the project's
Python dependencies, crucial for reproducing our results and facilitating future project
scalability.
4.5.8 Visualization Tools
● Matplotlib and Seaborn: Integrated within our Jupyter Notebooks for generating
insightful visualizations of our data and model performance metrics.
4.5.9 Documentation and Collaboration
● Sphinx: Generates detailed documentation from our code's docstrings, making "Infiniti
Script" easily understandable and usable by future contributors.
● Markdown: Used to create READMEs and wikis on GitHub, providing clear instructions
and documentation for the project.
4.5.10 Testing and Continuous Integration
● pytest: Employed for writing comprehensive test cases, ensuring the reliability of our
code throughout the development process.
● Travis CI: Linked with our GitHub repository for continuous integration, running
automated tests to maintain code quality with every commit.
Hardware and Software Requirements
● NVIDIA GPUs: For accelerated model training.
● Ample RAM and Storage: To handle large datasets and model checkpoints, with a
minimum of 16GB RAM recommended.
● Cross-Platform OS Compatibility: Ensuring the project can be developed and deployed
across Linux, macOS, and Windows environments.
Conclusion
"Infiniti Script" represents a fusion of advanced OCR techniques and best practices in software
development to tackle the nuances of handwritten text recognition. Through the strategic
integration of diverse tools and methodologies, we aim not only for high accuracy in OCR tasks
but also for a project architecture that is scalable, maintainable, and open for future innovation.
4.6 Security and Privacy Measures
"Infiniti Script" employs stringent security protocols to protect sensitive data and ensure user
privacy.
● 4.6.1 Data Encryption: All data, both at rest and in transit, is encrypted using industry-
standard encryption algorithms to prevent unauthorized access. The encrypted data
includes images of handwritten texts and their respective converted digital texts.

~ 30 ~
● 4.6.2 Secure User Authentication: We utilize robust authentication mechanisms that
incorporate two-factor authentication (2FA) to prevent unauthorized system access.
● 4.6.3 Privacy-Preserving Data Handling: Adhering to principles of data minimization,
"Infiniti Script" collects only essential data required for the OCR process and ensures that
all data is anonymized to prevent the possibility of back- tracing to individual users.
● 4.6.4 Regulatory Compliance: The project is compliant with global data protection
regulations, such as GDPR, to ensure the privacy of all users is respected and maintained.
● 4.6.5 Periodic Security Audits: The system undergoes regular security audits to detect
and mitigate potential vulnerabilities, ensuring the application's defense mechanisms are
always up-to-date.
4.7 User Interface and Experience
The interface of "Infiniti Script" is designed for simplicity and ease of use, providing a seamless
experience from image upload to text translation.
● 4.7.1 Intuitive Design: The application's interface, created with Figma, offers a clear and
intuitive user journey, reducing the learning curve for new users and enhancing the overall
user satisfaction.
● 4.7.2 Accessibility Features: We follow the Web Content Accessibility Guidelines
(WCAG) to ensure our application is accessible to users with disabilities, with features
like text-to-speech and high-contrast modes.
● 4.7.3 Real-Time Feedback and Assistance: Users receive instant feedback during their
interaction with the app, along with access to a comprehensive FAQ and support system
to assist them at any point.

4.8 Monitoring and Updates


Continuous monitoring ensures "Infiniti Script" operates at peak efficiency, and regular updates
reflect the latest advances in OCR technology.

● 4.8.1 System Monitoring: Real-time analytics track system performance, allowing for
prompt identification and resolution of any technical issues.

● 4.8.2 User Engagement and Feedback: Active engagement with the user community
helps gather valuable feedback which is integral to the iterative development process.

● 4.8.3 Agile Development and Feature Expansion: The development team employs agile
methodologies to ensure rapid deployment of new features and regular updates based on
user feedback and emerging OCR advancements.

● 4.8.4 Security Patching and Compliance: Ongoing security updates and compliance
checks maintain the highest levels of system integrity and regulatory adherence.

● 4.8.5 Data Recovery Strategies: Robust backup and disaster recovery protocols are in
place to protect user data against unforeseen events, ensuring reliability and trust in the
"Infiniti Script" system.

~ 31 ~
4.9 Features Integration
This document explores various AI-powered tools that can handle different tasks related to text and
speech. It covers projects that can translate languages within PDFs, summarize text content, convert
text to spoken format and vice versa, extract text from handwritten documents, and even paraphrase
text. Additionally, it highlights an AI model from Google that tackles solving mathematical
expressions. Overall, this summary showcases the diverse capabilities of AI in handling our
information needs, from communication and content creation to processing different media formats.

4.9.1 Language Translation :


● PDF Text Extraction : Utilize Pdfplumber library to extract text content from the PDF
document. Iterate through each page of the PDF and extract text using Pdfplumber's
extract_text() method. Aggregate the extracted text from all pages into a single string.
● Text Translation with Google Translator : Use the Googletrans library to translate the
extracted text into the desired language. Initialize a Translator object from Googletrans. Call
the translate() method on the Translator object.
● Language Selection : Determine the target language for translation. Google translation
supports translation into various languages, and you can specify the target language by
providing its language code (e.g., 'fr' for French, 'de' for German, etc.)
4.9.2 Text Summarization

● GPT-3.5 Model : Utilize the GPT-3.5 architecture, a state-of-the-art natural language


processing (NLP) model developed by OpenAI.GPT-3.5 is fine-tuned specifically for text
summarization tasks, enabling it to generate concise and coherent summaries from large
volumes of text.
● Turbo-Instruct Training Methodology : Employ the Turbo-Instruct training methodology
to fine-tune the GPT-3.5 model for text summarization. Turbo-Instruct provides targeted
instructions during model training, enhancing the model's performance on summarization.
● Integration with GPT-3.5 API : Interface with the GPT-3.5 API to access the model's text
summarization capabilities. Send input text to the GPT-3.5 API endpoint and receive
generated summaries as output.
● Output Handling and Evaluation : Decide how to handle the generated summaries output
by the GPT-3.5 model. Evaluate the quality and coherence of the generated summaries using
metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

4.9.3 Text-to-Audio

● gTTs : Leverage the gTTs library, a Python wrapper for Google's TTS API, to convert text
into speech. gTTs provides a simple and efficient interface for generating high-quality speech
from text strings.
● Pdfplumber for Text Extraction : Utilize Pdfplumber library to extract text content from
PDF documents. Extract text from PDF files using Pdfplumber's text extraction capabilities.
Prepare the extracted text for conversion into speech using gTTs.

~ 32 ~
● Text-to-Audio Conversion Process : Combine the extracted text with gTTs to initiate the
text-to-speech conversion process. Pass the text input to gTTs along with parameters such as
language, voice, and audio format.

● Customization Options : Customize the audio output by selecting different voices and
adjusting parameters such as speech rate, pitch, and volume. Explore available options for
language and accent to tailor the speech synthesis.

4.9.4 Audio-to-text

● AssemblyAI : Utilize AssemblyAI, an advanced automatic speech recognition (ASR)


service, to convert audio files into text. AssemblyAI offers state-of-the-art speech
recognition models trained on large datasets for accurate and reliable transcription.
● SpeechRecognition Library : 2. Incorporate the SpeechRecognition library, a Python
wrapper for various ASR engines and APIs, to perform audio-to-text conversion.
SpeechRecognition provides a unified interface for interacting with different ASR services.
● Pydub for Audio Processing : Use Pydub library for audio processing tasks such as audio
file loading, format conversion, and manipulation .Pydub simplifies the handling of audio
data by providing intuitive interfaces for common operations like splitting, merging.
● Transcription Workflow : Design a transcription workflow that incorporates both audio
processing with Pydub and transcription with AssemblyAI or SpeechRecognition.
Preprocess audio files as needed using Pydub, such as converting to the appropriate format
or segmenting longer recordings into shorter segments.

4.9.5 Text to Handwritten

● PIL (Python Imaging Library) : Employ the Python Imaging Library (PIL), now
maintained as the Pillow library, for image processing tasks including text rendering.
PIL/Pillow provides a comprehensive set of functions for creating and editing.
● Handwritten Font Selection : Choose an appropriate handwritten-style font to use for
rendering text as handwritten images. Explore available handwritten fonts or create custom
ones to achieve desired aesthetics and readability.
● Text Rendering Process : 3. Utilize PIL/Pillow's text rendering functions to convert text
strings into images. Specify parameters such as font size, color, and style to customize the
appearance of the handwritten text.
● Customization Options : Experiment with various font styles, sizes, and colors to create
diverse handwritten text effects. Incorporate additional elements such as doodles,
decorations, or borders to enhance the handwritten appearance of the text images.

4.9.6 Paraphraser

● PegasusTokenizer Model : Utilize the PegasusTokenizer model, a component of the


PEGASUS framework developed by Google Research, for text paraphrasing tasks.
PegasusTokenizer is specifically designed for tokenizing text inputs for tasks such as
paraphrasing and abstractive summarization.
~ 33 ~
● Torch Framework : Incorporate the Torch framework, an open-source machine learning
library, for fine-tuning and utilizing the Pegasus model. Torch provides a flexible platform
for building and training deep learning model.
● Pdfplumber for Text Extraction : Utilize Pdfplumber library to extract text content from
documents, including PDF files. Extract text from PDF documents using Pdfplumber's text
extraction capabilities, preparing it for paraphrasing with the Pegasus model.
● Paraphrasing Workflow : Design a paraphrasing workflow that combines text extraction
with Pdfplumber, tokenization with PegasusTokenizer, and model inference with Torch.
Extract text from input documents using Pdfplumber and preprocess it using
PegasusTokenizer for tokenization.

4.9.7 Mathematical expression solving

● Google Generative AI : Leverage Google Generative AI technologies, including deep


learning models and algorithms, for solving mathematical expressions. Google Generative
AI encompasses a range of machine learning techniques aimed at generating creative and
intelligent outputs, including those related to mathematical problem-solving.
● Mathematical Expression Representation : Represent mathematical expressions in a
format suitable for input to Generative AI models. Encode mathematical symbols, operators,
and numerical values in a structured format that can be interpreted by the Generative AI
system.
● Generative AI Model Training : Train Generative AI models on mathematical expression
solving tasks using appropriate datasets. Utilize supervised learning approaches to train
models on labeled examples of mathematical expressions and their solutions.
● Model Inference and Solution Generation : Perform inference with trained Generative AI
models to generate solutions for input mathematical expressions. Input mathematical
expressions into the trained models and obtain corresponding solutions as output.
● Output Handling and Integration : Determine how to handle the solutions generated by
the Generative AI system. Provide mechanisms for presenting solutions to users in a human-
readable format, such as mathematical notation or numerical values.
4.9.8 Exporting to Various File Formats :

● React Framework : Utilize React, a popular JavaScript library for building user interfaces,
as the foundation for developing export functionality. Leverage React's component-based
architecture and declarative programming model.
● jsPDF Library : Incorporate jsPDF, a JavaScript library for generating PDF documents on
the client-side, for exporting content to PDF format.
● Exporting Data : Define the data to be exported, such as text content, tabular data, or
graphical elements, within the React application.
● PDF Generation Process : Implement a PDF generation process using jsPDF within the
React application. Instantiate a jsPDF instance and use its methods to add content, styles,
and formatting to the PDF document.
~ 34 ~
● Customization Options : Customize the appearance and layout of the exported PDF
document using jsPDF's styling and formatting capabilities. Explore options for adding
headers, footers, page numbers, and other elements to enhance the readability.
● Integration with React Components : Integrate the PDF export functionality into React
components, allowing users to initiate and control the export process within the application.
Create UI components for triggering the export action.

4.10 UML Diagrams :


Unified Modeling Language (UML) is a general-purpose modeling language. The main aim of UML
is to define a standard way to visualize the way a system has been designed. It is quite similar to
blueprints used in other fields of engineering. UML is not a programming language, it is rather a
visual language. We use UML diagrams to portray the behavior and structure of a system. UML helps
software engineers, businessmen, and system architects with modeling, design, and analysis.
There are two broad categories of diagrams and they are again divided into subcategories :
● Structural Diagrams
● Behavioral Diagrams

Use Case Diagrams : Use Case Diagrams are used to depict the functionality of a system or a part
of a system. They are widely used to illustrate the functional requirements of the system and its
interaction with external agents(actors).

● A use case is basically a diagram representing different scenarios where the system can be
used.
● A use case diagram gives us a high level view of what the system or a part of the system does
without going into implementation details.
Sequence Diagram : A sequence diagram simply depicts interaction between objects in a sequential
order i.e. the order in which these interactions take place.
● We can also use the terms event diagrams or event scenarios to refer to a sequence diagram.
● Sequence diagrams describe how and in what order the objects in a system function.

Tools for creating UML Diagrams :


There are several tools available for creating Unified Modeling Language (UML) diagrams, which
are commonly used in software development to visually represent system architecture, design, and
implementation. Here are some popular UML diagram creating.

● Draw.io: Draw.io is a free, web-based diagramming tool that supports various diagram
types, including UML. It integrates with various cloud storage services and can be used
offline.
● Visual Paradigm: Visual Paradigm provides a comprehensive suite of tools for software
development, including UML diagramming. It offers both online and desktop versions and
supports a wide range of UML diagrams.

~ 35 ~
4.10.1 Use Case Diagram :
A use case diagram is a powerful tool in software engineering for capturing the functional
requirements of a system from the user's perspective. Here's a detailed description of a use case
diagram:

● Use case diagrams depict the interactions between users (actors) and the system,
showcasing the system's functionalities and how users interact with it.
● Actors represent different types of users or external systems that interact with the system to
achieve specific goals or tasks.
● Use cases represent individual functionalities or tasks that users can perform within the
system, often depicted as ovals within the diagram.

Figure 4.2 : Use Case Diagram

~ 36 ~
4.10.2 Class Diagram :
A class diagram is a fundamental component of software design, providing a visual representation of
the static structure of a system. Here's a detailed description of a class diagram:

● Class diagrams depict the structure of the system by illustrating the classes, attributes,
methods, and their relationships.
● Classes represent the building blocks of the system and encapsulate data and behavior into a
single unit.
● Attributes are the properties or characteristics of a class, representing the state of objects
belonging to that class.
● Methods define the behavior or operations that objects of a class can perform, encapsulating
the logic and functionality of the system.

Figure 4.3 : Class Diagram

~ 37 ~
Figure 4.4 : Complete Functionalities of Class Diagram

~ 38 ~
4.10.3 Object Diagram :
An object diagram is a structural diagram that provides a snapshot of the objects and their
relationships at a particular moment in time within a system. Here's a detailed description of an object
diagram:
● Representation of Instances: Object diagrams depict instances of classes and their
relationships in a specific scenario or context.
● Instantiation of Classes: Each object in the diagram represents a specific instance of a class,
showing the state of the system at a particular point in its execution.
● Objects and Attributes: Objects are represented as rectangles, with the name of the object
written inside, while their attributes and values are listed beneath them.
● Relationships between Objects: Relationships between objects are represented by lines
connecting them, indicating associations, dependencies, aggregations, or compositions.

Figure 4.5 : Object Diagram

~ 39 ~
4.10.4 Sequence Diagram :
A sequence diagram is a type of interaction diagram that visualizes the interactions between objects
in a specific scenario or sequence of events within a system. Here's a detailed description of a
sequence diagram:
● Illustration of Interactions: Sequence diagrams illustrate the interactions between objects or
components in a system over time, showcasing the flow of messages exchanged between
them.
● Time-based Representation: The vertical axis of the diagram represents time, with objects or
components arranged horizontally to indicate their order of execution.
● Object Lifelines: Each object participating in the interaction is represented by a lifeline,
depicted as a vertical dashed line extending downward from the object's name.
● Activation Bars: Activation bars, also known as execution occurrences, represent the duration
during which an object is active or engaged in processing a message.

Figure 4.6 : Sequence Diagram

~ 40 ~
4.11 Code :
Importing Required Modules
import difflib
import importlib
import math
import random
import string
random.seed(123)
import cv2
import gluonnlp as nlp
import leven
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import mxnet as mx
import numpy as np
from skimage import transform as skimage_tf, exposure
from tqdm import tqdm
from ocr.utils.expand_bounding_box import expand_bounding_box
from ocr.utils.sclite_helper import ScliteHelper
from ocr.utils.word_to_line import sort_bbs_line_by_line, crop_line_images
from ocr.utils.iam_dataset import IAMDataset, resize_image, crop_image, crop_handwriting_page
from ocr.utils.encoder_decoder import Denoiser, ALPHABET, encode_char, decode_char, EOS,
BOS
from ocr.utils.beam_search import ctcBeamSearch
import ocr.utils.denoiser_utils
import ocr.utils.beam_search
importlib.reload(ocr.utils.denoiser_utils)
from ocr.utils.denoiser_utils import SequenceGenerator
importlib.reload(ocr.utils.beam_search)
from ocr.utils.beam_search import ctcBeamSearch
from ocr.paragraph_segmentation_dcnn import SegmentationNetwork,
paragraph_segmentation_transform
from ocr.word_and_line_segmentation import SSD as WordSegmentationNet,
predict_bounding_boxes
from ocr.handwriting_line_recognition import Network as HandwritingRecognitionNet,
handwriting_recognition_transform
from ocr.handwriting_line_recognition import decode as decoder_handwriting, alphabet_encoding
Dataset Creation :
test_ds = IAMDataset("form_original", train=False)
random.seed(1)
figs_to_plot = 4
images = []
n=0
~ 41 ~
for i in range(0, figs_to_plot):
n = int(random.random()*len(test_ds))
image, _ = test_ds[n]
images.append(image)
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 10 * len(images)/2))
for i, image in enumerate(images):
y, x = int(i/2), int(i%2)
axs[y, x].imshow(image, cmap='Greys_r')
axs[y, x].axis('off')
Paragraph Segmentation :
paragraph_segmentation_net = SegmentationNetwork(ctx=ctx)
paragraph_segmentation_net.cnn.load_parameters("models/paragraph_segmentation2.params",
ctx=ctx)
paragraph_segmentation_net.hybridize()
form_size = (1120, 800)
predicted_bbs = []
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 9 * len(images)/2))
for i, image in enumerate(images):
s_y, s_x = int(i/2), int(i%2)
resized_image = paragraph_segmentation_transform(image, form_size)
bb_predicted = paragraph_segmentation_net(resized_image.as_in_context(ctx))
bb_predicted = bb_predicted[0].asnumpy()
bb_predicted = expand_bounding_box(bb_predicted, expand_bb_scale_x=0.03,
expand_bb_scale_y=0.03)
predicted_bbs.append(bb_predicted)
axs[s_y, s_x].imshow(image, cmap='Greys_r')
axs[s_y, s_x].set_title("{}".format(i))
(x, y, w, h) = bb_predicted
image_h, image_w = image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)
rect = patches.Rectangle((x, y), w, h, fill=False, color="r", ls="--")
axs[s_y, s_x].add_patch(rect)
axs[s_y, s_x].axis('off')
Image Processing :
segmented_paragraph_size = (700, 700)
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 9 * len(images)/2))
paragraph_segmented_images = []
for i, image in enumerate(images):
s_y, s_x = int(i/2), int(i%2)
bb = predicted_bbs[i]
image = crop_handwriting_page(image, bb, image_size=segmented_paragraph_size)
paragraph_segmented_images.append(image)
axs[s_y, s_x].imshow(image, cmap='Greys_r')

~ 42 ~
axs[s_y, s_x].axis('off')

Line/word segmentation :
word_segmentation_net = WordSegmentationNet(2, ctx=ctx)
word_segmentation_net.load_parameters("models/word_segmentation2.params")
word_segmentation_net.hybridize()
min_c = 0.1
overlap_thres = 0.1
topk = 600
fig, axs = plt.subplots(int(len(paragraph_segmented_images)/2), 2,
figsize=(15, 5 * int(len(paragraph_segmented_images)/2)))
predicted_words_bbs_array = []
for i, paragraph_segmented_image in enumerate(paragraph_segmented_images):
s_y, s_x = int(i/2), int(i%2)
predicted_bb = predict_bounding_boxes(
word_segmentation_net, paragraph_segmented_image, min_c, overlap_thres, topk, ctx)
predicted_words_bbs_array.append(predicted_bb)
axs[s_y, s_x].imshow(paragraph_segmented_image, cmap='Greys_r')
for j in range(predicted_bb.shape[0]):
(x, y, w, h) = predicted_bb[j]
image_h, image_w = paragraph_segmented_image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)
rect = patches.Rectangle((x, y), w, h, fill=False, color="r")
axs[s_y, s_x].add_patch(rect)
axs[s_y, s_x].axis('off')
Word to line image processing :
line_images_array = []
fig, axs = plt.subplots(int(len(paragraph_segmented_images)/2), 2,
figsize=(15, 9 * int(len(paragraph_segmented_images)/2)))
for i, paragraph_segmented_image in enumerate(paragraph_segmented_images):
s_y, s_x = int(i/2), int(i%2)
axs[s_y, s_x].imshow(paragraph_segmented_image, cmap='Greys_r')
axs[s_y, s_x].axis('off')
axs[s_y, s_x].set_title("{}".format(i))
predicted_bbs = predicted_words_bbs_array[i]
line_bbs = sort_bbs_line_by_line(predicted_bbs, y_overlap=0.4)
line_images = crop_line_images(paragraph_segmented_image, line_bbs)
line_images_array.append(line_images)
for line_bb in line_bbs:
(x, y, w, h) = line_bb
image_h, image_w = paragraph_segmented_image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)

~ 43 ~
rect = patches.Rectangle((x, y), w, h, fill=False, color="r")
axs[s_y, s_x].add_patch(rect)

Handwriting recognition :
handwriting_line_recognition_net = HandwritingRecognitionNet(rnn_hidden_states=512,
rnn_layers=2, ctx=ctx, max_seq_len=160)
handwriting_line_recognition_net.load_parameters("models/handwriting_line8.params", ctx=ctx)
handwriting_line_recognition_net.hybridize()
line_image_size = (60, 800)
character_probs = []
for line_images in line_images_array:
form_character_prob = []
for i, line_image in enumerate(line_images):
line_image = handwriting_recognition_transform(line_image, line_image_size)
line_character_prob = handwriting_line_recognition_net(line_image.as_in_context(ctx))
form_character_prob.append(line_character_prob)
character_probs.append(form_character_prob)
Denoising the text output :
FEATURE_LEN = 150
denoiser = Denoiser(alphabet_size=len(ALPHABET), max_src_length=FEATURE_LEN,
max_tgt_length=FEATURE_LEN, num_heads=16, embed_size=256, num_layers=2)
denoiser.load_parameters('models/denoiser2.params', ctx=ctx)
denoiser.hybridize(static_alloc=True)
We use a language model in order to rank the propositions from the denoiser
ctx_nlp = mx.gpu(3)
language_model, vocab = nlp.model.big_rnn_lm_2048_512(dataset_name='gbw', pretrained=True,
ctx=ctx_nlp)
moses_tokenizer = nlp.data.SacreMosesTokenizer()
moses_detokenizer = nlp.data.SacreMosesDetokenizer()
We use beam search to sample the output of the denoiser
beam_sampler = nlp.model.BeamSearchSampler(beam_size=20,
decoder=denoiser.decode_logprob,
eos_id=EOS,
scorer=nlp.model.BeamSearchScorer(),
max_length=150)
generator = SequenceGenerator(beam_sampler, language_model, vocab, ctx_nlp, moses_tokenizer,
moses_detokenizer)
def get_denoised(prob, ctc_bs=False):
if ctc_bs: # Using ctc beam search before denoising yields only limited improvements a is very
slow
text = get_beam_search(prob)
else:
text = get_arg_max(prob)
src_seq, src_valid_length = encode_char(text)
src_seq = mx.nd.array([src_seq], ctx=ctx)
src_valid_length = mx.nd.array(src_valid_length, ctx=ctx)
encoder_outputs, _ = denoiser.encode(src_seq, valid_length=src_valid_length)
~ 44 ~
states = denoiser.decoder.init_state_from_encoder(encoder_outputs,
encoder_valid_length=src_valid_length)
inputs = mx.nd.full(shape=(1,), ctx=src_seq.context, dtype=np.float32, val=BOS)
output = generator.generate_sequences(inputs, states, text)
return output.strip()
sentence = "This sentnce has an eror"
src_seq, src_valid_length = encode_char(sentence)
src_seq = mx.nd.array([src_seq], ctx=ctx)
src_valid_length = mx.nd.array(src_valid_length, ctx=ctx)
encoder_outputs, _ = denoiser.encode(src_seq, valid_length=src_valid_length)
states = denoiser.decoder.init_state_from_encoder(encoder_outputs,
encoder_valid_length=src_valid_length)
inputs = mx.nd.full(shape=(1,), ctx=src_seq.context, dtype=np.float32, val=BOS)
print(sentence)
print("Choice")
print(generator.generate_sequences(inputs, states, sentence))
Qualitative Result :
for i, form_character_probs in enumerate(character_probs):
fig, axs = plt.subplots(len(form_character_probs) + 1,
figsize=(10, int(1 + 2.3 * len(form_character_probs))))
for j, line_character_probs in enumerate(form_character_probs):
decoded_line_am = get_arg_max(line_character_probs)
print("[AM]",decoded_line_am)
decoded_line_bs = get_beam_search(line_character_probs)
decoded_line_denoiser = get_denoised(line_character_probs, ctc_bs=False)
print("[D ]",decoded_line_denoiser)
line_image = line_images_array[i][j]
axs[j].imshow(line_image.squeeze(), cmap='Greys_r')
axs[j].set_title("[AM]: {}\n[BS]: {}\n[D ]: {}\n\n".format(decoded_line_am, decoded_line_bs,
decoded_line_denoiser), fontdict={"horizontalalignment":"left", "family":"monospace"}, x=0)
axs[j].axis('off')
axs[-1].imshow(np.zeros(shape=line_image_size), cmap='Greys_r')
axs[-1].axis('off')
Quantitative Results :
sclite = ScliteHelper('../SCTK/bin')
def get_qualitative_results_lines(denoise_func):
sclite.clear()
test_ds_line = IAMDataset("line", train=False)
for i in tqdm(range(1, len(test_ds_line))):
image, text = test_ds_line[i]
line_image = exposure.adjust_gamma(image, 1)
line_image = handwriting_recognition_transform(line_image, line_image_size)
character_probabilities = handwriting_line_recognition_net(line_image.as_in_context(ctx))
decoded_text = denoise_func(character_probabilities)
actual_text = text[0].replace(""", '"').replace("'","'").replace("&", "&")
sclite.add_text([decoded_text], [actual_text])

~ 45 ~
cer, er = sclite.get_cer()
print("Mean CER = {}".format(cer))
return cer
def get_qualitative_results(denoise_func):
sclite.clear()
for i in tqdm(range(1, len(test_ds))):
image, text = test_ds[i]
resized_image = paragraph_segmentation_transform(image, image_size=form_size)
paragraph_bb = paragraph_segmentation_net(resized_image.as_in_context(ctx))
paragraph_bb = paragraph_bb[0].asnumpy()
paragraph_bb = expand_bounding_box(paragraph_bb, expand_bb_scale_x=0.01,
expand_bb_scale_y=0.01)
paragraph_segmented_image = crop_handwriting_page(image, paragraph_bb,
image_size=segmented_paragraph_size)
word_bb = predict_bounding_boxes(word_segmentation_net, paragraph_segmented_image,
min_c, overlap_thres, topk, ctx)
line_bbs = sort_bbs_line_by_line(word_bb, y_overlap=0.4)
line_images = crop_line_images(paragraph_segmented_image, line_bbs)
predicted_text = []
for line_image in line_images:
line_image = exposure.adjust_gamma(line_image, 1)
line_image = handwriting_recognition_transform(line_image, line_image_size)
character_probabilities = handwriting_line_recognition_net(line_image.as_in_context(ctx))
decoded_text = denoise_func(character_probabilities)
predicted_text.append(decoded_text)
actual_text = text[0].replace(""", '"').replace("'","'").replace("&", "&")
actual_text = actual_text.split("\n")
if len(predicted_text) > len(actual_text):
predicted_text = predicted_text[:len(actual_text)]
sclite.add_text(predicted_text, actual_text)
cer, _ = sclite.get_cer()
print("Mean CER = {}".format(cer))
return cer

~ 46 ~
Figure 4.7 : Quantitative Results

Training :
net = CNNBiLSTM(num_downsamples=num_downsamples, resnet_layer_id=resnet_layer_id ,
rnn_hidden_states=lstm_hidden_states, rnn_layers=lstm_layers, max_seq_len=max_seq_len,
ctx=ctx)
net.hybridize()
ctc_loss = gluon.loss.CTCLoss(weight=0.2)
best_test_loss = 10e5
if (os.path.isfile(os.path.join(checkpoint_dir, checkpoint_name))):
net.load_parameters(os.path.join(checkpoint_dir, checkpoint_name))
print("Parameters loaded")
print(run_epoch(0, net, test_data, None, log_dir, print_name="pretrained", is_train=False))
pretrained = "models/handwriting_line8.params"
if (os.path.isfile(pretrained)):
net.load_parameters(pretrained, ctx=ctx)
print("Parameters loaded")
print(run_epoch(0, net, test_data, None, log_dir, print_name="pretrained", is_train=False))
Parameters loaded
3.129511170468088
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate})
for e in range(epochs):
train_loss = run_epoch(e, net, train_data, trainer, log_dir, print_name="train", is_train=True)
test_loss = run_epoch(e, net, test_data, trainer, log_dir, print_name="test", is_train=False)
if test_loss < best_test_loss:
print("Saving network, previous best test loss {:.6f}, current test loss
{:.6f}".format(best_test_loss, test_loss))
net.save_parameters(os.path.join(checkpoint_dir, checkpoint_name))
best_test_loss = test_loss
if e % print_every_n == 0 and e > 0:
print("Epoch {0}, train_loss {1:.6f}, test_loss {2:.6f}".format(e, train_loss, test_loss))

~ 47 ~
Model Distance between characters :
import numpy as np
import mxnet as mx
import difflib
from ocr.handwriting_line_recognition import Network as BiLSTMNetwork, decode as
topK_decode
from ocr.utils.noisy_forms_dataset import Noisy_forms_dataset
from ocr.utils.ngram_dataset import Ngram_dataset
from ocr.utils.iam_dataset import resize_image
Decode noisy forms :
line_image_size = (60, 800)
def handwriting_recognition_transform(image):
image, _ = resize_image(image, line_image_size)
image = mx.nd.array(image)/255.
image = (image - 0.942532484060557) / 0.15926149044640417
image = image.as_in_context(ctx)
image = image.expand_dims(0).expand_dims(0)
return image
def get_ns(is_train):
network = BiLSTMNetwork(rnn_hidden_states=512, rnn_layers=2, max_seq_len=160, ctx=ctx)
network.load_parameters("models/handwriting_line_sl_160_a_512_o_2.params", ctx=ctx)
def noise_source_transform(image, text):
image = handwriting_recognition_transform(image)
output = network(image)
predict_probs = output.softmax().asnumpy()
return predict_probs
ns = Noisy_forms_dataset(noise_source_transform, train=is_train, name="OCR_noise2",
topK_decode=topK_decode)
return ns
ctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu()
train_ns = get_ns(is_train=True)
ng_train_ds = Ngram_dataset(train_ns, "word_5train", output_type="word", n=5)
substitution_dict = {}
for subs in substitutions:
if subs not in substitution_dict:
substitution_dict[subs] = 0
substitution_dict[subs] += 1
print(substitution_dict)
substitute_costs = np.ones((128, 128), dtype=np.float64)
for key in substitution_dict:
key1, key2 = key
substitute_costs[ord(key1), ord(key2)] = 0.9 if substitution_dict[key] <= 4 else 0.8
print(substitute_costs)
np.savetxt("models/substitute_costs.txt", substitute_costs, fmt='%4.6f')

~ 48 ~
CHAPTER 5
RESULTS

The results of our Optical Character Recognition (OCR) system showcase its robust
performance in text recognition tasks. Through iterative benchmarking and refinement, we
achieved significant milestones in accuracy assessment. Notably, our system demonstrated a
mean Character Error Rate (CER) of 8.4 on pre-segmented lines, indicative of its strong capability
in interpreting handwritten text. However, full pipeline testing revealed areas for potential
enhancements, with a recorded mean CER of 11.6. Quantitatively, our model displayed
consistency between training and real-world application scenarios, with a Training Intersection
over Union (IoU) of 0.593 and a Test IoU of 0.573.

Figure 5.1 : CER Over Training Epochs

These metrics underscore the system's proficiency and generalizability, essential traits for
robust OCR systems. Our approach encompassed various precision enhancements, including
rigorous training using the IAM Handwriting Database and advanced line and word detection
techniques employing custom anchor boxes. Furthermore, innovative decoding methods, such as
employing a CNN-BiLSTM system with LSTM for probabilistic predictions and CTC loss for
sequence alignment, significantly contributed to accuracy improvements. Comparative analysis
of decoding methods highlighted the efficacy of Beam Search with Lexicon Search and Language
Model, achieving the lowest CER at 21.058. Lastly, our system's user interface integration
demonstrated practicality and efficiency in handling text conversion tasks, further enhancing its
usability. These results collectively underscore the effectiveness of our OCR system and provide
valuable insights for future advancements in the field.
~ 49 ~
Figure 5.2 : Outcome of Optical Character Recognition.

The language translation feature in our project leverages the capabilities of Pdfplumber and
Google Translator to facilitate seamless translation of text content extracted from PDF documents
into different languages. This feature enhances accessibility and usability by enabling users to
overcome language barriers and access information in their preferred language. Pdfplumber is
employed to extract text content from PDF documents uploaded by users. The extracted text serves
as the input for the translation process, ensuring accurate and comprehensive translation results.

Figure 5.3 : Outcome of Language Translation

Google Translator, a powerful language translation service with accuracy of 82.5% is


integrated into the project to perform the actual translation of extracted text. The extracted text is
passed to Google Translator's API, along with the target language specified by the user. Upon
receiving the translated text from Google Translator, it is presented to the user through the project's
interface. Users have the flexibility to select their desired target language for translation, enabling
them to overcome language barriers effectively.
~ 50 ~
The text summarization feature in our project harnesses the power of GPT-3.5 Turbo-Instruct,
a cutting-edge natural language processing (NLP) model, to generate concise and coherent summaries
of text content. This section presents the results obtained from implementing text summarization
using GPT-3.5 Turbo-Instruct and discusses its impact on the project.

Figure 5.4 : Outcome of Text Summarization

GPT-3.5 Turbo-Instruct is seamlessly integrated into the project's workflow to perform text
summarization tasks. The model is fine-tuned using Turbo-Instruct methodology to optimize its
performance for summarization, ensuring high-quality summary generation. Text inputs provided to
GPT-3.5 Turbo-Instruct are processed to generate summaries that capture the essence of the original
content. The generated summaries are presented to users through the project's interface, providing
them with succinct overviews of the input text.

Figure 5.5 : Outcome of Text to Audio

The text-to-speech (TTS) feature in our project integrates the capabilities of gTTs (Google
Text-to-Speech) and Pdfplumber to convert text content extracted from PDF documents into audio
~ 51 ~
output. This section discusses the results obtained from implementing text-to-speech functionality
using gTTs and Pdfplumber and evaluates its impact on the project.

Pdfplumber is utilized to extract text content from PDF documents uploaded by users. The
extracted text serves as the input for the text-to-speech conversion process, ensuring accurate and
comprehensive audio output. gTTs, Google's Text-to-Speech API, is seamlessly integrated into the
project's workflow to perform text-to-speech conversion. The extracted text is passed to gTTs along
with parameters such as language, voice, and audio format for conversion into audio output.
The audio-to-text feature in our project integrates AssemblyAI, Speech Recognition, and Pydub to
transcribe audio content into text format. This section presents the results obtained from
implementing audio-to-text functionality using these technologies and evaluates its impact on the
project.

Pydub is employed for audio processing tasks such as loading audio files, format conversion,
and manipulation. Audio files are pre-processed using Pydub to ensure compatibility and optimal
quality before transcription. AssemblyAI with accuracy of 90% is utilized for accurate and reliable
transcription of audio content. The audio data pre-processed with Pydub is submitted to AssemblyAI's
cloud-based API for transcription Speech Recognition library is incorporated to provide a unified
interface for interacting with different ASR engines and APIs. This library offers flexibility and
compatibility with multiple ASR services, enhancing the transcription process.

Figure 5.6 : Outcome of Audio to Text

The text-to-handwritten feature in our project utilizes the Python Imaging Library (PIL), now
known as Pillow, to generate handwritten-style images from text input. This section presents the
results obtained from implementing text-to-handwritten functionality using PIL and evaluates its
impact on the project. PIL is seamlessly integrated into the project's workflow to facilitate the
~ 52 ~
generation of handwritten-style images from text. The library's image manipulation capabilities are
leveraged to create visually appealing handwritten text representations. Handwritten-style fonts are
carefully selected to achieve the desired aesthetics and readability for the generated images. Various
handwritten fonts are explored and customized to ensure consistency and visual appeal across
different text inputs.

The paraphrasing feature in our project employs the Pegasus Tokenizer model, Torch, and
Pdfplumber to generate paraphrased versions of text content. This section presents the results
obtained from implementing paraphrasing functionality using these technologies and evaluates its
impact on the project.

The Pegasus Tokenizer model is utilized for paraphrasing text by generating alternative
phrasings while preserving the original meaning. The model is fine-tuned to optimize its performance
for paraphrasing tasks, ensuring accurate and contextually appropriate output Torch, a powerful deep
learning framework, is integrated into the project's workflow to support the implementation and
deployment of the Pegasus Tokenizer model. Torch provides the necessary infrastructure for training,
inference, and optimization of the paraphrasing model. Pdfplumber is used for text extraction from
PDF documents, providing a source of text content for paraphrasing. The extracted text serves as
input to the paraphrasing model, enabling the generation of paraphrased versions of PDF content.

Figure 5.7 : Outcome of Paraphrasing

The mathematical expression solving feature in our project harnesses the capabilities of
Google Generative AI to solve complex mathematical equations and expressions. This section
presents the results obtained from implementing mathematical expression solving using Google
Generative AI and evaluates its impact on the project.

~ 53 ~
Google Generative AI, a state-of-the-art deep learning model, is integrated into the
project's workflow to tackle mathematical expression solving tasks. The model is trained on a
diverse dataset of mathematical expressions to develop a robust understanding of mathematical
concepts and operations Assessing the accuracy of solutions generated by Google Generative AI
in solving mathematical expressions. Evaluating the model's ability to handle complex
mathematical expressions involving multiple variables, functions, and operators. Analysing the
speed and efficiency of the mathematical expression solving process, ensuring timely generation
of solutions.

Figure 5.8 : Outcome of Mathematical expression solving

Figure 5.9 : Complete Features of Entire Application

~ 54 ~
CHAPTER 6
CONCLUSION AND FUTURE SCOPE

In conclusion, our research represents a significant advancement in Optical Character


Recognition (OCR) technology, achieved through the integration of Convolutional Neural
Networks (CNN) with single-shot multi-box detector (SSD) methods and innovative handwriting
recognition techniques. By addressing the complex challenges inherent in OCR, including data
compilation, image enhancement, and model training, we have developed a highly refined and
efficient system. Leveraging the diverse handwriting styles within the IAM Handwriting
Database enriched our dataset, while the application of advanced algorithms such as MSER and
CNN significantly improved character recognition and page segmentation accuracy.

Furthermore, our research introduces novel methodologies such as custom anchor box
designs within the SSD framework, leading to enhanced line segmentation and text localization
accuracy. Through iterative tuning of Mean Square Error (MSE) and Intersection over Union
(IoU) metrics, we ensured high precision in character bounding box predictions and text
localization. Text Summarization leveraging the capabilities of GPT-3.5 Turbo-Instruct could
facilitate efficient extraction of key insights from text content. Paraphrasing using Pegasus
Tokenizer model, Torch, and Pdfplumber could offer users alternative versions of text content

In addition to technical advancements, our study underscores the importance of ongoing


innovation in AI and Moving forward. Integration with cutting-edge features such as Language
Translation utilizing Pdfplumber and Google Translator could extend the system's utility across
diverse linguistic contexts. Text-to-Speech Conversion powered by gTTs and Pdfplumber could
provide users with accessible audio renditions of text content. Audio-to-Text Conversion utilizing
AssemblyAI, Speech Recognition, and Pydub could enable seamless transcription of audio
content into text format. Mathematical Expression Solving employing Google Generative AI
could extend the system's functionality to include automated mathematical problem-solving
capabilities. These enhancements would not only enrich the capabilities of our OCR system but
also contribute to advancements in artificial intelligence and facilitate broader adoption across
various domains.

~ 55 ~
CHAPTER 7
REFERENCES

[1] IAM Handwriting Database, University of Bern. [Online]. Available:


http://www.fki.inf.unibe.ch/databases/iamhandwriting-database.

[2] AWS Labs. Handwritten Text Recognition for Apache MXNet. [Online]. Available:
https://github.com/awslabs/handwrittentext- recognition-for-apache-mxnet/tree/master.

[3] A. Shonenkov, D. Karachev, M. Novopoltsev, M. Potanin, D. Dimitrov, and A. Chertok, "Handwritten


text generation and strikethrough characters augmentation," Computer Optics, vol. 46, no. 3, pp. 1-13, June
2022. doi: 10.18287/2412-6179-co1049.

[4] F. Kızılırmak and B. Yanıkoğlu, "CNN-BiLSTM model for English Handwriting Recognition:
Comprehensive Evaluation on the IAM Dataset," Research Square, Nov. 17, 2022. [Online]. Available:
https://doi.org/10.21203/rs.3.rs2274499/v1.

[5] L. Jiao, H. Wu, H. Wang, and R. Bie, "Text Recovery via Deep CNN- BiLSTM Recognition and
Bayesian Inference," in IEEE Access, vol. 6, pp. 76416-76428, 2018, doi:
10.1109/ACCESS.2018.2882592.

[6] N. Bhardwaj, "A Research on Handwritten Text Recognition," International Journal of Scientific
Research in Engineering and Management, vol. 06, no. 04, Apr. 2022, doi: 10.55041/ijsrem12649.

[7] L. Kumari, S. Singh, V. V. S. Rathore, and A. Sharma, "Lexicon and attention-based handwritten text
recognition system," Machine Graphics and Vision, vol. 31, no. 1/4, pp. 75–92, Dec. 2022, doi:
10.22630/mgv.2022.31.1.4.

[8] A. Ansari, B. Kaur, M. Rakhra, A. Singh, and D. Singh, "Handwritten Text Recognition using Deep
Learning Algorithms," 2022 4th International Conference on Artificial Intelligence and Speech
Technology (AIST), Dec. 2022, Published, doi: 10.1109/aist55798.2022.10065348.

[9] R. G. Khalkar, A. S. Dikhit, and A. Goel, "Handwritten Text Recognition using Deep Learning (CNN
& RNN)," IARJSET, vol. 8, no. 6, pp. 870–881, Jun. 2021, doi: 10.17148/iarjset.2021.861

[10] Apache MXNet. (2017). OCR with MXNet Gluon [PowerPoint slides].Available:
https://www.slideshare.net/apachemxnet/ocrwith-mxnet- gluon#11.

[11] Z. Chen, K. Wu, Y. Li, M. Wang, and W. Li, “SSD-MSN: An Improved Multi-Scale Object Detection
Network Based on SSD,” IEEE Access, vol. 7, pp. 80622–80632, 2019, doi:
10.1109/access.2019.2923016.

[12] S. Zhai, D. Shang, S. Wang, and S. Dong, “DF-SSD: An Improved SSD Object Detection Algorithm
Based on DenseNet and Feature Fusion,” IEEE Access, vol. 8, pp. 24344–24357, 2020, doi:
10.1109/access.2020.297102

~ 56 ~
[13] Y. Jiang, T. Peng, and N. Tan, “CP-SSD: Context Information Scene Perception Object Detection
Based on SSD,” Applied Sciences, vol. 9, no. 14, p. 2785, Jul. 2019, doi: 10.3390/app9142785.

[14] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNN- BLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941 .

[15] Dr. L. Jain, “Hand Written Character Recognition Using CNN Model,” International Journal for
Research in Applied Science and Engineering Technology, vol. 12, no. 1, pp. 1094–1103, Jan. 2024, doi:
10.22214/ijraset.2024.58045.

[16] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNNBLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941.

[17] M. Bisht and R. Gupta, “Offline Handwritten Devanagari Word Recognition Using CNN-
RNNCTC,” SN Computer Science, vol. 4, no. 1, Dec. 2022, doi: 10.1007/s42979-022-01461-x.

[18] T. Document, "Manifold Mixup Improves Text Recognition with CTC loss," arXiv:1903.04246
[cs.CV], Mar. 2019. Available: https://doi.org/10.48550/arXiv.1903.04246.

[19] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNNBLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941.

[20] D. A. Nirmalasari, N. Suciati, and D. A. Navastara, “Handwritten Text Recognition using Fully
Convolutional Network,” IOP Conference Series: Materials Science and Engineering, vol. 1077, no. 1, p.
012030, Feb. 2021, doi: 10.1088/1757- 899x/1077/1/012030.

[21] Su, S. Y. H., Baray, M., & Carberry, R. L. (1971, January 1). A system modeling language translator.
https://doi.org/10.1145/800158.805058.

[22] Parimoo, R., Sharma, R., Gaur, N., Jain, N., & Bansal, S. (2022, May 31). A Review on Text
Summarization Techniques. International Journal for Research in Applied Science and Engineering
Technology. https://doi.org/10.22214/ijraset.2022.42358

[23] Lukose, S., & Upadhya, S. S. (2017, January 1). Text to speech synthesizer-formant synthesis.
https://doi.org/10.1109/icnte.2017.7947945

[24] Lu, C., Chen, X., Li, J., & Huang, Y. (2013, October 1). Research on Audio-Video Synchronization
of Sound and Text Messages. https://doi.org/10.1109/iscid.2013.177

[25] Yu, Z., Jin, D., Wei, J., Li, Y., Liu, Z., Shang, Y., Han, J., & Wu, L. (2024, January 1). TeKo: Text-
Rich Graph Neural Networks With External Knowledge. IEEE Transactions on Neural Networks and
Learning Systems. https://doi.org/10.1109/tnnls.2023.3281354

~ 57 ~
[26] Dale, D. C., Voronov, A., Dementieva, D., Logacheva, V., Kozlova, O., Semenov, N., & Panchenko,
A. (2021, September 18). Text Detoxification using Large Pre-trained Neural Models. arXiv (Cornell
University). https://doi.org/10.48550/arxiv.2109.08914

[27] Huang, J., Tan, J., & Bi, N. (2020, January 1). Overview of Mathematical Expression Recognition.
Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-59830-3_4

~ 58 ~
~ 59 ~
~ 60 ~
~ 61 ~
~ 62 ~
~ 63 ~

You might also like