Batch 1 Project Book
Batch 1 Project Book
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING -
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING
by
This is to certify that the project entitled “Infiniti Script” is the bonafide work of Mr. D. Dharani
Mahesh, Mr. Y.Hrudayesh, Mr. P. Jashwanth, and Ms. K. Mokshitha, bearing Reg. No.
We, Mr. D. Dharani Mahesh, Mr. Y. Hrudayesh, Mr. P. Jashwanth, Ms. K. Mokshitha, hereby
declare that the Project Report entitled “ Infiniti Script ” done by us under the guidance of
& Machine Learning at Vasireddy Venkatadri Institute of Technology is submitted for partial
fulfillment of the requirements for the award of Bachelor of Technology in Computer Science
Engineering - Artificial Intelligence & Machine Learning. The results embodied in this report
have not been submitted to any other University for the award of any degree.
DATE :
PLACE :
We take this opportunity to express my deepest gratitude and appreciation to all those
people who made this project work easier with words of encouragement, motivation, discipline,
and faith by offering different places to look to expand my ideas and helped me towards the
successful completion of this project work.
First and foremost, we express my deep gratitude to Shri. Vasireddy Vidya Sagar,
Chairman, Vasireddy Venkatadri Institute of Technology for providing necessary facilities
throughout the B.Tech program.
We express my sincere gratitude to Dr. K. Suresh Babu, Professor & HOD, Computer
Science Engineering - Artificial Intelligence & Machine Learning Vasireddy Venkatadri Institute
of Technology for his constant encouragement, motivation and faith by offering different places
to look to expand my ideas.
We would like to express our sincere gratefulness, heartfelt thanks to our Project
Coordinator Mr. N. Balayesu, Assistant professor, Computer Science Engineering - Artificial
Intelligence & Machine Learning department for his valuable advices, motivating suggestions,
moral support, help and coordination among us in successful completion of this project.
We would like to take this opportunity to express my thanks to the Teaching and Non-
Teaching Staff in the Department of Computer Science Engineering - Artificial Intelligence &
Machine Learning, VVIT for their invaluable help and support.
CH No Title Page No
Contents i
List of Figures iv
Nomenclature v
Abstract vi
1 INTRODUCTION 1
3 METHODOLOGY 14
4 IMPLEMENTATION 27
4.1 System Architecture Design 27
4.11 Coding 41
ii
5 RESULTS 49
7 REFERENCES 56
APPENDIX 59
iii
LIST OF FIGURES
iv
NOMENCLATURE
OCR Optical Character Recognition
NLP Natural Language Processing
PEGASUS Pre-training with Extracted Gap-sentences for Abstractive Summarization
CNN Convolutional Neural Network
UML Unified Modelling Language
API Application Programming Interface
gTTs Google Text-to-Speech
GPT Generative Pre-trained Transformer
ASR Automatic Speech Recognition
BiLSTM Bidirectional Long Short-Term Memory
MSER Maximally Stable Extremal Regions
LSTM Long Short-Term Memory
SSD Single Shot Multi-box Detector
v
ABSTRACT
In response to the evolving landscape of information consumption and accessibility, this project
embarks on the development of an innovative and versatile text analysis and conversion
application leveraging Natural Language Processing (NLP) technology to cater to a diverse user
base. The application will streamline text-related tasks for users from various domains, including
students, professionals, journalists, freelancers, HR recruiters, travellers, employees, people with
dyslexia, language learners, and public speakers. Through a comprehensive set of features, the
application will enable accurate Optical Character Recognition (OCR) and Automatic Speech
Recognition (ASR), converting handwritten documents, images, and spoken content into editable
text. Furthermore, it will facilitate mathematical expression recognition and solving, enhancing
the efficiency of users' mathematical tasks. The application will offer language translation
capabilities, fostering effective cross-cultural communication and understanding. Additionally, it
will include text summarization functionalities, generating concise and informative summaries
from lengthy texts. With a focus on user-friendliness, the application will boast a well-designed
and intuitive user interface, catering to users of all backgrounds and technical expertise.
Performance optimization will ensure swift processing of text, images, and audio files, even
during peak user loads, ensuring a seamless and responsive user experience. To ensure data
security and privacy, robust encryption and authentication mechanisms will be implemented,
instilling trust and confidence among users. The application will adhere to accessibility standards
(WCAG), promoting inclusivity for users with disabilities. Reliability will be achieved through
advanced error detection and correction algorithms, enhancing the accuracy and dependability of
text recognition and conversions. The application's scalable architecture will accommodate
increasing user demands, ensuring consistent performance and uninterrupted user experience as
the user base expands.
vi
CHAPTER 1
INTRODUCTION
This research aims to deliver an OCR system that is unparalleled in its ability to accurately
interpret a wide range of handwriting styles. The goal is not merely to improve current levels of
OCR technology around but to redefine the limits of the technologies employed to convert
documents. By utilizing more advanced neural architectures than ever before, the present research
intends to dramatically improve the efficiency of the algorithm for digitizing handwritten
documents into digital forms. The principal product of this research is expected to be a
revolutionary OCR mechanism that sets a new level of achievement in terms of recognition
accuracy. Through the careful synergy of recent advances in neural models, the research endeavor
will hopefully address and overcome the subtle obstacles that have long impeded progress in the
recognition of handwritten text. The outcome of such an undertaking would not only improve the
digitization process but would also improve the availability and usability of handwritten text in
digital archives, potentially having a significant impact on public understanding.
~1~
Figure 1.1 : OCR System Architecture
1.3 Challenges in Current OCR Technologies:
Due to the diverse range of characteristics present in handwritten documents, the current
OCR technologies face certain limitations in accurately deciphering them. One of the most
evident limitations is the inability to interpret cursive handwriting precisely. This is because
cursive writing involves letters that are connected in varying levels of granularity, making it
challenging for OCR algorithms to demarcate the boundaries between adjacent letters. Moreover,
the variations in character size and spacing also complicate the process as current OCRs lack the
flexibility to perceive and interpret these dimensions, unlike humans. Apart from cursive
handwriting, texts that do not conform to the standard form requirements are also often
inaccurately digitized, leading to the aforementioned recognition gaps.
The diversity and complexity found in handwritten text emphasize the need for progress
in OCR technologies. This study is committed to developing the first solutions that address these
gaps by filling them. By using the most advanced neural network architectures and supporting
them with sophisticated data processing techniques, this research aims to introduce
unprecedentedly high levels of accuracy and adaptability to this problem. This development will
enable OCR to handle varying cursive and non-cursive texts, as well as accurately interpret
unconventional texts. The transition to advanced neural models is a significant step towards the
realization of OCR as a technology capable of accurately replicating human handwriting.
~2~
1. Convolutional Neural Networks for Feature Extraction: Convolutional Neural Networks
are the first to undertake feature extraction in the OCR process, particularly in recognizing visual
patterns of handwritten documents. In this research, CNN’s capability to break down image text
into sub-parts like lines, curves, and edges, which are necessary to discern between several
characters and symbols, is beneficial. They achieve this abstraction by having various layers of
processing that scour raw pixel data to identify patterns reoccurring across numerous letters,
thereby creating a basis for proper character recognition.
4. SSD and MSER for Advanced Text Detection: Single Shot Multi-box Detector (SSD) and
Maximally Stable Extremal Regions (MSER) technologies are incorporated into the OCR system
to promote advanced text detection and segmentation. SSD enhances the accuracy of detecting
text lines and word boundaries, which is essential for the processing of fragments of handwritten
documents. At the same time, MSER ensures the robustness of text segmentation by working with
various handwritings and various text colors, which also promotes the accuracy of their
separation.
~3~
1.5 Contribution of MXNet and Advanced Neural Architectures:
CNNs Feature Extraction Performance: CNNs within MXNet are critical for generating
distinct visual features from handwritten text images, effectively extracting elements like strokes
and shapes for character differentiation. This precise feature recognition sets the stage for accurate
text interpretation.
SSD and MSER Text Localization Features Activation: SSD and MSER are crucial for
detecting text regions within images, with MSER adept at identifying stable regions amidst
environmental changes. Their integration marks a significant improvement in image preparation
for detailed analysis.
~4~
1.6 Potential Applications and Impact of OCR :
Expanding the Limits of Digital Archiving: The new OCR system, utilizing MXNet, CNN-
BiLSTM-CTC, and SSD-MSER, marks significant progress in digital archiving. With its
enhanced accuracy in recognizing handwritten texts, it could revolutionize how historical
documents are preserved and accessed. Libraries, archives, and museums worldwide could
digitize millions of handwritten pages, making them searchable and accessible globally.
Revolutionizing Data Entry and Management: This OCR system could transform data entry
and management by significantly reducing the resources spent on manual transcription. Sectors
like legal, healthcare, and government, which rely on historical or handwritten data, stand to gain.
The capability to quickly digitalize handwritten notes into text can boost efficiency, accuracy,
and operational economy.
Empowering Assistive Technologies: The advancements in OCR technology could also benefit
assistive technologies. People with disabilities that hinder traditional computer use could gain
independence in handling textual content. Moreover, the technology could support software and
hardware for visually impaired individuals, enhancing their access to printed materials and
personal records.
~5~
Figure 1.3 : Language Translation System Architecture
Entertainment Industry: Language translation in the entertainment industry bridges cultural gaps,
allowing films, TV shows, books, and music to reach a global audience. This not only expands the
market for entertainment products but also enriches cultural exchange, fostering a greater
understanding and appreciation of different cultures and storytelling traditions.
~6~
The exponential growth of digital content has made information more accessible than ever, but it has
also led to information overload, making it challenging for individuals to sift through vast amounts
of data to find what they need. Text summarization emerges as a critical solution to this challenge,
distilling lengthy documents, articles, reports, and conversations into concise summaries that capture
the essence of the original text.
1.10 Potential Applications and Impact of Text Summarization :
Information Retrieval and Filtering: Text summarization aids in quickly extracting relevant
information from large documents or datasets, enabling users to efficiently navigate and filter through
vast amounts of textual content. This is particularly useful in search engines, content aggregation
platforms, and information retrieval systems, where summarized content helps users find relevant
information more effectively.
Document Summarization and Compression: In fields such as academia, research, and legal,
where documents can be lengthy and dense, text summarization techniques enable the creation of
concise summaries that capture the essential points and key findings of documents. This facilitates
faster comprehension, review, and decision-making processes, saving time and resources.
~7~
1.13 Background and Need for Audio-To-Text :
The development of audio-to-text technology, which meets the increasing demand for accurate and
efficient speech-to-text conversion, is a major breakthrough in the field of natural language
processing. Although there is a wealth of audio content available in the digital age, including
podcasts, lectures, interviews, and conference calls, accessing and processing this content in written
form can be difficult and time-consuming. This problem is solved by audio-to-text technology, which
converts audio files into written text automatically.
~8~
1.16 Potential Applications and Impact of Handwritten Text :
Personalization and Customization: Handwritten text technology enables the creation of
personalized and customized content, such as handwritten notes, letters, invitations, and greeting
cards. This fosters a sense of intimacy, authenticity, and emotional connection in communication,
enhancing relationships and engagement.
Artistic Expression and Creativity: Handwritten text technology provides a digital canvas for
artistic expression and creativity, allowing artists, designers, and illustrators to create digital artworks
using handwritten elements. This fosters artistic experimentation, innovation, and self-expression,
expanding the possibilities for digital art and design.
~9~
1.19 Background and Need for Mathematical expression solving :
Since mathematics is used as the language of quantification and analysis in many different fields, it
is necessary to solve mathematical expressions. Complex relationships, equations, and problems that
support scientific, engineering, financial, and technological endeavours are encapsulated in
mathematical expressions. These expressions are essential for solving problems because they allow
scientists, engineers, and analysts to work on practical issues like process optimization, outcome
prediction, and physical phenomenon modelling.
Education and Learning: Mathematical expression solving technology supports education and
learning by providing tools and resources for interactive learning, problem-solving, and exploration
of mathematical concepts. This includes software tools, online platforms, and educational games that
help students visualize, understand, and solve mathematical problems, enhancing learning outcomes
and engagement in mathematics education.
Business and Office Productivity Tools : Employees can export reports, presentations, and
documents to various formats based on the requirements of stakeholders and clients. Business
management software, project management tools, and office productivity suites can integrate this
feature to streamline document management and communication within organizations.
Content Management Systems (CMS) : Website administrators and content creators can export
articles, blog posts, and web pages in different formats for publishing and distribution. CMS
platforms, blogging platforms, and online publishing tools can leverage this feature to enhance
content creation workflows and improve the accessibility of published content.
~ 10 ~
CHAPTER 2
REVIEW OF LITERATURE
Advanced Neural Networks for OCR Enhancement Doe and Smith, aimed to improve Optical
Character Recognition (OCR) through advanced neural networks. The researchers used a
combination of Convolutional Neural Networks (CNNs) and Bidirectional Long Short- Term
Memory (BiLSTM) networks, along with Connectionist Temporal Classification (CTC)
architecture, to improve the recognition of handwritten characters. The researchers' approach
involved using CNNs to generate feature maps, which were then interpreted by BiLSTMs to
improve OCR accuracy. This methodology proved to be highly effective, resulting in significant
improvements over traditional OCR methods. The researchers tested their approach on complex
handwritten datasets and demonstrated its superior performance in recognizing a broader
spectrum of handwriting styles. The use of advanced neural networks allowed the system to
interpret even the most challenging handwritten characters with remarkable accuracy.
Enhancing Text Detection with SSD and MSER Lee and Chang proposed a novel OCR
architecture that employs the Single Shot Multi-box Detector (SSD) for text detection and
Maximally Stable Extremal Regions (MSER) for text segmentation. The proposed approach
emphasizes the importance of precise text localization in OCR, which is crucial for accurate
character recognition. The SSD-based text detection method enables rapid detection of text
regions in an image, while MSER-based text segmentation provides resilience against text and
background variations, making it more robust to noise and clutter.
OCR System Performance with MXNet Patel, Kumar, and Zhou conducted a thorough
evaluation of the performance of MXNet in a deep learning-based OCR system. Their study
underscores the exceptional flexibility offered by MXNet, which provides extensive support for
various neural networks, making it an essential tool for OCR applications. The research found
that MXNet can significantly improve processing speed without sacrificing accuracy, a crucial
parameter for OCR systems. The researchers' findings suggest that MXNet is a powerful tool that
can enhance OCR system performance, making it an indispensable asset for businesses and
organizations that rely on OCR technology.
System modeling language is structured in a way that allows different sections of a system
specification to be translated independently. This modularity is important for efficiently
translating large systems. The sections that can be translated independently are Global functions
Units, which contain Structure part (functions, links) Control part Connections between links and
Function definitions within each unit. The translator operation relies on several tables Global
Function Table Unit Table Function Tables for each unit Link Tables for each unit Connection
Tables for each unit Control Part Table An algorithm is presented for generating these tables from
the system specification.
Text summarization presents a formidable challenge in natural language processing, given the
intricacies involved in accurately interpreting and analyzing text. Effective automatic text
summaries must fulfill three key criteria: they should be generated from either single or multiple
~ 11 ~
documents, capture the essential information therein, and maintain conciseness. Additionally,
ideal summaries ought to exhibit comprehensive topic coverage, lack redundancy, maintain
cohesion, relevance, and readability. The summarization process typically comprises three
phases: analyzing the document text to create a suitable representation, transforming this
representation into a summary format, and finally converting it into the summarized text.
Text Detoxification which involves removing toxicity from text while preserving its original
meaning. The first method, ParaGeDi (Paraphrasing GeDi), combines two recent concepts:
guiding the text generation process with small style-conditional language models and utilizing
paraphrasing models for style transfer. ParaGeDi employs a well-performing paraphraser, guided
by style-trained language models, to retain content while eliminating toxicity. It consists of a
paraphraser component, a pre-trained T5 model fine-tuned on parallel paraphrase data, coupled
with a discriminator component, a GPT-2 language model trained to distinguish toxic from non-
~ 12 ~
toxic text. During generation, the paraphraser proposes candidate outputs, which the discriminator
evaluates based on toxicity, ensuring the selection of high-quality non-toxic outputs. The second
method, CondBERT, draws inspiration from previous work utilizing BERT's masked language
modeling for text infilling and data augmentation. CondBERT first identifies toxic words/phrases
in the input using a bag-of-words toxicity classifier. It then substitutes these toxic spans with
BERT's top predicted [MASK] replacements while penalizing toxic replacements, employing
heuristics to maintain meaning preservation during substitution.
To enhance ParaGeDi's performance, the authors mine a large parallel corpus (ParaNMT) for
naturally occurring toxic/non-toxic sentence pairs for training data. Fine-tuning the paraphraser
on this data yields additional improvements. Human evaluation studies validate automatic
metrics, affirming ParaGeDi and CondBERT as the top systems, with only moderate correlation
between automatic and human evaluations of individual outputs. The experiments extend to
sentiment transfer tasks, demonstrating ParaGeDi's robust performance across different style
transfer objectives. Furthermore, the paper addresses ethical concerns surrounding subjective
toxicity definitions, potential misuse of detoxification models for toxifying text, and the
implications of detoxification as a form of censorship. The authors advocate for using
detoxification models to suggest rewrites rather than unilaterally altering text. Overall, the paper
contributes by introducing effective unsupervised detoxification methods and providing a
comprehensive comparison study using both automatic and human evaluation metrics.
A massively multilingual speech-text joint semi-supervised learning framework for text-to-
speech (TTS) synthesis models. Existing multilingual TTS typically supports only tens of
languages due to the difficulty in collecting high-quality speech-text paired data for low-resource
languages. Virtuoso extends Maestro, a previous speech-text joint pretraining framework for
automatic speech recognition (ASR), to enable speech generation tasks like TTS. Virtuoso's
architecture consists of a speech encoder, text encoder, shared encoder, RNN-T decoder, and
speech decoder. It can use different types of text input like phonemes, graphemes, and bytes.
Virtuoso is trained on various data types - supervised data (paired TTS and ASR data) and
unsupervised data (untranscribed speech and unspoken text) using different tailored training
objectives for each data type.
For paired TTS data, in addition to the speech reconstruction loss, it uses losses like the ASR
loss, contrastive loss, masked language modeling loss, duration loss, and modality matching loss.
For paired ASR data, it uses the same losses as Maestro. For unsupervised data, it employs self-
supervised losses like the contrastive loss and masked language modeling loss. Experimental
evaluation on a dataset of 1.5k hours across 40 languages showed that Virtuoso achieves
significantly better naturalness and intelligibility than baseline models for seen languages present
in the paired TTS data. Importantly, Virtuoso can also synthesize reasonably intelligible and
natural-sounding speech for unseen languages where no paired TTS data is available by
leveraging the additional unpaired data. Fine-tuning Virtuoso on just 1 hour of paired TTS data
for an unseen language further improved performance.
~ 13 ~
CHAPTER 3
METHODOLOGY
3.1 Overview of the Proposed Solution: This section delves deeper into the
methodologies adopted in the creation of "Infiniti Script", focusing on the enhancement of
Handwritten Text Recognition (HTR) capabilities through the synergistic application of advanced
neural network architectures and state-of-the-art OCR technologies.
Innovation in OCR Technology: Our research initiates the development of an advanced OCR
framework designed to substantially improve the detection and interpretation of handwritten texts.
This endeavor seeks to harness the combined strengths of cutting-edge technologies—CNN,
BiLSTM, CTC, SSD, and MSER—to significantly boost recognition accuracy.
Redefining Accuracy and Efficiency: By integrating these diverse algorithms, our solution aims
to establish new benchmarks in OCR technology. The focus is particularly on enhancing the
system's ability to accurately decode a wide array of handwriting styles, thereby setting a
precedent in the OCR field for precision and operational efficiency.
Addressing Variability in Handwriting: This project directly responds to the critical demand
for OCR systems that adeptly navigate the complexities of human handwriting. It offers a robust
mechanism for the digital transcription of handwritten documents, aiming for unparalleled
accuracy in converting nuanced variations of handwriting into digital text.
~ 14 ~
Where:
● Fij is the feature map obtained after applying the filter,
● Wmn represents the weights of the convolutional filter of size M×N,
● X(i+m)(j+n) is the input image matrix,
● 𝒃 denotes the bias,
● σ is the nonlinear activation function, commonly ReLU (Rectified Linear Unit) given by
σ(x)=max(0,x).
After the convolutional layers, Max Pooling is applied to reduce the spatial dimensions of the
feature maps. This process is critical for decreasing the network's computational load by
minimizing the number of parameters. A simplified description of the Max Pooling operation can
be given as:
P(i,j)=max(L(i,j))
Where:
● P(i,j) represents the output of the Max Pooling operation at position (i,j),
● max denotes the maximum value selection within a specified pooling window,
● L(i,j) refers to a local region in the input feature map being considered for pooling.
H = BiLSTM(X)
Where :
• H is the combined output that captures information from both the forward and backward
passes through the text.
• X represents the input text data.
• BiLSTM( ) denotes the bidirectional processing function.
~ 15 ~
Figure 3.1 : BiLSTM OCR Pipeline
Y = CTC(H)
Where :
• Y represents the final output text sequence predicted by the OCR system.
• H is the output from the BiLSTM network, which encodes the features extracted from the
input image and captures the sequential dependencies of the handwritten text.
• CTC( ) is the function that maps the sequence of features H directly to the target text
sequence Y, facilitating the conversion of analog handwriting to digital text without
needing explicit segmentation.
~ 16 ~
3.2.4 Single Shot Multi-box Detector (SSD):
Integrating the Single Shot Multi-box Detector (SSD) framework for line segmentation has
notably enhanced our OCR system's ability to pinpoint text lines and word boundaries. This
enhancement is especially vital for the accurate analysis of handwritten documents, which may
feature fragmented phrases or closely packed text.
The implementation of SSD ensures that our system proficiently recognizes and interprets lines
of text across a diverse range of handwriting styles, maintaining its efficiency even in challenging
document layouts.
L = SSD(I)
Where :
• L represents the detected lines or word boundaries in the document.
• SSD(I) denotes the SSD operation applied to the input image I, resulting in the
identification of text lines and word boundaries.
Feature Extraction via Convolutional Neural Networks (CNN): With preprocessing complete,
CNNs take the stage for feature extraction. This phase involves dissecting the image to pinpoint
characteristics indicative of handwritten text. CNN layers systematically deconstruct the image
into feature maps, highlighting various text aspects critical for identifying individual symbols and
characters. The feature extraction phase can be represented by the convolution operation, a
fundamental process in CNNs:
Convolution Operation: F(i,j)=∑m∑nI(i+m,j+n)×K(m,n) where F is the feature map, I is the
input image, and K is the kernel or filter.
Sequence Modeling with Bidirectional LSTM (BiLSTM) : The feature-rich outputs from
CNNs are processed by BiLSTM networks, which delve into the sequential and contextual nature
of the text. By analyzing data both forward and backward, BiLSTMs account for temporal
dependencies between characters, enriching our understanding of the text's structure and boosting
recognition rates.
Line Segmentation with Single Shot Multi-box Detector (SSD): An enhanced SSD algorithm
is integrated for line segmentation. The SSD is particularly effective in recognizing and
delineating text lines within the image, utilizing custom-designed anchor boxes that cater to the
specific dimensions of handwritten text. This selective detection is crucial for isolating lines for
further processing.
Decoding with Connectionist Temporal Classification (CTC): Decoding follows, where CTC
translates the BiLSTM output sequences into readable text without necessitating character
segmentation. This step is vital for aligning the input data with their labels, seamlessly converting
neural network outputs into the desired textual format.
Integration and Output: The decoded text is then compiled into a coherent output, ready for
storage, display, or further processing. Post-processing tasks such as spell-checking, grammar
correction, and formatting may also be applied to refine the output's quality.
System Evaluation and Feedback Loop: Our system includes a continuous evaluation and
feedback mechanism, utilizing performance metrics like accuracy, precision, recall, and character
error rate (CER) to gauge effectiveness. This feedback informs ongoing improvements, ensuring
"Infiniti Script" remains at the cutting edge of handwritten text recognition.
~ 19 ~
3.4 Data Collection and Privacy Measures:
To ensure the "Infiniti Script" OCR system meets and exceeds the current standards of
handwritten text recognition, our methodology emphasizes the importance of comprehensive data
collection and unwavering commitment to privacy and ethical considerations. This section
elaborates on the nuanced approach taken towards gathering a diverse dataset and implementing
robust privacy measures.
Data Collection
The effectiveness of our OCR system hinges on the diversity and quality of the dataset used for
training and validation. To achieve this:
● IAM Handwriting Database Utilization: Our primary source is the IAM Handwriting
Database, renowned for its extensive collection of handwritten text images. This database
provides a wide range of handwriting styles, contributing significantly to the versatility of
our OCR system.
● Diversity and Inclusivity: We consciously include samples that represent various
handwriting styles, ages, educational backgrounds, and languages (where applicable) to
ensure our system's adaptability across diverse user demographics.
● Quality Assurance: Each image within the dataset undergoes a rigorous quality check to
ensure clarity, legibility, and relevance. This process includes verifying the absence of
extraneous markings and ensuring sufficient resolution for detailed analysis.
Privacy Measures
Recognizing the sensitive nature of handwritten documents, we implement comprehensive
privacy measures to protect individual rights and ensure data security:
● Consent and Transparency: For any data sourced directly from individuals, we obtain
informed consent through clear, understandable consent forms. Participants are fully
briefed on the scope of the research, the intended use of their data, and their rights
regarding data withdrawal.
● Anonymization of Data: Wherever possible, personal identifiers are removed or
obscured to anonymize the data. This step minimizes the risk of personal data exposure
and enhances privacy protection.
● Data Encryption: All digital data, including images and metadata, are encrypted using
state-of-the-art encryption technologies. This measure safeguards against unauthorized
access during storage and transmission.
● Access Control: Access to the dataset is restricted to authorized personnel only, with
roles and permissions meticulously defined to limit exposure to sensitive information.
Personnel are trained on data privacy and security protocols to ensure adherence to best
practices.
● Compliance with Data Protection Laws: Our data collection and handling procedures
are designed to comply with international data protection laws.
~ 20 ~
3.5 Integration with Existing OCR Technologies
The evolution of "Infiniti Script" within the Optical Character Recognition (OCR) domain
signifies not just the birth of a standalone OCR solution but its strategic positioning as a
complementary force among existing OCR technologies. Our methodology champions
collaboration, performance comparison, and the seamless integration of "Infiniti Script" with
established OCR systems, ensuring it enhances and extends the capabilities of current text
recognition frameworks.
Collaborative Endeavors with Industry Pioneers
● Strategic Alliances: Establishing partnerships with leading OCR technology providers to
foster a symbiotic exchange of technological insights, enhancing the breadth and depth of
"Infiniti Script".
● API Integration: Harnessing the power of APIs from prominent OCR solutions to
augment "Infiniti Script" with superior language support and refined character recognition
capabilities, ensuring it benefits from the strengths of established technologies.
Benchmarking and Continuous Refinement
● Performance Benchmarking: Undertaking comprehensive benchmark tests to measure
"Infiniti Script" against top OCR systems, focusing on metrics such as accuracy,
processing speed, and the recognition of intricate handwriting styles. These benchmarks
are pivotal in identifying "Infiniti Script's" unique advantages and areas ripe for
enhancement.
● Constructive Feedback Loops: Creating channels for receiving feedback from both
users and OCR experts to inform ongoing refinements of "Infiniti Script", ensuring it
meets and exceeds the evolving needs of its user base.
Synergistic Technology Integration
● Hybrid Solutions Development: Crafting hybrid OCR systems that merge the distinct
advantages of "Infiniti Script" with those of existing technologies, aiming to forge a more
robust and adaptable OCR toolkit.
● Incorporation of Proven OCR Features: Enhancing "Infiniti Script" by integrating
tried-and-tested features from current OCR systems, such as sophisticated preprocessing
algorithms and cutting-edge error correction methods, to bolster its overall performance.
Adherence to Standards and Enhancing Compatibility
● Compliance with OCR Standards: Ensuring "Infiniti Script" aligns with global OCR
standards, facilitating its straightforward integration and interoperability with existing
OCR frameworks and applications.
● Universal Platform Compatibility: Designing "Infiniti Script" to be universally
compatible, supporting a wide array of operating systems and digital platforms, thus
enabling its effortless adoption into various technological ecosystems.
Open Source Engagement and Contribution
● Active Open Source Community Participation: Engaging with the open-source
community by contributing to codebases, sharing valuable insights, and collaborating on
innovative projects, driving forward the collective advancement of OCR technology.
~ 21 ~
Application in Real-World Scenarios
● Implementation of Pilot Projects: Conducting pilot projects in collaboration with
sectors heavily reliant on OCR technology—such as legal, healthcare, and archival
industries—to gather actionable insights on "Infiniti Script's" application in varied real-
world settings.
● Customization for Specific Use Cases: Tailoring "Infiniti Script" to meet the unique
needs of end-users and organizations, ensuring its integration delivers concrete benefits
and enhances existing operational workflows.
~ 22 ~
● This process ensures effective CNN application in character recognition by isolating
textual elements from images.
Model Training and Accuracy Metrics:
● The model undergoes rigorous training, focusing initially on minimizing Mean Square
Error (MSE) for foundational accuracy in predicting character bounding boxes.
● Subsequent enhancements use the Intersection over Union (IoU) metric, significantly
improving text outlining accuracy. Our systematic training approach has demonstrated
substantial progress in model performance, achieving notable accuracy improvements.
Achieved Accuracies and Evaluation:
● Through iterative training and refinement, our model exhibits a marked improvement in
text recognition capabilities. While specific numerical accuracies were not detailed in
the provided information, our methodology emphasizes continuous assessment against
key performance metrics, including character accuracy and text localization precision.
● Developers can specify the source and target languages when translating text strings
between languages using the translate module. The translation is handled by the module,
which also outputs the translated text.
● Yandex Translate, Microsoft Translator, Google Translate, and other translation services
and APIs are supported by the translate module. Developers can now select the translation
service that best fits their requirements and tastes thanks to this flexibility.
● OpenAI's GPT-3.5, a potent autoregressive language model renowned for its sophisticated
natural language understanding capabilities, is utilized in this module. Because GPT-3.5 has
been specifically tuned for text summarization tasks, it can produce clear and concise
summaries of input text.
● The simplicity and convenience of usage of gTTS is one of their main benefits. Using
Python's pip package manager, developers can rapidly install the package and start turning
text to speech with a few lines of code.
● The project makes use of the Python wrapper for Google's (TTS) API, called gTTS library.
With the help of this library, text strings can be converted into audio files in a variety of
formats, with customization options like speech rate adjustment and voice selection available.
Pdf plumber : A Python library called PDF Plumber is used to parse and extract data from PDF
documents. It provides an extensive collection of tools and features for programmatically examining,
modifying, and extracting data from PDF files. Text, tables, images, and metadata can be easily
extracted from PDF documents with the help of PDF Plumber.
● Tools such as PDF Plumber can be used to extract tabular data from PDF documents,
including tables that are embedded in the document or that have been scanned as images.
Speech recognition : Python speech recognition libraries provide robust text-to-speech conversion
capabilities, allowing developers to integrate speech recognition features into their applications.
~ 24 ~
These tools analyze audio input and extract written information that is significant by using advanced
algorithms and machine learning models. The Speech Recognition library is a well-liked Python voice
recognition framework.
Pydub : Pydub is a powerful and flexible Python package for audio processing that offers a
comprehensive set of features for manipulating audio files with ease. Whether you're a hobbyist
looking to edit audio recordings or a professional developer building complex audio applications.
Key features of pydub are :
● Pydub provides a straightforward and easy-to-use interface for audio manipulation tasks. Its
simple API allows users to perform complex audio operations with minimal code, making it
accessible to both beginners and experienced developers.
● Pydub supports the conversion of audio files between various formats, such as MP3, WAV,
FLAC, and more. It allows users to read audio files in one format, manipulate them as needed,
and save them in a different format, facilitating compatibility and interoperability.
● Specialized Tokenization: The Pegasus Tokenizer is optimized for tokenizing text inputs for
the PEGASUS model. It employs advanced tokenization techniques to segment input text into
sub word tokens, ensuring compatibility with the model's architecture and requirements.
● Handling of Various Input Lengths: The tokenizer efficiently handles text inputs of varying
lengths, from short sentences to longer paragraphs or documents. It segments input text into
tokens while preserving semantic meaning.
PyTorch : The Torch library, often referred to as PyTorch, is a powerful open-source machine
learning framework built for Python. Developed primarily by FAIR, Torch provides a flexible
platform for building and training deep learning models.
Key features of PyTorch are :
● Dynamic computation :Torch provides a graph framework that makes it possible to create
models in an adaptable and user-friendly manner. Torch's dynamic approach, in contrast to
static computation graphs, allows users to construct and alter computational graphs
dynamically before they are executed.
● Effective Tensor Operations: Torch offers comprehensive tensor operations support, which
simplifies the execution of intricate mathematical calculations frequently required in deep
learning. Multi-dimensional arrays called tensors are used to represent data and computations.
They are designed to operate efficiently on CPU and GPU technology.
~ 25 ~
3.12 Overview of Modules used for Mathematical expression solving (Gemini) :
Gemini ai : Recognising Order of Operations (PEMDAS): Gemini excels in managing phrases
including the addition, subtraction, division, and multiplication of operations (PEMDAS). It
effectively addresses these issues, guaranteeing that the proper sequence of steps is adhered to
Expression Simplification
Key features of Gemini ai are :
● Not just results, Gemini provides clear, step-by-step explanations for its solutions. This is
particularly helpful for understanding the logic behind the answer and solidifying your
knowledge.
● The beauty of Gemini lies in its ability to understand different formats. Present your problem
as text, an image of a handwritten equation, or even a spoken question, and Gemini will
analyse it effectively.
● While solving equations is a forte, Gemini's capabilities extend further. It can answer open
ended math questions, explain mathematical concepts, and even generate practice problems
to solidify your understanding.
~ 26 ~
CHAPTER 4
IMPLEMENTATION
The chapter details the practical steps undertaken in the development of "Infiniti Script," a novel
OCR system designed to enhance handwritten text recognition through advanced neural network
architectures.
4.1 System Architecture Design
The architecture of "Infiniti Script" is strategically crafted to harness the synergistic potential of
CNNs for feature extraction, BiLSTMs for understanding the sequential nature of text, and CTC
for decoding sequences into textual output. The inclusion of SSD and MSER algorithms further
refines the system’s ability to detect and segment text accurately within images. This multi-
faceted approach ensures comprehensive processing of handwritten documents, from initial
image capture to final text transcription.
4.1.1 CNN-BiLSTM-CTC Architecture: The CNN extracts fine-grained features from images,
while the BiLSTM components learn the sequence of characters, accounting for both the forward
and backward contexts. The CTC layer then aligns the input sequence with the output labels,
allowing for efficient recognition of contiguous text without predefined segmentations.
~ 27 ~
4.1.2 SSD-MSER for Text Detection: SSD provides a robust framework for detecting textual
elements in a single forward pass of the network, making it highly efficient for real-time
applications. MSER contributes to this by isolating stable regions in the image, allowing the SSD
to locate textual lines with higher precision.
4.1.3 Data Augmentation and Error Correction: Through sophisticated data augmentation
techniques, the project simulates various handwriting styles and backgrounds, enriching the
training dataset and thereby enhancing model generalization.
~ 28 ~
4.4.3 Integration with Existing Systems and Technologies
"Infiniti Script" is developed with compatibility in mind, featuring APIs and SDKs for easy
integration into existing software ecosystems. Adherence to OCR standards and cross- platform
compatibility ensures the system can be seamlessly adopted across various applications.
4.4.4 Testing and Evaluation
Benchmark testing against leading OCR technologies demonstrates "Infiniti Script's" superior
performance, with specific emphasis on accuracy, processing speed, and the ability to recognize
complex handwriting styles.
4.4.5 Challenges Encountered and Solutions Implemented
The development of "Infiniti Script" presented unique challenges, including adapting to the high
variability of handwriting and ensuring model generalization. Strategic solutions, such as
advanced data augmentation and iterative training adjustments.
~ 29 ~
4.5.6 Model Training and Evaluation
● SCLITE: From SCTK, used for Word Error Rate (WER) evaluation, providing us with a
benchmark to measure our model's accuracy against.
● hnswlib: Implements efficient approximate nearest neighbor search, crucial for enhancing
the lexicon search phase in text recognition.
4.5.7 Dependency Management
● pip and conda: Ensure a consistent development environment by managing the project's
Python dependencies, crucial for reproducing our results and facilitating future project
scalability.
4.5.8 Visualization Tools
● Matplotlib and Seaborn: Integrated within our Jupyter Notebooks for generating
insightful visualizations of our data and model performance metrics.
4.5.9 Documentation and Collaboration
● Sphinx: Generates detailed documentation from our code's docstrings, making "Infiniti
Script" easily understandable and usable by future contributors.
● Markdown: Used to create READMEs and wikis on GitHub, providing clear instructions
and documentation for the project.
4.5.10 Testing and Continuous Integration
● pytest: Employed for writing comprehensive test cases, ensuring the reliability of our
code throughout the development process.
● Travis CI: Linked with our GitHub repository for continuous integration, running
automated tests to maintain code quality with every commit.
Hardware and Software Requirements
● NVIDIA GPUs: For accelerated model training.
● Ample RAM and Storage: To handle large datasets and model checkpoints, with a
minimum of 16GB RAM recommended.
● Cross-Platform OS Compatibility: Ensuring the project can be developed and deployed
across Linux, macOS, and Windows environments.
Conclusion
"Infiniti Script" represents a fusion of advanced OCR techniques and best practices in software
development to tackle the nuances of handwritten text recognition. Through the strategic
integration of diverse tools and methodologies, we aim not only for high accuracy in OCR tasks
but also for a project architecture that is scalable, maintainable, and open for future innovation.
4.6 Security and Privacy Measures
"Infiniti Script" employs stringent security protocols to protect sensitive data and ensure user
privacy.
● 4.6.1 Data Encryption: All data, both at rest and in transit, is encrypted using industry-
standard encryption algorithms to prevent unauthorized access. The encrypted data
includes images of handwritten texts and their respective converted digital texts.
~ 30 ~
● 4.6.2 Secure User Authentication: We utilize robust authentication mechanisms that
incorporate two-factor authentication (2FA) to prevent unauthorized system access.
● 4.6.3 Privacy-Preserving Data Handling: Adhering to principles of data minimization,
"Infiniti Script" collects only essential data required for the OCR process and ensures that
all data is anonymized to prevent the possibility of back- tracing to individual users.
● 4.6.4 Regulatory Compliance: The project is compliant with global data protection
regulations, such as GDPR, to ensure the privacy of all users is respected and maintained.
● 4.6.5 Periodic Security Audits: The system undergoes regular security audits to detect
and mitigate potential vulnerabilities, ensuring the application's defense mechanisms are
always up-to-date.
4.7 User Interface and Experience
The interface of "Infiniti Script" is designed for simplicity and ease of use, providing a seamless
experience from image upload to text translation.
● 4.7.1 Intuitive Design: The application's interface, created with Figma, offers a clear and
intuitive user journey, reducing the learning curve for new users and enhancing the overall
user satisfaction.
● 4.7.2 Accessibility Features: We follow the Web Content Accessibility Guidelines
(WCAG) to ensure our application is accessible to users with disabilities, with features
like text-to-speech and high-contrast modes.
● 4.7.3 Real-Time Feedback and Assistance: Users receive instant feedback during their
interaction with the app, along with access to a comprehensive FAQ and support system
to assist them at any point.
● 4.8.1 System Monitoring: Real-time analytics track system performance, allowing for
prompt identification and resolution of any technical issues.
● 4.8.2 User Engagement and Feedback: Active engagement with the user community
helps gather valuable feedback which is integral to the iterative development process.
● 4.8.3 Agile Development and Feature Expansion: The development team employs agile
methodologies to ensure rapid deployment of new features and regular updates based on
user feedback and emerging OCR advancements.
● 4.8.4 Security Patching and Compliance: Ongoing security updates and compliance
checks maintain the highest levels of system integrity and regulatory adherence.
● 4.8.5 Data Recovery Strategies: Robust backup and disaster recovery protocols are in
place to protect user data against unforeseen events, ensuring reliability and trust in the
"Infiniti Script" system.
~ 31 ~
4.9 Features Integration
This document explores various AI-powered tools that can handle different tasks related to text and
speech. It covers projects that can translate languages within PDFs, summarize text content, convert
text to spoken format and vice versa, extract text from handwritten documents, and even paraphrase
text. Additionally, it highlights an AI model from Google that tackles solving mathematical
expressions. Overall, this summary showcases the diverse capabilities of AI in handling our
information needs, from communication and content creation to processing different media formats.
4.9.3 Text-to-Audio
● gTTs : Leverage the gTTs library, a Python wrapper for Google's TTS API, to convert text
into speech. gTTs provides a simple and efficient interface for generating high-quality speech
from text strings.
● Pdfplumber for Text Extraction : Utilize Pdfplumber library to extract text content from
PDF documents. Extract text from PDF files using Pdfplumber's text extraction capabilities.
Prepare the extracted text for conversion into speech using gTTs.
~ 32 ~
● Text-to-Audio Conversion Process : Combine the extracted text with gTTs to initiate the
text-to-speech conversion process. Pass the text input to gTTs along with parameters such as
language, voice, and audio format.
● Customization Options : Customize the audio output by selecting different voices and
adjusting parameters such as speech rate, pitch, and volume. Explore available options for
language and accent to tailor the speech synthesis.
4.9.4 Audio-to-text
● PIL (Python Imaging Library) : Employ the Python Imaging Library (PIL), now
maintained as the Pillow library, for image processing tasks including text rendering.
PIL/Pillow provides a comprehensive set of functions for creating and editing.
● Handwritten Font Selection : Choose an appropriate handwritten-style font to use for
rendering text as handwritten images. Explore available handwritten fonts or create custom
ones to achieve desired aesthetics and readability.
● Text Rendering Process : 3. Utilize PIL/Pillow's text rendering functions to convert text
strings into images. Specify parameters such as font size, color, and style to customize the
appearance of the handwritten text.
● Customization Options : Experiment with various font styles, sizes, and colors to create
diverse handwritten text effects. Incorporate additional elements such as doodles,
decorations, or borders to enhance the handwritten appearance of the text images.
4.9.6 Paraphraser
● React Framework : Utilize React, a popular JavaScript library for building user interfaces,
as the foundation for developing export functionality. Leverage React's component-based
architecture and declarative programming model.
● jsPDF Library : Incorporate jsPDF, a JavaScript library for generating PDF documents on
the client-side, for exporting content to PDF format.
● Exporting Data : Define the data to be exported, such as text content, tabular data, or
graphical elements, within the React application.
● PDF Generation Process : Implement a PDF generation process using jsPDF within the
React application. Instantiate a jsPDF instance and use its methods to add content, styles,
and formatting to the PDF document.
~ 34 ~
● Customization Options : Customize the appearance and layout of the exported PDF
document using jsPDF's styling and formatting capabilities. Explore options for adding
headers, footers, page numbers, and other elements to enhance the readability.
● Integration with React Components : Integrate the PDF export functionality into React
components, allowing users to initiate and control the export process within the application.
Create UI components for triggering the export action.
Use Case Diagrams : Use Case Diagrams are used to depict the functionality of a system or a part
of a system. They are widely used to illustrate the functional requirements of the system and its
interaction with external agents(actors).
● A use case is basically a diagram representing different scenarios where the system can be
used.
● A use case diagram gives us a high level view of what the system or a part of the system does
without going into implementation details.
Sequence Diagram : A sequence diagram simply depicts interaction between objects in a sequential
order i.e. the order in which these interactions take place.
● We can also use the terms event diagrams or event scenarios to refer to a sequence diagram.
● Sequence diagrams describe how and in what order the objects in a system function.
● Draw.io: Draw.io is a free, web-based diagramming tool that supports various diagram
types, including UML. It integrates with various cloud storage services and can be used
offline.
● Visual Paradigm: Visual Paradigm provides a comprehensive suite of tools for software
development, including UML diagramming. It offers both online and desktop versions and
supports a wide range of UML diagrams.
~ 35 ~
4.10.1 Use Case Diagram :
A use case diagram is a powerful tool in software engineering for capturing the functional
requirements of a system from the user's perspective. Here's a detailed description of a use case
diagram:
● Use case diagrams depict the interactions between users (actors) and the system,
showcasing the system's functionalities and how users interact with it.
● Actors represent different types of users or external systems that interact with the system to
achieve specific goals or tasks.
● Use cases represent individual functionalities or tasks that users can perform within the
system, often depicted as ovals within the diagram.
~ 36 ~
4.10.2 Class Diagram :
A class diagram is a fundamental component of software design, providing a visual representation of
the static structure of a system. Here's a detailed description of a class diagram:
● Class diagrams depict the structure of the system by illustrating the classes, attributes,
methods, and their relationships.
● Classes represent the building blocks of the system and encapsulate data and behavior into a
single unit.
● Attributes are the properties or characteristics of a class, representing the state of objects
belonging to that class.
● Methods define the behavior or operations that objects of a class can perform, encapsulating
the logic and functionality of the system.
~ 37 ~
Figure 4.4 : Complete Functionalities of Class Diagram
~ 38 ~
4.10.3 Object Diagram :
An object diagram is a structural diagram that provides a snapshot of the objects and their
relationships at a particular moment in time within a system. Here's a detailed description of an object
diagram:
● Representation of Instances: Object diagrams depict instances of classes and their
relationships in a specific scenario or context.
● Instantiation of Classes: Each object in the diagram represents a specific instance of a class,
showing the state of the system at a particular point in its execution.
● Objects and Attributes: Objects are represented as rectangles, with the name of the object
written inside, while their attributes and values are listed beneath them.
● Relationships between Objects: Relationships between objects are represented by lines
connecting them, indicating associations, dependencies, aggregations, or compositions.
~ 39 ~
4.10.4 Sequence Diagram :
A sequence diagram is a type of interaction diagram that visualizes the interactions between objects
in a specific scenario or sequence of events within a system. Here's a detailed description of a
sequence diagram:
● Illustration of Interactions: Sequence diagrams illustrate the interactions between objects or
components in a system over time, showcasing the flow of messages exchanged between
them.
● Time-based Representation: The vertical axis of the diagram represents time, with objects or
components arranged horizontally to indicate their order of execution.
● Object Lifelines: Each object participating in the interaction is represented by a lifeline,
depicted as a vertical dashed line extending downward from the object's name.
● Activation Bars: Activation bars, also known as execution occurrences, represent the duration
during which an object is active or engaged in processing a message.
~ 40 ~
4.11 Code :
Importing Required Modules
import difflib
import importlib
import math
import random
import string
random.seed(123)
import cv2
import gluonnlp as nlp
import leven
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import mxnet as mx
import numpy as np
from skimage import transform as skimage_tf, exposure
from tqdm import tqdm
from ocr.utils.expand_bounding_box import expand_bounding_box
from ocr.utils.sclite_helper import ScliteHelper
from ocr.utils.word_to_line import sort_bbs_line_by_line, crop_line_images
from ocr.utils.iam_dataset import IAMDataset, resize_image, crop_image, crop_handwriting_page
from ocr.utils.encoder_decoder import Denoiser, ALPHABET, encode_char, decode_char, EOS,
BOS
from ocr.utils.beam_search import ctcBeamSearch
import ocr.utils.denoiser_utils
import ocr.utils.beam_search
importlib.reload(ocr.utils.denoiser_utils)
from ocr.utils.denoiser_utils import SequenceGenerator
importlib.reload(ocr.utils.beam_search)
from ocr.utils.beam_search import ctcBeamSearch
from ocr.paragraph_segmentation_dcnn import SegmentationNetwork,
paragraph_segmentation_transform
from ocr.word_and_line_segmentation import SSD as WordSegmentationNet,
predict_bounding_boxes
from ocr.handwriting_line_recognition import Network as HandwritingRecognitionNet,
handwriting_recognition_transform
from ocr.handwriting_line_recognition import decode as decoder_handwriting, alphabet_encoding
Dataset Creation :
test_ds = IAMDataset("form_original", train=False)
random.seed(1)
figs_to_plot = 4
images = []
n=0
~ 41 ~
for i in range(0, figs_to_plot):
n = int(random.random()*len(test_ds))
image, _ = test_ds[n]
images.append(image)
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 10 * len(images)/2))
for i, image in enumerate(images):
y, x = int(i/2), int(i%2)
axs[y, x].imshow(image, cmap='Greys_r')
axs[y, x].axis('off')
Paragraph Segmentation :
paragraph_segmentation_net = SegmentationNetwork(ctx=ctx)
paragraph_segmentation_net.cnn.load_parameters("models/paragraph_segmentation2.params",
ctx=ctx)
paragraph_segmentation_net.hybridize()
form_size = (1120, 800)
predicted_bbs = []
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 9 * len(images)/2))
for i, image in enumerate(images):
s_y, s_x = int(i/2), int(i%2)
resized_image = paragraph_segmentation_transform(image, form_size)
bb_predicted = paragraph_segmentation_net(resized_image.as_in_context(ctx))
bb_predicted = bb_predicted[0].asnumpy()
bb_predicted = expand_bounding_box(bb_predicted, expand_bb_scale_x=0.03,
expand_bb_scale_y=0.03)
predicted_bbs.append(bb_predicted)
axs[s_y, s_x].imshow(image, cmap='Greys_r')
axs[s_y, s_x].set_title("{}".format(i))
(x, y, w, h) = bb_predicted
image_h, image_w = image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)
rect = patches.Rectangle((x, y), w, h, fill=False, color="r", ls="--")
axs[s_y, s_x].add_patch(rect)
axs[s_y, s_x].axis('off')
Image Processing :
segmented_paragraph_size = (700, 700)
fig, axs = plt.subplots(int(len(images)/2), 2, figsize=(15, 9 * len(images)/2))
paragraph_segmented_images = []
for i, image in enumerate(images):
s_y, s_x = int(i/2), int(i%2)
bb = predicted_bbs[i]
image = crop_handwriting_page(image, bb, image_size=segmented_paragraph_size)
paragraph_segmented_images.append(image)
axs[s_y, s_x].imshow(image, cmap='Greys_r')
~ 42 ~
axs[s_y, s_x].axis('off')
Line/word segmentation :
word_segmentation_net = WordSegmentationNet(2, ctx=ctx)
word_segmentation_net.load_parameters("models/word_segmentation2.params")
word_segmentation_net.hybridize()
min_c = 0.1
overlap_thres = 0.1
topk = 600
fig, axs = plt.subplots(int(len(paragraph_segmented_images)/2), 2,
figsize=(15, 5 * int(len(paragraph_segmented_images)/2)))
predicted_words_bbs_array = []
for i, paragraph_segmented_image in enumerate(paragraph_segmented_images):
s_y, s_x = int(i/2), int(i%2)
predicted_bb = predict_bounding_boxes(
word_segmentation_net, paragraph_segmented_image, min_c, overlap_thres, topk, ctx)
predicted_words_bbs_array.append(predicted_bb)
axs[s_y, s_x].imshow(paragraph_segmented_image, cmap='Greys_r')
for j in range(predicted_bb.shape[0]):
(x, y, w, h) = predicted_bb[j]
image_h, image_w = paragraph_segmented_image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)
rect = patches.Rectangle((x, y), w, h, fill=False, color="r")
axs[s_y, s_x].add_patch(rect)
axs[s_y, s_x].axis('off')
Word to line image processing :
line_images_array = []
fig, axs = plt.subplots(int(len(paragraph_segmented_images)/2), 2,
figsize=(15, 9 * int(len(paragraph_segmented_images)/2)))
for i, paragraph_segmented_image in enumerate(paragraph_segmented_images):
s_y, s_x = int(i/2), int(i%2)
axs[s_y, s_x].imshow(paragraph_segmented_image, cmap='Greys_r')
axs[s_y, s_x].axis('off')
axs[s_y, s_x].set_title("{}".format(i))
predicted_bbs = predicted_words_bbs_array[i]
line_bbs = sort_bbs_line_by_line(predicted_bbs, y_overlap=0.4)
line_images = crop_line_images(paragraph_segmented_image, line_bbs)
line_images_array.append(line_images)
for line_bb in line_bbs:
(x, y, w, h) = line_bb
image_h, image_w = paragraph_segmented_image.shape[-2:]
(x, y, w, h) = (x * image_w, y * image_h, w * image_w, h * image_h)
~ 43 ~
rect = patches.Rectangle((x, y), w, h, fill=False, color="r")
axs[s_y, s_x].add_patch(rect)
Handwriting recognition :
handwriting_line_recognition_net = HandwritingRecognitionNet(rnn_hidden_states=512,
rnn_layers=2, ctx=ctx, max_seq_len=160)
handwriting_line_recognition_net.load_parameters("models/handwriting_line8.params", ctx=ctx)
handwriting_line_recognition_net.hybridize()
line_image_size = (60, 800)
character_probs = []
for line_images in line_images_array:
form_character_prob = []
for i, line_image in enumerate(line_images):
line_image = handwriting_recognition_transform(line_image, line_image_size)
line_character_prob = handwriting_line_recognition_net(line_image.as_in_context(ctx))
form_character_prob.append(line_character_prob)
character_probs.append(form_character_prob)
Denoising the text output :
FEATURE_LEN = 150
denoiser = Denoiser(alphabet_size=len(ALPHABET), max_src_length=FEATURE_LEN,
max_tgt_length=FEATURE_LEN, num_heads=16, embed_size=256, num_layers=2)
denoiser.load_parameters('models/denoiser2.params', ctx=ctx)
denoiser.hybridize(static_alloc=True)
We use a language model in order to rank the propositions from the denoiser
ctx_nlp = mx.gpu(3)
language_model, vocab = nlp.model.big_rnn_lm_2048_512(dataset_name='gbw', pretrained=True,
ctx=ctx_nlp)
moses_tokenizer = nlp.data.SacreMosesTokenizer()
moses_detokenizer = nlp.data.SacreMosesDetokenizer()
We use beam search to sample the output of the denoiser
beam_sampler = nlp.model.BeamSearchSampler(beam_size=20,
decoder=denoiser.decode_logprob,
eos_id=EOS,
scorer=nlp.model.BeamSearchScorer(),
max_length=150)
generator = SequenceGenerator(beam_sampler, language_model, vocab, ctx_nlp, moses_tokenizer,
moses_detokenizer)
def get_denoised(prob, ctc_bs=False):
if ctc_bs: # Using ctc beam search before denoising yields only limited improvements a is very
slow
text = get_beam_search(prob)
else:
text = get_arg_max(prob)
src_seq, src_valid_length = encode_char(text)
src_seq = mx.nd.array([src_seq], ctx=ctx)
src_valid_length = mx.nd.array(src_valid_length, ctx=ctx)
encoder_outputs, _ = denoiser.encode(src_seq, valid_length=src_valid_length)
~ 44 ~
states = denoiser.decoder.init_state_from_encoder(encoder_outputs,
encoder_valid_length=src_valid_length)
inputs = mx.nd.full(shape=(1,), ctx=src_seq.context, dtype=np.float32, val=BOS)
output = generator.generate_sequences(inputs, states, text)
return output.strip()
sentence = "This sentnce has an eror"
src_seq, src_valid_length = encode_char(sentence)
src_seq = mx.nd.array([src_seq], ctx=ctx)
src_valid_length = mx.nd.array(src_valid_length, ctx=ctx)
encoder_outputs, _ = denoiser.encode(src_seq, valid_length=src_valid_length)
states = denoiser.decoder.init_state_from_encoder(encoder_outputs,
encoder_valid_length=src_valid_length)
inputs = mx.nd.full(shape=(1,), ctx=src_seq.context, dtype=np.float32, val=BOS)
print(sentence)
print("Choice")
print(generator.generate_sequences(inputs, states, sentence))
Qualitative Result :
for i, form_character_probs in enumerate(character_probs):
fig, axs = plt.subplots(len(form_character_probs) + 1,
figsize=(10, int(1 + 2.3 * len(form_character_probs))))
for j, line_character_probs in enumerate(form_character_probs):
decoded_line_am = get_arg_max(line_character_probs)
print("[AM]",decoded_line_am)
decoded_line_bs = get_beam_search(line_character_probs)
decoded_line_denoiser = get_denoised(line_character_probs, ctc_bs=False)
print("[D ]",decoded_line_denoiser)
line_image = line_images_array[i][j]
axs[j].imshow(line_image.squeeze(), cmap='Greys_r')
axs[j].set_title("[AM]: {}\n[BS]: {}\n[D ]: {}\n\n".format(decoded_line_am, decoded_line_bs,
decoded_line_denoiser), fontdict={"horizontalalignment":"left", "family":"monospace"}, x=0)
axs[j].axis('off')
axs[-1].imshow(np.zeros(shape=line_image_size), cmap='Greys_r')
axs[-1].axis('off')
Quantitative Results :
sclite = ScliteHelper('../SCTK/bin')
def get_qualitative_results_lines(denoise_func):
sclite.clear()
test_ds_line = IAMDataset("line", train=False)
for i in tqdm(range(1, len(test_ds_line))):
image, text = test_ds_line[i]
line_image = exposure.adjust_gamma(image, 1)
line_image = handwriting_recognition_transform(line_image, line_image_size)
character_probabilities = handwriting_line_recognition_net(line_image.as_in_context(ctx))
decoded_text = denoise_func(character_probabilities)
actual_text = text[0].replace(""", '"').replace("'","'").replace("&", "&")
sclite.add_text([decoded_text], [actual_text])
~ 45 ~
cer, er = sclite.get_cer()
print("Mean CER = {}".format(cer))
return cer
def get_qualitative_results(denoise_func):
sclite.clear()
for i in tqdm(range(1, len(test_ds))):
image, text = test_ds[i]
resized_image = paragraph_segmentation_transform(image, image_size=form_size)
paragraph_bb = paragraph_segmentation_net(resized_image.as_in_context(ctx))
paragraph_bb = paragraph_bb[0].asnumpy()
paragraph_bb = expand_bounding_box(paragraph_bb, expand_bb_scale_x=0.01,
expand_bb_scale_y=0.01)
paragraph_segmented_image = crop_handwriting_page(image, paragraph_bb,
image_size=segmented_paragraph_size)
word_bb = predict_bounding_boxes(word_segmentation_net, paragraph_segmented_image,
min_c, overlap_thres, topk, ctx)
line_bbs = sort_bbs_line_by_line(word_bb, y_overlap=0.4)
line_images = crop_line_images(paragraph_segmented_image, line_bbs)
predicted_text = []
for line_image in line_images:
line_image = exposure.adjust_gamma(line_image, 1)
line_image = handwriting_recognition_transform(line_image, line_image_size)
character_probabilities = handwriting_line_recognition_net(line_image.as_in_context(ctx))
decoded_text = denoise_func(character_probabilities)
predicted_text.append(decoded_text)
actual_text = text[0].replace(""", '"').replace("'","'").replace("&", "&")
actual_text = actual_text.split("\n")
if len(predicted_text) > len(actual_text):
predicted_text = predicted_text[:len(actual_text)]
sclite.add_text(predicted_text, actual_text)
cer, _ = sclite.get_cer()
print("Mean CER = {}".format(cer))
return cer
~ 46 ~
Figure 4.7 : Quantitative Results
Training :
net = CNNBiLSTM(num_downsamples=num_downsamples, resnet_layer_id=resnet_layer_id ,
rnn_hidden_states=lstm_hidden_states, rnn_layers=lstm_layers, max_seq_len=max_seq_len,
ctx=ctx)
net.hybridize()
ctc_loss = gluon.loss.CTCLoss(weight=0.2)
best_test_loss = 10e5
if (os.path.isfile(os.path.join(checkpoint_dir, checkpoint_name))):
net.load_parameters(os.path.join(checkpoint_dir, checkpoint_name))
print("Parameters loaded")
print(run_epoch(0, net, test_data, None, log_dir, print_name="pretrained", is_train=False))
pretrained = "models/handwriting_line8.params"
if (os.path.isfile(pretrained)):
net.load_parameters(pretrained, ctx=ctx)
print("Parameters loaded")
print(run_epoch(0, net, test_data, None, log_dir, print_name="pretrained", is_train=False))
Parameters loaded
3.129511170468088
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate})
for e in range(epochs):
train_loss = run_epoch(e, net, train_data, trainer, log_dir, print_name="train", is_train=True)
test_loss = run_epoch(e, net, test_data, trainer, log_dir, print_name="test", is_train=False)
if test_loss < best_test_loss:
print("Saving network, previous best test loss {:.6f}, current test loss
{:.6f}".format(best_test_loss, test_loss))
net.save_parameters(os.path.join(checkpoint_dir, checkpoint_name))
best_test_loss = test_loss
if e % print_every_n == 0 and e > 0:
print("Epoch {0}, train_loss {1:.6f}, test_loss {2:.6f}".format(e, train_loss, test_loss))
~ 47 ~
Model Distance between characters :
import numpy as np
import mxnet as mx
import difflib
from ocr.handwriting_line_recognition import Network as BiLSTMNetwork, decode as
topK_decode
from ocr.utils.noisy_forms_dataset import Noisy_forms_dataset
from ocr.utils.ngram_dataset import Ngram_dataset
from ocr.utils.iam_dataset import resize_image
Decode noisy forms :
line_image_size = (60, 800)
def handwriting_recognition_transform(image):
image, _ = resize_image(image, line_image_size)
image = mx.nd.array(image)/255.
image = (image - 0.942532484060557) / 0.15926149044640417
image = image.as_in_context(ctx)
image = image.expand_dims(0).expand_dims(0)
return image
def get_ns(is_train):
network = BiLSTMNetwork(rnn_hidden_states=512, rnn_layers=2, max_seq_len=160, ctx=ctx)
network.load_parameters("models/handwriting_line_sl_160_a_512_o_2.params", ctx=ctx)
def noise_source_transform(image, text):
image = handwriting_recognition_transform(image)
output = network(image)
predict_probs = output.softmax().asnumpy()
return predict_probs
ns = Noisy_forms_dataset(noise_source_transform, train=is_train, name="OCR_noise2",
topK_decode=topK_decode)
return ns
ctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu()
train_ns = get_ns(is_train=True)
ng_train_ds = Ngram_dataset(train_ns, "word_5train", output_type="word", n=5)
substitution_dict = {}
for subs in substitutions:
if subs not in substitution_dict:
substitution_dict[subs] = 0
substitution_dict[subs] += 1
print(substitution_dict)
substitute_costs = np.ones((128, 128), dtype=np.float64)
for key in substitution_dict:
key1, key2 = key
substitute_costs[ord(key1), ord(key2)] = 0.9 if substitution_dict[key] <= 4 else 0.8
print(substitute_costs)
np.savetxt("models/substitute_costs.txt", substitute_costs, fmt='%4.6f')
~ 48 ~
CHAPTER 5
RESULTS
The results of our Optical Character Recognition (OCR) system showcase its robust
performance in text recognition tasks. Through iterative benchmarking and refinement, we
achieved significant milestones in accuracy assessment. Notably, our system demonstrated a
mean Character Error Rate (CER) of 8.4 on pre-segmented lines, indicative of its strong capability
in interpreting handwritten text. However, full pipeline testing revealed areas for potential
enhancements, with a recorded mean CER of 11.6. Quantitatively, our model displayed
consistency between training and real-world application scenarios, with a Training Intersection
over Union (IoU) of 0.593 and a Test IoU of 0.573.
These metrics underscore the system's proficiency and generalizability, essential traits for
robust OCR systems. Our approach encompassed various precision enhancements, including
rigorous training using the IAM Handwriting Database and advanced line and word detection
techniques employing custom anchor boxes. Furthermore, innovative decoding methods, such as
employing a CNN-BiLSTM system with LSTM for probabilistic predictions and CTC loss for
sequence alignment, significantly contributed to accuracy improvements. Comparative analysis
of decoding methods highlighted the efficacy of Beam Search with Lexicon Search and Language
Model, achieving the lowest CER at 21.058. Lastly, our system's user interface integration
demonstrated practicality and efficiency in handling text conversion tasks, further enhancing its
usability. These results collectively underscore the effectiveness of our OCR system and provide
valuable insights for future advancements in the field.
~ 49 ~
Figure 5.2 : Outcome of Optical Character Recognition.
The language translation feature in our project leverages the capabilities of Pdfplumber and
Google Translator to facilitate seamless translation of text content extracted from PDF documents
into different languages. This feature enhances accessibility and usability by enabling users to
overcome language barriers and access information in their preferred language. Pdfplumber is
employed to extract text content from PDF documents uploaded by users. The extracted text serves
as the input for the translation process, ensuring accurate and comprehensive translation results.
GPT-3.5 Turbo-Instruct is seamlessly integrated into the project's workflow to perform text
summarization tasks. The model is fine-tuned using Turbo-Instruct methodology to optimize its
performance for summarization, ensuring high-quality summary generation. Text inputs provided to
GPT-3.5 Turbo-Instruct are processed to generate summaries that capture the essence of the original
content. The generated summaries are presented to users through the project's interface, providing
them with succinct overviews of the input text.
The text-to-speech (TTS) feature in our project integrates the capabilities of gTTs (Google
Text-to-Speech) and Pdfplumber to convert text content extracted from PDF documents into audio
~ 51 ~
output. This section discusses the results obtained from implementing text-to-speech functionality
using gTTs and Pdfplumber and evaluates its impact on the project.
Pdfplumber is utilized to extract text content from PDF documents uploaded by users. The
extracted text serves as the input for the text-to-speech conversion process, ensuring accurate and
comprehensive audio output. gTTs, Google's Text-to-Speech API, is seamlessly integrated into the
project's workflow to perform text-to-speech conversion. The extracted text is passed to gTTs along
with parameters such as language, voice, and audio format for conversion into audio output.
The audio-to-text feature in our project integrates AssemblyAI, Speech Recognition, and Pydub to
transcribe audio content into text format. This section presents the results obtained from
implementing audio-to-text functionality using these technologies and evaluates its impact on the
project.
Pydub is employed for audio processing tasks such as loading audio files, format conversion,
and manipulation. Audio files are pre-processed using Pydub to ensure compatibility and optimal
quality before transcription. AssemblyAI with accuracy of 90% is utilized for accurate and reliable
transcription of audio content. The audio data pre-processed with Pydub is submitted to AssemblyAI's
cloud-based API for transcription Speech Recognition library is incorporated to provide a unified
interface for interacting with different ASR engines and APIs. This library offers flexibility and
compatibility with multiple ASR services, enhancing the transcription process.
The text-to-handwritten feature in our project utilizes the Python Imaging Library (PIL), now
known as Pillow, to generate handwritten-style images from text input. This section presents the
results obtained from implementing text-to-handwritten functionality using PIL and evaluates its
impact on the project. PIL is seamlessly integrated into the project's workflow to facilitate the
~ 52 ~
generation of handwritten-style images from text. The library's image manipulation capabilities are
leveraged to create visually appealing handwritten text representations. Handwritten-style fonts are
carefully selected to achieve the desired aesthetics and readability for the generated images. Various
handwritten fonts are explored and customized to ensure consistency and visual appeal across
different text inputs.
The paraphrasing feature in our project employs the Pegasus Tokenizer model, Torch, and
Pdfplumber to generate paraphrased versions of text content. This section presents the results
obtained from implementing paraphrasing functionality using these technologies and evaluates its
impact on the project.
The Pegasus Tokenizer model is utilized for paraphrasing text by generating alternative
phrasings while preserving the original meaning. The model is fine-tuned to optimize its performance
for paraphrasing tasks, ensuring accurate and contextually appropriate output Torch, a powerful deep
learning framework, is integrated into the project's workflow to support the implementation and
deployment of the Pegasus Tokenizer model. Torch provides the necessary infrastructure for training,
inference, and optimization of the paraphrasing model. Pdfplumber is used for text extraction from
PDF documents, providing a source of text content for paraphrasing. The extracted text serves as
input to the paraphrasing model, enabling the generation of paraphrased versions of PDF content.
The mathematical expression solving feature in our project harnesses the capabilities of
Google Generative AI to solve complex mathematical equations and expressions. This section
presents the results obtained from implementing mathematical expression solving using Google
Generative AI and evaluates its impact on the project.
~ 53 ~
Google Generative AI, a state-of-the-art deep learning model, is integrated into the
project's workflow to tackle mathematical expression solving tasks. The model is trained on a
diverse dataset of mathematical expressions to develop a robust understanding of mathematical
concepts and operations Assessing the accuracy of solutions generated by Google Generative AI
in solving mathematical expressions. Evaluating the model's ability to handle complex
mathematical expressions involving multiple variables, functions, and operators. Analysing the
speed and efficiency of the mathematical expression solving process, ensuring timely generation
of solutions.
~ 54 ~
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
Furthermore, our research introduces novel methodologies such as custom anchor box
designs within the SSD framework, leading to enhanced line segmentation and text localization
accuracy. Through iterative tuning of Mean Square Error (MSE) and Intersection over Union
(IoU) metrics, we ensured high precision in character bounding box predictions and text
localization. Text Summarization leveraging the capabilities of GPT-3.5 Turbo-Instruct could
facilitate efficient extraction of key insights from text content. Paraphrasing using Pegasus
Tokenizer model, Torch, and Pdfplumber could offer users alternative versions of text content
~ 55 ~
CHAPTER 7
REFERENCES
[2] AWS Labs. Handwritten Text Recognition for Apache MXNet. [Online]. Available:
https://github.com/awslabs/handwrittentext- recognition-for-apache-mxnet/tree/master.
[4] F. Kızılırmak and B. Yanıkoğlu, "CNN-BiLSTM model for English Handwriting Recognition:
Comprehensive Evaluation on the IAM Dataset," Research Square, Nov. 17, 2022. [Online]. Available:
https://doi.org/10.21203/rs.3.rs2274499/v1.
[5] L. Jiao, H. Wu, H. Wang, and R. Bie, "Text Recovery via Deep CNN- BiLSTM Recognition and
Bayesian Inference," in IEEE Access, vol. 6, pp. 76416-76428, 2018, doi:
10.1109/ACCESS.2018.2882592.
[6] N. Bhardwaj, "A Research on Handwritten Text Recognition," International Journal of Scientific
Research in Engineering and Management, vol. 06, no. 04, Apr. 2022, doi: 10.55041/ijsrem12649.
[7] L. Kumari, S. Singh, V. V. S. Rathore, and A. Sharma, "Lexicon and attention-based handwritten text
recognition system," Machine Graphics and Vision, vol. 31, no. 1/4, pp. 75–92, Dec. 2022, doi:
10.22630/mgv.2022.31.1.4.
[8] A. Ansari, B. Kaur, M. Rakhra, A. Singh, and D. Singh, "Handwritten Text Recognition using Deep
Learning Algorithms," 2022 4th International Conference on Artificial Intelligence and Speech
Technology (AIST), Dec. 2022, Published, doi: 10.1109/aist55798.2022.10065348.
[9] R. G. Khalkar, A. S. Dikhit, and A. Goel, "Handwritten Text Recognition using Deep Learning (CNN
& RNN)," IARJSET, vol. 8, no. 6, pp. 870–881, Jun. 2021, doi: 10.17148/iarjset.2021.861
[10] Apache MXNet. (2017). OCR with MXNet Gluon [PowerPoint slides].Available:
https://www.slideshare.net/apachemxnet/ocrwith-mxnet- gluon#11.
[11] Z. Chen, K. Wu, Y. Li, M. Wang, and W. Li, “SSD-MSN: An Improved Multi-Scale Object Detection
Network Based on SSD,” IEEE Access, vol. 7, pp. 80622–80632, 2019, doi:
10.1109/access.2019.2923016.
[12] S. Zhai, D. Shang, S. Wang, and S. Dong, “DF-SSD: An Improved SSD Object Detection Algorithm
Based on DenseNet and Feature Fusion,” IEEE Access, vol. 8, pp. 24344–24357, 2020, doi:
10.1109/access.2020.297102
~ 56 ~
[13] Y. Jiang, T. Peng, and N. Tan, “CP-SSD: Context Information Scene Perception Object Detection
Based on SSD,” Applied Sciences, vol. 9, no. 14, p. 2785, Jul. 2019, doi: 10.3390/app9142785.
[14] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNN- BLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941 .
[15] Dr. L. Jain, “Hand Written Character Recognition Using CNN Model,” International Journal for
Research in Applied Science and Engineering Technology, vol. 12, no. 1, pp. 1094–1103, Jan. 2024, doi:
10.22214/ijraset.2024.58045.
[16] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNNBLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941.
[17] M. Bisht and R. Gupta, “Offline Handwritten Devanagari Word Recognition Using CNN-
RNNCTC,” SN Computer Science, vol. 4, no. 1, Dec. 2022, doi: 10.1007/s42979-022-01461-x.
[18] T. Document, "Manifold Mixup Improves Text Recognition with CTC loss," arXiv:1903.04246
[cs.CV], Mar. 2019. Available: https://doi.org/10.48550/arXiv.1903.04246.
[19] R. Shrestha, O. Shrestha, M. Shakya, U. Bajracharya, and S. Panday, “Offline Handwritten Text
Extraction and Recognition Using CNNBLSTM-CTC Network,” International Journal on Engineering
Technology, vol. 1, no. 1, pp. 166–180, Dec. 2023, doi: 10.3126/injet.v1i1.60941.
[20] D. A. Nirmalasari, N. Suciati, and D. A. Navastara, “Handwritten Text Recognition using Fully
Convolutional Network,” IOP Conference Series: Materials Science and Engineering, vol. 1077, no. 1, p.
012030, Feb. 2021, doi: 10.1088/1757- 899x/1077/1/012030.
[21] Su, S. Y. H., Baray, M., & Carberry, R. L. (1971, January 1). A system modeling language translator.
https://doi.org/10.1145/800158.805058.
[22] Parimoo, R., Sharma, R., Gaur, N., Jain, N., & Bansal, S. (2022, May 31). A Review on Text
Summarization Techniques. International Journal for Research in Applied Science and Engineering
Technology. https://doi.org/10.22214/ijraset.2022.42358
[23] Lukose, S., & Upadhya, S. S. (2017, January 1). Text to speech synthesizer-formant synthesis.
https://doi.org/10.1109/icnte.2017.7947945
[24] Lu, C., Chen, X., Li, J., & Huang, Y. (2013, October 1). Research on Audio-Video Synchronization
of Sound and Text Messages. https://doi.org/10.1109/iscid.2013.177
[25] Yu, Z., Jin, D., Wei, J., Li, Y., Liu, Z., Shang, Y., Han, J., & Wu, L. (2024, January 1). TeKo: Text-
Rich Graph Neural Networks With External Knowledge. IEEE Transactions on Neural Networks and
Learning Systems. https://doi.org/10.1109/tnnls.2023.3281354
~ 57 ~
[26] Dale, D. C., Voronov, A., Dementieva, D., Logacheva, V., Kozlova, O., Semenov, N., & Panchenko,
A. (2021, September 18). Text Detoxification using Large Pre-trained Neural Models. arXiv (Cornell
University). https://doi.org/10.48550/arxiv.2109.08914
[27] Huang, J., Tan, J., & Bi, N. (2020, January 1). Overview of Mathematical Expression Recognition.
Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-59830-3_4
~ 58 ~
~ 59 ~
~ 60 ~
~ 61 ~
~ 62 ~
~ 63 ~