IMPLEMENT OCR FOR MEDICAL DATA EXTRACTION:
METHODS AND COMPARISONS
Divyesh M. Rathod, Mohd Kaif S. Saiyad,, Prof. Sunil K. Vithalani
Student, Department of IT,
Dharmsinh Desai University, Nadiad, Gujarat, India
[email protected] [email protected] Assistant Professor, Department of IT,
Dharmsinh Desai University, Nadiad, Gujarat, India
[email protected] cost-effective and reliable. The project also aims to
ABSTRACT evaluate the efficiency of various OCR tools to find
the most suitable solutions for medical document
This study provides an OCR-based total system processing, especially for handwriting recognition.
designed to extract patients crucial information from
documents, consisting of patient names, emails, cell OBJECTIVES AND SCOPE
phone numbers, manual names, and relationships,
using various OCR tools and techniques. The paper The main goal of this project is to develop and
explores the comparative evaluation with different implement an OCR-based system to: Extract key data
OCR equipment, including Tesseract, EasyOCR and fields from medical documents, such as patient name,
PaddleOCR, is supplied to exhibit their strengths phone number, email address, consultant name, and
and limitations. Images with different handwriting relationship.[1] It uses a free API service, OCR.space,
and spacing are used for accuracy calculation. to reduce costs.See how other OCR tools like
Tesseract OCR and EasyOCR work for advanced text
INTRODUCTION recognition.[2] A user-friendly interface has been
developed for easy input output of the scope, ensuring
Medical records contain multiple forms, which can
efficient use in healthcare settings. [3] Evaluation
impede data access and increase the margin for error.
Manual data entry is time-consuming and often metrics focus on the accuracy, speed, and reliability of
inaccurate. The introduction of Optical Character transcripts.
Recognition (OCR) technology significantly
improved document processing, enabling the LITERATURE REVIEW
extraction of invalid data from handwritten
documents This study explores the application of Overview of the Tesseract OCR Engine: Smith
OCR technology in a medical extraction application
(2007) provided a detailed overview of the Tesseract
about basic information using OCR services. The
project aims to bridge the gap between manual record OCR engine, revealing its evolution from early
management and effective, automated solutions in the research projects at HP Labs to its adoption as an open
management of medical data. project with the Google Tesseract architecture,
including adaptive segmentation and language
MOTIVATION FOR THE RESEARCH WORK
specific training, is more effective for printed text but
The challenges medical organisations face in manuscripts without additional training and
handling large volumes of paperwork underscore the preprocessing Less robust when used This study
need for flexible data extraction systems. Manuals highlighted Tesseract's vulnerability to customization,
increase invalid data, missing information, and
and enabled developers to tailor recognition systems
increased workload. The motivation for this research
is to develop an application that can enable the efficiently for applications.[1]
extraction of sensitive medical information and is
In another paper, A Novel Technique for Tabel 1.Result of Images with Models
Handwritten Text Recognition using Easy OCR: Kim
and so on. (2020) introduced a new method for Alphabe Resolutio Easy Padder Tesseract
handwritten text recognition using EasyOCR, which ts n OCR OCR
uses a deep learning algorithm for better performance.
The review highlighted EasyOCR’s support for small low 15% 10% 40%
multiple languages and its flexibility to recognize
capital high 22% 20% 65%
mixed text types. The authors demonstrated that
EasyOCR outperformed traditional OCR tools, curve high 20% 30% 60%
especially in situations involving customized
handwriting or plain characters. The incorporation of capital low 30% 20% 75%
Convolutional Neural Networks (CNN) gave
EasyOCR a high accuracy in handwriting recognition, curve low 25% 15% 20%
and showed promise for applications requiring
versatile language support and robust recognition
capabilities. [2] Tabel 2.Result of keywords with Model
Paddle OCR-based Text Extraction from Invoices: Keyword EasyOCR PaddleOC Tesseract
Zhou et al. (2021) focused on using PaddleOCR for R
text extraction from semi-structured documents like
Name 4.34% 5.07% 58.47%
invoices. PaddleOCR's implementation of deep
learning techniques, such as the CRNN Numbers 31.08% 21.5% 24.89%
(Convolutional Recurrent Neural Network) model,
enabled it to handle text with varied fonts and DOB 29.99% 1.89% 0.91%
alignments. This study demonstrated the potential of
PaddleOCR in real-world document processing, Email 31.68% 0.1% 0.21%
showcasing its effectiveness in extracting text from
structured layouts and its competitive performance Address 5.41.% 5.25% 65.7%
against other OCR tools, particularly in recognizing
challenging document formats.[3]
METHODOLOGY
Handwritten Character Recognition using Neural
Networks for Banking Applications: Medical records are collected and analysed to
Sharma and Patel (2019) explored the use of neural create a trial data system. The dataset contains various
networks for handwritten character recognition in types of information, including handwritten and
banking applications. The study implemented a published data. The documents are preprocessed,
multi-layer perceptron (MLP) model trained on a including resizing, noise reduction, and contrast
dataset of handwritten numerals commonly used in enhancement, to improve OCR accuracy. Comparative
financial forms. Their findings indicated that deep analysis is performed using Tesseract OCR and
learning approaches significantly improved the EasyOCR to evaluate the performance of manuscripts
recognition rates over traditional pattern-matching and printed texts. Accuracy, precision, recall, and
algorithms. This research highlighted the growing uptime are recorded to evaluate performance.
relevance of neural network-based OCR systems in Analyses of clarity of results and error rates are also
scenarios requiring high accuracy and reliability, such performed.
as banking and financial document processing.[4] Comparison of OCR Tools for Handwritten
COMPARATIVE ANALYSIS METRICS Text Detection
A comparative analysis of Tesseract OCR, A comparison between Tesseract OCR, EasyOCR,
EasyOCR, PaddleOCR reveals specific strengths and the OCR Space API was conducted to evaluate
and weaknesses of these tools in handwriting their performance in recognizing handwritten text,
recognition. with the following findings:
A. Calculated accuracy with OCR models
IMPLEMENTATION AND RESULTS
CONCLUSION
This study demonstrates the feasibility of using
OCR technology to extract automated medical
information. The inclusion of OCR.space as the
primary tool proved effective for printed text, while
EasyOCR showed better handwriting processing.
Future work will focus on increasing the accuracy of
the handwritten system and exploring hybrid images
that integrate multiple OCR engines for improved
reliability
ACKNOWLEDGMENTS
We would like to express our sincere
gratitude to Prof. Sunil K. Vithlani for providing the
research papers and guidance throughout this research
project. Your invaluable support and guidance have
been instrumental in the progress and success of this
research. We also appreciate the assistance of our
peers and the resources available to me, which
contributed significantly to this research.
REFERENCES
[1] R. Smith (2007) – "An Overview of the
Tesseract OCR Engine" – Proceedings of the
Ninth International Conference on
Document Analysis and Recognition
(ICDAR), Volume 2, 2007.
[2] A. Gupta, P. Sharma, and K. Jain (2021) –
"A Novel Technique for Handwritten Text
Recognition using EasyOCR" – Journal of
Computer Science and Technology Research,
Volume 5, Issue 4, December 2021.
[3] L. Zhang, H. Wei, and F. Wang (2022) –
"PaddleOCR-based Text Extraction from
Invoices" – International Journal of Image
Processing and Recognition, Volume 8,
Issue 2, June 2022.
[4] J. Kim and M. Patel (2020) – "Handwritten
Character Recognition using Neural
Networks for Banking Applications" –
Journal of Advanced Computational
Technologies in Finance, Volume 4, Issue 1,
March 2020.