0% found this document useful (0 votes)
24 views23 pages

Multi Page Document Classification Using NLP and ML - Doc2Vec - Towards Data Science

This document discusses a novel approach for Multi Page Document Classification using Machine Learning and Natural Language Processing, particularly in the mortgage industry. It highlights the challenges of manual document classification and proposes a solution that automates the process, improving accuracy and reducing human effort. The methodology includes data preparation, OCR text extraction, and machine learning classification to identify document boundaries within loan packages.

Uploaded by

wngiaphan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views23 pages

Multi Page Document Classification Using NLP and ML - Doc2Vec - Towards Data Science

This document discusses a novel approach for Multi Page Document Classification using Machine Learning and Natural Language Processing, particularly in the mortgage industry. It highlights the challenges of manual document classification and proposes a solution that automates the process, improving accuracy and reducing human effort. The methodology includes data preparation, OCR text extraction, and machine learning classification to identify document boundaries within loan packages.

Uploaded by

wngiaphan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

Published in Towards Data Science

Qaisar Tanvir Follow

Aug 7, 2021 · 18 min read · Listen

Save

Multi Page Document Classification using


Machine Learning and NLP
An approach to classify documents with different variations shapes,
text and page sizes.

Source: https://unsplash.com/photos/5cFwQ-WMcJU

This article describes a novel Multi Page Document Classification solution


approach, which leverages advanced machine learning and textual analytics to
solve one of the major challenges in the Mortgage industry.

Abstract
Even in today’s technological era most of the business is done using documents and
the amount of paperwork involved will vary from industry to industry. Many of
these industries need to scan through scanned document images (which usually

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 1/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

contains non-selectable text) to get the information for key index fields to operate
Open in app Get started
their daily tasks.

To achieve this, the first major task is to index different types of documents, which
later helps in extraction of information and meta-data from a variety of complex
documents. This blog post will represent how Advanced Machine learning and NLP
techniques can be leveraged to solve this major part of the puzzle, formally called
Document Classification.

Introduction
In the mortgage industry, different companies perform mortgage loan audits of
thousands of people.

Each individual audit is performed on an assortment of documents, submitted as a


bundle which is called a Loan Package. A package is a combination of scanned
pages, which can vary from (100–400~) pages. There are multiple sub-components
within the package which may consist of (1–30~) pages. Such sub-components are
called Documents or document classes. The following table represents this visually.

Second row represents a package and the numbers show the ordered page present in that package. Third
row shows the occurrence of different kind of documents within the package.

Background and Problem Statement


Traditionally, while evaluating the loan audits, Document Classification is one of
the major parts of the manual effort. The mortgage companies mostly outsource
this work to third party BPO companies, which execute this task by using manual or
partially automated classification techniques i.e rule engines, template matching.
The underlying problem which is faced by the current implementations is that the
Business Process Outsourcing (BPO) staff has to manually find and sort the
documents present in the packages.

Although, some degree of automation is achieved by few third-party companies


using keyword searches, regular expressions, etc. The accuracy and robustness of
such solutions are questionable and their manual workload reduction is still not
satisfactory. Keyword searches and regular expressions mean that these solutions
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 2/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

need to account for every new document or document variations which are
Open in app Get started
presented and also need to add rules for that. This in itself becomes a manual effort
and only partial automation is achieved. There still remains a chance where the
system might identify a document class to be “Doc A” but it is in fact “Doc B”,
because of common rules present in both. Additionally, there is no degree of
certainty towards an identification. More often than not, manual verification is still
required.

There are several hundred document types, the BPO staff needs to have a
knowledge base of “how a certain document looks, and what are the different
variations of the same document?”, in order to classify documents. On top of that, if
the manual work is too much, Human error tends to increase.

Objective

The document classification solution should


significantly reduce the manual human effort. It
should achieve a higher level of accuracy and
automation with minimal human intervention
The solution approach which we will be discussing in this series of blogs is not only
limited to the Mortgage industry, it can be applied where ever there are scanned
document images, and sorting of such documents is required. A few of the possible
industries are financial organizations, academia, research institutes, retail stores

Characteristics of the documents


In order to make a solution pipeline, the first step is to know what is the data and
what are its different characteristics. Since we have been working in the mortgage
domain, we will define the characteristics of data we process in the mortgage
industry.

Within a package, there are many types of pages, but generally, these can be
categorized into three types:

Structured | Consistent forms and templates

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 3/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

For example: Surveys, Questionnaires, Tests, Claim Forms

Unstructured | Textual, no formatting and tables

For example: Contracts, Letters, Articles, Notes

Semi-Structured | Hybrid of above two, may have partial structure

For example: Invoices, Purchase Orders, Billings, EOBs

In terms of documents, the following are the characteristics which are observed in
the data.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 4/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

The documents present in the packages are not in a consistent order. For
Open in app Get started
example i.e. in one package document “A” might come after document “B” and
in the other one it’s the other way around

There are many variations of a same document class. One document class can
have different looking variations, for example, a document class “A” page
template/format might change for different US states. Within mortgage domain,
these represent the same information, but have difference in formatting and
contents. In other words, if “cat” is a document, different breeds of cats would
be the “variations”.

The document types have different kinds of scanned deformities i.e. Noise, 2D
and 3D rotations, Bad scan quality, Page orientation, Which messes up OCR for
those documents.

Solution Methodology
In this section, we will abstractly explain how our solution pipeline works, and how
each component or module comes together to produce an end-to-end pipeline.
Following flow diagram of the solution.

Solution | Flow Diagram

Since the goal is to identify the documents within the package, we had to identify
what kind of characteristics within a document, make it different from another one?.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 5/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

In our case, we decided that the text present in the document is the key, because
Open in app Get started
intuitively we humans also do it this way. The next challenge was to figure out the
location of the document within the package. In the case of multi-page documents,
boundary pages (start, end) have the most significance. because using these pages,
a range of documents can be identified.

Machine Learning Classes


In terms of Machine learning, we treated this problem as a classification problem.
Where we decided to identify the first and last pages of each document. we
categorized our Machine Learning Classes (ML classes) in three types:

First Page Classes: These classes are the first pages of each document class,
which will be responsible to identify the start of the document.

Last Page Classes: These classes are the last pages of each document class,
which will be responsible to identify the end of the document. These classes will
be made only for the document classes which have samples with more than one
page.

Other Class: This class is a single class which contains the middle pages of all
the document classes combined into one class. Having this class helps the
pipeline in the later stages, it reduces the instances where a middle page of a
document is classified as the first or last page of the same document, which
intuitively is possible because there can be similarities between all the pages
such as headers, footers and templates. This allows the model to learn more
robust features.

Following diagram represents, how these different types of ML classes would look
like in terms of package and documents

Here A, B are the first page classes of document A,B. Moreover A-last, B-last are the last page classes of the
same documents. All the middle pages of any document class are considered as the Other class

Machine Learning Engine

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 6/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Once the ML classes are defined, the next step is to prepare the dataset for training
Open in app Get started
the Machine Learning Engine (The data preparation part will be discussed in detail in
the next sections). Following diagram explains the inner workings of the Machine
Learning Engine, and is a more technical view for the solution pipeline.

End-to-End classification cycle.

Let’s step-by-step describe the different phases of the solution.

Step 1

Package (which are in pdf format) is split into individual pages (images)

Step 2

The individual pages are processed through an OCR (Optical Character


Recognition), which extracts the text from the image and generates the text
files. We have used a state-of-art OCR engine to produce the text in our case.
There are many free online offerings of OCR which can be used in this step.

Step 3

The text corresponding to each page is then passed to the Machine learning
engine where the Text Vectorizer (Doc2Vec) generates its feature vector
representation, which essentially is a list of floats.

Step 4

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 7/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

The feature vectors are then passed to the classifier (Logistic Regression). The
Open in app Get started
classifier then predicts the class for each feature vector. Which are essentially
one of the ML classes which we have previously discussed (first, last or other)

Additionally, the classifier returns the confidence scores for all the ML classes (the
section on the most right of the diagram). For example let (D1,D2 ..) be the ML
classes then for a single page the results may look like the following.

Post Processing
Once the whole package is processed, we use the results/predictions to identify the
boundaries of the documents. The results contain the predicted class and the
confidence scores of the predictions for all the pages of the package. See the
following table

Following is the simple algorithm and steps which are used to identify the
Document boundaries using the output from the Machine learning engine.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 8/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

Document Range Identification | Post-Processing Algorithm

Solutions Details (Deep dive TL;DR)

Data Preparation
In pursuit of developing an end-to-end document classification pipeline. the very
first, and arguably the most important step is data preparation because the solution
is as good as the data it uses. The data we used for our experiments, were
documents from the mortgage domain. The strategies we adopted, can be applied to
any form of document datasets in a similar fashion. Following are the steps which
were performed.

Definition : Document Sample is an instance of a particular document. Usually it is


a (pdf) file containing only the pages of that document.

Step 1

First step is to decide, which documents within a package are to be recognized and
classified?. Ideally, all the documents which are present in packages should be

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 9/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

selected. Once the document classes are decided we move onto the extraction
Open in app Get started
part. In our case, we decided to classify (44) document classes.

Step 2

To obtain the data set, we collected pdfs of several hundred packages, and
manually extracted the selected documents from those packages. Once the
document was identified in the package, the pages of that document were
separated and concatenated together in the form of a pdf file. For example, if we
had found “Doc A” from page 4 to 10 in a package. we would extract the 6 pages
(4–10) and merge them into a 6-pager pdf. This 6-pager pdf constitutes a
document sample. All the samples extracted for a particular document class
was put into a seperate folder. Following shows the folder structure. We
collected 300+ document samples for each document class. Each document
class was given a unique identifier which we called “DocumentIdentifierID”

Step 3

The next step is to apply OCR and extract text from all the pages present in the
document samples. The OCR iterated on all the folders and generated excel
files, having the extract text and some meta-data. Following shows the format of
the excel files, Each row represents one page

Dataset Table with sample rows.

Loan Number, File Name : These are unique sample (pdf) identifiers. There are two
(green, yellow) samples present in the table.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 10/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Document Identifier ID , Document Name : Represent the document class, which


Open in app Get started
these samples belong to.

Page Count : Total number of pages present in one particular sample. (both
samples have 2 pages)

Page Number : Is the ordered page number of each page within a sample.

IsLastPage : If 1, it means the page is the last page of that particular sample.

Page Text : Is the text returned from the OCR for that particular page.

Data Transformations
Once the data is generated in the above format, next step is to transform it. In the
transformations phase, the data is converted/manipulated into the format which is
essential for training a machine learning model. Following are the transformations
which are applied to the dataset.

Step 1 | Generating ML classes

First step of transformation is to generate first page, last page, and other page
classes. To do this, Page Number and136
IsLastPage6columns values are used.
Following shows a conditional representation of the logic used.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 11/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

Moreover, below table represents the columns. Notice the yellow column where
6853 represents the first page class, 6853-last represents the last page class,
while mid-pages are considered as Other class

Step 2 | Data Split for Training and Test the Pipeline

Once the step 1 is complete, from that point on we only need two columns “Page
Text” and “ML Class” to make the training pipeline. Other columns are used for
testing evaluations.

Next step is to split the data for training and testing the pipeline, The data is split
in a way where 80% is used for training and 20% is used for testing. The data is
also randomly shuffled, but in a stratified fashion for each class. For more
information click the link.

Step 3 | Data cleaning and transformation

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 12/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

The “Page Text” column which contains the OCR text for each page is cleaned,
Open in app Get started
this process is applied on train and test both. Following are the processes which
are performed.

1. Case correction: All the text is converted to UPPER or lower case.

2. Regex for non-alphanumeric characters: All the characters which are not
alphanumeric are removed.

3. Word Tokenization: All the words are tokenized, which means the one Page Text
string becomes list of words

4. Stopwords Removal: Stopwords are the words which are too common in the
English language and might not be helpful in classifying the individual
documents. For example words like “the”, “is”, “a”. These words can also be
domain specific. it can be used to remove redundant words, which are common
in many different documents. i.e. in terms of finance or mortgage, the word
“price” can occur in many documents.

Following tables show before and after transformations

Training Pipeline
In the previous Machine Learning Engine Section, we abstractly discussed the
inner workings of the Machine Learning Engine. The two main components were.

1. Text Vectorizer: In our case, we have used Doc2Vec

2. Classifier Model: Logistic Regressor is used for classification.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 13/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Text Vectorizer (Doc2Vec)


Open in app Get started
Since the beginning of the Natural Language Processing (NLP), there has been the
need to transform text into something a machine can understand. Which means,
transforming textual information into a meaningful representation which is usually
known as vectors (or array) of numbers. Research community has been developing
different methods to perform this task. In our research and development we tried
different techniques and found Doc2Vec to be the best amongst all.

Doc2Vec is based on Word2Vec model. Word2Vec model is a Predictive Vector


Space Model. To understand Word2Vec, let us begin with Vector Space Models.

Vector Space Models (VSMs): Embeds words into a continuous vector space where
semantically similar words are mapped to nearby points

Two Approaches for VSM:

1. Count-Based Methods: Compute the statistics of how often some word co-occurs
with its neighbor words in a large text corpus, and then map these count-
statistics down to a small, dense vector for each word (e.g. TFIDF)

2. Predictive Methods: Predict a word from its neighbors in terms of learned small,
dense embedding vectors (e.g. Skip-Gram, CBOW). Word2Vec and Doc2Vec
belong to this category of models

Word2Vec Model

It is a computationally efficient predictive model for learning word embedding


from raw text. Word2Vec can be created by using the following two models:

1. Skip-Gram: Creates a sliding window around current word (target word). Then
use current word to predict all surrounding words (the context words). (e.g.
predicts ‘the cat sits on the‘ from ‘mat‘)

2. Continuous Bag-of-Words (CBOW): Creates a sliding window around current


word (target word). Then predict the current word from surrounding words (the
context words). (e.g. predicts ‘mat’ from ‘the cat sits on the’)

For more details, read this article. it explains different aspects of it in detail.

Doc2Vec Model

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 14/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

This text vectorization technique was introduced in the scientific research paper
Open in app Get started
Distributed Representations of Sentences and Documents. Moreover, further
technical details can be found here.

Definition | it is an unsupervised algorithm that learns fixed-length feature vector


representation from variable-length pieces of texts. Then these vectors can be used in any
machine learning classifier to predict the classes label.

It is similar to Word2Vec model except, it uses all words in each text file to create a
unique column in a matrix (called it Paragraph Matrix). Then a single layer NN, like
the one seen in Skip-Gram model, will be trained where the input data are all
surrounding words of the current word along with the current paragraph column to
predict the current word. The rest is same as the Skip-Gram or CBOW models.

Doc2Vec | Distributed Bag of Words | Source: Distributed Representations of Sentences and Documents

The advantage of Doc2Vec model:

On sentiment analysis task, Doc2Vec achieves new state-of-the-art results,


better than complex methods, yielding a relative improvement of more than
16% in terms of error rate.

On text classification task, Doc2Vec convincingly beats bag-of-words models,


giving a relative improvement of about 30%.

Classifier Model (Logistic Regressor)


Once the text is converted to a vector format. it is ready for a machine learning
classifier to learn the patterns present in the vectors of different document types
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 15/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

and identify the correct distinctions. Since, there are many classification techniques
Open in app Get started
which can be used here, we tried best of the bunch and evaluated their results. i.e.
Random Forest, SVM, Multi-Layer Perceptron and Logistic Regressor. Many
different parameters were tried for each classifier to obtain the optimal results.
Logistic Regressor was found to be the best amongst all of these models.

Training Procedure
Once the data is transformed. Firstly, we train the Doc2Vec model on the
training split (as discussed in the data transformation section). -

After the Doc2Vec model is trained. the training data is passed through it again,
but this time the model is not trained, rather we infer the vectors for the
training samples. The last step is to pass these vectors and the actual ML class
label to the classification model (Logistic Regressor).

Once the models are trained on the training data, the both models are saved to
the disk, so that these can be loaded into memory to be used in testing and
ultimate production deployment. Following diagram shows the basic flow of
this collaborative scheme.

Testing & Evaluation Pipeline


Once the pipeline is trained(which includes both the Doc2Vec model and the
Classifier), The following flow diagram shows how it is used to predict the
document classes for the testing data split.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 16/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

The transformed testing data is passed through the trained Doc2Vec Model, where
the vector representations of all the pages present in the testing data are extracted
and inferred. These vectors are then classified through the classifier which returns
the predicted class and the confidence score for all the ML classes.

For the detailed evaluation of the Machine Learning Engine, we generate an excel
file from the results. Following table shows the columns and the information
generated in the testing phase.

Evaluation excel file which is generated in the testing phase.

Page Text, File Name, Page Number : These are the same columns we had in the
data preparation stage, these are just taken as it is from the source dataset.

ground, pred : ground shows the actual ML class of that page, while pred shows the
predicted ML class by the ML engine.

Trained classes columns: Columns in this section represent the ML classes on


which the model was trained on and the confidence scores for those classes.

MaxProb, Range : MaxProb shows the max confidence score achieved by any of the
columns in Trained classes section. See the red colored text, Range shows the range
in which the MaxProb falls in.

Currently there are three levels of results evaluation.

1. Cumulative Error Evaluation Metric

2. Confusion Matrix

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 17/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

3. Class level confidence scores analysis


Open in app Get started

Cumulative Error Evaluation Metric


This evaluation calculates two metrics, Accuracy and F1-Score. For more details
check this blog. These provide us an abstract insight into goodness of the pipeline.
The scores can be between (1–100). where higher number represents how good the
pipeline is in classifying the documents. In our experiments, we got the following
accuracy and f1-score.

Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a
classification model (or “classifier”) on a set of test data for which the true values
are known.

Essentially, it makes it easier to understand:

Which classes are not performing well?

What is the accuracy score of an individual class?

Which classes are confused with each other?

The below plot represents the confusion matrix we generated after our testing. It is
an embed link so click to view the confusion matrix.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 18/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

Confusion Matrix Plot

Values on both X-axis (True labels) and Y-axis (Predicted labels) represents
document classes which we trained on. The numbers within the cells show the
percentage of testing dataset belonging to the class on the left and bottom.

The values at the diagonal represent the percentage of data where the predicted
classes were correct. Higher percentage is better. i.e. if 0.99 then it means 99% of
the testing data for that particular class was predicted correctly. All the other cells
show wrong predictions and percentage shows how much a certain class was
confused by the another class.

As it can be seen, that the model is able to correctly classify most of the ML classes
with more than 90% accuracy.

Class level confidence scores analysis


Although the confusion matrix gives details about the class confusions, but it
doesn’t represent the confidence scores of the predictions. Which in other words
means

“How confident the model is, when making a prediction about a document class?”

What is the need?

In the ideal situation, model should have high confidence when predicting a correct
ML class, and low confidence when predicting a wrong ML class. But this is not a
strict behavior and depends on many factors i.e. performance of a particular class,
actual domain similarities between document classes etc. To evaluate, whether this
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 19/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

behavior exist and confidence scores can be a useful indication of a true


Open in app Get started
predictions, we devised an additional evaluation approach.

Approach

Since the task is to reduce the manual work, it was decided that only the predictions
with high confidence will be chosen.This way wrong predictions will be not happen
(because those wont have high confidence). Rest of the documents and pages will
be verified manually by the BPO.

Threshold

In this step confidence scores of the classes are calculated and the threshold is
defined, Threshold is a percentage i.e. 80%, 75% which is decided based on
following conditions.

What is the confidence score value where, wrong predictions are in


insignificant numbers and true predictions are in higher numbers. In other
words, It is about finding the sweet spot.

Following line plot shows the true positives (blue line) and false positives (red line).

X-axis shows the ML classes, and Y-axis shows the percentage of the testing data for
a particular class, which is covered by true positives or false positives.

For example: in case of the the ML class 1330, true predictions cover almost 70% of
the whole testing data-set for that class. Which means ML engine was able to predict
70% of the data right, with confidence score greater than 90%. Moreover the false
positives covered only 1% of the testing data-set, which means only 1% of the test
data was predicted wrong with confidence score higher than 90%.

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 20/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Although, because of the threshold, sometimes we lose on true positives (when


Open in app Get started
confidence score is less than threshold). But that is not as bad as the false positives
with high confidence. Such pages/documents will be verified manually.

The previous plot is made with threshold (90% and above). In the following plot,
threshold is (80% and above). Notice that even if the threshold is dropped to 80%
the false positives do not increase, while true positives increase significantly. Which
means, that between 90% and 80% thresholds, 80% is optimal.

While doing this analysis all the levels are checked i.e. 50%, 60%, 70%. The most
optimal threshold is chosen using this evaluation metric.

Solution Features
Fast Predictions | The classification time for one page is under (~300ms). if we
include the OCR time, one page can be classified well under 1 second.
Moreover, if multi-processing is adopted,

High Accuracy | The current solution pipeline is able to identify and classify
documents with high accuracy and high confidence. In most of the classes we
get more than 95% accuracy.

Labeled Data Requirements | Within our experiments we have observed that the
pipeline can work good with most 300 samples per document class. (Like in the
experiment we discussed in these blogs). But this is dependent on the variations
and type of document class. Moreover, we see accuracy and confidence scores
increasing with more sample counts.

Confidence Score Threshold | The pipeline provides prediction confidence


scores, which enables a tuning approach, and allows to tune between the True
Positives and False Positives.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 21/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Multi-Processing | The Doc2Vec implementation allows for multi-processing,


Open in app Get started
Moreover our data transformation scripts are highly parallelized.

Conclusion
Machine learning and Natural Language Processing has been doing wonders in
many fields, we see first hand, how it helped to reduce the manual effort and
automated the task of Document Classification. The solution is not only fast, but
also very accurate.

Because of the sensitive nature of data used in this process. The code base is not
available. I will rework the codebase on some dummy data which will allow me to
upload it to my github. Please follow me on github for further updates. Also check
out some of my other projects ;)

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

About Help Terms Privacy

Get the Medium app

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 22/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science

Open in app Get started

https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 23/23

You might also like