Multi Page Document Classification Using NLP and ML - Doc2Vec - Towards Data Science
Multi Page Document Classification Using NLP and ML - Doc2Vec - Towards Data Science
Save
Source: https://unsplash.com/photos/5cFwQ-WMcJU
Abstract
Even in today’s technological era most of the business is done using documents and
the amount of paperwork involved will vary from industry to industry. Many of
these industries need to scan through scanned document images (which usually
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 1/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
contains non-selectable text) to get the information for key index fields to operate
Open in app Get started
their daily tasks.
To achieve this, the first major task is to index different types of documents, which
later helps in extraction of information and meta-data from a variety of complex
documents. This blog post will represent how Advanced Machine learning and NLP
techniques can be leveraged to solve this major part of the puzzle, formally called
Document Classification.
Introduction
In the mortgage industry, different companies perform mortgage loan audits of
thousands of people.
Second row represents a package and the numbers show the ordered page present in that package. Third
row shows the occurrence of different kind of documents within the package.
need to account for every new document or document variations which are
Open in app Get started
presented and also need to add rules for that. This in itself becomes a manual effort
and only partial automation is achieved. There still remains a chance where the
system might identify a document class to be “Doc A” but it is in fact “Doc B”,
because of common rules present in both. Additionally, there is no degree of
certainty towards an identification. More often than not, manual verification is still
required.
There are several hundred document types, the BPO staff needs to have a
knowledge base of “how a certain document looks, and what are the different
variations of the same document?”, in order to classify documents. On top of that, if
the manual work is too much, Human error tends to increase.
Objective
Within a package, there are many types of pages, but generally, these can be
categorized into three types:
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 3/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
In terms of documents, the following are the characteristics which are observed in
the data.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 4/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
The documents present in the packages are not in a consistent order. For
Open in app Get started
example i.e. in one package document “A” might come after document “B” and
in the other one it’s the other way around
There are many variations of a same document class. One document class can
have different looking variations, for example, a document class “A” page
template/format might change for different US states. Within mortgage domain,
these represent the same information, but have difference in formatting and
contents. In other words, if “cat” is a document, different breeds of cats would
be the “variations”.
The document types have different kinds of scanned deformities i.e. Noise, 2D
and 3D rotations, Bad scan quality, Page orientation, Which messes up OCR for
those documents.
Solution Methodology
In this section, we will abstractly explain how our solution pipeline works, and how
each component or module comes together to produce an end-to-end pipeline.
Following flow diagram of the solution.
Since the goal is to identify the documents within the package, we had to identify
what kind of characteristics within a document, make it different from another one?.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 5/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
In our case, we decided that the text present in the document is the key, because
Open in app Get started
intuitively we humans also do it this way. The next challenge was to figure out the
location of the document within the package. In the case of multi-page documents,
boundary pages (start, end) have the most significance. because using these pages,
a range of documents can be identified.
First Page Classes: These classes are the first pages of each document class,
which will be responsible to identify the start of the document.
Last Page Classes: These classes are the last pages of each document class,
which will be responsible to identify the end of the document. These classes will
be made only for the document classes which have samples with more than one
page.
Other Class: This class is a single class which contains the middle pages of all
the document classes combined into one class. Having this class helps the
pipeline in the later stages, it reduces the instances where a middle page of a
document is classified as the first or last page of the same document, which
intuitively is possible because there can be similarities between all the pages
such as headers, footers and templates. This allows the model to learn more
robust features.
Following diagram represents, how these different types of ML classes would look
like in terms of package and documents
Here A, B are the first page classes of document A,B. Moreover A-last, B-last are the last page classes of the
same documents. All the middle pages of any document class are considered as the Other class
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 6/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Once the ML classes are defined, the next step is to prepare the dataset for training
Open in app Get started
the Machine Learning Engine (The data preparation part will be discussed in detail in
the next sections). Following diagram explains the inner workings of the Machine
Learning Engine, and is a more technical view for the solution pipeline.
Step 1
Package (which are in pdf format) is split into individual pages (images)
Step 2
Step 3
The text corresponding to each page is then passed to the Machine learning
engine where the Text Vectorizer (Doc2Vec) generates its feature vector
representation, which essentially is a list of floats.
Step 4
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 7/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
The feature vectors are then passed to the classifier (Logistic Regression). The
Open in app Get started
classifier then predicts the class for each feature vector. Which are essentially
one of the ML classes which we have previously discussed (first, last or other)
Additionally, the classifier returns the confidence scores for all the ML classes (the
section on the most right of the diagram). For example let (D1,D2 ..) be the ML
classes then for a single page the results may look like the following.
Post Processing
Once the whole package is processed, we use the results/predictions to identify the
boundaries of the documents. The results contain the predicted class and the
confidence scores of the predictions for all the pages of the package. See the
following table
Following is the simple algorithm and steps which are used to identify the
Document boundaries using the output from the Machine learning engine.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 8/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Data Preparation
In pursuit of developing an end-to-end document classification pipeline. the very
first, and arguably the most important step is data preparation because the solution
is as good as the data it uses. The data we used for our experiments, were
documents from the mortgage domain. The strategies we adopted, can be applied to
any form of document datasets in a similar fashion. Following are the steps which
were performed.
Step 1
First step is to decide, which documents within a package are to be recognized and
classified?. Ideally, all the documents which are present in packages should be
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 9/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
selected. Once the document classes are decided we move onto the extraction
Open in app Get started
part. In our case, we decided to classify (44) document classes.
Step 2
To obtain the data set, we collected pdfs of several hundred packages, and
manually extracted the selected documents from those packages. Once the
document was identified in the package, the pages of that document were
separated and concatenated together in the form of a pdf file. For example, if we
had found “Doc A” from page 4 to 10 in a package. we would extract the 6 pages
(4–10) and merge them into a 6-pager pdf. This 6-pager pdf constitutes a
document sample. All the samples extracted for a particular document class
was put into a seperate folder. Following shows the folder structure. We
collected 300+ document samples for each document class. Each document
class was given a unique identifier which we called “DocumentIdentifierID”
Step 3
The next step is to apply OCR and extract text from all the pages present in the
document samples. The OCR iterated on all the folders and generated excel
files, having the extract text and some meta-data. Following shows the format of
the excel files, Each row represents one page
Loan Number, File Name : These are unique sample (pdf) identifiers. There are two
(green, yellow) samples present in the table.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 10/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Page Count : Total number of pages present in one particular sample. (both
samples have 2 pages)
Page Number : Is the ordered page number of each page within a sample.
IsLastPage : If 1, it means the page is the last page of that particular sample.
Page Text : Is the text returned from the OCR for that particular page.
Data Transformations
Once the data is generated in the above format, next step is to transform it. In the
transformations phase, the data is converted/manipulated into the format which is
essential for training a machine learning model. Following are the transformations
which are applied to the dataset.
First step of transformation is to generate first page, last page, and other page
classes. To do this, Page Number and136
IsLastPage6columns values are used.
Following shows a conditional representation of the logic used.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 11/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Moreover, below table represents the columns. Notice the yellow column where
6853 represents the first page class, 6853-last represents the last page class,
while mid-pages are considered as Other class
Once the step 1 is complete, from that point on we only need two columns “Page
Text” and “ML Class” to make the training pipeline. Other columns are used for
testing evaluations.
Next step is to split the data for training and testing the pipeline, The data is split
in a way where 80% is used for training and 20% is used for testing. The data is
also randomly shuffled, but in a stratified fashion for each class. For more
information click the link.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 12/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
The “Page Text” column which contains the OCR text for each page is cleaned,
Open in app Get started
this process is applied on train and test both. Following are the processes which
are performed.
2. Regex for non-alphanumeric characters: All the characters which are not
alphanumeric are removed.
3. Word Tokenization: All the words are tokenized, which means the one Page Text
string becomes list of words
4. Stopwords Removal: Stopwords are the words which are too common in the
English language and might not be helpful in classifying the individual
documents. For example words like “the”, “is”, “a”. These words can also be
domain specific. it can be used to remove redundant words, which are common
in many different documents. i.e. in terms of finance or mortgage, the word
“price” can occur in many documents.
Training Pipeline
In the previous Machine Learning Engine Section, we abstractly discussed the
inner workings of the Machine Learning Engine. The two main components were.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 13/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Vector Space Models (VSMs): Embeds words into a continuous vector space where
semantically similar words are mapped to nearby points
1. Count-Based Methods: Compute the statistics of how often some word co-occurs
with its neighbor words in a large text corpus, and then map these count-
statistics down to a small, dense vector for each word (e.g. TFIDF)
2. Predictive Methods: Predict a word from its neighbors in terms of learned small,
dense embedding vectors (e.g. Skip-Gram, CBOW). Word2Vec and Doc2Vec
belong to this category of models
Word2Vec Model
1. Skip-Gram: Creates a sliding window around current word (target word). Then
use current word to predict all surrounding words (the context words). (e.g.
predicts ‘the cat sits on the‘ from ‘mat‘)
For more details, read this article. it explains different aspects of it in detail.
Doc2Vec Model
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 14/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
This text vectorization technique was introduced in the scientific research paper
Open in app Get started
Distributed Representations of Sentences and Documents. Moreover, further
technical details can be found here.
It is similar to Word2Vec model except, it uses all words in each text file to create a
unique column in a matrix (called it Paragraph Matrix). Then a single layer NN, like
the one seen in Skip-Gram model, will be trained where the input data are all
surrounding words of the current word along with the current paragraph column to
predict the current word. The rest is same as the Skip-Gram or CBOW models.
Doc2Vec | Distributed Bag of Words | Source: Distributed Representations of Sentences and Documents
and identify the correct distinctions. Since, there are many classification techniques
Open in app Get started
which can be used here, we tried best of the bunch and evaluated their results. i.e.
Random Forest, SVM, Multi-Layer Perceptron and Logistic Regressor. Many
different parameters were tried for each classifier to obtain the optimal results.
Logistic Regressor was found to be the best amongst all of these models.
Training Procedure
Once the data is transformed. Firstly, we train the Doc2Vec model on the
training split (as discussed in the data transformation section). -
After the Doc2Vec model is trained. the training data is passed through it again,
but this time the model is not trained, rather we infer the vectors for the
training samples. The last step is to pass these vectors and the actual ML class
label to the classification model (Logistic Regressor).
Once the models are trained on the training data, the both models are saved to
the disk, so that these can be loaded into memory to be used in testing and
ultimate production deployment. Following diagram shows the basic flow of
this collaborative scheme.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 16/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
The transformed testing data is passed through the trained Doc2Vec Model, where
the vector representations of all the pages present in the testing data are extracted
and inferred. These vectors are then classified through the classifier which returns
the predicted class and the confidence score for all the ML classes.
For the detailed evaluation of the Machine Learning Engine, we generate an excel
file from the results. Following table shows the columns and the information
generated in the testing phase.
Page Text, File Name, Page Number : These are the same columns we had in the
data preparation stage, these are just taken as it is from the source dataset.
ground, pred : ground shows the actual ML class of that page, while pred shows the
predicted ML class by the ML engine.
MaxProb, Range : MaxProb shows the max confidence score achieved by any of the
columns in Trained classes section. See the red colored text, Range shows the range
in which the MaxProb falls in.
2. Confusion Matrix
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 17/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a
classification model (or “classifier”) on a set of test data for which the true values
are known.
The below plot represents the confusion matrix we generated after our testing. It is
an embed link so click to view the confusion matrix.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 18/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Values on both X-axis (True labels) and Y-axis (Predicted labels) represents
document classes which we trained on. The numbers within the cells show the
percentage of testing dataset belonging to the class on the left and bottom.
The values at the diagonal represent the percentage of data where the predicted
classes were correct. Higher percentage is better. i.e. if 0.99 then it means 99% of
the testing data for that particular class was predicted correctly. All the other cells
show wrong predictions and percentage shows how much a certain class was
confused by the another class.
As it can be seen, that the model is able to correctly classify most of the ML classes
with more than 90% accuracy.
“How confident the model is, when making a prediction about a document class?”
In the ideal situation, model should have high confidence when predicting a correct
ML class, and low confidence when predicting a wrong ML class. But this is not a
strict behavior and depends on many factors i.e. performance of a particular class,
actual domain similarities between document classes etc. To evaluate, whether this
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 19/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
Approach
Since the task is to reduce the manual work, it was decided that only the predictions
with high confidence will be chosen.This way wrong predictions will be not happen
(because those wont have high confidence). Rest of the documents and pages will
be verified manually by the BPO.
Threshold
In this step confidence scores of the classes are calculated and the threshold is
defined, Threshold is a percentage i.e. 80%, 75% which is decided based on
following conditions.
Following line plot shows the true positives (blue line) and false positives (red line).
X-axis shows the ML classes, and Y-axis shows the percentage of the testing data for
a particular class, which is covered by true positives or false positives.
For example: in case of the the ML class 1330, true predictions cover almost 70% of
the whole testing data-set for that class. Which means ML engine was able to predict
70% of the data right, with confidence score greater than 90%. Moreover the false
positives covered only 1% of the testing data-set, which means only 1% of the test
data was predicted wrong with confidence score higher than 90%.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 20/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
The previous plot is made with threshold (90% and above). In the following plot,
threshold is (80% and above). Notice that even if the threshold is dropped to 80%
the false positives do not increase, while true positives increase significantly. Which
means, that between 90% and 80% thresholds, 80% is optimal.
While doing this analysis all the levels are checked i.e. 50%, 60%, 70%. The most
optimal threshold is chosen using this evaluation metric.
Solution Features
Fast Predictions | The classification time for one page is under (~300ms). if we
include the OCR time, one page can be classified well under 1 second.
Moreover, if multi-processing is adopted,
High Accuracy | The current solution pipeline is able to identify and classify
documents with high accuracy and high confidence. In most of the classes we
get more than 95% accuracy.
Labeled Data Requirements | Within our experiments we have observed that the
pipeline can work good with most 300 samples per document class. (Like in the
experiment we discussed in these blogs). But this is dependent on the variations
and type of document class. Moreover, we see accuracy and confidence scores
increasing with more sample counts.
Conclusion
Machine learning and Natural Language Processing has been doing wonders in
many fields, we see first hand, how it helped to reduce the manual effort and
automated the task of Document Classification. The solution is not only fast, but
also very accurate.
Because of the sensitive nature of data used in this process. The code base is not
available. I will rework the codebase on some dummy data which will allow me to
upload it to my github. Please follow me on github for further updates. Also check
out some of my other projects ;)
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 22/23
10/21/22, 4:37 PM Multi Page Document Classification using NLP and ML | Doc2Vec | Towards Data Science
https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03 23/23