0% found this document useful (0 votes)
173 views25 pages

Document Classification With LayoutLMv3

Document Classification with LayoutLMv3

Uploaded by

Marcos Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
173 views25 pages

Document Classification With LayoutLMv3

Document Classification with LayoutLMv3

Uploaded by

Marcos Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 25
9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp Blog > Document Classification with Layoutimv3 Document Classification with LayoutLMv3 Document Classification with Transformers and PyTorch | Setup & Preproce... In this tutorial, we will explore the task of document classification using layout information and image content. We will use the LayoutLMv3 model, a state-of-the-art model for this task, and PyTorch Lightning, a lightweight PyTorch wrapper for high- performance training. Join the AI BootCamp! Ready to dive into the world of Al and Machine Learning? Join the Al BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more! JOIN NOW https shwww-mlexpert iofblogidocumentclassiieaton-witrayourinv3 1125 asi, ssaMt Document Classfcation with LayouliMv8 |MLExper- Get Things Done wih Al Bootcamp We will start by preparing the dataset and data loaders, followed by building and training the model. We will then evaluate the performance of our model and analyze the results using a confusion matrix. Finally, we will explore ways to improve the performance of the model on specific classes. By the end of this tutorial, you will have a good understanding of how to use LayoutLMv3 for document classification and how to leverage PyTorch Lightning to train and evaluate deep learning models. © In this tutorial, we will be using Jupyter Notebook to run the code. If you prefer to. follow along, you can access the notebook here: open the notebook Notebook Setup We will begin by installing wkhtmltopdft, a utility that can convert HTML files into images 3%bash wget -q https: //github. com/wkhtmltopdf /packaging/releases/download/@.12.6-1/wkhtmltox_¢ cp [email protected]_and64.deb /usr/bin apt -qq install /usr/bin/wkhtmltox_0.12.6-1.bionic_amd64.deb Next, we will proceed to install all the necessary libraries: Ipip install -qqq transformers==4.27.2 --progress-bar off !pip install -qqq pytorch-lightning==1.9.4 --progress-bar off Ipip install -qqq torchnetrics==0.11.4 --progress-bar off Ipip install -qqq 4 2.3 --progress-bar off Ipip install -qqq 1.6.2 --progress-bar off !pip install -qqq 9.4.0 ~-progress-bar off Ipip install -qqq tensorboardx==2.5.1 --progress-bar off Ipip install -qqq_huggingface_hub Ipip install -qqq --upgrade --no-cache-dir gdown 11.1 --progress-bar off The essential libraries for this tutorial are: ntps:wwn.mlexpertiofblogldocument-classiiation-witriayoutinv’ 2125 97574, 643 AM Document Classification wih LayoullMv8 |MLExpert- Get Things Done wih Al Bootcamp * transformers : We'll be using the implementation of LayoutLMv3 from this library for our model. * pytorch-lightning :It will help us in fine-tuning our model. * torchnetrics : This library provides us with various metrics for classification and other tasks * easyocr : We'll be using this library to run OCR on the document images. Let's add all imports that we'll use: fron transformers import LayoutLMv3Featureéxtractor, LayoutLMv3TokenizerFast, LayoutLM fron tqdm import tqdm import torch fron torch.utils.data import Dataset, DataLoader import pytorch lightning as pl fron pytorch_lightning.callbacks import ModelCheckpoint fron PIL import Inage, InageDraw, ImageFont import nunpy as np fron sklearn.model_selection import train_test_split import imgkit import easyocr import torchvision.transforms as T from pathlib import Path import matplotlib.pyplot as plt import os import cv2 from typing import List import json fron torchmetrics import Accuracy fron huggingface_hub import notebook_login fron sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay %matplotlib inline pl.seed_everything(42) The last line sets the seed for PyTorch Lightning to 42. Setting a seed ensures that the random number generator used by PyTorch Lightning (and the underlying PyTorch framework) produces the same sequence of random numbers each time the code is run Data https:ihwww-mlexpertiofblogidocument classification witrayourinv3 3125 971924, 843. aM Document Classetion wih LayouMva | MLExpert- Gat Things Dane wih Al Bootcamp The data is from Kaggle - Financial Documents Clustering?. It contains HTML documents (tables) from the publically available Hexaware Technologies financial annual reports?, It has 5 categories: * Income Statements (317 files) © Balance Sheets (282 files) *© Cash Flows (36 files) * Notes (702 files) © Others (1236 files) Download and extract an exact copy of the Kaggle files from my Google Drive: Igdown tMZXonmajLPKSzhZ2dt-Cd2RTSSYFHy® lunzip -q financial -documents.zip Inv “TableClassifierQuaterlywithNotes” “documents” Convert HTML to Images The documents are in HTML format, which is not usable for our model. We'll convert them to images. First, let's change the folder names to "snake case": for dir in Path("documents").glob("*"): dir.rename(str(dir).lower().replace(" ", "_")) List (Path("“documents").glob("*")) [PosixPath( ‘documents/notes*), PosixPath( ‘documents/cash_flow'), PosixPath( ‘documents/balance_sheets'), PosixPath( ‘documents/income_statement"), PosixPath( ‘documents/others')] ntps:wwn.mlexpertofblogldocument-classiiation-witrayoutlnv’ 425, 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp We need a directory for the converted images and each class of documents for dir in Path("documents") .glob("*"): image_dir = Path(f"images/{dir.name}") image_dir.mkdir(exist_ok=True, parents=True) To convert the HTML files to images, we'll be utilizing the imgkit package: def convert_html_to_image(file_path: Path, images_dir: Path, scale: float = 1.0) -> Pai file_nane = file_path.with_suffix(".jpg”).name save_path = images_dir / file_path.parent.name / f"{file_name ingkit.from_file(str(file_path), save_path, options: ‘quiet’: '', ‘format’: ‘jpeg"? image = Image.open(save_path) width, height = image.size image = image.resize((int(width * scale), int(height * scale))) image. save(str(save_path)) return save_path document_paths = list(Path("“documents").glob("*/*")) for doc_path in tqdm(document_paths) : convert_html_to_image(doc_path, Path("image: ), scale=8.8) Let's look at a sample document image: image_paths = sorted(list(Path("images").glob("*/*.jpg"))) image = Image.open(image_paths[@]) . convert ("RGB") width, height = image. size image https hw mlexpert iofblogidocument classification witrayourinv3 5125 9115124, 8:43 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp As ataist [As at3ist_(Asat 31st March 2027 March 2038) March 2017 2 Financial ieee — ae pa igses? [958056 oe a Heerese (Tey nventones a2 i925 000292392530 | (_TtFinanciassets EE [lores raeate —__— (en Ine eee ere | eas equivalonis e832 [i7da 28382 a7r41 | [i s3845 57.08 /.838.45 i «ini io ao son Total Assets (= (8) ear | arma | [ieee ties je ees Equity & Labilties 7 ae a | DEquiy (0) Equity share capial Tesae yesene A.esSOs Tew OE [Tooter Equiy —————SSS—=iR SO tan gs ara | pee | Total Eauiy @ eisis fe 27098 a aa (_[euepites —— face Sates is ‘Giher Non Curent Financia iain 030 = — a —| (HLongiemprovsons aaa OL. Ese 2 () Deferod tax fables (Net) Tosisi (3996s oer si tt one a a (w) Trade payabies SBS7ATS [9s7263 257473 9.81262 [Other Curent Financilabiiies (20571 6d 9521 asad (__[fb) Other current liabilities 7as04 (iei1o —fresoa isis | [)Shor term provisions. 007 (38.26, —— 26 (@ Current tax Fates (Net) sees (7132 sexes 732 | {Cherm G. — smear jsearea paneer 25,027.08 seine mere https: shwwn-mlexpertiofblogidocumentclassiicaton-witrayourinv3 Document Classification wth LayoutLMV | MLExpert- Get Things Dane with Al Bootcamp quay ano Laces (Ayr (5) + le ki | 42,309. 192,302.51 41,973.88 EasyOCR EasyOCR is a Python library for optical character recognition (OCR), which is the process of extracting text from images. EasyOCR uses deep learning models to recognize text and can handle a wide range of font styles, sizes, and orientations. reader = easyocr.Reader([ en" ]) We'll feed our sample document into the EasyOCR reader and see what it detects: image_path = image_paths[0] ocr_result = reader.readtext (str(image_path)) The ocr_result has the following format: text box coordinates [x,y], text, confidence Here's the first row from the result: ([[279, 13], [327, 13], (327, 27], [279, 27]], ‘In lacs)’, @.46634036192148154) We'll examine the OCR output overlaid on top of the document image: def create_bounding_box(bbox_data): xs =] ys = ( for x, y in bbox_data: xs-append(x) ys-append(y) left = int(min(xs)) top = int(min(ys)) right = int(max(xs)) ‘tps: mlexper ilblogidocument-lasication-wtriayoutin3 7125 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp bottom = int(max(ys)) return [left, top, right, bottom] font_path = Path(cv2.__path__[@]) / “qt/fonts/DejavuSansCondensed. ttf" font = ImageFont.truetype(str(font_path), size=12) Fig, (ax1, ax2) = plt-subplots(1, 2, Figsize-(28, 28)) left_image = Tmage.open(image_path). convert ("RGB") right_image = Image.new("RGB", left_image.size, (255, 255, 255)) left_draw = ImageDraw.Draw(left_image) right_draw = InageDraw.Draw(right_image) for i, (bbox, word, confidence) in enumerate(ocr_result): box = create_bounding_box(bbox) left_draw.rectangle(box, outline="blue", width=2) left, top, right, bottom = box left_draw.text((right + 5, top), text-str(i + 1), fill-"red", font=font) right_draw.text((left, top), text=-word, fill="black", font=font) ax1. imshow(left_image) x2. imshow(right_image) axtaxis("off"); ax2.axis("off")3 https:ihwww-mlexpertiofblogidocument classification witrayourinv3 9115124, 8:43 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp 2. i ai ntps:hwwwmlexpertofblogldocument-classiiation-witriayoutinv’ 9115124, 843 AM Document Classification with Layout MV | MLExpert- Get Things Done wth Al Bootcamp Ee 9) ES STE] We define a helper function create_bounding_box() that takes text box coordinates The function finds the minimum and maximum values of xs and ys , and returns the coordinates of the resulting bounding box as a list in the format left, top, right, bottom We can extract the OCR (Optical Character Recognition) result from each image and then save the results in JSON files: for image_path in tqdm(image_paths): ocr_result = reader.readtext(str(image_path), batch_size=16) ocr_page = [] for bbox, word, confidence in ocr_result: ocr_page.append({ "word": word, “bounding box": create_bounding_box(bbox) » with image_path.with_suffix(".json").open("w") as f: json.dump(ocr_page, #) LayoutLMv3 LayoutLMv3¢ is a state-of-the-art pre-trained language model developed by Microsoft Research Asia. It is designed to handle document analysis tasks that require understanding of both text and layout information, such as document classification, information extraction, and question answering The model is built on top of the transformer architecture and trained on massive amounts of annotated document images and text. LayoutLMv3 is capable of recognizing and encoding both the textual content and the visual layout of a document, allowing it to provide superior performance on document analysis tasks. You can use LayoutLMv3 for various tasks, such as document classification, named entity recognition, and question answering. To use LayoutLMv3, you can fine-tune the pre ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 10125 97524, 843m Document Casicaton wih Layout |NLEXpert- Get Things Dane with Al Bootcamp trained model on your specific task with a small amount of task-specific data. Hugging Face Transformers and PyTorch provide easy-to-use APIs that will allow us to fine-tune LayoutLMv3 for document classification. Preprocessing LayoutLMVv3 uses text, bounding boxes and images as input. To prepare all, we can use the LayoutLMv3Processor. The processor combines OCR, tokenization and image preprocessing feature_extractor = LayoutLMv3Featureéxtractor(apply_ocr=False) tokenizer = LayoutLMv3TokenizerFast.from_pretrained( “microsoft/layoutInv3-base” d processor = LayoutLMv3Processor(feature_extractor, tokenizer) The LayoutLmv3Featureextractor uses Tesseract OCR as the default option. However, Tesseract OCR was very slow during my experiments. Instead, we'll use a custom OCR engine (EasyOCR). Consider Google Cloud Vision or Amazon Textract, if you require a faster and more accurate OCR solution. We'll apply the processor to the sample document. LayoutLMv3 requires that each bounding box be normalized to be on a 0-1000 scale. We'll need the image width and height scale for that: image_path = image_paths[@] image = Image.open(image_path) . convert ("RGB") width, height = image.size width_scale = 1000 / width height_scale = 1000 / height Next, we'll take the OCR and extract words and bounding boxes: def scale_bounding_box(box: List[int], width_scale : float = 1.0, height scale : float return [ ntps:wwn.mlexpertofblogldocument-classiiation-witrayoutlnv’ 125 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done with Al Boatcamp int(box[@] * width_scale), int(box[1] * height_scale), int(box[2] * width_scale), int(box[3] * height_scale) json_path = image_path.with_suffix(".json") with json_path.open("r") as f: ocr_result = json.load(f) words = [] boxes = [] for row in ocr_result: boxes append(scale_bounding_box(row["bounding_box"], width_scale, height_scale)) words. append(row[ "word" ]) len(words), Len(boxes) (a74, 174) We define the function scale_bounding_box() to apply the image scale to each bounding box. Next, we iterate over each row of the OCR results stored in ocr_result , extract the bounding box coordinates and word text for each recognized text region, and scale the bounding box coordinates using the scale_bounding_box() . encoding = processor( image, words, boxes=boxes, max_length=512, padding="max_length", truncation=True, print(#""" input_ids: {1ist(encoding["input_ids"].squeeze().shape)} word boxes: {list (encoding["bbox"].squeeze().shape)} image data: (List(encoding[ "pixel_values"].squeeze().shape)} image size: {image.size} ) ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 1205 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp input_ids: [512] word boxes: [512, 4] image data: [3, 224, 224] image size: (819, 1195) We have three pieces of information: input_ids from the tokenizer, bbox for the bounding boxes, and pixel_values for the image. Let's have a look at the encoded image: image_data = encoding["pixel_values"][@] transform = T.ToPILImage() transform(image_data) The image encoding is a 3-dimensional array of shape (channels, height, width) Next, we convert the tensor to a PIL image object using a transformation from torchvision Model Let's create an instance of LayoutLMv3: model = LayoutLMv3ForSequenceClassification.from_pretrained( “microsoft/lLayout1mv3-base", num_labels=2 ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 98124, 8:43 AM Document Classification wth LayoutLMV | MLExpert- Get Things Dane wth Al Bootcamp The sequence classification model is loaded from the microsoft/layout1mv3-base checkpoint. We set num_labels to 2, which indicates we'll use it for binary classification. We can run the encoded document through the model and look at the predictions: outputs = model (**encoding) outputs. logits tensor([[®.2644, 0.2629], grad_fn=) Naturally, our model is untrained and lacks the ability to comprehend the documents in our dataset. Let's train it! Training To fine-tune LayoutLMv3, we will utilize PyTorch Lightning. This is what we'll do: * Split the data into training and testing subsets * Create a PyTorch Dataset * Generating dataloaders * Define a LightningModule * Use the Trainer from PyTorch Lightning to train our model Let's start by preparing the data: train_images, test_images = train_test_split(image_paths, test_size=.2) DOCUMENT_CLASSES = sorted(List(map( lambda p: p.name, Path("images*).glob("*") ») DOCUMENT_CLASSES [ ntps:wwn.mlexpertiofblogldocument-classiiation-witriayoutinv’ 1425 9115724, 6:49AM Document Classification with Layout MvS | MLExpert- Gat Things Dane wit Al Bootcamp ‘balance_sheets', ‘cash flow", “incone_statement’, notes", ‘others* First, we split the document images into train and test subsets. Next, we extract the document classes from the document image directory names. This allows us to create a mapping from document image to its class. We have everything needed to create a PyTorch Dataset: class DocumentClassificationbataset (Dataset) def __init_(self, image_paths, processor): self.image_paths = image_paths self.processor = processor def _len_(self): return len(self.image_paths) def __getitem_(self, item): image_path = self.image_paths[item] json_path = image_path.with_suffix(".json") with json_path.open("r") as f: ocr_result = json.load(f) with Image.open(image_path).convert("RGB") as image: width, height = image.size width_scale = 1088 / width height_scale = 1000 / height words = [] boxes = [] for row in ocr_result: boxes . append(scale_bounding_box( row["bounding_box"], width_scale, height_scale » words. append(row[ "word" ]) ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 1525 98124, 8:43 AM Document Classification wth LayoutLMV | MLExpert- Get Things Dane wth Al Bootcamp encoding = self.processor( image, words, max_length=512, max_length", label = DOCUMENT_CLASSES. index(image_path. parent. name) return dict( input_ids=encoding[ "input_ids"].flatten(), attention_mask=encoding["attention_mask"].flatten(), bbox=encoding[ "box" ].flatten(end_dim=1), pixel_values=encoding| “pixel_values"].flatten(end_dim=1), Jabels=torch.tensor(1abel, dtype=torch. long) The class takes two arguments: © image_paths a list of paths to document images © processor : an instance of the LayoutLMv3Processor class The _1en__ method returns the number of images in the dataset, and the _getitem__ method loads and preprocesses the image and OCR results at a given index We can now create datasets and data loaders for the train and test documents: train_dataset = DocumentClassificationDataset(train_images, processor) test_dataset = DocumentClassificationDataset(test_images, processor) train_data_loader = DataLoader( train_dataset, batch_size=8, shuffle-True, num_workers=2 test_data_loader = DataLoader( test_dataset, ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 16125 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp shuffle-False, niun_workers=2 Let's implement a LightningModule using PyTorch Lightning. This will wrap all the components and allow us to train our model class ModelModule(pl.LightningModule) : def _init_(self, n_classes:int): super()._init_() self.model = LayoutLmv3ForSequenceClassification.from_pretrained( "microsoft/layoutImv3-base”, num_labels=n_classes ) self.model.config.id2label = {k: v for k, v in enumerate(DOCUMENT_CLASSES) } self.model.config.label2id = {v: k for k, v in enumerate(DOCUMENT_CLASSES)} self.train_accuracy = Accuracy(task="multiclass", num_classes=n_classes) self.val_accuracy = Accuracy(task="multiclass”, num_classes=n_classes) def forward(self, input_ids, attention_mask, bbox, pixel_values, labels=None): return self.model( input_ids, attention_mask=attention_mask, bbox=bbox, pixel_values=pixel_values, labels=labels def training_step(self, batch, batch_idx): input_ids = batch["input_ids"] attention_mask = batch["attention_mask"] bbox = batch["bbox"] pixel_values = batch["pixel_values”] labels = batch{"labels"] output = self(input_ids, attention_mask, bbox, pixel_values, labels) self.log("train_loss", output .1oss) self. 1og( “train_acc", self. train_accuracy (output. logits, labels), ) return output.loss def validation step(self, batch, batch_idx): https shwwn-mlexpertiofblogidocument classification witrayourinv3 1728 9115724, 843 AM Document Classification wth Layout M3 | MLExpert- Get Things Done with Al Booteamp input_ids = batch{"input_ids"] attention_mask = batch["attention_mask"] bbox = batch["bbox"] pixel_values = batch{"pixel_values”] labels = batch["labels"] output = self(input_ids, attention_mask, bbox, pixel_values, labels) self.log("val_loss", output. 1oss) self. 10g( "val_ace", self.val_accuracy(output.logits, labels), on_step-False, on_epoch=True ) return output. loss def configure_optimizers(self): optimizer = torch.optim.Adam(self.model.parameters(), 1r=0.00001) #1e-5 return optimizer The _init__ method initializes the LayoutLMv3 model for sequence classification with a specified number of classes, and sets up the accuracy metric for both training and validation. The forward method takes input tensors ( input_ids , attention_mask, bbox, and pixel_values ) and an optional labels tensor (only used during training), and returns the model output. The training_step and validation_step methods define the training and validation steps respectively. In each method, the input tensors are passed through the model, and the loss and accuracy are logged. The configure_optimizers method defines an Adam optimizer used for training. Let's create an instance of our ModelModule : model_module = ModelModule(1en(DOCUMENT_CLASSES)) We'll use Tensorboard to track the training progress load_ext tensorboard https:ihwww-mlexpertiofblogidocument classification witrayourinv3 18725 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done wth Al Boatcamp ‘Ktensorboard --logdir lightning_logs Finally, we need to set up the PyTorch Lightning Trainer: model_checkpoint = ModelCheckpoint ( filename="{epoch}-{step}-{val_loss:.4F}", save_last=True, save_top_k=3, monitor="vi trainer = pl.Trainer( accelerator="gpu", precision=16, devices: max_epochs=5, callbacks=[ model_checkpoint 1. ) The Modelcheckpoint callback is defined to save the model's weights after each epoch, with a specific naming format that includes the epoch number, training step, and validation loss. The Trainer will use a single GPU, mixed precision (16 bit) training, and train for 5 epochs. Let's train: trainer. fit(model_module, train_data_loader, test_data_loader) INFO: pytorch_lightning.accelerators.cuda:LOCAL_RANK: @ - CUDA_VISIBLE_DEVICES: [@] INFO: pytorch_Lightning.callbacks.model_summary: | Name | type | Params | model | LayoutLMv3ForSequenceClassification | 125 1 | train_accuracy | MulticlassAccuracy le | val_accuracy | MulticlassAccuracy Je 125M‘ Trainable parans ° Non-trainable params 225M ‘Total. params https:ihwww-mlexpertiofblogidocument classification witrayourinv3 19125 98124, 8:43 AM Document Classification wth LayoulL MVS | MLExpert- Get Things Done with Al Bootcamp 251.843 Total estimated model params size (MB) nitpssww.mlexpen joblog!documert-classifcation-withiayoutinyS 20128 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done with Al Boatcamp trained_model = ModelModule.load_from_checkpoint( model_checkpoint .best_model_path, n_classes=len(DOCUMENT_CLASSES) , local_files_only=True notebook_login() trained_model.model .push_to_hub( “layout Inv3-Financial-document classification” Once the model is uploaded, we can easily download it using its name or ID. We will load the model from the Hub and put it on the GPU for inference: DEVICE = “cuda:0" if torch.cuda.is_available() else "cpu model = LayoutLMv3ForSequenceClassification. from_pretrained( ‘curiousily/layout1mv3-financial-docunent-classification” ) model model.eval().to(DEVICE) We'll write a function to do inference for a single document image: def predict_document_image( image_path: Path, model: LayoutLMv3ForSequenceClassification, processor: LayoutLMv3Processor): json_path = image_path.with_suffix(".json") with json_path.open("r") as f: ocr_result = json.load(f) with Tmage.open(image_path).convert("RGB") as image: width, height = image.size width_scale = 1000 / width height_scale = 1000 / height words ol ol for row in ocr_result: boxes. append( tps mlexperioflogdocumentclassfeation with ayoutinv3 21 boxes 9115124, 843 AM Document Classification wth LayoutLMV | MLExpert- Get Things Dane wth Al Bootcamp scale_bounding_box( row[ "bounding box"], width_scale, height_scale d words. append(row[ "word" ]) encoding = processor( image, words, boxes=boxes, max_length=512, padding="max_length”, truncation=True, return_tensor pt” with torch. inference_mode(): output = model( input_ids-encoding[ "input_ids" ]to(DEVICE), attention_mask=encoding[ “attention_mask"].to(DEVICE), bbox-encoding["bbox"] .to(DEVICE), pixel_values=encoding["pixel_values"].to(DEVICE) predicted_class = output. logits.argnax() return model. config. id2label [predicted_class.item()] This function takes an image path as input, opens the image, extracts the OCR, scales the bounding boxes based on the image size, and preprocesses the image and text data using the previously defined processor . The preprocessed data is then sent to the model for inference on the GPU. Finally, the function returns the predicted class label for the input image. We can now execute the function on all test documents ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done with Al Boatcamp labels = [] predictions = [] for image_path in tadm(test_images): labels .append(image_path.parent..name) predictions. append( predict_document_image(image_path, model, processor) Given that the dataset is imbalanced, relying solely on accuracy as the evaluation metric may not provide a complete picture of the model's performance. Therefore, we will use a confusion matrix to gain deeper insights: cm = confusion_matrix(labels, predictions, labels-DOCUMENT_CLASSES) cm_display = ConfusionMatrixDi splay( confusion_matrix=cm, display_labels-DOCUMENT_CLASSES cm_display.plot() cm_display.ax_.set_xticklabels(DOCUMENT_CLASSES, rotation=45) cm_display.figure_.set_size_inches(16, 8) plt.show(); There is some confusion between the two most represented classes - others and notes . Could you create an improved model that makes more accurate predictions for those? ntps:hwwwmlexpertofblogldocument-classiiation-witrayoutinv’ 23125 9si24, 8:43 AM Document Classification with LayoulL Mv3 | MLExpert- Get Things Do balance sheets cash flow Income_statement others hitps:www-mlexper ioblogidocument-classcation-with layouts with Al Bootcamp 200 150 100 S 24128 9115124, 843 AM Document Classification wth Layout MV | MLExpert- Get Things Done with Al Boatcamp 3,000+ people already joined Join the The State of Al Newsletter Every week, receive a curated collection of cutting-edge Al developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of Al Your Email Address SUBSCRIBE Iwon't send you any spam, ever! References 1. wkhtmitopdf - tool to render HTML into PDF/images & 2. Financial Documents Clustering © 3. Hexaware Technologies financial annual reports = 4, LayoutLMv3: Pre-training for Document Al with Unified Text and Image Masking © Dark 1020-2024 MLExpert™ by Venelin Valkov. All Rights Reserved. ntps:wwn.mlexpertofblogldocument-classiiation-witrayoutlnv’ 25125

You might also like