Extract text from PDF File using Python
Last Updated :
09 Aug, 2024
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
We will extract text from pdf files using two Python libraries, pypdf and PyMuPDF, in this article.
Extracting text from a PDF file using the pypdf library.
Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python
Installation
To install this package type the below command in the terminal.
pip install pypdf
Example: Input PDF: 
Python
# importing required modules
from pypdf import PdfReader
# creating a pdf reader object
reader = PdfReader('example.pdf')
# printing number of pages in pdf file
print(len(reader.pages))
# getting a specific page from the pdf file
page = reader.pages[0]
# extracting text from page
text = page.extract_text()
print(text)
Output:
Let us try to understand the above code in chunks:
reader = PdfReader('example.pdf')
- We created an object of PdfReader class from the pypdf module.
- The PdfReader class takes a required positional argument of the path to the pdf file.
print(len(reader.pages))
- pages property gives a List of PageObjects. So, here we can use the in-built len() function of python to get the number of pages in the pdf file.
page = reader.pages[0]
- Now, as reader.pages is a list of PageObjects, we can get a specific Page of the pdf by tapping into the index of the page. In python list indexing starts from 0, so reader.pages[0] gives us the first page of the pdf file.
text = page.extract_text()
print(text)
- Page object has function extract_text() to extract text from the pdf page.
Extracting text from a PDF file using the PyMuPDF library.
PyMuPDF is a Python library that supports file formats like XPS, PDF, CBR, and CBZ. But for now, in this article, we are going to concentrate on PDF (Portable Document Format) files.
Installation
pip install pymupdf
pip install fitz
To extract the text from the pdf, we need to follow the following steps:
- Importing the library
- Opening document
- Extracting text
Note: We are using the sample.pdf here; to get the pdf, use the link below.
sample.pdf – Link
1. Importing the library
Python
2. Opening document
Python
doc = fitz.open('sample.pdf')
Here we created an object called “doc,” and filename should be a Python string.
3. Extracting text
Python
for page in doc:
text = page.get_text()
print(text)
Here, we iterated pages in pdf and used the get_text() method to extract each page from the file.
All the Code to extract the text
Python
import fitz
doc = fitz.open('sample.pdf')
text = ""
for page in doc:
text+=page.get_text()
print(text)
Output:

Conclusion
We have seen two Python libraries, pypdf and PyMuPDF, that can extract text from a PDF file. Comment on your preferred library from the above two libraries.
Similar Reads
Convert PDF to TXT File Using Python
We have a PDF file and want to extract its text into a simple .txt format. The idea is to automate this process so the content can be easily read, edited, or processed later. For example, a PDF with articles or reports can be converted into plain text using just a few lines of Python. In this articl
2 min read
Python program to extract Email-id from URL text file
Prerequisite : Pattern Matching with Python Regex Given the URL text-file, the task is to extract all the email-ids from that text file and print the urllib.request library can be used to handle all the URL related work. Example : Input : Hello This is Geeksforgeeks [email protected] rev
1 min read
Extract Multiple JSON Objects from one File using Python
Python is extremely useful for working with JSON( JavaScript Object Notation) data, which is a most used format for storing and exchanging information. However, it can become challenging when dealing with multiple JSON objects stored within a single file. In this article, we will see some techniques
3 min read
Python Extract Substring Using Regex
Python provides a powerful and flexible module called re for working with regular expressions. Regular expressions (regex) are a sequence of characters that define a search pattern, and they can be incredibly useful for extracting substrings from strings. In this article, we'll explore four simple a
2 min read
Python - Extract digits from given string
We need to extract the digit from the given string. For example we are given a string s=""abc123def456gh789" we need to extract all the numbers from the string so the output for the given string will become "123456789" In this article we will show different ways to extract digits from a string in Py
2 min read
How to Create a Python Dictionary from Text File?
The task of creating a Python dictionary from a text file involves reading its contents, extracting key-value pairs and storing them in a dictionary. Text files typically use delimiters like ':' or ',' to separate keys and values. By processing each line, splitting at the delimiter and removing extr
3 min read
Get the File Extension from a URL in Python
Handling URLs in Python often involves extracting valuable information, such as file extensions, from the URL strings. However, this task requires careful consideration to ensure the safety and accuracy of the extracted data. In this article, we will explore four approaches to safely get the file ex
2 min read
Read a text file using Python Tkinter
Graphical User Interfaces (GUIs) are an essential aspect of modern software development, providing users with interactive and visually appealing applications. Python's Tkinter library is a robust tool for creating GUIs, and in this article, we will delve into the process of building a Tkinter applic
3 min read
Extract List of Substrings in List of Strings in Python
Working with strings is a fundamental aspect of programming, and Python provides a plethora of methods to manipulate and extract substrings efficiently. When dealing with a list of strings, extracting specific substrings can be a common requirement. In this article, we will explore five simple and c
3 min read
Check If a Text File Empty in Python
Before performing any operations on your required file, you may need to check whether a file is empty or has any data inside it. An empty file is one that contains no data and has a size of zero bytes. In this article, we will look at how to check whether a text file is empty using Python. Check if
4 min read