Introduction to Python PyPDF2 Library
Last Updated :
11 Sep, 2024
PyPDF2 is a Python library that helps in working and dealing with PDF files. It allows us to read, manipulate, and extract information from PDFs without the need for complex software. Using PyPDF2, we can split a single PDF into multiple files, merge multiple PDFs into one, extract text, rotate pages, and even add watermarks. In this article, we are going to learn most of the PyPDF2 library.
What is PyPDF2?
We use PyPDF2 when we have to deal with large documents. Suppose we have a large PDF document, and we only need to send a few pages to someone. Instead of manually extracting those pages, we can do this in just a few lines of code using PyPDF2. We use PyPDF2 to combine multiple PDF files into one file. This tool helps us do things such as reading, extracting text, merging, splitting, rotating, and even encrypting/decrypting PDF files.
Installing PyPDF2 via pip
We have to first install PyPDF2 before using it. We can install using pip. We open our command prompt or terminal and run the following command:
pip install PyPDF2
Basic Concepts
Now, let's look at some important key concepts before understand the features of PyPDF's:
- PDF Structure: A PDF file consists of objects like text, images, metadata, and page structure.
- Pages: PDF files contain multiple pages, and each page can be manipulated individually.
- Metadata: PDFs contain information such as the author, title, and creation date.
Key Features of PyPDF2
Some of key features of PyPDF2 are given below:
- It is used for reading PDF files.
- It is used for extracting text and metadata.
- It is used for merging, splitting, and rotating pages.
- It is used for encrypting and decrypting PDF files.
- It is also used for adding watermarks and modifying PDF content.
Working with PDF Files
1. Reading PDF Files
If we want to read a PDF file, we have to first open it using PyPDF2. Let's we have a pdf named example.pdf.
A simple pdf file
Here is how we can read a pdf using PyPDF2.
Python
import PyPDF2
# Open a PDF file
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
# Get the total number of pages
total_pages = len(reader.pages)
print(f"Total pages: {total_pages}")
# Read the content of the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(text)
Output:
Reading pdf content using pypdf22. Extracting Text from PDF Files
We can easily extract text from PDF files using the extract_text() function. This can be useful for parsing large documents.
Python
import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
print(page.extract_text())
Output:
Extracting Text from a pdf using PyPDF23. Extracting Metadata from PDF Files
PyPDF2 allows us to extract metadata such as the author, title, and creation date:
Python
import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
metadata = reader.metadata
print(f"Author: {metadata.author}")
print(f"Title: {metadata.title}")
print(f"Creation Date: {metadata.creation_date}")
Output:
Extracting Meta Data From a pdf using PyPDFManipulating PDF Files using PyPDF2
We can play around with all the pdfs we have. Let's see a few ways to manipulate pdfs.
1. Merging Multiple PDF Files
We can merge multiple PDF files into one using PyPDF2's PdfWriter(). Let's we have an another pdf file named example2.pdf.
example2.pdf
A pdfMerge example.pdf and example2.pdf:
Python
from PyPDF2 import PdfReader, PdfWriter
pdf_writer = PdfWriter()
# Add PDFs to merge
for pdf_file in ['example.pdf', 'example2.pdf']:
reader = PdfReader(pdf_file)
for page in reader.pages:
pdf_writer.add_page(page)
# Save the merged PDF
with open('merged.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
Here we get a merged.pdf file.
Merged.pdfmerged.pdf
2. Splitting PDF Files into Individual Pages
If we want to split a PDF into separate pages, PyPDF2 makes this easy. Let's split the merged.pdf file.
Python
import PyPDF2
with open('merged.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
# Save each page as a new file
with open(f'page_{i+1}.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
Output:
Splitting a pdf into multiple pdfs using PyPDFThe page_1.pdf and page_2.pdf will have contents of page1 and page two of merged.pdf file respectively.
3. Adding Watermarks to PDF Files
We can also add watermark in PDF file if we want. We need another PDF file containing the watermark (like a logo or text). We can overlay this on our main PDF file.
watermark.pdf
watermark.pdfPython program to add watermark to a pdf using PyPDF2.
Python
import PyPDF2
with open('example.pdf', 'rb') as main_pdf, open('watermark.pdf', 'rb') as watermark_pdf:
reader = PyPDF2.PdfReader(main_pdf)
watermark_reader = PyPDF2.PdfReader(watermark_pdf)
writer = PyPDF2.PdfWriter()
watermark_page = watermark_reader.pages[0]
for page in reader.pages:
page.merge_page(watermark_page)
writer.add_page(page)
# Save the watermarked PDF
with open('watermarked.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
watermarked.pdf
watermarked.pdf
We can also password-protect our PDF files using encryption.
Below is example:
Python
import PyPDF2
writer = PyPDF2.PdfWriter()
# Add pages to encrypt
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
# Encrypt the PDF with a password
writer.encrypt('password123')
# Save the encrypted PDF
with open('encrypted.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
Output:
When we try to open the file, we will need to pass the password:
Enter the password to view pdfCommon Use Cases
- We can automate generating PDF reports by extracting or merging data.
- We can also combine PyPDF2 with other Python libraries like matplotlib or PIL for more advanced PDF generation.
Conclusion
PyPDF2 is a useful, simple and powerful library for working with PDFs in Python. By following the steps given above, we can start extracting text from PDF files and explore further to discover all the features PyPDF2 provides.
Similar Reads
Introduction to Python Pydantic Library
In modern Python development, data validation and parsing are essential components of building robust and reliable applications. Whether we're developing APIs, working with configuration files, or handling data from various sources, ensuring that our data is correctly validated and parsed is crucial
7 min read
Introduction to Python qrcode Library
We must have seen QR codes in real life. QR codes are now used everywhere. QR codes are those square-shaped, black-and-white patterns we see on products, posters, business cards, and even Wi-Fi routers. These are used to store information that can be quickly scanned and read by a smartphone or a QR
6 min read
Introduction to PyFlux in Python
We all are well aware of the various types of libraries Python has to offer. We'll be telling you about one such library knows as PyFlux. The most frequently encountered problems in the Machine learning domain is Time series analysis. PyFlux is an open-source library in Python explicitly built for w
1 min read
Introduction to Python3
Python is a high-level general-purpose programming language. Python programs generally are smaller than other programming languages like Java. Programmers have to type relatively less and indentation requirements of the language make them readable all the time. Note: For more information, refer to P
3 min read
Introduction to Python GIS
Geographic Information Systems (GIS) are powerful tools for managing, analyzing, and visualizing spatial data. Python, a versatile programming language, has emerged as a popular choice for GIS applications due to its extensive libraries and ease of use. This article provides an introduction to Pytho
4 min read
PLY (Python lex-Yacc) - An Introduction
We all have heard of lex which is a tool that generates lexical analyzer which is then used to tokenify input streams and yacc which is a parser generator but there is a python implementation of these two tools in form of separate modules in a package called PLY. These modules are named lex.py and y
3 min read
Introduction To Machine Learning using Python
Machine learning has revolutionized the way we approach data-driven problems, enabling computers to learn from data and make predictions or decisions without explicit programming. Python, with its rich ecosystem of libraries and tools, has become the de facto language for implementing machine learni
6 min read
Introduction to PyVista in Python
Pyvista is an open-source library provided by Python programming language. It is used for 3D plotting and mesh analysis. It also provides high-level API to simplify the process of visualizing and analyzing 3D data and helps scientists and other working professionals in their field to visualize the d
4 min read
How to Install "Python-PyPDF2" package on Linux?
PyPDF2 is a Python module for extracting document-specific information, merging PDF files, separating PDF pages, adding watermarks to files, encrypting and decrypting PDF files, and so on. PyPDF2 is a pure python module so it can run on any platform without any platform-related dependencies on any e
2 min read
Introduction to pygame
Game programming is very rewarding nowadays and it can also be used in advertising and as a teaching tool too. Game development includes mathematics, logic, physics, AI, and much more and it can be amazingly fun. In python, game programming is done in pygame and it is one of the best modules for doi
5 min read