Open In App

Introduction to Python PyPDF2 Library

Last Updated : 11 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

PyPDF2 is a Python library that helps in working and dealing with PDF files. It allows us to read, manipulate, and extract information from PDFs without the need for complex software. Using PyPDF2, we can split a single PDF into multiple files, merge multiple PDFs into one, extract text, rotate pages, and even add watermarks. In this article, we are going to learn most of the PyPDF2 library.

What is PyPDF2?

We use PyPDF2 when we have to deal with large documents. Suppose we have a large PDF document, and we only need to send a few pages to someone. Instead of manually extracting those pages, we can do this in just a few lines of code using PyPDF2. We use PyPDF2 to combine multiple PDF files into one file. This tool helps us do things such as reading, extracting text, merging, splitting, rotating, and even encrypting/decrypting PDF files.

Installing PyPDF2 via pip

We have to first install PyPDF2 before using it. We can install using pip. We open our command prompt or terminal and run the following command:

pip install PyPDF2

Basic Concepts

Now, let's look at some important key concepts before understand the features of PyPDF's:

  • PDF Structure: A PDF file consists of objects like text, images, metadata, and page structure.
  • Pages: PDF files contain multiple pages, and each page can be manipulated individually.
  • Metadata: PDFs contain information such as the author, title, and creation date.

Key Features of PyPDF2

Some of key features of PyPDF2 are given below:

  • It is used for reading PDF files.
  • It is used for extracting text and metadata.
  • It is used for merging, splitting, and rotating pages.
  • It is used for encrypting and decrypting PDF files.
  • It is also used for adding watermarks and modifying PDF content.

Working with PDF Files

1. Reading PDF Files

If we want to read a PDF file, we have to first open it using PyPDF2. Let's we have a pdf named example.pdf.

Screenshot-2024-09-11-123304
A simple pdf file


Here is how we can read a pdf using PyPDF2.

Python
import PyPDF2

# Open a PDF file
with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    # Get the total number of pages
    total_pages = len(reader.pages)
    print(f"Total pages: {total_pages}")

    # Read the content of the first page
    first_page = reader.pages[0]
    text = first_page.extract_text()
    print(text)

Output:

Screenshot-2024-09-11-123445
Reading pdf content using pypdf2

2. Extracting Text from PDF Files

We can easily extract text from PDF files using the extract_text() function. This can be useful for parsing large documents.

Python
import PyPDF2

with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    for page in reader.pages:
        print(page.extract_text())

Output:

Screenshot-2024-09-11-123636
Extracting Text from a pdf using PyPDF2

3. Extracting Metadata from PDF Files

PyPDF2 allows us to extract metadata such as the author, title, and creation date:

Python
import PyPDF2

with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    
    metadata = reader.metadata
    print(f"Author: {metadata.author}")
    print(f"Title: {metadata.title}")
    print(f"Creation Date: {metadata.creation_date}")

Output:

Screenshot-2024-09-11-124242
Extracting Meta Data From a pdf using PyPDF

Manipulating PDF Files using PyPDF2

We can play around with all the pdfs we have. Let's see a few ways to manipulate pdfs.

1. Merging Multiple PDF Files

We can merge multiple PDF files into one using PyPDF2's PdfWriter(). Let's we have an another pdf file named example2.pdf.

example2.pdf

Screenshot-2024-09-11-124549
A pdf

Merge example.pdf and example2.pdf:

Python
from PyPDF2 import PdfReader, PdfWriter

pdf_writer = PdfWriter()

# Add PDFs to merge
for pdf_file in ['example.pdf', 'example2.pdf']:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        pdf_writer.add_page(page)

# Save the merged PDF
with open('merged.pdf', 'wb') as output_file:
    pdf_writer.write(output_file)

Here we get a merged.pdf file.

Screenshot-2024-09-11-124908
Merged.pdf

merged.pdf


2. Splitting PDF Files into Individual Pages

If we want to split a PDF into separate pages, PyPDF2 makes this easy. Let's split the merged.pdf file.

Python
import PyPDF2

with open('merged.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)

        # Save each page as a new file
        with open(f'page_{i+1}.pdf', 'wb') as output_pdf:
            writer.write(output_pdf)


Output:

Screenshot-2024-09-11-125254
Splitting a pdf into multiple pdfs using PyPDF

The page_1.pdf and page_2.pdf will have contents of page1 and page two of merged.pdf file respectively.

3. Adding Watermarks to PDF Files

We can also add watermark in PDF file if we want. We need another PDF file containing the watermark (like a logo or text). We can overlay this on our main PDF file.

watermark.pdf

Screenshot-2024-09-11-144435
watermark.pdf

Python program to add watermark to a pdf using PyPDF2.


Python
import PyPDF2

with open('example.pdf', 'rb') as main_pdf, open('watermark.pdf', 'rb') as watermark_pdf:
    reader = PyPDF2.PdfReader(main_pdf)
    watermark_reader = PyPDF2.PdfReader(watermark_pdf)

    writer = PyPDF2.PdfWriter()
    watermark_page = watermark_reader.pages[0]

    for page in reader.pages:
        page.merge_page(watermark_page)
        writer.add_page(page)

    # Save the watermarked PDF
    with open('watermarked.pdf', 'wb') as output_pdf:
        writer.write(output_pdf)

watermarked.pdf

Screenshot-2024-09-11-144553
watermarked.pdf


4. Encrypting and Decrypting PDF Files

We can also password-protect our PDF files using encryption.

Below is example:

Python
import PyPDF2

writer = PyPDF2.PdfWriter()

# Add pages to encrypt
with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

# Encrypt the PDF with a password
writer.encrypt('password123')

# Save the encrypted PDF
with open('encrypted.pdf', 'wb') as output_pdf:
    writer.write(output_pdf)

Output:

When we try to open the file, we will need to pass the password:

Screenshot-2024-09-11-144830
Enter the password to view pdf

Common Use Cases

  • We can automate generating PDF reports by extracting or merging data.
  • We can also combine PyPDF2 with other Python libraries like matplotlib or PIL for more advanced PDF generation.

Conclusion

PyPDF2 is a useful, simple and powerful library for working with PDFs in Python. By following the steps given above, we can start extracting text from PDF files and explore further to discover all the features PyPDF2 provides.


Next Article
Article Tags :
Practice Tags :

Similar Reads