Python3-提取pdf文件内容的方式，PyPDF2的使用

本文链接：https://blog.csdn.net/liranke/article/details/126525804

本文介绍使用 Python 的 PyPDF2 库从 PDF 文件中提取文本的方法。通过详细步骤及示例代码展示了如何读取 PDF 文件内容，包括打开文件、获取页数、提取每页文本等关键操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 PDF文件格式简介

PDF，全称是Portable Document Format，意为“可携带文档格式”。作为一种文件格式，它操作系统平台无关，支持Windows，Unix/Linux，Mac...等几乎所有的主流操作系统。而且，无论在哪种打印机上都可保证精确的颜色和准确的打印效果，即PDF会忠实地再现原稿的每一个字符、颜色以及图象。
当然，它也不同于普通的可以直接读取内容的文本文件，它需要专门的软件来打开，就如同word文档，需要Office软件打开一样。

本篇博文中，主要介绍如何使用python语言提取PDF文件中的文字。

2 PyPDF2库简介

在python中，提供了PyPDF2库可以进行PDF文件的各种操作。例如：

提取PDF文件文字
按页拆分文档
逐页合并文档
裁剪页面
合并多个页面到一个页
对pdf文档进行加密解密

可以参考如下网址：

https://pypi.org/project/PyPDF2/

PyPDF2的安装：

pip install PyPDF2

3 PyPDF2库的使用-提取pdf文件的内容

（1）核心代码：

# 读取pdf

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#获取第N页

pageObj = pdfReader.getPage(页索引id)

# 获取内容
dataStr = pageObj.extractText()

（2）全部代码如下：

# -*- coding: utf-8 -*-
import PyPDF2
import chardet
from chardet import detect as char_detect

def read_pdf(filename):
    ''' 读取pdf文件的内容'''
    pdfFileObj = open(filename, 'rb')  #rw,r+都会出错
    # pdfFileObj = open(filename, 'r+',encoding="utf-8")
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    print("pages cnt:",pdfReader.numPages)

    for i in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(i)
        dataStr = pageObj.extractText()

        print("current page index:", i)
        print("============text===============");
        print(dataStr) #ok

        if i ==0:
            # 仅仅为了测试，只输出第一页
            break;
    #必须close
    pdfFileObj.close()

if __name__ == '__main__':
    # test_chardet()
    filename = 'Effective C++ 英文版.pdf'
    read_pdf(filename)

运行结果：

% python3 pdf1.py

pages cnt: 251

current page index: 0

============text===============

E ffe c tiv eC + +byScottMeyers

Back

to

Dedication

Continue

。。。省略若干字符

截图如下：

说明：

（1）pdfFileObj = open(filename, 'rb') #rw,r+都会出错

打开文件，必须用rb，即可读，二进制方式打开。如果用rw，则错误信息如下：

“ValueError: must have exactly one of create/read/write/append mode”

（2）pdfReader = PyPDF2.PdfFileReader(pdfFileObj)：创建PdfFileReader对象，用于读取数据；

（3）pdfReader.numPages：pdf的页数；

（4）pageObj = pdfReader.getPage(i)：获取第i页page对象；

（5）dataStr = pageObj.extractText()：读取该页的数据；

（6）pdfFileObj.close()：操作完成后，一定要关闭文件。