python学习笔记之读取pdf文件库pdfminer
上一节中介绍了抽取PDF文本及表格的库pdfplumber,今天介绍另外一个PDF解析库:pdfminer
安装
pip install pdfminer3k
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer3k
注意:python2中是pdfminer,python3中是pdfminer3k
读取PDF文本
在网上搜了一圈的资料,实现代码如下:
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
praser = PDFParser(open('浪潮之巅.pdf','rb'))
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)
doc.initialize()
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
rsrcmagr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmagr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmagr, device)
for pg in doc.get_pages():
interpreter.process_page(pg)
layout = device.get_result()
for x in layout:
if (isinstance(x,LTTextBox)):
print(x.get_text())
输出如下: