Scrape Data From PDF Files Using Python Towards Data Science
Scrape Data From PDF Files Using Python Towards Data Science
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Background
Data science professionals are dealing with data in all shapes and forms. Data could be
stored in popular SQL databases, such as PostgreSQL, MySQL, or am old-fashioned excel
spreadsheet. Sometimes, data might also be saved in an unconventional format, such as
PDF. In this article, I am going to talk about how to scrape data from PDF using Python
libraries.
Required Libraries
tabula-py: to scrape text from PDF files
Install Libraries
Import Libraries
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 1/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
import tabula as tb
import pandas as pd
import re
just need to input the location of the tabular data in the PDF page by specifying the top,
left, bottom and right coordinates of the area . If the PDF page only includes the target
table, then we don’t even need to specify the area. tabula-py should be able to detect
the rows and columns automatically.
file = 'state_population.pdf'
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 2/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
(Created by Author)
There are few steps we need to take to transform the data into panel format.
Like data in a structured format, we also use tb.read_pdf to import the unstructured
data. This time, we need to specify extra options to properly import the data.
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 3/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
file = 'payroll_sample.pdf'
Area and Columns: I’ve talked about area above. Here we will also need to use columns
to identify the locations all relevant columns (one column on the left section and four
columns on the right section.)
Stream and Lattice: if there are grid lines to separate each cell, we can use lattice =
True to automatically identify each cell, If not, we can use stream = True and columns to
manually specify each cell.
lattice (bool, optional) — Force PDF to be extracted using lattice-mode extraction (if
there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
(Created by Author)
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 4/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
Now we have some data to work with, we will use Python library Pandas to manipulate
the dataframe.
First, we will need to create a new column that can identify unique rows. We notice that
employee name (Superman and Batman) seem to be useful to identify border between
different records. Each employee name contains a unique pattern, which starts with a
capital letter and ends with a lower-case letter. We can use regular expression '^[A-Z].*
[a-z]$' to identify employee name, then use Pandas function cumsum (cumulative sum)
to create a row identifier.
df['row'] = df['border'].transform('cumsum')
(Created by Author)
Step 3: Reshape the data (convert data from long form to wide form)
Next, we will reshape data on both the left section and right section. For the left section,
we create a new dataframe, employee that includes employee_name, net_amount,
pay_date and pay_period. For the right section, we create another dataframe, payment
that includes OT_Rate, Regular_Rate, OT_Hours, Regular_Hours, OT_Amt and
Regular_Amt. To convert the data in a wide form, we can use Pandas function, pivot .
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 5/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
employee = employee[employee[0].notnull()]
employee['index'] = employee.groupby('row').cumcount()+1
employee['net_amount'] = employee.apply(lambda x:
x['net_amount'].replace('Net', '').strip(), axis = 1)
payment = payment[payment[1].notnull()]
payment = payment[payment['row']!=0]
(Created by Author)
(Created by Author)
Step 4: Join the data in the left section with the data in right section
Lastly, we use the function, merge to join both employee and payment dataframes based
on the row identifier.
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 6/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
(Created by Author)
Final Note
As of today, companies still manually process PDF data. With the help of python
libraries, we can save time and money by automating this process of scraping data from
PDF files and converting unstructured data into panel data.
Please keep in mind that when scraping data from PDF files, you should always carefully
read the terms and conditions posted by the author and make sure you have permission
to do so.
Stay Tuned! I will talk about other tools, such as PDFQuery and PyPDF2 to work with
PDF data in future articles.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 7/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 8/8