Incrementally Loading Data into Parquet with Python

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

What is Parquet?

Traditional DBMSs are row-oriented: the data for each row – both on disk and in memory – is stored sequentially, with all its columns together. This structure is optimal for performing transactions. When inserting or updating a row, the engine doesn’t need to access several different locations to write all the various column values.

For data analytics, though, this is often less than ideal. Querying data is quite often about the columns themselves. A columnar (column oriented) database “flips the script” and stores the data for each column together. Retrieving a column’s value for every row can be done without reading any of the other columns. In a data warehousing environment, tables are often “flattened” to remove joins, which can add dozens of columns to the original table. For these extremely wide tables, a columnar database can be orders of magnitude faster: a query that needs only four columns from a 100-column table can skip the other 96 columns entirely.

Parquet is a hybrid format: it divides the data into “rowsets” like a row-oriented format, but within each rowset, the data is stored in “column chunks” like a columnar database. Visually this alternating format results in a tiled pattern somewhat resembling a wood parquet floor (which is in fact where Parquet takes its name).

Table Structures

If we imagine a table containing four columns per row (client, product, date, amount), the below image represents how each format would store the data:

This tiled format has advantages over a pure column layout. Each rowset holds metadata identifying the range of values for each column chunk. If a query’s parameters lie outside this range, the entire rowset can be skipped without being read. Rowsets can be any size you define, but for efficiency they tend to be very large – 100MB is generally considered a minimum for good performance, and 1GB is common.

Parquet data can be compressed to save storage and increase throughput. While some row-based DBMS’s also compress data, the columnar format means that values stored near each other tend to be similar, and thus allows a better compression ratio. Parquet also supports dictionary encoding where low-cardinality columns are stored as keys to a lookup table. For long text columns like city names, this can save dramatic amounts of space. With encoding and compression together, it’s not uncommon for a Parquet table to be a quarter the size or smaller than in a traditional DMBS.

Partitioning

Another useful feature that you can use with Parquet based tables is partitioning, which separates a table into multiple independently readable files. Partitioning is done by selecting one or more columns to separate by. For instance, if a table contains 100 different client IDs, partitioning by this column will divide the table into 100 files, each containing just one client ID.

Since within each partition all values for the partition column are identical, there’s no need to repeatedly store that column itself. This saves storage and increases performance. Queries that filter by the partition column can skip reading unnecessary partitions entirely, further boosting performance.

A few other details

Parquet also supports a number of advanced features like schema merging and schema evolution. These are beyond the scope of this article but are well worth investigating for anyone considering a Parquet repository. For some, Parquet’s biggest selling point is that it is open source; using it doesn’t tie one to any specific vendor or product. Parquet is integral to Apache’s Hadoop platform and supported by all the big-name cloud vendors: AWS, Azure, and Google Cloud. We should also reiterate for clarity that Parquet is simply a data storage format, not itself an engine for manipulating that data. Other programs will be used to write and query the data.

Incrementally updating data in a Parquet file

A typical business scenario involves new production data being migrated on a regular basis (i. e. nightly) to a separate data lake or warehouse. Usually, these repositories have a longer data retention policy: live servers may “age out” records in as little as a few months time, whereas warehouses often hold many years’ worth of data. So, in general, the problem involves taking a small subset of a large table and appending it to another much larger table. How can we do this for Parquet?

There are tools and libraries that allow you to export SQL tables – or the results of a SQL query – to Parquet. But few allow appending to an existing Parquet file. For smaller tables, we could perform a “full refresh”, exporting the entire table each time. But for larger tables, this is very inefficient. And if the target environment holds data no longer on the live server, it’s not even possible. You could use a product that supports appending data to existing Parquet files, but these are scarce and they perform poorly in this mode.

The Problem Restated – and a Solution

Remember that Parquet is a format, not a specific software product. You can manipulate Parquet data with a wide variety of engines and libraries – but all these tools either disallow directly appending to an existing Parquet file, or do so very inefficiently, often dozens or even hundreds of times slower than the original write. Parquet is optimized for reading, not updating. Note: this is a problem for all columnar data formats, not Parquet in particular.

For a possible solution, we turn to Parquet’s partitioning feature. This converts a single Parquet file into a directory structure of sub-files, each containing a portion of the full data. This change is transparent to the end user: all these partitions appear together as a single “file” of data. If we write a new partition file into this directory structure, it’s automatically linked to the existing data. Since other partitions aren’t being disturbed, this is much faster and more efficient than directly appending to an existing file. We can use this feature to implement incremental feeds. If you’re migrating a day’s worth of new data nightly, you can partition the tables being exported by entry date. With this approach, each nightly feed creates a new partition in a separate file so, even though the Parquet “table” is technically being updated, no individual file ever needs to be rewritten. (This assumes the tables have a date or timestamp column that identifies when the rows were added).

Let’s test out this approach by creating a simple table of sales data, with a column “sdate” that records the date of the sale. The example will assume our source DBMS is PostgreSQL, though obviously with minor changes any SQL database can be supported.

CREATE TABLE sales (

sales_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,

customer_id INT NOT NULL,

salesperson_id INT NOT NULL,

product_id INT NOT NULL,

sdate DATE NOT NULL,

random_text VARCHAR(32) NOT NULL

);

A script to populate this table with random data is in Appendix I at the conclusion of this article.

Now: a Python script to migrate this to Parquet. The script will use a few libraries that need to be first installed: Pandas and PyArrow for writing Parquet data, and Psycopg for querying the data from PostgreSQL. If Python is properly installed, you can install these from a command line.

C:> pip install psycopg

C:> pip install pandas

C:> pip install pyarrow

Instead of PyArrow, we could use fastparquet, a well-known alternative, but this is now deprecated and not recommended for new development, so PyArrow is the better choice. In this section, I will create a Python script to export the SQL data. We’ll lay it out in three portions and then will present the full script.

First, the declarations:

1 2	import pandas as pd import psycopg as py

You’ll notice we didn’t need to declare PyArrow; Pandas does that for us. Next, our script uses Pysocopg to connect to a PostgreSQL database and retrieve the results of our SQL query:

# open database connection (insert your own values here)

conn = py.connect('`'dbname=postgres user=postges password=mypass port=5432'`')

cr = conn.cursor()

# retrieve data from SQL query

cr.execute("SELECT * FROM sales WHERE sdate = '01-01-2024'")

tuples = cr.fetchall()

Notice our WHERE clause limits the rows received to one day (the first day of 2024). As part of a nightly export process, the WHERE clause would instead select the prior day’s data.

Finally, the script retrieves the column names from the query, and uses this schema information and the data to create a Parquet file:

# retrieve column headers to pass to Pandas

column_hdrs= [d[0] for d in cr.description]

df = pd.DataFrame(tuples, columns=column_hdrs)

# write retrieved data to Parquet. Specified path must exist.

df.to_parquet('C:\\Users\\myuser\\sales', partition_cols='sdate', existing_data_behavior='delete_matching')

Before running this script, replace the root directory with an appropriate location on your drive. The full script appears like this:

# declarations

import pandas as pd

import psycopg as py

# open database connection (insert your own values here)

conn = py.connect('`'dbname=postgres user=postges password=mypass port=5432'`')

cr = conn.cursor()

# retrieve data from SQL query

cr.execute("SELECT * FROM sales WHERE sdate = '01-01-2024'")

tuples = cr.fetchall()

# retrieve column headers to pass to Pandas

column_hdrs= [d[0] for d in cr.description]

df = pd.DataFrame(tuples, columns=column_hdrs)

# write retrieved data to Parquet. Specified path must exist.

df.to_parquet('C:\\Users\\myuser\\sales', partition_cols='sdate', existing_data_behavior='delete_matching')

Run this script a few times, each time altering the WHERE clause to the next day’s date, and you’ll see a directory structure that resembles this:

Each run a new folder and a *.parquet file is created (if the number of rows for a single day’s data exceeds the size of a single Parquet rowset, multiple files will be created within each folder). Note that different Parquet libraries – and even different versions of the same library – may name these partitions differently. This isn’t a concern: the name of the Parquet “file” you’ll access the data by is now the directory root path, and doesn’t depend on these sub-files.

Also note if you fail to update the WHERE clause date, the script will rewrite the same day’s data again.

If you want to verify that the new partitions are being added to the Parquet ‘file’, you can use the following script to open the table and return a count of total rows. You will see the row count rise each time you load a new day’s worth of data:

`import pyarrow.parquet as pq`

`df = pq.read_table(source="C:\\Users\\myuser\\sales").to_pandas()`

`print(f'Row Count: {len(df.index)}')`

You can also verify that existing partitions aren’t being rewritten by examining the timestamps on the individual files within this directory structure. Each new run of the load script generates a new file for the specified date, but leaves the others untouched.

Working with Non-Timestamped Data

If the table you are trying to export doesn’t contain an appropriate date or timestamp to identify row age, you can achieve the same effect if you can identify each day’s worth of data by some other means: simply create a date pseudo-column in the SQL result set.

For example, if our sales table lacked a column like sdate that we could partition by easily, we could write the following query:

SELECT 'yesterday'::date AS sdate, *

FROM sales

WHERE (criteria for yesterday's data)

If each new day’s data is at least 100MB, this approach works well. But for tables that ingest less data daily, the partitions can be too small to perform well. We’re also creating a new partition each day. This usually isn’t a problem if the table holds ten years or less of data, but for tables spanning decades, the partition count can grow overly large, and impact performance.

There’s also the problem of delayed inserts. We write a single day’s date only once. If production data occasionally arrives late from remote sources (IoT devices, field personnel tablets, etc.), it can miss the migration window. Or in the case of our sales table, each row might first be stored in an intermediate table before being validated and moved to the table being exported.

Instead of partitioning by individual days, partitioning by month can solve these problems. For most tables, a month of data is much closer to the “Goldilocks” range: not too large, and not too small. A monthly partition also reduces our partition count by a factor of 30. And if we rewrite our data throughout the month, we also solve the delayed entry problem.

Our Python script contains three changes to effect this. First, we alter the SELECT statement to retrieve a full month’s data, and to create pseudo-columns of the sale date’s year and month. Secondly, we change our ‘partition_cols’ parameter in the call to to_parquet() to partition by year, then month. Lastly, when writing the Parquet data, we set the ‘existing_data_behavior’ flag so that if data exists within a partition, the export will delete (and replace) it.

The full script becomes:

import pandas as pd

import psycopg as py

# open database connection to fetch relational data

conn = py.connect('(connection string here)')

cr = conn.cursor()

# retrieve data from SQL query

cr.execute("SELECT EXTRACT(year FROM start_time) as yr,

EXTRACT(month from start_time) as mon, *

FROM sales

WHERE sdate BETWEEN '01-01-2024' AND '01-31-2024")

tuples = cr.fetchall()

# retrieve column headers to pass to Pandas

column_hdrs= [d[0] for d in cr.description]

df = pd.DataFrame(tuples, columns=column_hdrs)

# write retrieved data to Parquet

df.to_parquet('C:\\Users\\myuser\\sales.parquet',

partition_cols=['yr','mon'],

existing_data_behavior='delete_matching')

Each day of the month (and the first day of the following month) the partition for that month would be rewritten to include that day’s data. The directory structure from this two-column partitioning now has an additional level. Over time, it’ll grow to look like this:

The “year” and “month” values become columns in the result set of the Parquet data: any tool that reads the table will pick these up. But they’re not actually stored in the data files, so they don’t consume space.

By rewriting data through the month, we also solve the problem of delayed inserts. In the script’s current form, it will export late-entry rows for all but the final day of the month. But if we modify it slightly so that on the first two days of each month, the prior and current month’s data are both selected, we close this gap. The cost of doing this is minor – at most, 32 days’ data will be written at any time. This “backwards-looking” technique can be extended further, if rows are liable to be delayed longer.

Conclusion

We’ve outlined a basic strategy for incrementally exporting SQL data to Parquet. While our sample script was specific to PostgreSQL, it can easily be adapted to any other engine. In Part II of this series, we’ll benchmark Parquet performance in various data analytics scenarios.

Appendix I : Sample Data Script

-- Create sample table

CREATE TABLE sales (

sales_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,

customer_id INT NOT NULL,

salesperson_id INT NOT NULL,

product_id INT NOT NULL,

sdate DATE NOT NULL,

random_text VARCHAR(32) NOT NULL

);

-- Add 180 days of random data. (100,000 rows total)

INSERT INTO sales (customer_id, salesperson_id, product_id, sdate, random_text)

SELECT ROUND(RANDOM() * 99999), ROUND(RANDOM() * 9999), ROUND(RANDOM() * 1000),

'01/01/2024'::date + '1 DAY'::INTERVAL * ROUND(RANDOM() * 180),

MD5(RANDOM()::text)

FROM generate_series(1,100000);

Register for Simple Talk

Incrementally Loading Data into Parquet with Python

What is Parquet?

Table Structures

Partitioning

A few other details

Incrementally updating data in a Parquet file

The Problem Restated – and a Solution

Working with Non-Timestamped Data

Conclusion

Appendix I : Sample Data Script

Article tags

About the author

Lee Asher

Lee's contributions

Articles

Books

Top topics

Lee's latest contributions:

The Wide World of GiST Indexes

Overlapping Ranges in Subsets in PostgreSQL

Solving the Overlap Query Problem in PostgreSQL

What is Parquet?

Table Structures

Partitioning

A few other details

Incrementally updating data in a Parquet file

The Problem Restated – and a Solution

Working with Non-Timestamped Data

Conclusion

Appendix I : Sample Data Script

Article tags

Recommended

About the author

Lee Asher

Lee's contributions

Articles

Books

Top topics

Lee's latest contributions:

The Wide World of GiST Indexes

Overlapping Ranges in Subsets in PostgreSQL

Solving the Overlap Query Problem in PostgreSQL