Introduction to Vaex in Python

Python Server Side Programming Programming

In the realm of data science, one of the important aspects that we need to consider is handling large datasets. Handling such vast volumes of data is really a challenge when it comes to memory management and Speed of execution.

Vaex is a Python library that is designed specifically to solve such problems. It is particularly useful for Out-of-Core Data Frames that are lazy (like Pandas) because calculations are only performed when it is necessary. It offers good solutions in terms of Big data analysis, manipulation, and visualisation. We will explore the idea of Vaex, its features, and how to utilise it in Python in this post/article.

It makes use of the advantages of contemporary technology, such as multi-core CPUs and SSDs, to enable quick, effective calculations.

Why do we Need Vaex?

Vaex's Lazy Evaluation, virtual columns, memory-mapping, Visualization and utilizes an expression system that allows for efficient computations and reduced memory usage etc. enable us to work with vast volumes of datasets efficiently and quickly. Vaex has the potential to overcome various limitations found in other libraries, including pandas.

Getting Started With Vaex

We can install vaex in two ways

Using Pip pip install --upgrade vaex
Using Conda conda install -c conda-forge vaex

Once installed, you can import and use it as follows

import vaex

Reading data Performance

Vaex reads huge tabular data far more quickly than pandas. Let's do analysis by loading a dataset of equal size into both libraries. Here I will make use of the dataset vaex provided. If you want to get good results and to observe the difference between the performance of Vaex and Pandas try with the large datasets. In essence, if you are dealing with substantial datasets in Python, Vaex might just be the library for you.

Performance of Vaex

We will load the dataset that is provided by vaex directly in HDF5 format which is memory-mapped using vaex.example() command.

Example

import vaex
%time df_v=vaex.example()
print(df_v.head(5))
df_v.describe()

Output

CPU times: user 10.7 ms, sys: 0 ns, total: 10.7 ms
Wall time: 10.7 ms

Performance of Pandas

We will load the same dataset that is used by vaex and compare the reading performance

Example

import pandas as pd
columns = df_v.get_column_names()
data = {}
for column in columns:
   data[column] = df_v[column].values
%time df_p = pd.DataFrame(data)
print(df_p.head(5))
df_p.describe()

Output

CPU times: user 4.17 ms, sys: 5.06 ms, total: 9.23 ms
Wall time: 13.7 ms

From the above results we can conclude that Vaex tool less time compared to Pandas for the same dataset.

Example

print("Size =")
print(df_p.shape)
print(df_v.shape)

Output

Size =
(330000,11)
(330000,11)

Data Manipulation/Lazy Computation With Vaex

Vaex uses a technique called "lazy evaluation" to delay the evaluation of an operation until its result is needed. This technique helps in saving computing power and efficiently managing memory. As we know Vaex uses an Expression System these expressions are evaluated lazily, meaning the computations are executed only when necessary. This way it makes the computation faster. Let's test it with an example on a single computation

Pandas DataFrame

Example

%time df_pandas['x'] + df_pandas['y']

Output

CPU times: user 2.15 ms, sys: 10 Âµs, total: 2.16 ms
Wall time: 1.51 ms

Vaex DataFrame

Example

%time df_v.x + df_v.y

Output

CPU times: user 280 Âµs, sys: 31 Âµs, total: 311 Âµs
Wall time: 318 Âµs
Expression = (x + y)
Length: 330,000 dtype: float32 (expression)
-------------------------------------------
Â Â Â Â Â 0Â  0.83494
Â Â Â Â Â 1Â  3.49052
Â Â Â Â Â 2 Â  1.2058
Â Â Â Â Â 3Â  9.30084
Â Â Â Â Â 4Â  19.2119
Â Â Â Â Â Â ...Â Â Â Â Â Â 
329995Â  2.78315
329996Â  4.43943
329997Â  13.3985
329998Â  1.34032
329999Â  17.4648

Statistics Performance

Vaex can also perform some Operations like mean, standard deviation, count etc. Let's compare how well pandas and Vaex perform while computing statistics

Pandas Dataframe

Example

%time df_p["L"].mean()

Output

Wall time:4.23 ms
920.81793

Vaex DataFrame

Example

%time df_v.mean(df_v.L)

Output

Wall time: 2.49 ms
array(920.81803276)

Data Filtering

Vaex does not make copies of memory while filtering, selecting, cleaning data, in contrast to Pandas. Take data filtering as an example. Since Vaex doesn't do memory copying, it uses little space of RAM to finish and the execution will also be fast.

Pandas Dataframe

Example

%time df_p_filtered = df_p[df_p['x'] > 0]

Output

CPU times: user 13 ms, sys: 1.74 ms, total: 14.7 ms
Wall time: 19.7 ms

Vaex Dataframe

Example

%time df_v_filtered = df_v[df_v['x'] > 0]

Output

CPU times: user 1.23 ms, sys: 20 Âµs, total: 1.25 ms
Wall time: 1.27 ms

Vaex showcases its efficiency by performing multiple computations in a single pass over the data

Example

df_v.select(df_v.id < 15,name='less_than')
df_v.select(df_v.id >= 15,name='greater_than')
%time df_v.mean(df_v.id, selection=['less_than', 'greater_than'])

Output

CPU times: user 19.3 ms, sys: 0 ns, total: 19.3 ms
Wall time: 15.5 ms
array([ 7.00641717, 23.49799197])

Virtual Columns in Vaex

If we seek to create new columns in a dataframe by incorporating expressions, virtual columns come into play. These columns resemble regular ones but do not occupy memory space; instead, they store the expressions themselves. In the world of vaex, there exists no discrimination between virtual and regular columns, as the default Expression system treats them all with equal significance.

Example

%time df_v['new_col'] = df_v['x']**2
print(df_v.head())
df_v.mean(df_v['new_col'])

Output

as you observe a new column was added to the table.

array(52.94398942)

Visualization

Vaex integrates seamlessly with popular visualization libraries like Matplotlib and Bokeh, giving users the power to create highly detailed and interactive visualizations with large datasets.

Vaex, an incredibly powerful data analysis library, empowers users to effortlessly create stunning visualizations, transcending the boundaries of mere two-dimensional representations, and delving into the intricate realm of three-dimensional vistas, even when grappling with vast and complex datasets.

We will attempt to create a one-dimensional graph

Example

%time df_v.viz.histogram(df_v.x, limits = [0, 10])

Output

We will attempt to create a two-dimensional graph

Example

df_v.viz.heatmap(df_v.x,df_v.y+df_v.z,limits=[-3, 20])

Output

In Addition, we can also add some statistical expressions to visualize the data. Expression can be passed with the help of following syntax

Syntax

what=<statistic><Expression> as an argument.

Output

We can also add arithmetic and numpy functions to these calculations.

Harischandra Prasad

Updated on: 2023-10-16T12:41:18+05:30

327 Views

Kickstart Your Career

Get certified by completing the course

Get Started