
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Introduction to Vaex in Python
In the realm of data science, one of the important aspects that we need to consider is handling large datasets. Handling such vast volumes of data is really a challenge when it comes to memory management and Speed of execution.
Vaex is a Python library that is designed specifically to solve such problems. It is particularly useful for Out-of-Core Data Frames that are lazy (like Pandas) because calculations are only performed when it is necessary. It offers good solutions in terms of Big data analysis, manipulation, and visualisation. We will explore the idea of Vaex, its features, and how to utilise it in Python in this post/article.
It makes use of the advantages of contemporary technology, such as multi-core CPUs and SSDs, to enable quick, effective calculations.
Why do we Need Vaex?
Vaex's Lazy Evaluation, virtual columns, memory-mapping, Visualization and utilizes an expression system that allows for efficient computations and reduced memory usage etc. enable us to work with vast volumes of datasets efficiently and quickly. Vaex has the potential to overcome various limitations found in other libraries, including pandas.
Getting Started With Vaex
We can install vaex in two ways
Using Pip pip install --upgrade vaex
Using Conda conda install -c conda-forge vaex
Once installed, you can import and use it as follows
import vaex
Reading data Performance
Vaex reads huge tabular data far more quickly than pandas. Let's do analysis by loading a dataset of equal size into both libraries. Here I will make use of the dataset vaex provided. If you want to get good results and to observe the difference between the performance of Vaex and Pandas try with the large datasets. In essence, if you are dealing with substantial datasets in Python, Vaex might just be the library for you.
Performance of Vaex
We will load the dataset that is provided by vaex directly in HDF5 format which is memory-mapped using vaex.example() command.
Example
import vaex %time df_v=vaex.example() print(df_v.head(5)) df_v.describe()
Output
CPU times: user 10.7 ms, sys: 0 ns, total: 10.7 ms Wall time: 10.7 ms

Performance of Pandas
We will load the same dataset that is used by vaex and compare the reading performance
Example
import pandas as pd columns = df_v.get_column_names() data = {} for column in columns: data[column] = df_v[column].values %time df_p = pd.DataFrame(data) print(df_p.head(5)) df_p.describe()
Output
CPU times: user 4.17 ms, sys: 5.06 ms, total: 9.23 ms Wall time: 13.7 ms

From the above results we can conclude that Vaex tool less time compared to Pandas for the same dataset.
Example
print("Size =") print(df_p.shape) print(df_v.shape)
Output
Size = (330000,11) (330000,11)
Data Manipulation/Lazy Computation With Vaex
Vaex uses a technique called "lazy evaluation" to delay the evaluation of an operation until its result is needed. This technique helps in saving computing power and efficiently managing memory. As we know Vaex uses an Expression System these expressions are evaluated lazily, meaning the computations are executed only when necessary. This way it makes the computation faster. Let's test it with an example on a single computation
Pandas DataFrame
Example
%time df_pandas['x'] + df_pandas['y']
Output
CPU times: user 2.15 ms, sys: 10 µs, total: 2.16 ms Wall time: 1.51 ms
Vaex DataFrame
Example
%time df_v.x + df_v.y
Output
CPU times: user 280 µs, sys: 31 µs, total: 311 µs Wall time: 318 µs Expression = (x + y) Length: 330,000 dtype: float32 (expression) -------------------------------------------      0 0.83494      1 3.49052      2  1.2058      3 9.30084      4 19.2119       ...      329995 2.78315 329996 4.43943 329997 13.3985 329998 1.34032 329999 17.4648
Statistics Performance
Vaex can also perform some Operations like mean, standard deviation, count etc. Let's compare how well pandas and Vaex perform while computing statistics
Pandas Dataframe
Example
%time df_p["L"].mean()
Output
Wall time:4.23 ms 920.81793
Vaex DataFrame
Example
%time df_v.mean(df_v.L)
Output
Wall time: 2.49 ms array(920.81803276)
Data Filtering
Vaex does not make copies of memory while filtering, selecting, cleaning data, in contrast to Pandas. Take data filtering as an example. Since Vaex doesn't do memory copying, it uses little space of RAM to finish and the execution will also be fast.
Pandas Dataframe
Example
%time df_p_filtered = df_p[df_p['x'] > 0]
Output
CPU times: user 13 ms, sys: 1.74 ms, total: 14.7 ms Wall time: 19.7 ms
Vaex Dataframe
Example
%time df_v_filtered = df_v[df_v['x'] > 0]
Output
CPU times: user 1.23 ms, sys: 20 µs, total: 1.25 ms Wall time: 1.27 ms
Vaex showcases its efficiency by performing multiple computations in a single pass over the data
Example
df_v.select(df_v.id < 15,name='less_than') df_v.select(df_v.id >= 15,name='greater_than') %time df_v.mean(df_v.id, selection=['less_than', 'greater_than'])
Output
CPU times: user 19.3 ms, sys: 0 ns, total: 19.3 ms Wall time: 15.5 ms array([ 7.00641717, 23.49799197])
Virtual Columns in Vaex
If we seek to create new columns in a dataframe by incorporating expressions, virtual columns come into play. These columns resemble regular ones but do not occupy memory space; instead, they store the expressions themselves. In the world of vaex, there exists no discrimination between virtual and regular columns, as the default Expression system treats them all with equal significance.
Example
%time df_v['new_col'] = df_v['x']**2 print(df_v.head()) df_v.mean(df_v['new_col'])
Output

as you observe a new column was added to the table.
array(52.94398942)
Visualization
Vaex integrates seamlessly with popular visualization libraries like Matplotlib and Bokeh, giving users the power to create highly detailed and interactive visualizations with large datasets.
Vaex, an incredibly powerful data analysis library, empowers users to effortlessly create stunning visualizations, transcending the boundaries of mere two-dimensional representations, and delving into the intricate realm of three-dimensional vistas, even when grappling with vast and complex datasets.
We will attempt to create a one-dimensional graph
Example
%time df_v.viz.histogram(df_v.x, limits = [0, 10])
Output

We will attempt to create a two-dimensional graph
Example
df_v.viz.heatmap(df_v.x,df_v.y+df_v.z,limits=[-3, 20])
Output

In Addition, we can also add some statistical expressions to visualize the data. Expression can be passed with the help of following syntax
Syntax
what=<statistic><Expression> as an argument.
Output

We can also add arithmetic and numpy functions to these calculations.