0% found this document useful (0 votes)
13 views19 pages

Boost Your Python Development Speed by 80x With PyKX 1703203067

The document discusses PyKX, a tool that enhances Python's performance by making it 80x faster and 600x more efficient, particularly in data analytics and machine learning applications. It highlights the integration of PyKX with kdb, allowing Python users to leverage high-performance data management and analytics capabilities while maintaining familiarity with Python. Various use cases, including transaction cost analysis and anomaly detection, demonstrate significant improvements in execution speed and memory efficiency when using PyKX.

Uploaded by

Harsh Karanwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Boost Your Python Development Speed by 80x With PyKX 1703203067

The document discusses PyKX, a tool that enhances Python's performance by making it 80x faster and 600x more efficient, particularly in data analytics and machine learning applications. It highlights the integration of PyKX with kdb, allowing Python users to leverage high-performance data management and analytics capabilities while maintaining familiarity with Python. Various use cases, including transaction cost analysis and anomaly detection, demonstrate significant improvements in execution speed and memory efficiency when using PyKX.

Uploaded by

Harsh Karanwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

80x faster | 600x more efficient

Accelerate Your Python


Development Journey With PyKX
Author: Steve Wilcockson
[Link]

[Link]
Python 80x faster | 600x more memory
efficient
Python is an amazing language, and NumPy, SciPy and Pandas are outstanding
packages. But sometimes you need to run Python models and analytics faster with
more data, deploy them more efficiently, and move to production sooner.

PyKX can help, see the example below:


XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

In this whitepaper, we investigate several use cases and examples – from fast and
big data running in research notebooks, through to mission critical models. We also
explore streamlining the journey from data analytics research to production –
referencing workflows ranging from simple aggregations to deep neural networks.

We examine use cases like those depicted above, where PyKX can enhance
Python’s efficiency by over 600x and speed by 80x. Additionally, we dive into
scenarios where the performance gains might be less significant in terms of
multiples but equally impactful.

Let’s start by exploring the very foundations of Python, NumPy and Pandas
alongside kdb from KX, which has a surprisingly similar background.

1
Python and kdb: Yin and Yang
Python has an interesting history. It was developed in the early ‘90s as the
brainchild of Dutchman Guido Van Rossum, subsequently dubbed “Benevolent
Dictator for Life of the Python Programming Language”. Its name can be
attributed to his love of the British comedy series Monty Python’s Flying Circus –
hence explaining eric1 and IDLE2 as supporting IDEs along the way.

It was used originally as much for software testing and web scripting as for data
analysis, but its versatility and ease of use contributed to widespread adoption
among programmers, engineers and scientists. As a result, it has become the
lingua franca of data scientists for data analytics, machine learning and artificial
intelligence. Much of that success can be attributed to some of its key design
features:

Easy-to-read syntax: Python uses indentation to define code blocks, making it


highly readable.
Dynamic typing: It's declarative and performs type checking at runtime,
allowing flexible and dynamic programming.
Interpreted nature: Programs are executed by an interpreter, enabling quick
development and testing.
Cross-platform compatibility: It's available on major operating systems like
Windows, macOS, and Linux, ensuring portability of code.

Add to that an unsurpassed community of libraries powering modern data science


and AI supported by millions of experts across the world.

The q programming language in kdb has a similar vintage. It grew from the k
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

programming language, based on APL, and shares similar design principles in terms
of its dynamic typing, being interpretative and providing cross-platform
compatibility. While it may not have comparative readability, those well versed in
it would argue it's compensated by mathematical and data management
simplicity, which comes from columnar design, vector processing and efficient
compression.

Kdb and q are optimal for time series and vector-native use cases in data science
and AI use cases. Performance is proven in multiple independent benchmarks from
Securities Technology Analysis Center (STAC)3 and used widely by major banks and
innovators in automotive, healthcare, telecommunications, and manufacturing
organizations.

2
In the capital markets, specifically at hedge fund AQR capital management, Wes
McKinney, introduced Pandas (PANel Data & AnalysiS), which sits with NumPy
(numerical Python) and the SciPy (scientific Python) stack. This brought
foundational mathematical operations and data types to Python, and focussed on
usability and convenience rather than performance, helping application
development of data management and analytics, not just in finance but all
industries and academia. However, ease of use and convenience does not
automatically mean production-level performance and the ability to service big
and fast data use cases.

Whether your Pythonic data analysis use case is in fraud detection, capital
markets, predictive maintenance, healthcare, or retail analytics—regardless of
data set and model type—consider this: How capable are your Python apps really?
Can they take on more data, deliver absolute performance and efficiency, and
become production-worthy sooner?
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

3
PyKX: Combining Ease of Use, Data
Management and Hyper-Efficient
Analytics
By adding PyKX into Jupyter Notebooks, Python users can tap into the capabilities
of kdb to instantly get the best of all worlds. Python users can:

Query kdb-served data from Python via an easily implemented API, ideal for the
Python-using analyst and data scientist wanting clean, fast, big data.
Store, query, manipulate and use high-performance q objects within a Python
process.
Embed Python functionality, for example deep learning capabilities, within q
sessions (for the kdb/q developers).

In addition, PyKX provides a module interface that can easily load q scripts,
namespaces, and contexts. Once loaded, their production-worthy hyper-efficient
functions can be accessed as Python modules. Add to that support for ANSI SQL
querying and your data science becomes a lot more powerful.

With PyKX, then, Python developers can deploy their skills, re-use their libraries and
enjoy the high performance and resource efficiency of kdb.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

4
Successful Users Agree
The benefits of combining the power of kdb with the familiarity of Python were
compellingly told by Erin Stanton, Data Scientist at Virtu Financial, at KXCON[23].
She explained how it enabled them to accelerate their machine learning adoption
by allowing Python users in research, model, and analytics functions to focus on
model selection and feature engineering rather than data management. She
further explained how their response times were accelerated, citing one 8-hour
SQL-based report executing in 5 minutes with Python supported by kdb. That was
attractive enough to provide the same valuable ML services directly to their
customers, distinguishing Virtu from its peers.

At the same event, a leading high frequency hedge fund discussed how PyKX
helped embed kdb functionality directly into its Python applications to deliver real-
time analytics and process over 1 trillion events per day. For the fund, the more
agile, convenient Python programming language could take on production
capabilities that previously might have been limited to C++.

That same combination of Python for data manipulation and kdb for number
crunching was commended by Alex Donoghue, TD Securities who explained how
data engineers, quants and electronic traders could leverage their Python skills
and libraries over kdb without learning the q language. However, he also
commented that once customers were familiar with q, performance and efficiency
doors opened.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

5
Using PyKX
PyKX is easy to install and use - just add PyKX to your Jupyter Notebook as shown
below. From there on, it's simply “Python as normal”.

Well, not quite “as normal” - it’s all the familiarity of Python but powered by kdb –
for the best of both worlds. Actually, it’s the best of three worlds. Along with
Pythonic usability and readability, users receive fast data and high-performance
analytics to:
Test with more data.
Run faster code, including real-time microsecond applications.
Take code from research to production faster (kdb serves both production and
research environments).

The code below shows the time difference when generating 100 million floats
compared to normal NumPy. The change is clear - in q it took 232 milliseconds, in
NumPy it took 648.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

With support for ANSI SQL, data engineers can access the benefits of kdb without
having to learn q as its querying language.
6
Once again, both easy and fast. That speed is illustrated below in running a SQL
query in PyKX (2.31 milliseconds) compared to Pandas (39.5 milliseconds).
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

As memory space is shared between Python and q, data access is simplified and
highly efficient, normally zero copy. This is shown below in transferring one million q
floats to NumPy in what is effectively a constant time operation. Similarly, data
can transfer from NumPy to q in the same manner.

7
Developers can also interact with PyKX tables. As well as supporting ANSI SQL and
an API for qSQL, a Pandas API allows users to reference the metadata as they
would in Pandas, but also index into PyKX data with the same syntax. Moreover, it
tends to be faster, as illustrated below. The same query takes 158 milliseconds when
operating on a standard Pandas dataframe, but 46 milliseconds on a PyKX table.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

8
Sample Use Cases
1. Transaction Cost Analysis

Transaction Cost Analysis (TCA) measures what actually occurred versus what was
expected, particularly in financial markets. However, similar examples can apply in
other contexts. For instance, how do my actual discounted grocery sales over the
day compare to my expectation, or is my machine in my manufacturing pod
processing the expected number of components per unit of time?

For those interested in the detail, TCA is normally both an internally useful and a
regulatory and compliance obligation to quantify trade efficiency for financial
institutions. Because transactions happen at high volumes and velocities in
finance, kdb is commonly the production environment, while Python is often the
algorithmic prototyping environment.

Consequently, bringing these two technologies closer together makes a lot of


sense. Interoperability lowers the overheads of the production platform, making it
more accessible and powerful for larger Python development and quant teams.

In our example use case, we first ingest datasets of quotes (9 million), market
trades (900K) and broker trades (20K) into Pandas dataframes and PyKX tables. To
then quantify slippage, we calculate the difference between execution price and a
chosen benchmark, assessing the bid-ask spread, its volatility and market impact.
For volatility specifically, an already existing highly efficient q function operates
within PyKX. See below.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

Native Python functions combine with SQL and kdb to derive the bid-ask spread. In
this case a moving average calculation is used.

9
To quantify liquidity, a PyKX-enabled SQL query within the code calculates trading
volumes of 10-minute intervals during the trading day.

Results are then visualized in the notebook. The graphs on the left-hand side
(below) show heightened volatility at start and end of day, and how, as volatility
increases, so too does the spread. The right-hand side show the corresponding
increase in trading volume at market opening and close. For those familiar with the
domain, that’s standard market behaviour, but important to understand.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

Having run these calculations to understand the so-called slippage factor — the
difference between the executed price and expected price — and the times it may
become elevated, the 'asof' join function combines quotes and executions tables.
The merged table accurately informs the calculation and plotting of the slippage
factor over time and across venues.

10
Venue 1 has higher slippage than the other two venues, a useful insight for the
trading desk and its management.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

But most importantly, whether you care about slippage and trade execution or not,
the execution times and resource usage when processing the files and running the
calculations are transformed. The example below shows an 85x response
improvement with PyKX compared to Pandas (1150 versus 13.5 milliseconds).

11
Even more impressive, memory usage shows an almost 630x reduction.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

12
Sample Use Cases
2. Anomaly Detection

Anomaly detection is key to many use cases, including fraud detection, criminal
investigations, and cybersecurity. It’s particularly important in manufacturing
processes for predictive maintenance in helping to identify unusual behavior that
can signify maintenance requirements. With early intervention, downtime can be
reduced or negated, and yields improved. Achieving it requires the ability to
process streams of data from the machinery, normally sensors, and correlate those
streams with expected data drawn from historical data using statistical and
machine learning techniques. Once again, coupling Python with kdb to integrate
and allow the data to be interrogated makes sense. In the use case below, we
deploy Python with the low/no code kdb Insights Enterprise.

kdb Insights Enterprise provides an intuitive GUI to define, run and maintain data
pipelines from ingestion and transformation to storage and publication. The drag-
and-drop interface, shown below, outlines how data is captured, decoded and
transformed from an MQTT messaging service in JSON format and written to a kdb
database.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

Once the data is prepared, it can be accessed by SQL for simple queries and
analytics. In this instance, calculating 3 sigma upper and lower control limits to
account for 99.7% of expected temperature fluctuations. Temperatures recorded
outside that range could signify trouble and justify investigation. This is a
prescriptive approach.

13
Alternatively, or in addition, a deep neural network regression model could run a
data-driven approach to assess the likelihood of breakdowns, dynamically
accessing live data with trained model sets. The architecture is illustrated below.

With Python, functionality can be written to calculate time intervals between


threshold breeches and moving averages as below.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

14
They can then be incorporated as a parallel process within the data pipeline:

From a machine learning model registry, an appropriate model can be selected:


XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

This too can be added to the data pipeline and fed data from the newly defined
processes:

15
Operators can then compare the prediction of breeches in temperature levels that
would warrant intervention to "real-world" data as below, further fine-tuning the
model to improve its accuracy for production deployment.

Conclusion
Whatever your use case, if a large amount of data/ constant streams of data are
to be analyzed, whether real-time streaming or frequent batch analyses, Python
can draw upon the ultra-efficient kdb world to provide data or models. This way,
whether you want to stay in Python, NumPy, and Pandas or explore the highest
performance of q like TD Securities and others did in the use case examples;
efficiencies can be made. This allows larger data volumes and faster velocities to
be processed, improving the capacity of analytics, and helping take solutions to
production faster.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

16
Getting Started and More Information
The PyKX documentation site provides an overview of PyKX, a comprehensive user
guide, details on its APIs, and supporting examples of it in use.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

17
A thorough reference card of key functionality in kdb q is available to Python
developers and listed below.
XKyP htiW yenruoJ tnempoleveD nohtyP ruoy etareleccA

Two ways to get started


1. Click here for more introductory information on PyKX. To gain hands-on
experience, you can easily install it from PyPI using a simple “pip install pykx”
command - this helpful guide will kickstart your journey.

2. Visit the KX Academy page to access PyKX in a sandbox environment to learn


how to store, query, manipulate and use kdb objects.

Either way, you are on the brink of being able to access the power of kdb from the
familiarity of Python. Enjoy!

18

You might also like