DATA SCIENCE E-BOOK
PYTHON
PACKAGES
TO LEARN
DATA
SCIENCE
The packages recommended
for your learning
V.2.0
WRITTEN BY CORNELLIUS YUDHA WJAYA
PYTHON PACKAGES TO
LEARN DATA SCIENCE
About
Please feel free to share this PDF with anyone for free.
The latest version of this book can be downloaded from:
https://cornelliusyudhawijay.gumroad.com/follow
This work is licensed under the Creative Commons Attribution 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/ or send a letter to
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
Visit me at:
TABLE OF
CONTENTS
I. Introduction
II. Exploratory Data Analysis
III. Statistic
IV. Mathematic
V. Big Data Processing
VI. Machine Learning
VII. Interpretability
VIII. Time Series
IX. NLP
X. Recommendation System
XI. Audio Project
XII. Outlier Detection
XIII. Machine Learning Validation
XIV. Synthetic Data
XV. Closing Remarks
I. Introduction
Why this ebook is created?
Data science is a broad subject; that is why learning
all the concepts would not be easy. Many have asked
me how to learn data science and machine learning
concepts properly. For me, the answer would be
learning by hands-on with the current technology,
and that is Python programming. To effectively
learn data science, I want to introduce various
Python packages to support your data science
learning.
02
E X P L O R A T O R Y
D A T A A N A L Y S I S
02
y-data Profiling
y-data profiling is a Python Package to generate data
analysis reports with as few lines as possible. It offers a nice
quick report of the dataset. This module is the best to work in
the Jupyter environment.
pip install ydata-profiling
[notebook]
#Enable the widget extension in Jupyter
jupyter nbextension enable --py widgetsnbextension
Sweetviz
Sweetviz is another open-source Python package to generate
a beautiful EDA report with a single code line. The difference
from Pandas Profiling is that the output is a fully self-
contained HTML application.
#Installing the sweetviz package via pip
pip install sweetviz
PandasGUI
PandasGUI is different from the previous packages I explained
above. Instead of generating a report, PandasGUI generates a GUI
(Graphical User Interface) data frame we could use to analyze our
Pandas Data Frame in more detail.
#Installing via pip
pip install pandasgui
#or if you prefer directly from the source
pip install git+https://github.com/adamerose/pandasgui.git
Missingno
Data exploration is not limited to the data present in the dataset, but
it includes the missing data from your dataset. There are cases that
missing data happen because of an accident or pure chance, but this is
often not true. Missing data might uncover insight that we never
knew previously.
Introducing missingno, a package specifically developed to visualize
your missing data. This package provides an easy-to-use insightful
one-liner code to interpret the missing data and shows the missing
data relationship between features.
pip install missingno
AutoViz
AutoViz is an open-source visualization package under the AutoViML
package library designed to automate many data scientists’ works.
Many of the projects were quick and straightforward but undoubtedly
helpful, including AutoViz.
AutoViz is a one-liner code visualization package that would
automatically produce data visualization.
pip install autoviz
DataPrep
Data preparation is the initial step that any data professional does.
Whether you want to analyze the data or preprocess the data for a machine
learning model, you need to prepare the data. Preparing data means you
need to collect, clean, and explore the data. To do all the activities I have
mentioned, there is a Python package developed called DataPrep.
DataPrep is a Python Package developed to prepare your data. This
package contains three main APIs for us to use, they are:
Data Exploration ( dataprep.eda )
Data Cleaning( dataprep.clean )
Data Collection ( dataprep.connector )
DataPrep packages are designed to have a fast data exploration and work
well with Pandas and Dask DataFrame objects.
pip install -U dataprep
03
S T A T I S T I C
03
Scipy.Stats
SciPy (pronounced “Sigh Pie”) is an open-source package computing
tool for performing a scientific method in the Python environment.
The Scipy itself is also a collection of numerical algorithms and
domain-specific toolboxes used in many mathematical, engineering,
and data research.
One of the APIs available within Scipy is the statistical API called
Stats. According to the Scipy homepage, Scipy.Stats is a module that
contains a large number of probability distributions and a growing
library of statistical functions, especially for probability function
study.
python -m pip install --user numpy scipy
matplotlib ipython jupyter pandas sympy
nose
Pingouin
Pingouin is an open-source statistical package that is mainly used for
statistical. This package gives you many classes and functions to learn
basic statistics and hypothesis testing. According to the developer, Pingouin
is designed for users who want simple yet exhaustive stats functions.
Pingouin is simple but exhaustive because the package gives you more
explanation regarding the data. On Scipy.Stats, they return only the T-
value and the p-value when sometimes we want more explanation
regarding the data.
In the Pingouin package, the calculation is taken a few steps above. For
example, instead of returning only the T-value and p-value, the t-test from
Pingouin also return the degrees of freedom, the effect size (Cohen’s d), the
95% confidence intervals of the difference in means, the statistical power,
and the Bayes Factor (BF10) of the test.
pip install pingouin
Statsmodel
Statsmodels is a statistical model python package that provides
many classes and functions to create a statistical estimation.
Statsmodel package used to be a part of the Scipy module, but
currently, the statsmodel package is developed separately.
What is different between Scipy.Stats and statsmodel? The
Scipy.Stats module focuses on the statistical theorem such as
probabilistic function and distribution, while the statsmodel
package focuses on the statistical estimation based on the data.
pip install statsmodels
Lifelines
The lifelines package in Python is a specialized library used for
survival analysis, a set of statistical approaches for analyzing the
expected duration of time until one or more events happen. This
type of analysis is commonly used in fields like biology,
engineering, and economics, especially for analyzing the time
until events like death, failure, or churn occur.
Lifelines packages features including:
easy installation
internal plotting methods
simple and intuitive API
handles right, left and interval censored data
contains the most popular parametric, semi-parametric and
non-parametric models
pip install lifelines
Scikit-posthocs
scikit-posthocs is a Python package that provides post hoc tests
for pairwise multiple comparisons that are usually performed in
statistical data analysis to assess the differences between group
levels if a statistically significant result of the ANOVA test has
been obtained.
scikit-posthocs attempts to improve Python statistical capabilities
by offering a lot of parametric and nonparametric post hoc tests
along with outliers detection and basic plotting methods.
pip install scikit-posthocs
linearmodels
linearmodels is a Python package that Extends stats models with Panel
regression, instrumental variable estimators, system estimators, and
models for estimating asset prices. The package's main features include:
Panel Data Models: It supports various models for panel data analysis,
such as Panel regression with fixed effects, random effects, and
between estimators. It also includes support for first-difference models
and pooled models.
Instrumental Variable Estimators: It provide tools for two-stage least
squares (2SLS) for instrumental variables regression, which is useful
when dealing with endogenous regressors.
Asset Pricing Models: The package includes models specifically
designed for asset pricing, such as factor models used in finance.
System Estimators: These are used for estimating simultaneous
equations models, which are common in econometrics.
pip install linearmodels
ArviZ
ArviZ is a Python package for exploratory analysis of Bayesian
models. It serves as a backend-agnostic tool for diagnosing and
visualizing Bayesian inference.
pip install arviz
04
M A T H E M A T I C
04
Numeric and
Mathematical Modules
Well, technically, the packages I want to outline in this point is
not only one single package, but it consists of several packages
that intertwined and were called Numeric and Mathematical
Modules.
The modules were documented on the Python homepage, and we
are provided with a complete explanation regarding the package.
Taken from the Python Documentation, the packages listed in
their modules are:
numbers — Numeric abstract base classes
math — Mathematical functions
cmath — Mathematical functions for
complex numbers
decimal — Decimal fixed point and
floating-point arithmetic
fractions — Rational numbers
random — Generate pseudo-random
numbers
statistics — Mathematical statistics
functions
SymPy
What is SymPy? It is a Python library for symbolic mathematics.
Well, what is Symbolic computation? The tutorial page given in
the SymPy documentation explains that Symbolic computation is
a computation problem that deals with mathematical objects
symbolically. In simpler terms, symbolic mathematics represented
the mathematical object precisely and not approximately. If the
mathematical expressions are unevaluated variables, they are left
in the symbolic form.
pip install sympy
Sage
Sage is open-source mathematic software that runs above the
Python programming language. Technically, Sage wasn’t Python
Package but rather software. The usage is simple if you already
know Python, so you would not feel too hard when using the
software.
Sage supports research and teaching in algebra, geometry,
number theory, cryptography, numerical computation, and
related areas. There are many general and specific topic that was
included within Sage, including:
Basic Algebra and Calculus
Plotting
Basic Rings
Linear Algebra
Polynomials
Parents, Conversion, and Coercion
Finite Groups, Abelian Groups
Number Theory
Some More Advanced Mathematics
NetworkX
NetworkX is a Python package for the creation, manipulation,
and study of the structure, dynamics, and functions of complex
networks. The package offers various functions, including:
Data structures for graphs, digraphs, and multigraphs
Many standard graph algorithms
Network structure and analysis measures
Generators for classic graphs, random graphs, and synthetic
networks
Nodes can be "anything" (e.g., text, images, XML records)
Edges can hold arbitrary data (e.g., weights, time-series)
pip install networkx
05
B I G D A T A
P R O C E S S I N G
05
Polars
Polars is a DataFrame library designed to processing data with a
fast lighting time by implementing Rust Programming language
and using Arrow as the foundation. Polars premise is to give the
users a swifter experience in comparison to Pandas package. The
ideal situation to use the Polars package is when you have data
that were too big for Pandas but too small for using Spark.
For you who familiar with the Pandas workflow, Polars would not
be that different — there is some extra functionality, but overall
they are pretty similar.
pip install polars
Dask
Dask is a Python package for parallel computing in Python. There
are two main parts in Dask, there are:
1. Task Scheduling. Similar to Airflow, it is used to optimized the
computation process by automatically executing tasks.
2. Big Data Collection. Parallel data frame like Numpy arrays or
Pandas data frame object — specific for parallel processing.
In simpler terms, Dask offers a data frame or array object like you
could find in Pandas, but it is processed in parallel for faster
execution time, and it offers a task scheduler.
#If you want to install dask completely
python -m pip install "dask[complete]"
#If you want to install dask core only
python -m pip install dask
Vaex
Vaex is a Python package used for processing and exploring big
tabular datasets with interfaces similar to Pandas. Vaex
documentation shows that it can calculate statistics such as
mean, sum, count, standard deviation, etc., on an N-dimensional
grid up to a billion (109) objects/rows per second. It means Vaex
is Pandas alternative that is also used to improve the execution
time.
The Vaex workflow is similar to Pandas API, which means if you
are already familiar with Pandas, then it would not be hard for
you to use Vaex.
pip install vaex
06
M A C H I N E
L E A R N I N G
06
Scikit-Learn
The king of Machine Learning modeling in Python. There is no
way I would omit Scikit-Learn in my list as your learning
references. If, for some reason, you never heard about Scikit-
Learn, this module is an open-source Python library for machine
learning built on top of SciPy.
Scikit-Learn contains all the common Machine Learning models
we use in our everyday data science work. According to the
homepage, Scikit-learn supports supervised and unsupervised
learning modeling. It also provides various tools for model fitting,
data preprocessing, selection and evaluation, and many other
utilities.
pip install -U scikit-learn
MLFlow
The current state of Machine Learning education is not limited to
the machine learning model, but it is expanded into the
automation process of the model. This is what we call MLOps or
Machine Learning Operations.
Many open-source Python packages support the MLOps lifecycle,
but in my opinion, MLflow has a complete MLOps learning
material for any beginner.
According to the MLFlow homepage, MLflow is an open-source
platform for managing the end-to-end machine learning lifecycle.
This package handles four functions, they are:
Experiments tracking (MLflow Tracking),
ML code reproducible (MLflow Projects),
Managing and deploying models (MLflow Models),
Model central lifecycle (MLflow Model Registry).
pip install mlflow
River
River is a library for building online machine learning models.
Such models operate on data streams. But a data stream is a bit of
a vague concept.
River is not the only library allowing you to do online machine
learning. But it might just the simplest one to use in the Python
ecosystem. River plays nicely with Python dictionaries, therefore
making it easy to use in the context of web applications where
JSON payloads are aplenty.
pip install river
PyCaret
PyCaret is an open-source, low-code machine learning library in
Python that automates machine learning workflows. It is an end-
to-end machine learning and model management tool that
exponentially speeds up the experiment cycle and makes you
more productive.
Compared with the other open-source machine learning libraries,
PyCaret is an alternate low-code library that can be used to
replace hundreds of lines of code with a few lines only. This
makes experiments exponentially fast and efficient. PyCaret is
essentially a Python wrapper around several machine-learning
libraries and frameworks, such as scikit-learn, XGBoost,
LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few
more.
pip install pycaret
TPOT
TPOT stands for Tree-based Pipeline Optimization Tool. Consider
TPOT as your Data Science Assistant. TPOT is a Python Automated
Machine Learning tool that optimizes machine learning pipelines
using genetic programming.
TPOT will automate the most tedious part of machine learning by
intelligently exploring thousands of possible pipelines to find the
best one for your data. Once TPOT is finished searching (or you get
tired of waiting), it provides the Python code for the best pipeline it
found so you can tinker with the pipeline from there.
pip install tpot
EvalML
EvalML is an AutoML library that builds, optimizes, and evaluates
machine learning pipelines using domain-specific objective functions.
Key Functionality
Automation - Makes machine learning easier. Avoid training and
tuning models by hand. Includes data quality checks, cross-
validation, and more.
Data Checks - Catches and warns of problems with your data and
problem setup before modeling.
End-to-end - Constructs and optimizes pipelines that include
state-of-the-art preprocessing, feature engineering, feature
selection, and a variety of modeling techniques.
Model Understanding - Provides tools to understand and
introspect on models, to learn how they'll behave in your problem
domain.
Domain-specific - Includes repository of domain-specific objective
functions and an interface to define your own.
pip install evalml
07
I N T E R P R E T A B I L I T Y
07
Eli5
There are many advanced ML Interpretation Python Package out
there, but most of them are too specific which devoid of any
learning opportunities. In this case, I recommended Eli5 for your
Machine Learning interpretability study package as it offers all
the basic concepts without many complicated concepts.
Taken from the Eli5 package, the basic usage of this package is to:
1. inspect model parameters and try to figure out how the model
works globally;
2. inspect an individual prediction of a model and figure out
why the model makes the decision.
pip install eli5
Yellowbrick
Yellowbrick is an open-source Python package that extends the
scikit-learn API with visual analysis and diagnostic tools. For
Data scientists, Yellowbrick is used to evaluate the model
performance and visualize the model behavior.
Yellowbrick is a multi-purpose package that you could use in your
everyday modeling work. Even though most of the interpretation
API from the Yellowbrick is at the basic level, it is still useful for
our first modeling steps.
pip install yellowbrick
SHAP
SHAP or (SHapley Additive exPlanations) is a game-theoretic
approach to explain the output of any machine learning model. In
a simpler term, SHAP uses the SHAP values to explain the
importance of each feature. SHAP uses the SHAP values difference
between the prediction of the model and the null model developed.
SHAP is model agnostic, similar to the Permutation Importance,
so it is useful for any model.
pip install shap
Mlxtend
Mlxtend or machine learning extensions is a Python package for
data science everyday work life. The APIs within the package are
not limited to interpretability but extend to various functions,
such as statistical evaluation, Data Pattern, Image Extraction,
and many more. However, we would discuss the API for
interpretability — the Decision Regions plotting.
The Decision Regions plot API would produce a decision region
plot to visualize how the feature decides the classification model
prediction. Let’s try using sample data and a guide from Mlxtend.
pip install Mlxtend
PDPBox
PDP or Partial Dependence Plot is a plot that shows the marginal
effect of features on the predicted outcome of the machine
learning model. It is used to evaluate whether the correlation
between the feature and target is linear, monotonic, or more
complex.
The advantage of interpreting with a Partial Dependence plot is
that it is easy to interpret for business people. The calculation for
the partial dependence plots has a causal interpretation when we
intervene on a feature and measure the changes in the
predictions; this is when we could measure the interpretation.
pip install pdpbox
InterpretML
InterpretML is a Python Package that includes many Machine
Learning Interpretability APIs. The purpose of this package is to
give you an interactive plot based on plotly to understand your
prediction result.
InterpretML offers you many ways to interpret your Machine
Learning (Globally and Locally) by using many of the techniques
we have discussed — namely SHAP and PDP. Also, this package
owns a Glassbox model API which gives you an interpretability
function when you develop your model.
pip install interpret
08
T I M E S E R I E S
08
pmdarima
One of the forecasting models often used in the time-series
analysis is ARIMA (AutoRegressive Integrated Moving Average).
ARIMA is a forecasting algorithm where we could predict future
values based on the past values of the time series without any
additional information.
Pmdarima is a statistical Python package that provides the
ARIMA API and all the basic time-series analysis API, but we only
try the Auto ARIMA.
pip install pmdarima
sktime
Many people who learned machine learning with Python would
use Sklearn as their starter point. The problem with Sklearn is
that the package provides no time-series analysis module; this is
why sktime packages are developed. According to the homepage,
sktime is specialized in time series algorithms and scikit-learn
compatible tools, including:
Forecasting,
Time series classification,
Time series regression.
pip install sktime
fbprophet
The fbprophet or prophet is a time-series analysis developed by
the Facebook group. According to the homepage, fbprophet is a
package to develop forecasting time series data based on an
additive model where non-linear trends fit time seasonality with
holiday effects.
Fbprophet mentions that it works best with time series data with
strong seasonal effects and several seasons of historical data.
Also, fbprophet notes that it is robust to missing data and could
handle outliers well. From the explanation, we could infer that
fbprophet is a good package to model time data with high
seasonality.
pip install pystan==2.19.1.1
pip install prophet
tsfresh
tsfresh is a python package that automatically calculates a large
number of time series characteristics, the so-called features.
Further, the package contains methods to evaluate the power and
importance of such characteristics for regression or classification
tasks.
The package provides systematic time-series feature extraction
by combining established algorithms from statistics, time-series
analysis, signal processing, and nonlinear dynamics with a robust
feature selection algorithm. In this context, the term time-series
is interpreted in the broadest possible sense, such that any types
of sampled data or even event sequences can be characterised.
pip install tsfresh
darts
Darts is a Python library for user-friendly forecasting and anomaly
detection on time series. It contains a variety of models, from
classics such as ARIMA to deep neural networks. The forecasting
models can all be used similarly, using fit() and predict() functions,
similar to scikit-learn.
The library also makes it easy to backtest models, combine the
predictions of several models, and take external data into account.
Darts supports both univariate and multivariate time series and
models. The ML-based models can be trained on potentially large
datasets containing multiple time series, and some of the models
offer rich support for probabilistic forecasting.
pip install darts
09
N L P
09
NLTK
Natural Language Toolkit, or NLTK, is an open-source Python
package developed specifically for human language. It is arguably
the most-used package for beginners and professionals in the NLP
field as NLTK offers many useful APIs in NLP research. According
to the homepage, NLTK is suitable for any profession — linguist,
data scientist, researcher, student, and many more.
NLTK contains all the common APIs we use in our everyday NLP
everyday activities work. If we explore the homepage, we will find
that NLTK provides various tools for parsing, stemming,
tokenization, and many more. They also include API to read data
from sources such as Twitter.
pip install --user -U nltk
Pattern
Pattern package is an open-source Python package developed for
text processing and web-mining data. The API provides many
functions, such as:
Data Mining API from various sources (Google, Twitter, and Wikipedia)
NLP Processing
Machine Learning Modelling
Network Analysis
If we compared the Pattern package to the NLTK package, the
functions for the text processing within the Pattern are
incomplete than NLTK. However, Pattern contains the web-
mining data, which NLTK doesn't have. This is because Pattern
packages were developed with a focus on data mining.
pip install pattern
TextBlob
TextBlob is a Python text processing package that provides many
APIs to easing the NLP project tasks. TextBlob was built on top of
the NLTK and Pattern packages, which means you would find
many of the familiar APIs.
TextBlob stands out for how beginner-friendly this package is to
do NLP activity with their simple API. TextBlob package was
developed specifically for NLP tasks such as tagging, translation,
sentiment analysis, and more in a simple way.
pip install -U textblob
python -m textblob.download_corpora
SpaCy
SpaCy is an Open-source Python package with a tagline for
Industrial-Strength Natural Language Processing. It means that
SpaCy is developed for production environment and industrial
activity than for academic purposes.
Although it is developed for Industrial purposes, the tutorial and
documentation they have are quite complete. The pages offer you the
guide, lesson, and online video to learn NLP from the beginning and
use SpaCy. For example, the SpaCy 101 Guide would let you know
many subjects such as Linguistic annotations, Tokenization, POS tags
and dependencies, Vocab, hashes, and lexemes, and many more.
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
FastText
FastText fastText is a library for efficient learning of word
representations and sentence classification. The package is
developed by Facebook Researcher,.
The features including:
Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.
pip install darts
10
R E C O M M E N D A T I O N
S Y S T E M
10
Surprise
Surprise is an open-source Python package for building a
recommendation system based on the rating data. The name SurPRISE
is an abbreviation for the Simple Python RecommendatIon System
Engine. The package provides all the necessary tools for building the
recommendation system — from loading the dataset, choosing the
prediction algorithm, and evaluating the model.
pip install scikit-surprise
TensorFlow Recommenders
The TensorFlow framework contains a library to build the
recommendation system called TensorFlow Recommenders. Like the
other package, the TensorFlow Recommenders contains dataset
examples, recommender algorithms, model evaluations, and
deployment. TensorFlow Recommenders would allow us to build a
recommendation system based only on the TensorFlow framework.
pip install tensorflow-recommenders
Recmetrics
Learning about the recommendation system algorithm would not be
complete without the evaluation metrics. The previous pages I
mentioned have taught us some basic recommendation evaluation
metrics, but a Python package focuses on the metrics — Recmetrics.
The package contains many evaluation metrics for the
recommendation system, such as:
Long Tail Plot
Coverage
Novelty
Personalize
Intra-List Similarity
pip install recmetrics
11
A U D I O P R O J E C T
11
Magenta
Magenta is an open-source Python package built on top of TensorFlow
to manipulate image and music data to train a machine learning
model with the generative model as the output.
Magenta does not provide clear API references for us to learn; instead,
they give a lot of research demos and collaborator notebooks we could
try on our own.
For a first-timer in the Audio Data Science project, I suggest visiting
their Hello World notebook for music creation. I learned a lot from
their Notebook, especially the part of generative machine learning
where you could test various tones to produce your music.
pip install magenta
Librosa
Librosa is a Python package developed for music and audio analysis. It
is specific on capturing the audio information to be transformed into
a data block. However, the documentation and example are good to
understand how to work with audio data science projects
pip install librosa
pyAudioAnalysis
pyAudioAnalysis is a Python package for audio analysis tasks. It is
designed to do various analyses, such as:
Extract Audio Features
Train machine learning model for audio segmentation
Classification of unknown audio
Emotion recognition with a Regression model
Dimensional Reduction for audio data visualization
and many more. You could do many things with this package,
especially if you are new to audio data science projects.
git clone https://github.com/tyiannak/pyAudioAnalysis.git
pip install -r ./requirements.txt
pip install -e .
12
O U T L I E R
D E T E C T I O N
12
PyOD
PyOD or Python Outlier Detection is a python package toolkit for
detecting outlier data. PyOD package boasts 30 outlier detection
algorithms, ranging from the classic to the most latest—proof PyOD
package is well maintained. Examples of the outlier detection model
include:
Angle-Based Outlier Detection
Cluster-Based Local Outlier Factor
Principal Component Analysis Outlier Detection
Variational Auto Encoder
PyOD makes outlier detection simple and intuitive by using fewer lines
of code to predict the outlier data. Like model training, PyOD uses the
classifier model to train the data and predict the outlier based on the
model.
pip install pyod
alibi-detect
The alibi-detect python package is an open-source package that
focuses on outlier, adversarial, and drift detection. This package could
be used for tabular and unstructured data such as images or text. The
alibi-detect package offers 10 methods for outlier detection.
pip install alibi-detect
PyNomaly
PyNomaly is a python package to detect outliers based on the LoOP
(Local Outlier Probabilities). The LoOP is based on the Local Outlier
Factor (LOF), but the scores are normalized to the range [0–1].
Local Outlier Factor or LOF is an algorithm proposed by Breunig et al.
(2000). The concept is simple; the algorithm tries to find anomalous
data points by measuring the local deviation of a given data point with
respect to its neighbors. In this algorithm, LOF would yield a score that
tells if our data is an outlier or not.
pip install PyNomaly
13
M A C H I N E
L E A R N I N G
V A L I D A T I O N
13
Evidently
Evidently is an open-source python package to analyze and monitor
machine learning models. The package is explicitly developed to
establish an easy-to-monitor machine learning dashboard and detect
drift in the data. It's specifically designed with production in mind, so
it's better used when a data pipeline is there. However, you could still
use it even in the development phase.
We could monitor our machine learning model metrics as a whole and
per feature prediction. The detail is good enough to know if there is a
difference when incoming new data.
pip install evidently
Deepchecks
Deepchecks is a python package to validate our machine learning
model with a few lines. Many APIs are available for detecting data
drift, label drift, train-test comparison, evaluating models, and many
more. Deepchecks are perfect to use in the research phase and before
your model goes into production.
Deepchecks produce full suites reports that would contain much
information such as Confusion Matrix Report, Simple Model
Comparison, Mixed Data Types, Data Drift, etc. All the information you
need to check the machine learning model is available in a single code
run.
pip install deepchecks
TensorFlow-Data-Validation
TensorFlow Data Validation or TFDV is a python package developed by
TensorFlow developers to manage data quality issues. It is used to
automatically describe the data statistic, infer the data schema, and
detect any anomalies in the incoming data.
The TFDV package is not limited only to generating statistical
visualization but is also helpful in detecting any change in the
incoming data. We need to infer the original or reference data schema
to do this.
pip install tensorflow-data-validation
14
S Y N T H E T I C D A T A
14
Faker
Faker is a Python package developed to simplify generating synthetic
data. Many subsequent data synthetic generator python packages are
based on the Faker package. People love how simple and intuitive this
package was.
With Faker we could generate various synthetic data. For example, we
would create a synthetic data name. The result each time we ran Faker
is different data than our previous iteration. The randomization
process is important in generating synthetic data because we want a
variation in our dataset.
pip install faker
SDV
SDV or Synthetic Data Vault is a Python package to generate synthetic
data based on the dataset provided. The generated data could be single-
table, multi-table, or time-series, depending on the scheme you
provided in the environment. Also, the generated would have the same
format properties and statistics as the provided dataset.
SDV generates synthetic data by applying mathematical techniques and
machine learning models such as the deep learning model. Even if the
data contain multiple data types and missing data, SDV will handle it,
so we only need to provide the data (and the metadata when required).
pip install sdv
Gretel
Gretel or Gretel Synthetics is an open-source Python package based on
Recurrent Neural Network (RNN) to generate structured and
unstructured data. The python package approach treats the dataset as
text data and trains the model based on this text data. The model would
then produce synthetic data with text data (we need to transform the
data to our intended result).
Gretel required a little bit of heavy computational power because it is
based on the RNN, so I recommend using free google colab notebook or
Kaggle notebook if your computer is not powerful enough.
pip install gretel
Mimesis
Mimesis is a robust data generator for Python that can produce a
wide range of fake data in various languages. This tool is useful for
populating testing databases, creating fake API endpoints, filling
pandas DataFrames, generating JSON and XML files with custom
structures, and anonymizing production data, among other purposes.
The features include:
Multilingual: Supports multiple languages.
Extensibility: Supports custom data providers.
Easy: Offers a simple design and clear documentation for easy data
generation.
Performance: Widely recognized as the fastest data generator
among Python solutions.
Data variety: Includes a variety of data providers designed for
different use cases.
Schema-based generators: Offers schema-based data generators to
produce data of any complexity effortlessly.
pip install mimesis
XV. Closing Remarks
All the packages written in this e-book are purely my
opinion and tested personally.
The Python packages listed here might undergo name
changes or be completely disabled by the respective developer
after I write this e-book, so be cautious with that.
Overall, credits are given to the developer for all these
amazing Python packages.
I hope this e-book can help you in the data science learning
journey.