Data Science Pipeline, EDA & Data Preparation
Data Science Pipeline, EDA & Data Preparation
Summary
Keywords
Self-Assessment Questions
Suggested Reading
2019
All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information storage
or retrieval system without written permission from the publisher.
Acknowledgement
Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives
Data science pipelines are sequences of processing and analysis steps applied to data for a specific
purpose.
They're useful in production projects, and they can also be useful if one expects to encounter the same
type of business question in the future, so as to save on design time and coding.
1) Problem Definition
Contrary to common belief, the hardest part of data science isn’t building an accurate model
or obtaining good, clean data. It is much harder to define feasible problems and come up with
reasonable ways of measuring solutions. Problem definition aims at understanding, in depth,
a given problem at hand.
Multiple brainstorming sessions are organized to correctly define a problem because of your
end goal with depending upon what problem you are trying to solve. Hence, if you go wrong
during the problem definition phase itself, you will be delivering a solution to a problem which
never even existed at first.
2) Hypothesis Testing
Data collection is the process of gathering and measuring information on variables of interest,
in an established systematic fashion that enables one to answer stated research questions,
test hypotheses, and evaluate outcomes. Moreover, the data collection component of
research is common to all fields of study including physical and social sciences, humanities,
business, etc.
While methods vary by discipline, the emphasis on ensuring accurate and honest collection
remains the same.
Data processing is more about a series of actions or steps performed on data to verify,
organize, transform, integrate, and extract data in an appropriate output form for subsequent
use. Methods of processing must be rigorously documented to ensure the utility and integrity
of the data.
Once you have clean and transformed data, the next step for machine learning projects is to
become intimately familiar with the data using exploratory data analysis (EDA).
EDA is about numeric summaries, plots, aggregations, distributions, densities, reviewing all
the levels of factor variables and applying general statistical methods.
A clear understanding of the data provides the foundation for model selection, i.e. choosing
the correct machine learning algorithm to solve your problem.
Feature engineering is the process of determining which predictor variables will contribute
the most to the predictive power of a machine learning algorithm.
The process of feature engineering is as much of an art as a science. Often feature engineering
is a give-and-take process with exploratory data analysis to provide much-needed intuition
about the data. It’s good to have a domain expert around for this process, but it’s also good
to use your imagination.
Machine learning can be used to make predictions about the future. You provide a model with
a collection of training instances, fit the model on this data set, and then apply the model to
new instances to make predictions.
Predictive modelling is useful because you can make products that adapt based on expected
user behaviour. For example, if a viewer consistently watches the same broadcaster on a
streaming service, the application can load that channel on application start-up.
6) Data Visualisation
Interpreting the data is more like communicating your findings to the interested parties. If you
can’t explain your findings to someone believe me, whatever you have done is of no use.
Hence, this step becomes very crucial.
The objective of this step is to first identify the business insight and then correlate it to your
data findings. Secondly, you might need to involve domain experts in correlating the findings
with business problems.
Domain experts can help you in visualising your findings according to the business dimensions
which will also aid in communicating facts to a non-technical audience.
The first step in analytics is gathering data. Then as you begin to analyse and dig deep for answers, it
often becomes necessary to connect to and mashup information from a variety of data sources.
Data can be messy, disorganised, and contain errors. As soon as you start working with it, you will see
the need for enriching or expanding it, adding groupings and calculations. Sometimes it is difficult to
understand what changes have already been implemented.
Moving between data wrangling and analytics tools slows the analytics process—and can introduce
errors. It’s important to find a data wrangling function that lets you easily make adjustments to data
without leaving your analysis.
This is also called as Data Munging. It follows certain steps such as after extracting the data from
different data sources, sorting of data using certain algorithm is performed, decompose the data into
a different structured format and finally store the data into another database.
2. Drop the unnecessary columns like columns containing IDs, Names, etc.
6. Remove outliers
Each of the above mentioned steps has a special importance with respect to Data Science.
If you have two rows like Bombay and Mumbai representing the same city, this could lead to wrong
results. One of the rows has to be changed manually by the data analyst and this is done by creating
a mapping on the fly in the visualisation tool and applied to every row of data to detect for more such
issues and the process is repeated for other cities.
Therefore, data is converted to the proper feasible format before applying any model intro it. By
performing filtering, grouping and selecting appropriate data accuracy and performance of the model
could be increased.
1. Cleaning the data - finding junk values and removing them, finding outliers and replacing them
appropriately (with the 95% percentile, for example) etc.
2. Summary Statistics - finding the summary statistics - mean, median and if necessary, mode,
along with the standard deviation and variance of the particular distribution
3. Univariate analysis - A simple histogram that shows the frequency of a particular variable's
occurrence, or a line chart that shows how a particular variable changes over time to have a
look at all the variables in the data and understand them.
The idea is that, after performing Exploratory Data Analysis, you should have a sound understanding
of the data you are about to dive into. Further hypothesis based analysis (post EDA) could involve
statistical testing, bi-variate analysis etc.
We all must have seen our mother taking a spoonful of soup to judge whether or not the salt is
appropriate in the soup. The act of tasting the soup to check the salt level and to better understand
the taste of soup by taking a spoonful is exploratory data analysis. Based on that our mothers decide
the salt level, this is where they make inferences and their validity depends on whether or not the
soup is well stirred that is to say whether or not the sample represents the whole population.
Say we have given some data of sales and their daily revenue numbers for a big retail chain
The question that arises now is that what are the ways with which we can achieve this?
What will you look for? Do you know what to look for? Will you immediately run a code to find mean
median and mode and other statistics?
The main objective is to understand the data inside out. The first step in any EDA is asking the right
questions for which we want the answers for. If our questions go wrong, the whole EDA goes wrong.
The first step of any EDA is list down as many questions as you can on a piece of paper.
What are some of the questions that we can ask? They are:
These are some of the questions that need to be asked before deciding on the next steps.
Exploratory data analysis (EDA) is very different from classical statistics. It is not about fitting models,
parameter estimation, or testing hypotheses, but is about finding information in data and generating
ideas.
So, this is the background of EDA. Technically, it involves steps like cleaning the data, calculating
summary statistics and then making plots to better understand the data at hand to make meaningful
inferences.
The next step after data cleaning is data reduction. This includes defining and extracting attributes,
decreasing the dimensions of data, representing the problems to be solved, summarising the data,
and selecting portions of the data for analysis.
There are multiple data cleansing practices in vogue to clean and standardize bad data and make it
effective, usable and relevant to business needs.
Organisations relying heavily on data driven business strategies need to choose a practice that best
fits in with their operational working. A standard practice is shown in the diagram below.
Detailed steps of this process are as follows:
1. Stored Data:
Put together the data collected from all sources and create a data warehouse. Once your data
is stored in a place, it is ready to be put through the cleansing process.
2. Identify errors:
Multiple problems contribute to lowering the quality of data and making it dirty. Problems like
inaccuracy, invalid data, incorrect data entry, missing values, spell error, incorrect data ranges,
multiple representation of data.
These are some of the common errors which should be taken care in creating a cleansed data
regime.
3. Remove duplication/redundancy
Multiple employees work on a single file where they collect and enter data. Most of the times,
they don’t realise they are entering the same data collected by some other employee, at some
other time. Such duplicate data corrupts the data results and must be weeded out.
Effective marketing occurs with high quality of data and thus validating the accuracy is the
utmost prior thing organisations aim for. However, the method of collection is independent
of cleansing process.
A triple verification of data will enhance the dataset and build trust worthiness in marketers
and sales professional to utilize the power of data.
Now that data is validated, it is more important to put all the data in a standardised and
accessible format. This ensures entered data is clean and enriched for ready to use.
Some of the other best practices which need to be followed while Data Cleansing are:
But keep in mind that all these are standard practices and they might and might not apply every time
to a given problem. For e.g. if we have a numerical data, we might want to remove missing values,
NAs at first.
For textual data, tokenisation, removing whitespace, punctuation, stopwords, stemming can be all
possible steps towards cleaning data for further analysis.
Thus Data Cleansing is imperative for model building. If the data is garbage, then the output will also
be garbage no matter how great of statistical analysis is applied on it.
Statistical modelling, is, literally, building statistical models. A linear regression is a statistical model.
To do any kind of statistical modelling, it is utmost necessary to know the basics of statistics like:
Basic statistics: Mean, Median, Mode, Variance, Standard Deviation, Percentile, etc.
Probability Distribution: Geometric Distribution, Binomial Distribution, Poisson
distribution, Normal Distribution, etc.
Population and Sample: understanding the basic concepts, the concept of sampling
Confidence Interval and Hypothesis Testing: How to Perform Validation Analysis
Correlation and Regression Analysis: Basic Models for General Data Analysis
Statistical modeling is a step which comes after data cleansing. The most important parts are model
selection, configuration, prediction, evaluation & presentation.
1) Model Selection
One among many machine learning algorithms may be appropriate for a given predictive
modeling problem. The process of selecting one method as the solution is called model
selection.
This may involve a suite of criteria both from stakeholders in the project and the careful
interpretation of the estimated skill of the methods evaluated for the problem.
As with model configuration, two classes of statistical methods can be used to interpret the
estimated skill of different models for the purposes of model selection. They are:
o Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the
result given an assumption or expectation about the result (presented using critical
values and p-values).
o Estimation Statistics. Methods that quantify the uncertainty of a result using
confidence intervals.
2) Model Configuration
A given machine learning algorithm often has a suite of hyperparameters (parameters passed
to the statistical model which can be changed) that allow the learning method to be tailored
to a specific problem.
The configuration of the hyperparameters is often empirical in nature, rather than analytical,
requiring large suites of experiments in order to evaluate the effect of different
hyperparameter values on the skill of the model.
Hyperparameters are the ones which can break or make a model. Hyperparameter tuning is a
very famous practice in the world of Data Science.
The 2 methods by which we can do Hyperparameter tuning are:
o Grid Search
o Random Search
3) Model Evaluation
4) Model Presentation
Once a final model has been trained, it can be presented to stakeholders prior to being used
or deployed to make actual predictions on real data.
A part of presenting a final model involves presenting the estimated skill of the model.
Methods from the field of estimation statistics can be used to quantify the uncertainty in the
estimated skill of the machine learning model through the use of tolerance intervals and
confidence intervals.
Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via
confidence intervals.
Data science is useless if you can’t communicate your findings to others, and visualisations are
imperative if you’re speaking to a non-technical audience. If you come into a board room without
presenting any visuals, you’re going to run out of work pretty soon.
More than that, visualisations are very helpful for data scientists themselves. Visual representations
are much more intuitive to grasp than numerical abstractions
The below plot is a chart which shows total air passengers across time for a particular airline.
Just by glancing at the chart for two seconds, we immediately recognize a seasonal trend and a long-
term trend. Identifying those patterns by analysing the numbers alone would require decomposing
the signal in several steps.
You need to understand the data yourself so you need to create visualisations which will
probably never be shared.
You need to get the data’s story across and visualisation is usually the best way to go.
Visualisations are helpful both in pre-processing and post-processing stages. They help us understand
our datasets and results in the form of shapes and objects which is somehow more real to the human
brain.
There are currently three key trends that are probably going to shape the future of data visualisation:
Interactivity, Automation, and storytelling (VR).
1) Interactivity
Interactivity has been a key element of online Data Visualisation for numerous years. But it is currently
beginning to overwhelm static visualisations as for the predominant manner in which visualisations
are introduced - particularly in news media. It is progressively expected that every online map, chart,
and a graph is interactive as well as energised.
The challenge of interactivity is to give choices obliging an extensive range of users and corresponding
necessities, without overcomplicating the UI of the data visualisation. There are 7 key sorts of
interactivity, as shown below:
Reconfigure
Choosing features
Encode
Abstract/elaborate
Explore
Connect
Filter
2) Automation
In the past, Data Visualisation was a tedious and troublesome process. The current test is to automate
the Big Data Visualisation to regulate huge picture trends, however, without dismissing the sight of
interest.
Best practice visualisation and design standards are vital. But there should be a match between the
kind of visualisation and the reason for which it will be utilised.
3) Storytelling and VR
Storytelling with data is popular, and rightfully so. Data Visualisations are vacant of significance
without a story, and stories can be enormously enhanced when supplemented with data visualisation.
The future of storytelling might be virtual reality. The human visual awareness system is upgraded to
seeing and interfacing in three measurements. The full storytelling capability of data visualisation can
be investigated once it is no longer compelled to flat screens.
Some of the best Data Visualisation tools for Data Science are:
1) Tableau
2) QlikView
3) PowerBi
4) QlikSense
5) FusionCharts
6) HighCharts
7) Plotly
But the most important one if you are playing with R is Ggplot2 and that with respect to Python is
Seaborn or Matplotlib.
What is GGPLOT?
Ggplot2 is a data visualisation package for the statistical programming language R, which tries to take
the good parts of base and lattice graphics and none of the bad parts.
It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as
providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
The 5 main reasons why you should explore ggplot are as follows:
Data Visualisation will change the manner in which our analysts work with data. They will be relied
upon to respond to issues more quickly and required to dig for more insights – look at information
differently, more creatively.
Activity 1
Find and list the more data visualisation tools.
Summary
Data Science is a combination of multiple fields which involves creation, preparation,
transformation, modelling, and visualisation of data.
Data Science pipeline consists of Data Wrangling, Data Cleansing & Extraction, EDA, Statistical
Model Building, and Data Visualisation.
Data Wrangling is a step in which the data needs to transformed and aggregated into usable
format through which insights can be derived.
Data Cleansing is an important step in which data needs to be cleansed like replacing the
missing values, replacing NaN’s in data, removing outliers along with standardisation and
normalisation.
Data Visualisation is a process of visualising the data so as to derive insights from it at a glance.
It is also used to present results of the data science problem.
Statistical modelling is the core of Data Science problem solution. It is fitting of statistical
equations on the data at hand to predict a certain value on future observations.
Keywords
Data Science Pipeline: The 7 major stages of solving a Data Science problem.
Data Wrangling: The art of transforming the data into a format through which it is easier to
draw insights from.
Data Cleansing: The process of cleaning the data of missing, garbage, Nan’s and outliers.
Data Visualisation: The art of building graphs and charts so as to understand data easily and
find insights into it.
Statistical Modelling: The implementation of statistical equations on existing data.
Self-Assessment Questions
1. What is Data Science Pipeline?
2. Why is there a need for Data Wrangling?
3. What are the steps involved in Data Cleansing?
4. What are the basics required to perform statistical modelling?
5. What do you mean by Data Visualisation and where is it used?
Suggested Reading
1. Jeffrey Stanton, An Introduction to Data Science.
2. The Data Science Handbook, Book by Field Cady.
3. Hands-On Data Science and Python Machine Learning, Book by Frank Kane.
4. Data Science in Practice.