0% found this document useful (0 votes)
23 views

Data Science Pipeline, EDA & Data Preparation

Uploaded by

willyamedome
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Science Pipeline, EDA & Data Preparation

Uploaded by

willyamedome
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit 4

Data Science Pipeline, EDA & Data Preparation


Structure:

4.1 Introduction to Data Science Pipeline

4.2 Data Wrangling

4.3 Exploratory Data Analysis

4.4 Data Extraction & Cleansing

4.5 Statistical Modelling

4.6 Data Visualisation

Summary

Keywords

Self-Assessment Questions

Answers to Check your Progress

Suggested Reading

Published by Symbiosis Centre for Distance Learning (SCDL), Pune

2019

Copyright © 2019 Symbiosis Open Education Society

All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by
any means, electronic or mechanical, including photocopying, recording or by any information storage
or retrieval system without written permission from the publisher.

Acknowledgement

Every attempt has been made to trace the copyright holders of materials reproduced in this unit.
Should any infringement have occurred, SCDL apologies for the same and will be pleased to make
necessary corrections in future editions of this unit.
Objectives

After going through this unit, you will able to:

 Understand what is meant by Data Science Pipeline


 The meaning of Data Wrangling and Exploratory Data Analysis
 Understand why cleansing the data is the most important part of Data Science
 Understand the basics of Statistical Modelling
 Know why visualising the data is an integral part of Data Science Work Cycle

4.1 INTRODUCTION TO DATA SCIENCE PIPELINE


A data science pipeline is the overall step by step process towards obtaining, cleaning, visualising,
modelling, and interpreting data within a business or group.

Data science pipelines are sequences of processing and analysis steps applied to data for a specific
purpose.

They're useful in production projects, and they can also be useful if one expects to encounter the same
type of business question in the future, so as to save on design time and coding.

Stages of Data Science Pipeline are as follows:

1) Problem Definition

 Contrary to common belief, the hardest part of data science isn’t building an accurate model
or obtaining good, clean data. It is much harder to define feasible problems and come up with
reasonable ways of measuring solutions. Problem definition aims at understanding, in depth,
a given problem at hand.
 Multiple brainstorming sessions are organized to correctly define a problem because of your
end goal with depending upon what problem you are trying to solve. Hence, if you go wrong
during the problem definition phase itself, you will be delivering a solution to a problem which
never even existed at first.
2) Hypothesis Testing

 Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a


population parameter. The methodology employed by the analyst depends on the nature of
the data used and the reason for the analysis.
 Hypothesis testing is used to infer the result of a hypothesis performed on sample data from
a larger population. In simple words, we form some assumptions during problem definition
phase and then validate those assumptions statistically using data.

3) Data Collection and processing

 Data collection is the process of gathering and measuring information on variables of interest,
in an established systematic fashion that enables one to answer stated research questions,
test hypotheses, and evaluate outcomes. Moreover, the data collection component of
research is common to all fields of study including physical and social sciences, humanities,
business, etc.
 While methods vary by discipline, the emphasis on ensuring accurate and honest collection
remains the same.
 Data processing is more about a series of actions or steps performed on data to verify,
organize, transform, integrate, and extract data in an appropriate output form for subsequent
use. Methods of processing must be rigorously documented to ensure the utility and integrity
of the data.

4) EDA and Feature Engineering

 Once you have clean and transformed data, the next step for machine learning projects is to
become intimately familiar with the data using exploratory data analysis (EDA).
 EDA is about numeric summaries, plots, aggregations, distributions, densities, reviewing all
the levels of factor variables and applying general statistical methods.
 A clear understanding of the data provides the foundation for model selection, i.e. choosing
the correct machine learning algorithm to solve your problem.
 Feature engineering is the process of determining which predictor variables will contribute
the most to the predictive power of a machine learning algorithm.
 The process of feature engineering is as much of an art as a science. Often feature engineering
is a give-and-take process with exploratory data analysis to provide much-needed intuition
about the data. It’s good to have a domain expert around for this process, but it’s also good
to use your imagination.

5) Modelling and Prediction

 Machine learning can be used to make predictions about the future. You provide a model with
a collection of training instances, fit the model on this data set, and then apply the model to
new instances to make predictions.
 Predictive modelling is useful because you can make products that adapt based on expected
user behaviour. For example, if a viewer consistently watches the same broadcaster on a
streaming service, the application can load that channel on application start-up.

6) Data Visualisation

 Data visualisation is the process of displaying data/information in graphical charts, figures,


and bars. It is used as a means to deliver visual reporting to users for the performance,
operations or general statistics of data and model prediction.
7) Insight Generation and implementation

 Interpreting the data is more like communicating your findings to the interested parties. If you
can’t explain your findings to someone believe me, whatever you have done is of no use.
Hence, this step becomes very crucial.
 The objective of this step is to first identify the business insight and then correlate it to your
data findings. Secondly, you might need to involve domain experts in correlating the findings
with business problems.
 Domain experts can help you in visualising your findings according to the business dimensions
which will also aid in communicating facts to a non-technical audience.

4.2 DATA WRANGLING


Data wrangling is 80% of what a data scientist does. It’s where most of the real value is created

The first step in analytics is gathering data. Then as you begin to analyse and dig deep for answers, it
often becomes necessary to connect to and mashup information from a variety of data sources.

Data can be messy, disorganised, and contain errors. As soon as you start working with it, you will see
the need for enriching or expanding it, adding groupings and calculations. Sometimes it is difficult to
understand what changes have already been implemented.

Moving between data wrangling and analytics tools slows the analytics process—and can introduce
errors. It’s important to find a data wrangling function that lets you easily make adjustments to data
without leaving your analysis.

This is also called as Data Munging. It follows certain steps such as after extracting the data from
different data sources, sorting of data using certain algorithm is performed, decompose the data into
a different structured format and finally store the data into another database.

Some of the steps associated with Data Wrangling are:

1. Load, explore, and analyse your data

2. Drop the unnecessary columns like columns containing IDs, Names, etc.

3. Drop the columns which contain a lot of null or missing values

4. Impute missing values

5. Replace invalid values

6. Remove outliers

7. Log Transform Skewed Variables

8. Transform categorical variables to dummy variables

10. Binning the continuous numeric variables

11. Standardisation and Normalisation

Each of the above mentioned steps has a special importance with respect to Data Science.

Let us look at an example.


If you want to visualise number of customers of a telecom provider by city, then you need to ensure
that there is only one row per city before data visualisation.

If you have two rows like Bombay and Mumbai representing the same city, this could lead to wrong
results. One of the rows has to be changed manually by the data analyst and this is done by creating
a mapping on the fly in the visualisation tool and applied to every row of data to detect for more such
issues and the process is repeated for other cities.

Need of Data Wrangling

Data wrangling is an important aspect for implementing the statistical model.

Therefore, data is converted to the proper feasible format before applying any model intro it. By
performing filtering, grouping and selecting appropriate data accuracy and performance of the model
could be increased.

4.3 EXPLORATORY DATA ANALYSIS


Exploratory data analysis is, as the name mentions, a peek at the data you will be working with. Usually
this involves

1. Cleaning the data - finding junk values and removing them, finding outliers and replacing them
appropriately (with the 95% percentile, for example) etc.
2. Summary Statistics - finding the summary statistics - mean, median and if necessary, mode,
along with the standard deviation and variance of the particular distribution
3. Univariate analysis - A simple histogram that shows the frequency of a particular variable's
occurrence, or a line chart that shows how a particular variable changes over time to have a
look at all the variables in the data and understand them.

The idea is that, after performing Exploratory Data Analysis, you should have a sound understanding
of the data you are about to dive into. Further hypothesis based analysis (post EDA) could involve
statistical testing, bi-variate analysis etc.

Let's understand this with the help of an example.

We all must have seen our mother taking a spoonful of soup to judge whether or not the salt is
appropriate in the soup. The act of tasting the soup to check the salt level and to better understand
the taste of soup by taking a spoonful is exploratory data analysis. Based on that our mothers decide
the salt level, this is where they make inferences and their validity depends on whether or not the
soup is well stirred that is to say whether or not the sample represents the whole population.

Let us take another business case example,

Say we have given some data of sales and their daily revenue numbers for a big retail chain

Business problem is – A retail chain wants to improve its revenue.

The question that arises now is that what are the ways with which we can achieve this?

What will you look for? Do you know what to look for? Will you immediately run a code to find mean
median and mode and other statistics?

The main objective is to understand the data inside out. The first step in any EDA is asking the right
questions for which we want the answers for. If our questions go wrong, the whole EDA goes wrong.

The first step of any EDA is list down as many questions as you can on a piece of paper.
What are some of the questions that we can ask? They are:

 How many total stores are there in the retail company?


 Which stores and regions are performing the best and the worse?
 What are the actual sales across each and every store?
 How many stores are selling products below the average?
 How many stores are exclusively selling best profit making products?
 On which days are the sales maximum?
 Do we see seasonal sales across products?
 Are there any abnormal sales numbers?

These are some of the questions that need to be asked before deciding on the next steps.

It gives some very interesting insights out of data such as:

1. Listing the outliers and anomalies in our data


2. Identifying the most important variables
3. Understanding the relationship between variables
4. Checking for any errors such as missing variables or incorrect entries
5. Know the data types of the dataset – whether continuous/discreet/categorical
6. Understand how the data is distributed
7. Testing a hypothesis or checking assumptions related to a specific model

Exploratory data analysis (EDA) is very different from classical statistics. It is not about fitting models,
parameter estimation, or testing hypotheses, but is about finding information in data and generating
ideas.

So, this is the background of EDA. Technically, it involves steps like cleaning the data, calculating
summary statistics and then making plots to better understand the data at hand to make meaningful
inferences.

4.4 DATA EXTRACTION & CLEANSING


Data extraction & cleaning (sometimes also referred to as data cleansing or data scrubbing) is the act
of detecting and either removing or correcting corrupt or inaccurate records from a record set, table,
or database. Used mainly in cleansing databases, the process applies identifying incomplete, incorrect,
inaccurate, irrelevant, etc. items of data and then replacing, modifying, or deleting this “dirty”
information.

The next step after data cleaning is data reduction. This includes defining and extracting attributes,
decreasing the dimensions of data, representing the problems to be solved, summarising the data,
and selecting portions of the data for analysis.

There are multiple data cleansing practices in vogue to clean and standardize bad data and make it
effective, usable and relevant to business needs.

Organisations relying heavily on data driven business strategies need to choose a practice that best
fits in with their operational working. A standard practice is shown in the diagram below.
Detailed steps of this process are as follows:

1. Stored Data:

Put together the data collected from all sources and create a data warehouse. Once your data
is stored in a place, it is ready to be put through the cleansing process.

2. Identify errors:

Multiple problems contribute to lowering the quality of data and making it dirty. Problems like
inaccuracy, invalid data, incorrect data entry, missing values, spell error, incorrect data ranges,
multiple representation of data.

These are some of the common errors which should be taken care in creating a cleansed data
regime.

3. Remove duplication/redundancy

Multiple employees work on a single file where they collect and enter data. Most of the times,
they don’t realise they are entering the same data collected by some other employee, at some
other time. Such duplicate data corrupts the data results and must be weeded out.

4. Validate the accuracy of data

Effective marketing occurs with high quality of data and thus validating the accuracy is the
utmost prior thing organisations aim for. However, the method of collection is independent
of cleansing process.

A triple verification of data will enhance the dataset and build trust worthiness in marketers
and sales professional to utilize the power of data.

5. Standardise data format

Now that data is validated, it is more important to put all the data in a standardised and
accessible format. This ensures entered data is clean and enriched for ready to use.

Some of the other best practices which need to be followed while Data Cleansing are:

 Sort data by different attributes


 For large datasets cleanse it stepwise and improve the data with each step until you achieve
a good data quality
 For large datasets, break them into small data. Working with less data will increase your
iteration speed
 To handle common cleansing task create a set of utility functions/tools/scripts. It might
include, remapping values based on a CSV file or SQL database or, regex search-and-replace,
blanking out all values that don’t match a regex
 If you have an issue with data cleanliness, arrange them by estimated frequency and attack
the most common problems
 Analyse the summary statistics for each column (standard deviation, mean, number of missing
values)
 Keep track of every date cleaning operation, so you can alter changes or remove operations if
required

But keep in mind that all these are standard practices and they might and might not apply every time
to a given problem. For e.g. if we have a numerical data, we might want to remove missing values,
NAs at first.

For textual data, tokenisation, removing whitespace, punctuation, stopwords, stemming can be all
possible steps towards cleaning data for further analysis.

Thus Data Cleansing is imperative for model building. If the data is garbage, then the output will also
be garbage no matter how great of statistical analysis is applied on it.

4.5 STATISTICAL MODELLING


In simple terms, statistical modelling is a simplified, mathematically-formalized way to approximate
reality (i.e. what generates your data) and optionally to make predictions from this approximation.
The statistical model is the mathematical equation that is used.

Statistical modelling, is, literally, building statistical models. A linear regression is a statistical model.

To do any kind of statistical modelling, it is utmost necessary to know the basics of statistics like:

 Basic statistics: Mean, Median, Mode, Variance, Standard Deviation, Percentile, etc.
 Probability Distribution: Geometric Distribution, Binomial Distribution, Poisson
distribution, Normal Distribution, etc.
 Population and Sample: understanding the basic concepts, the concept of sampling
 Confidence Interval and Hypothesis Testing: How to Perform Validation Analysis
 Correlation and Regression Analysis: Basic Models for General Data Analysis

Statistical modeling is a step which comes after data cleansing. The most important parts are model
selection, configuration, prediction, evaluation & presentation.

Let us look at each one of these in brief.

1) Model Selection

 One among many machine learning algorithms may be appropriate for a given predictive
modeling problem. The process of selecting one method as the solution is called model
selection.
 This may involve a suite of criteria both from stakeholders in the project and the careful
interpretation of the estimated skill of the methods evaluated for the problem.
 As with model configuration, two classes of statistical methods can be used to interpret the
estimated skill of different models for the purposes of model selection. They are:
o Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the
result given an assumption or expectation about the result (presented using critical
values and p-values).
o Estimation Statistics. Methods that quantify the uncertainty of a result using
confidence intervals.

2) Model Configuration

 A given machine learning algorithm often has a suite of hyperparameters (parameters passed
to the statistical model which can be changed) that allow the learning method to be tailored
to a specific problem.
 The configuration of the hyperparameters is often empirical in nature, rather than analytical,
requiring large suites of experiments in order to evaluate the effect of different
hyperparameter values on the skill of the model.
 Hyperparameters are the ones which can break or make a model. Hyperparameter tuning is a
very famous practice in the world of Data Science.
 The 2 methods by which we can do Hyperparameter tuning are:
o Grid Search
o Random Search

3) Model Evaluation

 A crucial part of a predictive modeling problem is evaluating a learning method.


 This often requires the estimation of the skill of the model when making predictions on data
not seen during the training of the model.
 Generally, the planning of this process of training and evaluating a predictive model is called
experimental design. This is a whole subfield of statistical methods.
 Experimental Design. Methods to design systematic experiments to compare the effect of
independent variables on an outcome, such as the choice of a machine learning algorithm on
prediction accuracy.
 As part of implementing an experimental design, methods are used to resample a dataset in
order to make economic use of available data in order to estimate the skill of the model. These
two represent a subfield of statistical methods.
 Resampling Methods. Methods for systematically splitting a dataset into subsets for the
purposes of training and evaluating a predictive model.

4) Model Presentation

 Once a final model has been trained, it can be presented to stakeholders prior to being used
or deployed to make actual predictions on real data.
 A part of presenting a final model involves presenting the estimated skill of the model.
 Methods from the field of estimation statistics can be used to quantify the uncertainty in the
estimated skill of the machine learning model through the use of tolerance intervals and
confidence intervals.
 Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via
confidence intervals.

4.6 DATA VISUALISATION


Data Visualisation is the representation of information in the form of chart, diagram, picture, etc.
These are created as the visual representation of information.
Importance of Data Visualisation:

 Absorb information quickly


 Understand your next steps
 Connect the dots
 Hold your audience longer
 Kick the need for data scientists
 Share your insights with everyone
 Find the outliers
 Memorise the important insights
 Act on your findings quickly

There are 10 elements of successful data visualisation:

 It tells a visual story


 It’s easy to understand
 It’s tailored for your target audience
 It’s user friendly
 It’s useful
 It’s honest
 It’s succinct
 It provides context

Data science is useless if you can’t communicate your findings to others, and visualisations are
imperative if you’re speaking to a non-technical audience. If you come into a board room without
presenting any visuals, you’re going to run out of work pretty soon.

More than that, visualisations are very helpful for data scientists themselves. Visual representations
are much more intuitive to grasp than numerical abstractions

Let’s consider an example

The below plot is a chart which shows total air passengers across time for a particular airline.
Just by glancing at the chart for two seconds, we immediately recognize a seasonal trend and a long-
term trend. Identifying those patterns by analysing the numbers alone would require decomposing
the signal in several steps.

Thus you require visualisations at two places:

 You need to understand the data yourself so you need to create visualisations which will
probably never be shared.
 You need to get the data’s story across and visualisation is usually the best way to go.

Visualisations are helpful both in pre-processing and post-processing stages. They help us understand
our datasets and results in the form of shapes and objects which is somehow more real to the human
brain.

What is the future of data visualisation?

There are currently three key trends that are probably going to shape the future of data visualisation:
Interactivity, Automation, and storytelling (VR).

1) Interactivity

Interactivity has been a key element of online Data Visualisation for numerous years. But it is currently
beginning to overwhelm static visualisations as for the predominant manner in which visualisations
are introduced - particularly in news media. It is progressively expected that every online map, chart,
and a graph is interactive as well as energised.

The challenge of interactivity is to give choices obliging an extensive range of users and corresponding
necessities, without overcomplicating the UI of the data visualisation. There are 7 key sorts of
interactivity, as shown below:

 Reconfigure
 Choosing features
 Encode
 Abstract/elaborate
 Explore
 Connect
 Filter

2) Automation

In the past, Data Visualisation was a tedious and troublesome process. The current test is to automate
the Big Data Visualisation to regulate huge picture trends, however, without dismissing the sight of
interest.

Best practice visualisation and design standards are vital. But there should be a match between the
kind of visualisation and the reason for which it will be utilised.

3) Storytelling and VR

Storytelling with data is popular, and rightfully so. Data Visualisations are vacant of significance
without a story, and stories can be enormously enhanced when supplemented with data visualisation.

The future of storytelling might be virtual reality. The human visual awareness system is upgraded to
seeing and interfacing in three measurements. The full storytelling capability of data visualisation can
be investigated once it is no longer compelled to flat screens.
Some of the best Data Visualisation tools for Data Science are:

1) Tableau
2) QlikView
3) PowerBi
4) QlikSense
5) FusionCharts
6) HighCharts
7) Plotly

But the most important one if you are playing with R is Ggplot2 and that with respect to Python is
Seaborn or Matplotlib.

Let us discuss in detail a bit about Ggplot2.

What is GGPLOT?

Ggplot2 is a data visualisation package for the statistical programming language R, which tries to take
the good parts of base and lattice graphics and none of the bad parts.

It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as
providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

The 5 main reasons why you should explore ggplot are as follows:

 It can do quick-and-dirty and complex, so you only need one system


 The default colours and other aesthetics are nicer
 Never again lose an axis title (or get told your pdf can’t be created) due to wrongly specified
outer or inner margins
 You can save plots (or the beginnings of a plot) as objects
 Multivariate exploration is greatly simplified through faceting and colouring

Data Visualisation will change the manner in which our analysts work with data. They will be relied
upon to respond to issues more quickly and required to dig for more insights – look at information
differently, more creatively.

Data Visualisation will advance that imaginative data analysis.

Check your Progress 1


1. What are the components of Data Science Pipeline?
2. Name some Data Visualisation tools.
3. What are the four steps involved in model building?
4. What is EDA?
5. What is Data Wrangling?

Activity 1
Find and list the more data visualisation tools.
Summary
 Data Science is a combination of multiple fields which involves creation, preparation,
transformation, modelling, and visualisation of data.
 Data Science pipeline consists of Data Wrangling, Data Cleansing & Extraction, EDA, Statistical
Model Building, and Data Visualisation.
 Data Wrangling is a step in which the data needs to transformed and aggregated into usable
format through which insights can be derived.
 Data Cleansing is an important step in which data needs to be cleansed like replacing the
missing values, replacing NaN’s in data, removing outliers along with standardisation and
normalisation.
 Data Visualisation is a process of visualising the data so as to derive insights from it at a glance.
It is also used to present results of the data science problem.
 Statistical modelling is the core of Data Science problem solution. It is fitting of statistical
equations on the data at hand to predict a certain value on future observations.

Keywords
 Data Science Pipeline: The 7 major stages of solving a Data Science problem.
 Data Wrangling: The art of transforming the data into a format through which it is easier to
draw insights from.
 Data Cleansing: The process of cleaning the data of missing, garbage, Nan’s and outliers.
 Data Visualisation: The art of building graphs and charts so as to understand data easily and
find insights into it.
 Statistical Modelling: The implementation of statistical equations on existing data.

Self-Assessment Questions
1. What is Data Science Pipeline?
2. Why is there a need for Data Wrangling?
3. What are the steps involved in Data Cleansing?
4. What are the basics required to perform statistical modelling?
5. What do you mean by Data Visualisation and where is it used?

Answers to Check your Progress


Check your Progress 1

1) Components of Data Science Pipeline are:


a. Identifying the problem
b. Hypothesis testing
c. Data collection & data wrangling
d. EDA
e. Statistical Modelling
f. Interpreting and communicating results
g. Data Visualisation and Insight Generation
2) Some Data Visualisation tools are:
a. Tableau
b. Power Bi
c. R & Python
d. Qlikeview and Qliksense
3) 4 steps involved in model building are:
a. Model selection
b. Model configuration
c. Model evaluation
d. Model presentation
4) EDA is exploratory data analysis which refers to refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical representations
5) Data wrangling is the process of cleaning and unifying messy and complex data sets for easy
access and analysis.

Suggested Reading
1. Jeffrey Stanton, An Introduction to Data Science.
2. The Data Science Handbook, Book by Field Cady.
3. Hands-On Data Science and Python Machine Learning, Book by Frank Kane.
4. Data Science in Practice.

You might also like