Unit 4
Unit 4
Data exploration or Exploratory data analysis (EDA) is the first step of data analysis used
to explore and visualize data to uncover insights from the start or identify areas or patterns to
dig into more. Using interactive dashboards and point-and-click data exploration, users can
better understand the bigger picture and get to insights faster.
Data Exploration — after data has been prepared, you “explore” the data to see what parts of it
will help reveal the answers you seek. You can also explore various hypotheses.
EDA is a step in the Data Analysis Process, where a number of techniques are used to better
understand the dataset being used.
‘Understanding the dataset’ can refer to a number of things including but not limited to…
Understanding the structure of data, values distributions
Extracting important variables and leaving behind useless variables
Identifying outliers, missing values, or human error
Understanding the relationship(s), or lack of, between variables
Ultimately, maximizing your insights of a dataset and minimizing potential error that
may occur later in the process
The entire process is conducted by a team of data analysts using visual analysis tools and some
advanced statistical software like R. Data exploration can use a combination of manual
methods and automated tools, such as data visualization, charts, and preliminary reports.
Data Refinement
Data refinement means ensuring the data put into a data analytics platform is relevant,
homogenized and categorized so the users can get meaningful results and pinpoint
discrepancies.
The data refinement process is a key part of establishing a data-driven company and
maintaining good habits.
“Data refinement standardizes, aggregates, categorizes, and analyzes raw data to gain
actionable insights. Most refinement models use statistical modeling to transform heaps of
crude data into something usable.”
Below are the steps involved to understand, clean and prepare your data for building your
predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our
refined model.
1. Variable Identification:
We have first to define the type of every variable (continuous or categorical…) and its role in
the dataset (input variable or an output variable).
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Let’s understand this step more clearly by taking an example.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer
below data set). Here you need to identify predictor variables, target variable, data type of
variables and category of variables. Below, the variables have been defined in different
category:
2. Univariate Analysis:
At this stage, we explore variables one by one. Method to perform uni-variate analysis will
depend on whether the variable type is categorical or continuous. Let’s look at these methods
and statistical measures for categorical and continuous variables individually:
(Note: Univariate analysis is also used to highlight missing and outlier values.)
2.2 For categorical variables: For categorical variables, we’ll use frequency table to
understand distribution of each category. We can also read as percentage of values under each
category. It can be be measured using two metrics, Count and Count% against each category.
Bar chart can be used as visualization.
3. Bi-Variable Analysis:
Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance level.
We can perform bi-variate analysis for any combination of categorical and continuous
variables. The combination can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these combinations during
analysis process.
Let’s understand the possible combinations:
3.1 Categorical & Categorical: Various methods are there for this. But a Stacked Column
Chart is a good visualization that shows how the frequencies are spread between the two
categorical variables.
3.2 Continuous & Continuous: We can build a scatter plots in order to see how two
continuous variables interact between each other. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or non-linear.
Scatter plot shows the relationship between two variable but does not indicates the strength of
relationship amongst them. To find the strength of the relationship, we use Correlation.
Correlation varies between -1 and +1.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. For
example: in Excel, function CORREL() is used to return the correlation between two variables
3.3 Categorical & Continuous: While exploring relation between categorical and continuous
variables, we can draw box plots for each level of categorical variables.
Notice the missing values in the image shown above: In the left scenario, we have not treated
missing values. The inference from this data set is that the chances of playing cricket by males
is higher than females. On the other hand, if you look at the second table, which shows data
after treatment of missing values (based on gender), we can see that females have higher
chances of playing cricket compared to males.
o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values
with estimated ones. The objective is to employ known relationships that can be
identified in the valid values of the data set to assist in estimating the missing values.
Mean / Mode / Median imputation is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that
variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all
non-missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take
average of all non-missing values of “Manpower” (170/6= 28.33) and then
replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender
“Male” (119/4= 29.75) and “Female” (25/1= 25) individually of non-missing
values then replace the missing value based on gender. For “Male“, we will
replace missing values of manpower with 29.75 and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create
a model to predict target variable based on other attributes of the training data set and
populate missing values of test data set. We can use regression, ANOVA, Logistic
regression and various modelling technique to perform this. There are 2 drawbacks for
this approach:
1. The model estimated values are usually more well-behaved than the true values
2. If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to the attribute whose
values are missing. The similarity of two attributes is determined using a distance
function. It is also known to have certain advantage & disadvantages.
o Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not
required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
o Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances.
Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need whereas
lower value of k implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to
neglect outliers while building models. This is a discouraging practice. Outliers tend to make
your data skewed and reduces accuracy. Let’s learn more about outlier treatment.
binning of variable. We can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean,
median, mode imputation methods. Before imputing values, we should analyse if it is natural
outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical
model to predict values of outlier observation and after that we can impute it with predicted
values.
Treat separately: If there are significant number of outliers, we should treat them separately in
the statistical model. One of the approach is to treat both groups as two different groups and
build individual model for both groups and then combine the output.
6. Feature Engineering:
Feature engineering is the science (and art) of extracting more information from existing data.
You are not adding any new data here, but you are actually making the data you already have
more useful.
During this phase we try to infer better variables/predictors out of the existing variables.
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If
you try and use the dates directly, you may not be able to extract meaningful insights from the
data. This is because the foot fall is less affected by the day of the month than it is by the day
of the week. i.e., we can create other new variables out of it like weekday/weekend,
Monday/Tuesday…., and so on. Now this information about day of week is implicit in your
data. You need to bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
Data Summarization
Data Summarization in Data Mining is a key concept from which a concise description of a
dataset can be obtained to see what looks normal or out of place. A carefully chosen summary
of raw data would convey many trends and patterns of the data in an easily accessible manner.
The term ‘data mining’ refers to exactly to this i.e., extracting meaningful information from
the raw data. And Data Summarization in Data Mining aims at presenting the extracted
information and trends in a tabular or graphical format.
Data summaries usually present the dataset’s average (mean, median, and/or mode); standard
deviation from mean or interquartile range; how the data is distributed across the range of data
(for example is it skewed to one side of the range); and statistical dependence (if more than one
variable was captured in the dataset). Data summaries may be presented in numerical text
and/or in tables, graphs, or diagrams.
In general, data can be summarized numerically in the form of a table known as tabular
summarization or visually in the form of a graph known as data visualization.
The different types of Data Summarization in Data Mining are:
Tabular Summarization: This method instantly conveys patterns such as frequency
distribution, cumulative frequency, etc, and
Data Visualization: Visualisations from a chosen graph style such as histogram, time-
series line graph, column/bar graphs, etc. can help to spot trends immediately in a
visually appealing way.
There are three areas in which you can implement Data Summarization in Data Mining. These
are as follows:
Data Summarization in Data Mining: Centrality
Data Summarization in Data Mining: Dispersion
Data Summarization in Data Mining: Distribution of a Sample of Data
1) Data Summarization in Data Mining: Centrality
The principle of Centrality is used to describe the center or middle value of the data.
Several measures can be used to show the centrality of which the common ones are average
also called mean, median, and mode. The three of them summarize the distribution of the
sample data.
Mean: This is used to calculate the numerical average of the set of values.
The arithmetic mean is calculated by adding together the values in the sample.
The sum is then divided by the number of items in the sample.
Median: This identifies the value in the middle of all the values in the dataset when
values are ranked in order. For Example:
If you have an odd number of values in your sample:
46789
The median is simply the middle value i.e. 7 in this case.
When you have an even number of values:
234789
the middle will fall between two items. What you do is use a value mid-way
between the two items in the middle. In this case mid-way between 4 and 7,
which gives 5.5.
The most appropriate measure to use will depend largely on the shape of the dataset.
Variance: This is similar to standard deviation but it measures how tightly or loosely
values are spread around the average.
Variance= s2 i.e. square of standard deviation
Range: The range indicates the difference between the largest and the smallest values
thereby showing the distance between the extremes.
The first bin, labelled 18, contains values up to 18. There are two in the dataset
(17, and 16). The next bin is 21 and therefore contains items that are >18 but
not greater than 21 (there are three: 21, 19 and 21).
The following dataset is not normally distributed:
21 36 18 17 16 22 20 19 20 22 25 19 17 21 19 21 31 22 19 19 16 23 21
16 30
Note that the same bins were used for the second dataset. The range for both
samples was 16-36. The data in the second sample are clearly not normally
distributed. The tallest size class is not in the middle and there is a long “tail”
towards the higher values. For these data the median and inter-quartile range would
be appropriate summary statistics.
Histograms: A histogram is like a bar chart. The bars represent the frequency of values
in the data sample that correspond to various size classes (bins). Generally the bars are
drawn without gaps between them to highlight the fact that the x-axis represents a
continuous variable.
There is little difference between a tally plot and a histogram but the latter can be
produced easily using a computer (you can sketch one in a notebook too).
To make a histogram you follow the same general procedure as for a tally plot but with
subtle differences:
o Determine the size classes.
o Work out the frequency for each size class.
o Draw a bar chart using the size classes as the x-axis and the frequencies on the
y-axis.
You can draw a histogram by hand or use your spreadsheet. The following histograms
were drawn using the same data as for the tally plots in the preceding section. The fig1
histogram shows normally distributed data. Fig2 histogram shows a non-parametric
distribution.
Fig1 Fig2
In both these examples the bars are shown
with a small gap, more properly the bars
should be touching. The x-axis shows the size
classes as a range under each bar. You can
also show the maximum value for each size
class. Ideally your histogram should have the
labels at the divisions between size classes
like so:
In practice you’ll use a computer to calculate skewness; Excel has a SKEW function
that will compute it for you.
A positive value indicates that the average is skewed to the left, that is, there is a long
“tail” of more positive values. A negative value indicates the opposite. The larger the
value the more skewed the sample is.
Kurtosis: This is a measure of how pointy the distribution is. The Kurtosis of a sample
is a measure of how pointed the distribution is, it shows how clustered the values are
around the middle.
The formula to calculate kurtosis uses the number of items in the sample (the
replication, n) and the standard deviation, s.
In practice you’ll use a computer to calculate kurtosis; Excel has a KURT function that
will compute it for you.
A positive result indicates a pointed distribution, which will probably also have a low
dispersion. A negative result indicates a flat distribution, which will probably have high
dispersion. The higher the value the more extreme the pointedness or flatness of the
distribution.
Determining the shape of the distribution of your data goes a long way in helping you decide
which statistical option to choose from when performing data summarization and subsequent
analysis through data mining.
Correlation Analysis
Correlation analysis is a statistical technique that allows you to determine whether there is a
relationship between two separate variables and how strong that relationship may be. Simply
put - correlation analysis calculates the level of change in one variable due to the change in
the other.
This type of analysis is only appropriate if the data is quantified and represented by a number.
It can’t be used for categorical data, such as gender, brands purchased, or colour.
The analysis produces a single number between +1 and −1 that describes the degree of
relationship between two variables. If the result is positive then the two variables are positively
correlated to each other, i.e. when one is high, the other one tends to be high too. If the result
is negative then the two variables are negatively correlated to each
A high correlation points to a strong relationship between the two variables, while a low
correlation means that the variables are weakly related.
When it comes to market research, researchers use correlation analysis to analyse quantitative
data collected through research methods like surveys and live polls. They try to identify the
relationship, patterns, significant connections, and trends between two variables or datasets.
Types of correlation
Correlation between two variables can be either a positive correlation, a negative correlation,
or no correlation. Let's look at examples of each of these three types.
Positive correlation: A positive correlation between two variables means both the
variables move in the same direction. An increase in one variable leads to an increase
in the other variable and vice versa.
For example, spending more time on a treadmill burns more calories.
Negative correlation: A negative correlation between two variables means that the
variables move in opposite directions. An increase in one variable leads to a decrease
in the other variable and vice versa.
For example, increasing the speed of a vehicle decreases the time you take to reach your
destination.
Weak/Zero correlation: No correlation exists when one variable does not affect the
other.
For example, there is no correlation between the number of years of school a person
has attended and the letters in his/her name.
However, if the relationship between the data is not linear, then that is when this particular
coefficient will not accurately represent the relationship between the two variables, and when
Spearman’s Rank must be implemented instead.
Pearson’s coefficient requires the relevant data must be inputted into a table similar to that of
Spearman’s Rank but without the ranks, and the result produced will be in the numerical form
which all correlation coefficients produce, including Spearman’s Rank and Pearson’s
Coefficient: -1 ≤ r ≤ +1.
When to use: When no assumptions about the probability distribution may be made. Typically
applied to qualitative data, but can be applied to quantitative data if Spearman’s Rank is
insufficient.
Interpreting Results
Typically, the best way to gain a generalised but more immediate interpretation of the results
of a set of data, is to visualise it on a scatter graph such as these:
Positive Correlation
Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase at the same
time. The line of best fit, or the trend line, is places to best
represent the data on the graph. In this case, it is following the
data points upwards to indicate the positive correlation.
Negative Correlation
Any score from -0.5 to -1 indicate a strong negative correlation,
which means that as one variable increases, the other decreases
proportionally. The line of best fit can be seen here to indicate
the negative correlation. In these cases it will slope downwards
from the point of origin.
No Correlation
Very simply, a score of 0 indicates that there is no
correlation, or relationship, between the two variables. The
larger the sample size, the more accurate the result. No
matter which formula is used, this fact will stand true for all.
The more data there is in putted into the formula, the more
accurate the end result will be.
Outliers or anomalies must be accounted for in both correlation coefficients. Using a scatter
graph is the easiest way of identifying any anomalies that may have occurred, and running the
correlation analysis twice (with and without anomalies) is a great way to assess the strength of
the influence of the anomalies on the analysis. If anomalies are present, Spearman’s Rank
coefficient may be used instead of Pearson’s Coefficient, as this formula is extremely robust
against anomalies due to the ranking system used.
Dimensionality reduction
(Reduce the size of your dataset while keeping as much of the variation as possible)
In both Statistics and Machine Learning, the number of attributes, features or input variables
of a dataset is referred to as its dimensionality. For example, let’s take a very simple dataset
containing 2 attributes called Height and Weight. This is a 2-dimensional dataset and any
observation of this dataset can be plotted in a 2D plot.
If we add another dimension called Age to the same dataset, it becomes a 3-dimensional dataset
and any observation lies in the 3-dimensional space.
Likewise, real-world datasets have many attributes. The observations of those datasets lie in
high-dimensional space which is hard to imagine. The following is a general geometric
interpretation of a dataset related to dimensionality considered by data scientists, statisticians
and machine learning engineers.
In a tabular dataset containing rows and columns, the columns represent the dimensions of the
n-dimensional feature space and the rows are the data points lying in that space.
Dimensionality reduction is the process of reducing the number of variables/ attributes in
high-dimensional data while keeping as much of the variability (information) in the original
data as possible. It either finds a new set of variables that is less than the original number of
variables or only keeps the most important variables that are also less than the original number
of variables. We should consider a good trade-off between the number of variables to keep and
variability loss in the original dataset.
It is a data preprocessing step meaning that we perform dimensionality reduction before
training the model.
In addition, high-dimensional data can also lead to overfitting, where the model fits the training
data too closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
Feature selection: is the process of selecting a subset of relevant features to the
problem at hand for use in model construction or in other words, the selection of the
most important features.
In normal circumstances, domain knowledge plays an important role and we could
select features we feel would be the most important. For example, in predicting home
prices the number of bedrooms and square footage are often considered important.
It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
Feature extraction: Feature extraction involves creating new features by combining
or transforming the original features. The goal is to create a set of features that captures
the essence of the original data in a lower-dimensional space. This reduces the data in
a high dimensional space to a lower dimension space, i.e. a space with lesser no. of
dimensions. There are several methods for feature extraction, including principal
component analysis (PCA), linear discriminant analysis (LDA) etc.
Statistical data binning in data mining is a data preprocessing technique used in statistical
analysis to group continuous values into a smaller number of bins. This technique is useful for
exploring the distribution of a variable and identifying patterns or trends in the data.
It can also be used in multivariate statistics, binning in several dimensions simultaneously. For
example, if you have data about a group of people, you might want to arrange their ages into a
smaller number of age intervals, such as grouping every five years together.
1. Equal Frequency Binning: Bins have an equal frequency. This method involves dividing
a continuous variable into a specified number of bins, each containing an equal number of
observations. This method is useful for data with a large number of observations or when the
data is skewed.
For example, equal frequency:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Each bin contains equal number of elements.
2. Equal Width Binning: This method involves dividing a continuous variable into a specified
number of bins of equal width. This method is useful for data with a normal distribution. So if
there are n number of bins, then each bin will have equal width, and the range of each bin is
defined as [min+w], [min + 2w] …. [min + nw] where w = (max - min) / (no of bins).
For example, equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
and if we want to bin this into 3 intervals, then our range for each bin would be
w = (max - min) / (no of bins)= (215 – 5) / 3 = 210 /3 = 70,
First bin max range = min+w = 5+70 = 75, second bin= min + 2w = 5+ 2*70=145, third bin=
min + 3w =5+3*70= 215
This way, we will have ranges for 3 bins as [5, 75], [75-145],[145-215].
Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Supervised Binning:
Supervised binning methods transform numerical variables into categorical counterparts
and refer to the target (class) information when selecting discretization cut points. Entropy-
based binning is an example of a supervised binning method.
Entropy-based Binning
Entropy based method uses a split approach. The entropy (or the information content) is calculated
based on the class label. Intuitively, it finds the best split so that the bins are as pure as possible
that is the majority of the values in a bin correspond to have the same class label. Formally, it is
characterized by finding the split with the maximal information gain.
Example:
Discretize the temperature variable using entropy-based binning algorithm on the following data
is:
Step 1: Calculate "Entropy" for the target.
E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871
The information gains for all three bins show that the best interval for
"Temperature" is (<=60, >60) because it returns the highest gain.