0% found this document useful (0 votes)
55 views

Unit 4

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Unit 4

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Exploration

Data exploration or Exploratory data analysis (EDA) is the first step of data analysis used
to explore and visualize data to uncover insights from the start or identify areas or patterns to
dig into more. Using interactive dashboards and point-and-click data exploration, users can
better understand the bigger picture and get to insights faster.
Data Exploration — after data has been prepared, you “explore” the data to see what parts of it
will help reveal the answers you seek. You can also explore various hypotheses.

EDA is a step in the Data Analysis Process, where a number of techniques are used to better
understand the dataset being used.
‘Understanding the dataset’ can refer to a number of things including but not limited to…
 Understanding the structure of data, values distributions
 Extracting important variables and leaving behind useless variables
 Identifying outliers, missing values, or human error
 Understanding the relationship(s), or lack of, between variables
 Ultimately, maximizing your insights of a dataset and minimizing potential error that
may occur later in the process

The entire process is conducted by a team of data analysts using visual analysis tools and some
advanced statistical software like R. Data exploration can use a combination of manual
methods and automated tools, such as data visualization, charts, and preliminary reports.

Why is Data Exploration Important?


Have you heard of the phrase, “garbage in, garbage out”?
With EDA, it’s more like, “garbage in, perform EDA, possibly garbage out.”
By conducting EDA, you can turn an almost useable dataset into a completely useable dataset.
It’s not like EDA can magically make any dataset clean — that is not true. However, many
EDA techniques can remedy some common problems that are present in every dataset.
Exploratory Data Analysis does two main things:
1. It helps clean up a dataset.
2. It gives you a better understanding of the variables and the relationships between them.
Some other benefits of EDA are:
 Starting with data exploration helps users to make better decisions on where to dig
deeper into the data and to take a broad understanding of the business when asking more
detailed questions later.
 With a user-friendly interface, anyone across an organization can familiarize
themselves with the data, discover patterns, and generate thoughtful questions that may
spur on deeper, valuable analysis.
 Data exploration and visual analytics tools build understanding, empowering users to
explore data in any visualization.
 This approach speeds up time to answers and deepens users’ understanding by
covering more ground in less time.
 Data exploration is important for this reason because it democratizes access to data and
provides governed self-service analytics.
 Furthermore, businesses can accelerate data exploration by provisioning and delivering
data through visual data marts that are easy to explore and use.

Main Use Cases for Data Exploration


 Help businesses explore large amounts of data quickly to better understand next steps
in terms of further analysis.
 Gives business a more manageable starting point and a way to target areas of interest.
 In most cases, data exploration involves using data visualizations to examine the data
at a high level. By taking this high-level approach, businesses can determine which data
is most important and which may distort the analysis and therefore should be removed.
 Data exploration can also be helpful in decreasing time spent on less valuable analysis
by selecting the right path forward from the start.
 Can be used to answer questions, test business assumptions, and generate hypotheses
for further analysis.

Data Refinement
Data refinement means ensuring the data put into a data analytics platform is relevant,
homogenized and categorized so the users can get meaningful results and pinpoint
discrepancies.
The data refinement process is a key part of establishing a data-driven company and
maintaining good habits.
“Data refinement standardizes, aggregates, categorizes, and analyzes raw data to gain
actionable insights. Most refinement models use statistical modeling to transform heaps of
crude data into something usable.”

Data refinement: before and after


In its simplest form data refinement is the process of making a dataset more legible. This before
and after image of a sample dataset should give you a clear idea about the importance of data
refining.
One of the major aspects of quality data is its consistency. One of the best industry practices to
refine the data is data normalization to ensure that corresponding data fields share similar
characteristics in a standard format throughout the dataset.
Data normalization has extensive applications in retail and E-commerce, ranging all the way
from augmenting customer experience to enhancing brand image.

Why data refinement is important for your business?


The importance of refined data makes immediate sense when you comprehend the
consequences bad data can have on your business. Every time a bad batch of data escapes your
quality control, it comes back to haunt you with interest later.
Data refinement helps you:
1. Save time
Data refinement empowers data analysts to do data analysis. More often than not, data
refinement demands human intervention. No wonder, analysts spend 50 to 80 percent of their
time refining data!
2. Make better decisions
Whether you want to monitor prices on an E-commerce platform or thoroughly analyze real
estate listings, data refinement endows you with actionable data to make informed decisions.
3. Gain deeper intelligence
The process of collecting actionable insights from any given dataset is much like extracting
iron from its ore. When the data is refined, it brings hidden information into sight and even
uncovers patterns previously buried in a deluge of illegible raw data.

4. Get actionable insights


When your data has been put through data refinement you can rest assured that it is accurate
and actionable. What only remains to be seen is if there’s more you can do with it. Or, if you
can collect more data and vet it the same way to uncover more ways of solving the problems
plaguing your business.

Data Visualization – From Unit 2

Steps for Data Exploration or Exploratory Data Analysis (EDA)


Remember the quality of your inputs decide the quality of your output. So, once you have got
your business hypothesis ready, it makes sense to spend lot of time and efforts here. Data
exploration, cleaning and preparation can take up to 70% of total project time.

Below are the steps involved to understand, clean and prepare your data for building your
predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our
refined model.

1. Variable Identification:
We have first to define the type of every variable (continuous or categorical…) and its role in
the dataset (input variable or an output variable).
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
category of the variables.
Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not (refer
below data set). Here you need to identify predictor variables, target variable, data type of
variables and category of variables. Below, the variables have been defined in different
category:
2. Univariate Analysis:
At this stage, we explore variables one by one. Method to perform uni-variate analysis will
depend on whether the variable type is categorical or continuous. Let’s look at these methods
and statistical measures for categorical and continuous variables individually:

2.1.For continuous variables: In case of continuous variables, we need to understand the


central tendency and spread of the variable. These are measured using various statistical
metrics visualization methods. Like we can build histograms and boxplots for each continuous
variable independently as shown below:

(Note: Univariate analysis is also used to highlight missing and outlier values.)

2.2 For categorical variables: For categorical variables, we’ll use frequency table to
understand distribution of each category. We can also read as percentage of values under each
category. It can be be measured using two metrics, Count and Count% against each category.
Bar chart can be used as visualization.

3. Bi-Variable Analysis:
Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance level.
We can perform bi-variate analysis for any combination of categorical and continuous
variables. The combination can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these combinations during
analysis process.
Let’s understand the possible combinations:
3.1 Categorical & Categorical: Various methods are there for this. But a Stacked Column
Chart is a good visualization that shows how the frequencies are spread between the two
categorical variables.
3.2 Continuous & Continuous: We can build a scatter plots in order to see how two
continuous variables interact between each other. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or non-linear.

Scatter plot shows the relationship between two variable but does not indicates the strength of
relationship amongst them. To find the strength of the relationship, we use Correlation.
Correlation varies between -1 and +1.
 -1: perfect negative linear correlation
 +1:perfect positive linear correlation and
 0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. For
example: in Excel, function CORREL() is used to return the correlation between two variables

3.3 Categorical & Continuous: While exploring relation between categorical and continuous
variables, we can draw box plots for each level of categorical variables.

4. Detecting / Treating Missing Values/ Observation


This phase is more of an art rather than a systematic approach and usually it depends to the
problem in hand.
4.1 Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased
model because we have not analysed the behavior and relationship with other variables
correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated
missing values. The inference from this data set is that the chances of playing cricket by males
is higher than females. On the other hand, if you look at the second table, which shows data
after treatment of missing values (based on gender), we can see that females have higher
chances of playing cricket compared to males.

4.2 Why my data has missing values?


We looked at the importance of treatment of missing values in a dataset. Now, let’s identify
the reasons for occurrence of these missing values. They may occur at two stages:
1. Data Extraction: It is possible that there are problems with extraction process. In such
cases, we should double-check for correct data with data guardians. Some hashing
procedures can also be used to make sure data extraction is correct. Errors at data
extraction stage are typically easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct.
They can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data
collection process decide that they will declare their earning after tossing a fair
coin. If an head occurs, respondent declares his / her earnings & vice versa. Here
each observation has equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and
missing ratio varies for different values / level of other input variables. For
example: We are collecting data for age and female has higher missing value
compare to male.
o Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable.
For example: In a medical study, if a particular diagnostic causes discomfort,
then there is higher chance of drop out from the study. This missing value is not
at random unless we have included “discomfort” as an input variable for all
patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response
to their earning.

4.3 Methods to Treat Missing Values


1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables
of interest are present. Advantage of this method is, it keeps as many cases
available for analysis. One of the disadvantage of this method, it uses different
sample size for different variables.

o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.

2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values
with estimated ones. The objective is to employ known relationships that can be
identified in the valid values of the data set to assist in estimating the missing values.
Mean / Mode / Median imputation is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that
variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all
non-missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take
average of all non-missing values of “Manpower” (170/6= 28.33) and then
replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender
“Male” (119/4= 29.75) and “Female” (25/1= 25) individually of non-missing
values then replace the missing value based on gender. For “Male“, we will
replace missing values of manpower with 29.75 and for “Female” with 25.

3. Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create
a model to predict target variable based on other attributes of the training data set and
populate missing values of test data set. We can use regression, ANOVA, Logistic
regression and various modelling technique to perform this. There are 2 drawbacks for
this approach:
1. The model estimated values are usually more well-behaved than the true values
2. If there are no relationships with attributes in the data set and the attribute with
missing values, then the model will not be precise for estimating missing values.

4. KNN Imputation: In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to the attribute whose
values are missing. The similarity of two attributes is determined using a distance
function. It is also known to have certain advantage & disadvantages.
o Advantages:
 k-nearest neighbour can predict both qualitative & quantitative attributes
 Creation of predictive model for each attribute with missing data is not
required
 Attributes with multiple missing values can be easily treated
 Correlation structure of the data is taken into consideration
o Disadvantage:
 KNN algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need whereas
lower value of k implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to
neglect outliers while building models. This is a discouraging practice. Outliers tend to make
your data skewed and reduces accuracy. Let’s learn more about outlier treatment.

5. Detecting / Treating outliers:


Outlier
Outlier is a commonly used terminology by analysts and data scientists as it needs close
attention else it can result in wildly wrong estimations.
Simply speaking, Outlier is an observation that appears far away and diverges from an overall
pattern in a sample.
Let’s take an example, we do
customer profiling and find out
that the average annual income
of customers is $0.8 million.
But, there are two customers
having annual income of $4 and
$4.2 million. These two
customers annual income is
much higher than rest of the
population. These two
observations will be seen
as Outliers.

Types of Outliers – From Unit 3


What causes Outliers? - From Unit 3
What is the impact of Outliers on a dataset? - - From Unit 3
How to detect Outliers? - From Unit 3

How to remove Outliers?


Most of the ways to deal with outliers are similar to the methods of missing values like deleting
observations, transforming them, binning them, treat them as a separate group, imputing values
and other statistical methods. Here, we will discuss the common techniques used to deal with
outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing
error or outlier observations are very small in numbers. We can also use trimming at both ends
to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers.
Natural log of a value reduces the variation caused by extreme values. Binning is also a form
of variable transformation. Decision Tree algorithm allows to deal with outliers well due to

binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean,
median, mode imputation methods. Before imputing values, we should analyse if it is natural
outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical
model to predict values of outlier observation and after that we can impute it with predicted
values.
Treat separately: If there are significant number of outliers, we should treat them separately in
the statistical model. One of the approach is to treat both groups as two different groups and
build individual model for both groups and then combine the output.

6. Feature Engineering:
Feature engineering is the science (and art) of extracting more information from existing data.
You are not adding any new data here, but you are actually making the data you already have
more useful.
During this phase we try to infer better variables/predictors out of the existing variables.
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If
you try and use the dates directly, you may not be able to extract meaningful insights from the
data. This is because the foot fall is less affected by the day of the month than it is by the day
of the week. i.e., we can create other new variables out of it like weekday/weekend,
Monday/Tuesday…., and so on. Now this information about day of week is implicit in your
data. You need to bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
Data Summarization

Data Summarization in Data Mining is a key concept from which a concise description of a
dataset can be obtained to see what looks normal or out of place. A carefully chosen summary
of raw data would convey many trends and patterns of the data in an easily accessible manner.
The term ‘data mining’ refers to exactly to this i.e., extracting meaningful information from
the raw data. And Data Summarization in Data Mining aims at presenting the extracted
information and trends in a tabular or graphical format.
Data summaries usually present the dataset’s average (mean, median, and/or mode); standard
deviation from mean or interquartile range; how the data is distributed across the range of data
(for example is it skewed to one side of the range); and statistical dependence (if more than one
variable was captured in the dataset). Data summaries may be presented in numerical text
and/or in tables, graphs, or diagrams.
In general, data can be summarized numerically in the form of a table known as tabular
summarization or visually in the form of a graph known as data visualization.
The different types of Data Summarization in Data Mining are:
 Tabular Summarization: This method instantly conveys patterns such as frequency
distribution, cumulative frequency, etc, and
 Data Visualization: Visualisations from a chosen graph style such as histogram, time-
series line graph, column/bar graphs, etc. can help to spot trends immediately in a
visually appealing way.

There are three areas in which you can implement Data Summarization in Data Mining. These
are as follows:
 Data Summarization in Data Mining: Centrality
 Data Summarization in Data Mining: Dispersion
 Data Summarization in Data Mining: Distribution of a Sample of Data
1) Data Summarization in Data Mining: Centrality
The principle of Centrality is used to describe the center or middle value of the data.
Several measures can be used to show the centrality of which the common ones are average
also called mean, median, and mode. The three of them summarize the distribution of the
sample data.
 Mean: This is used to calculate the numerical average of the set of values.
The arithmetic mean is calculated by adding together the values in the sample.
The sum is then divided by the number of items in the sample.

 Mode: This shows the most frequently repeated value in a dataset.


It is calculated by working out how many there are of each value in your sample.
The one with the highest frequency is the mode. It is possible to get tied
frequencies, in which case you report both values. The sample is then said to be
bimodal. You might get more than two modal values!

 Median: This identifies the value in the middle of all the values in the dataset when
values are ranked in order. For Example:
If you have an odd number of values in your sample:
46789
The median is simply the middle value i.e. 7 in this case.
When you have an even number of values:
234789
the middle will fall between two items. What you do is use a value mid-way
between the two items in the middle. In this case mid-way between 4 and 7,
which gives 5.5.

The most appropriate measure to use will depend largely on the shape of the dataset.

2) Data Summarization in Data Mining: Dispersion


The dispersion of a sample refers to how spread out the values are around the average
(center). Looking at the spread of the distribution of data shows the amount of variation or
diversity within the data. When the values are close to the center, the sample has low dispersion
while high dispersion occurs when they are widely scattered about the center.
Different measures of dispersion can be used based on which is more suitable for your dataset
and what you want to focus on. The different measures of dispersion are as follows:
 Standard deviation: This provides a standard way of knowing what is normal,
showing what is extra large or extra small and helping you to understand the spread of
the variable from the mean. It shows how close all the values are to the mean.
The general formula for calculating standard deviation looks like the following:

To work out standard deviation follow these steps:

1. Subtract the mean from each value in the sample.


2. Square the results from step 1 (this removes negative values).
3. Add together the squared differences from step 2.
4. Divide the summed squared differences from step 3 by n-1, which is the
number of items in the sample (replication) minus one.
5. Take the square root of the result from step 4.

 Variance: This is similar to standard deviation but it measures how tightly or loosely
values are spread around the average.
Variance= s2 i.e. square of standard deviation
 Range: The range indicates the difference between the largest and the smallest values
thereby showing the distance between the extremes.

3) Data Summarization in Data Mining: Distribution of a Sample of Data


The distribution of sample data values has to do with the shape which refers to how data values
are distributed across the range of values in the sample. In simple terms, it means if the values
are clustered around the average to show how they are symmetrically arranged around it or if
there are more values to one side than the order. Two ways to explore the distribution of the
sample data are graphically and through shape statistics.
To draw a picture of the data distribution graphically, frequency histograms and tally plots can
be used to summarize the data.
 Tally plots: A tally plot is a kind of data frequency distribution graph that can be used
to represent the values from a dataset.
You can sketch a tally plot in a notebook. This makes it a very useful tool for times
when you haven’t got a computer to hand.
To draw a tally plot follow these steps:
1. Determine the size classes (bins), you want around 7 bins.
2. Draw a vertical line (axis) and write the values for the bins to the left.
3. For each datum, determine which size class it fits into and add a tally mark to the
right of the axis, opposite the appropriate bin.
You will now be able to assess the shape of the data sample you’ve got.

The tally plot in the preceding figure shows a normal (parametric)


distribution. You can see that the shape is more or less symmetrical around
the middle. So here the mean and standard deviation would be good
summary values to represent the data. The original dataset was:
17 26 28 27 29 28 25 26 34 32 23 29 24 21 26 31 31 22 26 19 36 23 21
16 30

The first bin, labelled 18, contains values up to 18. There are two in the dataset
(17, and 16). The next bin is 21 and therefore contains items that are >18 but
not greater than 21 (there are three: 21, 19 and 21).
The following dataset is not normally distributed:
21 36 18 17 16 22 20 19 20 22 25 19 17 21 19 21 31 22 19 19 16 23 21
16 30

These data produce a tally plot like so:

Note that the same bins were used for the second dataset. The range for both
samples was 16-36. The data in the second sample are clearly not normally
distributed. The tallest size class is not in the middle and there is a long “tail”
towards the higher values. For these data the median and inter-quartile range would
be appropriate summary statistics.

 Histograms: A histogram is like a bar chart. The bars represent the frequency of values
in the data sample that correspond to various size classes (bins). Generally the bars are
drawn without gaps between them to highlight the fact that the x-axis represents a
continuous variable.
There is little difference between a tally plot and a histogram but the latter can be
produced easily using a computer (you can sketch one in a notebook too).
To make a histogram you follow the same general procedure as for a tally plot but with
subtle differences:
o Determine the size classes.
o Work out the frequency for each size class.
o Draw a bar chart using the size classes as the x-axis and the frequencies on the
y-axis.
You can draw a histogram by hand or use your spreadsheet. The following histograms
were drawn using the same data as for the tally plots in the preceding section. The fig1
histogram shows normally distributed data. Fig2 histogram shows a non-parametric
distribution.

Fig1 Fig2
In both these examples the bars are shown
with a small gap, more properly the bars
should be touching. The x-axis shows the size
classes as a range under each bar. You can
also show the maximum value for each size
class. Ideally your histogram should have the
labels at the divisions between size classes
like so:

3.1) Shape Statistics


For Shape Statistics, skewness and kurtosis can help give values to how central the average is
and show how clustered they are around the data average.
 Skewness: This is a measure of how central the average is in the distribution. The
skewness of a sample is a measure of how central the average is to the overall spread
of values.
The formula to calculate skewness uses the number of items in the sample (the
replication, n) and the standard deviation, s.

In practice you’ll use a computer to calculate skewness; Excel has a SKEW function
that will compute it for you.
A positive value indicates that the average is skewed to the left, that is, there is a long
“tail” of more positive values. A negative value indicates the opposite. The larger the
value the more skewed the sample is.
 Kurtosis: This is a measure of how pointy the distribution is. The Kurtosis of a sample
is a measure of how pointed the distribution is, it shows how clustered the values are
around the middle.
The formula to calculate kurtosis uses the number of items in the sample (the
replication, n) and the standard deviation, s.

In practice you’ll use a computer to calculate kurtosis; Excel has a KURT function that
will compute it for you.

A positive result indicates a pointed distribution, which will probably also have a low
dispersion. A negative result indicates a flat distribution, which will probably have high
dispersion. The higher the value the more extreme the pointedness or flatness of the
distribution.

Determining the shape of the distribution of your data goes a long way in helping you decide
which statistical option to choose from when performing data summarization and subsequent
analysis through data mining.

Correlation Analysis
Correlation analysis is a statistical technique that allows you to determine whether there is a
relationship between two separate variables and how strong that relationship may be. Simply
put - correlation analysis calculates the level of change in one variable due to the change in
the other.
This type of analysis is only appropriate if the data is quantified and represented by a number.
It can’t be used for categorical data, such as gender, brands purchased, or colour.
The analysis produces a single number between +1 and −1 that describes the degree of
relationship between two variables. If the result is positive then the two variables are positively
correlated to each other, i.e. when one is high, the other one tends to be high too. If the result
is negative then the two variables are negatively correlated to each
A high correlation points to a strong relationship between the two variables, while a low
correlation means that the variables are weakly related.
When it comes to market research, researchers use correlation analysis to analyse quantitative
data collected through research methods like surveys and live polls. They try to identify the
relationship, patterns, significant connections, and trends between two variables or datasets.
Types of correlation
Correlation between two variables can be either a positive correlation, a negative correlation,
or no correlation. Let's look at examples of each of these three types.
 Positive correlation: A positive correlation between two variables means both the
variables move in the same direction. An increase in one variable leads to an increase
in the other variable and vice versa.
For example, spending more time on a treadmill burns more calories.
 Negative correlation: A negative correlation between two variables means that the
variables move in opposite directions. An increase in one variable leads to a decrease
in the other variable and vice versa.
For example, increasing the speed of a vehicle decreases the time you take to reach your
destination.
 Weak/Zero correlation: No correlation exists when one variable does not affect the
other.
For example, there is no correlation between the number of years of school a person
has attended and the letters in his/her name.

Different kinds of correlation analysis


Here are the 2 types of correlation analysis;
 Spearman correlation
 Pearson correlation
1. Spearman’s Rank correlation Coefficient
This coefficient is used to see if there is any significant relationship between the two datasets,
and operates under the assumption that the data being used is ordinal, which here means that
the numbers do not indicate quantity, but rather they signify a position of place of the subject’s
standing (e.g. 1st, 2nd, 3rd, etc.)
This coefficient can be displayed on a data table to demonstrate the raw data, its rankings, and
the difference between the two ranks.
This squared difference between the two ranks will be shown on a scatter graph, which will
clearly indicate whether there is a positive correlation, negative correlation, or no correlation
at all between the two variables. The constraint that this coefficient works under is -1 ≤ r ≤ +1,
where a result of 0 would mean that there was no relation between the data whatsoever.
When to use this correlation analysis: Where data must be handled regarding population or
probability distribution characteristics. It is typically used with quantitative data already
established inside the parameters.

2. Pearson Product-Moment Coefficient


This is the most widely used correlation analysis formula, which measures the strength of the
‘linear’ relationships between the raw data from both variables, rather than their ranks. This is
a dimensionless coefficient, meaning that there are no data-related boundaries to be considered
when conducting analyses with this formula, which is a reason why this coefficient is the first
formula researchers try.

However, if the relationship between the data is not linear, then that is when this particular
coefficient will not accurately represent the relationship between the two variables, and when
Spearman’s Rank must be implemented instead.
Pearson’s coefficient requires the relevant data must be inputted into a table similar to that of
Spearman’s Rank but without the ranks, and the result produced will be in the numerical form
which all correlation coefficients produce, including Spearman’s Rank and Pearson’s
Coefficient: -1 ≤ r ≤ +1.
When to use: When no assumptions about the probability distribution may be made. Typically
applied to qualitative data, but can be applied to quantitative data if Spearman’s Rank is
insufficient.

Interpreting Results
Typically, the best way to gain a generalised but more immediate interpretation of the results
of a set of data, is to visualise it on a scatter graph such as these:
Positive Correlation
Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase at the same
time. The line of best fit, or the trend line, is places to best
represent the data on the graph. In this case, it is following the
data points upwards to indicate the positive correlation.
Negative Correlation
Any score from -0.5 to -1 indicate a strong negative correlation,
which means that as one variable increases, the other decreases
proportionally. The line of best fit can be seen here to indicate
the negative correlation. In these cases it will slope downwards
from the point of origin.
No Correlation
Very simply, a score of 0 indicates that there is no
correlation, or relationship, between the two variables. The
larger the sample size, the more accurate the result. No
matter which formula is used, this fact will stand true for all.
The more data there is in putted into the formula, the more
accurate the end result will be.

Outliers or anomalies must be accounted for in both correlation coefficients. Using a scatter
graph is the easiest way of identifying any anomalies that may have occurred, and running the
correlation analysis twice (with and without anomalies) is a great way to assess the strength of
the influence of the anomalies on the analysis. If anomalies are present, Spearman’s Rank
coefficient may be used instead of Pearson’s Coefficient, as this formula is extremely robust
against anomalies due to the ranking system used.

Degrees of Correlation Analysis


We can quantify the degree of correlation between two factors through the correlation
coefficient. We can likewise decide if the correlation is positive or negative and its certificate
or degree based on the coefficient of correlation.
1. Perfect correlation: If two factors shift in a similar course and to a similar extent, the
correlation between the two is a wonderful positive. As indicated by Karl Pearson, the
coefficient of correlation for this situation is +1. Then again, assuming that the factors
shift on the contrary course and to a similar extent, the correlation is wonderfully
negative. Its coefficient of correlation is – 1. Practically speaking, we seldom go over
these kinds of correlations.
2. Non Appearance of correlation: If two series of two factors show no relations
between them or a change in one variable doesn’t prompt an adjustment of the other
variable, then, at that point, we can immovably say that there is no correlation or
ludicrous correlation between the two factors. And In this case, the coefficient of
correlation is 0.
3. Restricted levels of correlation: If two factors are not impeccably associated, or there
is an ideal shortfall of correlation, then, at that point, we term the correlation as Limited
correlation. Hence Correlation might be positive, negative, or zero yet lies with the
cutoff points ± 1. For example, the worth of r is to such an extent that – 1 ≤ r ≤ +1. The
+ and – signs are used separately for positive straight correlations and negative direct
correlations.

Uses of correlation analysis


Correlation analysis is used to study practical cases. Here, the researcher can't manipulate
individual variables.
For example, correlation analysis is used to measure the correlation between the patient's blood
pressure and the medication used.
Marketers use it to measure the effectiveness of advertising.
Researchers measure the increase/decrease in sales due to a specific marketing campaign.

Use of Correlation analysis in Business


Correlation analysis is also a quick way to identify potential company issues. If there is a
correlation between two variables, correlation analysis provides an opportunity for rapid
hypothesis testing, especially if the test is low risk and won’t require a significant investment
of time and money.
For example, you might find that there’s a positive correlation between customers looking at
reviews for a particular product and whether or not they purchase it.
You can't say for certain that the product reviews caused the purchase, but it indicates a place
where testing can provide more information.
If you can get 10% more people to look at product reviews, especially positive ones, can you
increase the number of purchases? Correlations can help to fuel different hypotheses that can
then be rapidly tested, especially in digital environments.

Advantages of correlation analysis


In statistics, correlation refers to the fact that there is a link between various events. One of the
tools to infer whether such a link exists is correlation analysis. Practical simplicity is
undoubtedly one of its main advantages.
To perform reliable correlation analysis, it is essential to make in-depth observations of two
variables, which gives us an advantage in obtaining results. Some of the most notorious benefits
of correlation analysis are:
 Awareness of the behavior between two variables: A correlation helps to identify the
absence or presence of a relationship between two variables. It tends to be more relevant
to everyday life.
 Good starting point for research: It proves to be a good starting point when a
researcher starts investigating relationships for the first time.
 Uses for further studies: Researchers can identify the direction and strength of the
relationship between two variables and later narrow the findings down in later studies.
 Simple metrics: Research findings are simple to classify. The findings can range from
-1.00 to 1.00. There can be only three potential broad outcomes of the analysis.

Variable Importance and Dimension Reduction

Dimensionality reduction
(Reduce the size of your dataset while keeping as much of the variation as possible)
In both Statistics and Machine Learning, the number of attributes, features or input variables
of a dataset is referred to as its dimensionality. For example, let’s take a very simple dataset
containing 2 attributes called Height and Weight. This is a 2-dimensional dataset and any
observation of this dataset can be plotted in a 2D plot.

If we add another dimension called Age to the same dataset, it becomes a 3-dimensional dataset
and any observation lies in the 3-dimensional space.
Likewise, real-world datasets have many attributes. The observations of those datasets lie in
high-dimensional space which is hard to imagine. The following is a general geometric
interpretation of a dataset related to dimensionality considered by data scientists, statisticians
and machine learning engineers.
In a tabular dataset containing rows and columns, the columns represent the dimensions of the
n-dimensional feature space and the rows are the data points lying in that space.
Dimensionality reduction is the process of reducing the number of variables/ attributes in
high-dimensional data while keeping as much of the variability (information) in the original
data as possible. It either finds a new set of variables that is less than the original number of
variables or only keeps the most important variables that are also less than the original number
of variables. We should consider a good trade-off between the number of variables to keep and
variability loss in the original dataset.
It is a data preprocessing step meaning that we perform dimensionality reduction before
training the model.
In addition, high-dimensional data can also lead to overfitting, where the model fits the training
data too closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of
the model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
 Feature selection: is the process of selecting a subset of relevant features to the
problem at hand for use in model construction or in other words, the selection of the
most important features.
In normal circumstances, domain knowledge plays an important role and we could
select features we feel would be the most important. For example, in predicting home
prices the number of bedrooms and square footage are often considered important.
It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: Feature extraction involves creating new features by combining
or transforming the original features. The goal is to create a set of features that captures
the essence of the original data in a lower-dimensional space. This reduces the data in
a high dimensional space to a lower dimension space, i.e. a space with lesser no. of
dimensions. There are several methods for feature extraction, including principal
component analysis (PCA), linear discriminant analysis (LDA) etc.

Methods of Dimensionality Reduction


As the number of variables in your dataset increases, we need to reduce them to get the
advantages. This is where the dimensionality reduction takes place. But keep in mind that,
when you reduce the number of variables in your dataset, you’ll lose some percentage (usually
1%-15% depending on the number of components or features that we keep) of the variability
in the original data. There are many techniques/methods for dimensionality reduction.
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method used.
The prime linear method is Principal Component Analysis, or PCA.

Principal Component Analysis


This method was introduced by Karl Pearson. It works on
the condition that while the data in a higher dimensional
space is mapped to data in a lower dimension space, the
variance of the data in the lower dimensional space should
be maximum.

It involves the following steps:


 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.

Advantages of Dimensionality Reduction


 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or 3D, which
can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in machine
learning models, which can lead to poor generalization performance. Dimensionality
reduction can help in reducing the complexity of the data, and hence prevent overfitting.
 Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection for
machine learning models.
 Data Preprocessing: Dimensionality reduction can be used as a preprocessing step
before applying machine learning algorithms to reduce the dimensionality of the data
and hence improve the performance of the model.
 Improved Performance: Dimensionality reduction can help in improving the
performance of machine learning models by reducing the complexity of the data, and
hence reducing the noise and irrelevant information in the data.

Disadvantages of Dimensionality Reduction


 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied.
 Interpretability: The reduced dimensions may not be easily interpretable, and it may
be difficult to understand the relationship between the original features and the reduced
dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially
when the number of components is chosen based on the training data.
 Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to
outliers, which can result in a biased representation of the data.
 Computational complexity: Some dimensionality reduction techniques, such as
manifold learning, can be computationally intensive, especially when dealing with large
datasets.

Why is Dimensionality Reduction important in Machine Learning and


Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap.
In another condition, a classification problem that relies on both humidity and rainfall can be
collapsed into just one underlying feature, since both of the aforementioned are correlated to a
high degree. Hence, we can reduce the number of features in such problems.
A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2-dimensional space, and a 1-D problem to a simple line. The below figure illustrates
this concept, where a 3-D feature space is split into two 2-D feature spaces, and later, if found
to be correlated, the number of features can be reduced even further.

Feature Selection vs Dimensionality Reduction


Often, feature selection and dimensionality reduction are grouped together. While both
methods are used for reducing the number of features in a dataset, there is an important
difference.
Feature selection is simply selecting and excluding given features without changing them.
Dimensionality reduction transforms features into a lower dimension.

Binning -Reducing Number of Categories in Categorical Variables

Binning in Data Mining


(Conversion of a continuous variable to categorical)
Data binning, also called discretization or discrete binning or bucketing, is a data pre-
processing technique used to reduce the effects of minor observation errors. It is a form of
quantization. The original data values are divided into small intervals known as bins, and then
they are replaced by a general value calculated for that bin. This has a soothing effect on the
input data and may also reduce the chances of over fitting in the case of small datasets.
Binning can be applied to both numerical and categorical variables, and its primary purpose is
to simplify the data and make it more manageable for analysis. For instance, Binning in data
mining can be used to discretize a numerical variable, such as age, into age groups (e.g., 0-18,
19-30, 31-50, and 51+), which can be useful for analysis and modeling purposes.
Binning in data mining can be useful in various scenarios, such as reducing the noise in the
data, improving the accuracy of predictive models, and making the data easier to understand
and interpret. In the following sections, we'll answer questions about the different types
of binning techniques and how they are used in data mining.

Statistical data binning in data mining is a data preprocessing technique used in statistical
analysis to group continuous values into a smaller number of bins. This technique is useful for
exploring the distribution of a variable and identifying patterns or trends in the data.
It can also be used in multivariate statistics, binning in several dimensions simultaneously. For
example, if you have data about a group of people, you might want to arrange their ages into a
smaller number of age intervals, such as grouping every five years together.

Supervised binning is a form of intelligent binning in which important characteristics of the


data are used to determine the bin boundaries. Unlike statistical binning, supervised binning
considers the joint distribution of the input variable and the target variable. The bin boundaries
are determined by a single-predictor decision tree, which considers the predictive power of
each bin for the target variable.
Supervised binning can be used for both numerical and categorical attributes.
It is especially useful for identifying nonlinear relationships between the input variable and the
target variable. By using supervised binning to create new features or input variables, we can
improve the performance of our predictive models.

Why is Binning Used?


Binning or discretization is used for the transformation of a continuous or numerical variable
into a categorical feature. Binning of continuous variable introduces non-linearity and tends to
improve the performance of the model by improving resource utilization and model build
response time without significant loss in model quality. Binning can improve model quality by
strengthening the relationship between attributes. It can be also used to identify missing values
or outliers.

Image Data Processing


 In image data processing, binning is a technique used to reduce the size of an image by
combining adjacent pixels into larger superpixels or bins. This can be useful for
reducing the computational complexity of image processing algorithms and improving
the image's signal-to-noise ratio.
 Binning is typically performed by dividing the image into a grid of equal-sized cells
and then averaging the pixel values within each cell to obtain a new, lower-
resolution image. The resulting image will have a smaller number of pixels. Still, each
pixel will represent the average value of a larger area, which can help to reduce noise
and improve the overall image quality.
 Let's say we have an image that is 1024 x 1024 pixels in size, and we want to reduce its
size by a factor of 2. We can do this using binning by dividing the image into a grid of
512 x 512 cells, each containing 2 x 2 pixels. We can then take the average or sum of
the pixel values within each cell to obtain a new, lower-resolution image that is 512 x
512 pixels in size.

Purpose of Binning Data


The purpose of binning data is to reduce the complexity of data and make it
more manageable and easier to analyze. Binning in data mining can be used for both numerical
and categorical data and involves grouping data into smaller, more manageable intervals or
categories, or bins.
There are several reasons why binning data can be useful -
 Simplification of data - Binning reduces the complexity of data by grouping values
into a smaller number of categories or intervals, which makes it easier to understand,
summarize and visualize.
 Reduction of noise - In some cases, binning can help reduce noise in the data by
smoothing out variations in individual data points and highlighting larger trends or
patterns.
 Facilitation of data analysis - Binning can make it easier to perform statistical analysis
and create visualizations, such as histograms, by reducing the number of unique values
in the data.
 Improvement of model performance - Binning can also be used to create new features
or input variables for predictive models. By grouping similar values, binning can
strengthen the relationship between attributes and improve the performance of machine
learning models.
Binning Technique /Types of binning:
 Unsupervised Binning: Equal width binning, Equal frequency binning
 Supervised Binning: Entropy-based binning

Methods for Binning Data


There are two methods of dividing data into bins and binning data:
Unsupervised Binning:
Unsupervised binning is a category of binning that transforms a numerical or continuous
variable into categorical bins without considering the target class label into account.
Unsupervised binning are of two categories:

1. Equal Frequency Binning: Bins have an equal frequency. This method involves dividing
a continuous variable into a specified number of bins, each containing an equal number of
observations. This method is useful for data with a large number of observations or when the
data is skewed.
For example, equal frequency:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Each bin contains equal number of elements.

2. Equal Width Binning: This method involves dividing a continuous variable into a specified
number of bins of equal width. This method is useful for data with a normal distribution. So if
there are n number of bins, then each bin will have equal width, and the range of each bin is
defined as [min+w], [min + 2w] …. [min + nw] where w = (max - min) / (no of bins).
For example, equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
and if we want to bin this into 3 intervals, then our range for each bin would be
w = (max - min) / (no of bins)= (215 – 5) / 3 = 210 /3 = 70,
First bin max range = min+w = 5+70 = 75, second bin= min + 2w = 5+ 2*70=145, third bin=
min + 3w =5+3*70= 215
This way, we will have ranges for 3 bins as [5, 75], [75-145],[145-215].
Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]

Supervised Binning:
Supervised binning methods transform numerical variables into categorical counterparts
and refer to the target (class) information when selecting discretization cut points. Entropy-
based binning is an example of a supervised binning method.

Entropy-based Binning

Entropy based method uses a split approach. The entropy (or the information content) is calculated
based on the class label. Intuitively, it finds the best split so that the bins are as pure as possible
that is the majority of the values in a bin correspond to have the same class label. Formally, it is
characterized by finding the split with the maximal information gain.
Example:
Discretize the temperature variable using entropy-based binning algorithm on the following data
is:
Step 1: Calculate "Entropy" for the target.

E (Failure) = E(7, 17) = E(0.29, .71) = -0.29 x log2(0.29) - 0.71 x log2(0.71) = 0.871

Where, pi(7) = 7/ (7+17) = 0.29, pi(17) = 17/ (17+7) = 0.71

Step 2: Calculate "Entropy" for the target given a bin.

E (Failure,Temperature) = P(<=60) x E(3,0) + P(>60) x E(4,17) = 3/24 x 0 + 21/24 x 0.7= 0.615

Where, P(<=60)= 3/ (3+4+17)= 3/24 , P(>60)= (4+17)/ (3+4+17)= 21/24

E(3,0) = - (3/3+0) log2 ((3/3+0)) - (0/3+0) log2 ((0/3+0)) = -1(log2 1) - 0=-1(0) =0

Step 3: Calculate "Information Gain" given a bin.

Information Gain (Failure, Temperature) = 0.256

The information gains for all three bins show that the best interval for
"Temperature" is (<=60, >60) because it returns the highest gain.

Example of Binning Continuous Data


Let’s have a look at one example of binning a continuous variable. Below is a bar chart
representing the number of persons for a given age. Now if we bin the age of persons, then the
data can be visualized for various age groups instead of a single age or single individuals, as
shown below.

Example of Binning Categorical Data


Let’s see an example of how categorical variables can be binned to simplify the analysis.
Suppose we have a dataset showing sales distribution for each fruit, as shown below for apples,
strawberries, blueberries, and bananas. We can group strawberries and blueberries into a single
group - berry fruits and simplify our further analysis.

You might also like