Statistical Modelling
for Data Science
(20CSE743)
Mrs. Snigdha Sen
Associate Professor- CSE, GAT
PhD Scholar, IIIT-Allahabad
Syllabus
Course Outcome
Assignment( 10 M)
Mini Project and short report/
MOOC Course/
Any Online Course
Class Work
• Demonstration of few concepts
using google colab once in a
week on a rotation basis
• Study Material would be
provided
Why to learn
• Seamless excellent blending of
Statistics+ Data Science
• Huge scope of jobs in product-based
companies
Outline & Content
• Introduction
• Standard Deviation
• Skewness
• Kurtosis
• Mean
• Applications
Data
• Most important in any analysis
• Characteristics of data is very important
• Understanding your data is too crucial
• Predictive model works better if data is
known properly
Data Science Vs Statistical
Modelling
Statistical
Data Science: Modelling: Data
Exploratory Data distribution,
Analysis- missing statistics, T-Test,
value, Chi Square Test,
visualization Anova
Statistical Modelling
The science of statistics is the study
of how to learn from data. It helps
you collect the right data, perform
the correct analysis, and effectively
present the results with statistical
knowledge. Statistical modeling
is key to making scientific
discoveries, data-driven decisions,
and predictions.
Application
• Health Insurance Agency
Statistical Modelling
Statistical Modelling
In statistics, a Q–Q plot (quantile–
Q-Q plot:
quantile plot) is a probability plot, a
graphical method for comparing two
probability distributions.
Kernel Density: Kernel functions are used
to estimate density of random variables
and as weighing function in non-parametric
regression.
Statistical Modelling
A statistical
Statistical model is a A statistical
modeling model is a
mathematica mathematical
is the l model that
process of representati embodies a set
applying on (or of statistical
statistical mathematica assumptions
analysis to a l model) of concerning the
dataset. generation of
observed sample data
data.
Statistical Modelling
• Statistics is the grammar of science. – Karl Pearson
• Statistical model is non-deterministic unlike other mathematical models where variables
have specific values. Variables in statistical models are stochastic i.e. they have
probability distributions.
• Statistical models help understand the characteristics of known data and estimate the
properties of large populations based on it. It’s the central idea behind
machine learning.
• It allows you to find an error bar or confidence interval based on sample size and other
factors. For example, an estimate X calculated from 10 samples would have a wider
confidence interval than an estimate Y calculated from 10000 samples.
• Statistical modeling also supports hypothesis testing. It provides statistical evidence for
the occurrence of specific events.
Statistics and
Machi machine learning (ML)
differ primarily in their
ne purposes.
learnin You can build ML models
for predicting the future
g vs. by making accurate
statisti predictions without
explicit programming
cal
While statistical models
modeli can explain the
relationship between
ng variables.
Need of SM
• Choosing models that meet your needs
• Improved data preparation for analysis
• Enhanced communication skills
Where are statistical models
used?
Case Study
The experiment included a total of 122 primary care physicians
affiliated with one of three major hospitals in the Texas Medical
Center of Houston. These physicians were sent a packet
containing a medical chart similar to the one they view upon
seeing a patient. This chart portrayed a patient who was
displaying symptoms of a migraine headache but was otherwise
healthy. Two variables (the gender and the weight of the
patient) were manipulated across six different versions of the
medical charts. The weight of the patient, described in terms of
Body Mass Index (BMI), was average (BMI = 23), overweight
(BMI = 30), or obese (BMI = 36).
Data
It compares each data
point to the mean of all
data points, and
Standard standard deviation
deviation describes how returns a calculated
dispersed a set of data value that describes
is. whether the data points
are in close proximity
or whether they are
spread out.
Statistics In Data
Science
Mean: It measures the central
tendency.
Spread: Basically says how far the points were
typically varying from the mean.
Variance: It basically says, “What is the average of
the squared distance of each point from the mean”.
Statistics In Data
Science
Standard deviation: Square root of the variance. It says,
“What is the average deviation of points from mean value?
”.
Median absolute deviation:
It has the same notion as standard deviation. It measures
how far away my points from central tendency are, which is
median in this case.
Gaussian Distribution — N(μ,σ)
Also known as normal distribution and is solely
dependent on
two parameters namely mean(μ) tending to zero
and standard deviation(σ) tending to one.
Gaussian Distribution — N(μ,σ)
Distribution
Standard deviation is
a statistical
measurement of the
amount a number
varies from the
average number in a
series.
A low standard
Standard deviation means that
the data is very
Deviatio closely related to the
average, thus very
n reliable.
A high standard
deviation means that
there is a large
variance between the
data and the
statistical average,
and is not as reliable
Standard deviation is a statistical
measurement of the amount a
number varies from the average
number in a series.
A low standard deviation means
that the data is very closely
related to the average, thus very
reliable.
A high standard deviation means
that there is a large variance
between the data and the
statistical average, and is not as
reliable
Standard Deviation
Numerical Problem
• Take the values 2, 1, 3, 2 and 4. calculate
standard deviation
Numerical Problem
The standard deviation of the values 2, 1, 3, 2
and 4 is 1.01.
Statistical Analysis
Statistics
Make inferences and draw
Describe and summarize data
conclusions about a
population based on
sample data
Descripti
Inferential
ve
1. Student's T Test
Mean, median, 2. One Sample T Test
mode, standard 3. Two Sample T Test
deviation, range 4. Chi square test
5. ANOVA
Summary Statistics
• The Sample Median
Find out sample median
Answer: The sample median is the middle
number, which is 68.31.
The Trimmed Mean
• Like the median, the trimmed mean is a
measure of center that is designed to be
unaffected by outliers.
• The trimmed mean is computed by arranging
the sample values in order, “trimming” an
equal number of them from each end, and
computing the mean of those remaining.
Proble
m1
Compute the mean, median, and the 5%, 10%,
and 20% trimmed means.
Solution
N=24
• Mean= 195.42.
• The median is the average of the 12th
and 13th numbers, which is (191 +
223)/2 = 207.00.
To compute the 5% trimmed mean, we must drop 5% of the
data from each end. This comes to (0.05)(24) = 1.2
observations.
We round 1.2 to 1, and trim one observation off each end.
The 5% trimmed mean is the average of the remaining 22
numbers:
Mode and Range
• The range is the difference between the largest
and smallest
values in a sample. It is a measure of spread.
There are three modes: 80, 179, and 232.
Each of these values appears twice, and no
other
value appears more than once. The range is
470 − 30 = 440.
Quartiles
• Quartiles: Quartiles divide the set into 4 equal parts.
• There are three quartiles Q1, Q2 and Q3, where Q2 is
the median of the distribution.
• Five number summary:
• Every dataset can be described using these 5 numbers
• Lowest value
• Q1: 25 percentile
• Q2: Median
• Q3: 75 Percentile
• Highest Value
Quartiles
• Quartiles divide it as nearly as possible into quarters.
• Steps to calculate Quartile
• Let n represent the sample size.
• Order the sample values from smallest to
largest.
• To find the first quartile, compute the value
0.25(n + 1). The second quartile uses the value
0.5(n + 1).
• If this is an integer, then the sample value in
that position is the first quartile.
• If not, then take the average of the sample
values on either side of this value.
Quartiles
Solution
The sample size is n = 24. To find the first quartile, compute
(0.25)(25) = 6.25.is therefore found by averaging the 6th
The first quartile
and 7th data points,
(105 + 126)/2 = 115.5.
Third quartile : (0.75)(25) = 18.75.
(242 + 245)/2
= 243.5.
Interquartile Range
Interquartile range is defined as the range between 75 percentile (Q3) and 25 percentile (Q1).
Percentiles
• The pth percentile of a sample-
• Steps
• Order the sample values from smallest to
largest, and then compute the quantity (p/100)
(n + 1), where n is the sample size.
• If this quantity is an integer, the sample value
in this position is the pth percentile. Otherwise
average the two sample values on either side.
• Find the 65th percentile of the asphalt data
The sample size is n = 24.
To find the 65th percentile, compute (0.65)(25) =
16.25.
The 65th percentile is therefore found by averaging the
16th and 17th data points, when the sample is
arranged in increasing order.
(236 + 240)/2 = 238.
Sample Statistics and Population Parameters
• A numerical summary of a sample is called a
statistic.
• A numerical summary of a population is
called a parameter.
• Statistics are often used to estimate
parameters.
Skewness in data
Skewness
• Skewness is an asymmetry in the distribution of data as it does not
show any kind of symmetry in continuous data.
• Skewed data can be of 2 types. Right-skewed data is also called as
Positively-Skewed data and, Left-Skewed data is called as
Negatively-Skewed data.
• Skewness=0 means that the distribution is symmetric, i.e. the
probability of falling on either side of the distribution’s mean is
equal.
Skewed Distribution
Why is skewness a problem?
The reason behind this is that the tapering ends or the tail region of the skewed
data distributions are the outliers in the data and it is known to us that outliers can
severely damage the performance of a statistical model. The best example of this
being regression models that show bad results when trained over skewed data.
Skewed Distribution
• In simple words, skewness is the measure of how
much the probability distribution of a random
variable deviates from the normal distribution.
• Degrades the model’s ability (especially
regression based models) to describe
Effects of typical cases as it has to deal with rare
cases on extreme values.
skewed • Right skewed data will predict better on
data data points with lower value as
compared to those with higher values.
• Skewed data also does not work well
with many statistical methods.
However, tree based models are not
affected.
Dealing with skewed
data
log transformation: transform skewed
distribution to a normal distribution
Normali
Box Remove Cube Root
ze
Box Cox
transformation: Cube root: when
transform non- values are too Square root:
4. .Normalize
normal to Remove outliers large. Can be applied only to
(min-max)
approximate a applied on positive values
normal negative values
distribution
Some example
Log
Transform
Square root
Box Cox
Transformation
Python package
• pip install scipy
• scipy.stats.skew()
Various Transformation
BOX Cox
Box-cox Transformation only cares about computing the value of \lambda
which varies from – 5 to 5. A value of \lambda is said to be best if it is able to
approximate the non-normal curve to a normal curve
This function requires input to be positive. Using
this formula manually is a very laborious task thus
many popular libraries provide this function.
How to handle an imbalanced dataset
– data approach
Kurtosis
• Kurtosis is a statistical measure that defines how heavily
the tails of a distribution differ from the tails of a normal
distribution
• Distributions with medium kurtosis (medium tails) are
mesokurtic.
• Distributions with low kurtosis (thin tails) are platykurtic
.
• Distributions with high kurtosis (fat tails) are leptokurtic.
Kurtosis
Kurtosis
• Skewness essentially is a commonly used measure in
descriptive statistics that characterizes the
asymmetry of a data distribution, while kurtosis
determines the heaviness of the distribution tails.”
Kurtosis is a useful measure of whether there is a
problem with outliers in a data set. A larger kurtosis
indicates more serious outlier problems, therefore the
researcher has to choose alternative statistical
methods.
Descriptive Statistics
of a Dataset
• Income-Expenditure Dataset
What is the Mean and Median Expense of a
Household?
• Income-Expenditure Dataset
income_df["Mthly_HH_Expense"].mean()
income_df["Mthly_HH_Expense"].median()
Plot the Histogram to count the Highest qualified
member
Calculate IQR(difference between 75% and
25% quartile)
Calculate Standard Deviation for first 4 columns
Calculate Variance for first 3 columns
Plot the Histogram to count the
No_of_Earning_Members
Inferential Statistics
1.Normal deviate Z test
2.Student's T Test
3.One Sample T Test
4.Two Sample T Test
5.Chi square test
6.ANOVA
Application of statistics in data science and
modelling
1. Compare the given dataset characteristics (central values and
spread) with production data characterisitics. Are they same?
2. After fixing the missing values / outliers, does the data still
represent the process it is supposed to
3. For classifications problems, when we use imblearn package to
address class imbalances, are the data distributions same?
4. When we split data for training, validation and testing, do the
three datasets have similar characterisitcs?
5. When the models are built using multiple algorithms, are the
differences in distribution their scores significant?
Since normal distribution is of so
much importance, we need to check if
the collected data is normal or not.
Q stands for quantile and therefore, Q-
Q plot represents quantile-quantile
plot
Q-Q QQ plots is very useful to determine
plot(Quant
ile-
Quantile If two populations are of the same
distribution
Plot)
If residuals follow a normal
distribution. Having a normal error
term is an assumption in regression
and we can verify if it’s met using this.
Skewness of distribution.
If the data is normally distributed, the
points in a Q-Q plot will lie on a straight
diagonal line.
Q-Q plot
• We plot the theoretical quantiles, basically known as the
standard normal variate (a normal distribution with
mean of zero and a standard deviation of one) on the x-
axis
• The ordered values for the random variable, which we
want to determine whether or not is a Gaussian
distribution, on the y-axis.
Normal distribution Uniform Distribution
Exponential Distribution
Q-Q plots and skewness of data
• Left side of the plot deviates from the line, it is left-skewed.
When the right side of the plot deviates, it’s right-skewed.
Q-Q plots and skewness of data
• Left side of the plot deviates from the line, it is left-skewed.
When the right side of the plot deviates, it’s right-skewed.
Independence
• The items in a sample are said to be
independent if knowing the values of some of
them does not help to predict the values of the
others.
• Items in a simple random sample may be
treated as independent in many cases
encountered in practice. The exception occurs
when the population is finite and the sample
consists of a substantial fraction (more than
5%) of the population.
Types of Experiments
• One-sample experiment: only one population of
interest, and a single sample is drawn from it.
• Multisample experiment: two or more
populations of interest, and a sample is drawn
from each population.
• Types of Data:
Visualization-Dot plot
• A dotplot is a graph that can be used to give a
rough impression of the shape of a sample. It is
useful when the sample size is not too large and
when the sample contains some repeated values.
Dot plot
Stem and leaf Plot
Each item in the sample is divided into two parts: a
stem, consisting of the leftmost one or two digits, and
the leaf, which consists of the next digit.
The stem consists of the tens digit and the leaf
consists of the ones digit. Each line of the stem-and-
leaf plot contains all of the sample items with a given
stem.
The stem-and-leaf plot is a compact way to represent
the data. It also gives some indication of its shape.
For the geyser data, we can see that there are
relatively few durations in the 60–69 minute interval,
compared with the 50–59, 70–79, or 80–89 minute
intervals.
Box plot
• A histogram is a graphic
that gives an idea of the
“shape” of a sample,
indicating regions where
sample points are
Histogr concentrated and regions
where they are sparse.
am
Steps
• Calculate frequency
• Calculate Relative frequency ( Fequency/Total no.
of elements)
• Calculate Density (“Density” presents the relative
frequency divided by the class width.)
Unimodal and Bimodal
Histograms
• A histogram is unimodal if it has only one peak,
or mode, and bimodal if it has two clearly distinct
modes.
Multivariate Data
• Data for which each item consists of more than
one value is called multivariate data.
• When each item is a pair of values, the data are
said to be bivariate.
• One of the most useful graphical summaries for
numerical bivariate data is the scatterplot.
Solution
Solution
Summary Statistics for
Categorical Data