0% found this document useful (0 votes)

11 views95 pages

Module1 Introduction

Uploaded by

Kent Wells

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views95 pages

Module1 Introduction

Uploaded by

Kent Wells

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 95

Statistical Modelling

for Data Science

(20CSE743)

Mrs. Snigdha Sen

Associate Professor- CSE, GAT

PhD Scholar, IIIT-Allahabad
Syllabus
Course Outcome
Assignment( 10 M)

Mini Project and short report/

MOOC Course/
Any Online Course
Class Work

• Demonstration of few concepts

using google colab once in a
week on a rotation basis

• Study Material would be

provided
Why to learn

• Seamless excellent blending of

Statistics+ Data Science
• Huge scope of jobs in product-based
companies
Outline & Content

• Introduction
• Standard Deviation
• Skewness
• Kurtosis
• Mean
• Applications
Data

• Most important in any analysis

• Characteristics of data is very important
• Understanding your data is too crucial
• Predictive model works better if data is
known properly
Data Science Vs Statistical
Modelling

Statistical
Data Science: Modelling: Data
Exploratory Data distribution,
Analysis- missing statistics, T-Test,
value, Chi Square Test,
visualization Anova
Statistical Modelling

The science of statistics is the study

of how to learn from data. It helps
you collect the right data, perform
the correct analysis, and effectively
present the results with statistical
knowledge. Statistical modeling
is key to making scientific
discoveries, data-driven decisions,
and predictions.
Application

• Health Insurance Agency

Statistical Modelling
Statistical Modelling

In statistics, a Q–Q plot (quantile–

Q-Q plot:
quantile plot) is a probability plot, a
graphical method for comparing two
probability distributions.

Kernel Density: Kernel functions are used

to estimate density of random variables
and as weighing function in non-parametric
regression.
Statistical Modelling

A statistical
Statistical model is a A statistical
modeling model is a
mathematica mathematical
is the l model that
process of representati embodies a set
applying on (or of statistical
statistical mathematica assumptions
analysis to a l model) of concerning the
dataset. generation of
observed sample data
data.
Statistical Modelling
• Statistics is the grammar of science. – Karl Pearson
• Statistical model is non-deterministic unlike other mathematical models where variables
have specific values. Variables in statistical models are stochastic i.e. they have
probability distributions.
• Statistical models help understand the characteristics of known data and estimate the
properties of large populations based on it. It’s the central idea behind
machine learning.
• It allows you to find an error bar or confidence interval based on sample size and other
factors. For example, an estimate X calculated from 10 samples would have a wider
confidence interval than an estimate Y calculated from 10000 samples.
• Statistical modeling also supports hypothesis testing. It provides statistical evidence for
the occurrence of specific events.
Statistics and
Machi machine learning (ML)
differ primarily in their
ne purposes.
learnin You can build ML models
for predicting the future
g vs. by making accurate
statisti predictions without
explicit programming
cal
While statistical models
modeli can explain the
relationship between
ng variables.
Need of SM

• Choosing models that meet your needs

• Improved data preparation for analysis
• Enhanced communication skills
Where are statistical models
used?
Case Study
The experiment included a total of 122 primary care physicians
affiliated with one of three major hospitals in the Texas Medical
Center of Houston. These physicians were sent a packet
containing a medical chart similar to the one they view upon
seeing a patient. This chart portrayed a patient who was
displaying symptoms of a migraine headache but was otherwise
healthy. Two variables (the gender and the weight of the
patient) were manipulated across six different versions of the
medical charts. The weight of the patient, described in terms of
Body Mass Index (BMI), was average (BMI = 23), overweight
(BMI = 30), or obese (BMI = 36).
Data

It compares each data

point to the mean of all
data points, and
Standard standard deviation
deviation describes how returns a calculated
dispersed a set of data value that describes
is. whether the data points
are in close proximity
or whether they are
spread out.
Statistics In Data
Science
Mean: It measures the central
tendency.

Spread: Basically says how far the points were

typically varying from the mean.
Variance: It basically says, “What is the average of
the squared distance of each point from the mean”.
Statistics In Data
Science
Standard deviation: Square root of the variance. It says,
“What is the average deviation of points from mean value?
”.

Median absolute deviation:

It has the same notion as standard deviation. It measures
how far away my points from central tendency are, which is
median in this case.
Gaussian Distribution — N(μ,σ)

Also known as normal distribution and is solely

dependent on
two parameters namely mean(μ) tending to zero
and standard deviation(σ) tending to one.
Gaussian Distribution — N(μ,σ)
Distribution
Standard deviation is
a statistical
measurement of the
amount a number
varies from the
average number in a
series.

A low standard
Standard deviation means that
the data is very
Deviatio closely related to the
average, thus very
n reliable.

A high standard
deviation means that
there is a large
variance between the
data and the
statistical average,
and is not as reliable
Standard deviation is a statistical
measurement of the amount a
number varies from the average
number in a series.
A low standard deviation means
that the data is very closely
related to the average, thus very
reliable.
A high standard deviation means
that there is a large variance
between the data and the
statistical average, and is not as
reliable
Standard Deviation
Numerical Problem
• Take the values 2, 1, 3, 2 and 4. calculate
standard deviation
Numerical Problem

The standard deviation of the values 2, 1, 3, 2

and 4 is 1.01.
Statistical Analysis
Statistics

Make inferences and draw

Describe and summarize data
conclusions about a
population based on
sample data

Descripti
Inferential
ve
1. Student's T Test
Mean, median, 2. One Sample T Test
mode, standard 3. Two Sample T Test
deviation, range 4. Chi square test
5. ANOVA
Summary Statistics
• The Sample Median

Find out sample median

Answer: The sample median is the middle

number, which is 68.31.
The Trimmed Mean
• Like the median, the trimmed mean is a
measure of center that is designed to be
unaffected by outliers.
• The trimmed mean is computed by arranging
the sample values in order, “trimming” an
equal number of them from each end, and
computing the mean of those remaining.
Proble
m1

Compute the mean, median, and the 5%, 10%,

and 20% trimmed means.

Solution
N=24
• Mean= 195.42.
• The median is the average of the 12th
and 13th numbers, which is (191 +
223)/2 = 207.00.
To compute the 5% trimmed mean, we must drop 5% of the
data from each end. This comes to (0.05)(24) = 1.2
observations.

We round 1.2 to 1, and trim one observation off each end.

The 5% trimmed mean is the average of the remaining 22
numbers:
Mode and Range
• The range is the difference between the largest
and smallest
values in a sample. It is a measure of spread.

There are three modes: 80, 179, and 232.

Each of these values appears twice, and no
other
value appears more than once. The range is
470 − 30 = 440.
Quartiles
• Quartiles: Quartiles divide the set into 4 equal parts.
• There are three quartiles Q1, Q2 and Q3, where Q2 is
the median of the distribution.
• Five number summary:
• Every dataset can be described using these 5 numbers
• Lowest value
• Q1: 25 percentile
• Q2: Median
• Q3: 75 Percentile
• Highest Value
Quartiles
• Quartiles divide it as nearly as possible into quarters.
• Steps to calculate Quartile
• Let n represent the sample size.
• Order the sample values from smallest to
largest.
• To find the first quartile, compute the value
0.25(n + 1). The second quartile uses the value
0.5(n + 1).
• If this is an integer, then the sample value in
that position is the first quartile.
• If not, then take the average of the sample
values on either side of this value.
Quartiles

Solution
The sample size is n = 24. To find the first quartile, compute
(0.25)(25) = 6.25.is therefore found by averaging the 6th
The first quartile
and 7th data points,
(105 + 126)/2 = 115.5.

Third quartile : (0.75)(25) = 18.75.

(242 + 245)/2
= 243.5.
Interquartile Range
Interquartile range is defined as the range between 75 percentile (Q3) and 25 percentile (Q1).
Percentiles
• The pth percentile of a sample-
• Steps
• Order the sample values from smallest to
largest, and then compute the quantity (p/100)
(n + 1), where n is the sample size.
• If this quantity is an integer, the sample value
in this position is the pth percentile. Otherwise
average the two sample values on either side.
• Find the 65th percentile of the asphalt data

The sample size is n = 24.

To find the 65th percentile, compute (0.65)(25) =

16.25.

The 65th percentile is therefore found by averaging the

16th and 17th data points, when the sample is
arranged in increasing order.

(236 + 240)/2 = 238.

Sample Statistics and Population Parameters

• A numerical summary of a sample is called a

statistic.
• A numerical summary of a population is
called a parameter.
• Statistics are often used to estimate
parameters.
Skewness in data
Skewness
• Skewness is an asymmetry in the distribution of data as it does not
show any kind of symmetry in continuous data.

• Skewed data can be of 2 types. Right-skewed data is also called as

Positively-Skewed data and, Left-Skewed data is called as
Negatively-Skewed data.

• Skewness=0 means that the distribution is symmetric, i.e. the

probability of falling on either side of the distribution’s mean is
equal.
Skewed Distribution
Why is skewness a problem?

The reason behind this is that the tapering ends or the tail region of the skewed
data distributions are the outliers in the data and it is known to us that outliers can
severely damage the performance of a statistical model. The best example of this
being regression models that show bad results when trained over skewed data.
Skewed Distribution

• In simple words, skewness is the measure of how

much the probability distribution of a random
variable deviates from the normal distribution.
• Degrades the model’s ability (especially
regression based models) to describe
Effects of typical cases as it has to deal with rare
cases on extreme values.
skewed • Right skewed data will predict better on
data data points with lower value as
compared to those with higher values.
• Skewed data also does not work well
with many statistical methods.
However, tree based models are not
affected.
Dealing with skewed
data
log transformation: transform skewed
distribution to a normal distribution

Normali
Box Remove Cube Root
ze
Box Cox
transformation: Cube root: when
transform non- values are too Square root:
4. .Normalize
normal to Remove outliers large. Can be applied only to
(min-max)
approximate a applied on positive values
normal negative values
distribution
Some example
Log
Transform

Square root

Box Cox
Transformation
Python package

• pip install scipy

• scipy.stats.skew()
Various Transformation
BOX Cox
Box-cox Transformation only cares about computing the value of \lambda
which varies from – 5 to 5. A value of \lambda is said to be best if it is able to
approximate the non-normal curve to a normal curve

This function requires input to be positive. Using

this formula manually is a very laborious task thus
many popular libraries provide this function.
How to handle an imbalanced dataset
– data approach
Kurtosis
• Kurtosis is a statistical measure that defines how heavily
the tails of a distribution differ from the tails of a normal
distribution
• Distributions with medium kurtosis (medium tails) are
mesokurtic.
• Distributions with low kurtosis (thin tails) are platykurtic
.
• Distributions with high kurtosis (fat tails) are leptokurtic.
Kurtosis
Kurtosis
• Skewness essentially is a commonly used measure in
descriptive statistics that characterizes the
asymmetry of a data distribution, while kurtosis
determines the heaviness of the distribution tails.”
Kurtosis is a useful measure of whether there is a
problem with outliers in a data set. A larger kurtosis
indicates more serious outlier problems, therefore the
researcher has to choose alternative statistical
methods.
Descriptive Statistics
of a Dataset
• Income-Expenditure Dataset
What is the Mean and Median Expense of a
Household?

• Income-Expenditure Dataset

income_df["Mthly_HH_Expense"].mean()

income_df["Mthly_HH_Expense"].median()
Plot the Histogram to count the Highest qualified
member
Calculate IQR(difference between 75% and
25% quartile)
Calculate Standard Deviation for first 4 columns

Calculate Variance for first 3 columns

Plot the Histogram to count the
No_of_Earning_Members
Inferential Statistics

1.Normal deviate Z test

2.Student's T Test
3.One Sample T Test
4.Two Sample T Test
5.Chi square test
6.ANOVA
Application of statistics in data science and
modelling

1. Compare the given dataset characteristics (central values and

spread) with production data characterisitics. Are they same?
2. After fixing the missing values / outliers, does the data still
represent the process it is supposed to
3. For classifications problems, when we use imblearn package to
address class imbalances, are the data distributions same?
4. When we split data for training, validation and testing, do the
three datasets have similar characterisitcs?
5. When the models are built using multiple algorithms, are the
differences in distribution their scores significant?
Since normal distribution is of so
much importance, we need to check if
the collected data is normal or not.
Q stands for quantile and therefore, Q-
Q plot represents quantile-quantile
plot
Q-Q QQ plots is very useful to determine
plot(Quant
ile-
Quantile If two populations are of the same
distribution
Plot)
If residuals follow a normal
distribution. Having a normal error
term is an assumption in regression
and we can verify if it’s met using this.
Skewness of distribution.
If the data is normally distributed, the
points in a Q-Q plot will lie on a straight
diagonal line.
Q-Q plot
• We plot the theoretical quantiles, basically known as the
standard normal variate (a normal distribution with
mean of zero and a standard deviation of one) on the x-
axis

• The ordered values for the random variable, which we

want to determine whether or not is a Gaussian
distribution, on the y-axis.
Normal distribution Uniform Distribution

Exponential Distribution
Q-Q plots and skewness of data
• Left side of the plot deviates from the line, it is left-skewed.
When the right side of the plot deviates, it’s right-skewed.
Q-Q plots and skewness of data
• Left side of the plot deviates from the line, it is left-skewed.
When the right side of the plot deviates, it’s right-skewed.
Independence
• The items in a sample are said to be
independent if knowing the values of some of
them does not help to predict the values of the
others.
• Items in a simple random sample may be
treated as independent in many cases
encountered in practice. The exception occurs
when the population is finite and the sample
consists of a substantial fraction (more than
5%) of the population.
Types of Experiments
• One-sample experiment: only one population of
interest, and a single sample is drawn from it.
• Multisample experiment: two or more
populations of interest, and a sample is drawn
from each population.
• Types of Data:
Visualization-Dot plot
• A dotplot is a graph that can be used to give a
rough impression of the shape of a sample. It is
useful when the sample size is not too large and
when the sample contains some repeated values.
Dot plot
Stem and leaf Plot
Each item in the sample is divided into two parts: a
stem, consisting of the leftmost one or two digits, and
the leaf, which consists of the next digit.

The stem consists of the tens digit and the leaf

consists of the ones digit. Each line of the stem-and-
leaf plot contains all of the sample items with a given
stem.

The stem-and-leaf plot is a compact way to represent

the data. It also gives some indication of its shape.
For the geyser data, we can see that there are
relatively few durations in the 60–69 minute interval,
compared with the 50–59, 70–79, or 80–89 minute
intervals.
Box plot
• A histogram is a graphic
that gives an idea of the
“shape” of a sample,
indicating regions where
sample points are
Histogr concentrated and regions
where they are sparse.
am
Steps
• Calculate frequency
• Calculate Relative frequency ( Fequency/Total no.
of elements)
• Calculate Density (“Density” presents the relative
frequency divided by the class width.)
Unimodal and Bimodal
Histograms
• A histogram is unimodal if it has only one peak,
or mode, and bimodal if it has two clearly distinct
modes.
Multivariate Data
• Data for which each item consists of more than
one value is called multivariate data.
• When each item is a pair of values, the data are
said to be bivariate.
• One of the most useful graphical summaries for
numerical bivariate data is the scatterplot.
Solution
Solution
Summary Statistics for
Categorical Data

Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
History Reporting
No ratings yet
History Reporting
61 pages
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
No ratings yet
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
26 pages
Theory and Formula
No ratings yet
Theory and Formula
42 pages
Lec 1 Probability
No ratings yet
Lec 1 Probability
34 pages
Data Types:: Basic Statistics
No ratings yet
Data Types:: Basic Statistics
23 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
Statistics for College Students
No ratings yet
Statistics for College Students
90 pages
Lesson #05: Data Management: Feasible)
No ratings yet
Lesson #05: Data Management: Feasible)
11 pages
Data Management
No ratings yet
Data Management
36 pages
Central Tendency - Lecture Notes
No ratings yet
Central Tendency - Lecture Notes
34 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Module 3 4 MMW
No ratings yet
Module 3 4 MMW
6 pages
Statistics Basics for Data Science
100% (2)
Statistics Basics for Data Science
27 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Chapter 2 BSC TY Statistical Data Analysis
No ratings yet
Chapter 2 BSC TY Statistical Data Analysis
124 pages
ST8114 Module1 PartI UnivariateEDA
No ratings yet
ST8114 Module1 PartI UnivariateEDA
60 pages
Final SRB Unit 2
No ratings yet
Final SRB Unit 2
162 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
26 pages
Statistics Definitions & Examples
No ratings yet
Statistics Definitions & Examples
16 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Slides Week2
No ratings yet
Slides Week2
43 pages
Decision Science
No ratings yet
Decision Science
523 pages
Book P2 2025 F
No ratings yet
Book P2 2025 F
131 pages
Intro to Central Tendency Basics
No ratings yet
Intro to Central Tendency Basics
13 pages
Statistics
No ratings yet
Statistics
12 pages
Q & A - Unit 1 - Introduction To Statistics
No ratings yet
Q & A - Unit 1 - Introduction To Statistics
20 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Statistical Measures and Analysis
No ratings yet
Statistical Measures and Analysis
47 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
MMW PPT Weeks 9 12
No ratings yet
MMW PPT Weeks 9 12
31 pages
Business Statistics
No ratings yet
Business Statistics
106 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
50 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Statistics and Data Management Guide
No ratings yet
Statistics and Data Management Guide
14 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
4 - Stat - Measures of Variation 2024
No ratings yet
4 - Stat - Measures of Variation 2024
27 pages
PC 2 Statistics by Praveen Mathur
No ratings yet
PC 2 Statistics by Praveen Mathur
44 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Understanding Mean, Median, and Mode
No ratings yet
Understanding Mean, Median, and Mode
34 pages
COMM 191 Reviewer
No ratings yet
COMM 191 Reviewer
17 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Math
No ratings yet
Math
6 pages
Statistics Basics for Students
No ratings yet
Statistics Basics for Students
46 pages
Lecture of BIOSTATISTICS 12.2022 RMDC
No ratings yet
Lecture of BIOSTATISTICS 12.2022 RMDC
85 pages
4 - Stat - Measures of Variation 2021
No ratings yet
4 - Stat - Measures of Variation 2021
26 pages
Spring Semester, 2020-2021
No ratings yet
Spring Semester, 2020-2021
40 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Statistics Maths Clinic Gr12 Eng
No ratings yet
Statistics Maths Clinic Gr12 Eng
6 pages
Hypothesis Testing and Rare Events
No ratings yet
Hypothesis Testing and Rare Events
129 pages
Lecture 04
No ratings yet
Lecture 04
88 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Descriptive Statistics & Data Analysis
No ratings yet
Descriptive Statistics & Data Analysis
48 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
EcoStubbleAI: AI for Sustainable Stubble Disposal
0% (1)
EcoStubbleAI: AI for Sustainable Stubble Disposal
3 pages
Bramhotsavam of Sri Venkateswara Schedule and Plan 2024
No ratings yet
Bramhotsavam of Sri Venkateswara Schedule and Plan 2024
15 pages
Carnatic Raagas' Impact on Emotions
No ratings yet
Carnatic Raagas' Impact on Emotions
1 page
Recognition Model For Solar Radiation Time Series Based On Random Forest With Feature Selection Approach - IEEE Conference Publication - IEEE Xplore
No ratings yet
Recognition Model For Solar Radiation Time Series Based On Random Forest With Feature Selection Approach - IEEE Conference Publication - IEEE Xplore
24 pages
Long-Time Gap Crowd Prediction Using Time Series Deep Learning Models With Two-Dimensional Single Attribute Inputs 1-S2.0-S1474034621002329-Main
No ratings yet
Long-Time Gap Crowd Prediction Using Time Series Deep Learning Models With Two-Dimensional Single Attribute Inputs 1-S2.0-S1474034621002329-Main
14 pages
Time Series 10.1007@s10618 019 00619 1
No ratings yet
Time Series 10.1007@s10618 019 00619 1
47 pages
Valston Hancock: RAAF Career Overview
No ratings yet
Valston Hancock: RAAF Career Overview
2 pages
Arts & Crafts at Keswick School
No ratings yet
Arts & Crafts at Keswick School
2 pages
Herbert Maryon: Sculptor & Conservator
No ratings yet
Herbert Maryon: Sculptor & Conservator
2 pages
NEET & AIIMS Muscle Contraction Guide
100% (1)
NEET & AIIMS Muscle Contraction Guide
45 pages
L7 - Locomotion and Movement - Oct 8, 2019 PDF
No ratings yet
L7 - Locomotion and Movement - Oct 8, 2019 PDF
47 pages
L4 - Locomotion and Movement - Oct 1, 2019 PDF
No ratings yet
L4 - Locomotion and Movement - Oct 1, 2019 PDF
43 pages
IIHR Research Achievements Overview
No ratings yet
IIHR Research Achievements Overview
13 pages
Correlation Analysis Lecture 7
No ratings yet
Correlation Analysis Lecture 7
7 pages
Cohen's D
No ratings yet
Cohen's D
2 pages
Beckman Coulter LS - Sampel 20 - 01 - 03
No ratings yet
Beckman Coulter LS - Sampel 20 - 01 - 03
2 pages
SPSS
No ratings yet
SPSS
3 pages
Sampling Theory Question Bank
No ratings yet
Sampling Theory Question Bank
7 pages
Descriptive Statistics - Handout
No ratings yet
Descriptive Statistics - Handout
10 pages
Analyzing Assessment Data with Statistics
92% (12)
Analyzing Assessment Data with Statistics
34 pages
1 s2.0 S155637071730130X Main
No ratings yet
1 s2.0 S155637071730130X Main
1 page
Hasil Anova (Analysis of Variance)
No ratings yet
Hasil Anova (Analysis of Variance)
3 pages
University Students' Academic Performance Analysis
No ratings yet
University Students' Academic Performance Analysis
15 pages
Math Homework: Central Tendency
No ratings yet
Math Homework: Central Tendency
3 pages
Soils Proficiency Testing Program: Report No. 1242
No ratings yet
Soils Proficiency Testing Program: Report No. 1242
65 pages
Indicator 1 2 3 4 5 Weighte D Mean Interpretation
No ratings yet
Indicator 1 2 3 4 5 Weighte D Mean Interpretation
2 pages
Note 5 - Statistical Methods in Biochemical Analysis
No ratings yet
Note 5 - Statistical Methods in Biochemical Analysis
12 pages
AGA 3842-2022-2023. Descriptive Statistics
No ratings yet
AGA 3842-2022-2023. Descriptive Statistics
101 pages
Statistical Methods in Nursing
No ratings yet
Statistical Methods in Nursing
73 pages
Questions and Answers On Generalized Method of Moments: A B A B
No ratings yet
Questions and Answers On Generalized Method of Moments: A B A B
7 pages
Student Data Analysis Guide
No ratings yet
Student Data Analysis Guide
10 pages
Republic of The Philippines Department of Education
No ratings yet
Republic of The Philippines Department of Education
3 pages
Z Score
No ratings yet
Z Score
2 pages
Mid-Session Exam
No ratings yet
Mid-Session Exam
42 pages
Comprehensive Guide to Data Visualization
No ratings yet
Comprehensive Guide to Data Visualization
34 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Statistics Exam Questions and Answers
100% (1)
Statistics Exam Questions and Answers
2 pages
Introduction to Statistics and Data Analysis
No ratings yet
Introduction to Statistics and Data Analysis
48 pages
Stats Frequency Distributions Guide
100% (74)
Stats Frequency Distributions Guide
29 pages
DSE PPT - Mean Variance Portfolio Theory
No ratings yet
DSE PPT - Mean Variance Portfolio Theory
12 pages
2.1 Measures of Central Tendency
No ratings yet
2.1 Measures of Central Tendency
54 pages
Individual Assignment (50 Marks) : STA104/QMT181 Introduction To Statistics
No ratings yet
Individual Assignment (50 Marks) : STA104/QMT181 Introduction To Statistics
2 pages
POA10 AKDA Modul1 Forecasting Rev1
No ratings yet
POA10 AKDA Modul1 Forecasting Rev1
7 pages

Module1 Introduction

Uploaded by

Module1 Introduction

Uploaded by

Statistical Modelling

for Data Science

Mrs. Snigdha Sen

Associate Professor- CSE, GAT

Mini Project and short report/

• Demonstration of few concepts

• Study Material would be

• Seamless excellent blending of

• Most important in any analysis

The science of statistics is the study

• Health Insurance Agency

In statistics, a Q–Q plot (quantile–

Kernel Density: Kernel functions are used

• Choosing models that meet your needs

It compares each data

Spread: Basically says how far the points were

Median absolute deviation:

Also known as normal distribution and is solely

The standard deviation of the values 2, 1, 3, 2

Make inferences and draw

Find out sample median

Answer: The sample median is the middle

Compute the mean, median, and the 5%, 10%,

We round 1.2 to 1, and trim one observation off each end.

There are three modes: 80, 179, and 232.

Third quartile : (0.75)(25) = 18.75.

The sample size is n = 24.

To find the 65th percentile, compute (0.65)(25) =

The 65th percentile is therefore found by averaging the

(236 + 240)/2 = 238.

• A numerical summary of a sample is called a

• Skewed data can be of 2 types. Right-skewed data is also called as

• Skewness=0 means that the distribution is symmetric, i.e. the

• In simple words, skewness is the measure of how

• pip install scipy

This function requires input to be positive. Using

Calculate Variance for first 3 columns

1.Normal deviate Z test

1. Compare the given dataset characteristics (central values and

• The ordered values for the random variable, which we

The stem consists of the tens digit and the leaf

The stem-and-leaf plot is a compact way to represent

You might also like