Descriptive Statistics

The document discusses descriptive statistics, focusing on measures of location (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It explains the importance of feature engineering in machine learning and provides insights into percentiles and quartiles, including the interquartile range (IQR) for identifying outliers. Examples illustrate the concepts, including how to calculate outliers and visualize data using box plots.

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views19 pages

Descriptive Statistics

Uploaded by

rgrewal112233

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Descriptive Statistics

Measures of Location
• Measures of Location / Measures of Central Tendency : A single
value that represents the “centering” of a set of data, e.g. average
• Example: Marks obtained by 10 students, arranged in an ascending
order … 45,56,61,65,68,71,73,79,82,88,91
• Possible measure of location: 45,56,61,65,68, 71, 73,79,82,88,91

Measures of Location

Mean Mode Median

Basic Usage
• Mean: Better if the data is normally distributed and there are no
outliers … Used for interval and ratio data
• Median: Better when the data is skewed (has extreme values) …
Used for ordinal, interval, and ratio data
• Mode: Useful for identifying the most common value or values in a
dataset … Used in all the four scales … Best for categorical data

Normally distributed data Skewed data

Mean
•
Median
•
Mode
• Mode: The value that occurs most frequently in a dataset
• Data: 62, 78, 84, 89, 91, 95, 97, 89, 91, 89
• Frequency: 62: 1, 78: 1, 84: 1, 89: 3, 91: 2, 95: 1, 97: 1
• Mode = 89
• What if there are multiple values with the same highest frequency?:
Multimodal data
• If we have two modes: bi-modal
• If we have three modes: tri-modal
• Not used much in practice
Feature Engineering
• Feature engineering: Transform raw data into meaningful features
• Why? Improve the performance of the machine learning models
• How?
• Create new columns (From Date of purchase, create weekday/weekend)
• Scale features (Bring features on the same scale, e.g. age and income)
• Encode categorical features (Gender: Convert F = 0, M = 1), since ML models
work with numeric data
• Handle missing data (Drop, Indicate using a Missing flag, or Impute with
mean/mode/median)
• Feature selection (Keep only the most relevant features)
• Feature interaction (From unit price and quantity, create bill amount)
Measures of Dispersion
• Spread / Measures of Dispersion / Scatter : How and by how much,
our data set is spread out around its center?

Measures of Dispersion

Range Variance Standard Deviation

Range
• Range: Difference between the maximum value and the minimum
value in the data set
• Affected by outliers Range
Minimum Maximum

• Example: 8, 11, 5, 9, 7, 6, 2500

• Range = Max – Min = 2500 – 5 = 2495, which is quite meaningless
• Solution: Inter Quartile Range (IQR)
• But first, we need to understand percentile and quartile
Percentile
• Percentile (Relative): ≠ Percentage (Absolute)
• Percentile: A value below which certain percentage of observations lie
• Slices percentage data into two parts: Below a certain cut off, Above the
same cut off
• kth percentile = k% data is below it, and rest is above it
• Examples:
• If you are in the 90th percentile in an examination, 90% students are below you and
10% students are above you
• If a patient’s blood pressure is in the 60th percentile, 60% patients have a blood
pressure less than this patient, and 40% patients have higher blood pressure than
this patient
• Median = 50th percentile
Percentile Example
• General graph Score at the 62nd percentile
In some references, we might see Number of
Percentile Example observations, rather than Number of observations + 1
… Generally does not make a big difference

•
Percentile Example
•
US Household Net Worth and Percentile (Source:
https://finance.yahoo.com/news/wealthy-net-worth-considered-poor-190014440.html)

Category Percentile Net Worth

Poor 20th $10,000
Middle class 50th $281,000
Wealthy 90th $1.9 million
Quartile
Q1 Q2 Q3
•

25% 50% 75%

Inter Quartile Range (IQR)
• Inter Quartile Range (IQR) = Q3 – Q1 = Middle 50% of the data
• In the given example: IQR = Q3 – Q1 = 95.5 – 82 = 13.5
• Handles outliers better than range, since the extreme values at both the
ends are ignored in IQR
• Since it uses percentiles rather than actual values, it is less affected by
skewed data (See Skewness)
• Outliers: Data points that are significantly outside of the typical range of
values
• Lower bound: Q1 – (1.5 * IQR) = 82 – (1.5 * 13.5) = 61.75
• Upper bound: Q3 + (1.5 * IQR) = 82 + (1.5 * 13.5) = 102.25
• Points below the lower bound or above the upper bound are outliers
• In our example, there are no such points, so we do not have any outliers
Outlier Example
• Commute times for 14 randomly selected adults in minutes: 16, 8, 35, 17,
13, 15, 15, 5, 16, 25, 20, 20, 12, 10
• Find outliers and draw a box plot
• Solution: First sort them: 5, 8, 10, 12, 13, 15, 15, 16, 16, 17, 20, 20, 25, 35
• Create a 5-number summary: Minimum, Q1, Q2, Q3, Maximum = 5, 12,
15.5, 20, and 35
• Outlier
• First calculate 1.5 * IQR = 1.5 x (20 – 12) = 1.5 x 8 = 12
• Outliers calculation: Q1 – 12 = 12 – 12 = 0 and Q3 + 12 = 20 + 12 = 32
• So, outliers = Commute time < 0 or > 32
• Boxplot: Draw a vertical line between 5 and 35; Draw a box with 12 and 20;
Draw a median line at 15.5, Show outlier points (See next slide)
Outlier Code
• import matplotlib.pyplot as plt
• import seaborn as sns

• # Data
• commuter_times = [16, 8, 35, 17, 13, 15, 15, 5, 16, 25, 20, 20, 12, 10]

• # Create the box plot

• plt.figure(figsize=(10, 6))
• sns.boxplot(data=commuter_times, orient='h')

• # Add titles and labels

• plt.title('Box Plot of Commuter Times')
• plt.xlabel('Minutes')

• # Show the plot

• plt.show()
Resulting Boxplot

1 Program
No ratings yet
1 Program
20 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
No ratings yet
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
44 pages
Chapter 3 - 250720 - 111806
No ratings yet
Chapter 3 - 250720 - 111806
40 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
No ratings yet
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
34 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Machine Learning Attribute Types Explained
No ratings yet
Machine Learning Attribute Types Explained
31 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
02 Data
No ratings yet
02 Data
36 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Lecture Slides - Capítulo 02
No ratings yet
Lecture Slides - Capítulo 02
21 pages
01 Data
No ratings yet
01 Data
100 pages
Variability Final
No ratings yet
Variability Final
53 pages
Descriptive Statistics - Numerical Measures
No ratings yet
Descriptive Statistics - Numerical Measures
91 pages
Central Tendency Variation Outliers
No ratings yet
Central Tendency Variation Outliers
59 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
Measures of Position PDF
No ratings yet
Measures of Position PDF
5 pages
Summary Measures
No ratings yet
Summary Measures
26 pages
2 - Descriptive Statistics
No ratings yet
2 - Descriptive Statistics
29 pages
Business Intelligence and Data Analytics - Week 2
No ratings yet
Business Intelligence and Data Analytics - Week 2
24 pages
Data Management
No ratings yet
Data Management
36 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
03 WEEK2 Statistics Part2
No ratings yet
03 WEEK2 Statistics Part2
38 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
2 - Unit-Ii-2
No ratings yet
2 - Unit-Ii-2
66 pages
CHP 2
No ratings yet
CHP 2
52 pages
Math264 Numerical Measures Apaydın
No ratings yet
Math264 Numerical Measures Apaydın
64 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Session 2 Descriptive Statistics
No ratings yet
Session 2 Descriptive Statistics
33 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Data Mining Part 1
No ratings yet
Data Mining Part 1
16 pages
R22 Unit2 CH2
No ratings yet
R22 Unit2 CH2
28 pages
Slides Week2
No ratings yet
Slides Week2
43 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
65 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Ch3-Numerical Measures
No ratings yet
Ch3-Numerical Measures
33 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Chapter 2 - Measures of Location and Spread
No ratings yet
Chapter 2 - Measures of Location and Spread
3 pages
Chapter 2
No ratings yet
Chapter 2
52 pages
Ch3B Numerical Descriptive Measures
No ratings yet
Ch3B Numerical Descriptive Measures
22 pages
CH 3 - 250408 - 170537
No ratings yet
CH 3 - 250408 - 170537
33 pages
Measures of Variation PDF
No ratings yet
Measures of Variation PDF
45 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
02data Part2
No ratings yet
02data Part2
34 pages
Topic3 Descriptive Statistics
No ratings yet
Topic3 Descriptive Statistics
50 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
T Test, ANOVA, Chi Square Test
No ratings yet
T Test, ANOVA, Chi Square Test
26 pages
Continuous Distributions
No ratings yet
Continuous Distributions
17 pages
Naïve Bayes' Classifier
No ratings yet
Naïve Bayes' Classifier
17 pages
UT35A 32A Manual
100% (1)
UT35A 32A Manual
328 pages
Calculating and Combining Mean Averages
No ratings yet
Calculating and Combining Mean Averages
15 pages
EMB09A03HP ExcellianceMOS
100% (1)
EMB09A03HP ExcellianceMOS
9 pages
Overview of SHA-512 Algorithm
No ratings yet
Overview of SHA-512 Algorithm
25 pages
Orbitals: 2h 2v H V
No ratings yet
Orbitals: 2h 2v H V
21 pages
Case Study Fam C
No ratings yet
Case Study Fam C
3 pages
Centrifugal Pump Flow Analysis
No ratings yet
Centrifugal Pump Flow Analysis
8 pages
FTD Recommended Syllabus
No ratings yet
FTD Recommended Syllabus
23 pages
FINS 2624 Tutorial Week 3 Slides
No ratings yet
FINS 2624 Tutorial Week 3 Slides
12 pages
D N F Block Elements NCERT BULLETS
No ratings yet
D N F Block Elements NCERT BULLETS
16 pages
CSEC Mathematics June 1981 P2
No ratings yet
CSEC Mathematics June 1981 P2
9 pages
Vibration Measurement
100% (1)
Vibration Measurement
21 pages
ITEMS
No ratings yet
ITEMS
2 pages
Lecture 5 - Dam Construction
No ratings yet
Lecture 5 - Dam Construction
68 pages
Economic Load Dispatch Using PSO Methode
No ratings yet
Economic Load Dispatch Using PSO Methode
53 pages
Midterm Fall 2024
No ratings yet
Midterm Fall 2024
2 pages
Mathematics Solution Manual 2016
No ratings yet
Mathematics Solution Manual 2016
120 pages
GARCH Models in Python Guide
100% (1)
GARCH Models in Python Guide
33 pages
Errata For Grade 9 Maths Paper Final
No ratings yet
Errata For Grade 9 Maths Paper Final
4 pages
Shift Control Unit, Function HPS ACTROS
100% (4)
Shift Control Unit, Function HPS ACTROS
4 pages
Huawei 3G Introduction PDF
No ratings yet
Huawei 3G Introduction PDF
64 pages
Artc 10-StAn Naca Aircraft Wing
No ratings yet
Artc 10-StAn Naca Aircraft Wing
10 pages
MM-3-1 Keyless Propeller
No ratings yet
MM-3-1 Keyless Propeller
11 pages
The Letters of Peter Damian 151 180 Fathers of The Church Mediaeval Continuation First Edition Damian Latest PDF 2025
No ratings yet
The Letters of Peter Damian 151 180 Fathers of The Church Mediaeval Continuation First Edition Damian Latest PDF 2025
103 pages
Early Education Curriculum A Child's Connection To The World 7th Edition Nancy Beaver All Chapter Instant Download
No ratings yet
Early Education Curriculum A Child's Connection To The World 7th Edition Nancy Beaver All Chapter Instant Download
81 pages
Project Quality Management Techniques
No ratings yet
Project Quality Management Techniques
22 pages
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
No ratings yet
Describing Data:: Frequency Tables, Frequency Distributions, and Graphic Presentation
32 pages
B737 FMC Guide for Pilots
100% (1)
B737 FMC Guide for Pilots
19 pages
Wireless Communication
No ratings yet
Wireless Communication
2 pages
CEN 306 Lect 01 MSINGH 20230103
No ratings yet
CEN 306 Lect 01 MSINGH 20230103
14 pages