Descriptive Statistics
Measures of Location
• Measures of Location / Measures of Central Tendency : A single
value that represents the “centering” of a set of data, e.g. average
• Example: Marks obtained by 10 students, arranged in an ascending
order … 45,56,61,65,68,71,73,79,82,88,91
• Possible measure of location: 45,56,61,65,68, 71, 73,79,82,88,91
Measures of Location
Mean Mode Median
Basic Usage
• Mean: Better if the data is normally distributed and there are no
outliers … Used for interval and ratio data
• Median: Better when the data is skewed (has extreme values) …
Used for ordinal, interval, and ratio data
• Mode: Useful for identifying the most common value or values in a
dataset … Used in all the four scales … Best for categorical data
Normally distributed data Skewed data
Mean
•
Median
•
Mode
• Mode: The value that occurs most frequently in a dataset
• Data: 62, 78, 84, 89, 91, 95, 97, 89, 91, 89
• Frequency: 62: 1, 78: 1, 84: 1, 89: 3, 91: 2, 95: 1, 97: 1
• Mode = 89
• What if there are multiple values with the same highest frequency?:
Multimodal data
• If we have two modes: bi-modal
• If we have three modes: tri-modal
• Not used much in practice
Feature Engineering
• Feature engineering: Transform raw data into meaningful features
• Why? Improve the performance of the machine learning models
• How?
• Create new columns (From Date of purchase, create weekday/weekend)
• Scale features (Bring features on the same scale, e.g. age and income)
• Encode categorical features (Gender: Convert F = 0, M = 1), since ML models
work with numeric data
• Handle missing data (Drop, Indicate using a Missing flag, or Impute with
mean/mode/median)
• Feature selection (Keep only the most relevant features)
• Feature interaction (From unit price and quantity, create bill amount)
Measures of Dispersion
• Spread / Measures of Dispersion / Scatter : How and by how much,
our data set is spread out around its center?
Measures of Dispersion
Range Variance Standard Deviation
Range
• Range: Difference between the maximum value and the minimum
value in the data set
• Affected by outliers Range
Minimum Maximum
• Example: 8, 11, 5, 9, 7, 6, 2500
• Range = Max – Min = 2500 – 5 = 2495, which is quite meaningless
• Solution: Inter Quartile Range (IQR)
• But first, we need to understand percentile and quartile
Percentile
• Percentile (Relative): ≠ Percentage (Absolute)
• Percentile: A value below which certain percentage of observations lie
• Slices percentage data into two parts: Below a certain cut off, Above the
same cut off
• kth percentile = k% data is below it, and rest is above it
• Examples:
• If you are in the 90th percentile in an examination, 90% students are below you and
10% students are above you
• If a patient’s blood pressure is in the 60th percentile, 60% patients have a blood
pressure less than this patient, and 40% patients have higher blood pressure than
this patient
• Median = 50th percentile
Percentile Example
• General graph Score at the 62nd percentile
In some references, we might see Number of
Percentile Example observations, rather than Number of observations + 1
… Generally does not make a big difference
•
Percentile Example
•
US Household Net Worth and Percentile (Source:
https://finance.yahoo.com/news/wealthy-net-worth-considered-poor-190014440.html)
Category Percentile Net Worth
Poor 20th $10,000
Middle class 50th $281,000
Wealthy 90th $1.9 million
Quartile
Q1 Q2 Q3
•
25% 50% 75%
Inter Quartile Range (IQR)
• Inter Quartile Range (IQR) = Q3 – Q1 = Middle 50% of the data
• In the given example: IQR = Q3 – Q1 = 95.5 – 82 = 13.5
• Handles outliers better than range, since the extreme values at both the
ends are ignored in IQR
• Since it uses percentiles rather than actual values, it is less affected by
skewed data (See Skewness)
• Outliers: Data points that are significantly outside of the typical range of
values
• Lower bound: Q1 – (1.5 * IQR) = 82 – (1.5 * 13.5) = 61.75
• Upper bound: Q3 + (1.5 * IQR) = 82 + (1.5 * 13.5) = 102.25
• Points below the lower bound or above the upper bound are outliers
• In our example, there are no such points, so we do not have any outliers
Outlier Example
• Commute times for 14 randomly selected adults in minutes: 16, 8, 35, 17,
13, 15, 15, 5, 16, 25, 20, 20, 12, 10
• Find outliers and draw a box plot
• Solution: First sort them: 5, 8, 10, 12, 13, 15, 15, 16, 16, 17, 20, 20, 25, 35
• Create a 5-number summary: Minimum, Q1, Q2, Q3, Maximum = 5, 12,
15.5, 20, and 35
• Outlier
• First calculate 1.5 * IQR = 1.5 x (20 – 12) = 1.5 x 8 = 12
• Outliers calculation: Q1 – 12 = 12 – 12 = 0 and Q3 + 12 = 20 + 12 = 32
• So, outliers = Commute time < 0 or > 32
• Boxplot: Draw a vertical line between 5 and 35; Draw a box with 12 and 20;
Draw a median line at 15.5, Show outlier points (See next slide)
Outlier Code
• import matplotlib.pyplot as plt
• import seaborn as sns
• # Data
• commuter_times = [16, 8, 35, 17, 13, 15, 15, 5, 16, 25, 20, 20, 12, 10]
• # Create the box plot
• plt.figure(figsize=(10, 6))
• sns.boxplot(data=commuter_times, orient='h')
• # Add titles and labels
• plt.title('Box Plot of Commuter Times')
• plt.xlabel('Minutes')
• # Show the plot
• plt.show()
Resulting Boxplot