MODULE–1: EXPLORATORY DATA ANALYSIS (EDA)
(As per VTU syllabus – Chapter 1)
These notes combine theory + numerical problems + code snippets in an exam-oriented format,
suitable for 5, 8, and 10 mark answers.
1. INTRODUCTION TO EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is the process of summarizing, visualizing, and understanding data
before applying advanced statistical or machine learning techniques. The main goals of EDA are: - To
understand the central tendency of data - To measure variability or spread - To identify outliers and
anomalies - To study relationships between variables
EDA relies heavily on robust statistics and visual tools rather than strict probabilistic assumptions.
2. ESTIMATES OF LOCATION
2.1 Definition
Estimates of location describe the central or typical value around which the data is distributed.
2.2 Measures of Location
(a) Mean
• Arithmetic average of all observations
• Formula:
n
1
ˉ=
x ∑ xi
n i=1
• Sensitive to outliers
Use: Symmetric data without extreme values
(b) Median
• Middle value of ordered data
• Robust to outliers
1
Use: Skewed distributions (income, house prices)
(c) Trimmed Mean
• Mean after removing a fixed percentage of lowest and highest values
• Provides balance between mean and median
(d) Weighted Mean
• Assigns different importance (weights) to observations
• Formula:
∑ wi xi
ˉw =
x
∑ wi
(e) Weighted Median
• Median considering weights
• Highly robust
2.3 Numerical Problem (10 Marks)
Problem: Given the data: 20, 22, 25, 27, 28, 30, 32, 35, 150
Find: 1. Mean 2. Median 3. 10% Trimmed Mean
Solution: - Mean = 41 - Median = 28 - Trimmed Mean ≈ 28.43
Conclusion: Median is the best measure due to presence of outlier.
2.4 Python Code
import numpy as np
from scipy import stats
data = [20,22,25,27,28,30,32,35,150]
print([Link](data))
print([Link](data))
print(stats.trim_mean(data, 0.1))
2
3. ESTIMATES OF VARIABILITY
3.1 Definition
Variability measures the spread or dispersion of data around its center.
3.2 Measures of Variability
(a) Range
• Difference between maximum and minimum
• Very sensitive to outliers
(b) Variance
• Average of squared deviations from mean
• Formula:
ˉ )2
∑(xi − x
s2 =
n−1
(c) Standard Deviation
• Square root of variance
• Same unit as data
(d) Mean Absolute Deviation
• Average of absolute deviations
(e) Median Absolute Deviation (MAD)
• Median of absolute deviations from median
• Highly robust
(f) Interquartile Range (IQR)
• Difference between Q3 and Q1
• Resistant to outliers
3
3.3 Numerical Problem
Problem: Data: 2, 4, 6, 8, 10
Results: - Variance = 10 - Standard deviation = 3.16 - IQR = 4
3.4 Python Code
import numpy as np
data = [2,4,6,8,10]
print([Link](data, ddof=1))
print([Link](data, ddof=1))
4. EXPLORING DATA DISTRIBUTIONS
4.1 Purpose
To understand: - Shape - Skewness - Spread - Outliers
4.2 Visualization Techniques
(a) Boxplot
• Shows median, quartiles, IQR, outliers
• Textbook Figure: Fig 1-2
(b) Histogram
• Frequency distribution using bins
• Textbook Figure: Fig 1-3
(c) Density Plot
• Smoothed histogram
• Textbook Figure: Fig 1-4
4.3 Interpretation Problem
If histogram has long right tail → positively skewed
4
Mean > Median
4.4 Python Code
import [Link] as plt
[Link](data)
[Link]()
5. EXPLORING BINARY AND CATEGORICAL DATA
5.1 Binary Data
• Two outcomes (Yes/No, 0/1)
5.2 Categorical Data
• Multiple categories (Grade, Gender, Department)
5.3 Summary Measures
• Proportions
• Percentages
• Mode
5.4 Expected Value
EV = ∑ pi xi
5.5 Numerical Problem
Expected profit = ₹250
5.6 Python Code
values = [1000, 500, 0]
prob = [0.1, 0.3, 0.6]
print(sum(v*p for v,p in zip(values, prob)))
5
6. EXPLORING TWO OR MORE VARIABLES
6.1 Correlation
• Measures linear relationship
• Range: −1 to +1
6.2 Scatter Plot
• Visual relationship between two variables
• Textbook Figure: Fig 1-7
6.3 Correlation Matrix
• Pairwise correlations
• Textbook Figure: Fig 1-6
6.4 Large Dataset Visualization
• Hexagonal binning (Fig 1-8)
• Contour plots (Fig 1-9)
6.5 Categorical vs Numeric
• Boxplots (Fig 1-10)
• Violin plots (Fig 1-11)
6.6 Numerical Problem
Perfect positive correlation → r = +1
6.7 Python Code
import numpy as np
x = [1,2,3,4]
y = [2,4,6,8]
print([Link](x, y)[0][1])
6
7. IMPORTANT VTU EXAM POINTS
• Write definition + formula + example
• Draw one neat diagram if applicable
• Always give one-line interpretation
• For code: logic > syntax perfection
8. SUMMARY
Module-1 focuses on understanding data before modeling. It builds the foundation for all further
statistical inference and machine learning techniques.
✔ END OF MODULE–1 (COMPLETE & EXAM-READY)