0% found this document useful (0 votes)
14 views7 pages

EDA Techniques and Python Code Guide

Module-1 covers Exploratory Data Analysis (EDA), emphasizing the importance of summarizing, visualizing, and understanding data. It includes definitions, measures of location and variability, and practical examples with Python code for calculations and visualizations. The module serves as a foundational resource for statistical inference and machine learning techniques.

Uploaded by

adithyakl689
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

EDA Techniques and Python Code Guide

Module-1 covers Exploratory Data Analysis (EDA), emphasizing the importance of summarizing, visualizing, and understanding data. It includes definitions, measures of location and variability, and practical examples with Python code for calculations and visualizations. The module serves as a foundational resource for statistical inference and machine learning techniques.

Uploaded by

adithyakl689
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE–1: EXPLORATORY DATA ANALYSIS (EDA)

(As per VTU syllabus – Chapter 1)

These notes combine theory + numerical problems + code snippets in an exam-oriented format,
suitable for 5, 8, and 10 mark answers.

1. INTRODUCTION TO EXPLORATORY DATA ANALYSIS


Exploratory Data Analysis (EDA) is the process of summarizing, visualizing, and understanding data
before applying advanced statistical or machine learning techniques. The main goals of EDA are: - To
understand the central tendency of data - To measure variability or spread - To identify outliers and
anomalies - To study relationships between variables

EDA relies heavily on robust statistics and visual tools rather than strict probabilistic assumptions.

2. ESTIMATES OF LOCATION

2.1 Definition

Estimates of location describe the central or typical value around which the data is distributed.

2.2 Measures of Location

(a) Mean

• Arithmetic average of all observations


• Formula:
n
1
ˉ=
x ∑ xi
n i=1

• Sensitive to outliers

Use: Symmetric data without extreme values

(b) Median

• Middle value of ordered data


• Robust to outliers

1
Use: Skewed distributions (income, house prices)

(c) Trimmed Mean

• Mean after removing a fixed percentage of lowest and highest values


• Provides balance between mean and median

(d) Weighted Mean

• Assigns different importance (weights) to observations


• Formula:

∑ wi xi
ˉw =
x
∑ wi

(e) Weighted Median

• Median considering weights


• Highly robust

2.3 Numerical Problem (10 Marks)

Problem: Given the data: 20, 22, 25, 27, 28, 30, 32, 35, 150

Find: 1. Mean 2. Median 3. 10% Trimmed Mean

Solution: - Mean = 41 - Median = 28 - Trimmed Mean ≈ 28.43

Conclusion: Median is the best measure due to presence of outlier.

2.4 Python Code

import numpy as np
from scipy import stats

data = [20,22,25,27,28,30,32,35,150]
print([Link](data))
print([Link](data))
print(stats.trim_mean(data, 0.1))

2
3. ESTIMATES OF VARIABILITY

3.1 Definition

Variability measures the spread or dispersion of data around its center.

3.2 Measures of Variability

(a) Range

• Difference between maximum and minimum


• Very sensitive to outliers

(b) Variance

• Average of squared deviations from mean


• Formula:

ˉ )2
∑(xi − x
s2 =
n−1

(c) Standard Deviation

• Square root of variance


• Same unit as data

(d) Mean Absolute Deviation

• Average of absolute deviations

(e) Median Absolute Deviation (MAD)

• Median of absolute deviations from median


• Highly robust

(f) Interquartile Range (IQR)

• Difference between Q3 and Q1


• Resistant to outliers

3
3.3 Numerical Problem

Problem: Data: 2, 4, 6, 8, 10

Results: - Variance = 10 - Standard deviation = 3.16 - IQR = 4

3.4 Python Code

import numpy as np

data = [2,4,6,8,10]
print([Link](data, ddof=1))
print([Link](data, ddof=1))

4. EXPLORING DATA DISTRIBUTIONS

4.1 Purpose

To understand: - Shape - Skewness - Spread - Outliers

4.2 Visualization Techniques

(a) Boxplot

• Shows median, quartiles, IQR, outliers


• Textbook Figure: Fig 1-2

(b) Histogram

• Frequency distribution using bins


• Textbook Figure: Fig 1-3

(c) Density Plot

• Smoothed histogram
• Textbook Figure: Fig 1-4

4.3 Interpretation Problem

If histogram has long right tail → positively skewed

4
Mean > Median

4.4 Python Code

import [Link] as plt


[Link](data)
[Link]()

5. EXPLORING BINARY AND CATEGORICAL DATA

5.1 Binary Data

• Two outcomes (Yes/No, 0/1)

5.2 Categorical Data

• Multiple categories (Grade, Gender, Department)

5.3 Summary Measures

• Proportions
• Percentages
• Mode

5.4 Expected Value

EV = ∑ pi xi

5.5 Numerical Problem

Expected profit = ₹250

5.6 Python Code

values = [1000, 500, 0]


prob = [0.1, 0.3, 0.6]
print(sum(v*p for v,p in zip(values, prob)))

5
6. EXPLORING TWO OR MORE VARIABLES

6.1 Correlation

• Measures linear relationship


• Range: −1 to +1

6.2 Scatter Plot

• Visual relationship between two variables


• Textbook Figure: Fig 1-7

6.3 Correlation Matrix

• Pairwise correlations
• Textbook Figure: Fig 1-6

6.4 Large Dataset Visualization

• Hexagonal binning (Fig 1-8)


• Contour plots (Fig 1-9)

6.5 Categorical vs Numeric

• Boxplots (Fig 1-10)


• Violin plots (Fig 1-11)

6.6 Numerical Problem

Perfect positive correlation → r = +1

6.7 Python Code

import numpy as np
x = [1,2,3,4]
y = [2,4,6,8]
print([Link](x, y)[0][1])

6
7. IMPORTANT VTU EXAM POINTS
• Write definition + formula + example
• Draw one neat diagram if applicable
• Always give one-line interpretation
• For code: logic > syntax perfection

8. SUMMARY
Module-1 focuses on understanding data before modeling. It builds the foundation for all further
statistical inference and machine learning techniques.

✔ END OF MODULE–1 (COMPLETE & EXAM-READY)

You might also like