0% found this document useful (0 votes)

36 views40 pages

Data Preprocessing and Visualization Guide

The document outlines practical exercises for data analysis using Python, covering data preprocessing, visualization, descriptive and inferential statistics, and correlation analysis. Each section includes theoretical explanations, code examples, and conclusions drawn from the analyses performed on datasets like the Titanic and student exam scores. The importance of data cleaning, statistical measures, and visualizations in understanding and interpreting data is emphasized throughout.

Uploaded by

vivekanand.rai.ug23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views40 pages

Data Preprocessing and Visualization Guide

Uploaded by

vivekanand.rai.ug23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Table of contents

S. No. Title Page Date Signature

No.

1. Write a program to read and preprocess 2

dataset and perform visualisation using
matplotlib and seaborn.

2. Write a program to implement descriptive 8

(mean, median, mode, range, variance,
and std deviation) and inferential statistics
(t-test and z-test).

3. Write a program to implement correlation 13

analysis using Pearson and Spearman
coefficients.

4. Write a program to implement binomial, 18

normal, and Poisson distributions.

5. Implement SVM and Decision tree 28

classification techniques.

6. Write a program to build a logistic 34

regression and linear regression model.
2

Practical 1
Aim: Write a program to read and preprocess dataset and perform visualization using
matplotlib and seaborn

Theory
Data preprocessing and visualization are essential steps in the data analysis pipeline. Before
any meaningful statistical or machine learning analysis can be performed, raw data must be
cleaned, organized, and visualized to understand its structure and patterns.

1. Data Reading
The first step in any data analysis task is reading the dataset into the program. In Python, this is
commonly done using libraries such as pandas, which provides the read_csv(), read_excel(),
and similar functions to load datasets into a DataFrame. The DataFrame structure allows easy
manipulation and analysis of tabular data.

2. Data Preprocessing
Raw data often contains missing values, duplicate entries, or inconsistent formats.
Preprocessing prepares the data for analysis by handling such issues. Common preprocessing
tasks include:
- Handling missing values using methods like imputation (mean/median/mode replacement) or
deletion.
- Removing duplicates to ensure data integrity.
- Encoding categorical data by converting string labels into numerical values (e.g., one-hot
encoding, label encoding).
- Feature scaling such as normalization or standardization of numerical features to a common
scale.
- Data type conversion to ensure all columns are in appropriate formats.

This step ensures that the dataset is clean, consistent, and suitable for analysis or model
building.

3. Data Visualization
Data visualization is the graphical representation of information to identify patterns,
relationships, and trends within the dataset. Visualization helps in exploratory data analysis
(EDA), making it easier to understand the data’s distribution and correlations.

Two of the most popular Python libraries for visualization are:

- Matplotlib: A foundational plotting library that provides control over every aspect of a figure,
supporting line plots, histograms, bar charts, scatter plots, and more.
3

- Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for creating
attractive and informative statistical graphics. It simplifies complex visualizations such as pair
plots, heatmaps, and distribution plots.

4. Importance
By combining preprocessing and visualization:
- Data inconsistencies and anomalies can be detected early.
- Relationships between variables can be observed visually.
- Insights from visual patterns help guide further analysis or model selection.

In summary, data preprocessing ensures data quality, while data visualization provides insights
into the data’s underlying patterns—both being critical components of any data science or
machine learning workflow.

Code

import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import LabelEncoder

sns.set_style("whitegrid")

df = sns.load_dataset('titanic')

[Link](columns={'sibsp': 'siblings_spouses', 'parch': 'parents_children', 'pclass': 'passenger_class'},

inplace=True)
print(f"Loaded Titanic dataset: {[Link][0]} rows, {[Link][1]} columns\n")

print("First rows:\n", [Link]())

print("\nMissing values:\n", [Link]().sum())
print("\nDuplicates:", [Link]().sum())

# Preprocessing
for col in [Link]:
if df[col].isnull().sum() > 0:
if df[col].dtype in ['float64', 'int64']:
df[col].fillna(df[col].median(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
4

df.drop_duplicates(inplace=True)

le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
df[col + '_encoded'] = le.fit_transform(df[col])

print("\nPreprocessing done!\n")

# Visualizations
numeric_cols = df.select_dtypes(include=[[Link]]).columns[:4]

# 1. Distribution plots
fig, axes = [Link](2, 2, figsize=(12, 8))
for i, col in enumerate(numeric_cols):
[Link](df[col], kde=True, ax=axes[i//2, i%2])
axes[i//2, i%2].set_title(f'{col} Distribution')
plt.tight_layout()
[Link]('[Link]')
[Link]()

# 2. Correlation heatmap
[Link](figsize=(10, 6))
[Link](df[numeric_cols].corr(), annot=True, cmap='coolwarm', center=0)
[Link]('Correlation Heatmap')
[Link]('[Link]')
[Link]()

# 3. Box plots
fig, axes = [Link](2, 2, figsize=(12, 8))
for i, col in enumerate(numeric_cols):
[Link](y=df[col], ax=axes[i//2, i%2])
axes[i//2, i%2].set_title(f'{col} Boxplot')
plt.tight_layout()
[Link]('[Link]')
[Link]()

print("Visualizations saved!")

Output
5

Distribution plots of the different classes of passengers

Box plots for the distribution summary of the different classes

Correlation heatmap between the different passengers’ classes

Conclusion
Thus, in this practical, we used data preprocessing and visualization techniques to analyze the
Titanic dataset. By cleaning the data, handling missing values, and encoding categorical
variables, we prepared the dataset for analysis. Using Matplotlib and Seaborn, we visualized
patterns for survival rates with respect to several parameters like age, sibling count etc.

These steps helped us gain insights into the factors affecting survival on the Titanic and
demonstrated the importance of preprocessing and visualization in understanding real-world
datasets.
8

Practical 2

Aim: Write a program to implement descriptive (mean, median, mode, range, variance, and
std deviation) and inferential statistics (t-test and z-test).

Theory
Descriptive statistics are used to summarize and describe the main features of a dataset.
They provide simple quantitative summaries about the data. Key measures include:

• Mean: The sum of all data points divided by the number of points. It gives a measure of
central tendency.
• Median: The middle value when the data is sorted. It is useful when the data has
outliers.
• Mode: The value that appears most frequently in the dataset.
• Range: The difference between the maximum and minimum values, showing the spread
of the data.
• Variance: Measures how far the data points are spread out from the mean. A higher
variance indicates more spread.
• Standard Deviation: The square root of variance; gives an idea of how much data
deviates from the mean in the same units as the data.

Inferential statistics allow us to make predictions or inferences about a population based on a

sample. It often involves hypothesis testing.

• t-Test: Used to compare the means of two groups to determine if they are statistically
different. Commonly used when the sample size is small and the population standard
deviation is unknown.
• z-Test: Used to compare means when the sample size is large or the population
standard deviation is known. It is used to test hypotheses about population means.

Two sample t-test

Where:

• x̄₁ = mean of sample 1

• x̄₂ = mean of sample 2
• s₁² = variance of sample 1
• s₂² = variance of sample 2
9

• n₁ = size of sample 1
• n₂ = size of sample 2

Degrees of Freedom: df = n₁ + n₂ − 2

Decision rule:
• If |t| > t(critical), reject H₀ (means are significantly different).
• If |t| ≤ t(critical), accept H₀ (no significant difference).

Used to test whether two independent groups have different means — for example,
comparing the average performance of two different student groups or test results before
and after applying a new teaching method.

Single sample z-test

Where:

• x̄ = sample mean
• μ = population mean
• σ = population standard deviation
• n = sample size

Decision Rule:
• If |z| > z(critical), reject H₀ (there is a significant difference).
• If |z| ≤ z(critical), accept H₀ (no significant difference).

Used for large samples (n ≥ 30) or when the population standard deviation is known,
for example, comparing a sample’s average score with a known population average.

Descriptive statistics help in understanding the basic characteristics of data, while inferential
statistics help in decision making, predictions, and validating assumptions in research.
10

Code

import numpy as np
import [Link] as stats

def descriptive_statistics(data):
mean = [Link](data)
median = [Link](data)
mode = [Link](data, keepdims=True)[0][0]
data_range = [Link](data) - [Link](data)
variance = [Link](data, ddof=1) # sample variance
std_dev = [Link](data, ddof=1) # sample standard deviation

print("\n--- DESCRIPTIVE STATISTICS ---")

print(f"Data: {data}")
print(f"Mean: {mean:.3f}")
print(f"Median: {median:.3f}")
print(f"Mode: {mode}")
print(f"Range: {data_range:.3f}")
print(f"Variance: {variance:.3f}")
print(f"Standard Deviation: {std_dev:.3f}")

def t_test(sample1, sample2):

t_stat, p_val = stats.ttest_ind(sample1, sample2)
print("\n--- INFERENTIAL STATISTICS: T-TEST ---")
print(f"Sample 1 Mean: {[Link](sample1):.3f}")
print(f"Sample 2 Mean: {[Link](sample2):.3f}")
print(f"T-Statistic: {t_stat:.3f}")
print(f"P-Value: {p_val:.5f}")
if p_val < 0.05:
print("Result: Significant difference (Reject H₀)")
else:
print("Result: No significant difference (Fail to reject H₀)")

def z_test(sample_mean, population_mean, std_dev, n):

# Z = (sample_mean - population_mean) / (std_dev / sqrt(n))
z = (sample_mean - population_mean) / (std_dev / [Link](n))
p_val = 2 * (1 - [Link](abs(z))) # two-tailed
print("\n--- INFERENTIAL STATISTICS: Z-TEST ---")
print(f"Sample Mean: {sample_mean:.3f}")
print(f"Population Mean: {population_mean:.3f}")
print(f"Standard Deviation: {std_dev:.3f}")
print(f"Sample Size: {n}")
print(f"Z-Statistic: {z:.3f}")
print(f"P-Value: {p_val:.5f}")
11

if p_val < 0.05:

print("Result: Significant difference (Reject H₀)")
else:
print("Result: No significant difference (Fail to reject H₀)")

if __name__ == "__main__":
# Example dataset
data = [23, 45, 12, 67, 34, 45, 34, 23, 45, 56]
descriptive_statistics(data)

# Example samples for t-test

sample1 = [23, 45, 34, 23, 45, 34, 56, 45]
sample2 = [34, 54, 56, 43, 45, 65, 43, 55]
t_test(sample1, sample2)

# Example z-test
sample_mean = 52
population_mean = 50
std_dev = 5
n = 30
z_test(sample_mean, population_mean, std_dev, n)

Output
12

Conclusion
In this practical, we applied descriptive and inferential statistics on a sample dataset. Descriptive
measures like mean, median, mode, range, variance, and standard deviation helped us
summarize the data and understand its distribution. Inferential tests, such as the t-test and z-
test, allowed us to draw conclusions about population parameters from the sample.

By performing these analyses, we gained insights into the dataset’s central tendency, variability,
and statistical significance, which is essential for data-driven decision-making in real-world
applications.

For example, if applied to a dataset of students’ marks, we could determine the average
performance, understand score variability, and test hypotheses about differences between
groups of students.
13

Practical 3

Aim: Write a program to implement correlation analysis using Pearson and Spearman
coefficients

Theory
Correlation analysis is a statistical method used to evaluate the strength and direction of the
relationship between two variables. It helps in understanding how one variable changes with
respect to another.

Pearson Correlation Coefficient

Pearson correlation measures the linear relationship between two continuous variables.
It ranges from -1 to +1:
• +1 indicates a perfect positive linear relationship,
• -1 indicates a perfect negative linear relationship,
• 0 indicates no linear correlation.

It assumes that the data is normally distributed and the relationship is linear.

Spearman Rank Correlation Coefficient

Spearman correlation measures the monotonic relationship between two variables using ranks
instead of actual values. It is useful when the data is not normally distributed or when the
relationship is not linear. Like Pearson, it ranges from -1 to +1, with similar interpretations.

Correlation analysis is widely used in fields like economics, social sciences, and biology to
identify relationships between variables. For example, it can help understand the relationship
between students’ study hours and their exam scores or between advertising spend and sales
revenue.
14

Code

import pandas as pd
import numpy as np
from [Link] import pearsonr, spearmanr
import seaborn as sns
import [Link] as plt

# Example dataset: students' study hours vs exam scores

data = {
'Study_Hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'Exam_Score': [50, 55, 56, 60, 65, 70, 74, 78, 85, 88]
}

df = [Link](data)
print("Dataset:\n", df, "\n")

pearson_corr, pearson_p = pearsonr(df['Study_Hours'], df['Exam_Score'])

print(f"Pearson Correlation Coefficient: {pearson_corr:.4f}")
print(f"P-value: {pearson_p:.4f}\n")

spearman_corr, spearman_p = spearmanr(df['Study_Hours'], df['Exam_Score'])

print(f"Spearman Correlation Coefficient: {spearman_corr:.4f}")
print(f"P-value: {spearman_p:.4f}\n")

# Visualizations
[Link](df, kind='reg')
[Link]("Scatter plot with regression line", y=1.02)
[Link]()

corr_matrix = [Link](method='pearson')
[Link](corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
[Link]("Pearson Correlation Heatmap")
[Link]()

if abs(pearson_corr) > 0.8:

relation = "strong"
elif abs(pearson_corr) > 0.5:
relation = "moderate"
else:
relation = "weak"

print(f"Interpretation: There is a {relation} linear relationship between study hours and exam score.")

Output
15
16

Distributions of the exam scores and study hours

Correlation heatmap for representing how strongly study hours and exam scores are correlated
to each other

Conclusion
In this practical, we performed correlation analysis using Pearson and Spearman coefficients on
a sample dataset of students’ marks and study hours, which helped analyze whether increased
study hours are associated with higher scores.

Pearson correlation helped us determine the strength and direction of a linear relationship
between variables, while Spearman correlation was used to evaluate monotonic relationships,
even when data did not follow normal distribution.

This analysis is essential for understanding dependencies between variables, making

predictions, and informing data-driven decisions in real-world applications.
18

Practical 4

Aim: Write a program to implement correlation analysis using Pearson and Spearman
coefficients

Theory
Binomial Distribution
The Binomial Distribution is a discrete probability distribution that describes the number of
successes in a fixed number of independent trials, where each trial has only two possible
outcomes — success or failure.

It is used when:

• The number of trials (n) is fixed.

• The probability of success (p) is constant.
• Each trial is independent of others.

Where:
• n = number of trials
• K = number of successes
• P = probability of success
• P(X = k) = probability of getting exactly k succeses

Example: Finding the probability of getting exactly 3 heads in 5 coin tosses.

Normal distribution
The Normal Distribution (also known as Gaussian Distribution) is a continuous probability
distribution that is symmetric about the mean, forming a bell-shaped curve.

It is used to model real-world phenomena like height, weight, or marks distribution.

Where:
• μ = mean
• σ = standard deviation
• x = variable
• f(x) = probability density function

Example: Distribution of students’ marks in a large class.

Poisson Distribution
The Poisson Distribution is a discrete probability distribution that expresses the probability of a
given number of events occurring in a fixed interval of time or space, provided these events
occur independently and at a constant average rate (λ).

Where:
• λ = average number of occurrences (mean rate)
• k = number of occurrences
• e = Euler’s number (≈ 2.718)

Example: The number of emails received per hour or the number of accidents per day
20

Code

import numpy as np
import [Link] as plt
from scipy import stats
from [Link] import comb
import math

[Link]('seaborn-v0_8-darkgrid')

class DistributionAnalyzer:
"""Class to implement and visualize statistical distributions"""

def __init__(self):
self.fig_count = 0

# BINOMIAL DISTRIBUTION
def binomial_pmf(self, k, n, p):
return comb(n, k, exact=True) * (p ** k) * ((1 - p) ** (n - k))

def plot_binomial(self, n=20, p=0.5):

k_values = [Link](0, n + 1)
pmf_values = [self.binomial_pmf(k, n, p) for k in k_values]

scipy_pmf = [Link](k_values, n, p)

[Link](figsize=(10, 6))
[Link](k_values, pmf_values, alpha=0.7, color='steelblue', edgecolor='black')
[Link]('Number of Successes (k)', fontsize=12)
[Link]('Probability', fontsize=12)
[Link](f'Binomial Distribution (n={n}, p={p})', fontsize=14,
fontweight='bold')

[Link](axis='y', alpha=0.3)

mean = n * p
variance = n * p * (1 - p)
[Link](mean, color='red', linestyle='--', linewidth=2,
label=f'Mean = {mean:.2f}')

[Link]()

textstr = f'Mean: {mean:.2f}\nVariance: {variance:.2f}\nStd Dev: {[Link](variance):.2f}'

[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,

fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
21

[Link]()

return mean, variance

# NORMAL DISTRIBUTION
def normal_pdf(self, x, mu, sigma):
coefficient = 1 / (sigma * [Link](2 * [Link]))
exponent = -0.5 * ((x - mu) / sigma) ** 2
return coefficient * [Link](exponent)

def plot_normal(self, mu=0, sigma=1, x_range=None):

if x_range is None:
x_range = (mu - 4*sigma, mu + 4*sigma)

x = [Link](x_range[0], x_range[1], 1000)

y = self.normal_pdf(x, mu, sigma)

scipy_pdf = [Link](x, mu, sigma)

[Link](figsize=(10, 6))
[Link](x, y, 'b-', linewidth=2, label='Normal PDF')
plt.fill_between(x, y, alpha=0.3)

[Link](mu, color='red', linestyle='--', linewidth=2, label=f'Mean (μ) = {mu}')

[Link](mu - sigma, color='green', linestyle=':', linewidth=1.5, alpha=0.7)
[Link](mu + sigma, color='green', linestyle=':', linewidth=1.5,
alpha=0.7, label=f'±1σ = {sigma}')

[Link]('x', fontsize=12)
[Link]('Probability Density', fontsize=12)
[Link](f'Normal Distribution (μ={mu}, σ={sigma})', fontsize=14, fontweight='bold')
[Link]()
[Link](alpha=0.3)

textstr = f'Mean: {mu}\nStd Dev: {sigma}\nVariance: {sigma**2}'

[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,
fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
[Link]()

return mu, sigma**2

# POISSON DISTRIBUTION
def poisson_pmf(self, k, lam):
return (lam ** k * [Link](-lam)) / [Link](k)
22

def plot_poisson(self, lam=3, k_max=15):

k_values = [Link](0, k_max + 1)
pmf_values = [self.poisson_pmf(k, lam) for k in k_values]

scipy_pmf = [Link](k_values, lam)

[Link](figsize=(10, 6))
[Link](k_values, pmf_values, alpha=0.7, color='coral', edgecolor='black')

[Link]('Number of Events (k)', fontsize=12)

[Link]('Probability', fontsize=12)
[Link](f'Poisson Distribution (λ={lam})', fontsize=14, fontweight='bold')

[Link](axis='y', alpha=0.3)

[Link](lam, color='red', linestyle='--', linewidth=2,

label=f'Mean = λ = {lam}')
[Link]()

textstr = f'λ (lambda): {lam}\nMean: {lam}\nVariance: {lam}\nStd Dev: {[Link](lam):.2f}'

[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,

fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
[Link]()

return lam, lam

# COMPARISON PLOT
def plot_all_distributions(self):
fig, axes = [Link](1, 3, figsize=(15, 5))

# Binomial
n, p = 20, 0.5
k_binom = [Link](0, n + 1)
pmf_binom = [Link](k_binom, n, p)

axes[0].bar(k_binom, pmf_binom, alpha=0.7, color='steelblue', edgecolor='black')

axes[0].set_title(f'Binomial (n={n}, p={p})', fontsize=12, fontweight='bold')

axes[0].set_xlabel('k')
axes[0].set_ylabel('Probability')
axes[0].grid(axis='y', alpha=0.3)

# Normal
mu, sigma = 0, 1
x_norm = [Link](-4, 4, 1000)
23

pdf_norm = [Link](x_norm, mu, sigma)

axes[1].plot(x_norm, pdf_norm, 'b-', linewidth=2)
axes[1].fill_between(x_norm, pdf_norm, alpha=0.3)

axes[1].set_title(f'Normal (μ={mu}, σ={sigma})', fontsize=12, fontweight='bold')

axes[1].set_xlabel('x')
axes[1].set_ylabel('Probability Density')
axes[1].grid(alpha=0.3)

# Poisson
lam = 3
k_pois = [Link](0, 15)
pmf_pois = [Link](k_pois, lam)

axes[2].bar(k_pois, pmf_pois, alpha=0.7, color='coral', edgecolor='black')

axes[2].set_title(f'Poisson (λ={lam})', fontsize=12,

fontweight='bold')

axes[2].set_xlabel('k')
axes[2].set_ylabel('Probability')
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
[Link]()

if __name__ == "__main__":
analyzer = DistributionAnalyzer()

print("=" * 60)
print("STATISTICAL DISTRIBUTIONS VISUALIZATION")
print("=" * 60)

print("\n1. BINOMIAL DISTRIBUTION")

print("-" * 60)
print("Example: Flipping a coin 20 times")
mean_b, var_b = analyzer.plot_binomial(n=20, p=0.5)
print(f"Mean: {mean_b}, Variance: {var_b}")

print("\n2. NORMAL DISTRIBUTION")

print("-" * 60)
print("Example: Standard normal distribution")
mean_n, var_n = analyzer.plot_normal(mu=0, sigma=1)
print(f"Mean: {mean_n}, Variance: {var_n}")

print("\n3. POISSON DISTRIBUTION")

print("-" * 60)
print("Example: Average of 3 events per interval")
mean_p, var_p = analyzer.plot_poisson(lam=3)
24

print(f"Mean: {mean_p}, Variance: {var_p}")

print("\n4. COMPARING ALL DISTRIBUTIONS")

print("-" * 60)
analyzer.plot_all_distributions()

print("\n" + "=" * 60)

print("Visualization complete!")
print("=" * 60)

Output
25
26
27

Conclusion
In this practical, we implemented Binomial, Normal, and Poisson distributions using Python to
understand their behavior and visualize them.

• The Binomial Distribution was used to model discrete outcomes with fixed trials.
• The Normal Distribution represented continuous data with a symmetric bell-shaped
curve.
• The Poisson Distribution modeled the probability of a certain number of events occurring
within a fixed interval.

Through visualization and computation, we observed how these distributions differ in shape,
spread, and application, and how they are used in real-life statistical analysis and data
modeling.
28

Practical 5

Aim: Implement SVM and Decision tree classification techniques.

Theory
Support Vector Machine (SVM)
Support Vector Machine is a supervised learning algorithm used for classification (and
regression). SVM finds the optimal hyperplane that separates classes by maximizing the margin
between the nearest points of different classes (support vectors). For non-linearly separable
data, SVM uses kernel functions to map input features into a higher-dimensional space where a
linear separator may exist. SVM is effective in high-dimensional spaces and when the number
of features exceeds the number of samples.

Key ideas:
• Margin: distance between the hyperplane and nearest data points (support vectors).
• Optimal hyperplane: maximizes the margin.
• Kernels: linear, polynomial, radial basis function (RBF), sigmoid — allow non-linear
decision boundaries.
• Regularization parameter (C): trade-off between maximizing margin and minimizing
classification error.
• Slack variables (ξ): allow soft margin (tolerate misclassification).

Decision function:

Where:
• w = weight vector
• x = input vector
• b = bias term

Objective:

Where:
• C = regularization parameter
• ξᵢ = slack variables allowing misclassification
29

Dual form (with kernels):

Prediction:

Decision Tree
A Decision Tree is a non-parametric supervised learning method used for classification and
regression. It builds a tree-like model of decisions by recursively splitting the dataset on feature
values to create homogeneous subsets (in terms of class labels). Each internal node represents
a test on a feature, each branch the outcome, and each leaf node a class label (or distribution).

Key ideas:

• Recursive partitioning: split data to reduce impurity.

• Impurity measures: Gini impurity, Entropy (information gain), or Classification Error.
• Stopping criteria: maximum depth, minimum samples per leaf, or when further splits do
not improve impurity significantly.
• Pruning: pre-pruning or post-pruning to avoid overfitting.

Entropy:

Where pₖ is the proportion of class k in dataset S

Information Gain:
30

Where:
• A = attribute used for splitting
• Sᵥ = subset of S where attribute A takes value v

Gini Impurity:

Code

import numpy as np
import [Link] as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import SVC
from [Link] import DecisionTreeClassifier, plot_tree
from [Link] import accuracy_score, classification_report

iris = datasets.load_iris()
X, y = [Link][:, :2], [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features for SVM

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)

# Train SVM (RBF kernel)

print("=== SVM Classifier ===")
svm = SVC(kernel='rbf', C=1.0, random_state=42)
[Link](X_train_scaled, y_train)
y_pred_svm = [Link](X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(classification_report(y_test, y_pred_svm, target_names=iris.target_names))

# Train Decision Tree

print("\n=== Decision Tree Classifier ===")
dt = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=42)
[Link](X_train, y_train)
31

y_pred_dt = [Link](X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Tree depth: {dt.get_depth()}, Leaves: {dt.get_n_leaves()}")
print(classification_report(y_test, y_pred_dt, target_names=iris.target_names))

# Visualize decision boundaries

fig1, (ax1, ax2) = [Link](1, 2, figsize=(12, 4))

def plot_boundary(model, X, y, ax, title, scaled=False):

h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h))

mesh = np.c_[[Link](), [Link]()]

Z = [Link]([Link](mesh) if scaled else mesh).reshape([Link])

[Link](xx, yy, Z, alpha=0.3, cmap='viridis')

[Link](X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='black')
ax.set_title(title)

plot_boundary(svm, X_test, y_test, ax1, 'SVM (RBF)', scaled=True)

plot_boundary(dt, X_test, y_test, ax2, 'Decision Tree (Gini)')
plt.tight_layout()
[Link]()

# Plot decision tree structure

fig2, ax = [Link](figsize=(15, 10))

plot_tree(dt, ax=ax, feature_names=iris.feature_names[:2],

class_names=iris.target_names, filled=True, rounded=True, fontsize=10)

ax.set_title('Decision Tree Structure', fontsize=14, fontweight='bold')

plt.tight_layout()
[Link]()

Output
32
33

Conclusion
In this practical, SVM and Decision Tree classification techniques were implemented and
compared. SVM provides robust, margin-based classification that is powerful for high-
dimensional and (with kernels) non-linear problems; it requires careful feature scaling and
hyperparameter tuning.

Decision Trees provide intuitive rule-based models that are easy to interpret and require less
preprocessing, but they are prone to overfitting and often need pruning or ensemble methods
for better generalization.

By preparing data, tuning hyperparameters, and evaluating with appropriate metrics, both
methods can be effectively applied to classification problems; the choice between them
depends on dataset characteristics (linearity, dimensionality, noise, interpretability needs) and
performance on validation/test sets.
34

Practical 6

Aim: Write a program to build a logistic regression and linear regression model.

Theory
Linear Regression and Logistic Regression are regression analysis models, which are used to
model the relationship between dependent and independent variables, and help in predicting or
estimating the value of one variable based on the values of others.

Linear Regression: Used when the dependent variable is continuous. It establishes a linear
relationship between the dependent variable (Y) and one or more independent variables (X).
The aim is to find the best-fitting straight line through the data points that minimizes the error
between predicted and actual values.

General equation:

Where:

• Y = Dependent variable
• X = Independent variable
• β₀ = Intercept
• β₁ = Slope coefficient
• ε = Error term

To calculate slope and intercept:

Logistic Regression: Used when the dependent variable is categorical (binary). It predicts the
probability that a given input belongs to a particular class. Instead of fitting a straight line, it fits
an S-shaped (sigmoid) curve that outputs probabilities between 0 and 1.

The logistic regression model predicts the probability (p) that the dependent variable
belongs to class 1 as follows:

Where:
• p = Probability of success (Y = 1)
• β₀, β₁ = Model coefficients
• X = Predictor variable

Code

# Import required libraries

import numpy as np
import [Link] as plt
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import load_iris, fetch_california_housing
from [Link] import mean_squared_error, r2_score, accuracy_score, confusion_matrix,
ConfusionMatrixDisplay

print("===== LINEAR REGRESSION (California Housing Dataset) =====")

housing = fetch_california_housing()
X, y = [Link], [Link]
feature_names = housing.feature_names

df_housing = [Link](X, columns=feature_names)

df_housing['MedHouseVal'] = y

X_room = df_housing[['AveRooms']]
y_price = df_housing['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X_room, y_price, test_size=0.2, random_state=42)

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

y_pred = lin_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.3f}")

print(f"R² Score: {r2:.3f}")

# Visualization
[Link](figsize=(6,4))
[Link](X_test, y_test, color='blue', label='Actual')
[Link](X_test, y_pred, color='red', linewidth=2, label='Predicted Line')
[Link]('Average Number of Rooms (AveRooms)')
[Link]('Median House Value')
[Link]('Linear Regression: House Value vs Rooms')
[Link]()
[Link]()

print("\n===== LOGISTIC REGRESSION (Iris Dataset) =====")

iris = load_iris()
X = [Link][:, :2] # Use first two features for visualization
y = [Link]

# For binary classification

y_binary = (y == 0).astype(int) # 1 if setosa, else 0

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

y_pred = log_model.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print(f"Accuracy: {acc*100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Setosa', 'Setosa'])
[Link](cmap='Purples')
[Link]("Logistic Regression - Confusion Matrix")
[Link]()
37

# Decision boundary visualization

[Link](figsize=(6,4))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = [Link]([Link](x_min, x_max, 0.02),
[Link](y_min, y_max, 0.02))
Z = log_model.predict(np.c_[[Link](), [Link]()])
Z = [Link]([Link])

[Link](xx, yy, Z, alpha=0.3, cmap=[Link])

[Link](X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', cmap=[Link])
[Link]('Sepal Length (cm)')
[Link]('Sepal Width (cm)')
[Link]('Logistic Regression Decision Boundary (Iris)')
[Link]()

Output
38
39
40

Conclusion

In this experiment, we implemented Linear Regression on the California Housing dataset to

predict further trends for house values.

Furthermore, we employed Logistic Regression models on the Iris dataset to classify flower
records with respect to different petal characteristics.

Linear Regression helped us model a continuous relationship between variables, while Logistic
Regression was used to predict categorical outcomes by estimating class probabilities.

Common questions

Impurity measures, like Gini impurity and entropy, guide the splitting of nodes, aiming to create homogenous subsets that improve classification accuracy. Splits are chosen to minimize impurity, enhancing the predictive performance of the decision tree model .

Pearson correlation measures the linear relationship between two continuous variables assuming normal distribution, while Spearman correlation assesses monotonic relationships using rank, effective when data are not normally distributed or when relationships are non-linear .

Key descriptive statistical measures include mean, median, mode, range, variance, and standard deviation. These measures provide insights into the central tendency, spread, and variability within the dataset, describing its fundamental characteristics and allowing for easier interpretation and comparison of data .

Visualization in EDA helps in understanding the underlying distributions, correlations, and patterns within the data, aiding in hypothesis generation and data preprocessing decisions. Through tools like histograms, heatmaps, and scatter plots, it provides insights that numerical analysis alone might overlook .

Primary considerations include ensuring normal distribution of the data (or large enough samples for Central Limit Theorem applicability), equal variances, and that the samples are independent. These conditions affect the validity of the t-test results and the reliability of conclusions drawn about population differences .

Feature scaling is crucial in SVM models to prevent dominance of features with larger ranges over others, ensuring that the optimization process for defining the separating hyperplane functions properly, especially with non-linear kernels to handle curved and complex boundaries .

Decision trees provide a intuitive rule-based model that is easy to interpret and requires less preprocessing. However, they are prone to overfitting unless pruned, struggle with high dimensional data, and often require ensemble methods for better generalization .

Logistic regression is preferred when the dependent variable is categorical, as it predicts probabilities and classifies data into categories. In contrast, linear regression is used for continuous outcome variables and estimates real values .

Data preprocessing improves the quality of data by handling missing values, transforming or encoding categorical variables, and removing duplicates, providing clean and statistical-ready datasets. This ensures that the models have a strong foundation for learning patterns without noise or irrelevant elements, which ultimately influences accurate model selection and results .

The Binomial distribution models the number of successes in a fixed number of independent trials with a constant probability of success, while the Poisson distribution focuses on the number of events occurring in a fixed interval, with events happening independently at a constant average rate. These distinctions affect their applications in different statistical contexts .

Data Analysis Fundamentals Overview
No ratings yet
Data Analysis Fundamentals Overview
17 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
42 pages
Descriptive Statistics in Python Analysis
No ratings yet
Descriptive Statistics in Python Analysis
4 pages
Data Visualization with Seaborn Insights
No ratings yet
Data Visualization with Seaborn Insights
19 pages
Data Science Concepts and Tools Overview
No ratings yet
Data Science Concepts and Tools Overview
10 pages
Box Plot and Data Analysis Techniques
No ratings yet
Box Plot and Data Analysis Techniques
7 pages
Machine Learning: Data Preparation Guide
No ratings yet
Machine Learning: Data Preparation Guide
30 pages
Essential Data Concepts in Statistics
No ratings yet
Essential Data Concepts in Statistics
96 pages
Data Standardization and Normalization
No ratings yet
Data Standardization and Normalization
11 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Predictive Modeling for Used Devices
No ratings yet
Predictive Modeling for Used Devices
25 pages
Essential Data Visualization Techniques
No ratings yet
Essential Data Visualization Techniques
19 pages
Exploratory Data Analysis Basics in Python
No ratings yet
Exploratory Data Analysis Basics in Python
10 pages
Pandas and Data Visualization Lab Manual
No ratings yet
Pandas and Data Visualization Lab Manual
69 pages
EDA Techniques and Tools in Python
No ratings yet
EDA Techniques and Tools in Python
6 pages
2D Data Analysis and Visualization in Python
No ratings yet
2D Data Analysis and Visualization in Python
2 pages
Data and Visual Analytics Lab Manual
No ratings yet
Data and Visual Analytics Lab Manual
20 pages
Data Science with Python: Key Techniques
No ratings yet
Data Science with Python: Key Techniques
21 pages
Seaborn Data Visualization Guide
No ratings yet
Seaborn Data Visualization Guide
11 pages
EDA Techniques in Python for Data Insights
No ratings yet
EDA Techniques in Python for Data Insights
31 pages
Data Science Lab Manual for TYCS VI
No ratings yet
Data Science Lab Manual for TYCS VI
39 pages
Installing Factor Analyzer in Python
No ratings yet
Installing Factor Analyzer in Python
15 pages
Python Data Visualization Techniques
No ratings yet
Python Data Visualization Techniques
12 pages
Titanic Fare Distribution with Seaborn
No ratings yet
Titanic Fare Distribution with Seaborn
21 pages
Data Science: Statistical Learning Guide
No ratings yet
Data Science: Statistical Learning Guide
10 pages
Data Visualization Techniques in Python & R
No ratings yet
Data Visualization Techniques in Python & R
99 pages
Python Libraries for Research Analysis
No ratings yet
Python Libraries for Research Analysis
18 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
40 pages
EDA and Feature Engineering with Pandas
No ratings yet
EDA and Feature Engineering with Pandas
37 pages
Data Science Diploma & Certificate Courses
No ratings yet
Data Science Diploma & Certificate Courses
15 pages
Handling Missing Data in Pandas
No ratings yet
Handling Missing Data in Pandas
26 pages
Key Concepts in Data Science Explained
No ratings yet
Key Concepts in Data Science Explained
9 pages
Seaborn Heatmap for Data Analysis
No ratings yet
Seaborn Heatmap for Data Analysis
49 pages
Intro to Statistics with Python
No ratings yet
Intro to Statistics with Python
54 pages
Python Statistical Analysis Cheat Sheet
No ratings yet
Python Statistical Analysis Cheat Sheet
4 pages
Understanding Data Structures in Python
No ratings yet
Understanding Data Structures in Python
35 pages
Research Methodology Overview
No ratings yet
Research Methodology Overview
29 pages
Research Methodology Overview
No ratings yet
Research Methodology Overview
29 pages
Understanding Box Plots and Data Visualization
No ratings yet
Understanding Box Plots and Data Visualization
91 pages
Data Science Concepts and Techniques
No ratings yet
Data Science Concepts and Techniques
20 pages
AI Laboratory Experiments Overview
No ratings yet
AI Laboratory Experiments Overview
54 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Python Data Analysis with Iris Dataset
No ratings yet
Python Data Analysis with Iris Dataset
25 pages
Python Data Analysis Techniques
No ratings yet
Python Data Analysis Techniques
147 pages
Data Transformation Techniques Explained
No ratings yet
Data Transformation Techniques Explained
9 pages
Foundation of Data Science Previous Year Question Paper
100% (1)
Foundation of Data Science Previous Year Question Paper
40 pages
Data Analytics Exam Paper 2022-23
No ratings yet
Data Analytics Exam Paper 2022-23
13 pages
Descriptive and Inferential Statistics Guide
No ratings yet
Descriptive and Inferential Statistics Guide
86 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
29 pages
Data Literacy: Collection to Analysis Guide
No ratings yet
Data Literacy: Collection to Analysis Guide
8 pages
Analyze Mean, Median, Mode in Python
No ratings yet
Analyze Mean, Median, Mode in Python
49 pages
EDA and Descriptive Statistics Guide
No ratings yet
EDA and Descriptive Statistics Guide
40 pages
AI Applications Lab: Data Preprocessing & Visualization
No ratings yet
AI Applications Lab: Data Preprocessing & Visualization
2 pages
Data Exploration in Data Science
No ratings yet
Data Exploration in Data Science
30 pages
Data Manipulation and Visualization in Python
No ratings yet
Data Manipulation and Visualization in Python
29 pages
Data Analysis with Pandas and Visualization
No ratings yet
Data Analysis with Pandas and Visualization
19 pages
Machine Learning Lab Manual BAIL606
No ratings yet
Machine Learning Lab Manual BAIL606
50 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
10 pages
Traffic Volume Estimation with ML
No ratings yet
Traffic Volume Estimation with ML
47 pages
Heart Disease Prediction with ML System
No ratings yet
Heart Disease Prediction with ML System
53 pages
Medlens: Smart Health Diagnosis System
100% (1)
Medlens: Smart Health Diagnosis System
88 pages
IPL Data Analysis: ML Clustering & Classification
No ratings yet
IPL Data Analysis: ML Clustering & Classification
10 pages
AbuSaa2019 Article FactorsAffectingStudentsPerfor
No ratings yet
AbuSaa2019 Article FactorsAffectingStudentsPerfor
32 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
28 pages
AI Tools for Food Quality and Consumer Choice
No ratings yet
AI Tools for Food Quality and Consumer Choice
13 pages
Data Analysis Techniques Overview
100% (1)
Data Analysis Techniques Overview
51 pages
AI for Predictive Maintenance in Machine Tools
No ratings yet
AI for Predictive Maintenance in Machine Tools
6 pages
Cyberbullying Detection Using Autoencoder
No ratings yet
Cyberbullying Detection Using Autoencoder
71 pages
SRM Machine Learning Lab Manual
No ratings yet
SRM Machine Learning Lab Manual
42 pages
Regression Analysis and Bayesian Learning
No ratings yet
Regression Analysis and Bayesian Learning
84 pages
Medicinal Leaf Quality via Machine Learning
No ratings yet
Medicinal Leaf Quality via Machine Learning
8 pages
PSO-SVM Network Intrusion Detection in Cloud
No ratings yet
PSO-SVM Network Intrusion Detection in Cloud
8 pages
AI-Powered Interactive Virtual Tutor
No ratings yet
AI-Powered Interactive Virtual Tutor
7 pages
CST 294: Machine Learning Foundations
No ratings yet
CST 294: Machine Learning Foundations
22 pages
DDoS Detection in SDNs Using ML/DL Techniques
No ratings yet
DDoS Detection in SDNs Using ML/DL Techniques
2 pages
Road Damage Detection Using YOLO Models
No ratings yet
Road Damage Detection Using YOLO Models
11 pages
Predicting Green Hydrogen from Images
No ratings yet
Predicting Green Hydrogen from Images
19 pages
Kehri Awale 2020 A Facial Emg Data Analysis For Emotion Classification
No ratings yet
Kehri Awale 2020 A Facial Emg Data Analysis For Emotion Classification
14 pages
Machine Learning for Crop Yield Prediction
No ratings yet
Machine Learning for Crop Yield Prediction
6 pages
Human Activity Recognition via CNNs
No ratings yet
Human Activity Recognition via CNNs
59 pages
Multitask SVM for Remote Sensing Classification
No ratings yet
Multitask SVM for Remote Sensing Classification
8 pages
Machine Learning Exam Questions 2023
No ratings yet
Machine Learning Exam Questions 2023
1 page
Fninf 18 1384720
No ratings yet
Fninf 18 1384720
12 pages
Intelligent Automation in RPA Systems
No ratings yet
Intelligent Automation in RPA Systems
15 pages
Deep Learning Overview and Basics
No ratings yet
Deep Learning Overview and Basics
98 pages
Predicting Osteoporosis Risk with AI
No ratings yet
Predicting Osteoporosis Risk with AI
13 pages
Detecting Fake E-Commerce Reviews with ML
No ratings yet
Detecting Fake E-Commerce Reviews with ML
12 pages
6.867 Machine Learning Mid-term Exam
No ratings yet
6.867 Machine Learning Mid-term Exam
11 pages
Image Classification of Flower Popularity
No ratings yet
Image Classification of Flower Popularity
18 pages