0% found this document useful (0 votes)
36 views40 pages

Data Preprocessing and Visualization Guide

The document outlines practical exercises for data analysis using Python, covering data preprocessing, visualization, descriptive and inferential statistics, and correlation analysis. Each section includes theoretical explanations, code examples, and conclusions drawn from the analyses performed on datasets like the Titanic and student exam scores. The importance of data cleaning, statistical measures, and visualizations in understanding and interpreting data is emphasized throughout.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views40 pages

Data Preprocessing and Visualization Guide

The document outlines practical exercises for data analysis using Python, covering data preprocessing, visualization, descriptive and inferential statistics, and correlation analysis. Each section includes theoretical explanations, code examples, and conclusions drawn from the analyses performed on datasets like the Titanic and student exam scores. The importance of data cleaning, statistical measures, and visualizations in understanding and interpreting data is emphasized throughout.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

Table of contents

S. No. Title Page Date Signature


No.

1. Write a program to read and preprocess 2


dataset and perform visualisation using
matplotlib and seaborn.

2. Write a program to implement descriptive 8


(mean, median, mode, range, variance,
and std deviation) and inferential statistics
(t-test and z-test).

3. Write a program to implement correlation 13


analysis using Pearson and Spearman
coefficients.

4. Write a program to implement binomial, 18


normal, and Poisson distributions.

5. Implement SVM and Decision tree 28


classification techniques.

6. Write a program to build a logistic 34


regression and linear regression model.
2

Practical 1
Aim: Write a program to read and preprocess dataset and perform visualization using
matplotlib and seaborn

Theory
Data preprocessing and visualization are essential steps in the data analysis pipeline. Before
any meaningful statistical or machine learning analysis can be performed, raw data must be
cleaned, organized, and visualized to understand its structure and patterns.

1. Data Reading
The first step in any data analysis task is reading the dataset into the program. In Python, this is
commonly done using libraries such as pandas, which provides the read_csv(), read_excel(),
and similar functions to load datasets into a DataFrame. The DataFrame structure allows easy
manipulation and analysis of tabular data.

2. Data Preprocessing
Raw data often contains missing values, duplicate entries, or inconsistent formats.
Preprocessing prepares the data for analysis by handling such issues. Common preprocessing
tasks include:
- Handling missing values using methods like imputation (mean/median/mode replacement) or
deletion.
- Removing duplicates to ensure data integrity.
- Encoding categorical data by converting string labels into numerical values (e.g., one-hot
encoding, label encoding).
- Feature scaling such as normalization or standardization of numerical features to a common
scale.
- Data type conversion to ensure all columns are in appropriate formats.

This step ensures that the dataset is clean, consistent, and suitable for analysis or model
building.

3. Data Visualization
Data visualization is the graphical representation of information to identify patterns,
relationships, and trends within the dataset. Visualization helps in exploratory data analysis
(EDA), making it easier to understand the data’s distribution and correlations.

Two of the most popular Python libraries for visualization are:


- Matplotlib: A foundational plotting library that provides control over every aspect of a figure,
supporting line plots, histograms, bar charts, scatter plots, and more.
3

- Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for creating
attractive and informative statistical graphics. It simplifies complex visualizations such as pair
plots, heatmaps, and distribution plots.

4. Importance
By combining preprocessing and visualization:
- Data inconsistencies and anomalies can be detected early.
- Relationships between variables can be observed visually.
- Insights from visual patterns help guide further analysis or model selection.

In summary, data preprocessing ensures data quality, while data visualization provides insights
into the data’s underlying patterns—both being critical components of any data science or
machine learning workflow.

Code

import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import LabelEncoder

sns.set_style("whitegrid")

df = sns.load_dataset('titanic')

[Link](columns={'sibsp': 'siblings_spouses', 'parch': 'parents_children', 'pclass': 'passenger_class'},


inplace=True)
print(f"Loaded Titanic dataset: {[Link][0]} rows, {[Link][1]} columns\n")

print("First rows:\n", [Link]())


print("\nMissing values:\n", [Link]().sum())
print("\nDuplicates:", [Link]().sum())

# Preprocessing
for col in [Link]:
if df[col].isnull().sum() > 0:
if df[col].dtype in ['float64', 'int64']:
df[col].fillna(df[col].median(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
4

df.drop_duplicates(inplace=True)

le = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
df[col + '_encoded'] = le.fit_transform(df[col])

print("\nPreprocessing done!\n")

# Visualizations
numeric_cols = df.select_dtypes(include=[[Link]]).columns[:4]

# 1. Distribution plots
fig, axes = [Link](2, 2, figsize=(12, 8))
for i, col in enumerate(numeric_cols):
[Link](df[col], kde=True, ax=axes[i//2, i%2])
axes[i//2, i%2].set_title(f'{col} Distribution')
plt.tight_layout()
[Link]('[Link]')
[Link]()

# 2. Correlation heatmap
[Link](figsize=(10, 6))
[Link](df[numeric_cols].corr(), annot=True, cmap='coolwarm', center=0)
[Link]('Correlation Heatmap')
[Link]('[Link]')
[Link]()

# 3. Box plots
fig, axes = [Link](2, 2, figsize=(12, 8))
for i, col in enumerate(numeric_cols):
[Link](y=df[col], ax=axes[i//2, i%2])
axes[i//2, i%2].set_title(f'{col} Boxplot')
plt.tight_layout()
[Link]('[Link]')
[Link]()

print("Visualizations saved!")

Output
5

Distribution plots of the different classes of passengers


6

Box plots for the distribution summary of the different classes


7

Correlation heatmap between the different passengers’ classes

Conclusion
Thus, in this practical, we used data preprocessing and visualization techniques to analyze the
Titanic dataset. By cleaning the data, handling missing values, and encoding categorical
variables, we prepared the dataset for analysis. Using Matplotlib and Seaborn, we visualized
patterns for survival rates with respect to several parameters like age, sibling count etc.

These steps helped us gain insights into the factors affecting survival on the Titanic and
demonstrated the importance of preprocessing and visualization in understanding real-world
datasets.
8

Practical 2

Aim: Write a program to implement descriptive (mean, median, mode, range, variance, and
std deviation) and inferential statistics (t-test and z-test).

Theory
Descriptive statistics are used to summarize and describe the main features of a dataset.
They provide simple quantitative summaries about the data. Key measures include:

• Mean: The sum of all data points divided by the number of points. It gives a measure of
central tendency.
• Median: The middle value when the data is sorted. It is useful when the data has
outliers.
• Mode: The value that appears most frequently in the dataset.
• Range: The difference between the maximum and minimum values, showing the spread
of the data.
• Variance: Measures how far the data points are spread out from the mean. A higher
variance indicates more spread.
• Standard Deviation: The square root of variance; gives an idea of how much data
deviates from the mean in the same units as the data.

Inferential statistics allow us to make predictions or inferences about a population based on a


sample. It often involves hypothesis testing.

• t-Test: Used to compare the means of two groups to determine if they are statistically
different. Commonly used when the sample size is small and the population standard
deviation is unknown.
• z-Test: Used to compare means when the sample size is large or the population
standard deviation is known. It is used to test hypotheses about population means.

Two sample t-test

Where:

• x̄₁ = mean of sample 1


• x̄₂ = mean of sample 2
• s₁² = variance of sample 1
• s₂² = variance of sample 2
9

• n₁ = size of sample 1
• n₂ = size of sample 2

Degrees of Freedom: df = n₁ + n₂ − 2

Decision rule:
• If |t| > t(critical), reject H₀ (means are significantly different).
• If |t| ≤ t(critical), accept H₀ (no significant difference).

Used to test whether two independent groups have different means — for example,
comparing the average performance of two different student groups or test results before
and after applying a new teaching method.

Single sample z-test

Where:

• x̄ = sample mean
• μ = population mean
• σ = population standard deviation
• n = sample size

Decision Rule:
• If |z| > z(critical), reject H₀ (there is a significant difference).
• If |z| ≤ z(critical), accept H₀ (no significant difference).

Used for large samples (n ≥ 30) or when the population standard deviation is known,
for example, comparing a sample’s average score with a known population average.

Descriptive statistics help in understanding the basic characteristics of data, while inferential
statistics help in decision making, predictions, and validating assumptions in research.
10

Code

import numpy as np
import [Link] as stats

def descriptive_statistics(data):
mean = [Link](data)
median = [Link](data)
mode = [Link](data, keepdims=True)[0][0]
data_range = [Link](data) - [Link](data)
variance = [Link](data, ddof=1) # sample variance
std_dev = [Link](data, ddof=1) # sample standard deviation

print("\n--- DESCRIPTIVE STATISTICS ---")


print(f"Data: {data}")
print(f"Mean: {mean:.3f}")
print(f"Median: {median:.3f}")
print(f"Mode: {mode}")
print(f"Range: {data_range:.3f}")
print(f"Variance: {variance:.3f}")
print(f"Standard Deviation: {std_dev:.3f}")

def t_test(sample1, sample2):


t_stat, p_val = stats.ttest_ind(sample1, sample2)
print("\n--- INFERENTIAL STATISTICS: T-TEST ---")
print(f"Sample 1 Mean: {[Link](sample1):.3f}")
print(f"Sample 2 Mean: {[Link](sample2):.3f}")
print(f"T-Statistic: {t_stat:.3f}")
print(f"P-Value: {p_val:.5f}")
if p_val < 0.05:
print("Result: Significant difference (Reject H₀)")
else:
print("Result: No significant difference (Fail to reject H₀)")

def z_test(sample_mean, population_mean, std_dev, n):


# Z = (sample_mean - population_mean) / (std_dev / sqrt(n))
z = (sample_mean - population_mean) / (std_dev / [Link](n))
p_val = 2 * (1 - [Link](abs(z))) # two-tailed
print("\n--- INFERENTIAL STATISTICS: Z-TEST ---")
print(f"Sample Mean: {sample_mean:.3f}")
print(f"Population Mean: {population_mean:.3f}")
print(f"Standard Deviation: {std_dev:.3f}")
print(f"Sample Size: {n}")
print(f"Z-Statistic: {z:.3f}")
print(f"P-Value: {p_val:.5f}")
11

if p_val < 0.05:


print("Result: Significant difference (Reject H₀)")
else:
print("Result: No significant difference (Fail to reject H₀)")

if __name__ == "__main__":
# Example dataset
data = [23, 45, 12, 67, 34, 45, 34, 23, 45, 56]
descriptive_statistics(data)

# Example samples for t-test


sample1 = [23, 45, 34, 23, 45, 34, 56, 45]
sample2 = [34, 54, 56, 43, 45, 65, 43, 55]
t_test(sample1, sample2)

# Example z-test
sample_mean = 52
population_mean = 50
std_dev = 5
n = 30
z_test(sample_mean, population_mean, std_dev, n)

Output
12

Conclusion
In this practical, we applied descriptive and inferential statistics on a sample dataset. Descriptive
measures like mean, median, mode, range, variance, and standard deviation helped us
summarize the data and understand its distribution. Inferential tests, such as the t-test and z-
test, allowed us to draw conclusions about population parameters from the sample.

By performing these analyses, we gained insights into the dataset’s central tendency, variability,
and statistical significance, which is essential for data-driven decision-making in real-world
applications.

For example, if applied to a dataset of students’ marks, we could determine the average
performance, understand score variability, and test hypotheses about differences between
groups of students.
13

Practical 3

Aim: Write a program to implement correlation analysis using Pearson and Spearman
coefficients

Theory
Correlation analysis is a statistical method used to evaluate the strength and direction of the
relationship between two variables. It helps in understanding how one variable changes with
respect to another.

Pearson Correlation Coefficient


Pearson correlation measures the linear relationship between two continuous variables.
It ranges from -1 to +1:
• +1 indicates a perfect positive linear relationship,
• -1 indicates a perfect negative linear relationship,
• 0 indicates no linear correlation.

It assumes that the data is normally distributed and the relationship is linear.

Spearman Rank Correlation Coefficient


Spearman correlation measures the monotonic relationship between two variables using ranks
instead of actual values. It is useful when the data is not normally distributed or when the
relationship is not linear. Like Pearson, it ranges from -1 to +1, with similar interpretations.

Correlation analysis is widely used in fields like economics, social sciences, and biology to
identify relationships between variables. For example, it can help understand the relationship
between students’ study hours and their exam scores or between advertising spend and sales
revenue.
14

Code

import pandas as pd
import numpy as np
from [Link] import pearsonr, spearmanr
import seaborn as sns
import [Link] as plt

# Example dataset: students' study hours vs exam scores


data = {
'Study_Hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'Exam_Score': [50, 55, 56, 60, 65, 70, 74, 78, 85, 88]
}

df = [Link](data)
print("Dataset:\n", df, "\n")

pearson_corr, pearson_p = pearsonr(df['Study_Hours'], df['Exam_Score'])


print(f"Pearson Correlation Coefficient: {pearson_corr:.4f}")
print(f"P-value: {pearson_p:.4f}\n")

spearman_corr, spearman_p = spearmanr(df['Study_Hours'], df['Exam_Score'])


print(f"Spearman Correlation Coefficient: {spearman_corr:.4f}")
print(f"P-value: {spearman_p:.4f}\n")

# Visualizations
[Link](df, kind='reg')
[Link]("Scatter plot with regression line", y=1.02)
[Link]()

corr_matrix = [Link](method='pearson')
[Link](corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
[Link]("Pearson Correlation Heatmap")
[Link]()

if abs(pearson_corr) > 0.8:


relation = "strong"
elif abs(pearson_corr) > 0.5:
relation = "moderate"
else:
relation = "weak"

print(f"Interpretation: There is a {relation} linear relationship between study hours and exam score.")

Output
15
16

Distributions of the exam scores and study hours


17

Correlation heatmap for representing how strongly study hours and exam scores are correlated
to each other

Conclusion
In this practical, we performed correlation analysis using Pearson and Spearman coefficients on
a sample dataset of students’ marks and study hours, which helped analyze whether increased
study hours are associated with higher scores.

Pearson correlation helped us determine the strength and direction of a linear relationship
between variables, while Spearman correlation was used to evaluate monotonic relationships,
even when data did not follow normal distribution.

This analysis is essential for understanding dependencies between variables, making


predictions, and informing data-driven decisions in real-world applications.
18

Practical 4

Aim: Write a program to implement correlation analysis using Pearson and Spearman
coefficients

Theory
Binomial Distribution
The Binomial Distribution is a discrete probability distribution that describes the number of
successes in a fixed number of independent trials, where each trial has only two possible
outcomes — success or failure.

It is used when:

• The number of trials (n) is fixed.


• The probability of success (p) is constant.
• Each trial is independent of others.

Where:
• n = number of trials
• K = number of successes
• P = probability of success
• P(X = k) = probability of getting exactly k succeses

Example: Finding the probability of getting exactly 3 heads in 5 coin tosses.


19

Normal distribution
The Normal Distribution (also known as Gaussian Distribution) is a continuous probability
distribution that is symmetric about the mean, forming a bell-shaped curve.

It is used to model real-world phenomena like height, weight, or marks distribution.

Where:
• μ = mean
• σ = standard deviation
• x = variable
• f(x) = probability density function

Example: Distribution of students’ marks in a large class.

Poisson Distribution
The Poisson Distribution is a discrete probability distribution that expresses the probability of a
given number of events occurring in a fixed interval of time or space, provided these events
occur independently and at a constant average rate (λ).

Where:
• λ = average number of occurrences (mean rate)
• k = number of occurrences
• e = Euler’s number (≈ 2.718)

Example: The number of emails received per hour or the number of accidents per day
20

Code

import numpy as np
import [Link] as plt
from scipy import stats
from [Link] import comb
import math

[Link]('seaborn-v0_8-darkgrid')

class DistributionAnalyzer:
"""Class to implement and visualize statistical distributions"""

def __init__(self):
self.fig_count = 0

# BINOMIAL DISTRIBUTION
def binomial_pmf(self, k, n, p):
return comb(n, k, exact=True) * (p ** k) * ((1 - p) ** (n - k))

def plot_binomial(self, n=20, p=0.5):


k_values = [Link](0, n + 1)
pmf_values = [self.binomial_pmf(k, n, p) for k in k_values]

scipy_pmf = [Link](k_values, n, p)

[Link](figsize=(10, 6))
[Link](k_values, pmf_values, alpha=0.7, color='steelblue', edgecolor='black')
[Link]('Number of Successes (k)', fontsize=12)
[Link]('Probability', fontsize=12)
[Link](f'Binomial Distribution (n={n}, p={p})', fontsize=14,
fontweight='bold')

[Link](axis='y', alpha=0.3)

mean = n * p
variance = n * p * (1 - p)
[Link](mean, color='red', linestyle='--', linewidth=2,
label=f'Mean = {mean:.2f}')

[Link]()

textstr = f'Mean: {mean:.2f}\nVariance: {variance:.2f}\nStd Dev: {[Link](variance):.2f}'

[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,


fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
21

[Link]()

return mean, variance

# NORMAL DISTRIBUTION
def normal_pdf(self, x, mu, sigma):
coefficient = 1 / (sigma * [Link](2 * [Link]))
exponent = -0.5 * ((x - mu) / sigma) ** 2
return coefficient * [Link](exponent)

def plot_normal(self, mu=0, sigma=1, x_range=None):


if x_range is None:
x_range = (mu - 4*sigma, mu + 4*sigma)

x = [Link](x_range[0], x_range[1], 1000)


y = self.normal_pdf(x, mu, sigma)

scipy_pdf = [Link](x, mu, sigma)

[Link](figsize=(10, 6))
[Link](x, y, 'b-', linewidth=2, label='Normal PDF')
plt.fill_between(x, y, alpha=0.3)

[Link](mu, color='red', linestyle='--', linewidth=2, label=f'Mean (μ) = {mu}')


[Link](mu - sigma, color='green', linestyle=':', linewidth=1.5, alpha=0.7)
[Link](mu + sigma, color='green', linestyle=':', linewidth=1.5,
alpha=0.7, label=f'±1σ = {sigma}')

[Link]('x', fontsize=12)
[Link]('Probability Density', fontsize=12)
[Link](f'Normal Distribution (μ={mu}, σ={sigma})', fontsize=14, fontweight='bold')
[Link]()
[Link](alpha=0.3)

textstr = f'Mean: {mu}\nStd Dev: {sigma}\nVariance: {sigma**2}'


[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,
fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
[Link]()

return mu, sigma**2

# POISSON DISTRIBUTION
def poisson_pmf(self, k, lam):
return (lam ** k * [Link](-lam)) / [Link](k)
22

def plot_poisson(self, lam=3, k_max=15):


k_values = [Link](0, k_max + 1)
pmf_values = [self.poisson_pmf(k, lam) for k in k_values]

scipy_pmf = [Link](k_values, lam)

[Link](figsize=(10, 6))
[Link](k_values, pmf_values, alpha=0.7, color='coral', edgecolor='black')

[Link]('Number of Events (k)', fontsize=12)


[Link]('Probability', fontsize=12)
[Link](f'Poisson Distribution (λ={lam})', fontsize=14, fontweight='bold')

[Link](axis='y', alpha=0.3)

[Link](lam, color='red', linestyle='--', linewidth=2,


label=f'Mean = λ = {lam}')
[Link]()

textstr = f'λ (lambda): {lam}\nMean: {lam}\nVariance: {lam}\nStd Dev: {[Link](lam):.2f}'

[Link](0.02, 0.98, textstr, transform=[Link]().transAxes,


fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round',
facecolor='wheat', alpha=0.5))

plt.tight_layout()
[Link]()

return lam, lam

# COMPARISON PLOT
def plot_all_distributions(self):
fig, axes = [Link](1, 3, figsize=(15, 5))

# Binomial
n, p = 20, 0.5
k_binom = [Link](0, n + 1)
pmf_binom = [Link](k_binom, n, p)

axes[0].bar(k_binom, pmf_binom, alpha=0.7, color='steelblue', edgecolor='black')

axes[0].set_title(f'Binomial (n={n}, p={p})', fontsize=12, fontweight='bold')

axes[0].set_xlabel('k')
axes[0].set_ylabel('Probability')
axes[0].grid(axis='y', alpha=0.3)

# Normal
mu, sigma = 0, 1
x_norm = [Link](-4, 4, 1000)
23

pdf_norm = [Link](x_norm, mu, sigma)


axes[1].plot(x_norm, pdf_norm, 'b-', linewidth=2)
axes[1].fill_between(x_norm, pdf_norm, alpha=0.3)

axes[1].set_title(f'Normal (μ={mu}, σ={sigma})', fontsize=12, fontweight='bold')

axes[1].set_xlabel('x')
axes[1].set_ylabel('Probability Density')
axes[1].grid(alpha=0.3)

# Poisson
lam = 3
k_pois = [Link](0, 15)
pmf_pois = [Link](k_pois, lam)

axes[2].bar(k_pois, pmf_pois, alpha=0.7, color='coral', edgecolor='black')

axes[2].set_title(f'Poisson (λ={lam})', fontsize=12,


fontweight='bold')

axes[2].set_xlabel('k')
axes[2].set_ylabel('Probability')
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
[Link]()

if __name__ == "__main__":
analyzer = DistributionAnalyzer()

print("=" * 60)
print("STATISTICAL DISTRIBUTIONS VISUALIZATION")
print("=" * 60)

print("\n1. BINOMIAL DISTRIBUTION")


print("-" * 60)
print("Example: Flipping a coin 20 times")
mean_b, var_b = analyzer.plot_binomial(n=20, p=0.5)
print(f"Mean: {mean_b}, Variance: {var_b}")

print("\n2. NORMAL DISTRIBUTION")


print("-" * 60)
print("Example: Standard normal distribution")
mean_n, var_n = analyzer.plot_normal(mu=0, sigma=1)
print(f"Mean: {mean_n}, Variance: {var_n}")

print("\n3. POISSON DISTRIBUTION")


print("-" * 60)
print("Example: Average of 3 events per interval")
mean_p, var_p = analyzer.plot_poisson(lam=3)
24

print(f"Mean: {mean_p}, Variance: {var_p}")

print("\n4. COMPARING ALL DISTRIBUTIONS")


print("-" * 60)
analyzer.plot_all_distributions()

print("\n" + "=" * 60)


print("Visualization complete!")
print("=" * 60)

Output
25
26
27

Conclusion
In this practical, we implemented Binomial, Normal, and Poisson distributions using Python to
understand their behavior and visualize them.

• The Binomial Distribution was used to model discrete outcomes with fixed trials.
• The Normal Distribution represented continuous data with a symmetric bell-shaped
curve.
• The Poisson Distribution modeled the probability of a certain number of events occurring
within a fixed interval.

Through visualization and computation, we observed how these distributions differ in shape,
spread, and application, and how they are used in real-life statistical analysis and data
modeling.
28

Practical 5

Aim: Implement SVM and Decision tree classification techniques.

Theory
Support Vector Machine (SVM)
Support Vector Machine is a supervised learning algorithm used for classification (and
regression). SVM finds the optimal hyperplane that separates classes by maximizing the margin
between the nearest points of different classes (support vectors). For non-linearly separable
data, SVM uses kernel functions to map input features into a higher-dimensional space where a
linear separator may exist. SVM is effective in high-dimensional spaces and when the number
of features exceeds the number of samples.

Key ideas:
• Margin: distance between the hyperplane and nearest data points (support vectors).
• Optimal hyperplane: maximizes the margin.
• Kernels: linear, polynomial, radial basis function (RBF), sigmoid — allow non-linear
decision boundaries.
• Regularization parameter (C): trade-off between maximizing margin and minimizing
classification error.
• Slack variables (ξ): allow soft margin (tolerate misclassification).

Decision function:

Where:
• w = weight vector
• x = input vector
• b = bias term

Objective:

Where:
• C = regularization parameter
• ξᵢ = slack variables allowing misclassification
29

Dual form (with kernels):

Prediction:

Decision Tree
A Decision Tree is a non-parametric supervised learning method used for classification and
regression. It builds a tree-like model of decisions by recursively splitting the dataset on feature
values to create homogeneous subsets (in terms of class labels). Each internal node represents
a test on a feature, each branch the outcome, and each leaf node a class label (or distribution).

Key ideas:

• Recursive partitioning: split data to reduce impurity.


• Impurity measures: Gini impurity, Entropy (information gain), or Classification Error.
• Stopping criteria: maximum depth, minimum samples per leaf, or when further splits do
not improve impurity significantly.
• Pruning: pre-pruning or post-pruning to avoid overfitting.

Entropy:

Where pₖ is the proportion of class k in dataset S

Information Gain:
30

Where:
• A = attribute used for splitting
• Sᵥ = subset of S where attribute A takes value v

Gini Impurity:

Code

import numpy as np
import [Link] as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import SVC
from [Link] import DecisionTreeClassifier, plot_tree
from [Link] import accuracy_score, classification_report

iris = datasets.load_iris()
X, y = [Link][:, :2], [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features for SVM


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)

# Train SVM (RBF kernel)


print("=== SVM Classifier ===")
svm = SVC(kernel='rbf', C=1.0, random_state=42)
[Link](X_train_scaled, y_train)
y_pred_svm = [Link](X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(classification_report(y_test, y_pred_svm, target_names=iris.target_names))

# Train Decision Tree


print("\n=== Decision Tree Classifier ===")
dt = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=42)
[Link](X_train, y_train)
31

y_pred_dt = [Link](X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Tree depth: {dt.get_depth()}, Leaves: {dt.get_n_leaves()}")
print(classification_report(y_test, y_pred_dt, target_names=iris.target_names))

# Visualize decision boundaries


fig1, (ax1, ax2) = [Link](1, 2, figsize=(12, 4))

def plot_boundary(model, X, y, ax, title, scaled=False):


h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h))

mesh = np.c_[[Link](), [Link]()]


Z = [Link]([Link](mesh) if scaled else mesh).reshape([Link])

[Link](xx, yy, Z, alpha=0.3, cmap='viridis')


[Link](X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='black')
ax.set_title(title)

plot_boundary(svm, X_test, y_test, ax1, 'SVM (RBF)', scaled=True)


plot_boundary(dt, X_test, y_test, ax2, 'Decision Tree (Gini)')
plt.tight_layout()
[Link]()

# Plot decision tree structure


fig2, ax = [Link](figsize=(15, 10))

plot_tree(dt, ax=ax, feature_names=iris.feature_names[:2],


class_names=iris.target_names, filled=True, rounded=True, fontsize=10)

ax.set_title('Decision Tree Structure', fontsize=14, fontweight='bold')


plt.tight_layout()
[Link]()

Output
32
33

Conclusion
In this practical, SVM and Decision Tree classification techniques were implemented and
compared. SVM provides robust, margin-based classification that is powerful for high-
dimensional and (with kernels) non-linear problems; it requires careful feature scaling and
hyperparameter tuning.

Decision Trees provide intuitive rule-based models that are easy to interpret and require less
preprocessing, but they are prone to overfitting and often need pruning or ensemble methods
for better generalization.

By preparing data, tuning hyperparameters, and evaluating with appropriate metrics, both
methods can be effectively applied to classification problems; the choice between them
depends on dataset characteristics (linearity, dimensionality, noise, interpretability needs) and
performance on validation/test sets.
34

Practical 6

Aim: Write a program to build a logistic regression and linear regression model.

Theory
Linear Regression and Logistic Regression are regression analysis models, which are used to
model the relationship between dependent and independent variables, and help in predicting or
estimating the value of one variable based on the values of others.

Linear Regression: Used when the dependent variable is continuous. It establishes a linear
relationship between the dependent variable (Y) and one or more independent variables (X).
The aim is to find the best-fitting straight line through the data points that minimizes the error
between predicted and actual values.

General equation:

Where:

• Y = Dependent variable
• X = Independent variable
• β₀ = Intercept
• β₁ = Slope coefficient
• ε = Error term

To calculate slope and intercept:


35

Logistic Regression: Used when the dependent variable is categorical (binary). It predicts the
probability that a given input belongs to a particular class. Instead of fitting a straight line, it fits
an S-shaped (sigmoid) curve that outputs probabilities between 0 and 1.

The logistic regression model predicts the probability (p) that the dependent variable
belongs to class 1 as follows:

Where:
• p = Probability of success (Y = 1)
• β₀, β₁ = Model coefficients
• X = Predictor variable

Code

# Import required libraries


import numpy as np
import [Link] as plt
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import load_iris, fetch_california_housing
from [Link] import mean_squared_error, r2_score, accuracy_score, confusion_matrix,
ConfusionMatrixDisplay

print("===== LINEAR REGRESSION (California Housing Dataset) =====")

housing = fetch_california_housing()
X, y = [Link], [Link]
feature_names = housing.feature_names

df_housing = [Link](X, columns=feature_names)


df_housing['MedHouseVal'] = y

X_room = df_housing[['AveRooms']]
y_price = df_housing['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X_room, y_price, test_size=0.2, random_state=42)


36

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

y_pred = lin_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.3f}")


print(f"R² Score: {r2:.3f}")

# Visualization
[Link](figsize=(6,4))
[Link](X_test, y_test, color='blue', label='Actual')
[Link](X_test, y_pred, color='red', linewidth=2, label='Predicted Line')
[Link]('Average Number of Rooms (AveRooms)')
[Link]('Median House Value')
[Link]('Linear Regression: House Value vs Rooms')
[Link]()
[Link]()

print("\n===== LOGISTIC REGRESSION (Iris Dataset) =====")

iris = load_iris()
X = [Link][:, :2] # Use first two features for visualization
y = [Link]

# For binary classification


y_binary = (y == 0).astype(int) # 1 if setosa, else 0

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

y_pred = log_model.predict(X_test)

acc = accuracy_score(y_test, y_pred)


print(f"Accuracy: {acc*100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Setosa', 'Setosa'])
[Link](cmap='Purples')
[Link]("Logistic Regression - Confusion Matrix")
[Link]()
37

# Decision boundary visualization


[Link](figsize=(6,4))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = [Link]([Link](x_min, x_max, 0.02),
[Link](y_min, y_max, 0.02))
Z = log_model.predict(np.c_[[Link](), [Link]()])
Z = [Link]([Link])

[Link](xx, yy, Z, alpha=0.3, cmap=[Link])


[Link](X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', cmap=[Link])
[Link]('Sepal Length (cm)')
[Link]('Sepal Width (cm)')
[Link]('Logistic Regression Decision Boundary (Iris)')
[Link]()

Output
38
39
40

Conclusion

In this experiment, we implemented Linear Regression on the California Housing dataset to


predict further trends for house values.

Furthermore, we employed Logistic Regression models on the Iris dataset to classify flower
records with respect to different petal characteristics.

Linear Regression helped us model a continuous relationship between variables, while Logistic
Regression was used to predict categorical outcomes by estimating class probabilities.

Common questions

Powered by AI

Impurity measures, like Gini impurity and entropy, guide the splitting of nodes, aiming to create homogenous subsets that improve classification accuracy. Splits are chosen to minimize impurity, enhancing the predictive performance of the decision tree model .

Pearson correlation measures the linear relationship between two continuous variables assuming normal distribution, while Spearman correlation assesses monotonic relationships using rank, effective when data are not normally distributed or when relationships are non-linear .

Key descriptive statistical measures include mean, median, mode, range, variance, and standard deviation. These measures provide insights into the central tendency, spread, and variability within the dataset, describing its fundamental characteristics and allowing for easier interpretation and comparison of data .

Visualization in EDA helps in understanding the underlying distributions, correlations, and patterns within the data, aiding in hypothesis generation and data preprocessing decisions. Through tools like histograms, heatmaps, and scatter plots, it provides insights that numerical analysis alone might overlook .

Primary considerations include ensuring normal distribution of the data (or large enough samples for Central Limit Theorem applicability), equal variances, and that the samples are independent. These conditions affect the validity of the t-test results and the reliability of conclusions drawn about population differences .

Feature scaling is crucial in SVM models to prevent dominance of features with larger ranges over others, ensuring that the optimization process for defining the separating hyperplane functions properly, especially with non-linear kernels to handle curved and complex boundaries .

Decision trees provide a intuitive rule-based model that is easy to interpret and requires less preprocessing. However, they are prone to overfitting unless pruned, struggle with high dimensional data, and often require ensemble methods for better generalization .

Logistic regression is preferred when the dependent variable is categorical, as it predicts probabilities and classifies data into categories. In contrast, linear regression is used for continuous outcome variables and estimates real values .

Data preprocessing improves the quality of data by handling missing values, transforming or encoding categorical variables, and removing duplicates, providing clean and statistical-ready datasets. This ensures that the models have a strong foundation for learning patterns without noise or irrelevant elements, which ultimately influences accurate model selection and results .

The Binomial distribution models the number of successes in a fixed number of independent trials with a constant probability of success, while the Poisson distribution focuses on the number of events occurring in a fixed interval, with events happening independently at a constant average rate. These distinctions affect their applications in different statistical contexts .

You might also like