0% found this document useful (0 votes)
1 views

Lab Report 4

This lab report focuses on practical skills in unsupervised and supervised learning, specifically K-means clustering and decision tree models. Students will learn to implement, analyze, and evaluate these algorithms, including performance metrics and hyperparameter tuning. The lab emphasizes hands-on experience and critical analysis of machine learning techniques.

Uploaded by

sampritihaldar77
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lab Report 4

This lab report focuses on practical skills in unsupervised and supervised learning, specifically K-means clustering and decision tree models. Students will learn to implement, analyze, and evaluate these algorithms, including performance metrics and hyperparameter tuning. The lab emphasizes hands-on experience and critical analysis of machine learning techniques.

Uploaded by

sampritihaldar77
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Lab Report-4

Title: Unsupervised Learning and Decision Tree


Objective: This lab manual aims to equip students with practical skills in
unsupervised and supervised learning. For unsupervised learning, students will
implement and analyze K-means, hierarchical clustering, and DBSCAN,
focusing on their application, evaluation using metrics like silhouette score, and
the impact of dimensionality reduction. They'll understand how to select
appropriate algorithms based on data characteristics.

For supervised learning, students will build and evaluate decision tree models.
They'll learn to visualize and interpret tree structures, understand Gini impurity
and entropy, and assess performance using metrics like accuracy and F1-score.
Students will also tune hyper parameters to mitigate overfitting and optimize
model performance through cross-validation. Optionally, they'll compare
decision trees to other classification algorithms, analyzing their respective
strengths and weaknesses. The lab emphasizes hands-on application and critical
analysis of these fundamental machine learning techniques.

Theory:
Unsupervised learning: Unsupervised learning is a type of machine learning
where the algorithm learns patterns from data that has not been labelled or
classified. In contrast to supervised learning, where the model is trained using
input-output pairs (labelled data), and unsupervised learning works with data
that only contains inputs (features) without any corresponding outputs.The goal
of unsupervised learning is to identify underlying structures, relationships, or
patterns within the data. It’s often used for tasks like clustering, dimensionality
reduction, and anomaly detection.
Decision Tree: A Decision Tree is a supervised machine learning algorithm
used for both classification and regression tasks. It works by recursively
splitting the data into subsets based on the most significant feature, creating a
tree-like structure of decisions. In a decision tree:
I. Nodes represent decisions or tests on attributes (features).
II. Branches represent the outcome of those tests (e.g., feature values).
III. Leaf nodes represent the final decision or prediction (class label or
continuous value).
IV. Source Code 1: import pandas as pd
import numpy as np

import sklearn as sk

from sklearn.cluster import KMeans

from sklearn.datasets import make_circles, make_blobs

from sklearn.model_selection import train_test_split

from sklearn import mixture

import matplotlib.pyplot as plt

import seaborn as sns

from matplotlib.pyplot import cm

%matplotlib inline

# Define mock functions for utilities

try:

from utilities import color, super_scat_it, distance, initiate, estimate_centroid

except ModuleNotFoundError:

print("utilities.py module not found. Using placeholder functions.")

def super_scat_it(X, y, k):

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')

plt.title("Cluster Visualization (placeholder)")

plt.show()

# You can similarly define placeholder functions for other functions you need from `utilities.py`

def color():

pass

def distance():

pass

def initiate():

pass

def estimate_centroid():

pass

nb_obs = 1000

k = 2 std = 4 dim = 2 seed = 10

X, y = make_blobs(n_samples=nb_obs, centers=k, cluster_std=std,


n_features=dim, random_state=seed)

super_scat_it(X, y, k)

Source Code 2: import numpy as np


import matplotlib.pyplot as plt

class KMeans:

def __init__(self, data, k, seed=None):

Args:

data: unlabeled data

k: number of clusters

Class Attributes:

self.data: unlabeled data

self.centroid: cluster centers

self.label: cluster labels for each point

self.iteration: number of iterations before k-means converges

self.data = data

self.k = k

self.seed = seed

np.random.seed(seed)

# Initialize centroids (this should use a method like initiate)

self.centroid = self.initiate(data, k)

# Initialize the cluster labels (each point initially assigned to the nearest centroid)

self.label = np.argmin(self.distance(self.data, self.centroid), axis=1) self.iteration = 0

def initiate(self, data, k):

Function to initialize centroids randomly

# Randomly select k data points as the initial centroids

random_indices = np.random.choice(data.shape[0], k, replace=False)

centroids = data[random_indices]

return centroids

def distance(self, data, centroids):

Function to compute the distance between data points and centroids

return np.linalg.norm(data[:, np.newaxis]


● centroids, axis=2)
def estimate_centroid(self, data, labels):
Function to estimate the centroids of the clusters
centroids = np.array([data[labels == i].mean(axis=0) for i in range(self.k)])
return centroids

def fit(self):

Fit the KMeans model to the data

# Run the algorithm until convergence

while True:

# Step 1: Update the cluster centers (centroids)

self.centroid = self.estimate_centroid(self.data, self.label)

# Step 2: Update the labels (assign each point to the nearest centroid) label_new =
np.argmin(self.distance(self.data, self.centroid), axis=1)

# Check for convergence (if labels haven't changed)

if np.array_equal(label_new, self.label):

break

# Update the labels for the next iteration

self.label = label_new

self.iteration += 1

# Compute the objective function (mean of minimum distances to centroids)

self.objective = np.mean(np.min(self.distance(self.data, self.centroid), axis=1))

print(f"Converged after {self.iteration} iterations with objective: {self.objective}")

def visualize_clusters(self):

Visualize the clustered data points and centroids

plt.figure(figsize=(8,6))

plt.scatter(self.data[:, 0], self.data[:, 1], c=self.label, cmap='viridis', s=50)

plt.scatter(self.centroid[:, 0], self.centroid[:, 1], s=200, c='red', marker='X', label='Centroids')

plt.title('K-Means Clustering')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.legend()

plt.show()

# Example usage:

if __name__ == "__main__":
# Generate some sample data

from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Create an instance of KMeans

kmeans = KMeans(data=X, k=3, seed=42)

# Fit the model

kmeans.fit()

# Visualize the clusters

kmeans.visualize_clusters()

Source Code 3: import numpy as np


import matplotlib.pyplot as plt

from scipy.special import expit

from sklearn.neural_network import MLPClassifier

from sklearn.datasets import make_classification

from sklearn.cluster import KMeans # KMeans for clustering the hidden representations

# Sample data and model initialization

X_train, y_train = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Train a neural network model (e.g., Multi-layer Perceptron)

aenn = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42)

aenn.fit(X_train, y_train)

# Select a subset of 500 samples for visualization

n_sub = 500 # Subset of data to visualize

X_sub = X_train[:n_sub]

y_sub = y_train[:n_sub]

# Get the hidden representations (activations) for the first hidden layer

hiddens = expit(np.dot(X_sub, aenn.coefs_[0]) + aenn.intercepts_[0])

# Perform KMeans clustering on the hidden representations to find centroids

kmeans = KMeans(n_clusters=3, random_state=42) # Adjust n_clusters as needed

kmeans.fit(hiddens)

# Get the cluster centers (centroids)

centroids = kmeans.cluster_centers_

# Plotting the centroids in the 2D hidden space

fig = plt.figure()

# We take the first two dimensions of the centroids for visualization purposes
centroids_2d = centroids[:, :2]

# Plot the centroids

plt.scatter(centroids_2d[:, 1], centroids_2d[:, 0], color='red', marker='x', s=100, label='Centroids')

# Plot the data points and color by their cluster assignment

plt.scatter(hiddens[:, 1], hiddens[:, 0], c=kmeans.labels_, cmap='viridis', alpha=0.5)

# Adding labels and title

plt.xlabel('First hidden dimension')

plt.ylabel('Second hidden dimension')

plt.title(f'Centroids of Clusters in Hidden Layer Activations')

plt.legend()

plt.tight_layout()

plt.show()

Conclusion: This lab provided hands-on experience in implementing


unsupervised learning and Decision tree models. The experiment demonstrated
the importance of data pre-processing, model section, and performance
evaluation. Future improvements can include trying advanced deep learning
models for better accuracy.

You might also like