Lab Report 4
Lab Report 4
For supervised learning, students will build and evaluate decision tree models.
They'll learn to visualize and interpret tree structures, understand Gini impurity
and entropy, and assess performance using metrics like accuracy and F1-score.
Students will also tune hyper parameters to mitigate overfitting and optimize
model performance through cross-validation. Optionally, they'll compare
decision trees to other classification algorithms, analyzing their respective
strengths and weaknesses. The lab emphasizes hands-on application and critical
analysis of these fundamental machine learning techniques.
Theory:
Unsupervised learning: Unsupervised learning is a type of machine learning
where the algorithm learns patterns from data that has not been labelled or
classified. In contrast to supervised learning, where the model is trained using
input-output pairs (labelled data), and unsupervised learning works with data
that only contains inputs (features) without any corresponding outputs.The goal
of unsupervised learning is to identify underlying structures, relationships, or
patterns within the data. It’s often used for tasks like clustering, dimensionality
reduction, and anomaly detection.
Decision Tree: A Decision Tree is a supervised machine learning algorithm
used for both classification and regression tasks. It works by recursively
splitting the data into subsets based on the most significant feature, creating a
tree-like structure of decisions. In a decision tree:
I. Nodes represent decisions or tests on attributes (features).
II. Branches represent the outcome of those tests (e.g., feature values).
III. Leaf nodes represent the final decision or prediction (class label or
continuous value).
IV. Source Code 1: import pandas as pd
import numpy as np
import sklearn as sk
%matplotlib inline
try:
except ModuleNotFoundError:
plt.show()
# You can similarly define placeholder functions for other functions you need from `utilities.py`
def color():
pass
def distance():
pass
def initiate():
pass
def estimate_centroid():
pass
nb_obs = 1000
super_scat_it(X, y, k)
class KMeans:
Args:
k: number of clusters
Class Attributes:
self.data = data
self.k = k
self.seed = seed
np.random.seed(seed)
self.centroid = self.initiate(data, k)
# Initialize the cluster labels (each point initially assigned to the nearest centroid)
centroids = data[random_indices]
return centroids
def fit(self):
while True:
# Step 2: Update the labels (assign each point to the nearest centroid) label_new =
np.argmin(self.distance(self.data, self.centroid), axis=1)
if np.array_equal(label_new, self.label):
break
self.label = label_new
self.iteration += 1
def visualize_clusters(self):
plt.figure(figsize=(8,6))
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
# Example usage:
if __name__ == "__main__":
# Generate some sample data
kmeans.fit()
kmeans.visualize_clusters()
from sklearn.cluster import KMeans # KMeans for clustering the hidden representations
aenn.fit(X_train, y_train)
X_sub = X_train[:n_sub]
y_sub = y_train[:n_sub]
# Get the hidden representations (activations) for the first hidden layer
kmeans.fit(hiddens)
centroids = kmeans.cluster_centers_
fig = plt.figure()
# We take the first two dimensions of the centroids for visualization purposes
centroids_2d = centroids[:, :2]
plt.legend()
plt.tight_layout()
plt.show()