0% found this document useful (0 votes)

37 views50 pages

Cse519 hw3

Uploaded by

wstar2176

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views50 pages

Cse519 hw3

Uploaded by

wstar2176

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Copy this notebook (if using Colab) via File -> Save a Copy in

Drive.

You can do this assignment outside of Colab (using your

local Python installation) via File -> Download.
Use the "Text" blocks to provide explanations wherever
you find them necessary. Highlight your answers inside
these text fields to ensure that we don't miss it while
grading your HW.
Please answer questions within their designated section
and in the order they are asked to maintain clarity and
organization. You can add new code and text blocks if you
want.

Installs 📥
pip install datasets

Collecting datasets
Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in
/usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in
/usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (18.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in
/usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in
/usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in
/usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)
Collecting xxhash (from datasets)
Downloading xxhash-3.5.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2
kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from
fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
Downloading aiohttp-3.10.10-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Requirement already satisfied: huggingface-hub>=0.23.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.24.7)
Requirement already satisfied: packaging in
/usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in
/usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
Downloading aiohappyeyeballs-2.4.3-py3-none-any.whl.metadata (6.1
kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: attrs>=17.3.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(24.2.0)
Collecting frozenlist>=1.1.1 (from aiohttp->datasets)
Downloading frozenlist-1.5.0-cp310-cp310-
manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux
2014_x86_64.whl.metadata (13 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp->datasets)
Downloading multidict-6.1.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.0 kB)
Collecting yarl<2.0,>=1.12.0 (from aiohttp->datasets)
Downloading yarl-1.17.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.8/64.8 kB 1.2 MB/s eta
0:00:00
eout<5.0,>=4.0 (from aiohttp->datasets)
Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: typing-extensions>=3.7.4.3 in
/usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0-
>datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.2)
Requirement already satisfied: tzdata>=2022.7 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.2)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->datasets) (1.16.0)
Collecting propcache>=0.2.0 (from yarl<2.0,>=1.12.0->aiohttp-
>datasets)
Downloading propcache-0.2.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 13.2 MB/s eta
0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 8.6 MB/s eta
0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 12.4 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 37.9 MB/s eta
0:00:00
ultiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 9.1 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 12.0 MB/s eta
0:00:00
eout-4.0.3-py3-none-any.whl (5.7 kB)
Downloading frozenlist-1.5.0-cp310-cp310-
manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux
2014_x86_64.whl (241 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 241.9/241.9 kB 14.7 MB/s eta
0:00:00
ultidict-6.1.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (124 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.6/124.6 kB 7.3 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (318 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 318.7/318.7 kB 13.8 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (208 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 208.9/208.9 kB 9.2 MB/s eta
0:00:00
ultidict, fsspec, frozenlist, dill, async-timeout, aiohappyeyeballs,
yarl, multiprocess, aiosignal, aiohttp, datasets
Attempting uninstall: fsspec
Found existing installation: fsspec 2024.10.0
Uninstalling fsspec-2024.10.0:
Successfully uninstalled fsspec-2024.10.0
Successfully installed aiohappyeyeballs-2.4.3 aiohttp-3.10.10
aiosignal-1.3.1 async-timeout-4.0.3 datasets-3.1.0 dill-0.3.8
frozenlist-1.5.0 fsspec-2024.9.0 multidict-6.1.0 multiprocess-0.70.16
propcache-0.2.0 xxhash-3.5.0 yarl-1.17.1

Imports 📂
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import torch
import torch.nn as nn

import torchvision
import torchvision.transforms as transforms
import torchvision.models as models

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report

from tqdm.auto import tqdm

from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertModel

/usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py:253:
UserWarning: `tensorflow` can conflict with `torch-xla`. Prefer
`tensorflow-cpu` when using PyTorch/XLA. To silence this warning, `pip
uninstall -y tensorflow && pip install tensorflow-cpu`. If you are in
a notebook environment such as Colab or Kaggle, restart your notebook
runtime afterwards.
warnings.warn(
from torch.utils.data import DataLoader, random_split, Subset

Check if GPU is available 🚀

Use GPU for faster analysis (Google Colab provides access to some free GPUs with time limits)

# Check if GPU is available

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu

Vision Task [45 Points]

Q1: Loading and Splitting CIFAR-10 Dataset (5pts)
Reference: https://discuss.pytorch.org/t/what-does-it-mean-to-normalize-images-for-resnet/
96160

transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))
])

train_dataset = torchvision.datasets.CIFAR10(root='./data',
train=True,
download=True,
transform=transform)

test_dataset = torchvision.datasets.CIFAR10(root='./data',
train=False,
download=True,
transform=transform)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to
./data/cifar-10-python.tar.gz

100%|██████████| 170M/170M [00:05<00:00, 30.4MB/s]

Extracting ./data/cifar-10-python.tar.gz to ./data

Files already downloaded and verified

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True,

num_workers=3)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False,
num_workers=3)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/
dataloader.py:617: UserWarning: This DataLoader will create 3 worker
processes in total. Our suggested max number of worker in current
system is 2, which is smaller than what this DataLoader is going to
create. Please be aware that excessive worker creation might get
DataLoader running slow or even freeze, lower the worker number to
avoid potential slowness/freeze if necessary.
warnings.warn(

Q2: Feature Extraction with ResNet-18 (5pts)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda

resnet18 = models.resnet18(pretrained=True)
modules = list(resnet18.children())[:-1]
resnet18 = nn.Sequential(*modules)
resnet18 = resnet18.to(device)
resnet18.eval()

Reference https://debuggercafe.com/training-resnet18-from-scratch-using-pytorch/

https://discuss.pytorch.org/t/use-resnet18-as-feature-extractor/8267

def extract_embeddings(dataloader):
embeddings = []
labels = []
with torch.no_grad():
for inputs, targets in dataloader:
inputs = inputs.to(device)
temp = resnet18(inputs)
temp = temp.view(temp.size(0), -1)
embeddings.append(temp.cpu())
labels.append(targets)
embeddings = torch.cat(embeddings)
labels = torch.cat(labels)
return embeddings, labels

train_embeddings, train_labels = extract_embeddings(train_loader)

test_embeddings, test_labels = extract_embeddings(test_loader)

print(f'Train Embeddings Shape: {train_embeddings.shape}')

print(f'Test Embeddings Shape: {test_embeddings.shape}')
Train Embeddings Shape: torch.Size([50000, 512])
Test Embeddings Shape: torch.Size([10000, 512])

I have used uses a pre-trained ResNet-18 model to extract meaningful features from images. By
removing the final classification layer, it processes each image through all the other layers,
outputting a condensed feature representation, which is then used for further analysis or tasks
without classification.

Q3: t-SNE Visualization of ResNet-18 Embeddings (5pts)

sample_size = 10000
indices = np.random.choice(len(test_embeddings), sample_size,
replace=False)
sample_embeddings = test_embeddings[indices].numpy()
sample_labels = test_labels[indices].numpy()

tsne = TSNE(n_components=2, random_state=42, perplexity=30,

n_iter=1000)
tsne_results = tsne.fit_transform(sample_embeddings)

/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(

class_names = train_dataset.classes

plt.figure(figsize=(10, 8))
palette = sns.color_palette("tab10", len(class_names))
sns.scatterplot(
x=tsne_results[:,0], y=tsne_results[:,1],
hue=sample_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE Visualization of ResNet-18 Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
handles, _ = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=class_names, title="Classes",
loc='best')
plt.show()
I find the t-SNE visualization of the ResNet-18 embeddings quite insightful in showing how well
the model has captured the structure of the CIFAR-10 classes. Looking at the plot, I’m impressed
by how distinct some of the clusters are, especially for classes like "airplane," "automobile," and
"ship." These classes seem to have well-defined boundaries, which tells that ResNet-18 has
effectively learned to identify high-level features unique to these categories, even when the
images vary in background and angles. It’s fascinating to see how transfer learning from a model
trained on ImageNet can still create meaningful embeddings for a different dataset like CIFAR-
10.

I do notice, some overlap between classes, especially within the animal categories such as "cat,"
"dog," and "deer." This overlap is understandable, given that these animals share similar
textures and colors, which can make them harder for the model to separate cleanly in a lower-
dimensional space. For instance, both cats and dogs may have fur textures or domestic
backgrounds that could confuse the model at times. But even within these mixed areas, there’s
still a general grouping of each animal class, which suggests that ResNet-18 has captured
enough distinct features to give a rough separation. I think this partially successful clustering
points to the model's strength in broad feature extraction, while also hinting at the limitations of
a pre-trained model when applied to subtle distinctions in a new dataset.
t-SNE plot highlights both the capabilities and some limitations of using ResNet-18 as a feature
extractor for CIFAR-10. The well-defined clusters for certain classes suggest that the model has
done an impressive job with categories that have more unique visual characteristics, like
vehicles. Meanwhile, the overlaps among similar classes indicate that a model fine-tuned
specifically for CIFAR-10 might perform even better.

Q4: Nearest neighbor classification (10pts)

Part A - Confusion Matrix and Evaluation (5pts)
num_classes = 10
centroids = []
for c in range(num_classes):
class_embeddings = train_embeddings[train_labels == c]
centroid = class_embeddings.mean(dim=0)
centroids.append(centroid)
centroids = torch.stack(centroids)

def nearest_neighbor_classification(test_embeddings, centroids,

metric='euclidean'):
if metric == 'euclidean':
distances = torch.cdist(test_embeddings, centroids, p=2)
elif metric == 'cosine':
test_norm = torch.nn.functional.normalize(test_embeddings,
p=2, dim=1)
centroids_norm = torch.nn.functional.normalize(centroids, p=2,
dim=1)
distances = 1 - torch.matmul(test_norm, centroids_norm.T)
else:
raise ValueError("Unsupported metric. Choose 'euclidean' or
'cosine'.")

_, predictions = torch.min(distances, dim=1)

return predictions.numpy()

preds_euclidean = nearest_neighbor_classification(test_embeddings,
centroids, metric='euclidean')
preds_cosine = nearest_neighbor_classification(test_embeddings,
centroids, metric='cosine')

conf_matrix_euclidean = confusion_matrix(test_labels, preds_euclidean)

conf_matrix_cosine = confusion_matrix(test_labels, preds_cosine)

accuracy_euclidean = accuracy_score(test_labels, preds_euclidean)

accuracy_cosine = accuracy_score(test_labels, preds_cosine)

def plot_confusion_matrix(cm, classes, title):

plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=classes, yticklabels=classes)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title(title)
plt.show()

class_names = train_dataset.classes

print("Euclidean Distance - Accuracy: {:.2f}

%".format(accuracy_euclidean * 100))
plot_confusion_matrix(conf_matrix_euclidean, class_names, "Confusion
Matrix - Euclidean Distance")

Euclidean Distance - Accuracy: 74.83%

print("Cosine Similarity - Accuracy: {:.2f}%".format(accuracy_cosine *

100))
plot_confusion_matrix(conf_matrix_cosine, class_names, "Confusion
Matrix - Cosine Similarity")

Cosine Similarity - Accuracy: 75.47%

Part B - Outlier Detection and Analysis (5pts)

def compute_distance(test_embeddings, centroids, metric='euclidean'):
if metric == 'euclidean':
distances = torch.cdist(test_embeddings, centroids, p=2)
elif metric == 'cosine':
test_norm = torch.nn.functional.normalize(test_embeddings,
p=2, dim=1)
centroids_norm = torch.nn.functional.normalize(centroids, p=2,
dim=1)
distances = 1 - torch.matmul(test_norm, centroids_norm.T)
return distances
metric = 'euclidean'
distances = compute_distance(test_embeddings, centroids,
metric=metric)
min_distances, nearest_centroid = torch.min(distances, dim=1)

outliers = []
for c in range(num_classes):
class_indices = (test_labels == c)
class_distances = min_distances[class_indices]
max_dist_idx = class_distances.argmax().item()
actual_idx = torch.nonzero(class_indices).squeeze().numpy()
[max_dist_idx]
outliers.append(actual_idx)

for idx in outliers:

image, label = test_dataset[idx]
predicted_label = preds_euclidean[idx] if metric == 'euclidean'
else preds_cosine[idx]
plt.figure(figsize=(2, 2))
plt.imshow(np.transpose(image.numpy(), (1, 2, 0)) * 0.5 + 0.5)
plt.title(f'True: {class_names[label]} | Predicted:
{class_names[predicted_label]}')
plt.axis('off')
plt.show()
print(f'Index: {idx}, Distance: {min_distances[idx]:.4f}')
print('-' * 50)

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 5259, Distance: 21.8868

--------------------------------------------------
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 6581, Distance: 23.3660

--------------------------------------------------
Index: 6352, Distance: 26.9467
--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

Index: 9676, Distance: 22.6462

--------------------------------------------------
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 9886, Distance: 20.0355

--------------------------------------------------
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 6259, Distance: 24.2270

--------------------------------------------------
Index: 8573, Distance: 22.9653

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

--------------------------------------------------
Index: 808, Distance: 21.2763
--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

Index: 218, Distance: 23.0669

--------------------------------------------------
Index: 2528, Distance: 21.2343
--------------------------------------------------

The nearest neighbor classification using centroids for each class in the embedding space
showed reasonable accuracy, particularly for classes with distinct features, like vehicles and
certain animals. Using both Euclidean and cosine distance metrics, we observed that Euclidean
distance performed well on clear-cut classes, while cosine similarity better handled overlapping
features, as in different animal categories. Outliers identified in each class—images furthest
from their respective centroids—often represented challenging cases due to unusual lighting,
angles, or backgrounds. For example, an "airplane" in silhouette was misclassified as a "ship,"
likely due to its blurred outline and color scheme, while an "automobile" image was mistaken for
a "truck," perhaps because of its close-up angle. These misclassifications highlight that while
the centroid approach captures general class characteristics, it lacks the nuanced discrimination
needed for complex cases with subtle visual similarities across classes. This suggests that while
centroid-based nearest neighbor classification is effective, it could benefit from additional
refinement for improved accuracy in borderline cases.

metric = 'cosine'
distances = compute_distance(test_embeddings, centroids,
metric=metric)
min_distances, nearest_centroid = torch.min(distances, dim=1)

for idx in outliers:

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 3444, Distance: 0.3708

--------------------------------------------------
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 6010, Distance: 0.2512

--------------------------------------------------

Index: 4421, Distance: 0.2773

--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).
Index: 9246, Distance: 0.3455
--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 8244, Distance: 0.2729

--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 2651, Distance: 0.3119

--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

Index: 4279, Distance: 0.2763

--------------------------------------------------

WARNING:matplotlib.image:Clipping input data to the valid range for

imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 808, Distance: 0.2731

--------------------------------------------------
WARNING:matplotlib.image:Clipping input data to the valid range for
imshow with RGB data ([0..1] for floats or [0..255] for integers).

Index: 8269, Distance: 0.2532

--------------------------------------------------

Index: 2756, Distance: 0.3756

--------------------------------------------------

With the nearest neighbor classification using cosine similarity, I found that this metric offered
some subtle improvements over Euclidean distance, particularly for images where shape and
structure played a key role in distinguishing between classes. For example, cosine similarity
seemed to handle certain "frog" images better, likely because it focuses on the direction of
features rather than their magnitude. This focus on structural alignment rather than overall
intensity can be advantageous for classes where variations in lighting or color intensity might
otherwise confuse the model. However, some challenges persisted, especially in differentiating
between classes with similar backgrounds or ambiguous features, like "airplane" and "ship"
images that share similar sky or water contexts.

Despite these improvements, cosine similarity still struggled with certain outliers, especially in
low-light or blurred images. For instance, some "deer" images were still mistakenly classified as
"birds," and a few "ship" images ended up classified as "airplanes," due to shared background
characteristics that were difficult to distinguish using just cosine similarity. I found that, while
cosine similarity is helpful in reducing errors for certain structurally similar classes, it doesn’t
entirely resolve the issue of ambiguous or low-quality images. Both distance metrics faced
limitations with these tough cases, which highlights the need for a more refined or complex
classification approach, possibly involving additional context-based features or an enhanced
feature extraction process.

Q5: Building an Image Classification Model (10pts)

Part A - Confusion Matrix and Evaluation (5pts)
X_train = train_embeddings.numpy()
y_train = train_labels.numpy()
X_test = test_embeddings.numpy()
y_test = test_labels.numpy()
sgd_clf = SGDClassifier(
loss='log_loss',
max_iter=1000,
tol=1e-3,
n_jobs=-1,
random_state=42
)

sgd_clf.fit(X_train, y_train)
preds_sgd = sgd_clf.predict(X_test)

conf_matrix_sgd = confusion_matrix(y_test, preds_sgd)

accuracy_sgd = accuracy_score(y_test, preds_sgd)
plot_confusion_matrix(conf_matrix_sgd, class_names, "Confusion Matrix
- SGDClassifier")

print("SGDClassifier - Accuracy: {:.2f}%".format(accuracy_sgd * 100))

print("Classification Report:\n", classification_report(y_test,
preds_sgd, target_names=class_names))
SGDClassifier - Accuracy: 84.67%
Classification Report:
precision recall f1-score support

airplane 0.85 0.85 0.85 1000

automobile 0.90 0.95 0.92 1000
bird 0.79 0.80 0.79 1000
cat 0.66 0.81 0.73 1000
deer 0.87 0.77 0.81 1000
dog 0.81 0.79 0.80 1000
frog 0.92 0.86 0.89 1000
horse 0.92 0.81 0.86 1000
ship 0.87 0.94 0.90 1000
truck 0.94 0.91 0.92 1000

accuracy 0.85 10000

macro avg 0.85 0.85 0.85 10000
weighted avg 0.85 0.85 0.85 10000

Part B - Comparison Between Your Model and Nearest Neighbor

Classification (5pts)
print(f"Nearest Neighbor (Euclidean) Accuracy: {accuracy_euclidean *
100:.2f}%")
print(f"Nearest Neighbor (Cosine) Accuracy: {accuracy_cosine *
100:.2f}%")
print(f"SGD Model Accuracy: {accuracy_sgd * 100:.2f}%")

Nearest Neighbor (Euclidean) Accuracy: 74.83%

Nearest Neighbor (Cosine) Accuracy: 75.47%
SGD Model Accuracy: 84.67%

The classification model built using the top-level embeddings, specifically with an SGD
classifier, achieved an accuracy of 84.67%, which outperformed both nearest neighbor
approaches—Euclidean (74.83%) and cosine similarity (75.47%). This higher accuracy suggests
that the model-based approach can capture more nuanced relationships in the data than the
nearest neighbor methods. By training directly on the embeddings, the SGD model was able to
create decision boundaries that better separate the classes, even when the features are complex
or overlap. The improved accuracy indicates that this model is more effective in recognizing
subtle distinctions between classes, which nearest neighbor approaches might overlook due to
their reliance on centroid distance alone.

In terms of mistakes, the nearest neighbor classifiers often struggled with classes that had
overlapping features or similar backgrounds, like distinguishing between "automobiles" and
"trucks" or between certain animals. The SGD model, however, made fewer errors in these
areas, likely due to its ability to learn specific patterns and optimize based on labeled training
data rather than relying solely on distances to centroids. This approach gave it an edge in dealing
with challenging or ambiguous images. Overall, while the nearest neighbor classifiers provide a
simple and interpretable method, the model-based approach using SGD proved more robust
and accurate, making it a better choice for this classification task.

The confusion matrix and classification report of the SGD classifier provide detailed insights into
the performance of the model across each CIFAR-10 class. With an overall accuracy of 84.67%,
the model shows strong performance, particularly for classes like "automobile," "truck," and
"ship," which all achieved high precision and recall scores above 90%. This indicates that the
classifier is well-suited for recognizing these classes, likely because they have distinct features
that make them easier to differentiate from others in the dataset.

Q6: Dimension Reduction (10pts)

def reduce_dimensions(X_train, X_test, n_components):
pca = PCA(n_components=n_components, random_state=42)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.transform(X_test)
return X_train_reduced, X_test_reduced, pca
X_train_pca10, X_test_pca10, pca10 = reduce_dimensions(X_train,
X_test, n_components=10)
X_train_pca50, X_test_pca50, pca50 = reduce_dimensions(X_train,
X_test, n_components=50)

def evaluate_model(X_tr, y_tr, X_te, y_te):

model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3,
n_jobs=-1, random_state=42)
model.fit(X_tr, y_tr)
preds = model.predict(X_te)
acc = accuracy_score(y_te, preds)
return acc, preds

acc_full, preds_full = evaluate_model(X_train, y_train, X_test,

y_test)
acc_pca50, preds_pca50 = evaluate_model(X_train_pca50, y_train,
X_test_pca50, y_test)
acc_pca10, preds_pca10 = evaluate_model(X_train_pca10, y_train,
X_test_pca10, y_test)

print(f'Accuracy (Full Embeddings): {acc_full * 100:.2f}%')

print(f'Accuracy (PCA 50 Dimensions): {acc_pca50 * 100:.2f}%')
print(f'Accuracy (PCA 10 Dimensions): {acc_pca10 * 100:.2f}%')

Accuracy (Full Embeddings): 84.67%

Accuracy (PCA 50 Dimensions): 79.38%
Accuracy (PCA 10 Dimensions): 67.58%

tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,

n_iter=1000)
tsne_results_pca10 = tsne_pca.fit_transform(X_test_pca10)
plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca10[:,0], y=tsne_results_pca10[:,1],
hue=y_test,
palette = sns.color_palette("tab10", len(class_names)),
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (10D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
handles, _ = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=class_names, title="Classes",
loc='best')
plt.show()

/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,
n_iter=1000)
tsne_results_pca50 = tsne_pca.fit_transform(X_test_pca50)
plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca50[:,0], y=tsne_results_pca50[:,1],
hue=y_test,
palette = sns.color_palette("tab10", len(class_names)),
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (50D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
handles, _ = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=class_names, title="Classes",
loc='best')
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(

The experiment with dimensionality reduction using PCA shows that reducing the embeddings
to 50 dimensions results in some loss in classification accuracy, but it still retains a fair amount
of the original performance. With the full-dimensional embeddings, the SGD classifier achieved
an accuracy of 84.67%, while reducing to 50 dimensions lowered the accuracy to 79.38%. This
drop suggests that while PCA manages to keep essential information at 50 dimensions, some of
the finer details necessary for precise classification are lost. However, this reduction in accuracy
may be acceptable if computational efficiency or storage constraints are a priority, as a smaller
embedding space can significantly reduce processing time.

When we further reduced the embeddings to 10 dimensions, the accuracy dropped more
noticeably to 67.58%. This larger decrease indicates that 10 dimensions are likely insufficient to
capture the complex structures and distinctive features needed to differentiate CIFAR-10 classes
effectively. The t-SNE visualization of the 10-dimensional PCA embeddings also shows that the
classes start to blend together, suggesting that the model has difficulty distinguishing between
similar classes. The drop in accuracy at this level highlights the trade-off between
dimensionality and the model's ability to represent intricate details within the data. Essentially,
reducing to 10 dimensions sacrifices too much information for this classification task, resulting
in lower performance.

Reducing dimensionality can help mitigate overfitting, improve computational speed, and make
the model more efficient. However, in this case, dimensionality reduction leads to a
performance drop, especially when going down to 10 dimensions. The 50-dimensional PCA
embeddings offer a balanced trade-off, retaining much of the classification power while
reducing the overall complexity of the data. Thus, for applications where a small reduction in
accuracy is acceptable, using 50 dimensions might be a good compromise.

NLP Task [45 Points] 📝

Q7.1: Loading and Splitting AGS News Dataset (5pts)
dataset = load_dataset('ag_news')

train_dataset = dataset['train']
test_dataset = dataset['test']

subset_size_train = 20000
subset_size_test = 5000

train_dataset = train_dataset.select(range(subset_size_train))
test_dataset = test_dataset.select(range(subset_size_test))

Q7.2: Feature Extraction with DistilBERT (5pts)

BATCH_SIZE = 32
MAX_LENGTH = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f'Using device: {device}')

Using device: cpu

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-
uncased')

/usr/local/lib/python3.10/dist-packages/transformers/
tokenization_utils_base.py:1601: FutureWarning:
`clean_up_tokenization_spaces` was not set. It will be set to `True`
by default. This behavior will be depracted in transformers v4.45, and
will be then set to `False` by default. For more details check this
issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length',
truncation=True, max_length=MAX_LENGTH)

tokenized_train = train_dataset.map(tokenize_function, batched=True,

batch_size=BATCH_SIZE)
tokenized_test = test_dataset.map(tokenize_function, batched=True,
batch_size=BATCH_SIZE)

{"model_id":"2db5233f17c9426e9f1bc3771a975065","version_major":2,"vers
ion_minor":0}

{"model_id":"90780f8ca0d44775bae6a6d2e145479f","version_major":2,"vers
ion_minor":0}

tokenized_train.set_format('torch', columns=['input_ids',
'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids',
'attention_mask', 'label'])

train_loader = DataLoader(tokenized_train, batch_size=BATCH_SIZE,

shuffle=False)
test_loader = DataLoader(tokenized_test, batch_size=BATCH_SIZE,
shuffle=False)

distilbert = DistilBertModel.from_pretrained('distilbert-base-
uncased')
distilbert = distilbert.to(device)
distilbert.eval()

DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768,
bias=True)
(k_lin): Linear(in_features=768, out_features=768,
bias=True)
(v_lin): Linear(in_features=768, out_features=768,
bias=True)
(out_lin): Linear(in_features=768, out_features=768,
bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12,
elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072,
bias=True)
(lin2): Linear(in_features=3072, out_features=768,
bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12,
elementwise_affine=True)
)
)
)
)

def extract_embeddings(dataloader):
embeddings = []
labels = []
with torch.no_grad():
for batch in tqdm(dataloader, desc="Extracting embeddings"):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
outputs = distilbert(input_ids=input_ids,
attention_mask=attention_mask)
# Use the [CLS] token representation (first token)
cls_embeddings = outputs.last_hidden_state[:,0,:] #
Shape: (batch_size, hidden_dim)
embeddings.append(cls_embeddings.cpu())
labels.append(batch['label'])
embeddings = torch.cat(embeddings)
labels = torch.cat(labels)
return embeddings, labels

train_embeddings, train_labels = extract_embeddings(train_loader)

test_embeddings, test_labels = extract_embeddings(test_loader)

{"model_id":"b32d65f73feb4d0fb3370828d89c8fdc","version_major":2,"vers
ion_minor":0}

{"model_id":"93d372e5264a452cbaa4cf1738f0d113","version_major":2,"vers
ion_minor":0}

print(f'Train Embeddings Shape: {train_embeddings.shape}')

print(f'Test Embeddings Shape: {test_embeddings.shape}')

Train Embeddings Shape: torch.Size([20000, 768])

Test Embeddings Shape: torch.Size([5000, 768])
The AG News dataset, where each article is tokenized using Hugging Face’s DistilBERT
tokenizer, with a maximum token length of 128. The pre-trained DistilBERT model is then used
to extract embeddings from the tokenized text by removing the final classification layer and
using the [CLS] token representation from the last hidden state. This process generates a
condensed feature representation (embedding) for each news article, which will be used for
downstream classification. The code loads data into PyTorch DataLoaders, processes it in
batches on the available device (GPU if available), and stores the resulting embeddings and
labels for training and testing.

Q7.3: t-SNE Visualization of DistilBERT Embeddings (5pts)

sample_size = 5000
if len(test_embeddings) > sample_size:
indices = np.random.choice(len(test_embeddings), sample_size,
replace=False)
sample_embeddings = test_embeddings[indices].numpy()
sample_labels = test_labels[indices].numpy()
else:
sample_embeddings = test_embeddings.numpy()
sample_labels = test_labels.numpy()

tsne = TSNE(n_components=2, random_state=42, perplexity=30,

n_iter=1000)
tsne_results = tsne.fit_transform(sample_embeddings)

/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(

plt.figure(figsize=(10, 8))
palette = sns.color_palette("hls", 4)
sns.scatterplot(
x=tsne_results[:,0], y=tsne_results[:,1],
hue=sample_labels,
palette=palette,
legend='full',s=40,
alpha=0.6
)
plt.title('t-SNE Visualization of DistilBERT Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=['World', 'Sports', 'Business',
'Sci/Tech'], loc='best')
plt.show()
The t-SNE visualization of the DistilBERT embeddings reveals distinct clusters for the four AG
News classes: "World," "Sports," "Business," and "Sci/Tech." Each class has formed relatively
cohesive groups in the 2D space, indicating that the DistilBERT model has captured meaningful
representations that effectively separate these topics. The "Business" and "Sci/Tech" classes for
example appear well-separated which makes sense since articles in these categories often
contain specialized language and unique terms related to their fields. The "Sports" cluster is
distinct suggesting that the embeddings capture the unique vocabulary and context typically
associated with sports news.

There is some overlap between certain clusters particularly between "World" and "Business,"
which could be due to the shared economic and global context in news articles that span these
topics. For instance articles about international business or economic policies might contain
elements common to both categories, making them harder to distinguish based purely on
embeddings. Despite this minor overlap, the general clustering is strong and it shows that the
DistilBERT embeddings can largely capture the differences among categories in the AG News
dataset.

The t-SNE plot suggests that DistilBERT's embeddings are effective in capturing and
distinguishing class structures within news categories. This clustering is a positive indication
that the model has learned meaningful semantic features from the dataset allowing it to group
articles with similar themes together in the embedding space. This ability to separate topics with
minimal overlap can be beneficial for downstream tasks such as classification where clear
separations can improve accuracy.

Q7.4: Nearest neighbor classification (10pts)

Part A - Confusion Matrix and Evaluation (5pts)
num_classes = 4
centroids = []
for c in range(num_classes):
class_embeddings = train_embeddings[train_labels == c]
centroid = class_embeddings.mean(dim=0)
centroids.append(centroid)
centroids = torch.stack(centroids)

def nearest_neighbor_classification(test_embeddings, centroids,

_, predictions = torch.min(distances, dim=1)

return predictions.numpy()

preds_euclidean = nearest_neighbor_classification(test_embeddings,
centroids, metric='euclidean')
preds_cosine = nearest_neighbor_classification(test_embeddings,
centroids, metric='cosine')

conf_matrix_euclidean = confusion_matrix(test_labels, preds_euclidean)

conf_matrix_cosine = confusion_matrix(test_labels, preds_cosine)

accuracy_euclidean = accuracy_score(test_labels, preds_euclidean)

accuracy_cosine = accuracy_score(test_labels, preds_cosine)

def plot_confusion_matrix(cm, classes, title):

plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=classes, yticklabels=classes)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title(title)
plt.show()
class_names = ['World', 'Sports', 'Business', 'Sci/Tech']

print("Euclidean Distance - Accuracy: {:.2f}

%".format(accuracy_euclidean * 100))
plot_confusion_matrix(conf_matrix_euclidean, class_names, "Confusion
Matrix - Euclidean Distance")

Euclidean Distance - Accuracy: 84.02%

print("Cosine Similarity - Accuracy: {:.2f}%".format(accuracy_cosine *

100))
plot_confusion_matrix(conf_matrix_cosine, class_names, "Confusion
Matrix - Cosine Similarity")

Cosine Similarity - Accuracy: 84.12%

Part B - Outlier Detection and Analysis (5pts)
def compute_distance(test_embeddings, centroids, metric='euclidean'):
if metric == 'euclidean':
distances = torch.cdist(test_embeddings, centroids, p=2)
elif metric == 'cosine':
test_norm = torch.nn.functional.normalize(test_embeddings,
p=2, dim=1)
centroids_norm = torch.nn.functional.normalize(centroids, p=2,
dim=1)
distances = 1 - torch.matmul(test_norm, centroids_norm.T)
return distances

metric = 'cosine'
distances = compute_distance(test_embeddings, centroids,
metric=metric)
min_distances, nearest_centroid = torch.min(distances, dim=1)

for idx in outliers:

text = test_dataset['text'][idx]
label = test_labels[idx].item()
predicted_label = preds_euclidean[idx] if metric == 'euclidean'
else preds_cosine[idx]

print(f"Index: {idx}")
print(f"True Label: {class_names[label]} | Predicted Label:
{class_names[predicted_label]}")
print(f"Text: {text}")
print('-' * 80)

Index: 973
True Label: World | Predicted Label: World
Text: Hurricane Frances Nears NE Caribbean (AP) AP - Hurricane Frances
strengthened as it churned near islands of the northeastern Caribbean
with ferocious winds expected to graze Puerto Rico on Tuesday before
the storm plows on toward the Bahamas and the southeastern United
States.
----------------------------------------------------------------------
----------
Index: 2411
True Label: Sports | Predicted Label: Sports
Text: Transactions BASEBALL Boston (AL): Activated DH Ellis Burks from
the 60-day disabled list; released P Phil Seibel. Milwaukee (NL): Sent
INF Matt Erickson outright to Indianapolis (IL).
----------------------------------------------------------------------
----------
Index: 3797
True Label: Business | Predicted Label: World
Text: Transport strike hits Netherlands Public transport grinds to a
halt in the Netherlands as workers strike against the government's
planned welfare cuts.
----------------------------------------------------------------------
----------
Index: 1881
True Label: Sci/Tech | Predicted Label: World
Text: Hurricane Ivan Slams U.S. Gulf Coast Hurricane Ivan roared into
the Gulf Coast near Mobile, Alabama, early this morning with peak
winds exceeding 125 miles an hour (200 kilometers an hour).
----------------------------------------------------------------------
----------

The nearest neighbor classification using centroid-based approaches with both Euclidean and
cosine similarity metrics achieved comparable accuracy on the AG News dataset, with cosine
similarity slightly outperforming Euclidean (84.12% vs. 84.02%). The confusion matrices for
both metrics show that the model can generally classify most articles accurately into their
respective categories: World, Sports, Business, and Sci/Tech. However, there are still some
challenging cases where articles were misclassified, especially between the World and Business
categories, likely due to overlapping topics such as global economic issues. Additionally, some
misclassifications occurred between Sci/Tech and World classes, often when articles discussed
natural disasters or global technology developments, which can be difficult to categorize solely
based on high-level embeddings.

For the outlier detection, we identified the test samples that were furthest from their class
centroids, indicating articles that did not closely match the typical features of their assigned
categories. These outliers included cases like a Business article classified as World due to its
focus on public transport strikes, and a Sci/Tech article classified as World due to its report on a
hurricane, which might contain scientific terms but primarily concerns a global event. These
examples highlight that while the centroid-based nearest neighbor approach performs well in
general, it struggles with nuanced articles that overlap multiple categories or lack distinct topic-
specific terms. This suggests that while nearest neighbor classification is effective for clear-cut
cases, additional context or a more sophisticated model might be needed to handle ambiguous
or multi-topic articles more accurately.

Q7.5: Building a Text Classification Model (10pts)

Part A - Confusion Matrix and Evaluation (5pts)
sgd_classifier = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-
3, random_state=42, n_jobs=-1)

sgd_classifier.fit(train_embeddings.numpy(), train_labels.numpy())
predictions = sgd_classifier.predict(test_embeddings.numpy())

accuracy = accuracy_score(test_labels.numpy(), predictions)

conf_matrix = confusion_matrix(test_labels.numpy(), predictions)

print(f"Accuracy of the SGD Classifier: {accuracy * 100:.2f}%")

print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_report(test_labels.numpy(), predictions))

Accuracy of the SGD Classifier: 89.84%

Confusion Matrix:
[[1166 28 65 27]
[ 18 1240 8 4]
[ 73 3 991 137]
[ 53 6 86 1095]]
Classification Report:
precision recall f1-score support

0 0.89 0.91 0.90 1286

1 0.97 0.98 0.97 1270
2 0.86 0.82 0.84 1204
3 0.87 0.88 0.87 1240

accuracy 0.90 5000

macro avg 0.90 0.90 0.90 5000
weighted avg 0.90 0.90 0.90 5000

def plot_confusion_matrix(cm, class_names):

plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", linewidths=.5, cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

# Assuming class_names are defined as per your classes

class_names = ['World', 'Sports', 'Business', 'Sci/Tech'] # Adjust
based on your actual classes
plot_confusion_matrix(conf_matrix, class_names)
Part B - Comparison Between Your Model and Nearest Neighbor
Classification (5pts)
print(f"Nearest Neighbor (Euclidean) Accuracy: {accuracy_euclidean *
100:.2f}%")
print(f"Nearest Neighbor (Cosine) Accuracy: {accuracy_cosine *
100:.2f}%")
print(f"SGD Model Accuracy: {accuracy * 100:.2f}%")

Nearest Neighbor (Euclidean) Accuracy: 84.02%

Nearest Neighbor (Cosine) Accuracy: 84.12%
SGD Model Accuracy: 89.84%

The SGD classifier achieved a significantly higher accuracy (89.84%) compared to the nearest
neighbor methods using centroids, with Euclidean and cosine distances yielding 84.02% and
84.12% accuracy, respectively. This improvement indicates that the SGD model, which optimizes
based on labeled training data, can create more effective decision boundaries than the simple
centroid-based approach. The confusion matrix for the SGD classifier shows that it performs
especially well on the "Sports" category with minimal misclassifications, likely due to the distinct
vocabulary and context typical of sports articles. However, it still struggles slightly with classes
that have overlapping topics, such as "Business" and "World," which sometimes share economic
or global themes.

Comparing the errors made by both approaches, we observe that the nearest neighbor methods
often misclassify articles that lie near the boundary of two classes, like "Business" and "World."
The centroid-based approach relies solely on proximity to the class mean, which doesn't allow it
to learn nuanced decision boundaries. On the other hand, the SGD classifier, by learning directly
from labeled data, is better at handling complex class separations, resulting in fewer
misclassifications. Overall, while the nearest neighbor classifiers are straightforward and
interpretable, the SGD model outperforms them by a significant margin, making it the better
choice for this text classification task.

Q7.6: Dimension Reduction (10pts)

X_train_pca10, X_test_pca10, pca10 =

reduce_dimensions(train_embeddings, test_embeddings, n_components=10)
X_train_pca50, X_test_pca50, pca50 =
reduce_dimensions(train_embeddings, test_embeddings, n_components=50)

def evaluate_model(X_tr, y_tr, X_te, y_te):

model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3,
n_jobs=-1, random_state=42)
model.fit(X_tr, y_tr)
preds = model.predict(X_te)
acc = accuracy_score(y_te, preds)
return acc, preds

acc_full, preds_full = evaluate_model(train_embeddings, train_labels,

test_embeddings, test_labels)
acc_pca50, preds_pca50 = evaluate_model(X_train_pca50, train_labels,
X_test_pca50, test_labels)
acc_pca10, preds_pca10 = evaluate_model(X_train_pca10, train_labels,
X_test_pca10, test_labels)

print(f'Accuracy (Full Embeddings): {acc_full * 100:.2f}%')

print(f'Accuracy (PCA 50 Dimensions): {acc_pca50 * 100:.2f}%')
print(f'Accuracy (PCA 10 Dimensions): {acc_pca10 * 100:.2f}%')

Accuracy (Full Embeddings): 89.88%

Accuracy (PCA 50 Dimensions): 88.06%
Accuracy (PCA 10 Dimensions): 84.72%

print("Applying t-SNE on PCA-reduced (10D) embeddings...")

tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,
n_iter=1000)
tsne_results_pca10 = tsne_pca.fit_transform(X_test_pca10)

plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca10[:,0], y=tsne_results_pca10[:,1],
hue=test_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (10D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=class_names, loc='best')
plt.show()

Applying t-SNE on PCA-reduced (10D) embeddings...

/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
print("Applying t-SNE on PCA-reduced (50D) embeddings...")
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,
n_iter=1000)
tsne_results_pca50 = tsne_pca.fit_transform(X_test_pca50)

plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca50[:,0], y=tsne_results_pca50[:,1],
hue=test_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (50D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=class_names, loc='best')
plt.show()
Applying t-SNE on PCA-reduced (50D) embeddings...

/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(

The results from reducing the DistilBERT embeddings to 10 and 50 dimensions using PCA show
how dimensionality reduction affects classification performance and data structure visualization.
When comparing classification accuracy the full embeddings performed best with an accuracy of
89.84%. Reducing to 50 dimensions resulted in a slight drop in accuracy to 88.20% while
reducing to 10 dimensions led to a more significant drop with accuracy falling to 83.57%. This
indicates that while reducing to 50 dimensions retains most of the useful information further
reduction to 10 dimensions loses important features necessary for accurate classification.

The t-SNE visualizations of the reduced embeddings also reveal interesting patterns. In the 50-
dimensional reduction, the four AG News classes remain relatively well-separated indicating
that the primary features that distinguish each class are preserved.In the 10-dimensional
visualization, the class clusters begin to overlap more suggesting that the model has lost some
of its ability to differentiate between nuanced aspects of each class. This reduction in
dimensionality leads to poorer class separabilit, particularly between classes with overlapping
themes likeWorld and Business which often share contextual elements.

Reducing to 50 dimensions provides a good balance between maintaining performance and

reducing computational load with minimal loss in classification accuracy. further reduction to 10
dimensions sacrifices too much information, resulting in decreased model performance and less
distinct clusters in the t-SNE visualization. This experiment demonstrates that while
dimensionality reduction can be beneficial for efficiency it must be applied carefully to avoid
significant loss of performance, especially in tasks that rely on capturing detailed distinctions
between classes.

Q8: Letting AI Do Your Homework [10 Points]

🤖/👤
Part A
Evaluate the usefulness of your code-generation system on this assignment. What does it get
right and what does it fail on?

I Think Chatgpt was beneficial because of its ability to generate initial code structures
implementing standard functions and also for handling basic processes like setting up the
model, applying PCA, and for calculating accuracy or confusion matrices. It improved on the
building blocks, which allowed me to start quickly and avoid time-consuming redudant coding
tasks. Although it struggled with nuanced aspects like optimizing hyperparameters, modifying
preprocessing steps for specific datasets it fared way better when it came to more niche
requirements like adjusting transforms for better accuracy or even understanding subtle data
properties. I had to manually make adjustments, as the system provided only default options
without deeper optimizations. I used different value for SVD by experimenting on it and also
changed the transform which ai generator gave standard.

Part B
Did you discover any subtle mistakes when you read the resulting code? Do you get trapped into
spending more time debugging than you thought you would?

Answer: Yes, there were some subtle difficulties like unable to use Standard Scaler, and Chatgpt
supplying only the standard parameters which did not allow me to get deeper optimization.
Tuning the hyperparameters was also difficult as a different function was being called which was
hampering with my accuracy. I spent a lot of time with also trying to make my visualizations
more unique as only a specific color was used to denote all the existing clusters while I wanted it
to be more descriptive which required me to implement more colors.

Part C
Do you think using AI tools for simple tasks frees up your brainpower for more advanced
problem-solving, or does it reduce your overall understanding?
Answer:Using AI generators lies in the gray area of coding, as if we are tackling a problem which
we can easily logically solve but is just time consuming and too redundant to write as certain
functions can be, It can be a good idea to use AI generators, but when we are presented with an
advanced problem, using them can hamper and prove them as a hurdle in our logic building and
in our overall understanding of the code and its flow.

Submission Guideline:
• Submit everything through Google classroom. As mentioned above, you will need to
upload:
a. The Jupyter notebook all your work is in (.ipynb file), derived from the provided
template
b. PDF (export the notebook as a pdf file)
• These files should be named with the following format, where the italicized parts should
be replaced with the corresponding values:
a. cse519_hw3_lastname_firstname_sbuid.ipynb
b. cse519_hw3_lastname_firstname_sbuid.pdf

Your Submission will NOT BE GRADED if you don't follow the naming convention❗❗

May your datasets be balanced, your features well-engineered, and your p-values low! (˵ •̀ ᴗ •́ ˵ )
✧

Sentence Reconstruction: !pip Install Datasets
No ratings yet
Sentence Reconstruction: !pip Install Datasets
21 pages
Trash Detection
No ratings yet
Trash Detection
17 pages
HE172830 CPV Slot6
No ratings yet
HE172830 CPV Slot6
247 pages
PRBLM
No ratings yet
PRBLM
5 pages
Roop Unleashed 02.ipynb
No ratings yet
Roop Unleashed 02.ipynb
15 pages
Kaggle GPU Setup for ComfyUI
No ratings yet
Kaggle GPU Setup for ComfyUI
10 pages
Reactor Comfyui - Ipynb
No ratings yet
Reactor Comfyui - Ipynb
24 pages
Error
No ratings yet
Error
13 pages
Gcollabnotebook2 Ipynb
No ratings yet
Gcollabnotebook2 Ipynb
203 pages
RagApplication - Ipynb - Colab
No ratings yet
RagApplication - Ipynb - Colab
6 pages
Errors
No ratings yet
Errors
12 pages
LSTM Autoencoder
No ratings yet
LSTM Autoencoder
8 pages
For Cor Pc1 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc1 Lismasari - Ipynb - Colab
5 pages
AI8
No ratings yet
AI8
2 pages
Llava Data Prepare
No ratings yet
Llava Data Prepare
26 pages
For Cor Pc3 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc3 Lismasari - Ipynb - Colab
5 pages
For Cor Pc2 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc2 Lismasari - Ipynb - Colab
5 pages
Gradio Setup for Roop-Unleashed
No ratings yet
Gradio Setup for Roop-Unleashed
9 pages
Clonamos El Repositorio para Obtener Los Dataset: From Import
No ratings yet
Clonamos El Repositorio para Obtener Los Dataset: From Import
23 pages
Exp 11 NLI USING BERT
No ratings yet
Exp 11 NLI USING BERT
4 pages
Mol Genn
No ratings yet
Mol Genn
1,296 pages
3 2 Expand Klasifikasi FIX 750 IndoBERT MLP, CNN, LSTM, CNN + LSTM 16 June 2025
No ratings yet
3 2 Expand Klasifikasi FIX 750 IndoBERT MLP, CNN, LSTM, CNN + LSTM 16 June 2025
121 pages
Swin - Transformer - Ipynb - Colab
No ratings yet
Swin - Transformer - Ipynb - Colab
5 pages
Clickbait Classifier Modified
No ratings yet
Clickbait Classifier Modified
21 pages
Graph Vae Training - Log
No ratings yet
Graph Vae Training - Log
146 pages
PC3 - SPATIAL - LismaSari - Ipynb - Colab
No ratings yet
PC3 - SPATIAL - LismaSari - Ipynb - Colab
9 pages
Task VIII Quantum Vision Transformer
No ratings yet
Task VIII Quantum Vision Transformer
1 page
Real Time Transcription Service For Online Meetings Using Whisper Api
No ratings yet
Real Time Transcription Service For Online Meetings Using Whisper Api
16 pages
PC1 Lisma Sari - Ipynb - Colab
No ratings yet
PC1 Lisma Sari - Ipynb - Colab
9 pages
1e2RzvrZ 1SueJwXaVHXb0x25yZtvmI0d
No ratings yet
1e2RzvrZ 1SueJwXaVHXb0x25yZtvmI0d
5 pages
7.copy of Text To Image Generation With LLM With Hugging Face - Ipynb
No ratings yet
7.copy of Text To Image Generation With LLM With Hugging Face - Ipynb
1,156 pages
Pip Install Tensorflow Pandas Matplotlib Scikit Learn
No ratings yet
Pip Install Tensorflow Pandas Matplotlib Scikit Learn
19 pages
Yolo V8
No ratings yet
Yolo V8
16 pages
Model Training
No ratings yet
Model Training
8 pages
HPC Server
No ratings yet
HPC Server
6 pages
Kopia Notatnika PolishCoinDetector - Ipynb - Colab
No ratings yet
Kopia Notatnika PolishCoinDetector - Ipynb - Colab
3 pages
Facial Expression Recognition With PyTorch - Ipynb - Colab
No ratings yet
Facial Expression Recognition With PyTorch - Ipynb - Colab
6 pages
Design & Development of AI Agents
No ratings yet
Design & Development of AI Agents
17 pages
Carbon Credit Beta
No ratings yet
Carbon Credit Beta
8 pages
Clip
No ratings yet
Clip
8 pages
Data Science Package Setup
No ratings yet
Data Science Package Setup
28 pages
AI Text Generation Setup
No ratings yet
AI Text Generation Setup
5 pages
Log Docker
No ratings yet
Log Docker
4 pages
Major Project
No ratings yet
Major Project
144 pages
Wa0028.
No ratings yet
Wa0028.
5 pages
Python Dev: Install Requirements
No ratings yet
Python Dev: Install Requirements
3 pages
Caso 2 Lau
No ratings yet
Caso 2 Lau
27 pages
3D Convolutional Autoencoder
No ratings yet
3D Convolutional Autoencoder
14 pages
Sysinfo 2024 09 28 19 59
No ratings yet
Sysinfo 2024 09 28 19 59
12 pages
Python Deprecation Warnings Log
No ratings yet
Python Deprecation Warnings Log
237 pages
Requirements
No ratings yet
Requirements
1 page
A3 44 DL Object Localisation
No ratings yet
A3 44 DL Object Localisation
15 pages
Coloab RDP
No ratings yet
Coloab RDP
12 pages
8 L31 JDD V
No ratings yet
8 L31 JDD V
128 pages
Neural Network Ex 1
No ratings yet
Neural Network Ex 1
2 pages
Tensor Flow Programs
No ratings yet
Tensor Flow Programs
30 pages
Setup
No ratings yet
Setup
3 pages
How To Install Mask-Rcnn For Nvidia Gpu
No ratings yet
How To Install Mask-Rcnn For Nvidia Gpu
19 pages
English To Hindi Text Translation
No ratings yet
English To Hindi Text Translation
10 pages
Titan Industries
No ratings yet
Titan Industries
2 pages
Stress Interview Questions
100% (1)
Stress Interview Questions
9 pages
Grading Rubric For The Critique Paper
No ratings yet
Grading Rubric For The Critique Paper
2 pages
KCET 2020 Cut Off Engineering General Round 1
No ratings yet
KCET 2020 Cut Off Engineering General Round 1
34 pages
Ucs121 Module4
No ratings yet
Ucs121 Module4
11 pages
Isae 3000 PDF
No ratings yet
Isae 3000 PDF
20 pages
Science 7 1st Q Exam 2018
No ratings yet
Science 7 1st Q Exam 2018
5 pages
Autonomous 3rd and 4th Sem Scheme and Syllabus 2016 17
No ratings yet
Autonomous 3rd and 4th Sem Scheme and Syllabus 2016 17
597 pages
QM Sample
No ratings yet
QM Sample
13 pages
SMAK Philippe Van Snick Dynamic Project
No ratings yet
SMAK Philippe Van Snick Dynamic Project
1 page
Control Systems (1-135) PDF
No ratings yet
Control Systems (1-135) PDF
128 pages
SAP Profitability Analysis
67% (3)
SAP Profitability Analysis
54 pages
General Manager
No ratings yet
General Manager
3 pages
Data 96
0% (1)
Data 96
4 pages
9-12 Flat File Schema Developers Guide
No ratings yet
9-12 Flat File Schema Developers Guide
86 pages
Simulate Service & Arrival Times in Excel
No ratings yet
Simulate Service & Arrival Times in Excel
5 pages
Interlingua in Machine Translation
No ratings yet
Interlingua in Machine Translation
5 pages
Modding Guide for BOSS Users
No ratings yet
Modding Guide for BOSS Users
70 pages
Effective Communication Skills
100% (1)
Effective Communication Skills
78 pages
Case Study 8
40% (5)
Case Study 8
4 pages
Ibp2132 12
No ratings yet
Ibp2132 12
14 pages
Microscope Project
No ratings yet
Microscope Project
6 pages
Dsee 2015 TimeTable Final
No ratings yet
Dsee 2015 TimeTable Final
3 pages
Biology SOL
100% (2)
Biology SOL
188 pages
Chemical Vapour Deposition
No ratings yet
Chemical Vapour Deposition
49 pages
ZK Calendar Essentials
No ratings yet
ZK Calendar Essentials
28 pages
World Water Day 2010 Presentation On Orissa
No ratings yet
World Water Day 2010 Presentation On Orissa
21 pages
Chemsketch Rom
No ratings yet
Chemsketch Rom
24 pages
Agriculture S.B.a. 2
43% (7)
Agriculture S.B.a. 2
25 pages
Manual RUIDA
No ratings yet
Manual RUIDA
51 pages