Cse519 hw3
Cse519 hw3
Drive.
Installs 📥
pip install datasets
Collecting datasets
Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in
/usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in
/usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (18.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in
/usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in
/usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in
/usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)
Collecting xxhash (from datasets)
Downloading xxhash-3.5.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2
kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from
fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
Downloading aiohttp-3.10.10-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Requirement already satisfied: huggingface-hub>=0.23.0 in
/usr/local/lib/python3.10/dist-packages (from datasets) (0.24.7)
Requirement already satisfied: packaging in
/usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in
/usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
Downloading aiohappyeyeballs-2.4.3-py3-none-any.whl.metadata (6.1
kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: attrs>=17.3.0 in
/usr/local/lib/python3.10/dist-packages (from aiohttp->datasets)
(24.2.0)
Collecting frozenlist>=1.1.1 (from aiohttp->datasets)
Downloading frozenlist-1.5.0-cp310-cp310-
manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux
2014_x86_64.whl.metadata (13 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp->datasets)
Downloading multidict-6.1.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.0 kB)
Collecting yarl<2.0,>=1.12.0 (from aiohttp->datasets)
Downloading yarl-1.17.1-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.8/64.8 kB 1.2 MB/s eta
0:00:00
eout<5.0,>=4.0 (from aiohttp->datasets)
Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: typing-extensions>=3.7.4.3 in
/usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0-
>datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from requests>=2.32.2-
>datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.2)
Requirement already satisfied: tzdata>=2022.7 in
/usr/local/lib/python3.10/dist-packages (from pandas->datasets)
(2024.2)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas->datasets) (1.16.0)
Collecting propcache>=0.2.0 (from yarl<2.0,>=1.12.0->aiohttp-
>datasets)
Downloading propcache-0.2.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 13.2 MB/s eta
0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 8.6 MB/s eta
0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 12.4 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 37.9 MB/s eta
0:00:00
ultiprocess-0.70.16-py310-none-any.whl (134 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 9.1 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 12.0 MB/s eta
0:00:00
eout-4.0.3-py3-none-any.whl (5.7 kB)
Downloading frozenlist-1.5.0-cp310-cp310-
manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux
2014_x86_64.whl (241 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 241.9/241.9 kB 14.7 MB/s eta
0:00:00
ultidict-6.1.0-cp310-cp310-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (124 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.6/124.6 kB 7.3 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (318 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 318.7/318.7 kB 13.8 MB/s eta
0:00:00
anylinux_2_17_x86_64.manylinux2014_x86_64.whl (208 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 208.9/208.9 kB 9.2 MB/s eta
0:00:00
ultidict, fsspec, frozenlist, dill, async-timeout, aiohappyeyeballs,
yarl, multiprocess, aiosignal, aiohttp, datasets
Attempting uninstall: fsspec
Found existing installation: fsspec 2024.10.0
Uninstalling fsspec-2024.10.0:
Successfully uninstalled fsspec-2024.10.0
Successfully installed aiohappyeyeballs-2.4.3 aiohttp-3.10.10
aiosignal-1.3.1 async-timeout-4.0.3 datasets-3.1.0 dill-0.3.8
frozenlist-1.5.0 fsspec-2024.9.0 multidict-6.1.0 multiprocess-0.70.16
propcache-0.2.0 xxhash-3.5.0 yarl-1.17.1
Imports 📂
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
/usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py:253:
UserWarning: `tensorflow` can conflict with `torch-xla`. Prefer
`tensorflow-cpu` when using PyTorch/XLA. To silence this warning, `pip
uninstall -y tensorflow && pip install tensorflow-cpu`. If you are in
a notebook environment such as Colab or Kaggle, restart your notebook
runtime afterwards.
warnings.warn(
from torch.utils.data import DataLoader, random_split, Subset
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406),
(0.229, 0.224, 0.225))
])
train_dataset = torchvision.datasets.CIFAR10(root='./data',
train=True,
download=True,
transform=transform)
test_dataset = torchvision.datasets.CIFAR10(root='./data',
train=False,
download=True,
transform=transform)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to
./data/cifar-10-python.tar.gz
/usr/local/lib/python3.10/dist-packages/torch/utils/data/
dataloader.py:617: UserWarning: This DataLoader will create 3 worker
processes in total. Our suggested max number of worker in current
system is 2, which is smaller than what this DataLoader is going to
create. Please be aware that excessive worker creation might get
DataLoader running slow or even freeze, lower the worker number to
avoid potential slowness/freeze if necessary.
warnings.warn(
resnet18 = models.resnet18(pretrained=True)
modules = list(resnet18.children())[:-1]
resnet18 = nn.Sequential(*modules)
resnet18 = resnet18.to(device)
resnet18.eval()
Reference https://debuggercafe.com/training-resnet18-from-scratch-using-pytorch/
https://discuss.pytorch.org/t/use-resnet18-as-feature-extractor/8267
def extract_embeddings(dataloader):
embeddings = []
labels = []
with torch.no_grad():
for inputs, targets in dataloader:
inputs = inputs.to(device)
temp = resnet18(inputs)
temp = temp.view(temp.size(0), -1)
embeddings.append(temp.cpu())
labels.append(targets)
embeddings = torch.cat(embeddings)
labels = torch.cat(labels)
return embeddings, labels
I have used uses a pre-trained ResNet-18 model to extract meaningful features from images. By
removing the final classification layer, it processes each image through all the other layers,
outputting a condensed feature representation, which is then used for further analysis or tasks
without classification.
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
class_names = train_dataset.classes
plt.figure(figsize=(10, 8))
palette = sns.color_palette("tab10", len(class_names))
sns.scatterplot(
x=tsne_results[:,0], y=tsne_results[:,1],
hue=sample_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE Visualization of ResNet-18 Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
handles, _ = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=class_names, title="Classes",
loc='best')
plt.show()
I find the t-SNE visualization of the ResNet-18 embeddings quite insightful in showing how well
the model has captured the structure of the CIFAR-10 classes. Looking at the plot, I’m impressed
by how distinct some of the clusters are, especially for classes like "airplane," "automobile," and
"ship." These classes seem to have well-defined boundaries, which tells that ResNet-18 has
effectively learned to identify high-level features unique to these categories, even when the
images vary in background and angles. It’s fascinating to see how transfer learning from a model
trained on ImageNet can still create meaningful embeddings for a different dataset like CIFAR-
10.
I do notice, some overlap between classes, especially within the animal categories such as "cat,"
"dog," and "deer." This overlap is understandable, given that these animals share similar
textures and colors, which can make them harder for the model to separate cleanly in a lower-
dimensional space. For instance, both cats and dogs may have fur textures or domestic
backgrounds that could confuse the model at times. But even within these mixed areas, there’s
still a general grouping of each animal class, which suggests that ResNet-18 has captured
enough distinct features to give a rough separation. I think this partially successful clustering
points to the model's strength in broad feature extraction, while also hinting at the limitations of
a pre-trained model when applied to subtle distinctions in a new dataset.
t-SNE plot highlights both the capabilities and some limitations of using ResNet-18 as a feature
extractor for CIFAR-10. The well-defined clusters for certain classes suggest that the model has
done an impressive job with categories that have more unique visual characteristics, like
vehicles. Meanwhile, the overlaps among similar classes indicate that a model fine-tuned
specifically for CIFAR-10 might perform even better.
preds_euclidean = nearest_neighbor_classification(test_embeddings,
centroids, metric='euclidean')
preds_cosine = nearest_neighbor_classification(test_embeddings,
centroids, metric='cosine')
class_names = train_dataset.classes
outliers = []
for c in range(num_classes):
class_indices = (test_labels == c)
class_distances = min_distances[class_indices]
max_dist_idx = class_distances.argmax().item()
actual_idx = torch.nonzero(class_indices).squeeze().numpy()
[max_dist_idx]
outliers.append(actual_idx)
--------------------------------------------------
Index: 808, Distance: 21.2763
--------------------------------------------------
The nearest neighbor classification using centroids for each class in the embedding space
showed reasonable accuracy, particularly for classes with distinct features, like vehicles and
certain animals. Using both Euclidean and cosine distance metrics, we observed that Euclidean
distance performed well on clear-cut classes, while cosine similarity better handled overlapping
features, as in different animal categories. Outliers identified in each class—images furthest
from their respective centroids—often represented challenging cases due to unusual lighting,
angles, or backgrounds. For example, an "airplane" in silhouette was misclassified as a "ship,"
likely due to its blurred outline and color scheme, while an "automobile" image was mistaken for
a "truck," perhaps because of its close-up angle. These misclassifications highlight that while
the centroid approach captures general class characteristics, it lacks the nuanced discrimination
needed for complex cases with subtle visual similarities across classes. This suggests that while
centroid-based nearest neighbor classification is effective, it could benefit from additional
refinement for improved accuracy in borderline cases.
metric = 'cosine'
distances = compute_distance(test_embeddings, centroids,
metric=metric)
min_distances, nearest_centroid = torch.min(distances, dim=1)
outliers = []
for c in range(num_classes):
class_indices = (test_labels == c)
class_distances = min_distances[class_indices]
max_dist_idx = class_distances.argmax().item()
actual_idx = torch.nonzero(class_indices).squeeze().numpy()
[max_dist_idx]
outliers.append(actual_idx)
With the nearest neighbor classification using cosine similarity, I found that this metric offered
some subtle improvements over Euclidean distance, particularly for images where shape and
structure played a key role in distinguishing between classes. For example, cosine similarity
seemed to handle certain "frog" images better, likely because it focuses on the direction of
features rather than their magnitude. This focus on structural alignment rather than overall
intensity can be advantageous for classes where variations in lighting or color intensity might
otherwise confuse the model. However, some challenges persisted, especially in differentiating
between classes with similar backgrounds or ambiguous features, like "airplane" and "ship"
images that share similar sky or water contexts.
Despite these improvements, cosine similarity still struggled with certain outliers, especially in
low-light or blurred images. For instance, some "deer" images were still mistakenly classified as
"birds," and a few "ship" images ended up classified as "airplanes," due to shared background
characteristics that were difficult to distinguish using just cosine similarity. I found that, while
cosine similarity is helpful in reducing errors for certain structurally similar classes, it doesn’t
entirely resolve the issue of ambiguous or low-quality images. Both distance metrics faced
limitations with these tough cases, which highlights the need for a more refined or complex
classification approach, possibly involving additional context-based features or an enhanced
feature extraction process.
sgd_clf.fit(X_train, y_train)
preds_sgd = sgd_clf.predict(X_test)
The classification model built using the top-level embeddings, specifically with an SGD
classifier, achieved an accuracy of 84.67%, which outperformed both nearest neighbor
approaches—Euclidean (74.83%) and cosine similarity (75.47%). This higher accuracy suggests
that the model-based approach can capture more nuanced relationships in the data than the
nearest neighbor methods. By training directly on the embeddings, the SGD model was able to
create decision boundaries that better separate the classes, even when the features are complex
or overlap. The improved accuracy indicates that this model is more effective in recognizing
subtle distinctions between classes, which nearest neighbor approaches might overlook due to
their reliance on centroid distance alone.
In terms of mistakes, the nearest neighbor classifiers often struggled with classes that had
overlapping features or similar backgrounds, like distinguishing between "automobiles" and
"trucks" or between certain animals. The SGD model, however, made fewer errors in these
areas, likely due to its ability to learn specific patterns and optimize based on labeled training
data rather than relying solely on distances to centroids. This approach gave it an edge in dealing
with challenging or ambiguous images. Overall, while the nearest neighbor classifiers provide a
simple and interpretable method, the model-based approach using SGD proved more robust
and accurate, making it a better choice for this classification task.
The confusion matrix and classification report of the SGD classifier provide detailed insights into
the performance of the model across each CIFAR-10 class. With an overall accuracy of 84.67%,
the model shows strong performance, particularly for classes like "automobile," "truck," and
"ship," which all achieved high precision and recall scores above 90%. This indicates that the
classifier is well-suited for recognizing these classes, likely because they have distinct features
that make them easier to differentiate from others in the dataset.
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,
n_iter=1000)
tsne_results_pca50 = tsne_pca.fit_transform(X_test_pca50)
plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca50[:,0], y=tsne_results_pca50[:,1],
hue=y_test,
palette = sns.color_palette("tab10", len(class_names)),
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (50D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
handles, _ = plt.gca().get_legend_handles_labels()
plt.legend(handles=handles, labels=class_names, title="Classes",
loc='best')
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
The experiment with dimensionality reduction using PCA shows that reducing the embeddings
to 50 dimensions results in some loss in classification accuracy, but it still retains a fair amount
of the original performance. With the full-dimensional embeddings, the SGD classifier achieved
an accuracy of 84.67%, while reducing to 50 dimensions lowered the accuracy to 79.38%. This
drop suggests that while PCA manages to keep essential information at 50 dimensions, some of
the finer details necessary for precise classification are lost. However, this reduction in accuracy
may be acceptable if computational efficiency or storage constraints are a priority, as a smaller
embedding space can significantly reduce processing time.
When we further reduced the embeddings to 10 dimensions, the accuracy dropped more
noticeably to 67.58%. This larger decrease indicates that 10 dimensions are likely insufficient to
capture the complex structures and distinctive features needed to differentiate CIFAR-10 classes
effectively. The t-SNE visualization of the 10-dimensional PCA embeddings also shows that the
classes start to blend together, suggesting that the model has difficulty distinguishing between
similar classes. The drop in accuracy at this level highlights the trade-off between
dimensionality and the model's ability to represent intricate details within the data. Essentially,
reducing to 10 dimensions sacrifices too much information for this classification task, resulting
in lower performance.
Reducing dimensionality can help mitigate overfitting, improve computational speed, and make
the model more efficient. However, in this case, dimensionality reduction leads to a
performance drop, especially when going down to 10 dimensions. The 50-dimensional PCA
embeddings offer a balanced trade-off, retaining much of the classification power while
reducing the overall complexity of the data. Thus, for applications where a small reduction in
accuracy is acceptable, using 50 dimensions might be a good compromise.
train_dataset = dataset['train']
test_dataset = dataset['test']
subset_size_train = 20000
subset_size_test = 5000
train_dataset = train_dataset.select(range(subset_size_train))
test_dataset = test_dataset.select(range(subset_size_test))
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-
uncased')
/usr/local/lib/python3.10/dist-packages/transformers/
tokenization_utils_base.py:1601: FutureWarning:
`clean_up_tokenization_spaces` was not set. It will be set to `True`
by default. This behavior will be depracted in transformers v4.45, and
will be then set to `False` by default. For more details check this
issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length',
truncation=True, max_length=MAX_LENGTH)
{"model_id":"2db5233f17c9426e9f1bc3771a975065","version_major":2,"vers
ion_minor":0}
{"model_id":"90780f8ca0d44775bae6a6d2e145479f","version_major":2,"vers
ion_minor":0}
tokenized_train.set_format('torch', columns=['input_ids',
'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids',
'attention_mask', 'label'])
distilbert = DistilBertModel.from_pretrained('distilbert-base-
uncased')
distilbert = distilbert.to(device)
distilbert.eval()
DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768,
bias=True)
(k_lin): Linear(in_features=768, out_features=768,
bias=True)
(v_lin): Linear(in_features=768, out_features=768,
bias=True)
(out_lin): Linear(in_features=768, out_features=768,
bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12,
elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072,
bias=True)
(lin2): Linear(in_features=3072, out_features=768,
bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12,
elementwise_affine=True)
)
)
)
)
def extract_embeddings(dataloader):
embeddings = []
labels = []
with torch.no_grad():
for batch in tqdm(dataloader, desc="Extracting embeddings"):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
outputs = distilbert(input_ids=input_ids,
attention_mask=attention_mask)
# Use the [CLS] token representation (first token)
cls_embeddings = outputs.last_hidden_state[:,0,:] #
Shape: (batch_size, hidden_dim)
embeddings.append(cls_embeddings.cpu())
labels.append(batch['label'])
embeddings = torch.cat(embeddings)
labels = torch.cat(labels)
return embeddings, labels
{"model_id":"b32d65f73feb4d0fb3370828d89c8fdc","version_major":2,"vers
ion_minor":0}
{"model_id":"93d372e5264a452cbaa4cf1738f0d113","version_major":2,"vers
ion_minor":0}
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
plt.figure(figsize=(10, 8))
palette = sns.color_palette("hls", 4)
sns.scatterplot(
x=tsne_results[:,0], y=tsne_results[:,1],
hue=sample_labels,
palette=palette,
legend='full',s=40,
alpha=0.6
)
plt.title('t-SNE Visualization of DistilBERT Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=['World', 'Sports', 'Business',
'Sci/Tech'], loc='best')
plt.show()
The t-SNE visualization of the DistilBERT embeddings reveals distinct clusters for the four AG
News classes: "World," "Sports," "Business," and "Sci/Tech." Each class has formed relatively
cohesive groups in the 2D space, indicating that the DistilBERT model has captured meaningful
representations that effectively separate these topics. The "Business" and "Sci/Tech" classes for
example appear well-separated which makes sense since articles in these categories often
contain specialized language and unique terms related to their fields. The "Sports" cluster is
distinct suggesting that the embeddings capture the unique vocabulary and context typically
associated with sports news.
There is some overlap between certain clusters particularly between "World" and "Business,"
which could be due to the shared economic and global context in news articles that span these
topics. For instance articles about international business or economic policies might contain
elements common to both categories, making them harder to distinguish based purely on
embeddings. Despite this minor overlap, the general clustering is strong and it shows that the
DistilBERT embeddings can largely capture the differences among categories in the AG News
dataset.
The t-SNE plot suggests that DistilBERT's embeddings are effective in capturing and
distinguishing class structures within news categories. This clustering is a positive indication
that the model has learned meaningful semantic features from the dataset allowing it to group
articles with similar themes together in the embedding space. This ability to separate topics with
minimal overlap can be beneficial for downstream tasks such as classification where clear
separations can improve accuracy.
preds_euclidean = nearest_neighbor_classification(test_embeddings,
centroids, metric='euclidean')
preds_cosine = nearest_neighbor_classification(test_embeddings,
centroids, metric='cosine')
metric = 'cosine'
distances = compute_distance(test_embeddings, centroids,
metric=metric)
min_distances, nearest_centroid = torch.min(distances, dim=1)
outliers = []
for c in range(num_classes):
class_indices = (test_labels == c)
class_distances = min_distances[class_indices]
max_dist_idx = class_distances.argmax().item()
actual_idx = torch.nonzero(class_indices).squeeze().numpy()
[max_dist_idx]
outliers.append(actual_idx)
print(f"Index: {idx}")
print(f"True Label: {class_names[label]} | Predicted Label:
{class_names[predicted_label]}")
print(f"Text: {text}")
print('-' * 80)
Index: 973
True Label: World | Predicted Label: World
Text: Hurricane Frances Nears NE Caribbean (AP) AP - Hurricane Frances
strengthened as it churned near islands of the northeastern Caribbean
with ferocious winds expected to graze Puerto Rico on Tuesday before
the storm plows on toward the Bahamas and the southeastern United
States.
----------------------------------------------------------------------
----------
Index: 2411
True Label: Sports | Predicted Label: Sports
Text: Transactions BASEBALL Boston (AL): Activated DH Ellis Burks from
the 60-day disabled list; released P Phil Seibel. Milwaukee (NL): Sent
INF Matt Erickson outright to Indianapolis (IL).
----------------------------------------------------------------------
----------
Index: 3797
True Label: Business | Predicted Label: World
Text: Transport strike hits Netherlands Public transport grinds to a
halt in the Netherlands as workers strike against the government's
planned welfare cuts.
----------------------------------------------------------------------
----------
Index: 1881
True Label: Sci/Tech | Predicted Label: World
Text: Hurricane Ivan Slams U.S. Gulf Coast Hurricane Ivan roared into
the Gulf Coast near Mobile, Alabama, early this morning with peak
winds exceeding 125 miles an hour (200 kilometers an hour).
----------------------------------------------------------------------
----------
The nearest neighbor classification using centroid-based approaches with both Euclidean and
cosine similarity metrics achieved comparable accuracy on the AG News dataset, with cosine
similarity slightly outperforming Euclidean (84.12% vs. 84.02%). The confusion matrices for
both metrics show that the model can generally classify most articles accurately into their
respective categories: World, Sports, Business, and Sci/Tech. However, there are still some
challenging cases where articles were misclassified, especially between the World and Business
categories, likely due to overlapping topics such as global economic issues. Additionally, some
misclassifications occurred between Sci/Tech and World classes, often when articles discussed
natural disasters or global technology developments, which can be difficult to categorize solely
based on high-level embeddings.
For the outlier detection, we identified the test samples that were furthest from their class
centroids, indicating articles that did not closely match the typical features of their assigned
categories. These outliers included cases like a Business article classified as World due to its
focus on public transport strikes, and a Sci/Tech article classified as World due to its report on a
hurricane, which might contain scientific terms but primarily concerns a global event. These
examples highlight that while the centroid-based nearest neighbor approach performs well in
general, it struggles with nuanced articles that overlap multiple categories or lack distinct topic-
specific terms. This suggests that while nearest neighbor classification is effective for clear-cut
cases, additional context or a more sophisticated model might be needed to handle ambiguous
or multi-topic articles more accurately.
sgd_classifier.fit(train_embeddings.numpy(), train_labels.numpy())
predictions = sgd_classifier.predict(test_embeddings.numpy())
The SGD classifier achieved a significantly higher accuracy (89.84%) compared to the nearest
neighbor methods using centroids, with Euclidean and cosine distances yielding 84.02% and
84.12% accuracy, respectively. This improvement indicates that the SGD model, which optimizes
based on labeled training data, can create more effective decision boundaries than the simple
centroid-based approach. The confusion matrix for the SGD classifier shows that it performs
especially well on the "Sports" category with minimal misclassifications, likely due to the distinct
vocabulary and context typical of sports articles. However, it still struggles slightly with classes
that have overlapping topics, such as "Business" and "World," which sometimes share economic
or global themes.
Comparing the errors made by both approaches, we observe that the nearest neighbor methods
often misclassify articles that lie near the boundary of two classes, like "Business" and "World."
The centroid-based approach relies solely on proximity to the class mean, which doesn't allow it
to learn nuanced decision boundaries. On the other hand, the SGD classifier, by learning directly
from labeled data, is better at handling complex class separations, resulting in fewer
misclassifications. Overall, while the nearest neighbor classifiers are straightforward and
interpretable, the SGD model outperforms them by a significant margin, making it the better
choice for this text classification task.
plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca10[:,0], y=tsne_results_pca10[:,1],
hue=test_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (10D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=class_names, loc='best')
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
print("Applying t-SNE on PCA-reduced (50D) embeddings...")
tsne_pca = TSNE(n_components=2, random_state=42, perplexity=30,
n_iter=1000)
tsne_results_pca50 = tsne_pca.fit_transform(X_test_pca50)
plt.figure(figsize=(10, 8))
sns.scatterplot(
x=tsne_results_pca50[:,0], y=tsne_results_pca50[:,1],
hue=test_labels,
palette=palette,
legend='full',
alpha=0.6
)
plt.title('t-SNE on PCA-Reduced (50D) Embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Classes', labels=class_names, loc='best')
plt.show()
Applying t-SNE on PCA-reduced (50D) embeddings...
/usr/local/lib/python3.10/dist-packages/sklearn/manifold/
_t_sne.py:1162: FutureWarning: 'n_iter' was renamed to 'max_iter' in
version 1.5 and will be removed in 1.7.
warnings.warn(
The results from reducing the DistilBERT embeddings to 10 and 50 dimensions using PCA show
how dimensionality reduction affects classification performance and data structure visualization.
When comparing classification accuracy the full embeddings performed best with an accuracy of
89.84%. Reducing to 50 dimensions resulted in a slight drop in accuracy to 88.20% while
reducing to 10 dimensions led to a more significant drop with accuracy falling to 83.57%. This
indicates that while reducing to 50 dimensions retains most of the useful information further
reduction to 10 dimensions loses important features necessary for accurate classification.
The t-SNE visualizations of the reduced embeddings also reveal interesting patterns. In the 50-
dimensional reduction, the four AG News classes remain relatively well-separated indicating
that the primary features that distinguish each class are preserved.In the 10-dimensional
visualization, the class clusters begin to overlap more suggesting that the model has lost some
of its ability to differentiate between nuanced aspects of each class. This reduction in
dimensionality leads to poorer class separabilit, particularly between classes with overlapping
themes likeWorld and Business which often share contextual elements.
I Think Chatgpt was beneficial because of its ability to generate initial code structures
implementing standard functions and also for handling basic processes like setting up the
model, applying PCA, and for calculating accuracy or confusion matrices. It improved on the
building blocks, which allowed me to start quickly and avoid time-consuming redudant coding
tasks. Although it struggled with nuanced aspects like optimizing hyperparameters, modifying
preprocessing steps for specific datasets it fared way better when it came to more niche
requirements like adjusting transforms for better accuracy or even understanding subtle data
properties. I had to manually make adjustments, as the system provided only default options
without deeper optimizations. I used different value for SVD by experimenting on it and also
changed the transform which ai generator gave standard.
Part B
Did you discover any subtle mistakes when you read the resulting code? Do you get trapped into
spending more time debugging than you thought you would?
Answer: Yes, there were some subtle difficulties like unable to use Standard Scaler, and Chatgpt
supplying only the standard parameters which did not allow me to get deeper optimization.
Tuning the hyperparameters was also difficult as a different function was being called which was
hampering with my accuracy. I spent a lot of time with also trying to make my visualizations
more unique as only a specific color was used to denote all the existing clusters while I wanted it
to be more descriptive which required me to implement more colors.
Part C
Do you think using AI tools for simple tasks frees up your brainpower for more advanced
problem-solving, or does it reduce your overall understanding?
Answer:Using AI generators lies in the gray area of coding, as if we are tackling a problem which
we can easily logically solve but is just time consuming and too redundant to write as certain
functions can be, It can be a good idea to use AI generators, but when we are presented with an
advanced problem, using them can hamper and prove them as a hurdle in our logic building and
in our overall understanding of the code and its flow.
Submission Guideline:
• Submit everything through Google classroom. As mentioned above, you will need to
upload:
a. The Jupyter notebook all your work is in (.ipynb file), derived from the provided
template
b. PDF (export the notebook as a pdf file)
• These files should be named with the following format, where the italicized parts should
be replaced with the corresponding values:
a. cse519_hw3_lastname_firstname_sbuid.ipynb
b. cse519_hw3_lastname_firstname_sbuid.pdf
Your Submission will NOT BE GRADED if you don't follow the naming convention❗❗
May your datasets be balanced, your features well-engineered, and your p-values low! (˵ •̀ ᴗ •́ ˵ )
✧