Netflix Movies & TV Show Clustering using Unsupervised ML

In the ever-evolving landscape of digital entertainment, Netflix stands out as a dominant platform, providing a vast library of movies and TV shows to its global audience. The challenge, however, lies in effectively managing and recommending content to users. Unsupervised machine learning (ML), particularly clustering algorithms, plays a pivotal role in this endeavor by grouping similar content and enhancing user experience and satisfaction.

This article delves into unsupervised ML techniques for clustering Netflix's extensive collection of movies and TV shows.

What is Clustering in Machine Learning?

Clustering is unsupervised learning where the algorithm identifies natural groupings within data based on similarities. Unlike supervised learning, which requires labeled data, unsupervised learning operates on unlabeled data, making it ideal for exploratory analysis.

Key Clustering Algorithms:

K-Means Clustering: This algorithm partitions data into K distinct clusters based on feature similarity. It iteratively assigns data points to clusters and adjusts the cluster centroids until convergence.
Hierarchical Clustering: This method builds a hierarchy of clusters either through a bottom-up (agglomerative) or top-down (divisive) approach, resulting in a tree-like structure called a dendrogram.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on the density of data points, making it effective in identifying clusters of varying shapes and sizes, and distinguishing outliers.

Benefits of Clustering for Netflix

Enhanced Recommendations: Clustering enables Netflix to recommend content more effectively by identifying groups of similar movies and TV shows. For instance, if a user has shown a preference for a particular cluster of action movies, Netflix can recommend other movies within the same cluster.
Improved Content Organization: Organizing the vast library of content into meaningful clusters helps Netflix in structuring its catalog, making it easier for users to navigate and discover new content.
Personalization: Clustering allows for deeper personalization. By understanding the clusters that a user interacts with, Netflix can tailor its recommendations to better match the user's tastes and viewing habits.

Steps to perform Clustering on Netflix Data

Step 1: Import Libraries

First, import all the necessary libraries for data manipulation, preprocessing, clustering, and visualization.

Python

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

Step 2: Load the Dataset

Load the dataset into a Pandas DataFrame. You can load the dataset from here.

Python

df = pd.read_csv('/content/netflix_titles.csv')  # Update this line to the path of your CSV file

Step 3: Drop Rows with Critical Missing Values

Drop rows that have missing values in critical columns: 'director', 'cast', 'country', 'rating', and 'duration'.

Python

df = df.dropna(subset=['director', 'cast', 'country', 'rating', 'duration'])

# Check the shape of the DataFrame after dropping rows
print(f"Shape after dropping rows with missing critical values: {df.shape}")

Output:

Shape after dropping rows with missing critical values: (5332, 12)

Step 4: Preprocess the Duration

Create a function to preprocess the 'duration' column, converting it into a numerical format.

Python

def preprocess_duration(duration):
    if 'min' in duration:
        return int(duration.split(' ')[0])
    elif 'Season' in duration:
        return int(duration.split(' ')[0]) * 60  # Assume each season is equivalent to 60 minutes
    return 0

df['duration'] = df['duration'].apply(preprocess_duration)

Step 5: Normalize Numerical Features

Normalize the 'release_year' and 'duration' columns using StandardScaler.

Python

scaler = StandardScaler()
df[['release_year', 'duration']] = scaler.fit_transform(df[['release_year', 'duration']])

Step 6: Handle NaN Values in Text Features

Fill NaN values in text features with empty strings.

Python

df['director'] = df['director'].fillna('')
df['cast'] = df['cast'].fillna('')
df['country'] = df['country'].fillna('')

Step 7: Combine Text Features

Create a new column 'text_features' by combining relevant text columns to be used for text vectorization.

Python

df['text_features'] = df['type'] + ' ' + df['title'] + ' ' + df['director'] + ' ' + df['cast'] + ' ' + df['country'] + ' ' + df['rating'] + ' ' + df['listed_in'] + ' ' + df['description']

Step 8: Use TF-IDF Vectorizer

Use TF-IDF vectorizer with n-grams to transform the text features into numerical features.

Python

tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.9, min_df=5, max_features=1000)
tfidf_matrix = tfidf.fit_transform(df['text_features'])

# Convert to DataFrame for easy manipulation
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=df.index, columns=tfidf.get_feature_names_out())

Step 9: Combine Numerical and Text Features

Combine normalized numerical features and TF-IDF features into a final DataFrame for clustering.

Python

final_df = pd.concat([df[['release_year', 'duration']], tfidf_df], axis=1)

Step 10: Elbow Method for Optimal k

Use the elbow method to find the optimal number of clusters for K-Means clustering.

Python

sse = []
sil_scores = []
for k in range(2, 11):  # Start from 2 as 1 is not a valid number of clusters
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(final_df)
    sse.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(final_df, kmeans.labels_))

# Plot the Elbow graph and Silhouette Scores
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(range(2, 11), sse, marker='o')
plt.title('Elbow Method For Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')

plt.subplot(1, 2, 2)
plt.plot(range(2, 11), sil_scores, marker='o')
plt.title('Silhouette Score For Different k')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()

Output:

The first graph on the left, titled "Elbow Method For Optimal k," plots the Sum of Squared Errors (SSE) against the number of clusters, k, ranging from 2 to 10. SSE measures the total squared distance between each data point and the centroid of its assigned cluster. As the number of clusters increases, the SSE naturally decreases because more clusters mean that data points are generally closer to their respective centroids. The purpose of the elbow method is to identify a point where the rate of SSE reduction slows significantly.

The second graph on the right, titled "Silhouette Score For Different k," shows the silhouette score plotted against the number of clusters, k, also ranging from 2 to 10. The silhouette score measures how similar a data point is to its own cluster compared to other clusters, with a score ranging from -1 to 1. A higher silhouette score indicates well-defined, distinct clusters. In this graph, the highest silhouette score is observed at k=2k = 2k=2, but there is a notable drop as k increases. However, the scores for k=3 and k=4 are still relatively high compared to higher values of k, indicating that these cluster counts might also provide a good balance between cluster quantity and quality.

Step 11: Apply K-Means Clustering

Apply K-Means clustering with the optimal number of clusters determined from the elbow method and silhouette score.

Python

k = 5  # Update this value based on the optimal k from the Elbow Method and Silhouette Score
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(final_df)

# Add the cluster labels to the original dataframe
df['cluster'] = kmeans.labels_

# Display the first few rows with cluster labels
print(df.head())

# Calculate the silhouette score
score = silhouette_score(final_df, kmeans.labels_)
print(f'Silhouette Score: {score}')

Output:

Silhouette Score: 0.614701517836973277

Step 12: Reduce Dimensions to 2D Using PCA

Use PCA for dimensionality reduction to visualize clusters in 2D.

Python

pca = PCA(n_components=2, random_state=42)
pca_df = pca.fit_transform(final_df)

# Create a DataFrame with PCA components and cluster labels
pca_df = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df['cluster'] = kmeans.labels_

# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=pca_df['cluster'], cmap='viridis', marker='o')
plt.title('Clusters of Netflix Movies and TV Shows')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Output:

Step 13: Try DBSCAN Clustering

Apply DBSCAN clustering as an alternative approach.

Python

dbscan = DBSCAN(eps=0.5, min_samples=5, n_jobs=-1)
dbscan_labels = dbscan.fit_predict(final_df)

# Add the DBSCAN cluster labels to the original dataframe
df['cluster_dbscan'] = dbscan_labels

# Display the first few rows with DBSCAN cluster labels
print(df.head())

# Evaluate DBSCAN
unique_labels = set(dbscan_labels)
print(f'Number of clusters found by DBSCAN: {len(unique_labels) - (1 if -1 in dbscan_labels else 0)}')

Output:

Number of clusters found by DBSCAN: 0

Step 14: Visualize DBSCAN Clusters

Visualize DBSCAN clusters using the same PCA components from earlier.

Python

pca_df_dbscan = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df_dbscan['cluster'] = dbscan_labels

plt.figure(figsize=(10, 7))
plt.scatter(pca_df_dbscan['PC1'], pca_df_dbscan['PC2'], c=pca_df_dbscan['cluster'], cmap='viridis', marker='o')
plt.title('DBSCAN Clustering of Netflix Movies and TV Shows')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Output:

Since, this approach does not work for me we will use the next approch.

Step 15: Perform t-SNE for Better Visualization

Use t-SNE for dimensionality reduction to visualize clusters in 2D with more separation.

Python

tsne = TSNE(n_components=2, random_state=42, n_iter=300)
tsne_df = tsne.fit_transform(final_df)

# Create a DataFrame with t-SNE components and cluster labels
tsne_df = pd.DataFrame(data=tsne_df, columns=['TSNE1', 'TSNE2'])
tsne_df['cluster'] = kmeans.labels_

# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=tsne_df['cluster'], cmap='viridis', marker='o')
plt.title('t-SNE Clustering of Netflix Movies and TV Shows')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

Output:

The plot visualizes the clustering of Netflix movies and TV shows based on features extracted from the dataset. Each point represents a movie or TV show, and the color indicates the cluster to which it belongs

Challenges and Considerations

While clustering offers significant advantages, it also presents challenges:

Feature Selection: Choosing the right features is crucial. Irrelevant or redundant features can lead to poor clustering performance.
Scalability: Netflix’s dataset is enormous, requiring efficient algorithms and computational resources to process.
Dynamic Nature of Data: User preferences and content popularity change over time, necessitating periodic re-clustering and model updates.

Conclusion

Unsupervised machine learning, particularly clustering, is a powerful tool for managing and recommending content on Netflix. By leveraging algorithms like K-Means, Hierarchical Clustering, and DBSCAN, Netflix can effectively group its movies and TV shows, enhancing user experience through personalized recommendations and streamlined content organization. As technology advances, the integration of more sophisticated ML techniques will continue to refine and improve the ways in which we discover and enjoy digital entertainment.

Netflix Movies & TV Show Clustering using Unsupervised ML

What is Clustering in Machine Learning?

Key Clustering Algorithms:

Benefits of Clustering for Netflix

Steps to perform Clustering on Netflix Data

Step 1: Import Libraries

Step 2: Load the Dataset

Step 3: Drop Rows with Critical Missing Values

Step 4: Preprocess the Duration

Step 5: Normalize Numerical Features

Step 6: Handle NaN Values in Text Features

Step 7: Combine Text Features

Step 8: Use TF-IDF Vectorizer

Step 9: Combine Numerical and Text Features

Step 10: Elbow Method for Optimal k

Step 11: Apply K-Means Clustering

Step 12: Reduce Dimensions to 2D Using PCA

Step 13: Try DBSCAN Clustering

Step 14: Visualize DBSCAN Clusters

Step 15: Perform t-SNE for Better Visualization

Challenges and Considerations

Conclusion

Explore