In the ever-evolving landscape of digital entertainment, Netflix stands out as a dominant platform, providing a vast library of movies and TV shows to its global audience. The challenge, however, lies in effectively managing and recommending content to users. Unsupervised machine learning (ML), particularly clustering algorithms, plays a pivotal role in this endeavor by grouping similar content and enhancing user experience and satisfaction.
This article delves into unsupervised ML techniques for clustering Netflix's extensive collection of movies and TV shows.
What is Clustering in Machine Learning?
Clustering is unsupervised learning where the algorithm identifies natural groupings within data based on similarities. Unlike supervised learning, which requires labeled data, unsupervised learning operates on unlabeled data, making it ideal for exploratory analysis.
Key Clustering Algorithms:
- K-Means Clustering: This algorithm partitions data into K distinct clusters based on feature similarity. It iteratively assigns data points to clusters and adjusts the cluster centroids until convergence.
- Hierarchical Clustering: This method builds a hierarchy of clusters either through a bottom-up (agglomerative) or top-down (divisive) approach, resulting in a tree-like structure called a dendrogram.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN identifies clusters based on the density of data points, making it effective in identifying clusters of varying shapes and sizes, and distinguishing outliers.
Benefits of Clustering for Netflix
- Enhanced Recommendations: Clustering enables Netflix to recommend content more effectively by identifying groups of similar movies and TV shows. For instance, if a user has shown a preference for a particular cluster of action movies, Netflix can recommend other movies within the same cluster.
- Improved Content Organization: Organizing the vast library of content into meaningful clusters helps Netflix in structuring its catalog, making it easier for users to navigate and discover new content.
- Personalization: Clustering allows for deeper personalization. By understanding the clusters that a user interacts with, Netflix can tailor its recommendations to better match the user's tastes and viewing habits.
Steps to perform Clustering on Netflix Data
Step 1: Import Libraries
First, import all the necessary libraries for data manipulation, preprocessing, clustering, and visualization.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
Step 2: Load the Dataset
Load the dataset into a Pandas DataFrame. You can load the dataset from here.
df = pd.read_csv('/content/netflix_titles.csv') # Update this line to the path of your CSV file
Step 3: Drop Rows with Critical Missing Values
Drop rows that have missing values in critical columns: 'director', 'cast', 'country', 'rating', and 'duration'.
df = df.dropna(subset=['director', 'cast', 'country', 'rating', 'duration'])
# Check the shape of the DataFrame after dropping rows
print(f"Shape after dropping rows with missing critical values: {df.shape}")
Output:
Shape after dropping rows with missing critical values: (5332, 12)Step 4: Preprocess the Duration
Create a function to preprocess the 'duration' column, converting it into a numerical format.
def preprocess_duration(duration):
if 'min' in duration:
return int(duration.split(' ')[0])
elif 'Season' in duration:
return int(duration.split(' ')[0]) * 60 # Assume each season is equivalent to 60 minutes
return 0
df['duration'] = df['duration'].apply(preprocess_duration)
Step 5: Normalize Numerical Features
Normalize the 'release_year' and 'duration' columns using StandardScaler.
scaler = StandardScaler()
df[['release_year', 'duration']] = scaler.fit_transform(df[['release_year', 'duration']])
Step 6: Handle NaN Values in Text Features
Fill NaN values in text features with empty strings.
df['director'] = df['director'].fillna('')
df['cast'] = df['cast'].fillna('')
df['country'] = df['country'].fillna('')
Step 7: Combine Text Features
Create a new column 'text_features' by combining relevant text columns to be used for text vectorization.
df['text_features'] = df['type'] + ' ' + df['title'] + ' ' + df['director'] + ' ' + df['cast'] + ' ' + df['country'] + ' ' + df['rating'] + ' ' + df['listed_in'] + ' ' + df['description']
Step 8: Use TF-IDF Vectorizer
Use TF-IDF vectorizer with n-grams to transform the text features into numerical features.
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.9, min_df=5, max_features=1000)
tfidf_matrix = tfidf.fit_transform(df['text_features'])
# Convert to DataFrame for easy manipulation
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=df.index, columns=tfidf.get_feature_names_out())
Step 9: Combine Numerical and Text Features
Combine normalized numerical features and TF-IDF features into a final DataFrame for clustering.
final_df = pd.concat([df[['release_year', 'duration']], tfidf_df], axis=1)
Step 10: Elbow Method for Optimal k
Use the elbow method to find the optimal number of clusters for K-Means clustering.
sse = []
sil_scores = []
for k in range(2, 11): # Start from 2 as 1 is not a valid number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(final_df)
sse.append(kmeans.inertia_)
sil_scores.append(silhouette_score(final_df, kmeans.labels_))
# Plot the Elbow graph and Silhouette Scores
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(range(2, 11), sse, marker='o')
plt.title('Elbow Method For Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.subplot(1, 2, 2)
plt.plot(range(2, 11), sil_scores, marker='o')
plt.title('Silhouette Score For Different k')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()
Output:
The first graph on the left, titled "Elbow Method For Optimal k," plots the Sum of Squared Errors (SSE) against the number of clusters, k, ranging from 2 to 10. SSE measures the total squared distance between each data point and the centroid of its assigned cluster. As the number of clusters increases, the SSE naturally decreases because more clusters mean that data points are generally closer to their respective centroids. The purpose of the elbow method is to identify a point where the rate of SSE reduction slows significantly.
The second graph on the right, titled "Silhouette Score For Different k," shows the silhouette score plotted against the number of clusters, k, also ranging from 2 to 10. The silhouette score measures how similar a data point is to its own cluster compared to other clusters, with a score ranging from -1 to 1. A higher silhouette score indicates well-defined, distinct clusters. In this graph, the highest silhouette score is observed at k=2k = 2k=2, but there is a notable drop as k increases. However, the scores for k=3 and k=4 are still relatively high compared to higher values of k, indicating that these cluster counts might also provide a good balance between cluster quantity and quality.
Step 11: Apply K-Means Clustering
Apply K-Means clustering with the optimal number of clusters determined from the elbow method and silhouette score.
k = 5 # Update this value based on the optimal k from the Elbow Method and Silhouette Score
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(final_df)
# Add the cluster labels to the original dataframe
df['cluster'] = kmeans.labels_
# Display the first few rows with cluster labels
print(df.head())
# Calculate the silhouette score
score = silhouette_score(final_df, kmeans.labels_)
print(f'Silhouette Score: {score}')
Output:
Silhouette Score: 0.614701517836973277Step 12: Reduce Dimensions to 2D Using PCA
Use PCA for dimensionality reduction to visualize clusters in 2D.
pca = PCA(n_components=2, random_state=42)
pca_df = pca.fit_transform(final_df)
# Create a DataFrame with PCA components and cluster labels
pca_df = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df['cluster'] = kmeans.labels_
# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(pca_df['PC1'], pca_df['PC2'], c=pca_df['cluster'], cmap='viridis', marker='o')
plt.title('Clusters of Netflix Movies and TV Shows')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Output:
Step 13: Try DBSCAN Clustering
Apply DBSCAN clustering as an alternative approach.
dbscan = DBSCAN(eps=0.5, min_samples=5, n_jobs=-1)
dbscan_labels = dbscan.fit_predict(final_df)
# Add the DBSCAN cluster labels to the original dataframe
df['cluster_dbscan'] = dbscan_labels
# Display the first few rows with DBSCAN cluster labels
print(df.head())
# Evaluate DBSCAN
unique_labels = set(dbscan_labels)
print(f'Number of clusters found by DBSCAN: {len(unique_labels) - (1 if -1 in dbscan_labels else 0)}')
Output:
Number of clusters found by DBSCAN: 0Step 14: Visualize DBSCAN Clusters
Visualize DBSCAN clusters using the same PCA components from earlier.
pca_df_dbscan = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df_dbscan['cluster'] = dbscan_labels
plt.figure(figsize=(10, 7))
plt.scatter(pca_df_dbscan['PC1'], pca_df_dbscan['PC2'], c=pca_df_dbscan['cluster'], cmap='viridis', marker='o')
plt.title('DBSCAN Clustering of Netflix Movies and TV Shows')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Output:
Since, this approach does not work for me we will use the next approch.
Step 15: Perform t-SNE for Better Visualization
Use t-SNE for dimensionality reduction to visualize clusters in 2D with more separation.
tsne = TSNE(n_components=2, random_state=42, n_iter=300)
tsne_df = tsne.fit_transform(final_df)
# Create a DataFrame with t-SNE components and cluster labels
tsne_df = pd.DataFrame(data=tsne_df, columns=['TSNE1', 'TSNE2'])
tsne_df['cluster'] = kmeans.labels_
# Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(tsne_df['TSNE1'], tsne_df['TSNE2'], c=tsne_df['cluster'], cmap='viridis', marker='o')
plt.title('t-SNE Clustering of Netflix Movies and TV Shows')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()
Output:
The plot visualizes the clustering of Netflix movies and TV shows based on features extracted from the dataset. Each point represents a movie or TV show, and the color indicates the cluster to which it belongs
Challenges and Considerations
While clustering offers significant advantages, it also presents challenges:
- Feature Selection: Choosing the right features is crucial. Irrelevant or redundant features can lead to poor clustering performance.
- Scalability: Netflix’s dataset is enormous, requiring efficient algorithms and computational resources to process.
- Dynamic Nature of Data: User preferences and content popularity change over time, necessitating periodic re-clustering and model updates.
Conclusion
Unsupervised machine learning, particularly clustering, is a powerful tool for managing and recommending content on Netflix. By leveraging algorithms like K-Means, Hierarchical Clustering, and DBSCAN, Netflix can effectively group its movies and TV shows, enhancing user experience through personalized recommendations and streamlined content organization. As technology advances, the integration of more sophisticated ML techniques will continue to refine and improve the ways in which we discover and enjoy digital entertainment.