0% found this document useful (0 votes)
12 views

Dl Project

The document outlines the development of a movie recommendation system called MoviePal, which uses item-based collaborative filtering to suggest movies based on user preferences. It details the use of the MovieLens-small dataset, the implementation of K-Nearest Neighbors with cosine similarity, and the preprocessing steps necessary for effective recommendations. Key challenges addressed include handling sparse data and ensuring personalized recommendations for users with limited ratings history.

Uploaded by

Ruchita Maaran
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Dl Project

The document outlines the development of a movie recommendation system called MoviePal, which uses item-based collaborative filtering to suggest movies based on user preferences. It details the use of the MovieLens-small dataset, the implementation of K-Nearest Neighbors with cosine similarity, and the preprocessing steps necessary for effective recommendations. Key challenges addressed include handling sparse data and ensuring personalized recommendations for users with limited ratings history.

Uploaded by

Ruchita Maaran
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

T IT L E

Movie Recommendation
System - MoviePal
Objective:
The primary objective of this project is to build a movie
recommendation system that can suggest movies to users based on
their historical preferences. By leveraging item-based collaborative
filtering techniques, we aim to recommend movies that are similar
to those the user has already enjoyed.
The system will use ratings from users to predict preferences and
suggest top movies that the user might like. The system will be built
using the MovieLens-small dataset for simplicity and effectiveness.
The model will generate a list of 10 similar movies based on a given
movie input, helping users discover new content aligned with their
tastes.
Define the Problem
Statement:
The problem this project addresses is the challenge of
recommending movies to users based on their previous ratings and
the ratings of other users with similar tastes. Movie recommendation
systems are widely used by streaming services such as Netflix,
Amazon Prime, and Hulu. By applying collaborative filtering
techniques, we can predict which movies a user might like based on
the preferences of others who have similar viewing histories.
Key challenges in building the system include:
⁑ Handling sparse data, as users tend to rate only a small subset
of available movies.
⁑ Ensuring the recommendations are relevant and personalized.
⁑ Addressing the cold start problem, where the system may
struggle to recommend movies for new users with no ratings
history.
The solution involves creating an item-based collaborative filtering
model that:
 Takes in a movie title.
 Finds similar movies based on user ratings and recommends
them.
 Uses cosine similarity to determine how similar two movies are
based on their ratings.

Gather or Create Datasets:


To build a recommendation system, we need datasets containing
movie ratings and user behavior. In this project, we will use the
MovieLens-small dataset, which is a publicly available dataset that
contains user ratings of movies.
The dataset includes two key files:
 movies.csv: Contains the movie details, including the unique
movieId and the movie title.
 ratings.csv: Contains user ratings for movies. It includes the
userId, movieId, and the rating given by the user.
The MovieLens dataset is ideal because it is already cleaned and
formatted for use in recommendation system projects.

Set up the Simple Tool:


For this project, we will use Python as the primary programming
language, along with the following libraries:
 Pandas: For data manipulation and handling CSV files.
 NumPy: For numerical operations.
 Scikit-learn: For building machine learning models and using
nearest neighbors algorithms.
 SciPy: For sparse matrix operations, which are necessary to
handle large datasets efficiently.
 Matplotlib: For visualizations (optional, but useful for exploring
the data).
We will also use Jupyter Notebook for development, as this tool
provide a rich environment for writing and testing Python code
interactively.

Load & Explore the


Datasets:
First, we load the datasets to inspect the data structure. The
movies.csv contains information about movies, and the ratings.csv
contains ratings provided by users. We can load and inspect these
datasets using Pandas.
import pandas as pd
# Load the datasets
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
# Inspect the first few rows of each dataset
print(movies.head())
print(ratings.head())
The movies.csv file has the following columns:
 movieId: The unique identifier for each movie.
 title: The name of the movie.
 genres: The genres associated with the movie (e.g., Action,
Comedy).
The ratings.csv file has the following columns:
¥ userId: A unique identifier for each user.
¥ movieId: The identifier for the movie that the user has rated.
¥ rating: The rating given by the user to the movie (typically
between 1 and 5).
¥ timestamp: The time when the rating was given.

Preprocess the Data:


Before building the recommendation system, it is necessary to
preprocess the data. This involves handling missing values, ensuring
that the dataset is structured in a way that makes it easier for the
algorithm to process.
1. Pivot the Ratings Data: The first step in data preprocessing is
to pivot the ratings data so that each movie corresponds to a
row and each user corresponds to a column. The values will
represent the ratings.
final_dataset = ratings.pivot(index='movieId', columns='userId', values='rating')
final_dataset.fillna(0, inplace=True)
2. Remove Noise: We should filter out movies that have been
rated by only a few users and users who have rated too few
movies. This can be achieved by setting thresholds:
 A movie must have at least 10 users who rated it.
 A user must have rated at least 50 movies.
# Filter out movies with fewer than 10 ratings
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
final_dataset = final_dataset.loc[no_user_voted[no_user_voted > 10].index, :]
# Filter out users who have rated fewer than 50 movies
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')
final_dataset = final_dataset.loc[:, no_movies_voted[no_movies_voted > 50].index]
3. Convert to Sparse Matrix: Since the ratings matrix is sparse
(many missing values), we convert it into a sparse matrix using
Scipy's csr_matrix function. This helps to save memory and
improves computational efficiency when performing similarity
calculations.
from scipy.sparse import csr_matrix
csr_data = csr_matrix(final_dataset.values)
final_dataset.reset_index(inplace=True)

Split Datasets (Training &


Test Data):
Typically, in machine learning, we split the dataset into training and
testing sets. However, since collaborative filtering works by finding
similarities based on user ratings, we use the entire dataset to train
the model. The recommendation engine will generate suggestions
based on the movie's rating similarity, so no explicit test/train split is
required in this case.

Choose & Train a Model:


We will use K-Nearest Neighbors (KNN) with cosine similarity as the
similarity metric to find similar movies. KNN is an efficient method
for finding the most similar items based on user ratings.
from sklearn.neighbors import NearestNeighbors
# KNN model with cosine similarity
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)

Test & Evaluate the Model:


Once the model is trained, we evaluate it by inputting a movie name
and checking how well the model can recommend similar movies.
The evaluation will be based on the relevance of the
recommendations, i.e., whether the top 10 recommended movies
are similar to the input movie.
def get_movie_recommendation(movie_name):
n_movies_to_recommend = 10
movie_list = movies[movies['title'].str.contains(movie_name)]
if len(movie_list):
movie_idx = movie_list.iloc[0]['movieId']
movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
# Find the 10 most similar movies
distances, indices = knn.kneighbors(csr_data[movie_idx],
n_neighbors=n_movies_to_recommend+1)
rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())),
key=lambda x: x[1])[:0:-1]
recommend_frame = []
for val in rec_movie_indices:
movie_idx = final_dataset.iloc[val[0]]['movieId']
idx = movies[movies['movieId'] == movie_idx].index
recommend_frame.append({'Title': movies.iloc[idx]['title'].values[0], 'Distance': val[1]})
df = pd.DataFrame(recommend_frame, index=range(1, n_movies_to_recommend+1))
return df
else:
return "No movies found. Please check your input."
get_movie_recommendation('Iron Man')

Visualize Result:
Visualizing the results can provide insights into how well the model
is performing. In this case, we can use matplotlib to display the
number of users who rated each movie and other statistics, like the
number of votes by each user.
import matplotlib.pyplot as plt
# Visualize number of users who voted for each movie
f, ax = plt.subplots(1, 1, figsize=(16, 4))
plt.scatter(no_user_voted.index, no_user_voted, color='mediumseagreen')
plt.axhline(y=10, color='r') # Threshold for minimum 10 users per movie
plt.xlabel('MovieId')
plt.ylabel('No. of users voted')
plt.show()
Improve the Model:
Improvements can be made to the model by experimenting with:
 Matrix Factorization techniques like SVD (Singular Value
Decomposition).
 Hybrid models that combine collaborative filtering with
content-based filtering.
 Optimization of KNN parameters like the number of neighbors
and the distance metric used.
Additionally, feature engineering and tuning can be applied to
improve accuracy.

Conclusion:
This code demonstrates a simple movie recommendation system
using item-based collaborative filtering. By using the MovieLens-
small dataset and applying the KNN algorithm with cosine similarity,
the system finds and recommends similar movies based on the
user's input movie.
Key Takeaways:
 Collaborative filtering can be used for item-based
recommendations by identifying similar movies.
 Data preprocessing such as removing noise and handling
sparsity is crucial to improving the recommendation system's
accuracy and efficiency.
 KNN and cosine similarity are effective methods for finding
similarities in datasets with user ratings.
With this model, users can input a movie title and get
recommendations for similar films, improving the movie-watching
experience by offering personalized suggestions based on past user
behavior.

BY: RUCHITA MAARAN & SHOBANA M (252310022 & 252310027)

You might also like