0% found this document useful (0 votes)

25 views39 pages

Project III Report

Uploaded by

Yash Kinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views39 pages

Project III Report

Uploaded by

Yash Kinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Project-III Report

On
Movie Recommendation system using Machine learning

SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIRMENT FOR THE

AWARD OF THE DEGREE OF

BACHELOR OF ENGINEERING
(INFORMATION TECHNOLOGY)

Supervisor Submitted By:

Dr Rajkumari Nirmal Kumar Sharma(UE218069)
Assistant professor Parth Sood(UE218071)
Riyan Raj Saikia(UE218084)
Pranav Bhambri(UEM218124)

To
Department of Information Technology
University Institute of Engineering and Technology
Panjab University, Chandigarh
4th year
Table of content

S.No Content Page

No.
Declaration signed by the student i
Certificate signed by the ii
Supervisor/Mentor
Acknowledgement iii
1. Introduction 1
2. Component/modules/objectives of
the project(covering the innovation
aspect)
3 Data used/Data Sources

4 Proposed Methodology

5 Technology used in detail(Hardware

& software)
6 Project
demonstration/screenshots/results
achieved
7 Final output

8 conclusion

9 Bibliography/Reference
Declaration
We, the undersigned, hereby declare that the project titled "Movie Recommendation System
Using Machine Learning and Deployment on Website" is our original work, undertaken as
part of our Bachelor of Engineering in Information Technology (BE IT) curriculum during
our final year at University Institute of Engineering and Technology, Chandigarh, Panjab
University. The project was completed under the guidance of Dr Rajkumari.

This project involves the design and development of a system that recommends movies to
users based on their preferences and behaviours. The work encompasses the following
stages:

1. Research and Analysis: Understanding various recommendation algorithms, such as

collaborative filtering, content-based filtering, and hybrid approaches.
2. Model Development: Implementing the recommendation engine using machine
learning libraries like Python, scikit-learn, or TensorFlow.
3. Website Integration: Designing and deploying the system on a web platform using
Heroko to ensure seamless interaction.
4. Evaluation and Optimization: Testing the system using performance metrics like
precision, recall, and RMSE and optimizing for accurate recommendations.

We affirm that the work presented in this project is the result of our collective effort and has
not been submitted elsewhere for any certification or publication. All sources of data,
references, and external materials used in this work have been appropriately acknowledged.

We take full responsibility for the authenticity of the information and results presented in
this project.

Team Members:
Nirmal Kumar Sharma

Parth Sood

Riyan Raj Saikia

Pranav Bhambri

Institution:
University Institute of Engineering and Technology, Chandigarh, Panjab University.
Acknowledgement

We, the undersigned, would like to express our sincere gratitude to everyone who has supported and guided
us throughout the course of our project titled "Movie Recommendation System Using Machine Learning and
Deployment on Website."

First and foremost, we extend our heartfelt thanks to our supervisor, Dr Rajkumari, for their invaluable
guidance, constructive feedback, and constant encouragement, which played a crucial role in the successful
completion of this project. Their expertise and mentorship have been instrumental in shaping our
understanding and approach.

We are grateful to the Department of Information Technology at University Institute of Engineering and
Technology, Panjab University, Chandigarh for providing us with the necessary resources and a conducive
environment for research and development. We also thank the faculty members and technical staff for their
assistance and insights during various stages of the project.

A special note of appreciation goes to our peers and friends, whose valuable suggestions and moral support
inspired us to push our limits and complete this project successfully.

Finally, we express our deepest gratitude to our families for their unwavering support, patience, and
motivation throughout this endeavour.

This project has been a great learning experience, helping us enhance our technical skills, teamwork, and
problem-solving abilities. We remain thankful to all those who contributed to this journey, directly or
indirectly.
Introduction To the Project

In today’s interconnected world, recommendation systems have become an integral part of our daily lives.
These systems are designed to provide personalized suggestions to users, enhancing their experience by
saving time and effort in decision-making. Popular platforms like Amazon, Flipkart, and Netflix utilize
sophisticated recommendation engines to suggest products, movies, or shows based on user preferences and
behaviours. Even offline businesses, such as retail stores and supermarkets, implement recommendation
strategies through loyalty programs and purchase history to improve customer satisfaction.
The role of recommendation systems extends far beyond entertainment and shopping. They are used in
various domains, including education (course recommendations), healthcare (personalized treatment plans),
and social media (content recommendations). By analysing large volumes of user data, these systems offer
tailored suggestions that are both efficient and relevant.
Types of Recommendation Systems
There are three primary types of recommendation systems, each with its own strengths and limitations:
1. Content-Based Recommendation Systems
Content-based systems analyse the characteristics of items and match them with user preferences.
These systems rely on item attributes, such as genre, keywords, or descriptions, and compare them to
the user's past choices. For example, in a movie recommendation system, if a user enjoys action
films, the system suggests similar action-packed titles.
While content-based systems are effective in generating relevant recommendations for individual
users, they often struggle with the "cold start" problem, where there is insufficient data for new users
or items. Additionally, these systems may lack diversity in suggestions, as they focus solely on
similarities to previously selected items.
2. Collaborative Filtering Recommendation Systems
Collaborative filtering focuses on user behaviour and interactions rather than item attributes. It
identifies patterns among users with similar tastes and makes recommendations based on shared
preferences. For instance, if two users have rated similar movies highly, the system might suggest
movies one user has seen but the other has not.
Collaborative filtering is powerful in discovering new and diverse recommendations. However, it
faces challenges such as data sparsity, where insufficient ratings or interactions limit its
effectiveness. It also encounters issues with scalability in systems with large datasets.
3. Hybrid Recommendation Systems
Hybrid recommendation systems combine the strengths of both content-based and collaborative
filtering approaches. They address the limitations of individual methods, such as the cold start
problem in content-based systems and data sparsity in collaborative filtering. By integrating multiple
techniques, hybrid systems deliver more accurate, diverse, and robust recommendations.
Most modern platforms adopt hybrid systems to enhance user experiences. For example, Netflix
employs a hybrid approach, combining collaborative filtering with content-based analysis to
recommend movies and shows.
Earlier Youtube was using content based only but now it is using Hybrid approach.
Project Overview
In this project we will make content based movie recommendation system
This project focuses on designing and implementing a Movie Recommendation System using machine
learning. The goal is to develop a hybrid system that combines content-based filtering and collaborative
filtering to provide accurate and personalized movie suggestions. The system will analyze user preferences,
movie features, and historical interactions to generate recommendations that cater to individual tastes.
The project involves the following phases:
1. Data Collection and Preparation
The first step is gathering a comprehensive movie dataset that includes features such as genre, cast,
director, and user ratings. The dataset will be preprocessed to remove inconsistencies, handle missing
values, and normalize the data for machine learning algorithms.
2. Model Development
The recommendation engine will be built using machine learning techniques. Content-based filtering
will utilize cosine similarity to match movies based on their attributes, while collaborative filtering
will employ matrix factorization techniques like Singular Value Decomposition (SVD) to predict
user ratings. The hybrid approach will combine the outputs of both methods for improved accuracy.
3. Evaluation and Optimization
The system’s performance will be evaluated using metrics such as precision, recall, and Root Mean
Square Error (RMSE). Hyperparameter tuning and cross-validation will be employed to optimize the
model for better recommendations.
4. Website Deployment
The final system will be deployed as a user-friendly web application. Frameworks like Flask or
Django will be used for the backend, while HTML, CSS, and JavaScript will handle the frontend.
The application will allow users to input preferences, browse recommendations, and interact with the
system seamlessly.
Significance of the Project
Recommendation systems play a pivotal role in enhancing user satisfaction by offering tailored suggestions.
This project demonstrates the practical application of machine learning in solving real-world problems. By
creating a movie recommendation system, we aim to bridge the gap between user preferences and content
discovery, making entertainment more accessible and enjoyable.
The successful implementation of this project will not only deepen our understanding of machine learning
and web development but also showcase the potential of hybrid recommendation systems to transform user
experiences across various domains.
Components/Modules/Objectives of The Project
Modules

The project is structured around five key modules that ensure the recommendation system is both functional
and scalable. These modules provide a high-level overview of the system's architecture, with each module
contributing to the overall goal of creating a personalized movie recommendation engine.

1. Data Collection

The Data Collection module is the first step in building the movie recommendation system. It involves
gathering relevant movie-related data from multiple sources, ensuring the model has enough input to make
accurate recommendations.

Key Activities in This Module:

• Source Identification: Identify the most reliable and comprehensive sources for movie data. These
can include open-source datasets, APIs, and movie metadata platforms.
• Data Retrieval: Fetch large datasets (at least 5000 movies) using APIs or by scraping websites like
IMDb, TMDB, and Kaggle.
• Ensuring Diversity and Quality: Data should include various genres, languages, and movie
metadata to ensure that the recommendation system caters to a wide audience.

2. Preprocessing

Data preprocessing is a crucial step in transforming raw data into a clean, usable format for machine
learning algorithms. This module ensures that the data is ready for analysis by removing errors,
inconsistencies, and irrelevant information.

Key Activities in This Module:

• Data Cleaning: Handle missing values, duplicates, and incorrect data.

• Feature Selection: Select the most relevant features (e.g., movie titles, genres, keywords) that will
contribute to the model's accuracy.
• Data Transformation: Transform text-based data, such as movie descriptions, into a format that can
be used by machine learning models, such as creating vectors for each movie's content.

3. Model Building

The Model Building module focuses on the development of the recommendation algorithm itself. This is
where machine learning comes into play to analyze movie features and predict which movies a user is most
likely to enjoy.

Key Activities in This Module:

• Algorithm Selection: Choose the appropriate recommendation technique. For this project, a content-
based filtering approach is used, leveraging textual features like movie descriptions, genres, and
keywords.
• Similarity Metrics: Implement similarity metrics, such as cosine similarity or Euclidean distance, to
measure how similar different movies are based on their attributes.
• Model Training: Train the model using the preprocessed data to generate recommendations.

4. Website Development
The Website Development module ensures that the recommendation system is accessible to users through
an intuitive web interface. This module involves both frontend and backend development to create a fully
functional website where users can input their preferences and get recommendations.

Key Activities in This Module:

• Frontend Design: Design an interactive user interface using HTML, CSS, and JavaScript, allowing
users to search for movies and receive recommendations.
• Backend Development: Develop the backend using a framework like Flask or Django to handle
user requests, process them through the recommendation model, and return relevant results.
• API Integration: Integrate the machine learning model into the web application by creating APIs
that interface between the frontend and backend.

5. Deployment

The Deployment module involves deploying the movie recommendation system on a cloud platform so that
users can access it from anywhere. This module ensures that the system is available for public use and can
scale to accommodate multiple users.

Key Activities in This Module:

• Cloud Hosting: Host the web application on a platform like Heroku, AWS, or Google Cloud to
ensure reliability and scalability.
• Version Control and Deployment Pipeline: Use tools like Git for version control and automate the
deployment process to ensure smooth updates and bug fixes.
• Post-Deployment Monitoring: Monitor the system's performance to identify and fix any issues that
arise after deployment.
Components

Now that the high-level modules are outlined, let's break down the components within each module in more
detail. These components represent specific tasks that need to be completed to make each module functional.

1. Data Collection Components

• Collection of Data for 5000 Movies:

Gather a dataset containing at least 5000 movies. This dataset should include metadata like movie
titles, genres, actors, directors, keywords, and overviews. The data will be collected using APIs such
as the TMDB API or the IMDb API. Alternatively, publicly available datasets, such as those on
Kaggle, can be used.
• Merging the Dataset:
After collecting data from various sources, the next step is to merge the different datasets into one
cohesive dataset. This process involves aligning fields (e.g., genres, ratings, descriptions) across
datasets to create a unified dataset with all the necessary information for each movie.
• Keeping Only Desired Parameters:
Filtering out irrelevant columns is crucial to reduce data complexity. For this system, we will focus
on features like movie titles, genres, keywords, and overviews, which are most relevant for
generating recommendations.
• Converting the Dataset into a Suitable Format:
Data often needs to be transformed into a structured format (e.g., a Pandas DataFrame) that can be
easily processed by machine learning algorithms. For text features like movie overviews, this might
involve converting them into numerical formats using techniques like vectorization.

2. Preprocessing Components

• Data Cleaning:
Raw data may contain missing or inconsistent entries. Preprocessing involves identifying missing
values and handling them through imputation or removal. Any duplicate data will also be removed,
ensuring the dataset is accurate.
• Feature Extraction:
The goal of feature extraction is to transform text data into useful numerical representations. In this
project, we will extract features like movie genres, overviews, and keywords to create a unified
content-based profile for each movie.
• Text Preprocessing:
Text data like movie overviews need to be cleaned and transformed into a format that can be used by
the machine learning model. This involves steps such as tokenization (breaking down text into
words), stop-word removal (removing common words like "the" or "and"), and stemming (reducing
words to their root form).

3. Model Building Components

• Bag of Words (BoW):

The Bag of Words model will be used to represent the textual data of movie descriptions. This
method treats each unique word in the dataset as a feature, allowing the model to capture the
frequency of each word in the document.
• Vectorization (TF-IDF):
The TF-IDF technique will be used to assign weights to words, capturing their significance across
the entire dataset. Words that appear frequently in a document but not across many documents are
considered important. This method helps the system focus on unique and important terms when
comparing movies.
• Creating Function for Recommendations:
A function will be developed to take a movie title as input and output a list of similar movies based
on the computed similarity scores. This function will use the pre-trained model to identify and
recommend movies with similar features, such as genre, plot, and cast.

4. Website Development Components

• Frontend Design:
The user interface will be developed using HTML, CSS, and JavaScript to create a clean and
interactive design. Users will be able to search for movies, view recommendations, and explore
movie details directly from the web interface.
• Backend Development (Flask or Django):
The backend will be developed using a lightweight framework like Flask (or Django, depending on
the project scope). This will handle HTTP requests, interact with the recommendation model, and
return the recommendations to the frontend.
• API Integration:
APIs will be created to interact with the machine learning model. These APIs will handle requests
such as submitting a movie title and returning similar movies.

5. Deployment Components

• Deployment on Heroku:
The web application will be deployed on Heroku, which offers a scalable environment for hosting
web applications. The deployment process involves setting up the necessary files (e.g., Procfile
and requirements.txt), pushing the code to Heroku via Git, and ensuring that the app runs
smoothly on the cloud.
• Post-Deployment Monitoring:
After deployment, the system's performance will be monitored for errors, slow response times, or
any other issues that users may encounter. Tools like Heroku’s built-in monitoring or third-party
services like New Relic can be used for this.
Objectives

The objective of this project is to build a movie recommendation system that can provide personalized
movie suggestions based on user preferences. The system will be built and deployed as a website, offering
users an easy and engaging way to discover new movies.

Detailed Objectives:

1. Personalized Recommendations:
o Offer users movie recommendations tailored to their preferences based on the content of
movies they have already watched.
o Ensure that recommendations are relevant and accurate by leveraging machine learning
algorithms like content-based filtering.
2. User-Friendly Interface:
o Create an intuitive and aesthetically appealing user interface, making it easy for users to
navigate the system, input movie titles, and view recommendations.
3. Scalable and Accessible Platform:
o Host the recommendation system on a cloud platform like Heroku, ensuring that the system
can handle multiple users and scale as needed.
o Make the system accessible globally, allowing users to interact with it from any device with
internet access.
4. Seamless Integration of Machine Learning:
o Integrate the machine learning model smoothly with the web application, ensuring that movie
recommendations are generated in real-time and delivered promptly to users.
5. Efficient Deployment and Maintenance:
o Ensure that the deployed system runs efficiently and remains up-to-date with periodic
maintenance and bug fixes.
o Monitor performance to ensure a smooth user experience and troubleshoot issues as they
arise.
Data Used
The dataset TMDB_5000_Credits.CSV is a comprehensive collection of movie data that is
structured into two primary parts: the movie dataset and the credit dataset, containing
valuable information for building a content-based movie recommendation system. By
providing key details about 5000 movies, including their financials and associated cast and
crew, this dataset offers the necessary components for creating personalized movie
recommendations based on both movie characteristics and the people involved in their
creation.

1. Movie Dataset
The movie dataset is the first part of the TMDB_5000_Credits.CSV file and includes
essential movie-related attributes. This portion of the dataset is used to understand the
general attributes of the movies themselves, providing a foundation for building a content-
based recommendation system.
Key Attributes in the Movie Dataset:
• Movie ID:
Each movie in the dataset is assigned a unique identifier (ID). This ID serves as the
primary key that links each movie's features to other associated data, such as cast
and crew, budget, collection, and more.
• Title:
This attribute includes the name of each movie, which is crucial for displaying
recommendations to users. It also serves as the basis for user queries when searching
for similar movies.
• Genres:
The genres field lists the types or categories of the movie, such as Action, Drama,
Comedy, etc. This is a key feature for content-based filtering, where movies with
similar genres are often recommended. By leveraging this information, the
recommendation system can identify movies within the same genre that a user may
enjoy.
• Budget:
The budget field provides information about the financial investment in the
production of the movie. This can provide additional insights into the movie’s
production quality, but it is typically not used directly in generating
recommendations. However, for users interested in movies with high or low budgets,
this can be a valuable filtering feature.
• Collection:
The collection attribute refers to the total box office earnings or the collection of a
movie. Like the budget, this data is useful for filtering or sorting movies based on
their financial success. For instance, a user may want to find blockbuster films that
made large profits.
• Overview:
The overview field includes a brief description or synopsis of the movie. This is a
critical feature for content-based recommendations, as it allows the system to match
movies based on similar plot themes, storylines, or other textual content. By
analyzing the movie descriptions, the system can recommend movies with similar
themes.
• Release Date:
The release date indicates when the movie was first released to the public. This
attribute can be used to filter movies by year or decade, allowing users to find films
within specific time periods. Additionally, it can help generate recommendations
based on movie popularity during certain eras.
• Runtime:
The runtime refers to the total duration of the movie in minutes. For users who prefer
short films or longer epics, this attribute can be leveraged to refine the
recommendation process.
• Language:
This attribute shows the primary language in which the movie was produced. For
users who prefer films in a specific language, filtering by language can enhance the
recommendation process, providing tailored results based on language preferences.
• Production Companies and Countries:
These fields provide insight into the entities responsible for the production and
distribution of the movie. While not directly used in content-based
recommendations, these fields can be helpful for providing additional context or
filtering films based on regional preferences.
2. Credit Dataset: Cast and Crew
The credit dataset contains detailed information about the individuals involved in the
making of each movie, focusing on the cast (actors/actresses) and crew (directors,
producers, writers, etc.). This dataset is crucial for creating recommendations based on the
involvement of certain individuals, providing the option to recommend films that feature a
favorite actor, director, or producer.
Key Attributes in the Credit Dataset:
• Cast:
This field contains information about the main actors and actresses who performed in
the movie. Each movie may have a list of cast members, each with their role in the
film. This attribute is essential for making actor-based recommendations. For
instance, if a user enjoys movies starring a particular actor, the recommendation
system can suggest other films featuring that actor.
• Crew:
The crew dataset contains details about the people responsible for the behind-the-
scenes work, such as directors, producers, screenwriters, and other key staff. This
data is helpful for making recommendations based on directors or producers who are
linked to films that a user has enjoyed in the past. For example, if a user has watched
several movies directed by a specific filmmaker, the system can suggest other films
from that director.
• Job Titles in Crew:
The job titles attribute in the crew dataset provides specific roles each crew member
held in the creation of the movie. Roles such as director, producer, writer, and
cinematographer are included. This level of detail allows the system to offer
recommendations based on more granular preferences, such as movies directed by a
specific filmmaker or produced by a particular producer.
• Cast and Crew Relationships:
Understanding the relationships between cast and crew is important because movies
that involve similar people in multiple roles (e.g., the same director and cast) may
have similar themes or production styles. This allows the recommendation system to
consider these connections when suggesting films to the user.
• Cast and Crew Popularity:
Many datasets also include a measure of popularity for both cast and crew members,
which can further refine recommendations. A user may be more interested in movies
featuring famous or popular actors, directors, or producers. By including these
popularity metrics, the system can give higher priority to movies with well-known
personnel, providing more relevant recommendations.
Here we represented the dataset which contain details about movie like genres, budget, production company etc.

We named it movies

We used head function to represent data

Here we represented the dataset credits which contain information like cast which comprises all the actors and
crew which is working team behind the scenes
Data Sources

To build an effective content-based movie recommendation system, selecting and utilizing appropriate
datasets is a crucial step. This project relied on several well-known data sources that offer extensive and
reliable information about movies, their attributes, and user interactions. Below is an expanded discussion of
the key data sources used, their significance, and how they contribute to the system's functionality.

1. Kaggle

Kaggle is a leading platform for data science and machine learning that provides access to a wide variety of
datasets, including movie-related data. Datasets on Kaggle are often well-structured and accompanied by
comprehensive descriptions, making them highly suitable for machine learning projects.

• Example Datasets:
o The Movie Dataset: This dataset contains movie metadata such as genres, cast, crew, and
keywords, which are essential for content-based filtering.
o IMDb 5000 Movie Dataset: This dataset includes movie titles, release years, ratings, and
revenue figures.
• Significance in the Project:
Kaggle's datasets offer detailed metadata that serves as the backbone of the recommendation system.
Features like genres and cast enable the system to identify similarities between movies, which is
critical for generating tailored recommendations.
• Advantages:
o Regular updates and community contributions.
o Clean and pre-processed data, saving significant time.
o Availability of kernels and notebooks that provide implementation ideas and insights.
• Website: https://www.kaggle.com

2. TMDB (The Movie Database)

TMDB is a community-driven platform that provides an extensive database of movies, TV shows, and
celebrities. Known for its rich and detailed metadata, TMDB is a popular choice for building
recommendation systems.

• Data Features:
o Metadata: Includes genres, production companies, release dates, languages, and overviews.
o Ratings and Popularity Scores: Aggregated user ratings and popularity indices that can
complement content-based features.
o Keywords and Tags: Useful for natural language processing to enhance movie comparisons.
• Integration into the System:
TMDB data enhances the recommendation system by providing detailed descriptions and additional
features like keywords. For instance, plot summaries can be analyzed using TF-IDF and other text-
processing techniques to identify similar movies.
• Advantages:
o High-quality data from a verified community.
o Open API for seamless integration into machine learning workflows.
o Rich metadata that enables more nuanced recommendations.
• Website: https://www.themoviedb.org
3. IMDb (Internet Movie Database)

IMDb is one of the most widely used online platforms for movies, TV shows, and celebrity information. Its
extensive dataset includes user ratings, reviews, and detailed metadata, making it invaluable for movie
recommendation systems.

• Data Features:
o Ratings: Aggregated user ratings provide insights into a movie's popularity and quality.
o Cast and Crew: Includes director, actors, and production teams, which are significant factors
for recommendation.
o Genres and Keywords: Allow for effective categorization and comparison of movies.
• Relevance in the Project:
IMDb’s datasets play a key role in providing user-centric information, which can complement
metadata-based recommendations. By incorporating user ratings, the system can prioritize movies
that align with a user's preference for quality or popularity.
• Advantages:
o Comprehensive database with a global audience.
o Regular updates and accurate data.
o API availability for direct data retrieval.
• Website: https://www.imdb.com

4. UCI Machine Learning Repository

The UCI Machine Learning Repository is a trusted source for machine learning datasets, offering a variety
of real-world data for research and development.

• Example Dataset:
o Movie Data for Recommendation Systems: This dataset includes metadata such as genres,
ratings, and user preferences, which are essential for building content-based systems.
• Significance in the Project:
UCI datasets are particularly useful for benchmarking algorithms and testing system performance.
They often provide standardized formats that are easy to integrate into machine learning workflows.
• Advantages:
o Free access to high-quality datasets.
o Detailed documentation accompanying the datasets.
o Community and academic recognition for reliability.
• Website: https://archive.ics.uci.edu/ml/index.php
Technology Stack for Content-Based Movie
Recommendation System
In the development of a content-based movie recommendation system, a well-defined technology stack is
essential for efficient data processing, modeling, deployment, and user interaction. The stack typically
includes various tools and libraries that cater to specific stages of the development process, such as data
collection, preprocessing, modeling, and deployment. Below is an expanded discussion of the core
technologies used in the development of the movie recommendation system:

1. Jupyter Notebook

Jupyter Notebook is an open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text. It is particularly popular in the data
science and machine learning community for its interactivity, ease of use, and compatibility with Python.
Jupyter Notebook is often employed for:

• Exploratory Data Analysis (EDA): Jupyter provides a dynamic environment where developers can
experiment with different data processing and visualization techniques. The interactive interface
allows you to run code in smaller chunks, making it easier to debug and understand the flow of the
program.
• Data Preprocessing: Data preprocessing, including cleaning, transforming, and normalizing data,
can be carried out in an organized manner within Jupyter Notebooks. With real-time visualization
and code execution, developers can immediately observe the results of their data processing steps.
• Model Training and Evaluation: You can build and evaluate machine learning models in Jupyter
Notebooks. The notebook environment allows you to iteratively train different models, tweak
parameters, and visualize results without the need for complex setup.
• Documentation and Communication: Jupyter allows developers to combine code with explanatory
text, making it easy to document the entire process. This is helpful for sharing the code with
teammates or for presenting findings to stakeholders in a visually appealing manner.

Jupyter Notebooks are particularly beneficial during the development and experimentation phase of building
the recommendation system, as they streamline the process of trying different approaches and visualizing
results.

2. PyCharm

PyCharm is a powerful Integrated Development Environment (IDE) for Python development. Developed
by JetBrains, PyCharm is designed to assist Python developers with writing, testing, and debugging code. It
is highly favored by developers for its robustness and suite of features that facilitate efficient software
development. Some of the reasons PyCharm is widely used in the development of a recommendation system
include:

• Code Assistance: PyCharm provides intelligent code completion, syntax highlighting, and
suggestions, which speed up coding and help avoid common programming errors. For large-scale
projects, such as developing a movie recommendation system, these features improve productivity.
• Project Management: The IDE allows developers to organize their projects effectively by
structuring files and directories. It provides an efficient way to manage dependencies, virtual
environments, and version control systems (such as Git), ensuring smooth collaboration.
• Integrated Debugging: Debugging tools in PyCharm allow for step-by-step execution of the code,
identifying bugs, and understanding how the recommendation model is functioning. This is essential
for detecting issues during the model-building phase.
• Seamless Integration with Libraries: PyCharm integrates smoothly with libraries commonly used
in data science and machine learning, including Pandas, NumPy, Scikit-learn, and more. This
allows developers to easily build, test, and deploy machine learning models directly from within the
IDE.
• Testing and Optimization: PyCharm supports unit testing and profiling, which helps ensure the
quality of the recommendation system code and optimize performance. Automated testing is
particularly useful in verifying that the recommendation algorithm produces consistent and accurate
results.

PyCharm is ideal for developing the recommendation system as it offers a professional environment for
managing larger projects and ensuring the robustness of the code.

3. Heroku

Heroku is a cloud platform that enables developers to build, run, and operate applications entirely in the
cloud. It abstracts away much of the complexity of managing servers, making it a go-to platform for
deploying applications. Heroku offers several advantages in the context of deploying a movie
recommendation system:

• Easy Deployment: With Heroku, developers can deploy Python-based applications (such as the
recommendation system) directly from a Git repository. The platform supports a variety of
programming languages and frameworks, and it integrates easily with Python web frameworks like
Flask or Django.
• Scalability: Heroku is designed to scale applications with ease. As the recommendation system gains
more users, Heroku allows for easy scaling of resources (e.g., adding more processing power or
database storage) to accommodate the increased load.
• Add-ons and Integrations: Heroku offers numerous add-ons for data storage, monitoring, and
analytics. For instance, using PostgreSQL as the database, you can store user preferences and movie
data efficiently. It also supports Redis for caching movie recommendations to speed up user queries.
• Fast Prototyping: Heroku allows for rapid prototyping and iteration. You can quickly deploy
different versions of the recommendation system, making it easier to test various improvements and
experiment with new features.
• Custom Domain and SSL: For production deployment, Heroku allows you to attach custom
domains to your applications, along with SSL certificates for secure communication. This ensures
that your movie recommendation system is both professional and secure.

For creating and hosting a movie recommendation system, Heroku provides a cost-effective and simple
solution for deploying web applications in the cloud.

4. Google Colab

Google Colab is a cloud-based Jupyter Notebook environment that allows you to write and execute Python
code in a collaborative setting. It is a powerful tool for machine learning and data analysis, offering several
key benefits for building a recommendation system:
• Free GPU/TPU Support: Google Colab provides free access to powerful hardware accelerators,
such as GPU (Graphics Processing Unit) and TPU (Tensor Processing Unit), which are beneficial
for training deep learning models that may be part of a more advanced recommendation system.
• Collaboration: Since Colab is cloud-based, multiple developers can collaborate on the same
notebook in real-time, which is particularly useful for team-based projects. It’s easy to share
notebooks with other developers or stakeholders for feedback or improvements.
• Preinstalled Libraries: Google Colab comes with many popular data science libraries preinstalled,
such as Pandas, NumPy, Matplotlib, and Scikit-learn, making it quick to start the project without
worrying about setting up the environment.
• Integration with Google Drive: Colab is integrated with Google Drive, allowing you to store
datasets, models, and other project-related files. This seamless integration makes it easy to manage
large datasets and save the output of machine learning models.
• Environment Flexibility: You can switch between different Python versions in Google Colab,
ensuring compatibility with various libraries. Additionally, it allows users to access the internet to
fetch data or interact with external APIs.

Google Colab is an excellent choice for initial experimentation, especially for users who don't have access to
powerful local hardware, and it's useful for collaborative machine learning projects.

5. Python Libraries for Data Science and Machine Learning

The Python ecosystem provides several powerful libraries for data science and machine learning that are
integral to building a content-based movie recommendation system. Key libraries include:

• Pandas:
Pandas is a highly efficient library for data manipulation and analysis. It provides powerful data
structures like DataFrames that make it easy to clean, filter, and analyze datasets. In a
recommendation system, Pandas is used to load, preprocess, and manipulate large movie datasets
(such as TMDB_5000_Credits.CSV) before feeding them into machine learning models.
• NumPy:
NumPy is essential for performing numerical operations in Python. It provides support for large,
multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate
on these arrays. In a movie recommendation system, NumPy is often used for handling large-scale
data computations and transforming data into arrays or matrices for machine learning algorithms.
• Scikit-learn:
Scikit-learn is a robust library for building machine learning models. It offers a variety of algorithms
for supervised and unsupervised learning, including regression, classification, clustering, and
dimensionality reduction techniques. For a content-based recommendation system, Scikit-learn
provides useful utilities like TF-IDF (Term Frequency-Inverse Document Frequency) for text-
based features, cosine similarity for comparing movie features, and various vectorization
techniques.
• Matplotlib & Seaborn:
For visualizing the results of data analysis and model performance, libraries like Matplotlib and
Seaborn are commonly used. They help create graphs and charts to interpret the data, which can be
particularly useful when visualizing the popularity of movies, distribution of genres, or the
relationships between movie features.
• TensorFlow & Keras:
For more advanced recommendation systems that involve deep learning techniques, TensorFlow and
Keras are often employed. These libraries support the creation of neural networks and deep learning
models, allowing for more complex recommendation algorithms that may go beyond traditional
content-based filtering.
Project demonstration/screenshots/results achieved

Preprocessing

Merging the dataset

The integration of the movie dataset and the credit dataset is essential for building a comprehensive movie
recommendation system. By merging the data based on the movie ID, the system can utilize both movie-specific
information (such as genre, budget, and overview) and personnel-specific details (such as cast and crew involvement)
to generate recommendations.
We will only keep important parameters which would be decisive for making recommendation system

So those important parameters are

1)movie_id

2)title

3)overview

4)genres

5)keywords

6)cast and

7)crew

Result after keeping only important parameters:

Output for preprocessing:

Converting the format of data column into suitable format that into a list from dictionary

We will apply this to all the columns

For cast we will join the name with their names so avoid any confusion with any other actor names

From crew we will only extract director name

Overview column will also be converted into list

Creation of Tags
It will contain three format

Movie-id title tags

Check for overview if null then drop

Check for duplicate using movies.duplicated()

Now we will add new column tag which will be concatenation of overview,genres,keywords,cast,crew

Now we will make new dataframe which will contain only movie id, title and tag

We will convert the content of the tag into list

Now converting them into lowercase

Final output of preprocessing:
Vectorization for Movie Recommendation System

Vectorization is a fundamental process in building content-based movie recommendation systems, where textual
data needs to be transformed into a numerical format that can be processed by machine learning algorithms. In the
context of movie recommendation, we need to represent the textual information—such as movie tags, descriptions,
or genres—in a form that captures meaningful patterns and relationships. The process involves several important
steps, including corpus creation, applying the Bag of Words model, stop-word removal, and handling word variations
like stemming. Below is a detailed breakdown of these steps:

1. Text to Vector Conversion

To begin the vectorization process, we need to transform the raw text data into numerical vectors. In the case of
movie recommendation systems, each movie can be described by its set of tags (e.g., "action," "comedy," "drama"),
and the goal is to represent these tags in vector format.

a. Corpus Creation

The first step in vectorization is to create a corpus, which is essentially a collection of all the words that appear across
all movie tags. Since our dataset contains 5000 movies, each with a set of tags, we aggregate all these tags into one
large body of text. This corpus becomes the vocabulary from which we can extract meaningful features.

b. Bag of Words (BoW)

The Bag of Words (BoW) model is one of the most commonly used techniques for vectorizing text data. It involves
creating a "bag" or collection of words from the text data, where the frequency of each word is recorded while
ignoring the order of the words. The BoW model treats the text as a set of independent words and represents each
document (in this case, movie tags) by the frequency of occurrence of words.

• Step 1: Tokenization
The first step in applying the BoW model is tokenization. This process involves breaking down the tags
associated with each movie into individual words or tokens.

• Step 2: Frequency Count

After tokenization, we count how many times each word appears across all the movie tags in the dataset.
This step is crucial for understanding the significance of each word in relation to the overall movie dataset.

• Step 3: Vector Creation

Once we have a frequency count for each word in the corpus, we create a vector for each movie. The vector
will have a dimension corresponding to the frequency count of each word from the 5000 most frequent
words. For example, if the word “action” appears 3 times in a movie’s tags, its corresponding position in the
vector will hold the value 3. If the word does not appear, the position will hold the value 0.

2. Removing Stop Words

Stop words are commonly used words such as "and," "the," "of," "to," and "from" that do not provide significant
meaning in the context of a recommendation system. Including stop words in the vectorization process can lead to
inefficiencies and unnecessary complexity in the resulting vectors. Therefore, it is common practice to remove stop
words from the dataset.

By excluding these words, the resulting vectors become more meaningful, with a focus on the words that truly
describe the movie’s content. Libraries like NLTK (Natural Language Toolkit) offer built-in stop word lists, which can be
used to filter out these common but unimportant words.
3. 5000-Dimensional Feature Vector for Each Movie

At this point, we have a vocabulary of 5000 most frequent words across all the movie tags. Each movie will now be
represented as a vector in a 5000-dimensional space. Each dimension corresponds to one of these frequent words,
and the value at each position reflects the frequency of that word in the movie’s tags.

a. Representation of Movies in Vector Space

Each movie's vector will be a 5000-dimensional vector, where the value in each dimension represents the occurrence
count of the corresponding word from the vocabulary. For example, if the word "action" appears 3 times in a movie's
tag, its corresponding position in the vector will have the value 3. Words that do not appear in a movie’s tags will
have a value of 0 for that dimension.

This 5000-dimensional vector allows us to represent the content of each movie in a numerical form that can be easily
compared with other movies.

4. Handling Synonyms and Stemming

While the Bag of Words model is useful, it does not address issues like synonyms (e.g., “action” and “adventure”) or
different forms of a word (e.g., “acting” and “actor”). To improve the quality of the feature vectors, we can use
techniques like stemming or lemmatization to reduce words to their base or root form.

a. Stemming

Stemming is the process of reducing words to their root form by stripping off prefixes and suffixes. For example, the
words "acting," "actor," and "action" can all be reduced to the root word "act." This allows us to consolidate similar
words under a single feature, reducing the dimensionality of the vector and improving the effectiveness of the
recommendation system.

b. Lemmatization

Lemmatization is a more advanced technique than stemming, which reduces words to their base form by considering
their meaning and context. For instance, the word “better” may be reduced to “good” through lemmatization.
Although lemmatization is more computationally expensive than stemming, it provides more accurate results,
especially in systems where precise meaning matters.

Both stemming and lemmatization help ensure that variations of the same word are treated as one feature,
improving the quality of the recommendation system.

5. Vectorization Process Using CountVectorizer

The CountVectorizer from the scikit-learn library is a popular tool for transforming a collection of text documents into
a matrix of token counts. It allows us to easily tokenize the text, remove stop words, and create a numerical
representation of the data.

Once the text is processed using CountVectorizer, it produces a matrix where each row corresponds to a movie, and
each column corresponds to one of the selected words from the vocabulary. The values in the matrix represent the
frequency of the corresponding word in the movie’s tags.

6. Final Representation and Dimensionality Reduction

Once the movies are represented as 5000-dimensional vectors, we can use various techniques to measure the
similarity between movies. One common approach is cosine similarity, which computes the cosine of the angle
between two vectors. This measure allows us to quantify how similar two movies are based on their tag vectors.
However, working with 5000-dimensional vectors can be computationally expensive. To address this, we can use
dimensionality reduction techniques, such as Principal Component Analysis (PCA), to reduce the number of
dimensions while preserving the most important features. This helps improve both the efficiency and the accuracy of
the recommendation system.

7. Conclusion

The vectorization process is crucial for building a content-based movie recommendation system, as it transforms
textual data into a numerical format that can be used by machine learning algorithms. By applying the Bag of Words
model, removing stop words, and employing techniques like stemming and lemmatization, we can create meaningful
vectors that accurately represent each movie's content. These vectors allow us to compare movies based on their
tags and recommend movies that are most similar to the user’s preferences, ultimately providing tailored movie
recommendations.

Output for vectorization:

Cosine distance Calculation

Now we will calculate distance between movie vector We will find cosine distance instead of normal distance
Because normal distance fails for higher dimensionCosine_similarityWe will calculate distance of one movie from
other movie

Here similarity is a array of arrays which contain the distance of each movie from other movie Example similarity[1] is
array which contain the distance of 2nd movie from all other movies in form of array.
Function for recommendation of movie

Sort the array based on similarity and give top 5 movies

In this function we get movie name and we function return movie index in array

Output 1:
Output 2:
Making of website
We use Pycharm

Use streamlit

Pickle library to import movie list into pycharm in form of dictionary

Deployment on Heroko
Final Output
Conclusion

In this project, we successfully implemented a content-based movie recommendation system that provides tailored
movie recommendations based on a user's past interactions and preferences. This approach primarily relies on
analyzing the characteristics of the movies that a user has already watched or rated highly. These characteristics can
include genres, actors, directors, and even plot summaries. By leveraging these features, the system identifies
similarities between watched movies and other movies in the dataset to suggest relevant recommendations that
align with the user's taste.

The foundation of this system lies in the content-based filtering approach, which is particularly useful when
historical data about other users is unavailable or sparse. Unlike collaborative filtering, which depends on user-user
or item-item interactions, content-based systems utilize metadata about items to generate personalized
recommendations. This methodology ensures that users with unique or niche preferences can still receive highly
relevant recommendations, as the system focuses solely on their specific input rather than relying on the broader
community's behavior.

To build this recommendation system, we utilized machine learning techniques and natural language processing
tools. Libraries such as scikit-learn were instrumental in implementing vectorization methods like TF-IDF (Term
Frequency-Inverse Document Frequency) and cosine similarity to quantify the relationship between movies.
Additionally, datasets from platforms like Kaggle and the UCI Machine Learning Repository provided a wealth of
information, including movie descriptions, genres, and ratings, which served as input features for the model.

The advantage of this approach is its ability to explain recommendations clearly. For instance, if a user enjoyed a
movie due to its genre or director, the system can identify and recommend similar movies based on these attributes.
This transparency helps build trust in the recommendation process, as users can easily see why a particular movie
was suggested to them.

However, content-based systems have their limitations. One significant drawback is the "cold start problem" for new
users who have not interacted with the system enough to establish their preferences. Similarly, new movies with
insufficient metadata might not get recommended, as the system lacks sufficient information to compare them with
the user's profile. Furthermore, content-based methods can sometimes lead to a lack of diversity in
recommendations, as the system tends to suggest movies that are very similar to the ones the user has already
watched, potentially overlooking other genres or themes they might enjoy.

Despite these challenges, our system addresses some of these issues by incorporating user feedback mechanisms.
For instance, users can rate the recommendations they receive, which helps refine the system's understanding of
their preferences. Additionally, integrating features such as user-defined filters (e.g., excluding certain genres)
enhances the system's flexibility and usability.

In conclusion, the content-based movie recommendation system developed in this project demonstrates the practical
application of machine learning in personalization. By tailoring suggestions based on a user's preferences, the system
offers a unique and engaging way to discover new content. With further enhancements, such as combining content-
based filtering with collaborative methods (hybrid systems), the system could overcome its inherent limitations and
provide even more accurate and diverse recommendations. Ultimately, such systems have immense potential to
enhance user experiences in the entertainment industry, making them a valuable tool for platforms that aim to keep
their users engaged and satisfied.
Bibliography

1) Kaggle.
Kaggle is a prominent online platform for data science and machine learning practitioners. It provides publicly
available datasets, including movie datasets such as MovieLens, for developing and testing recommendation systems.
It also hosts competitions and kernels with pre-built solutions, offering insights into content-based and collaborative
filtering methods.
Website: https://www.kaggle.com

ChatGPT (OpenAI).
ChatGPT, developed by OpenAI, is a conversational AI model that assists in learning concepts, coding, and debugging
in machine learning. It is a useful tool for understanding machine learning algorithms, explaining documentation, and
building prototypes for recommendation systems.
Website: https://openai.com/chatgpt

UCI Machine Learning Repository.

The UCI Machine Learning Repository is a vast collection of datasets for machine learning research. It includes movie
datasets with attributes like genres, ratings, and tags, which are essential for building content-based
recommendation systems.
Website: https://archive.ics.uci.edu/ml/index.php

Documentation for Python Libraries (scikit-learn, TensorFlow, PyTorch).

Python libraries like scikit-learn, TensorFlow, and PyTorch are critical for implementing machine learning algorithms in
recommendation systems. Their documentation provides comprehensive guides on using classifiers, regression
models, natural language processing tools, and neural networks.

• scikit-learn: https://scikit-learn.org/stable/documentation.html

• TensorFlow: https://www.tensorflow.org/

• PyTorch: https://pytorch.org/

Umesh
No ratings yet
Umesh
40 pages
.Net Technology Training Report
No ratings yet
.Net Technology Training Report
25 pages
Food Review Analysis: A Training Report On
No ratings yet
Food Review Analysis: A Training Report On
34 pages
Automating e Abspdf-1
No ratings yet
Automating e Abspdf-1
50 pages
Final Mini Report Merged Removed
No ratings yet
Final Mini Report Merged Removed
32 pages
Library Management System
No ratings yet
Library Management System
2 pages
Project 1
No ratings yet
Project 1
35 pages
Predict College Acceptance
No ratings yet
Predict College Acceptance
29 pages
Finall PPR ALL - 250505 - 220910
No ratings yet
Finall PPR ALL - 250505 - 220910
65 pages
Batch 4 - Revolutionizing Blood Cell Analysis
No ratings yet
Batch 4 - Revolutionizing Blood Cell Analysis
79 pages
Mini - Project
No ratings yet
Mini - Project
52 pages
Blog Management System
No ratings yet
Blog Management System
7 pages
Predicting Behavior Change in SEN Students
No ratings yet
Predicting Behavior Change in SEN Students
73 pages
Final Reportrrrrttnb
No ratings yet
Final Reportrrrrttnb
60 pages
HTML and Css
No ratings yet
HTML and Css
2 pages
DBMS Mini Project Report Format 2021
No ratings yet
DBMS Mini Project Report Format 2021
17 pages
Micro Project
No ratings yet
Micro Project
16 pages
Final Project Report
No ratings yet
Final Project Report
72 pages
Minor Project Report Format Dec 2024 (1) (AutoRecovered)
No ratings yet
Minor Project Report Format Dec 2024 (1) (AutoRecovered)
15 pages
ODJMS Project Report Ankush1
No ratings yet
ODJMS Project Report Ankush1
22 pages
Final PBL Report Btech - CSE .
No ratings yet
Final PBL Report Btech - CSE .
7 pages
AI - FINAL Harsha
No ratings yet
AI - FINAL Harsha
16 pages
Final (1) Removed
No ratings yet
Final (1) Removed
4 pages
Job Attentive: Online Job Portal Report
No ratings yet
Job Attentive: Online Job Portal Report
36 pages
Documentation
No ratings yet
Documentation
62 pages
Sami Report
No ratings yet
Sami Report
35 pages
Prediction and Analysis of Student Performance by Data Mining in WEKA
No ratings yet
Prediction and Analysis of Student Performance by Data Mining in WEKA
56 pages
Main Papers
No ratings yet
Main Papers
7 pages
Monnit: A Project Report Submitted in Partial Fulfilment of The Requirement For The Award of Degree of
No ratings yet
Monnit: A Project Report Submitted in Partial Fulfilment of The Requirement For The Award of Degree of
29 pages
Rescued Document
No ratings yet
Rescued Document
51 pages
Iasscom Fortune Institute of Technology Bhopal: Project Title
No ratings yet
Iasscom Fortune Institute of Technology Bhopal: Project Title
4 pages
Student Grade Prediction Model
No ratings yet
Student Grade Prediction Model
106 pages
Minor Project Report 3
No ratings yet
Minor Project Report 3
1 page
VIVA
No ratings yet
VIVA
48 pages
Project Report Csit
No ratings yet
Project Report Csit
8 pages
Indore: A Project Report Submitted at
No ratings yet
Indore: A Project Report Submitted at
27 pages
Internship Report Sample
No ratings yet
Internship Report Sample
9 pages
Final Mini Project123
No ratings yet
Final Mini Project123
52 pages
Book Shop Management System Report
No ratings yet
Book Shop Management System Report
4 pages
Poc App Using Mit App Inventor Lab On Project Project Report
No ratings yet
Poc App Using Mit App Inventor Lab On Project Project Report
22 pages
Final Lab Manual
No ratings yet
Final Lab Manual
45 pages
Minor Projetc Format CES IT Students
No ratings yet
Minor Projetc Format CES IT Students
8 pages
Online Blood Donation Management System Report ( ( ( ( (
No ratings yet
Online Blood Donation Management System Report ( ( ( ( (
66 pages
Online Blood Donation Management System Report
56% (16)
Online Blood Donation Management System Report
66 pages
Facial Recognition Based Student Attendance Management System
No ratings yet
Facial Recognition Based Student Attendance Management System
26 pages
Password Generation Using Python
No ratings yet
Password Generation Using Python
33 pages
Documentation (2) (2) NVMSP
No ratings yet
Documentation (2) (2) NVMSP
51 pages
RGPV Final Year Project
100% (1)
RGPV Final Year Project
45 pages
Final Mini Project123-1
No ratings yet
Final Mini Project123-1
56 pages
Mini Project Campus Predictor Report
0% (1)
Mini Project Campus Predictor Report
46 pages
Final Project
No ratings yet
Final Project
209 pages
EVENT MANAGEMENT SYSTEM Sai Deepak
No ratings yet
EVENT MANAGEMENT SYSTEM Sai Deepak
51 pages
K Bank
No ratings yet
K Bank
29 pages
E Library Project
No ratings yet
E Library Project
87 pages
BLOG Project Reprort
No ratings yet
BLOG Project Reprort
47 pages
Projectreport
No ratings yet
Projectreport
49 pages
Projectreport 1
No ratings yet
Projectreport 1
49 pages
Food Delivery App Project Report
No ratings yet
Food Delivery App Project Report
101 pages
MY PROJECT-output
No ratings yet
MY PROJECT-output
55 pages
Fourier Assig
No ratings yet
Fourier Assig
2 pages
362 Housing Case A.city
No ratings yet
362 Housing Case A.city
56 pages
Report
No ratings yet
Report
27 pages
Cloud Computing Report
No ratings yet
Cloud Computing Report
7 pages
FORM1
No ratings yet
FORM1
2 pages
Da 4
No ratings yet
Da 4
22 pages
Print 3
No ratings yet
Print 3
11 pages
Print 2
No ratings yet
Print 2
12 pages
Da 3
No ratings yet
Da 3
3 pages
Report
No ratings yet
Report
27 pages
Web Technology II
No ratings yet
Web Technology II
10 pages
Paul and The Law
100% (1)
Paul and The Law
27 pages
Part - A Answer The Following Questions (10x1 10)
No ratings yet
Part - A Answer The Following Questions (10x1 10)
2 pages
Ancillary Eqpmnts
No ratings yet
Ancillary Eqpmnts
24 pages
Resume Kartikey Bharadwaj-1
No ratings yet
Resume Kartikey Bharadwaj-1
2 pages
Parallel
No ratings yet
Parallel
8 pages
Strategic Mgmt Midterm Guide
No ratings yet
Strategic Mgmt Midterm Guide
5 pages
MITWPU - Unit 2-Theory of Computation
No ratings yet
MITWPU - Unit 2-Theory of Computation
50 pages
Seismic Analysis and Retrofitting of R.C.C Structure
No ratings yet
Seismic Analysis and Retrofitting of R.C.C Structure
5 pages
Dutch Flower Industry Analysis
No ratings yet
Dutch Flower Industry Analysis
13 pages
Nora w10 Datasheet Ubx 21036702
No ratings yet
Nora w10 Datasheet Ubx 21036702
36 pages
Polarity & Intermolecular Forces Guide
No ratings yet
Polarity & Intermolecular Forces Guide
17 pages
Job Search and Application Practice
No ratings yet
Job Search and Application Practice
19 pages
Rat Form 3 W Further Logs and Quadratic Equations-1
No ratings yet
Rat Form 3 W Further Logs and Quadratic Equations-1
6 pages
04-Presentation Directional Protection
100% (1)
04-Presentation Directional Protection
36 pages
Aggregate & Capacity Planning Guide
100% (1)
Aggregate & Capacity Planning Guide
10 pages
Nutrition For Preschool-Age Children
50% (2)
Nutrition For Preschool-Age Children
57 pages
SPM6-72L 380-400 Watt: Mono Crystalline Module
No ratings yet
SPM6-72L 380-400 Watt: Mono Crystalline Module
2 pages
DLL Matatag - Tle 8 q2 w2
100% (1)
DLL Matatag - Tle 8 q2 w2
12 pages
Linear Equation in Two Unknowns PDF
No ratings yet
Linear Equation in Two Unknowns PDF
16 pages
SAP PM - Key Figures For Order Costs
No ratings yet
SAP PM - Key Figures For Order Costs
3 pages
Entrepreneurship
No ratings yet
Entrepreneurship
27 pages
Editor Comparative Politics Nature and Major Approaches
No ratings yet
Editor Comparative Politics Nature and Major Approaches
5 pages
Footloose
No ratings yet
Footloose
22 pages
Research 1232
No ratings yet
Research 1232
81 pages
Gmail - 答复 - Jomil - completion of works
No ratings yet
Gmail - 答复 - Jomil - completion of works
10 pages
Case Interview Prep Workshop
100% (1)
Case Interview Prep Workshop
23 pages
Creating Effective Test Specifications
No ratings yet
Creating Effective Test Specifications
25 pages
European Commission. (2013, November) - Organic Versus Conventional Farming
No ratings yet
European Commission. (2013, November) - Organic Versus Conventional Farming
10 pages

Project III Report

Uploaded by

Project III Report

Uploaded by

Project-III Report

SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIRMENT FOR THE

Supervisor Submitted By:

S.No Content Page

5 Technology used in detail(Hardware

1. Research and Analysis: Understanding various recommendation algorithms, such as

Riyan Raj Saikia

Key Activities in This Module:

Key Activities in This Module:

• Data Cleaning: Handle missing values, duplicates, and incorrect data.

Key Activities in This Module:

Key Activities in This Module:

Key Activities in This Module:

1. Data Collection Components

• Collection of Data for 5000 Movies:

3. Model Building Components

• Bag of Words (BoW):

4. Website Development Components

We used head function to represent data

2. TMDB (The Movie Database)

4. UCI Machine Learning Repository

5. Python Libraries for Data Science and Machine Learning

Merging the dataset

So those important parameters are

Result after keeping only important parameters:

Output for preprocessing:

We will apply this to all the columns

From crew we will only extract director name

Overview column will also be converted into list

Movie-id title tags

Check for overview if null then drop

Check for duplicate using movies.duplicated()

We will convert the content of the tag into list

Now converting them into lowercase

1. Text to Vector Conversion

b. Bag of Words (BoW)

• Step 2: Frequency Count

• Step 3: Vector Creation

2. Removing Stop Words

a. Representation of Movies in Vector Space

4. Handling Synonyms and Stemming

5. Vectorization Process Using CountVectorizer

6. Final Representation and Dimensionality Reduction

Output for vectorization:

Sort the array based on similarity and give top 5 movies

Pickle library to import movie list into pycharm in form of dictionary

UCI Machine Learning Repository.

Documentation for Python Libraries (scikit-learn, TensorFlow, PyTorch).

You might also like