0% found this document useful (0 votes)

29 views16 pages

YouTube Comment Sentiment Analysis with PySpark

Bda assignment Reva

Uploaded by

akusharma755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views16 pages

YouTube Comment Sentiment Analysis with PySpark

Bda assignment Reva

Uploaded by

akusharma755

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

ASSIGNMENT-1 REPORT
ON

“YOUTUBE COMMENT SENTIMENT ANALYSIS USING PYSPARK”

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted by
Akarsh Kiran Gowda
SRN: R23EF017
Under the Guidance of
Dr. A Ajil
Assistant Professor
School of Computer Science and Engineering

2025-2026
REVA UNIVERSITY
Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru-560064
[Link]
Table of Contents
Abstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 3
Problem Statement ....................................................................................................................................... 4
Project Overview........................................................................................................................................... 5
Implementation ............................................................................................................................................ 6
Output and Results ....................................................................................................................................... 9
Conclusion ................................................................................................................................................... 13
Future Work ................................................................................................................................................ 14
References .................................................................................................................................................. 15

List of Figures
Figure 1:YouTube Comment Sentiment Distribution Bar Graph .................................................................. 9
Figure 2:Sentiment Distribution for Top 5 Videos ...................................................................................... 10
Figure 3:Top Words in Positive Comments ................................................................................................. 11
Figure 4:Top Words in Negative Comments ............................................................................................... 11
Figure 5:Top Words in Neutral Comments ................................................................................................. 12
Figure 6:Confusion Matrix Plot ................................................................................................................... 12

1
Abstract

In the digital age, vast quantities of unstructured textual data are generated daily on platforms
such as YouTube. Extracting insights from this data is vital to understanding public opinion and
behavioral trends. This project focuses on sentiment analysis of YouTube comments using Apache
PySpark, leveraging its distributed computing capability for handling large datasets efficiently.
The system processes raw comment data, performs text cleaning and preprocessing, and applies
machine learning techniques to classify sentiments as positive, negative, or neutral. Additionally,
visualization tools such as word clouds and bar charts are used to represent trends and emotional
distribution in the dataset. This implementation demonstrates the scalability, accuracy, and
efficiency of PySpark in large-scale text analytics.

Keywords: Sentiment Analysis, PySpark, YouTube Comments, Machine Learning, NLP

2
Introduction

With the exponential growth of social media and video-sharing platforms like YouTube, user-
generated comments have become a major source of public opinion and feedback. Each video
uploaded to YouTube attracts thousands of comments from users expressing their thoughts,
appreciation, criticism, or suggestions. Analysing these comments provides significant insights
into audience engagement, content quality, and overall sentiment toward creators or topics.
However, due to the massive and unstructured nature of this data, manual evaluation is impractical,
and traditional single-machine processing tools often fall short in terms of scalability and
performance.

To overcome these challenges, Apache PySpark—an advanced distributed computing framework

is employed in this project. PySpark allows efficient handling of large-scale datasets through
parallel processing across multiple cores or machines. Its built-in libraries such as MLlib for
machine learning and SQL/DataFrame APIs for data manipulation make it an ideal tool for big
data analytics and natural language processing (NLP) applications.

In this project, PySpark is used to perform end-to-end sentiment analysis on YouTube comments.
The process begins with reading and cleaning the dataset, followed by text preprocessing tasks
such as tokenization, stop word removal, and normalization. Next, relevant features are extracted
using vectorization techniques like TF-IDF, which convert textual information into numerical
format suitable for machine learning models. PySpark’s MLlib is then utilized to train and test a
classification model capable of predicting whether a comment is positive, negative, or neutral.

The model’s output is evaluated using standard performance metrics like accuracy and confusion
matrix. In addition, graphical visualizations, including bar charts and word clouds, are generated
to represent the distribution of sentiments and frequently used words. These visual outputs enhance
the interpretability of the results and provide an intuitive understanding of public opinion trends.

By combining big data processing power with natural language understanding, this project
demonstrates how PySpark can transform raw, unstructured user comments into meaningful
insights. It not only highlights the potential of scalable sentiment analysis systems but also
provides a foundation for future developments in real-time opinion monitoring and social media
analytics.

3
Problem Statement

The exponential growth of user-generated content on platforms like YouTube has resulted in
millions of comments being added every day. Manually analysing these comments to determine
the general sentiment is neither feasible nor efficient. Traditional Python libraries, such as pandas
or scikit-learn, are limited in their capacity to process very large datasets. Therefore, a scalable
and efficient solution is required.

The problem addressed in this project is automating sentiment classification of YouTube

comments by leveraging PySpark’s distributed processing capabilities. The goal is to build a
pipeline that:

1. Cleans and preprocesses text data efficiently.

2. Applies a machine learning-based classification model for sentiment prediction.

3. Visualizes sentiment trends and frequent words for interpretation.

This helps content creators, marketers, and analysts understand user opinions at scale and make
informed data-driven decisions.

4
Project Overview

i. Objectives

• To design a scalable sentiment analysis system using PySpark capable of handling large
volumes of YouTube comments.

• To preprocess and clean unstructured text data for machine learning applications.

• To train and evaluate a sentiment classification model with measurable accuracy.

• To visualize sentiment trends using bar graphs and word clouds for better
interpretability.

ii. Goals

• Implement data ingestion and cleaning using PySpark DataFrames.

• Apply Natural Language Processing (NLP) steps such as tokenization, stop word
removal, and text normalization.

• Train a machine learning model (e.g., Logistic Regression or Naïve Bayes) using
PySpark MLlib.

• Evaluate model performance through accuracy and confusion matrix.

• Generate visual insights using matplotlib and word cloud libraries.

5
Implementation

i. Problem Analysis and Description

The dataset, [Link], contains thousands of comments collected from YouTube videos.
Each comment includes fields such as video_id, comment_text, and other metadata. However, the
primary focus is on analysing the textual content (CONTENT column) to determine sentiment
polarity.

The challenges identified include:

• Handling large-scale data efficiently.

• Cleaning noisy text (emojis, links, special symbols).

• Feature extraction suitable for machine learning.

• Building an interpretable and accurate sentiment classification model.

PySpark’s distributed processing framework and MLlib are chosen to overcome these challenges.

ii. Modules Identified

1. Data Ingestion Module

o Loads the dataset into a PySpark DataFrame.

o Removes missing and null values in the comment text field.

2. Data Preprocessing Module

o Converts text to lowercase.

o Removes punctuation, numbers, and special characters.

o Tokenizes text and removes stopwords.

o Prepares clean text suitable for model training.

3. Feature Engineering Module

6
o Converts textual data into numerical features using TF-IDF or CountVectorizer.

4. Machine Learning Module

o Implements a classification model using PySpark MLlib.

o Trains the model on labeled data to predict sentiment categories.

o Evaluates accuracy and generates a confusion matrix.

5. Visualization Module

o Generates bar graphs showing sentiment distribution.

o Produces a Word Cloud for frequently used words in comments.

iii. Implementation Steps

1. Data Loading
The [Link] dataset was loaded into a PySpark DataFrame. Null and empty entries were
dropped to ensure data quality.

from [Link] import SparkSession

spark = [Link]("YouTubeSentiment").getOrCreate()
df = [Link]("[Link]", header=True, inferSchema=True)
df = [Link](subset=["CONTENT"])

2. Data Cleaning
The CONTENT column was converted to lowercase and cleaned using regular expressions to remove
URLs, numbers, and punctuation.

from [Link] import col, lower, regexp_replace

df = [Link]("cleaned", lower(col("CONTENT")))
df = [Link]("cleaned", regexp_replace(col("cleaned"), r"http\S+|www\S+|[^a-z\s]", ""))

7
3. Feature Engineering and Vectorization
The cleaned text was tokenized and transformed using TF-IDF or CountVectorizer to prepare for
model input.

from [Link] import Tokenizer, StopWordsRemover, CountVectorizer, IDF

tokenizer = Tokenizer(inputCol="cleaned", outputCol="words")
wordsData = [Link](df)
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filteredData = [Link](wordsData)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
cvModel = [Link](filteredData)
featurizedData = [Link](filteredData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = [Link](featurizedData)
finalData = [Link](featurizedData)

4. Model Training
A classification model (e.g., Logistic Regression) was trained using PySpark MLlib. The model
achieved an accuracy of 75.05% on the test dataset.

from [Link] import LogisticRegression

from [Link] import MulticlassClassificationEvaluator
(trainingData, testData) = [Link]([0.8, 0.2], seed=42)
lr = LogisticRegression(featuresCol="features", labelCol="label")
lrModel = [Link](trainingData)
predictions = [Link](testData)

5. Evaluation
The performance was evaluated using metrics such as accuracy and confusion matrix.

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",

metricName="accuracy")
accuracy = [Link](predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

6. Visualization
Sentiment distribution was represented using bar plots, and a Word Cloud was generated to show
frequent terms from positive and negative comments.

import [Link] as plt

from wordcloud import WordCloud
# Sentiment Distribution
sentiment_counts = [Link]("label").count().toPandas()
[Link](sentiment_counts["label"], sentiment_counts["count"])
[Link]("Sentiment")
[Link]("Count")
[Link]("Sentiment Distribution")
# Word Cloud
text = " ".join([Link]("cleaned").[Link](lambda x: x).collect())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
[Link](wordcloud, interpolation='bilinear')
[Link]("off")

8
Output and Results

Results Obtained:

• Model Accuracy: 75.05 %

• Total Comments Processed: 691,680

YouTube Comment Sentiment Distribution Bar Graph:

Figure 1:YouTube Comment Sentiment Distribution Bar Graph

9
Sentiment Distribution for Top 5 Videos:

Figure 2:Sentiment Distribution for Top 5 Videos

10
Word Cloud Visualization:

i. Top Words in Positive Comments:

Figure 3:Top Words in Positive Comments

ii. Top Words in Negative Comments:

Figure 4:Top Words in Negative Comments

11
iii. Top Words in Neutral Comments:

Figure 5:Top Words in Neutral Comments

Confusion Matrix Plot:

Figure 6:Confusion Matrix Plot

12
Conclusion

This project successfully demonstrates the potential of PySpark as a powerful framework for
large-scale data analysis and sentiment classification tasks. Through the YouTube Comments
Sentiment Analysis project, we explored how massive unstructured textual data obtained from
social media can be transformed into meaningful insights using distributed data processing and
machine learning techniques.

The system efficiently handled a dataset consisting of more than six hundred thousand YouTube
comments, showcasing the scalability of PySpark’s in-memory computation engine. The
preprocessing phase ensured that noisy data such as URLs, emojis, punctuation, and mixed cases
were cleaned and normalized for effective text analysis. Subsequent tokenization, stop word
removal, and TF-IDF vectorization enabled the transformation of textual content into numerical
representations suitable for modelling.

The Logistic Regression model built using PySpark’s MLlib achieved a commendable accuracy
of 75.05%, indicating its capability to correctly classify the polarity of user opinions into positive,
negative, or neutral sentiments. The evaluation metrics and visualization outputs, including
sentiment distribution graphs and word clouds, further provided clear and interpretable
representations of the public opinion trends embedded within the data.

This analysis also illustrates the practical integration of data preprocessing, machine learning, and
visualization within a single PySpark-based pipeline. It highlights how cloud-compatible,
distributed frameworks like Spark can significantly accelerate model training and evaluation
compared to traditional Python libraries when dealing with large-scale datasets.

In conclusion, the project achieved its objective of performing end-to-end sentiment analysis on
real-world social media data while maintaining accuracy, scalability, and interpretability. It
demonstrates the potential for using Big Data technologies to monitor audience reactions, brand
perception, and public sentiment, thus providing a foundation for future analytical systems in
domains like marketing, entertainment, and digital media analytics.

13
Future Work

While the current project successfully established a functional and accurate sentiment analysis
pipeline using PySpark, there remain several opportunities for enhancement and extension in
future iterations.

Firstly, the sentiment classification process can be improved by incorporating deep learning
models such as LSTMs (Long Short-Term Memory networks) or Transformers (e.g., BERT),
which have proven to outperform traditional machine learning algorithms in understanding
linguistic context and sentiment nuances. Integrating these models through frameworks like
TensorFlowOnSpark or Spark NLP could enable more context-aware classification and better
handling of sarcasm, idioms, and multilingual comments.

Secondly, the dataset used in this project primarily contained English-language comments. Future
work can explore multilingual sentiment analysis, where models can detect and classify emotions
across diverse languages and cultures. This would make the system more robust and globally
applicable.

Another potential direction is the real-time sentiment monitoring of YouTube comment streams.
By connecting the pipeline with the YouTube Data API, live comments could be ingested and
processed continuously using Spark Streaming, enabling dynamic dashboards that visualize
changing audience reactions over time.

Additionally, expanding the analytical scope to include emotion classification (e.g., joy, anger,
sadness, fear) instead of just polarity-based sentiment could provide more detailed insights into
audience behavior. This could further be extended to topic modelling, helping identify trending
discussion themes within the comment sections.

Lastly, a web-based visualization dashboard could be developed to make the analysis interactive
and accessible to non-technical users. Tools such as Flask, Streamlit, or Dash can be integrated
with Spark outputs to present live metrics, sentiment heatmaps, and temporal trend graphs.

Through these enhancements, the system can evolve from a static analytical model into a real-
time, multi-language, and context-sensitive sentiment intelligence platform, offering broader
research and commercial applications.

14
References

1. Apache Software Foundation, Apache Spark Documentation, 2025. Available:

[Link]
2. Kaggle, YouTube Comment Dataset, Datasnaek, 2020. Available:
[Link]
3. Bird, Steven, Edward Loper, and Ewan Klein, Natural Language Processing with Python,
O’Reilly Media, 2009. (TextBlob library reference)
4. Dean, Jeffrey, and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large
Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
5. Zaharia, Matei, et al., “Apache Spark: A Unified Engine for Big Data Processing,”
Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
6. Pedregosa, Fabian, et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
7. Rehurek, Radim, and Petr Sojka, “Software Framework for Topic Modelling with Large
Corpora,” Proceedings of the LREC Workshop on New Challenges for NLP Frameworks,
2010.
8. Mikolov, Tomas, et al., “Efficient Estimation of Word Representations in Vector Space,”
arXiv preprint arXiv:1301.3781, 2013.
9. YouTube Data API Documentation, Google Developers, 2025. Available:
[Link]
10. McKinney, Wes, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition, O’Reilly Media, 2017.

Common questions

Using distributed frameworks like PySpark offers educational and research benefits by allowing students and researchers to explore scalable data processing techniques applicable to real-world big data challenges. It demonstrates the integration of machine learning, NLP, and data visualization in handling massive, unstructured datasets. This hands-on experience is essential for understanding cloud computing architectures, optimizing system performance, and executing predictive analytics on a realistic scale, which are increasingly valuable skills in the field of data science and engineering .

Visualization tools like bar charts and word clouds significantly enhance the interpretability of sentiment analysis results by providing intuitive, clear pictures of data trends and distributions. Bar charts help depict sentiment polarity across different comments visually, while word clouds highlight frequently used terms, giving an insight into common topics and expressions. These visualizations make it easier for even non-technical stakeholders to interpret the data, enabling informed decision-making based on audience sentiment .

PySpark offers significant advantages over traditional Python libraries, such as pandas or scikit-learn, due to its distributed computing framework which allows for parallel processing across multiple cores or machines, enhancing scalability and efficiency. This is particularly useful for handling large datasets like YouTube comments. PySpark’s MLlib also provides built-in tools for machine learning that can efficiently handle the text preprocessing and feature extraction steps required for sentiment analysis, something that single-machine platforms struggle to do on a large scale .

PySpark is highly applicable for multilingual sentiment analysis due to its scalability and distributed processing capabilities, which can efficiently handle large, diverse language datasets. However, its limitation lies in the need for suitable NLP libraries and models that understand language nuances, as well as possibly requiring extensions like Spark NLP or integration with deep learning frameworks. While PySpark supports data processing at scale, any existing language model must be adept at contextual and cultural nuances to ensure accurate sentiment detection across languages .

PySpark’s MLlib addresses several challenges in building sentiment classification models, including processing vast amounts of data efficiently through parallel distributed computing, which is beyond the capacity of traditional libraries. It streamlines preprocessing tasks and offers scalable machine learning algorithms that manage feature extraction and model evaluation with built-in functions. MLlib also simplifies handling complexity in text data treatment and automates the machine learning pipeline to improve model performance and scalability .

Future improvements suggested include integrating advanced deep learning models such as LSTMs or Transformers like BERT through frameworks compatible with PySpark, such as TensorFlowOnSpark. These models can better capture linguistic context and handle nuances such as sarcasm and idioms. Moreover, expanding the model to perform multilingual sentiment analysis and real-time processing using Spark Streaming could significantly enhance its robustness and applicability .

TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is used to transform textual data into numerical representations. This method is particularly valuable as it highlights the importance of words within a document relative to the entire corpus, thereby enhancing the machine learning model's ability to identify relevant features that contribute to sentiment classification. It effectively reduces the weight of common words that might not be sentiment indicators while emphasizing rarer, more informative words .

The project integrates data preprocessing, machine learning, and visualization to generate insights from YouTube comments by first cleaning and transforming raw text data into a form suitable for analysis through tokenization and vectorization methods like TF-IDF. PySpark’s MLlib facilitates model training and testing to classify sentiments, achieving notable accuracy. Subsequently, visualization tools like bar charts and word clouds present the results in an interpretable manner, highlighting sentiment trends and frequently mentioned topics, thereby transforming raw data into actionable insights .

The project used a Logistic Regression model implemented via PySpark’s MLlib for sentiment classification. The model achieved an accuracy of 75.05% on the test dataset, indicating its effectiveness in correctly classifying the sentiment of YouTube comments as positive, negative, or neutral .

The sentiment analysis pipeline begins with loading the dataset into a PySpark DataFrame, followed by the removal of null and missing entries to ensure data quality. Text cleaning involves converting text to lowercase and using regular expressions to strip out URLs, symbols, numbers, and punctuation, thereby preparing the text for further analysis. The data ingestion module is designed to remove unwanted noise and produce clean, structured data suitable for subsequent NLP operations .

YouTube Comment Sentiment Analysis Tool
No ratings yet
YouTube Comment Sentiment Analysis Tool
63 pages
YouTube Comment Sentiment Analysis Tool
No ratings yet
YouTube Comment Sentiment Analysis Tool
10 pages
Social Media Comment Analyzer Project
No ratings yet
Social Media Comment Analyzer Project
18 pages
YouTube Sentiment Analyzer Project
No ratings yet
YouTube Sentiment Analyzer Project
23 pages
Automated Comment Sentiment Analysis
No ratings yet
Automated Comment Sentiment Analysis
3 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
2 pages
Real-Time Sentiment Analysis Tool
No ratings yet
Real-Time Sentiment Analysis Tool
13 pages
Research Paper Sentiment Analysis of YouTube Comments 2024
No ratings yet
Research Paper Sentiment Analysis of YouTube Comments 2024
5 pages
Social Media Sentiment Analysis with NLP
No ratings yet
Social Media Sentiment Analysis with NLP
63 pages
YouTube Comment Sentiment Analysis Tool
No ratings yet
YouTube Comment Sentiment Analysis Tool
13 pages
YouTube Comments Sentiment Analysis System
No ratings yet
YouTube Comments Sentiment Analysis System
12 pages
YouTube Comments Sentiment Analysis Tool
No ratings yet
YouTube Comments Sentiment Analysis Tool
22 pages
Social Media Sentiment Analysis Report
No ratings yet
Social Media Sentiment Analysis Report
82 pages
Automated Comment Sentiment Analysis
No ratings yet
Automated Comment Sentiment Analysis
12 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
2 pages
SYNOPSIS of Twitter Sentiment Analysis
No ratings yet
SYNOPSIS of Twitter Sentiment Analysis
26 pages
AI-Powered Social Media Sentiment Analysis
No ratings yet
AI-Powered Social Media Sentiment Analysis
6 pages
Sentiment Analysis Pro Web App Overview
No ratings yet
Sentiment Analysis Pro Web App Overview
15 pages
YouTube Comments Sentiment Analyzer
No ratings yet
YouTube Comments Sentiment Analyzer
4 pages
Real-Time YouTube Comment Analysis
No ratings yet
Real-Time YouTube Comment Analysis
7 pages
Opinion Mining Project Review Summary
No ratings yet
Opinion Mining Project Review Summary
4 pages
NLP Sentiment Analysis for ChatGPT Reviews
No ratings yet
NLP Sentiment Analysis for ChatGPT Reviews
22 pages
Real-Time Social Media Sentiment Analysis
No ratings yet
Real-Time Social Media Sentiment Analysis
4 pages
YouTube Comment Sentiment Analysis
No ratings yet
YouTube Comment Sentiment Analysis
58 pages
Analyzing Social Media Public Opinion
No ratings yet
Analyzing Social Media Public Opinion
16 pages
Social Media Sentiment Analysis Methods
No ratings yet
Social Media Sentiment Analysis Methods
10 pages
YouTube Comment Sentiment Analysis Study
No ratings yet
YouTube Comment Sentiment Analysis Study
4 pages
1RV21AI011-1RV21AI028 Stream Lab Report
No ratings yet
1RV21AI011-1RV21AI028 Stream Lab Report
34 pages
Social Media Text Classification Project
No ratings yet
Social Media Text Classification Project
19 pages
Sentiment Analysis Tool Report
No ratings yet
Sentiment Analysis Tool Report
29 pages
Sentiment Analysis Project Report
No ratings yet
Sentiment Analysis Project Report
16 pages
Sentiment Analysis with Spark MLlib
No ratings yet
Sentiment Analysis with Spark MLlib
2 pages
Twitter Sentiment Analysis with FlaskAPI
No ratings yet
Twitter Sentiment Analysis with FlaskAPI
28 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
19 pages
Sentiment Analysis Tool for Twitter
No ratings yet
Sentiment Analysis Tool for Twitter
25 pages
Naive Bayes Sentiment Analysis Project
No ratings yet
Naive Bayes Sentiment Analysis Project
26 pages
Sentiment Analysis Project Report
No ratings yet
Sentiment Analysis Project Report
23 pages
YouTube Comment Sentiment Analysis Insights
No ratings yet
YouTube Comment Sentiment Analysis Insights
9 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
11 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
11 pages
Opinion Mining Project Overview
No ratings yet
Opinion Mining Project Overview
17 pages
Sentiment Analysis Web App Using LSTM
No ratings yet
Sentiment Analysis Web App Using LSTM
10 pages
Twitter Sentiment Analysis with Python
No ratings yet
Twitter Sentiment Analysis with Python
19 pages
Sentiment Analysis of Product Reviews
No ratings yet
Sentiment Analysis of Product Reviews
18 pages
AI Sentiment Analysis Project Guide
No ratings yet
AI Sentiment Analysis Project Guide
19 pages
Ijs DR 2205088
No ratings yet
Ijs DR 2205088
5 pages
Real-Time Social Media Sentiment Analysis
No ratings yet
Real-Time Social Media Sentiment Analysis
10 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
9 pages
Social Media Sentiment Analysis Project
No ratings yet
Social Media Sentiment Analysis Project
31 pages
Reddit Sentiment Analysis Tool Report
No ratings yet
Reddit Sentiment Analysis Tool Report
12 pages
Social Media Sentiment Analysis Techniques
No ratings yet
Social Media Sentiment Analysis Techniques
10 pages
Deep Learning Sentiment Analysis Project
No ratings yet
Deep Learning Sentiment Analysis Project
20 pages
Web Scraping for Sentiment Analysis
No ratings yet
Web Scraping for Sentiment Analysis
6 pages
Deep Learning for Social Media Sentiment Analysis
No ratings yet
Deep Learning for Social Media Sentiment Analysis
9 pages
Sentiment Analysis Project Report
No ratings yet
Sentiment Analysis Project Report
9 pages
Fin Ijprems1714118825
No ratings yet
Fin Ijprems1714118825
6 pages
Interactive Sentiment Analysis App
No ratings yet
Interactive Sentiment Analysis App
10 pages
Low-Code Data Engineering with dbt
No ratings yet
Low-Code Data Engineering with dbt
2 pages
Introduction to Data Science Essentials
No ratings yet
Introduction to Data Science Essentials
18 pages
PFILE vs SPFILE in Oracle Explained
No ratings yet
PFILE vs SPFILE in Oracle Explained
2 pages
Deep Learning in Medical Image Segmentation
No ratings yet
Deep Learning in Medical Image Segmentation
4 pages
COMP 2605 Oracle Lab #2 Instructions
No ratings yet
COMP 2605 Oracle Lab #2 Instructions
2 pages
Class10 IT Project
No ratings yet
Class10 IT Project
6 pages
Data Science Coursework Guide INF6027
No ratings yet
Data Science Coursework Guide INF6027
7 pages
Comprehensive Data Mining Course Outline
No ratings yet
Comprehensive Data Mining Course Outline
97 pages
AI-Driven Food Ordering Insights
No ratings yet
AI-Driven Food Ordering Insights
9 pages
Connecting to MySQL with PHP Methods
No ratings yet
Connecting to MySQL with PHP Methods
6 pages
Crafting a Research Assistant Resume
100% (1)
Crafting a Research Assistant Resume
5 pages
Android SQLite CRUD Example
No ratings yet
Android SQLite CRUD Example
4 pages
Snowflake Editions and Pricing Explained
No ratings yet
Snowflake Editions and Pricing Explained
9 pages
BFSI Customer Acquisition Strategies
No ratings yet
BFSI Customer Acquisition Strategies
57 pages
Qlik vs. Power BI: Lower TCO Explained
No ratings yet
Qlik vs. Power BI: Lower TCO Explained
12 pages
Class IX AI Practice Paper 5
No ratings yet
Class IX AI Practice Paper 5
6 pages
Characteristics of Relational Databases
No ratings yet
Characteristics of Relational Databases
13 pages
Writing an Effective Results Section
No ratings yet
Writing an Effective Results Section
1 page
Unnao Block Survey Data Summary
No ratings yet
Unnao Block Survey Data Summary
272 pages
Java Arrays and String Manipulation Guide
No ratings yet
Java Arrays and String Manipulation Guide
23 pages
Data Catalog Management with ML & Gen AI
No ratings yet
Data Catalog Management with ML & Gen AI
28 pages
Class 11 Informatics Practices Exam
No ratings yet
Class 11 Informatics Practices Exam
6 pages
Grade 6 Data Management Activities
No ratings yet
Grade 6 Data Management Activities
18 pages
HANA vs BW Data Modeling: Key Insights
No ratings yet
HANA vs BW Data Modeling: Key Insights
8 pages
Jardin Bataan Aquaria and Research Hub
No ratings yet
Jardin Bataan Aquaria and Research Hub
72 pages
SAS Data Step Programming Insights
No ratings yet
SAS Data Step Programming Insights
145 pages
SAP S/4HANA Embedded Analytics Overview
No ratings yet
SAP S/4HANA Embedded Analytics Overview
63 pages
Codec Negotiation Success Rates Report
No ratings yet
Codec Negotiation Success Rates Report
16 pages
On-Premises vs Cloud Data Warehouses
No ratings yet
On-Premises vs Cloud Data Warehouses
15 pages
AI Tools Impact on Software Development
No ratings yet
AI Tools Impact on Software Development
20 pages

YouTube Comment Sentiment Analysis with PySpark

Uploaded by

YouTube Comment Sentiment Analysis with PySpark

Uploaded by

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

“YOUTUBE COMMENT SENTIMENT ANALYSIS USING PYSPARK”

Keywords: Sentiment Analysis, PySpark, YouTube Comments, Machine Learning, NLP

To overcome these challenges, Apache PySpark—an advanced distributed computing framework

The problem addressed in this project is automating sentiment classification of YouTube

1. Cleans and preprocesses text data efficiently.

2. Applies a machine learning-based classification model for sentiment prediction.

3. Visualizes sentiment trends and frequent words for interpretation.

• To train and evaluate a sentiment classification model with measurable accuracy.

• Implement data ingestion and cleaning using PySpark DataFrames.

• Evaluate model performance through accuracy and confusion matrix.

• Generate visual insights using matplotlib and word cloud libraries.

i. Problem Analysis and Description

The challenges identified include:

• Handling large-scale data efficiently.

• Cleaning noisy text (emojis, links, special symbols).

• Feature extraction suitable for machine learning.

• Building an interpretable and accurate sentiment classification model.

ii. Modules Identified

1. Data Ingestion Module

o Loads the dataset into a PySpark DataFrame.

o Removes missing and null values in the comment text field.

2. Data Preprocessing Module

o Converts text to lowercase.

o Removes punctuation, numbers, and special characters.

o Tokenizes text and removes stopwords.

o Prepares clean text suitable for model training.

3. Feature Engineering Module

4. Machine Learning Module

o Implements a classification model using PySpark MLlib.

o Trains the model on labeled data to predict sentiment categories.

o Evaluates accuracy and generates a confusion matrix.

o Generates bar graphs showing sentiment distribution.

o Produces a Word Cloud for frequently used words in comments.

iii. Implementation Steps

from [Link] import SparkSession

from [Link] import col, lower, regexp_replace

from [Link] import Tokenizer, StopWordsRemover, CountVectorizer, IDF

from [Link] import LogisticRegression

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",

import [Link] as plt

• Model Accuracy: 75.05 %

• Total Comments Processed: 691,680

YouTube Comment Sentiment Distribution Bar Graph:

Figure 1:YouTube Comment Sentiment Distribution Bar Graph

Figure 2:Sentiment Distribution for Top 5 Videos

i. Top Words in Positive Comments:

Figure 3:Top Words in Positive Comments

ii. Top Words in Negative Comments:

Figure 4:Top Words in Negative Comments

Figure 5:Top Words in Neutral Comments

Confusion Matrix Plot:

Figure 6:Confusion Matrix Plot

1. Apache Software Foundation, Apache Spark Documentation, 2025. Available:

Common questions

What are the educational and research benefits of implementing sentiment analysis projects using distributed frameworks like PySpark?

What are the educational and research benefits of implementing sentiment analysis projects using distributed frameworks like PySpark?

In what ways do visualization tools enhance the interpretability of sentiment analysis results in the context of the project?

In what ways do visualization tools enhance the interpretability of sentiment analysis results in the context of the project?

What are the advantages of using PySpark over traditional Python libraries for sentiment analysis of large datasets?

What are the advantages of using PySpark over traditional Python libraries for sentiment analysis of large datasets?

Discuss the applicability and limitations of PySpark in conducting multilingual sentiment analysis as proposed in future work.

Discuss the applicability and limitations of PySpark in conducting multilingual sentiment analysis as proposed in future work.

What challenges are addressed by using PySpark’s MLlib in building sentiment classification models?

What challenges are addressed by using PySpark’s MLlib in building sentiment classification models?

What future improvements were suggested for enhancing the sentiment classification accuracy of the model?

What future improvements were suggested for enhancing the sentiment classification accuracy of the model?

Why is TF-IDF vectorization used in the feature engineering stage of this sentiment analysis project?

Why is TF-IDF vectorization used in the feature engineering stage of this sentiment analysis project?

How does the project integrate data preprocessing, machine learning, and visualization to produce meaningful insights from YouTube comments?

How does the project integrate data preprocessing, machine learning, and visualization to produce meaningful insights from YouTube comments?

What machine learning model was used in the YouTube comment sentiment analysis project, and what was its performance metric?

What machine learning model was used in the YouTube comment sentiment analysis project, and what was its performance metric?

How does the sentiment analysis pipeline in the described project handle data ingestion and cleaning?

How does the sentiment analysis pipeline in the described project handle data ingestion and cleaning?

You might also like