SCHOOL OF COMPUTER SCIENCE AND ENGINEERING
ASSIGNMENT-1 REPORT
ON
“YOUTUBE COMMENT SENTIMENT ANALYSIS USING PYSPARK”
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
Akarsh Kiran Gowda
SRN: R23EF017
Under the Guidance of
Dr. A Ajil
Assistant Professor
School of Computer Science and Engineering
2025-2026
REVA UNIVERSITY
Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru-560064
[Link]
Table of Contents
Abstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 3
Problem Statement ....................................................................................................................................... 4
Project Overview........................................................................................................................................... 5
Implementation ............................................................................................................................................ 6
Output and Results ....................................................................................................................................... 9
Conclusion ................................................................................................................................................... 13
Future Work ................................................................................................................................................ 14
References .................................................................................................................................................. 15
List of Figures
Figure 1:YouTube Comment Sentiment Distribution Bar Graph .................................................................. 9
Figure 2:Sentiment Distribution for Top 5 Videos ...................................................................................... 10
Figure 3:Top Words in Positive Comments ................................................................................................. 11
Figure 4:Top Words in Negative Comments ............................................................................................... 11
Figure 5:Top Words in Neutral Comments ................................................................................................. 12
Figure 6:Confusion Matrix Plot ................................................................................................................... 12
1
Abstract
In the digital age, vast quantities of unstructured textual data are generated daily on platforms
such as YouTube. Extracting insights from this data is vital to understanding public opinion and
behavioral trends. This project focuses on sentiment analysis of YouTube comments using Apache
PySpark, leveraging its distributed computing capability for handling large datasets efficiently.
The system processes raw comment data, performs text cleaning and preprocessing, and applies
machine learning techniques to classify sentiments as positive, negative, or neutral. Additionally,
visualization tools such as word clouds and bar charts are used to represent trends and emotional
distribution in the dataset. This implementation demonstrates the scalability, accuracy, and
efficiency of PySpark in large-scale text analytics.
Keywords: Sentiment Analysis, PySpark, YouTube Comments, Machine Learning, NLP
2
Introduction
With the exponential growth of social media and video-sharing platforms like YouTube, user-
generated comments have become a major source of public opinion and feedback. Each video
uploaded to YouTube attracts thousands of comments from users expressing their thoughts,
appreciation, criticism, or suggestions. Analysing these comments provides significant insights
into audience engagement, content quality, and overall sentiment toward creators or topics.
However, due to the massive and unstructured nature of this data, manual evaluation is impractical,
and traditional single-machine processing tools often fall short in terms of scalability and
performance.
To overcome these challenges, Apache PySpark—an advanced distributed computing framework
is employed in this project. PySpark allows efficient handling of large-scale datasets through
parallel processing across multiple cores or machines. Its built-in libraries such as MLlib for
machine learning and SQL/DataFrame APIs for data manipulation make it an ideal tool for big
data analytics and natural language processing (NLP) applications.
In this project, PySpark is used to perform end-to-end sentiment analysis on YouTube comments.
The process begins with reading and cleaning the dataset, followed by text preprocessing tasks
such as tokenization, stop word removal, and normalization. Next, relevant features are extracted
using vectorization techniques like TF-IDF, which convert textual information into numerical
format suitable for machine learning models. PySpark’s MLlib is then utilized to train and test a
classification model capable of predicting whether a comment is positive, negative, or neutral.
The model’s output is evaluated using standard performance metrics like accuracy and confusion
matrix. In addition, graphical visualizations, including bar charts and word clouds, are generated
to represent the distribution of sentiments and frequently used words. These visual outputs enhance
the interpretability of the results and provide an intuitive understanding of public opinion trends.
By combining big data processing power with natural language understanding, this project
demonstrates how PySpark can transform raw, unstructured user comments into meaningful
insights. It not only highlights the potential of scalable sentiment analysis systems but also
provides a foundation for future developments in real-time opinion monitoring and social media
analytics.
3
Problem Statement
The exponential growth of user-generated content on platforms like YouTube has resulted in
millions of comments being added every day. Manually analysing these comments to determine
the general sentiment is neither feasible nor efficient. Traditional Python libraries, such as pandas
or scikit-learn, are limited in their capacity to process very large datasets. Therefore, a scalable
and efficient solution is required.
The problem addressed in this project is automating sentiment classification of YouTube
comments by leveraging PySpark’s distributed processing capabilities. The goal is to build a
pipeline that:
1. Cleans and preprocesses text data efficiently.
2. Applies a machine learning-based classification model for sentiment prediction.
3. Visualizes sentiment trends and frequent words for interpretation.
This helps content creators, marketers, and analysts understand user opinions at scale and make
informed data-driven decisions.
4
Project Overview
i. Objectives
• To design a scalable sentiment analysis system using PySpark capable of handling large
volumes of YouTube comments.
• To preprocess and clean unstructured text data for machine learning applications.
• To train and evaluate a sentiment classification model with measurable accuracy.
• To visualize sentiment trends using bar graphs and word clouds for better
interpretability.
ii. Goals
• Implement data ingestion and cleaning using PySpark DataFrames.
• Apply Natural Language Processing (NLP) steps such as tokenization, stop word
removal, and text normalization.
• Train a machine learning model (e.g., Logistic Regression or Naïve Bayes) using
PySpark MLlib.
• Evaluate model performance through accuracy and confusion matrix.
• Generate visual insights using matplotlib and word cloud libraries.
5
Implementation
i. Problem Analysis and Description
The dataset, [Link], contains thousands of comments collected from YouTube videos.
Each comment includes fields such as video_id, comment_text, and other metadata. However, the
primary focus is on analysing the textual content (CONTENT column) to determine sentiment
polarity.
The challenges identified include:
• Handling large-scale data efficiently.
• Cleaning noisy text (emojis, links, special symbols).
• Feature extraction suitable for machine learning.
• Building an interpretable and accurate sentiment classification model.
PySpark’s distributed processing framework and MLlib are chosen to overcome these challenges.
ii. Modules Identified
1. Data Ingestion Module
o Loads the dataset into a PySpark DataFrame.
o Removes missing and null values in the comment text field.
2. Data Preprocessing Module
o Converts text to lowercase.
o Removes punctuation, numbers, and special characters.
o Tokenizes text and removes stopwords.
o Prepares clean text suitable for model training.
3. Feature Engineering Module
6
o Converts textual data into numerical features using TF-IDF or CountVectorizer.
4. Machine Learning Module
o Implements a classification model using PySpark MLlib.
o Trains the model on labeled data to predict sentiment categories.
o Evaluates accuracy and generates a confusion matrix.
5. Visualization Module
o Generates bar graphs showing sentiment distribution.
o Produces a Word Cloud for frequently used words in comments.
iii. Implementation Steps
1. Data Loading
The [Link] dataset was loaded into a PySpark DataFrame. Null and empty entries were
dropped to ensure data quality.
from [Link] import SparkSession
spark = [Link]("YouTubeSentiment").getOrCreate()
df = [Link]("[Link]", header=True, inferSchema=True)
df = [Link](subset=["CONTENT"])
2. Data Cleaning
The CONTENT column was converted to lowercase and cleaned using regular expressions to remove
URLs, numbers, and punctuation.
from [Link] import col, lower, regexp_replace
df = [Link]("cleaned", lower(col("CONTENT")))
df = [Link]("cleaned", regexp_replace(col("cleaned"), r"http\S+|www\S+|[^a-z\s]", ""))
7
3. Feature Engineering and Vectorization
The cleaned text was tokenized and transformed using TF-IDF or CountVectorizer to prepare for
model input.
from [Link] import Tokenizer, StopWordsRemover, CountVectorizer, IDF
tokenizer = Tokenizer(inputCol="cleaned", outputCol="words")
wordsData = [Link](df)
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filteredData = [Link](wordsData)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
cvModel = [Link](filteredData)
featurizedData = [Link](filteredData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = [Link](featurizedData)
finalData = [Link](featurizedData)
4. Model Training
A classification model (e.g., Logistic Regression) was trained using PySpark MLlib. The model
achieved an accuracy of 75.05% on the test dataset.
from [Link] import LogisticRegression
from [Link] import MulticlassClassificationEvaluator
(trainingData, testData) = [Link]([0.8, 0.2], seed=42)
lr = LogisticRegression(featuresCol="features", labelCol="label")
lrModel = [Link](trainingData)
predictions = [Link](testData)
5. Evaluation
The performance was evaluated using metrics such as accuracy and confusion matrix.
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
metricName="accuracy")
accuracy = [Link](predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
6. Visualization
Sentiment distribution was represented using bar plots, and a Word Cloud was generated to show
frequent terms from positive and negative comments.
import [Link] as plt
from wordcloud import WordCloud
# Sentiment Distribution
sentiment_counts = [Link]("label").count().toPandas()
[Link](sentiment_counts["label"], sentiment_counts["count"])
[Link]("Sentiment")
[Link]("Count")
[Link]("Sentiment Distribution")
# Word Cloud
text = " ".join([Link]("cleaned").[Link](lambda x: x).collect())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
[Link](wordcloud, interpolation='bilinear')
[Link]("off")
8
Output and Results
Results Obtained:
• Model Accuracy: 75.05 %
• Total Comments Processed: 691,680
YouTube Comment Sentiment Distribution Bar Graph:
Figure 1:YouTube Comment Sentiment Distribution Bar Graph
9
Sentiment Distribution for Top 5 Videos:
Figure 2:Sentiment Distribution for Top 5 Videos
10
Word Cloud Visualization:
i. Top Words in Positive Comments:
Figure 3:Top Words in Positive Comments
ii. Top Words in Negative Comments:
Figure 4:Top Words in Negative Comments
11
iii. Top Words in Neutral Comments:
Figure 5:Top Words in Neutral Comments
Confusion Matrix Plot:
Figure 6:Confusion Matrix Plot
12
Conclusion
This project successfully demonstrates the potential of PySpark as a powerful framework for
large-scale data analysis and sentiment classification tasks. Through the YouTube Comments
Sentiment Analysis project, we explored how massive unstructured textual data obtained from
social media can be transformed into meaningful insights using distributed data processing and
machine learning techniques.
The system efficiently handled a dataset consisting of more than six hundred thousand YouTube
comments, showcasing the scalability of PySpark’s in-memory computation engine. The
preprocessing phase ensured that noisy data such as URLs, emojis, punctuation, and mixed cases
were cleaned and normalized for effective text analysis. Subsequent tokenization, stop word
removal, and TF-IDF vectorization enabled the transformation of textual content into numerical
representations suitable for modelling.
The Logistic Regression model built using PySpark’s MLlib achieved a commendable accuracy
of 75.05%, indicating its capability to correctly classify the polarity of user opinions into positive,
negative, or neutral sentiments. The evaluation metrics and visualization outputs, including
sentiment distribution graphs and word clouds, further provided clear and interpretable
representations of the public opinion trends embedded within the data.
This analysis also illustrates the practical integration of data preprocessing, machine learning, and
visualization within a single PySpark-based pipeline. It highlights how cloud-compatible,
distributed frameworks like Spark can significantly accelerate model training and evaluation
compared to traditional Python libraries when dealing with large-scale datasets.
In conclusion, the project achieved its objective of performing end-to-end sentiment analysis on
real-world social media data while maintaining accuracy, scalability, and interpretability. It
demonstrates the potential for using Big Data technologies to monitor audience reactions, brand
perception, and public sentiment, thus providing a foundation for future analytical systems in
domains like marketing, entertainment, and digital media analytics.
13
Future Work
While the current project successfully established a functional and accurate sentiment analysis
pipeline using PySpark, there remain several opportunities for enhancement and extension in
future iterations.
Firstly, the sentiment classification process can be improved by incorporating deep learning
models such as LSTMs (Long Short-Term Memory networks) or Transformers (e.g., BERT),
which have proven to outperform traditional machine learning algorithms in understanding
linguistic context and sentiment nuances. Integrating these models through frameworks like
TensorFlowOnSpark or Spark NLP could enable more context-aware classification and better
handling of sarcasm, idioms, and multilingual comments.
Secondly, the dataset used in this project primarily contained English-language comments. Future
work can explore multilingual sentiment analysis, where models can detect and classify emotions
across diverse languages and cultures. This would make the system more robust and globally
applicable.
Another potential direction is the real-time sentiment monitoring of YouTube comment streams.
By connecting the pipeline with the YouTube Data API, live comments could be ingested and
processed continuously using Spark Streaming, enabling dynamic dashboards that visualize
changing audience reactions over time.
Additionally, expanding the analytical scope to include emotion classification (e.g., joy, anger,
sadness, fear) instead of just polarity-based sentiment could provide more detailed insights into
audience behavior. This could further be extended to topic modelling, helping identify trending
discussion themes within the comment sections.
Lastly, a web-based visualization dashboard could be developed to make the analysis interactive
and accessible to non-technical users. Tools such as Flask, Streamlit, or Dash can be integrated
with Spark outputs to present live metrics, sentiment heatmaps, and temporal trend graphs.
Through these enhancements, the system can evolve from a static analytical model into a real-
time, multi-language, and context-sensitive sentiment intelligence platform, offering broader
research and commercial applications.
14
References
1. Apache Software Foundation, Apache Spark Documentation, 2025. Available:
[Link]
2. Kaggle, YouTube Comment Dataset, Datasnaek, 2020. Available:
[Link]
3. Bird, Steven, Edward Loper, and Ewan Klein, Natural Language Processing with Python,
O’Reilly Media, 2009. (TextBlob library reference)
4. Dean, Jeffrey, and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large
Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
5. Zaharia, Matei, et al., “Apache Spark: A Unified Engine for Big Data Processing,”
Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
6. Pedregosa, Fabian, et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
7. Rehurek, Radim, and Petr Sojka, “Software Framework for Topic Modelling with Large
Corpora,” Proceedings of the LREC Workshop on New Challenges for NLP Frameworks,
2010.
8. Mikolov, Tomas, et al., “Efficient Estimation of Word Representations in Vector Space,”
arXiv preprint arXiv:1301.3781, 2013.
9. YouTube Data API Documentation, Google Developers, 2025. Available:
[Link]
10. McKinney, Wes, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition, O’Reilly Media, 2017.
15