0% found this document useful (0 votes)
29 views16 pages

YouTube Comment Sentiment Analysis with PySpark

Bda assignment Reva

Uploaded by

akusharma755
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

YouTube Comment Sentiment Analysis with PySpark

Bda assignment Reva

Uploaded by

akusharma755
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

ASSIGNMENT-1 REPORT
ON

“YOUTUBE COMMENT SENTIMENT ANALYSIS USING PYSPARK”

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

Submitted by
Akarsh Kiran Gowda
SRN: R23EF017
Under the Guidance of
Dr. A Ajil
Assistant Professor
School of Computer Science and Engineering

2025-2026
REVA UNIVERSITY
Rukmini Knowledge Park, Kattigenahalli, Yelahanka, Bengaluru-560064
[Link]
Table of Contents
Abstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 3
Problem Statement ....................................................................................................................................... 4
Project Overview........................................................................................................................................... 5
Implementation ............................................................................................................................................ 6
Output and Results ....................................................................................................................................... 9
Conclusion ................................................................................................................................................... 13
Future Work ................................................................................................................................................ 14
References .................................................................................................................................................. 15

List of Figures
Figure 1:YouTube Comment Sentiment Distribution Bar Graph .................................................................. 9
Figure 2:Sentiment Distribution for Top 5 Videos ...................................................................................... 10
Figure 3:Top Words in Positive Comments ................................................................................................. 11
Figure 4:Top Words in Negative Comments ............................................................................................... 11
Figure 5:Top Words in Neutral Comments ................................................................................................. 12
Figure 6:Confusion Matrix Plot ................................................................................................................... 12

1
Abstract

In the digital age, vast quantities of unstructured textual data are generated daily on platforms
such as YouTube. Extracting insights from this data is vital to understanding public opinion and
behavioral trends. This project focuses on sentiment analysis of YouTube comments using Apache
PySpark, leveraging its distributed computing capability for handling large datasets efficiently.
The system processes raw comment data, performs text cleaning and preprocessing, and applies
machine learning techniques to classify sentiments as positive, negative, or neutral. Additionally,
visualization tools such as word clouds and bar charts are used to represent trends and emotional
distribution in the dataset. This implementation demonstrates the scalability, accuracy, and
efficiency of PySpark in large-scale text analytics.

Keywords: Sentiment Analysis, PySpark, YouTube Comments, Machine Learning, NLP

2
Introduction

With the exponential growth of social media and video-sharing platforms like YouTube, user-
generated comments have become a major source of public opinion and feedback. Each video
uploaded to YouTube attracts thousands of comments from users expressing their thoughts,
appreciation, criticism, or suggestions. Analysing these comments provides significant insights
into audience engagement, content quality, and overall sentiment toward creators or topics.
However, due to the massive and unstructured nature of this data, manual evaluation is impractical,
and traditional single-machine processing tools often fall short in terms of scalability and
performance.

To overcome these challenges, Apache PySpark—an advanced distributed computing framework


is employed in this project. PySpark allows efficient handling of large-scale datasets through
parallel processing across multiple cores or machines. Its built-in libraries such as MLlib for
machine learning and SQL/DataFrame APIs for data manipulation make it an ideal tool for big
data analytics and natural language processing (NLP) applications.

In this project, PySpark is used to perform end-to-end sentiment analysis on YouTube comments.
The process begins with reading and cleaning the dataset, followed by text preprocessing tasks
such as tokenization, stop word removal, and normalization. Next, relevant features are extracted
using vectorization techniques like TF-IDF, which convert textual information into numerical
format suitable for machine learning models. PySpark’s MLlib is then utilized to train and test a
classification model capable of predicting whether a comment is positive, negative, or neutral.

The model’s output is evaluated using standard performance metrics like accuracy and confusion
matrix. In addition, graphical visualizations, including bar charts and word clouds, are generated
to represent the distribution of sentiments and frequently used words. These visual outputs enhance
the interpretability of the results and provide an intuitive understanding of public opinion trends.

By combining big data processing power with natural language understanding, this project
demonstrates how PySpark can transform raw, unstructured user comments into meaningful
insights. It not only highlights the potential of scalable sentiment analysis systems but also
provides a foundation for future developments in real-time opinion monitoring and social media
analytics.

3
Problem Statement

The exponential growth of user-generated content on platforms like YouTube has resulted in
millions of comments being added every day. Manually analysing these comments to determine
the general sentiment is neither feasible nor efficient. Traditional Python libraries, such as pandas
or scikit-learn, are limited in their capacity to process very large datasets. Therefore, a scalable
and efficient solution is required.

The problem addressed in this project is automating sentiment classification of YouTube


comments by leveraging PySpark’s distributed processing capabilities. The goal is to build a
pipeline that:

1. Cleans and preprocesses text data efficiently.

2. Applies a machine learning-based classification model for sentiment prediction.

3. Visualizes sentiment trends and frequent words for interpretation.

This helps content creators, marketers, and analysts understand user opinions at scale and make
informed data-driven decisions.

4
Project Overview

i. Objectives

• To design a scalable sentiment analysis system using PySpark capable of handling large
volumes of YouTube comments.

• To preprocess and clean unstructured text data for machine learning applications.

• To train and evaluate a sentiment classification model with measurable accuracy.

• To visualize sentiment trends using bar graphs and word clouds for better
interpretability.

ii. Goals

• Implement data ingestion and cleaning using PySpark DataFrames.

• Apply Natural Language Processing (NLP) steps such as tokenization, stop word
removal, and text normalization.

• Train a machine learning model (e.g., Logistic Regression or Naïve Bayes) using
PySpark MLlib.

• Evaluate model performance through accuracy and confusion matrix.

• Generate visual insights using matplotlib and word cloud libraries.

5
Implementation

i. Problem Analysis and Description

The dataset, [Link], contains thousands of comments collected from YouTube videos.
Each comment includes fields such as video_id, comment_text, and other metadata. However, the
primary focus is on analysing the textual content (CONTENT column) to determine sentiment
polarity.

The challenges identified include:

• Handling large-scale data efficiently.

• Cleaning noisy text (emojis, links, special symbols).

• Feature extraction suitable for machine learning.

• Building an interpretable and accurate sentiment classification model.

PySpark’s distributed processing framework and MLlib are chosen to overcome these challenges.

ii. Modules Identified

1. Data Ingestion Module

o Loads the dataset into a PySpark DataFrame.

o Removes missing and null values in the comment text field.

2. Data Preprocessing Module

o Converts text to lowercase.

o Removes punctuation, numbers, and special characters.

o Tokenizes text and removes stopwords.

o Prepares clean text suitable for model training.

3. Feature Engineering Module

6
o Converts textual data into numerical features using TF-IDF or CountVectorizer.

4. Machine Learning Module

o Implements a classification model using PySpark MLlib.

o Trains the model on labeled data to predict sentiment categories.

o Evaluates accuracy and generates a confusion matrix.

5. Visualization Module

o Generates bar graphs showing sentiment distribution.

o Produces a Word Cloud for frequently used words in comments.

iii. Implementation Steps

1. Data Loading
The [Link] dataset was loaded into a PySpark DataFrame. Null and empty entries were
dropped to ensure data quality.

from [Link] import SparkSession


spark = [Link]("YouTubeSentiment").getOrCreate()
df = [Link]("[Link]", header=True, inferSchema=True)
df = [Link](subset=["CONTENT"])

2. Data Cleaning
The CONTENT column was converted to lowercase and cleaned using regular expressions to remove
URLs, numbers, and punctuation.

from [Link] import col, lower, regexp_replace


df = [Link]("cleaned", lower(col("CONTENT")))
df = [Link]("cleaned", regexp_replace(col("cleaned"), r"http\S+|www\S+|[^a-z\s]", ""))

7
3. Feature Engineering and Vectorization
The cleaned text was tokenized and transformed using TF-IDF or CountVectorizer to prepare for
model input.

from [Link] import Tokenizer, StopWordsRemover, CountVectorizer, IDF


tokenizer = Tokenizer(inputCol="cleaned", outputCol="words")
wordsData = [Link](df)
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filteredData = [Link](wordsData)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
cvModel = [Link](filteredData)
featurizedData = [Link](filteredData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = [Link](featurizedData)
finalData = [Link](featurizedData)

4. Model Training
A classification model (e.g., Logistic Regression) was trained using PySpark MLlib. The model
achieved an accuracy of 75.05% on the test dataset.

from [Link] import LogisticRegression


from [Link] import MulticlassClassificationEvaluator
(trainingData, testData) = [Link]([0.8, 0.2], seed=42)
lr = LogisticRegression(featuresCol="features", labelCol="label")
lrModel = [Link](trainingData)
predictions = [Link](testData)

5. Evaluation
The performance was evaluated using metrics such as accuracy and confusion matrix.

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",


metricName="accuracy")
accuracy = [Link](predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

6. Visualization
Sentiment distribution was represented using bar plots, and a Word Cloud was generated to show
frequent terms from positive and negative comments.

import [Link] as plt


from wordcloud import WordCloud
# Sentiment Distribution
sentiment_counts = [Link]("label").count().toPandas()
[Link](sentiment_counts["label"], sentiment_counts["count"])
[Link]("Sentiment")
[Link]("Count")
[Link]("Sentiment Distribution")
# Word Cloud
text = " ".join([Link]("cleaned").[Link](lambda x: x).collect())
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
[Link](wordcloud, interpolation='bilinear')
[Link]("off")

8
Output and Results

Results Obtained:

• Model Accuracy: 75.05 %

• Total Comments Processed: 691,680

YouTube Comment Sentiment Distribution Bar Graph:

Figure 1:YouTube Comment Sentiment Distribution Bar Graph

9
Sentiment Distribution for Top 5 Videos:

Figure 2:Sentiment Distribution for Top 5 Videos

10
Word Cloud Visualization:

i. Top Words in Positive Comments:

Figure 3:Top Words in Positive Comments

ii. Top Words in Negative Comments:

Figure 4:Top Words in Negative Comments

11
iii. Top Words in Neutral Comments:

Figure 5:Top Words in Neutral Comments

Confusion Matrix Plot:

Figure 6:Confusion Matrix Plot

12
Conclusion

This project successfully demonstrates the potential of PySpark as a powerful framework for
large-scale data analysis and sentiment classification tasks. Through the YouTube Comments
Sentiment Analysis project, we explored how massive unstructured textual data obtained from
social media can be transformed into meaningful insights using distributed data processing and
machine learning techniques.

The system efficiently handled a dataset consisting of more than six hundred thousand YouTube
comments, showcasing the scalability of PySpark’s in-memory computation engine. The
preprocessing phase ensured that noisy data such as URLs, emojis, punctuation, and mixed cases
were cleaned and normalized for effective text analysis. Subsequent tokenization, stop word
removal, and TF-IDF vectorization enabled the transformation of textual content into numerical
representations suitable for modelling.

The Logistic Regression model built using PySpark’s MLlib achieved a commendable accuracy
of 75.05%, indicating its capability to correctly classify the polarity of user opinions into positive,
negative, or neutral sentiments. The evaluation metrics and visualization outputs, including
sentiment distribution graphs and word clouds, further provided clear and interpretable
representations of the public opinion trends embedded within the data.

This analysis also illustrates the practical integration of data preprocessing, machine learning, and
visualization within a single PySpark-based pipeline. It highlights how cloud-compatible,
distributed frameworks like Spark can significantly accelerate model training and evaluation
compared to traditional Python libraries when dealing with large-scale datasets.

In conclusion, the project achieved its objective of performing end-to-end sentiment analysis on
real-world social media data while maintaining accuracy, scalability, and interpretability. It
demonstrates the potential for using Big Data technologies to monitor audience reactions, brand
perception, and public sentiment, thus providing a foundation for future analytical systems in
domains like marketing, entertainment, and digital media analytics.

13
Future Work

While the current project successfully established a functional and accurate sentiment analysis
pipeline using PySpark, there remain several opportunities for enhancement and extension in
future iterations.

Firstly, the sentiment classification process can be improved by incorporating deep learning
models such as LSTMs (Long Short-Term Memory networks) or Transformers (e.g., BERT),
which have proven to outperform traditional machine learning algorithms in understanding
linguistic context and sentiment nuances. Integrating these models through frameworks like
TensorFlowOnSpark or Spark NLP could enable more context-aware classification and better
handling of sarcasm, idioms, and multilingual comments.

Secondly, the dataset used in this project primarily contained English-language comments. Future
work can explore multilingual sentiment analysis, where models can detect and classify emotions
across diverse languages and cultures. This would make the system more robust and globally
applicable.

Another potential direction is the real-time sentiment monitoring of YouTube comment streams.
By connecting the pipeline with the YouTube Data API, live comments could be ingested and
processed continuously using Spark Streaming, enabling dynamic dashboards that visualize
changing audience reactions over time.

Additionally, expanding the analytical scope to include emotion classification (e.g., joy, anger,
sadness, fear) instead of just polarity-based sentiment could provide more detailed insights into
audience behavior. This could further be extended to topic modelling, helping identify trending
discussion themes within the comment sections.

Lastly, a web-based visualization dashboard could be developed to make the analysis interactive
and accessible to non-technical users. Tools such as Flask, Streamlit, or Dash can be integrated
with Spark outputs to present live metrics, sentiment heatmaps, and temporal trend graphs.

Through these enhancements, the system can evolve from a static analytical model into a real-
time, multi-language, and context-sensitive sentiment intelligence platform, offering broader
research and commercial applications.

14
References

1. Apache Software Foundation, Apache Spark Documentation, 2025. Available:


[Link]
2. Kaggle, YouTube Comment Dataset, Datasnaek, 2020. Available:
[Link]
3. Bird, Steven, Edward Loper, and Ewan Klein, Natural Language Processing with Python,
O’Reilly Media, 2009. (TextBlob library reference)
4. Dean, Jeffrey, and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large
Clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
5. Zaharia, Matei, et al., “Apache Spark: A Unified Engine for Big Data Processing,”
Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
6. Pedregosa, Fabian, et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
7. Rehurek, Radim, and Petr Sojka, “Software Framework for Topic Modelling with Large
Corpora,” Proceedings of the LREC Workshop on New Challenges for NLP Frameworks,
2010.
8. Mikolov, Tomas, et al., “Efficient Estimation of Word Representations in Vector Space,”
arXiv preprint arXiv:1301.3781, 2013.
9. YouTube Data API Documentation, Google Developers, 2025. Available:
[Link]
10. McKinney, Wes, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition, O’Reilly Media, 2017.

15

Common questions

Powered by AI

Using distributed frameworks like PySpark offers educational and research benefits by allowing students and researchers to explore scalable data processing techniques applicable to real-world big data challenges. It demonstrates the integration of machine learning, NLP, and data visualization in handling massive, unstructured datasets. This hands-on experience is essential for understanding cloud computing architectures, optimizing system performance, and executing predictive analytics on a realistic scale, which are increasingly valuable skills in the field of data science and engineering .

Visualization tools like bar charts and word clouds significantly enhance the interpretability of sentiment analysis results by providing intuitive, clear pictures of data trends and distributions. Bar charts help depict sentiment polarity across different comments visually, while word clouds highlight frequently used terms, giving an insight into common topics and expressions. These visualizations make it easier for even non-technical stakeholders to interpret the data, enabling informed decision-making based on audience sentiment .

PySpark offers significant advantages over traditional Python libraries, such as pandas or scikit-learn, due to its distributed computing framework which allows for parallel processing across multiple cores or machines, enhancing scalability and efficiency. This is particularly useful for handling large datasets like YouTube comments. PySpark’s MLlib also provides built-in tools for machine learning that can efficiently handle the text preprocessing and feature extraction steps required for sentiment analysis, something that single-machine platforms struggle to do on a large scale .

PySpark is highly applicable for multilingual sentiment analysis due to its scalability and distributed processing capabilities, which can efficiently handle large, diverse language datasets. However, its limitation lies in the need for suitable NLP libraries and models that understand language nuances, as well as possibly requiring extensions like Spark NLP or integration with deep learning frameworks. While PySpark supports data processing at scale, any existing language model must be adept at contextual and cultural nuances to ensure accurate sentiment detection across languages .

PySpark’s MLlib addresses several challenges in building sentiment classification models, including processing vast amounts of data efficiently through parallel distributed computing, which is beyond the capacity of traditional libraries. It streamlines preprocessing tasks and offers scalable machine learning algorithms that manage feature extraction and model evaluation with built-in functions. MLlib also simplifies handling complexity in text data treatment and automates the machine learning pipeline to improve model performance and scalability .

Future improvements suggested include integrating advanced deep learning models such as LSTMs or Transformers like BERT through frameworks compatible with PySpark, such as TensorFlowOnSpark. These models can better capture linguistic context and handle nuances such as sarcasm and idioms. Moreover, expanding the model to perform multilingual sentiment analysis and real-time processing using Spark Streaming could significantly enhance its robustness and applicability .

TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is used to transform textual data into numerical representations. This method is particularly valuable as it highlights the importance of words within a document relative to the entire corpus, thereby enhancing the machine learning model's ability to identify relevant features that contribute to sentiment classification. It effectively reduces the weight of common words that might not be sentiment indicators while emphasizing rarer, more informative words .

The project integrates data preprocessing, machine learning, and visualization to generate insights from YouTube comments by first cleaning and transforming raw text data into a form suitable for analysis through tokenization and vectorization methods like TF-IDF. PySpark’s MLlib facilitates model training and testing to classify sentiments, achieving notable accuracy. Subsequently, visualization tools like bar charts and word clouds present the results in an interpretable manner, highlighting sentiment trends and frequently mentioned topics, thereby transforming raw data into actionable insights .

The project used a Logistic Regression model implemented via PySpark’s MLlib for sentiment classification. The model achieved an accuracy of 75.05% on the test dataset, indicating its effectiveness in correctly classifying the sentiment of YouTube comments as positive, negative, or neutral .

The sentiment analysis pipeline begins with loading the dataset into a PySpark DataFrame, followed by the removal of null and missing entries to ensure data quality. Text cleaning involves converting text to lowercase and using regular expressions to strip out URLs, symbols, numbers, and punctuation, thereby preparing the text for further analysis. The data ingestion module is designed to remove unwanted noise and produce clean, structured data suitable for subsequent NLP operations .

You might also like