Arabic Sentiment Analysis of YouTube Comments NLP-Based Machine Learning
Arabic Sentiment Analysis of YouTube Comments NLP-Based Machine Learning
cognitive computing
Article
Arabic Sentiment Analysis of YouTube Comments: NLP-Based
Machine Learning Approaches for Content Evaluation
Dhiaa A. Musleh, Ibrahim Alkhwaja , Ali Alkhwaja * , Mohammed Alghamdi , Hussam Abahussain ,
Faisal Alfawaz, Nasro Min-Allah and Mamoun Masoud Abdulqader
Abstract: YouTube is a popular video-sharing platform that offers a diverse range of content. Assess-
ing the quality of a video without watching it poses a significant challenge, especially considering
the recent removal of the dislike count feature on YouTube. Although comments have the potential
to provide insights into video content quality, navigating through the comments section can be
time-consuming and overwhelming work for both content creators and viewers. This paper proposes
an NLP-based model to classify Arabic comments as positive or negative. It was trained on a novel
dataset of 4212 labeled comments, with a Kappa score of 0.818. The model uses six classifiers: SVM,
Naïve Bayes, Logistic Regression, KNN, Decision Tree, and Random Forest. It achieved 94.62% accu-
racy and an MCC score of 91.46% with NB. Precision, Recall, and F1-measure for NB were 94.64%,
94.64%, and 94.62%, respectively. The Decision Tree had a suboptimal performance with 84.10%
accuracy and an MCC score of 69.64% without TF-IDF. This study provides valuable insights for
content creators to improve their content and audience engagement by analyzing viewers’ sentiments
toward the videos. Furthermore, it bridges a literature gap by offering a comprehensive approach to
Arabic sentiment analysis, which is currently limited in the field.
Citation: Musleh, D.A.; Alkhwaja, I.;
Alkhwaja, A.; Alghamdi, M.; Keywords: Arabic sentiment analysis; YouTube comments; machine learning; natural language
Abahussain, H.; Alfawaz, F.; processing; supervised learning
Min-Allah, N.; Abdulqader, M.M.
Arabic Sentiment Analysis of
YouTube Comments: NLP-Based
Machine Learning Approaches for
1. Introduction
Content Evaluation. Big Data Cogn.
Comput. 2023, 7, 127. https:// Sentiment analysis involves analyzing people’s responses, emotions, and opinions in
doi.org/10.3390/bdcc7030127 the form of text toward an object, such as a book, service, or video. This type of analysis can
be utilized on YouTube comments using machine learning algorithms and techniques to
Academic Editor: Salvador García
classify them as positive or negative. Classifying a YouTube video based on its comments
López
can save viewers’ time and assist YouTube content creators in gauging the impression of
Received: 3 May 2023 their viewers, leading to improvement in content quality in general on the platform.
Revised: 22 June 2023 The language that holds the fourth spot in terms of popularity on the internet is
Accepted: 29 June 2023 Arabic [1]. It takes an essential place within the realm of sentiment analysis and needs
Published: 3 July 2023 more research about it. There are several characteristics of the Arabic language which make
it unique compared to other languages. First of all, it has 28 letters with no capitalization
format. In addition, words can be written by connecting Arabic letters. For instance, writing
a letter alone will be different from writing it in a middle of a word [2]. Secondly, the Arabic
Copyright: © 2023 by the authors.
language can be written differently based on its dialects. On the internet, specifically social
Licensee MDPI, Basel, Switzerland.
media, it is common to see the language written using various dialects.
This article is an open access article
Natural language processing is a part of Artificial Intelligence, and it is a technique
distributed under the terms and
used to analyze and process human language by enabling the computer to interpret the
conditions of the Creative Commons
intent and sentiment of the writer or the speaker. In addition, it allows the computer
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
to understand the whole meaning of the sentence. The popularity of utilizing Natural
4.0/).
Language Processing (NLP) has recently increased noticeably. However, the number of its
applications in the Arabic language is incomparable with the English Language. It is clear
that English research papers outnumber Arabic ones. When it comes to Arabic YouTube
comments, we can capitalize on them to save the time of viewers to know whether it is
worth it to watch a specific video or not.
This study uses machine learning and NLP techniques to sentimentally evaluate Arabic
YouTube comments. As many research papers have been released for the English language,
this study focused on the Arabic language to expand this domain due to insufficient
research in Arabic. Our objective is to propose a novel model for sentiment analysis of
Arabic comments on YouTube videos, addressing the challenge of evaluating video quality
without watching the content. By utilizing NLP techniques and ML algorithms, the model
classifies comments as positive or negative. Our study includes the evaluation of the
model’s performance on a balanced dataset of manually labeled Arabic comments and
compares various classifiers. It provides valuable insights for content creators to improve
their content and audience engagement by analyzing viewers’ sentiments.
This paper is divided into the following segments: Section 2 discusses a Review of
Related Literature and Background. Section 3 discusses the Description of the Methodology.
Section 4 discusses Performance Measurement. Section 5 illustrates the Experimental
Results and Discussion. The final section encompasses the Conclusion and Future Work.
Year Author & Reference Dataset Used & Source Language Best Technique Best Result
Precision, Recall,
5986 comments from
2017 Al-Tamimi et al. [10] Arabic SVM-RBF and
collected from YouTube.
F1-Score = 88.8%
Accuracy = 90.05%
15,050 comments gathered Precision = 88%
2018 Alakrot et al. [11] Arabic SVM
from YouTube. Recall = 77%
F1-Score 82%
Accuracy = 87.84%
15,050 YouTube comments
Precision = 86.10%
2019 Mohaouchane et al. [12] provided by other Arabic CNN
Recall = 82.24%
previous published work.
F1-Score = 84.05%
Accuracy = 88.71
LSTM using data Precision = 88.43
2019 Mohammed and Kora [13] 40,000 tweets from Twitter. Arabic
augmentation Recall = 88.59
F-Score = 88.43
Accuracy = 90.75%
15,100 reviews provided FastText
Precision = 89.10%
2020 Ombabi et al. [14] by other previous Arabic (Skip-gram)-CNN–
Recall = 92.14%
published work. LSTM
F1-Score = 92.44%
Accuracy = 78.46%
8000 reviews from social
Precision = 79.94%
2022 Hadwan et al. [15] media, Google Play, and Arabic KNN
Recall = 78.01%
the App store.
F-Score = 78.96%
Semantic Accuracy = 79.20%
15,572 reviews provided
orientation Precision = 81.87%
2022 Khabour et al. [16] by other previous Arabic
approach using Recall = 79.20%
published work.
ontology F-Score = 78.75%
Several datasets used with Accuracy = 92.80%
Arwa Alqarni and Atta a total of 90,187 Precision = 92.97%
2023 Arabic CNN
Rahman [17] preprocessed tweets Recall = 93.06%
from Twitter. F-Score = 92.99%
3. Description of Methodology
This section addresses the collection, labeling, description, and preprocessing of
the dataset, as well as the subsequent steps of feature extraction and model generation.
Representation of data collection, data pre-processing, and classification model are shown
in Figure 1.
comments. The comment that has been labeled three times or more for a specific class is set
as the final label. Following the labeling process, we further investigated the comments to
eliminate
Big Data Cogn. Comput. 2023, 7, x FOR irrelevant ones, resulting in a total of 4212 comments that are divided
PEER REVIEW 5 of 16 equally
between 2106 positive and 2106 negative comments.
Figure 1. The process of collecting data, preprocessing it, and utilizing classifier models.
Figure 1. The process of collecting data, preprocessing it, and utilizing classifier models.
video or its content. On the other hand, the negative class encompasses comments that are
derogatory and critical of the YouTube video’s content. Five native Arabic individuals la-
beled each comment individually in a CSV file that includes the 4760 Arabic YouTube
comments. The comment that has been labeled three times or more for a specific class is
set as the final label. Following the labeling process, we further investigated the comments
Big Data Cogn. Comput. 2023, 7, 127 6 of 16
to eliminate irrelevant ones, resulting in a total of 4212 comments that are divided equally
between 2106 positive and 2106 negative comments.
Evaluation
EvaluationofofInter-Rater
Inter-RaterReliability
ReliabilityUsing
UsingFleiss’
Fleiss’Kappa
Kappa
The
Theagreement
agreementbetween
betweenthethefive
fiveArabic
Arabic individuals
individuals (raters)
(raters) who
who examined
examined 4212
4212 com-
com-
ments
ments was
was measured
measured using
using Fleiss’
Fleiss’Kappa.
Kappa. TheThe resulting
resulting Kappa
Kappa value
value was
was found
found toto be
be
0.818, indicating strong agreement among the labels provided by the group
0.818, indicating strong agreement among the labels provided by the group of raters. Theof raters. The
corresponding
correspondingz-score
z-scoreofofapproximately
approximately168 168suggests
suggestsaasignificant
significantdeviation
deviationfrom
fromthe
thenull
null
hypothesis of no agreement. These findings highlight the strong inter-rater
hypothesis of no agreement. These findings highlight the strong inter-rater agreement ob- agreement
observed
served ininthe
thedataset,
dataset,providing
providingvaluable
valuableinsights
insightsinto
intothe
the reliability
reliability of
of the
the ratings
ratings pro-
pro-
vided
vided by the group of individual Arabic raters. Figure 2 shows dataset analysisobtained
by the group of individual Arabic raters. Figure 2 shows dataset analysis obtained
by
bymeasuring
measuringthe theFleiss’
Fleiss’Kappa.
Kappa.
Figure2.2.Results
Figure Resultsof
ofimplementing
implementingFleiss
FleissKappa
Kappaon
onthe
thedataset.
dataset.
3.3.Dataset
3.3. DatasetPre-Processing
Pre-Processing
Pre-processing isisaavital
Pre-processing vitalcomponent
componentof ofmachine
machinelearning
learningsystems
systems[18],
[18],as
asititinvolves
involves
cleaningand
cleaning andrefining
refiningthe
thedataset
datasettotofacilitate
facilitatesubsequent
subsequentclassifier
classifiertraining.
training. In
In our
our dataset,
dataset,
YouTube
YouTubecomments
commentsare arewritten
writtenininvarious
various dialects of of
dialects Arabic
Arabicrather than
rather in standard
than in standard Arabic.
Ara-
To overcome
bic. the challenges
To overcome associated
the challenges with Arabic
associated language
with Arabic comments,
language we leveraged
comments, NLP
we leveraged
techniques incorporating
NLP techniques normalization,
incorporating stemming,
normalization, and stop-word
stemming, removal. The
and stop-word previous
removal. The
Figure 1 demonstrates
previous the main phases
Figure 1 demonstrates the main of phases
pre-processing that are taken
of pre-processing that before
are taken building
before
the model.the model.
building
3.3.2. Normalization
The normalizing process is used to remove noise from the data and correct spelling
errors in Arabic. This step showed a noticeable improvement in model accuracy scores
for various studies. For instance, in Huq et al. [19], researchers were able to increase the
scores by applying normalization and other preprocessing techniques. The accuracy score
increased from 61% to 80% for the SVM classifier model. For the purpose of normalizing
some Arabic letters, such as @ to @, we employed the script. First, the script cleans up any
leading or trailing whitespace in the given text. After that, Arabic character variants are
The
3.3.2. normalizing process is used to remove noise from the data and correct spelling
Normalization
errors in Arabic. This step showed a noticeable improvement in model accuracy scores for
The normalizing process is used to remove noise from the data and correct spelling
various studies. For instance, in Huq et al. [19], researchers were able to increase the scores
errors in Arabic. This step showed a noticeable improvement in model accuracy scores for
by applying normalization and other preprocessing techniques. The accuracy score in-
Big Data Cogn. Comput. 2023, 7, 127 various studies. For instance, in Huq et al. [19], researchers were able to increase the 7scores of 16
creased from 61% to 80% for the SVM classifier model. For the purpose of normalizing
by applying normalization and other preprocessing techniques. The accuracy score in-
some Arabic letters, such as ﺁto ﺍ, we employed the script. First, the script cleans up any
creased from 61% to 80% for the SVM classifier model. For the purpose of normalizing
leading or trailing whitespace in the given text. After that, Arabic character variants are
some Arabic letters, such as ﺁto ﺍ, we employed the script. First, the script cleans up any
replaced with their standard versions using regular expressions. For instance, it replaces
leading or trailing whitespace in the given text. After that, Arabic character variants are
different representations
different representations of of the letter “( ٱ, ﺍ, ﺁ, ﺃ, ﺍ ”)ﺇwith aa single
the letter single “”ﺍ.@”. ItItinstance,
also changes
changes thethe
replaced with their standard versions using regularwith expressions.“For also it replaces
letter “( ”ﻯsometimes used as the final form of “ )”ﻱto “ ”ﻱand it changes the letters “”ﺅ
letter “ ø” representations
different (sometimes used ofas
thethe “(form
final
letter ٱ, ﺍ, ﺁ, ﺃof
,ﺍ“”)ﺇøwith ø” and“”ﺍ.it Itchanges
”) toa“single the letters
also changes “ ð”
the letter
and “ ”ﺉto “ ”ءto normalize their usage. Furthermore, it substitutes “ ”ﺓfor the feminine
“
and”ﻯ (sometimes
“ ø”“”ﻩto The
“ Z” toused as the
normalize final form of “ )”ﻱ to “”ﻱ and it changes
the letters “”ﺅ and “”ﺉ
marker output of thetheir usage.
function Furthermore,
is then it substitutes
the normalized text that“ was
è” forproduced.
the feminineFor
to “ ”ءto normalize their usage. Furthermore, it substitutes “ ”ﺓfor the feminine marker “”ﻩ
text processing
marker and analysis
“ è” The output of the activities,
function isthesethen normalization
the normalizedmethodstext that canwashelp guarantee
produced. For
The output of the function is then the normalized text that was produced. For text pro-
that
text Arabic
processingcharacters are represented
and analysis activities, consistently.
these normalization Figure 3 demonstrates the script used
cessing and analysis activities, these normalization methodsmethods
can helpcan help guarantee
guarantee that Ara-
for applying
that Arabic normalization.
characters are represented consistently. Figure 3 demonstrates the used
scriptfor
used
bic characters are represented consistently. Figure 3 demonstrates the script ap-
for applying normalization.
plying normalization.
3.3.3.Tokenization
Tokenization is an essential step before analyzing the text since it reduces the varia-
3.3.3.
tion ofTokenization
typos found in words [5]. In addition, performing the tokenization phase can help
Tokenizationis is
Tokenization anan essential step
essential before analyzing the text since it reduces the varia-
the performance when includingstep before
a feature analyzing
extraction the
such text
as since
a bagitofreduces
wordsthe variation
(BOW) [20].
oftion of
typos typos found in words [5]. In addition, performing the tokenization phase can help
There arefound
other in words [5].
elements thatIn addition,
could affectperforming
tokenization, thefor
tokenization
instance, inphase
N-gram, canco-occur-
help the
the performance
performance whenwhen including
including a feature
a feature extractionassuch
extraction asofa words
bag of (BOW)
words (BOW) [20].
rence, and stemming. In tokenization, the spaces,suchtabs, anda bag
newlines are viewed[20]. There
as delim-
There
are otherareelements
other elements thataffect
couldtokenization,
affect tokenization, for instance, in N-gram, co-occur-
iters. There are alsothat
other could
characters for instance,
that can be considered in N-gram,
delimiters, suchco-occurrence,
as (), ?, and !.
rence,
and and
stemming. stemming. In tokenization,
In tokenization, theopinionthe spaces,
spaces,intabs, tabs, and
and newlines newlines
are are
viewed viewed as delim-
as delimiters.
Arabic users usually express their informal Arabic on social media. Thus, the
iters. There
There alsoare also other characters canthat can be considered delimiters, such as (), ?, and !.
use of area comma other
can characters
sometimesthat be wrong, be considered delimiters,
so it is better to consider such
it asasa(), ?, and
delimiter !. as
Arabic
well
Arabic
users users express
usually their
express their opinion in informal Arabic onmedia.
social Thus,
media.the Thus, the
if it exists between words. One of the usages of tokenization is to tally the frequency of
usually opinion in informal Arabic on social use of
use
aword
commaof a comma can sometimes be wrong, so it is better to consider it
can sometimes be wrong, so it is better to consider it as a delimiter as well if as a delimiter as well
appearances.
itifexists
it exists between
between words.
words. One
One of of
thethe usagesofoftokenization
usages tokenizationisistototally
tallythethefrequency
frequencyofof
word
word appearances.
appearances.
3.3.4. Stop-Word Removal
3.3.4.
3.3.4.Stop-Word
Stop-WordRemoval
Removal
The sources of stop words are two. The first source is from the “NLTK” library using
the “stopwords” class with the “words” method and selecting the Arabic language. The
second source is writing the stop words manually in the code. The way of preparing the
two sources of stop words. Later, the separation of the words within the comments is
performed, and each word will be compared with the stop words from the two previously
created sources. If a comparison is true, the word will be considered a stop word and
therefore discarded. The loops keep iterating and performing the same comparison until
the last word of comments.
3.3.5. Stemming
By eliminating prefixes and suffixes from the word, a process known as stemming
allows the word to be restored to its original meaning, which is the root. We used the
ISRIStemmer provided by the NLTK library [21].
N-gram ranges and TF-IDF methods. To train the dataset, we employed six different
supervised machine learning models, specifically SVM, RF, LR, KNN, DT, and NB.
3.4.2. N-Gram
N-gram is a technique that facilitates machine comprehension of word meaning in its
context. This method utilizes the words preceding and following the target word to aid the
machine in understanding the sentence’s meaning. Spell-checking, next-word prediction,
and language identification are all made easier with the use of n-grams. The term “grams”
refers to the individual words, whereas the letter “N” signifies the total number of words
to be examined. Various types of N-grams exist, such as Unigram N = 1, Bigram N = 2, and
Trigram N = 3, among others. The N-gram approach, for instance, can be used to compute
the following N-gram versions for the statement “ @Yg. ÉJÔg. ©¢®ÖÏ @ @ Yë”:
•
Unigrams: “ @ Yë”, “ ©¢®ÖÏ @”, “ ÉJÔg.”, “ @Yg.”.
•
@ Yë”, “ ÉJÔg ©¢®Ö Ï @”, “ @Yg ÉJÔg”.
Bigrams: “ ©¢®ÖÏ @ . . .
• g g
Trigrams: “ ÉJÔ . ©¢®Ö @ @ Yë”, “ @Yg. ÉJÔ . ©¢®Ö @”.
Ï Ï
3.5. Generating the Model
Building models using various classifiers, such as Multinomial NB, SVM, DT, Random
Forest, Logistic Regression, and KNN, can be achieved by using a number of classes from
the Python scikit-learn (sklearn) library, such as GridSearchCV, Confusion Matrix, and
Classification Report.
P( X |C ) P(C )
P(C | X ) = (1)
P( X )
w · x + b =0 (2)
a probability indicating the likelihood that the input belongs to a particular class [4]. The
SVM hyperplane is a line that separates the data into two categories. This hyperplane is
defined by the value of the weights on the margins that have been optimized. The primary
objective of the training is to find a separating hyper-plane that maximizes the distance
Big Data Cogn. Comput. 2023, 7, 127
between the classes of positive and negative examples while at the same time having the
9 of 16
smallest possible number of misclassifications. We can describe the hyperplane as:
w⋅ x+b=0 (2)
where w is the set of weights, x represents the input vector, and b denotes the bias. Also,
where w is the set of weights, x represents the input vector, and b denotes the bias. Also,
there’s another equation to maximize the margin size, which is:
there’s another equation to maximize the margin size, which is:
2 2
(3)
(3)
||w||
|w|
Figure
Figure44below
belowdemonstrates
demonstratesthe theconcept
conceptofofSVM
SVM with
withtwo
twoclasses: blue
classes: circles
blue repre-
circles rep-
sent thethe
resent positive class,
positive andand
class, redred
circles represent
circles the the
represent negative class.
negative class.
Figure4.4.SVM
Figure SVMconcept
conceptwith
withtwo
twoclasses
classes[23].
[23].
3.5.3.Random
3.5.3. RandomForest
Forest(RF)(RF)
RandomForest
Random Forestisisclassified
classifiedas asan
anensemble
ensembletechnique
techniquethatthathashasgained
gainedwidespread
widespread
popularityfor
popularity forits
its flexibility
flexibility and effectiveness
effectiveness in in solving
solvingaabroad
broadrange
rangeofofproblems
problems[24].[24].It
consists of a group of decision trees, each of which generates a prediction
It consists of a group of decision trees, each of which generates a prediction for a given for a given input
instance,
input and the
instance, andconclusive resultresult
the conclusive generated by the
generated bymodel is based
the model on the
is based on majority vote
the majority
of the
vote of individual
the individual trees. In Figure
trees. 5, the
In Figure 5, green circles
the green represent
circles the the
represent attributes thatthat
attributes are ran-
are
randomly
domly chosenchosenforfor
thethe split
split at each
at each node,
node, while
while thethe
blueblue circles
circles represent
represent the the attributes
attributes that
Big Data Cogn. Comput. 2023, 7, x FOR PEER
that areREVIEW
ignored during 10 of 16
are ignored during thethe construction
construction process.
process. Random
Random Forest
Forest hashas demonstrated
demonstrated success
success in
indiverse
diversedomains,
domains,including
includingimage
imageclassification,
classification,bioinformatics,
bioinformatics,and andNLP.
NLP.
Figure5.5.Random
Figure RandomForest
Foresttechnique
technique[25].
[25].
3.5.4.K-Nearest-Neighbor
3.5.4. K-Nearest-Neighbor(KNN)
(KNN)
K-NearestNeighbors
K-Nearest Neighbors(KNN)
(KNN) is is a simple
a simple method
method used
used in supervised
in supervised machine
machine learn-
learning
ing for regression and classification tasks. KNN runs on the presumption that
for regression and classification tasks. KNN runs on the presumption that fresh data fresh data
is categorized based
is categorized based onon its nearest neighbors [26]. Figure 6 demonstrates three circles
nearest neighbors [26]. Figure 6 demonstrates three circles be-
longing to three distinct classes represented by the colors blue, orange, and black. The
gray circle represents a data point that needs to be predicted or classified. However, ar-
rows are used to indicate only the closest neighbors, as determined by a K value of seven.
Figure 5. Random Forest technique [25].
Figure6.6.An
Figure Anexample
exampleofof
a KNN working
a KNN mechanism.
working mechanism.
There are several techniques for determining the spatial separation between the points.
There are several techniques for determining the spatial separation between the
However, one of the most used methods is Euclidean distance, which uses the following
points. However, one of the most used methods is Euclidean distance, which uses the
formula to determine the distance between two or more points:
following formula to determine the distance between two or more points:
q
d= ( x − x )2 + ( y − y )2 2 (4)
d =2 (x12 - x1 )2 2+ y21- y1 (4)
3.5.5. Decision Tree (DT)
A Decision
3.5.5. Decision Tree
Treeis(DT)
considered a non-parametric approach in supervised learning used
Big Data Cogn. Comput. 2023, 7, x FOR
toPEER REVIEW
organize 11 of 16
a sequence of roots in a tree structure, as shown in Figure 7. The decision tree
A Decision Tree is considered a non-parametric approach in supervised learning
consists of three primary nodes: the root node represents the overall population under
used to organize
consideration, a sequence
the decision nodeofisroots in when
created a tree astructure,
sub-node as shown into
is divided in Figure 7. The decision
other sub-nodes,
and the leaf node does not split into further nodes [27]. The decision tree algorithm utilizesunder
tree consists of three primary nodes: the root node represents the overall population
utilizes entropythe
consideration, to determine
decision the degree
node of variation
is created in the data, whereby larger entropy
entropy to determine the degree of variation inwhen a sub-node
the data, whereby is divided
larger into
entropy other
has an sub-
has an adverse
nodes, and effect on the selection of the next point.
adverse effectthe
onleaf
the node does
selection ofnot
thesplit
next into further nodes [27]. The decision tree algorithm
point.
c
c
Entropy = ( − p × log p ) (5)
Entropy = ∑ (− pi × logi 2 pi ) 2 i
i=1
(5)
i =1
Figure7.7.Structure
Figure Structureofofthe decision
the tree.
decision tree.
4. Performance Measures
Assessing the quality of sentiment analysis involves various metrics and indicators
to measure its performance. Several research papers used Precision, Recall, F-score, Accu-
racy, and Matthews correlation coefficient (MCC) as their performance measures, namely,
Novendri et al. [29], Musleh et al. [30], Singh and Tiwari [31], and more. Each of these has
a unique formula for calculating the rate of software performance. We evaluated how well
classifiers performed by using mathematical formulas. The symbols used in the formulas
are as per the following definitions: “TP” signifies instances that are accurately classified as
positive, “TN” represents instances that are accurately classified as negative, “FP” denotes
instances that are erroneously classified as positive, and “FN” refers to instances that are
erroneously classified as negative. The formulas are:
Accuracy is obtained by performing the calculation of true positives and negatives
divided by the total comments [32].
TP + TN
Accuracy = (6)
TP + FN + TP + FP
The precision of a classifier was evaluated by computing the number of false positives
it produced [33].
TP
Precision = (7)
TP + FP
The Recall of a classifier was evaluated by computing the number of false negatives it
generated [33].
TP
Recall = (8)
TP + FN
The F1-measure or F1-score was determined by finding the weighted harmonic mean
of both the recall and precision measurements [34].
2 × Precision × Recall
F1-Score = (9)
Precision × Recall
The calculation of the Matthews Correlation Coefficient relies on the information
derived from the confusion matrix [35].
TP × TN − FP × FN
MCC = p (10)
( TP + FP) ( TP + FN ) ( TN + FP) ( TN + FN )
stratified split into 70% for training and 30% for testing, utilizing 5-fold cross-validation to
ensure the robustness and reliability of the results.
Classifier N-Gram Range Count Vectorizer TF-IDF Accuracy Precision Recall F1-Score MCC
With 94.38 94.39 94.38 94.38 88.81
SVM (1, 1) With
Without 92.33 92.27 92.40 92.31 84.34
With 94.38 94.40 94.40 94.38 90.82
NB (1, 3) With
Without 94.62 94.64 94.64 94.62 91.46
With 93.75 93.83 93.79 93.75 87.46
LR (1, 3) With
Without 92.09 92.13 92.08 92.08 85.88
With 92.48 92.49 92.50 92.48 84.24
KNN (1, 3) With
Without 87.82 88.55 87.45 87.65 72.04
With 84.18 84.20 84.20 84.18 73.49
DT (1, 3) With
Without 84.10 84.20 84.09 84.08 69.64
With 91.77 91.80 91.77 91.77 83.02
RF (1, 2) With
Without 82.67 85.72 80.05 81.05 79.75
Best results are shown in bold.
100
90
80
70
60
Percentage
ACCURACY
PRECISION
50
RECALL
40 F1-SCORE
MCC
30
20
10
0
TF-IDF No TF-IDF TF-IDF No TF-IDF TF-IDF No TF-IDF TF-IDF No TF-IDF TF-IDF No TF-IDF TF-IDF No TF-IDF
SVM Naïve Bayes Logistic Regression KNN Decision Tree Random Forest
Figure
Figure8.8.The
Theinfluence
influenceofofTF-IDF
TF-IDFon
onthe
theclassification
classificationofofYouTube
YouTubecomments.
comments.
Figure
Figure99depicts
depictsword
word clouds
clouds showcasing the
the occurrence
occurrenceof
oflexemes
lexemesthat
thatare
areindicative
indica-
tive
Big Data Cogn. Comput. 2023, 7, x FOR PEER of positive
REVIEW
of positive affective
affective orientationsininthe
orientations theYouTube
YouTubecomments.
comments. The size
size of
of the
thefont
fontused
used 14 of 1
for
foreach
eachArabic
Arabicword
wordininthe
theword
wordclouds
cloudscorresponds
correspondstotoitsitsfrequency
frequencyofofuse,
use,with
withmore
more
frequently
frequentlyused
usedwords
wordsappearing
appearingininlarger
largerfont
fontsizes.
sizes.Therefore,
Therefore,Figure
Figure9a9ahighlights
highlightsthe
the
positive
positiveArabic
Arabicwords, suchasas““”ﺷﻜﺮﺍ
words,such @Qº”meaning
meaning“Thanks”,
“Thanks”,whilewhileFigure
Figure9b9bpresents
presentsthe
the
same
samepositive
positivewords
wordstranslated
translatedinto
intoEnglish.
English.
(a) (b)
Figure 9. 9.
Figure Word
Wordcloud ofpositive
cloud of positive comments
comments (a) Original
(a) Original Arabic Arabic words,
words, (b) (b)words
The same The same words trans
translated
lated into English.
into English.
TheThe violinplot
violin plot presented
presented ininFigure
Figure10 10
visualizes the distribution
visualizes of wordofcounts
the distribution wordincounts i
comments, categorized as “Positive” or “Negative”, with the aim of examining the relation-
comments, categorized as “Positive” or “Negative”, with the aim of examining the rela
ship between comment length and sentiment. The analysis reveals that comments with
tionship
a prolixbetween comment100length
nature, containing words and sentiment.
or more, The analysis
tend to convey a negativereveals that
sentiment comment
from
with
theacomment
prolix nature,
writers,containing
as indicated 100 words
by the or more,
taller portion tend
of the to convey
violin a negative
plot for the “Negative”sentimen
from the comment
category. writers,
Furthermore, as indicated
the graph displaysby the taller
a denser portion of
concentration of the violin with
comments plot word
for the “Neg
counts around 37 on the positive side, suggesting a higher likelihood for
ative” category. Furthermore, the graph displays a denser concentration of comment positive sentiment.
with word counts around 37 on the positive side, suggesting a higher likelihood for posi
tive sentiment.
tionship between comment length and sentiment. The analysis reveals that comments
with a prolix nature, containing 100 words or more, tend to convey a negative sentiment
from the comment writers, as indicated by the taller portion of the violin plot for the “Neg-
ative” category. Furthermore, the graph displays a denser concentration of comments
Big Data Cogn. Comput. 2023, 7, 127 with word counts around 37 on the positive side, suggesting a higher likelihood for 14 of 16
posi-
tive sentiment.
Figure10.
Figure Distributionof
10.Distribution of word
word counts
counts in
in comments
comments categorized
categorizedas
as‘Positive’
‘Positive’oror‘Negative’.
‘Negative’.
6. Conclusions
Based on the findings, it is apparent that TF-IDF can improve the performance of
most classifiers used in the investigation, with the exception of Naïve Bayes. Naïve Bayes
showed satisfactory results when using only n-grams and a count-vectorizer. However, the
performance of Naïve Bayes was slightly better without incorporating TF-IDF compared to
when TF-IDF was enabled. Specifically, Naïve Bayes achieved an accuracy of 94.62% and
an MCC score of 91.46%. In contrast, the Decision Tree exhibited low performance with
an accuracy of 84.10% and an MCC score of 69.64% when TF-IDF was not enabled. The
study specifically focused on sentiment analysis of YouTube comments from various videos,
with the objective of predicting whether the video content was categorized as positive
or negative. Manual labeling of comments into “Positive” and “Negative” classes was
conducted after preprocessing the dataset. The dataset achieved a Kappa score of 0.818,
indicating a substantial level of agreement between the annotators. A total of six distinct
supervised ML text classifiers were employed to predict the video recommendation status.
Notably, Naïve Bayes exhibited the most optimal performance, attaining an accuracy rate
of 94.62% even in the absence of TF-IDF integration.
In the upcoming stages, we aim to enhance the performance of our ML model in
various ways, including expanding the dataset by adding more comments to it, testing
other types of machine learning classifiers, and even exploring the growing popularity of
techniques such as RNNs, transformer-based neural network models, and large language
models, which have witnessed significant progress in recent times. Finally, we could also
augment the dataset with supplementary features like user engagement metrics such as
the number of likes, dislikes, or replies received by a comment. By implementing these
improvements, we can aim to make our model even more effective and accurate.
References
1. Tiwari, S.; Trivedi, M.C.; Kolhe, M.L.; Mishra, K.K.; Singh, B.K. Advances in Data and Information Sciences, Proceedings of ICDIS 2021;
Springer Nature Singapore Pte Ltd.: Singapore, 2022; ISBN 978-981-16-5688-0.
2. AlOtaibi, S.; Khan, M.B. Sentiment analysis challenges of informal Arabic language. Int. J. Adv. Comput. Sci. Appl. 2017, 8.
[CrossRef]
3. Rao, L. Sentiment Analysis of English Text with Multilevel Features. Sci. Program. 2022, 2022, 7605125. [CrossRef]
4. Samsir, S.; Kusmanto, K.; Dalimunthe, A.H.; Aditiya, R.; Watrianthos, R. Implementation Naïve Bayes Classification for Sentiment
Analysis on Internet Movie Database. Build. Inform. Technol. Sci. 2022, 4, 1–6. [CrossRef]
5. Geetha, R.; Padmavathy, T.; Anitha, R. Prediction of the academic performance of slow learners using efficient machine learning
algorithm. Adv. Comput. Intell. 2021, 1, 5. [CrossRef]
6. Umer, M.; Ashraf, I.; Mehmood, A.; Kumari, S.; Ullah, S.; Sang Choi, G. Sentiment analysis of tweets using a unified convolutional
neural network-long short-term memory network model. Comput. Intell. 2021, 37, 409–434. [CrossRef]
7. Murthy, G.S.N.; Allu, S.R.; Andhavarapu, B.; Bagadi, M.; Belusonti, M. Text based Sentiment Analysis using LSTM. Int. J. Eng.
Res. 2020, 9. [CrossRef]
8. Agrawal, S.; Awekar, A. Deep learning for detecting cyberbullying across multiple social media platforms. In Proceedings of the
European Conference on Information Retrieval, Grenoble, France, 26–29 March 2018; Springer International Publishing: Cham,
Switzerland, 2018; pp. 141–153.
9. Benkhelifa, R.; Laallam, F.Z. Opinion extraction and classification of real-time youtube cooking recipes comments. In Proceedings
of the International Conference on Advanced Machine Learning Technologies and Applications, Cairo, Egypt, 22–24 February
2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 395–404.
10. Al-Tamimi, A.K.; Shatnawi, A.; Bani-Issa, E. Arabic sentiment analysis of YouTube comments. In Proceedings of the 2017 IEEE
Jordan Conference on Applied Electrical Engineering and Computing Technologies, AEECT 2017, Aqaba, Jordan, 11–13 October
2017; pp. 1–6.
11. Alakrot, A.; Murray, L.; Nikolov, N.S. Towards Accurate Detection of Offensive Language in Online Communication in Arabic.
Procedia Comput. Sci. 2018, 142, 315–320. [CrossRef]
12. Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting Offensive Language on Arabic Social Media Using Deep Learning. In
Proceedings of the 2019 6th International Conference on Social Networks Analysis, Management and Security, SNAMS 2019,
Granada, Spain, 22–25 October 2019; pp. 466–471.
13. Mohammed, A.; Kora, R. Deep learning approaches for Arabic sentiment analysis. Soc. Netw. Anal. Min. 2019, 9, 52. [CrossRef]
14. Ombabi, A.H.; Ouarda, W.; Alimi, A.M. Deep learning CNN–LSTM framework for Arabic sentiment analysis using textual
information shared in social networks. Soc. Netw. Anal. Min. 2020, 10, 53. [CrossRef]
15. Hadwan, M.; Al-Hagery, M.; Al-Sarem, M.; Saeed, F. Arabic Sentiment Analysis of Users’ Opinions of Governmental Mobile
Applications. Comput. Mater. Contin. 2022, 72, 4675–4689. [CrossRef]
16. Khabour, S.M.; Al-Radaideh, Q.A.; Mustafa, D. A new ontology-based method for Arabic sentiment analysis. Big Data Cogn.
Comput. 2022, 6, 48. [CrossRef]
17. Alqarni, A.; Rahman, A. Arabic Tweets-Based Sentiment Analysis to Investigate the Impact of COVID-19 in KSA: A Deep
Learning Approach. Big Data Cogn. Comput. 2023, 7, 16. [CrossRef]
18. Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res.
2018, 270, 654–669. [CrossRef]
19. Huq, M.R.; Ahmad, A.; Rahman, A. Sentiment analysis on Twitter data using KNN and SVM. Int. J. Adv. Comput. Sci. Appl. 2017,
8. [CrossRef]
20. Hiraoka, T.; Shindo, H.; Matsumoto, Y. Stochastic tokenization with a language model for neural text classification. In Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1620–1629.
21. Deng, L.; Liu, Y. Deep Learning in Natural Language Processing; Springer: Singapore, 2018; ISBN 9789811052095.
22. Sahni, T.; Chandak, C.; Chedeti, N.R.; Singh, M. Efficient Twitter sentiment classification using subjective distant supervision. In
Proceedings of the 2017 9th International Conference on Communication Systems and Networks (COMSNETS), Bengaluru, India,
4–8 January 2017; pp. 548–553.
23. Liu, G.; Mao, S.; Kim, J.H. A mature-tomato detection algorithm using machine learning and color analysis. Sensors 2019, 19, 2023.
[CrossRef] [PubMed]
24. Ghallab, A.; Mohsen, A.; Ali, Y. Arabic Sentiment Analysis: A Systematic Literature Review. Appl. Comput. Intell. Soft Comput.
2020, 2020, 7403128. [CrossRef]
Big Data Cogn. Comput. 2023, 7, 127 16 of 16
25. Abhishek Sharma Decision Tree, vs. Random Forest—Which Algorithm Should You Use? Available online: https://www.
analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/ (accessed on 18 October 2022).
26. Duwairi, R.M.; Qarqaz, I. Arabic sentiment analysis using supervised classification. In Proceedings of the 2014 International
Conference on Future Internet of Things and Cloud, FiCloud 2014, Barcelona, Spain, 27–29 August 2014; pp. 579–583.
27. Hammad, M.; Al-Awadi, M. Sentiment analysis for Arabic reviews in social networks using machine learning. In Advances in
Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2016; Volume 448, pp. 131–139, ISBN 9783319324661.
28. Kleinbaum, D.G.; Dietz, K.; Gail, M.; Klein, M.; Klein, M. Logistic Regression; Springer: Berlin/Heidelberg, Germany, 2002;
ISBN 0387953973.
29. Novendri, R.; Callista, A.S.; Pratama, D.N.; Puspita, C.E. Sentiment analysis of YouTube movie trailer comments using naïve
bayes. Bull. Comput. Sci. Electr. Eng. 2020, 1, 26–32. [CrossRef]
30. Musleh, D.A.; Alkhales, T.A.; Almakki, R.A.; Alnajim, S.E.; Almarshad, S.K.; Alhasaniah, R.S.; Aljameel, S.S.; Almuqhim, A.A.
Twitter arabic sentiment analysis to detect depression using machine learning. Comput. Mater. Contin 2022, 71, 3463–3477.
[CrossRef]
31. Singh, R.; Tiwari, A. Youtube comments sentiment analysis. Int. J. Sci. Res. Eng. Manag. 2021, 5.
32. Aribowo, A.S.; Basiron, H.; Yusof, N.F.A.; Khomsah, S. Cross-domain sentiment analysis model on indonesian youtube comment.
Int. J. Adv. Intell. Inform. 2021, 7, 12–25. [CrossRef]
33. Al-Twairesh, N.; Al-Negheimish, H. Surface and deep features ensemble for sentiment analysis of arabic tweets. IEEE Access 2019,
7, 84122–84131. [CrossRef]
34. Alsubait, T.; Alfageh, D. Comparison of Machine Learning Techniques for Cyberbullying Detection on YouTube Arabic Comments.
Int. J. Comput. Sci. Netw. Secur. 2021, 21, 1–5.
35. Muaad, A.Y.; Jayappa, H.; Al-antari, M.A.; Lee, S. ArCAR: A novel deep learning computer-aided recognition for character-level
Arabic text representation and recognition. Algorithms 2021, 14, 216. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.