0% found this document useful (0 votes)
39 views6 pages

Toxic Comment Analysis For Online Learning

The paper discusses the increasing prevalence of toxic comments in online learning environments, exacerbated by the COVID-19 pandemic, and proposes a machine learning approach to classify and analyze these comments. Utilizing a self-prepared dataset combined with Kaggle's toxic comment dataset, the authors apply various algorithms including Logistic Regression and Random Forest to categorize comments into types such as 'Toxic', 'Severely Toxic', and 'Insult'. The goal is to reduce online bullying and create a healthier educational atmosphere for both teachers and students.

Uploaded by

tm5139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Toxic Comment Analysis For Online Learning

The paper discusses the increasing prevalence of toxic comments in online learning environments, exacerbated by the COVID-19 pandemic, and proposes a machine learning approach to classify and analyze these comments. Utilizing a self-prepared dataset combined with Kaggle's toxic comment dataset, the authors apply various algorithms including Logistic Regression and Random Forest to categorize comments into types such as 'Toxic', 'Severely Toxic', and 'Insult'. The goal is to reduce online bullying and create a healthier educational atmosphere for both teachers and students.

Uploaded by

tm5139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021

| Ernakulam

Toxic Comment Analysis for Online Learning


Manaswi Vichare Sakshi Thorat Cdt. Saiba Uberoi
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 978-1-7281-7136-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ACCESS51619.2021.9563344

Dept. of Computer Science and Dept. of Computer Science and Dept. of Computer Science and
Engineering Engineering Engineering
MIT School of Engineering MIT School of Engineering MIT School of Engineering
MIT Art, Design and Technology MIT Art, Design and Technology MIT Art, Design and Technology
University University University
Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected]

Sheetal Khedekar Sagar Jaikar


Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
MIT School of Engineering MIT School of Engineering
MIT Art, Design and Technology University MIT Art, Design and Technology University
Pune, India Pune, India
[email protected] [email protected]

Abstract—Due to recent circumstances of the pandemic, teachers and students. Teachers are facing a lot of issues due
online platforms are becoming more and more essential for to this serious topic, and are helpless while facing them to
communication in many sectors. But because of this, a lot of maintain their professionalism. Such comments cause
negativity and toxic comments are surfacing, resulting in degradations to the teacher’s morals and lead to
degradation and online abuse. Educational systems and psychological harassment. It can also lead to severe anxiety
Institutions heavily rely on such platforms for e-learning and glossophobia.
leading to unrestricted attacks of toxic and negative comments
towards teachers and students. Due to this work, issues of This paper will help to understand how we can classify
constant bullying and online abuse will be reduced. The and analyze toxic comments faced in online learning using
comments classified are according to the parameters from our Machine Learning techniques. The comments are classified,
self-prepared dataset combined with Kaggle’s toxic comment with the help of our self prepared dataset combined with
dataset, named as toxic, severely toxic, obscene, threat, insult, Kaggle’s toxic comment dataset, as “Toxic”, “Severely
and identity hate. Machine Learning algorithms such as Toxic”, “Obscene”, “Threat”, “Insult” and “Identity hate”
Logistic Regression, Random Forest, and Multinomial Naive keeping in mind the e-learning environment. The input will
Bayes are used. For data evaluation, ROC and Hamming be taken as the comment, followed by the training and
scores are used. The output will be shown as the rate of each classification of the model using Machine Learning. The
category in percentile and in a graphical format. This work
output will be shown as the rate of each category in
will help reduce the online bullying and harassment faced by
percentile and in a graphical format. This work will help
teachers and students and help create a non-toxic learning
environment. In this way, the main focus will be on studying
reduce the online bullying and harassment faced by teachers
and not getting de-motivated and discouraged by hateful and students and help create a non-toxic learning
comments and people commenting toxic comments will also get environment. In this way, the main focus will be on studying
reduced. and not getting de-motivated and discouraged by hateful
comments and people commenting toxic comments will also
Keywords—toxic, comments, e-learning, machine learning, get reduced.
analysis
II. LITERATURE REVIEW
I. INTRODUCTION In paper [1], the author focuses on analyzing any piece of
The tough times of the Covid-19 pandemic have opened text for different types of toxicity, for example, obscenity,
the doors of online interaction to the maximum. The insults, threats, and racism. The author explains the result in
Lockdown norms due to Covid-19 have led to restrictions in two sub-sections, that is, Individual and Collective results. A
social and professional interactions in every sector. But this Mean Validation Accuracy of 98.08% and an Absolute
hasn't stopped us to work from home by using numerous Validation accuracy of 91.61% have been achieved by the
online platforms available to conduct meetings, seminars, Six Headed Model. The author concludes the paper’s
etc. Online platforms provide countless advantages which intention to be as a way to enhance a fair online
include uninterrupted communication between people conversation. As a future scope, the author wants to develop
regardless of the distance between them, being time and cost- a more robust model by using Grid Search Algorithms. In
efficient, having easy access to the information, etc. Due to paper [2], the authors wanted to build a Machine Learning
these benefits, one of the sectors which have heavily been model to identify toxicity in online conversations. They
relying on such platforms is the Educational Institutions. developed three different models to predict scores for
toxicity, severe toxicity, insult, threat, obscenity, identity
E-Learning has gained more and more recognition due to attack. The methods which were used are Naive Bayes SVM
the changing times. Unlike offline teaching, in online model, LSTM model, BERT model, and Ensembling. F1
learning you can access the study material an unlimited score of 81.19% and EM score of 95.54% were achieved as
number of times, can conduct lectures multiple times, can the best performance of a single model. However, when two
access updated content and it also helps in saving time. Even ensemble methods were applied, the F1 and EM scores
though there is a bright side to online learning, a drawback increased to 84.28% and 95.14% respectively.
that is arising is the toxic and impolite comments towards the

978-1-7281-7136-4/21/$31.00 ©2021 IEEE


130
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021 | Ernakulam

In paper [3], the authors introduce deep learning document frequency (TF-IDF).The main aim of TF-IDF is to
techniques to classify toxicity in comments. The paper is improve the efficiency and performance of the classification
based on investigating the effects of three different neural algorithm by removing unwanted data. Data preprocessing is
network models, that is, Multi-Layer Perceptron (MLP), implemented to get useful words and avoid unnecessary data
Convolutional Neural networks (CNN), Long short term that won’t help in classification. Hence, this helps the
memory network (LSTM), at word and character level machine to get better input. Following are the steps which
granularity on binary and multi-label classification tasks. The help to eliminate the irrelevant data from the document.
authors conclude that LSTM and CNN were the best models
for word-level and multi-label classification. In paper [4], the • Read the document.
authors discuss how commenting spaces in various online • Every word in the document is split into small
forums have become a target of cyber bullying and online tokens, this is done by Tokenization.
abuse. The use of deep learning models such as recurrent
neural networks and Convolutional neural networks along • After tokenizing, we find out which words are to be
with the use of word embeddings and another model with the used and which not, and stop-words are removed as
use of CNN and character embeddings is being done. Upon well.
successful implementation, it was observed that accuracy of • Lemmatization is done to avoid inflectional endings
0.94 was achieved. Moreover, it was found that despite and give the basic word for that.
having more training iterations than other models, CNNs
with character embedding provided the highest performance. The conversion of categorical data to numeric data is
necessary in data preprocessing. To do the text processing,
In paper [5], the authors detect hate speech on Twitter
which should be understandable by the machine, we have to
using deep learning techniques. The models used for this
convert all the data to numeric form. This can be done by
paper are CNN, LSTM along with FastText. These networks
two methods, first is Bag of Words, and the second is TF-
are trained with back-propagation using labeled data. After
IDF. Bag of words cannot be used since it has some
training, the tweet is tested which is then classified whether
drawbacks.
it's racist, or not. When deep neural network embeddings
were paired with gradient boosted decision trees, the best The data pre-processing is done using the following
accuracy values were obtained. In paper [6], the solution techniques:
makes use of a publicly accessible embedding model that has
been evaluated against a Twitter hate speech corpus. The • Lexicon Normalization: The task of translating or
authors also verified the results against a prominent transforming a nonstandard text to a standard
sentiment dataset to ensure that they are reliable results. With register is known as lexical normalization.
neural network architecture, the authors have illustrated the Replacements are done on word level.
utility of ensemble approaches. • Lemmatization: Lemmatization is the process of
successfully performing tasks using a vocabulary
III. PROPOSED METHODOLOGY
and morphological study of words. Its goal is to
A. Dataset eliminate just inflectional endings and return a
The dataset is self-made along with Kaggle’s toxic word's basic or dictionary form, known as the
comment dataset keeping in mind the environment of online lemma.
learning. It contains sub-categories of toxicity such as toxic, • TF - IDF: TF-IDF stands for Term Frequency-
severely toxic, obscene, threat, insult, and identity hate. It Inverse Document Frequency. It is a typical
also contains an id number and a corresponding text approach for converting text into a meaningful
comment. Each toxic category contains the target score of 0 numerical representation, which is then used to fit a
or 1. Fig. 1 shows a glimpse of the Toxic comment dataset. machine learning algorithm for prediction. Term
Frequency is the ratio of Number of occurrences of
the word to the total number of words in a
document, as the occurrence increases the term
frequency increases too. Inverse Document
Frequency is calculated as log (N/n) as explained in
equations (1) and (2). TF-IDF algorithm:
1. Follow the data preprocessing steps.
Fig. 1. Toxic Comment Dataset
2. Then calculating both TF and IDF using,
B. Data Preprocessing TF = (Number of occurrence of word / total
Various data preprocessing methods can be seen in [7]. number of words in document) (1)
Let us consider our training dataset, in which we have
classified each category for the comments that we have used, IDF = log (N/n) (2)
whether it is a positive comment or negative comment. So where N is number of documents and n is
firstly, we have removed the words (such as: is, the, able, on) documents which contain comments as referred
which doesn't add anything to the classification by using in [8].
stop-words to text. Then we have to calculate the probability
of positive and negative comments. To make the machine C. Block Diagram
understand the comments, we have to convert the categorical Block diagram, Fig. 2, gives a diagrammatic
data into numeric ones using the frequency-inverse representation of how the attributes will function. The

131
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021 | Ernakulam

working goes on by feeding the model an input comment, confusion matrix, specificity, sensitivity, and
followed by data preprocessing and feature extraction. The accuracy score can all be used to evaluate a logistic
input is evaluated against the trained data and classified into regression model, with a focus on the true positive or
the specific toxic comment which it belongs to. false negative. Logistic regression predicts its output
using the sigmoid function to return the probability
values. Any value between 0 and 1 is mapped using
this function. So to map these probabilities, we use
this function.
p(X) = (eβ0+ β1X) / (1 + (eβ0+ β1X)) (3)
X=input value
β0 = Intercept term
β1 = Coefficient for the single input value β
Logistic Regression is mathematically represented as,
log [y / (1 – y)] = b0 + b1x1 + b2x2 + ... + bnxn (4)
The above equation (4) is obtained by using Linear
Regression. After successful implementation of data
preprocessing, extraction of dependent and
independent variables is carried out. The dataset is
then split into training and testing set using the
‘train_test_split’ from the ‘sklearn.model_selection.’
To fit the model into the training set, we use the
‘LogisticRegression’ class from the
‘sklearn.linear_model’ library. The parameters solver
and penalty are used for optimization problems and to
specify the norm to be used in penalization. Once the
fitting is done, the test result prediction is carried out
by defining a prediction variable. The data
visualization is carried out using ROC curves across
true positive rate and false positive rate axes.
• Multinomial Naive Bayes:
A detailed overview about Naïve Bayes Algorithm is
Fig. 2. Block Diagram of Toxic Comment Analysis for Online Learning given in [11] and [12]. The Bayes theorem is used in
this classification algorithm. The Bayes theorem
D. Machine Learning Techniques determines the probability P(c|x), where c is the class
Steps to apply for Toxic Comment Analysis for Online of probable outcomes and x is the supplied case to be
Learning: identified, which represents some specific
characteristics.
1. Importing Important Libraries.
P(c|x) = P(x|c) / P(c)*P(x) (3)
2. Assigning different classes.
Multinomial Naïve Bayes uses term frequency, that
3. Importing the dataset. is, the number of times a specific term appears in a
4. Converting categorical data to numeric. document. After normalization, term frequency can
be used to estimate the conditional probability using
5. Models: Logistic Regression, Random Forest, maximum likelihood estimates based on the training
Multinomial Naive Bayes. data. To look at an individual word and not a
6. Training and testing the data. sentence, we have assumed the Naive condition so
that it will look as if every word is independent of
7. Model Evaluation. another.
8. Plotting of Histograms. In this algorithm, we fit the model using the
Use of three machine learning algorithms, Logistic ‘MultinomialNB’ class found in
Regression, Multinomial Naive Bayes and Random Forest ‘sklearn.naive_bayes’ library for the training set.
algorithm are done. Accordingly, the test result prediction and data
evaluation is implemented.
• Logistic Regression:
• Random Forest:
Logic Regression model overview is seen in [9] and
[10]. The use of logistic regression, a supervised A thorough explanation of the Random Forest
learning algorithm, is used to predict the probability Algorithm is given in [13] and [14]. Random Forest is
of a target variable. The target variable is binary in an ensemble learning algorithm, where it constructs
nature, that is, the data is expressed as 1 or 0. The multiple decision trees and combines them to obtain a

132
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021 | Ernakulam

stable and accurate prediction. A class prediction is The values of True Positive Rate and False Positive
produced by each decision tree, and the model's Rate are calculated as,
prediction is based on the class with the highest votes.
The random forest algorithm, as shown in Fig. 3, True Positive Rate = TP / (TP+FN) (4)
takes place in two phases; the first is to combine N False Positive Rate = FP / (FP+TN) (5)
decision trees to produce the random forest, and the
second is to make predictions for each tree produced
in the first phase.

Fig. 3. Example of Random Forest Algorithm


Fig. 4. ROC Curve Graphical Representation
The created dataset is given to the Random Forest
Classifier. Each decision tree generates a prediction • Hamming Score:
during the training phase, and when a new data point
The proportion of predicted correct labels to the total
is introduced, the Random Forest classifier predicts
number (predicted and actual) of labels for each
the final decision based on the majority of results.
instance determines accuracy. The overall accuracy is
After the data preprocessing, to fit the model into the
calculated as the average of all instances. It is also
training set we use the ‘RandomForestClassifier’
referred to as the Hamming Score of the model.
class from the ‘sklearn.ensemble’ library. The
parameter ‘n_estimators’ is selected to input the IV. RESULTS
number of trees required. Once the fitting is done, the
test result prediction is carried out by defining a Results obtained after successful implementation are
prediction variable. The data visualization is carried tabulated in Table II. The hamming score obtained by using
out using ROC curves across true positive rate and Logistic Regression with L2 Penalty and 3 Fold CV is 0.87
false positive rate axes. and the CV score obtained for each label is in the range of
0.89 to 0.96. Mean area under curve for Logic Regression is
E. Data Evaluation 0.92. The ROC curve obtained is shown in Fig. 5. For
The data evaluation of the used models is done by using Multinomial Naive Bayes Algorithm, a hamming score of
ROC curves and Hamming score. 0.85 is obtained. The ROC curve is shown in Fig. 6 and the
Mean area under curve value is 0.92. Random Forest
• ROC Curves: Algorithm in Fig. 7 provided a hamming score of 0.85 with
A receiver operating characteristic curve (ROC curve) mean area under curve of 0.89.
is a graph, as mentioned in Fig. 4, which shows how TABLE II. RESULTS
well a classification model performs across all
categorization levels. Two parameters are plotted on Algorithm Mean AUC Hamming Label AUC
this curve: True Positive Rate. False Positive Rate as Score range
shown in Table I. ROC Curve implementation can
also be observed in [15]. Logistic 0.92 0.92 0.89 - 0.96
Regression
TABLE I. ROC CURVE
Multinomial 0.92 0.85 0.89 - 0.96
Predicted class Naive Bayes
True
class
- or Null + or Non - Null Total Random Forest 0.89 0.85 0.84 - 0.94

True Negative
- or Null False Positive (FP) N
(TN)
In the below mentioned graphs, we observe the bend of
+ or Non - False Negative the curve for each category. As the bend of the curve should
True Positive (TP) P be near the left corner for the classifier to work well, in each
Null (FN)
graph, the ‘severely toxic’ category has the most bended
Total N* P* curve out of all the considered categories. It means that this
particular category can be considered as the best classifier.

133
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021 | Ernakulam

The maximum AUC for ‘severely toxic’ category is observed dataset for each toxic category, that is, for toxic, severely
to be 0.96. toxic, insult, identity hate, threat, and obscene. For example,
in the “Threat” category the most used word is “die” which is
used in almost 16% of the dataset.
TABLE III. HISTOGRAMS

Toxic Severely toxic

Fig. 5. ROC Curve for Logistic Regression

Obscene Threat

Insult Identity hate

V. CONCLUSION
With the rise in online interactions, there's always a fear
of getting harassed or bullied in online conversations, which
limits the potential to express ourselves. Due to its numerous
Fig. 6. ROC Curve for Multiomial NB advantages, most of the sectors, such as IT companies,
organizations, etc are favoring online platforms in these
tough times of Covid-19. This includes the Educational
sectors as well, which are adopting the methods of E-
learning through online platforms. But, unfortunately, cases
of teachers being ill-mouthed in online conversations during
lectures or seminars are surfacing which is causing a lot of
distress and de-motivation to the teaching faculty. To tackle
this issue, the algorithms help to classify and analyze the data
for the online learning environment.
In this paper, three models for toxic comment
classification are proposed, which are: Logistic regression
model, Multinomial Naive Bayes and Random Forest. The
models are used to classify the toxic comments as toxic,
severely toxic, insult, threat, obscene, and identity hate. By
data collection and preprocessing to classify toxic comments
with the help of lemmatization, lexicon normalization and
TF-IDF algorithm, we train and test the models and evaluate
Fig. 7. ROC Curve for Random Forest using ROC curves and hamming score. Based on the
obtained results, we can conclude that the most effective is
A histogram is a data representation that looks like a bar the Logistic regression model, which provides the best
graph that buckets a variety of outcomes into columns along accuracy: 0.92 when testing on a training data set. As a
the x-axis. The y-axis can be used to illustrate data future scope, a better and faster model in place of Random
distributions by representing the numerical count or Forest can be implemented. We can implement deep learning
percentage of occurrences in each column. The histograms models to obtain a much higher accuracy rate.
shown in Table III provide the most used words in the

134
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.
2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) | 2-4 September 2021 | Ernakulam

REFERENCES the logistic regression and neural networks models,” AIP Conference
Proceedings 2048, 060011, 2018, DOI: 10.1063/1.5082126
[1] N Chakrabarty, “A Machine Learning Approach to Comment
Toxicity Classification,” Computational Intelligence in Pattern [9] Patrick Ozoh Phd, Adepeju Abeke Adigun, Olayiwola M O,
Recognition, 2019, arViv:1903.06765 “Identification and Classification of Toxic Comments on Social
Media using Machine Learning Techniques,” International Journal of
[2] Hao Li, Weiquan Mao, Hanyuan Liu, “Toxic Comment Detection and Research and Innovation in Applied Science (IJRIAS), 2019
Classification,” Standford University, CS299 Machine Learning,
2019 [10] Chao-ying joanne peng, kuk lida lee, gary m. Ingersoll, “An
Introduction to Logistic Regression Analysis and Reporting,” The
[3] Kevin Khieu, Neha Narwal, “Detecting and Classification Toxic Journal of Educational Research, September 2002,
Comments,” Standford University, CS224N, 2017 DOI:10.1080/00220670209598786
[4] Theodora Chu, Kylie Jue, Max Wang, “Comment Abuse [11] Zaheri Sara, Leath Jeff, and Stroud David, “Toxic Comment
Classification with Deep Learning,” Standford University, CS224N, Classification,” SMU Data Science Review, 2020.
2017
[12] Pouria Kaviani, Mrs. Sunita Dhotre, “Short Survey on Naive Bayes
[5] Manish Gupta, Vasudeva Varma, Pinkesh Badjatiya, Shashank Algorithm,” International Journal of Advance Engineering and
Gupta,“Deep learning for hate speech detection in tweets,”ACM Research Development, Volume 4, Issue 11, November 2017.
WWW'17 Companion, Perth, Western Australia, April 2017, DOI:
10.1145/3041021.3054223 [13] Stephen Marsland, “Machine learning, An Algorithmic Perspective,
Second Edition,” Chapman & Hall/Crc Machine Learning & Pattern
[6] Steven Zimmerman, Chris Fox, Udo Kruschwitz University of Essex Recognition, November 2014
Wivenhoe Park, “Improving Hate Speech Detection with Deep
Learning Ensembles,” Proceedings of the Eleventh International [14] Jehad Ali, Rehanullah Khan, Nasir Ahmad, Imran Maqsood,
Conference on Language Resources and Evaluation (LREC), 2018 “Random Forests and Decision Trees,” IJCSI International Journal of
Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012
[7] F.Mohammad, “Is preprocessing of text really worth your time for
toxic comment classification?,” Proceedings on the International [15] Pallam Ravi, Hari Narayana Batta, Greeshma S, Shaik Yaseen,
Conference on Artificial Intelligence (ICA), 2018, arXiv:1806.02908 “Toxic Comment Classification,” International Journal of Trend in
Scientific Research and Development (IJTSRD), June 2019,
[8] Mujahed A. Saif1, Alexander N. Medvedev1, Maxim A. Medvedev, DOI:10.31142/ijtsrd23464
Todorka Atanasova, “Classification of online toxic comments using

135
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:34:24 UTC from IEEE Xplore. Restrictions apply.

You might also like