0% found this document useful (0 votes)
84 views15 pages

Applied Sciences: Arabic Hate Speech Detection Using Deep Recurrent Neural Networks

computer engineering

Uploaded by

Melat Fissha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views15 pages

Applied Sciences: Arabic Hate Speech Detection Using Deep Recurrent Neural Networks

computer engineering

Uploaded by

Melat Fissha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

applied

sciences
Article
Arabic Hate Speech Detection Using Deep Recurrent
Neural Networks
Faisal Yousif Al Anezi

Management Information System Department, Prince Mohammad bin Fahd University,


Al-Khobar 34754, Saudi Arabia; [email protected]

Abstract: With the vast number of comments posted daily on social media and other platforms,
manually monitoring internet activity for possible national security risks or cyberbullying is an
impossible task. However, with recent advances in machine learning (ML), the automatic monitoring
of such posts for possible national security risks and cyberbullying becomes feasible. There is still the
issue of privacy on the internet; however, in this study, only the technical aspects of designing an
automated system that could monitor and detect hate speech in the Arabic language were targeted,
which many companies, such as Facebook, Twitter, and others, could use to prevent hate speech
and cyberbullying. For this task, a unique dataset consisting of 4203 comments classified into
seven categories, including content against religion, racist content, content against gender equality,
violent content, offensive content, insulting/bullying content, normal positive comments, and normal
negative comments, was designed. The dataset was extensively preprocessed and labeled, and its
features were extracted. In addition, the use of deep recurrent neural networks (RNNs) was proposed
for the classification and detection of hate speech. The proposed RNN architecture, called DRNN-2,
consisted of 10 layers with 32 batch sizes and 50 iterations for the classification task. Another model
consisting of five hidden layers, called DRNN-1, was used only for binary classification. Using the
proposed models, a recognition rate of 99.73% was achieved for binary classification, 95.38% for the
three classes of Arabic comments, and 84.14% for the seven classes of Arabic comments. This accuracy
Citation: Anezi, F.Y.A. Arabic Hate was high for the classification of a complex language, such as Arabic, into seven different classes. The
Speech Detection Using Deep achieved accuracy was higher than that of similar methods reported in the recent literature, whether
Recurrent Neural Networks. for binary classification, three-class classification, or seven-class classification, as discussed in the
Appl. Sci. 2022, 12, 6010. https:// literature review section.
doi.org/10.3390/app12126010
Keywords: hate speech detection; Arabic comment classification; deep learning; recurrent neural
Academic Editor:
Valentino Santucci networks; bidirectional RNN; natural language processing

Received: 29 April 2022


Accepted: 5 June 2022
Published: 13 June 2022
1. Introduction
Publisher’s Note: MDPI stays neutral In Saudi Arabia, the future of the country, which has a mostly young population, will
with regard to jurisdictional claims in depend on the cohesiveness of the communities and on authorities to keep the population
published maps and institutional affil- living as a cohesive unit. In the future, discrimination will not be tolerated against any part
iations. of the population, be it expatriates or nationals with different views politically, socially,
economically, and in all other aspects within the community. For Saudi Arabia to prosper
in the future, it must be more welcoming to foreigners, tourists, businessmen, and others.
However, there will always be a small portion of any community that may disagree and
Copyright: © 2022 by the author.
like to spread hate and intolerance against others. Thus, it has become a national security
Licensee MDPI, Basel, Switzerland.
This article is an open access article
issue to monitor hate speech across the different communication mediums (email, websites,
distributed under the terms and
SMS, social media, voice, etc.); currently, there is a need for a fully automated machine
conditions of the Creative Commons learning-based system for Arabic language communication monitoring. To monitor these
Attribution (CC BY) license (https:// mediums manually is an impossible task, and to monitor them electronically using conven-
creativecommons.org/licenses/by/ tional means would also yield poor results. In the age of artificial intelligence (AI), smart
4.0/). intelligent systems that can progressively learn to identify hate, racism, and discriminative

Appl. Sci. 2022, 12, 6010. https://doi.org/10.3390/app12126010 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 6010 2 of 15

words, phrases, and sentences will be required. This system will be smart enough to
identify these kinds of conversations and able to determine if they are one-time phrases or
if they profile a certain pattern for an individual or organization. In addition, the system
will be able to alert the authorities when repetitive unacceptable behavior is observed from
an individual group or organization. The system will be called a “Peace Monitor” and may
also be used to identify bullying on social media. Thus, the motivation for developing an
automatic machine learning-based system that is able to detect and identify hate speech is
especially needed in countries that are currently progressing to be more inclusive; it will
also be useful for the prevention of cyberbullying, particularly in school-aged children with
access to social media platforms. A great deal of work has been carried out on the detection
of hate speech and cyberbullying in other languages, especially English; however, little
work has been completed in Arabic for various reasons, including the complex nature of
the Arabic language and the lack of unified large datasets. The aim of this study was to
develop a large Arabic language dataset with seven classes that could be used as the basis
for the development of an even larger dataset with more classes to be used as a unified
dataset for researchers in this field. In addition, this study aimed to develop a modified
machine learning algorithm that was able to detect and accurately classify hate speech
in Arabic.
Machine learning and deep learning algorithms have gained momentum in the past
decade as a means to automate many tasks that have conventionally been performed by
humans. For example, machine learning and deep learning algorithms are being explored
for the automated diagnosis of many diseases and health conditions [1–4]. However, this
is just an example, and the use of these algorithms has been researched in relation to all
aspects of human life, from self-driving cars to monitoring the quality of food ordered in
fast-food restaurants. The fast pace at which technology has advanced to produce high-
speed computing machines and the advancement of the Internet of Things (IoT), where
sensors are attached to everything and everything is connected to either the cloud or the
internet, has increased the availability of big data that can be analyzed to produce results
that are productive and ease human life. This has paved the way for machine learning and
deep learning algorithms to be explored in all walks of life to automate tasks previously
carried out by humans at much faster speeds and with more accuracy [5,6]. One of the areas
in which these algorithms have been explored is text recognition, whether it is handwritten
or typed text [7].
The main contributions of this research can be summarized as follows:
1. Built a new hate speech dataset in Arabic with seven different classes.
2. Modified machine learning algorithms to be able to correctly classify and detect hate
speech in Arabic.
The final product of such a system has various advantages and uses, and some of the
benefits of the final product are as follows:
1. The proposed system can be used in Arabic-speaking countries to ensure a future that
is consistent with peace and tolerance.
2. The system can be used by the authorities to save lives because it can predict violent
behavior before it occurs.
3. The system can be used in smart cities as a smart feature, which can have several
added features to ensure safety and security.
4. The system can be expanded to include other Arabic dialects and thus can be used in
other Arab countries.
5. The system can be used by western countries for Arabic-speaking populations.
The rest of the paper is organized as follows: Section 2 includes a literature review of
the algorithms explored for text classification and recognition. Section 3 details the novel
methodology proposed in this paper. Section 4 highlights the experimental results and
discussion. Section 5 highlights the conclusion.
Appl. Sci. 2022, 12, 6010 3 of 15

2. Literature Review
A great deal of research has been carried out on hate speech in various languages,
especially English. The authors of [8] proposed the use of a convolutional neural network
for the classification of hate speech. The dataset they used divided hate speech into several
sections, including racist sentences, sexist sentences, harmful words, and threatening
phrases, such as death threats. The database consisted of 6655 tweets: 91 racism tweets,
946 sexism tweets, 18 tweets with both racism and sexism, and 5600 non-hate-speech tweets.
Among the different algorithms used, they reported that the best results achieved were
a precision of 86.61%, a recall of 70.42%, and an F-score of 77.38%. The authors of [9]
proposed the use of a support vector machine (SVM) for multiclass classification. Their
dataset consisted of 14,509 tweets annotated to three categories: hate, offensive, and OK.
In these experiments, the authors also proposed surface n-grams, word skip-grams, and
Brown clusters as the features to be extracted. The total dataset consisted of 2399 sentences
classified as hate, 4836 sentences classified as offensive, and 7274 sentences classified as
OK. They reported that the best accuracy of 78% was achieved using a character 4-g model.
The authors of [10] collected 17,567 Facebook posts annotated as no hate, weak hate, and
strong hate as a dataset for their proposed system. They proposed the use of two methods
to capture and classify hate phrases and speech, which are published daily on one of
the largest social networking sites—Facebook. The two methods used in this study were
support vector machines (SVM) and long short-term memory (LSTM), which is a recurrent
neural network (RNN). The SVM system provides excellent and wide levels for analyzing
and classifying linguistic texts, and these features have been used well in polar sentiment
classification tasks.
On the other hand, the LSTM system was used to expand the ranges in order to dissect
longer sentences that were not observably accurate from the first scan. Among their results,
the authors reported the highest accuracy of 80.6% using the SVM classifier. In [11], the
authors proposed the use of an ensemble neural network for the classification of hate
speech and used two publicly available datasets. Although accuracy was not reported as a
result, the reported mean of the ensemble was 78.62%.
The authors of [12] used seven publicly available datasets along with a state-of-the-
art linear SVM. However, this study concentrated on feature extraction and specifically
whether to use the extracted feature or engineer a set of features that might produce better
results. The authors concluded that the selection (engineering) of a set of features produced
better results than using all the extracted features. Moving away from the English language,
the authors of [13] created a new dataset consisting of 1100 tweets annotated and labeled
as hate speech or non-hate speech. The features used in this study were word n-gram
with n = 1 and n = 2 and character n-gram with n = 3 and n = 4 plus negative sentiment.
Several machine learning algorithms were tested, including naïve Bayes (NB), support
vector machine (SVM), Bayesian logistic regression (BLR), and random forest decision tree
(RF). The highest F-measure of 93.5% was reported when using RF as a classifier. The
authors of [14] used a publicly available dataset of tweets that were divided into 260 tweets
labeled “hate speech” and 445 tweets labeled “non-hate speech”. The five proposed and
tested classifiers included NB, K-nearest neighbors (KNN), maximum entropy (ME), RF,
and SVM. The authors used two ensemble methods: hard voting and soft voting. They
concluded that the best results were achieved using ensemble methods and reported the
best result when using soft voting with an F1 measure of 79.8% on an unbalanced dataset
and 84.7% on a balanced dataset.
The authors of [15] developed a dataset consisting of 10,000 tweets divided into the
following classes: 1915 offensive tweets, of which 225 were deemed vulgar and 506 were
labelled as hate speech, and the remaining 8085 tweets that were deemed clean. They used
various classifiers, such as AdaBoost, Gaussian NB, perceptron, gradient boosting, logistic
regression (LR), and SVM. They reported the best result with a precision of 88.6% using
SVM. In [16], the authors developed a dataset for Arabic speech collected from various
sources, such as Facebook, Instagram, YouTube, and Twitter. They collected a total of
Appl. Sci. 2022, 12, 6010 4 of 15

20,000 posts, tweets, and comments. They tested the dataset with 12 machine learning
algorithms and two deep learning algorithms. They reported that the highest accuracy
of 98.7% was achieved using the RNN. In [17], the authors responded to a competition,
OSACT4, to develop a machine learning algorithm for the classification of Arabic text and
the detection of offensive text. They were provided with a dataset of 10,000 tweets. Their
proposed method consisted of using ULMFiT, which was originally developed for English
language detection. By using forward and backward training and finding the average of
the two, they reported an accuracy of 96%.
The authors of [18] also responded to a competition for the detection of Arabic text
that was deemed ironic. The competition was held by the Forum for Information Retrieval
(FIRE2019). They provided the contestants with a dataset of 4024 tweets divided into
2091 ironic and 1933 non-ironic tweets. The authors extracted several features, including TF-
IDF word n-gram features, bag-of-words features, topic modeling features, and sentiment
features. They used three types of ensemble learning: hybrid, deep, and classical. They
reported their best F1 score of 84.4 and achieved third place. The authors of [19] collected
450,000 tweets, of which 2000 were annotated and labeled. They extracted various features
from the tweets, including syntactic dependencies between terms, their claim of well-
founded/justified discrimination against social groups, and incitement to respond with
antagonistic action. They found that the ensemble classification approach was the best at
detecting hate speech.
In [20], the authors used a publicly available dataset of 15,050 comments collected
from controversial YouTube videos about Arabic celebrities. After preprocessing, the
authors tested the data using the following classifiers: convolutional neural network (CNN),
bidirectional long short-term memory (Bi-LSTM), Bi-LSTM and attention mechanisms, and
finally a combined CNN-LSTM architecture. They also used Bayesian optimization to
tune the hyperparameters of the network models. They reported that their best result was
achieved using CNN-LSTM, with a recall of 83.46%.
In [21], the authors developed a dataset consisting of 3235 tweets concerning religious
issues, of which 2590 were labeled as “not hate” and 642 were labeled as “hate speech.”
They tested the dataset using many classifiers, including RF, complement NB, decision
tree (DT), and CNN. They also employed two deep learning methods: CNN and RNN.
CNN with FastText achieved the highest F1 measure of 52%. In [22], the authors used a
publicly available dataset from Fox News user comments consisting of 1528 comments, of
which 435 were labeled as hateful. The authors employed the use of LSTM, bidirectional
LSTM, bidirectional LSTM with attention, and an ensemble model. They found that the
best results were achieved using the ensemble model, with an accuracy of 77.9%. Table 1
presents a summary of these recent studies on text classification.
In this study, the recent research was thoroughly analyzed for the latest feature extrac-
tion methods, such as those listed and detailed in [23,24]. In addition, with reference to the
preprocessing stage, the literature detailing the syntax and prejudice of text [25] and that
addressing the clash between symbolic AI and ML was thoroughly analyzed [26].
As can be seen from the literature presented above, Arabic text detection and recogni-
tion is still an open area of research that has not received the full attention it deserves. It
should be noted here that Arabic is spoken by nearly 422 million people, and 25 countries
use Arabic as their official language. This research area is still lacking in having a unified
extensive dataset for researchers to use as well as in the achievement of accuracy levels that
can be implemented in practice. In this paper, a dataset that was developed from various
tweets and divided into seven different classes will be presented. This is by far the largest
number of classes for a dataset of Arabic hate speech according to the knowledge of the
author. In addition, a novel technique to achieve the highest accuracy compared to that
reported in the extant literature will also be proposed.
Appl. Sci. 2022, 12, 6010 5 of 15

Table 1. Summary of recent studies on text classification.

Source Year Method Name Dataset Name Accuracy (%)


[8] 2017 CNN Twitter hate speech 77.75%
[9] 2017 SVM Hate speech detection 77.5%
[10] 2017 SVM Strong hate, weak hate, and no hate 64.61%
[10] 2017 LSTM Strong hate, weak hate, and no hate 60.50%
[11] 2018 CNN Summary of datasets, totals for each class 84%
[14] 2018 Mazajak/SVM Distribution of hate speech types 88.6%
[16] 2020 CNN YouTube dataset 79%
[16] 2020 Deep Learning Twitter dataset 86%
[17] 2020 TF-IDF Test sets 86.5%
nGram with reduced typed dependencies and
[18] 2019 SVM–nGram 89%
hateful terms
Hate speech detection performance using
[19] 2014 Soft Voting 84.7%
balanced dataset
[20] 2019 DL-CNN YouTube comments 87.84%
[21] 2020 CNN + Word2Vec Sunnah Shia Arabic comments dataset 79%
[22] 2017 CNN Fox News user’s Comments 77.9%

3. Materials and Methods


The system was built using machine learning, particularly deep learning-based neural
networks, which are artificial intelligence-based learning algorithms that can be designed to
identify words, phrases, and sentences from all communication media, including websites,
social media, and others. The first step in designing an intelligent system is to train the
system using a large dataset of various words, phrases, and sentences that are labeled as
hate, racist, bullying, discriminatory, etc. The proposed methodology for this system is as
follows: Initially, the dataset is divided into two datasets (one that includes the positive and
negative comments and the other that that includes the rest of the classes). The datasets
go through a preprocessing phase to clean the data and perform other operations, such as
labeling and the removal of symbols. The dataset is then passed to another phase in which
the text vectorization is performed. In this phase, the words are tokenized. The phase
after that is used to extract the Arabic text features and divide the data for training and
validation. The training data is passed to the classifiers that should classify the data into
one of the seven classes. It was envisioned that the complete system would be completely
autonomous with excellent accuracy that could be implemented in practical systems. The
system can then be linked to a cloud-based serviceable to host large amounts of memory
and allow access to computing power for real-time processing.
The proposed system is shown in Figure 1. The first phase consists of dividing the
dataset into two categories: one that holds the negative and positive comments, and another
that holds the six hate speech classes. The combined dataset undergoes a preprocessing
phase, which includes data cleaning, symbol removal, labeling, etc. The dataset is then
passed to the second phase for text vectorization or word tokenizing. The very important
phase of text feature extraction is then completed. The data are then split into two parts for
training and validation, with 80% training and 20% validation. The 20% validation data are
also split into 80% validation and 20% verification. Then, the classifiers are trained for the
classification of text into one of the seven classes listed in the figure.

3.1. Experimental Dataset and Setup


A large Arabic dataset consisting of thousands of sentences, words, and social media
comments was compiled from various sources and labeled accordingly into different
classes. The dataset in itself is novel because it identifies words, phrases, and sentences in
Arabic and divides them into seven distinct classes, making it unique among the currently
published datasets for Arabic hate speech that usually contain two to four classes only. This
new dataset will open the door for broad research from a wide audience of researchers in
the AI field. The dataset contains a total of 4203 comments collected from various sources
Appl. Sci. 2022, 12, 6010 6 of 15

on social media. The comments were annotated and labeled by a group of undergraduate
students whose mother tongue was Arabic and who were proficient in the dialect, as
shown in Table 2. Each comment was annotated and labeled by at least three students that
agreed 100% on the class; otherwise, the comment was discarded. The experiments were
performed using a Windows computer with i7-8700K (3.70 GHz) CPU, 32 GB memory, and
NVIDIA GeForce GTX1080 GPU; the software code was written in Python.

Figure 1. Workflow of the proposed system for hate speech detection from Arabic text.

Table 2. Summary of the newly constructed class and data size for hate speech detection.

Class ID # Class Name Class Data Size


1 Comments Against Religion 407
2 Racist Comments 394
3 Comments Against Gender Equality 410
4 Insulting/Bullying Comments 404
5 Violent/Offensive Comments 788
6 Normal Positive Comments 900
7 Normal Negative Comments 900

3.2. Preprocessing and Word Embedding Layer for Arabic Hate Speech Dataset
The cleaning and preprocessing of data is an extremely vital step to increase accuracy.
The Arabic hate speech dataset was subjected to a preprocessing stage in which the data
were cleaned and irrelevant data were discarded. In addition, diacritics, symbols, special
characters, and emojis were removed from the data. It should be noted that Arabic char-
acters depend on the different Unicode representations; thus, a dictionary was created to
map the same representation to their fixed characters.
Every sentence inside a single comment separated by space was taken and a default
tokenizer was applied to remove symbols from the string. This was then converted to
Appl. Sci. 2022, 12, 6010 7 of 15

tensor to represent numerical data, which are suitable for neural networks. The data were
then extensively labeled; this was a hard, laborious, and time-consuming task. To ensure
that the labeling was performed correctly, randomly selected text was shown to a different
set of students who were tasked with labeling it; the two labels were compared to indicate
if the labeling task had been carried out accurately. The labels’ were converted from an
integer to a tensor or float data type so each label contained a vector of the numeral that
represented the label. The default batch size was set at 32; however, this, along with other
hyperparameters, was changed during each experiment.
Word2Vec [27] and global vector (GloVe) [28] were the models chosen for word embed-
ding, which allowed the system to learn word vectors according to their co-occurrence in
sentences. This representation of the words in a lower dimension of vector space allowed
the deep learning algorithms to map the words based on the similarity of the semantic
properties. As shown in Figure 2, two architectures are present for the Word2Vec: con-
tinuous bag of words (CBOW) and skip-gram (SG). The former predicts the target word
using its contextual context, i.e., where it lies in the sentence within a window showing the
words before and after it. The latter predicts the surrounding words based on the target
word. After this extensive preprocessing, a vocabulary set of 25,000 words was created,
and its size was changed based on the various experiments. The data were split into two as
mentioned earlier, with 80% for training and 20% for validation.

Figure 2. Word2Vec architecture based on CBOW and SG (‘ AîD


¯
ék@
 P €ñʯ Ð@Qk ɒ ¯@ ø YJK »@



.
means ‘I clean with my hand which is better, it is forbidden to spend money on it’).

3.3. Proposed Deep Recurrent Neural Network Model for Hate Speech Detection
With the advancement of more sophisticated processing power in computing ma-
chines, deep learning has been widely adopted in almost all data processing problems. A
convolutional neural network (CNN) is characterized by very deep neural networks that
can utilize the advantage of the inherent properties of data, particularly from image data
but text as well. Recurrent neural networks (RNNs) are similar to CNNs; however, RNNs
also manage the time domain. In a unidirectional (feedforward) RNN, the model learns
word by word from start to end.
In the Arabic language, most meanings of words and sentences change based on where
the word has occurred as well as the next words in the sentence. In bidirectional recurrent
neural networks, two sequences are considered: the forward sequence (left to right) and
the backward sequence (right to left). For example, an Arabic sequence can be used in the
same order for the forward layer; however, in the backward layer, the order of the text will
of the text will be reversed. In Arabic text classification, the time domain, which represents
the order of the words in the input text, plays a vital role by solving the problem more
effectively and enhancing the classification results. Long short-term memory (LSMT) can
be used to solve the vanishing gradient problem, which occurs if a long bidirectional RNN
Appl. Sci. 2022, 12, 6010
is used alone. 8 of 15
The proposed deep bidirectional RNN model for hate speech detection from the Ar-
abic social media comments consists of 10 layers, including the input layer, bidirectional
RNN layers, flatten layer, dense layers, dropout layers, and output layers. Figure 3 shows
be reversed. In Arabic text classification, the time domain, which represents the order of
the details of the RNN forward and backward layers used to extract the features. The em-
the words in the input text, plays a vital role by solving the problem more effectively and
bedding layer produced by Word2Vec and GloVe detailed earlier is then input into the
enhancing the classification results. Long short-term memory (LSMT) can be used to solve
backward RNN layer followed by the forward RNN layer. The output of the RNN is then
the vanishing gradient problem, which occurs if a long bidirectional RNN is used alone.
concatenated, flattened, and input into the dense and dropout layers, as detailed in Figure
The proposed deep bidirectional RNN model for hate speech detection from the
4. The data are then classified into one of the seven classes.
Arabic social media comments consists of 10 layers, including the input layer, bidirectional
In the proposed deep bidirectional recurrent neural network, the tanh activation
RNN layers, flatten layer, dense layers, dropout layers, and output layers. Figure 3 shows
function was used for the dense layers while the sigmoid activation function was used for
the details of the RNN forward and backward layers used to extract the features. The
the output layer. The experiments were performed using a learning rate of 0.001 with 32
embedding layer produced by Word2Vec and GloVe detailed earlier is then input into the
batch sizes for 50 iterations. The best validation accuracy-based model was saved out of
backward RNN layer followed by the forward RNN layer. The output of the RNN is then
50 iterations, which
concatenated, were
flattened, further
and inputused for the
into the testand
dense dataset to generate
dropout and
layers, as compare
detailed results.
in Figure 4.
The data are then classified into one of the seven classes.

Figure 3. Proposed architecture of the deep recurrent neural network model for feature extraction
from
from Arabic text ((‘‫ ‘محفظة مفاتيح' رائعة‬means
Arabic text means ‘keys
‘keys wallet
wallet is
is amazing’).
amazing’).

In the proposed deep bidirectional recurrent neural network, the tanh activation
function was used for the dense layers while the sigmoid activation function was used
for the output layer. The experiments were performed using a learning rate of 0.001 with
32 batch sizes for 50 iterations. The best validation accuracy-based model was saved out of
50 iterations, which were further used for the test dataset to generate and compare results.

3.4. Evaluation Matrices


Several metrics were used to evaluate the effectiveness of the proposed model. The
metrics used were typical metrics that allowed for a comparison with other techniques
from the literature, including the confusion matrix, precision, recall, F1 score, and accuracy.
The metrics are defined and described in Table 3.
Appl. Sci. 2022, 12, 6010 9 of 15

Table 3. Evaluation metrics used for the experimental results. TP —true positive; TN —true negative;
FP —false positive; FN —false negative.

Evaluation Metrics Equation Description


A table that summarizes true positives, true negatives,
Confusion Matrix NA
false positives, and false negatives.
TP Quantifies the portion of the positive test cases that
Precision TP + F P were correctly classified.
TP Also known as sensitivity, quantifies the portion of
Recall TP + FN actual positive test cases correctly classified.
F1 Score 2TP The harmonic mean of precision and recall.
2TP + FP + FN
TP + TN Quantifies the ratio of the sum of all correct
Accuracy TP + TN + FP + FN classifications to the total number of classifications.

Figure 4. Architecture of the convolutional operations for the hate speech detection from the
extracted features.

4. Experimental Results and Discussion


Three different types of experiments were performed using the newly built dataset and
other existing datasets of Arabic comments. The experiments were performed using hard-
ware with a Nvidia 1080TI GPU and 32-gigabyte memory; the software was implemented
using Python programming. The experimental dataset was divided into 80% training and
20% validation. The 20% validation data were also split into 80% validation and 20% veri-
fication. Different machine learning methods were also used to compare the results with
various deep recurrent neural network models, such as decision trees (DT), multilayer
perceptron (MLP), naïve Bayes (NB), and linear regression (LR). The experimental results
were evaluated based on their accuracy, precision, recall, F1 score, and confusion matrices.

4.1. Positive and Negative Arabic Comment Classification


The first experiment was based on the binary classification of comments as negative
or positive. Table 4 shows the results of using the DT, MLP, NB, and LR classifiers. It can be
seen from Table 4 that the highest accuracy (63.94%) was achieved using LR; however, this
was still a very low percentage. The precision, recall, and F1 score were the highest for LR
as well, with values of 0.63, 0.61, and 0.65, respectively.
Appl. Sci. 2022, 12, 6010 10 of 15

Table 4. Comparison of experimental results using typical classification methods for the two classes
of Arabic comments.

Method Accuracy Precision Recall F1 Score


DT 56.970% 0.57 0.57 0.59
MLP 57.576% 0.58 0.58 0.58
NB 62.810% 0.64 0.62 0.59
LR 63.937% 0.63 0.61 0.65

A repeat of the same experiment using the proposed deep RNN model showed
a significant improvement in the binary classification of Arabic comments, as shown
in Table 5. There were two proposed architectures: one consisting of five deep layers
with 231,841 parameters, referred to as DRNN-1, and the other with 10 deep layers and
1,682,787 parameters, referred to as DRNN-2. Using the DRNN-1 architecture, the training
accuracy increased to 99.73%, with a training loss of 0.0044 and a validation loss and
validation accuracy of 0.94 and 83.22%, respectively.

Table 5. Experimental results using the proposed deep recurrent neural network model-based
classification method for two classes of Arabic comments.

Model Iterations Train Loss Train Acc. Val. Loss Val. Acc.
10 0.0709 98.28% 0.4815 75.17%
Five deep layers with
25 0.0074 99.78% 0.7993 81.88%
231,841 parameters
50 0.0041 99.93% 0.9914 82.55%
(DRNN-1)
Best 0.0044 99.73% 0.9439 83.22%
10 0.0289 99.19% 0.8572 79.39%
10 deep layers with
25 0.0038 99.87% 1.1484 81.82%
1,682,787 parameters
50 0.0073 99.80% 1.1335 83.64%
(DRNN-2)
Best 0.0578 98.18% 0.3574 90.30%

With DRNN-2, the highest training accuracy achieved was 98.18% with a training
loss of 0.0578, while the validation loss and validation accuracies were 0.3574 and 90.30%,
respectively. DRNN-1 took 328 s to train the model, while DRNN-2, as a more complex
network, took 492 s to train the model. It was observed that DRNN-1 performed better
than the DRNN-2, while both achieved better results than the classifiers listed in Table 4.

4.2. Hate Speech Detection from Positive and Negative Arabic Comments
The second experiment was based on three classes of comments: negative, positive, or
hate speech. Table 4 shows the results of using the DT, MLP, NB, and LR classifiers. It can
be seen from Table 6 that the highest accuracy (56.192%) was achieved using DT. Again,
this was still a very low percentage. The precision, recall, and F1 score were the highest for
DT as well, with values of 0.56, 0.56, and 0.56, respectively.
A repeat of the same experiment using the proposed deep RNN model (DRNN-2)
showed a significant improvement in the classification of the three classes of Arabic com-
ments, as shown in Table 7. The DRNN-2 results are shown because they are significantly
better than those achieved by DRNN-1. With the use of DRNN-2, the training accuracy
increased to 95.38%, with a training loss of 0.1391 and a validation loss and validation
accuracy of 0.4148 and 86.40%, respectively. It can be observed that the proposed model
showed better results than those listed in Table 6 as well as those listed in Table 1 from the
extant literature while targeting the classification of Arabic comments into three classes.
Appl. Sci. 2022, 12, 6010 11 of 15

Table 6. Comparison of experimental results using the typical classification methods for the three
classes of Arabic comments.

Method Accuracy Precision Recall F1 Score


DT 56.192% 0.56 0.56 0.56
MLP 47.505% 0.51 0.48 0.49
NB 54.710% 0.53 0.54 0.5
LR 51.449% 0.48 0.54 0.48
RF 50.881% 0.54 0.52 0.46

Table 7. Comparison of experimental results using the typical classification methods for three classes
of Arabic comments.

Iterations Train Loss Train Acc. Val. Loss Val. Acc.


10 0.0576 98.26% 1.2928 77.00%
25 0.0257 99.31% 2.1474 69.61%
50 0.2016 92.48% 0.4729 84.80%
Best 0.1391 95.38% 0.4148 86.40%

Figure 5 shows the confusion matrix for the classification of Arabic comments into
three classes. The diagonal squares indicate that the confusion was minimal when it came
to normal positive and normal negative comments. However, with the hate speech class,
the confusion matrix shows that there were abnormalities in the true identification of hate
speech comments.

Figure 5. Confusion matrix of the experimental results using the proposed deep recurrent neural
network model-based classification method for three classes of Arabic comments.

4.3. Multiclass Hate Speech Classification of Arabic Comments


The third experiment was based on seven classes, namely comments against reli-
gion, racist comments, comments against gender equality, violent/offensive comments,
insulting/bullying comments, normal positive comments, and normal negative comments.
To the author’s knowledge, this may be the first study targeting seven classes of Arabic
comments. Table 8 shows the results of using the DT, MLP, NB, and LR classifiers. It can be
seen from Table 8 that the highest accuracy (34.761%) was achieved using DT; this was still
a very low percentage. The precision, recall, and F1 score were the highest for DT as well,
with values of 0.35, 0.35, and 0.34, respectively.
Appl. Sci. 2022, 12, 6010 12 of 15

Table 8. Comparison of experimental results using the typical classification methods for seven classes
of Arabic comments.

Method Accuracy Precision Recall F1 Score


DT 34.761% 0.35 0.35 0.34
MLP 17.884% 0.22 0.18 0.19
NB 24.175% 0.24 0.22 0.16
LR 17.692% 0.16 0.29 0.21
RF 21.762% 0.23 0.24 0.21

A repeat of the same experiment using the proposed deep RNN model (DRNN-2)
showed a significant improvement in the classification of the seven classes of Arabic
comments, as shown in Table 9. Again, the choice to show the results of DRNN-2 for these
experiments was because it produced a significant improvement in accuracy over DRNN-1.
The training accuracy increased to 84.14%, with a training loss of 0.4752 and a validation
loss and validation accuracy of 1.6282 and 58.26, respectively. It can be observed that the
proposed model showed better results than those listed in Table 8 as well as those listed in
Table 1.

Table 9. Comparison of experimental results using typical classification methods for seven classes of
Arabic comments.

Iterations Train Loss Train Acc. Val. Loss Val. Acc.


10 0.1303 96.54% 2.2865 57.14%
25 0.0601 98.54% 3.1065 53.22%
50 0.0398 98.91% 3.3132 56.86%
Best 0.4752 84.14% 1.6282 58.26%

Figure 6 shows the confusion matrix for the classification of Arabic comments into
seven classes. The diagonal squares indicate that the confusion was minimal when it
came to normal positive comments; medium when it came to racist, violent/offensive, and
bullying comments; and worse when it came to negative normal, religious, and gender
inequality comments.
A complete cloud-based system will be designed with various smart functions, such
as a system to alert the authorities, a function to produce tangible reports, one to produce
various reports according to need, and a human–computer interface in order to focus on
certain aspects in certain situations. The system prototype is expected to be developed
towards the end of this project, and publication in a Scopus-indexed or ISI-indexed journal
may result from the complete system.
Appl. Sci. 2022, 12, 6010 13 of 15

Figure 6. Confusion matrix of the experimental results from the proposed deep recurrent neural
network model-based classification method for seven classes of Arabic comments.

5. Conclusions
Many social media platforms prevent users from using hate speech, racism, cyberbully-
ing, etc. In many countries, privacy laws prevent governments from monitoring the internet
and positing the activities of individuals, even for national security reasons. However,
there are many countries in which the government is allowed to monitor social media and
other citizen activities for various reasons, including national security and the prevention
of cyberbullying, hate speech, and racism. In this paper, a deep machine learning algorithm
was proposed for the automatic classification and detection of Arabic hate speech. The
availability of publicly published datasets is scarce, and those that are published do not
contain the type of written comments that were targeted in this study. Therefore, one of the
main contributions of this work was the development of a unique dataset that contained
4203 comments gathered from various social media platforms and labeled into one of the
following seven classes: content against religion, racist content, content against gender
equality, violent/offensive content, insulting/bullying content, normal positive comments,
and normal negative comments.
The dataset was subjected to a thorough preprocessing phase and the comments were
labeled in a scientific manner that ensured the accuracy of the labeling and classification. In
this work, the use of deep recurrent neural networks (RNNs) for the automatic classification
and detection of Arabic hate speech was proposed. The proposed model, which was called
Appl. Sci. 2022, 12, 6010 14 of 15

RDNN-2, consisted of 10 layers with 32 batch sizes and performed the classification in
50 iterations. Another model consisting of five layers, which was called RDNN-1, was used
in the binary classification experiment only. Using the proposed models, a recognition
rate of 99.73% was achieved for binary classification, 95.38% for the three classes of Arabic
comments, and 84.14% for the seven classes of Arabic comments. This was a high accuracy
for the classification of a complex language, such as Arabic, into seven different classes.
The achieved accuracy was higher than any similar method reported in the recent literature,
whether for binary classification, three-class classification, or seven-class classification.
Future work should include the continuous development of the dataset in order to
increase its size and classes and develop a unified dataset for researchers in this field. In
addition, future research should focus on the continuous development and modification of
machine learning techniques to achieve better accuracies for the classification of hate speech
in Arabic. In addition, the author of the current study will work to develop a prototype
that can be applied within social media platforms for real-time testing and processing.

Funding: This research received no external funding.


Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The author declares no conflict of interest.

References
1. Bashar, A.; Latif, G.; Ben Brahim, G.; Mohammad, N.; Alghazo, J. COVID-19 Pneumonia Detection Using Optimized Deep
Learning Techniques. Diagnostics 2021, 11, 1972. [CrossRef] [PubMed]
2. Latif, G.; Al Anezi, F.Y.; Sibai, F.N.; Alghazo, J. Lung Opacity Pneumonia Detection with Improved Residual Networks. J. Med.
Biol. Eng. 2021, 41, 581–591. [CrossRef]
3. Latif, G.; Alghazo, J.; Sibai, F.N.; Iskandar, D.A.; Khan, A.H. Recent advancements in Fuzzy C-means based techniques for brain
MRI Segmentation. Curr. Med. Imaging 2021, 17, 917–930. [CrossRef]
4. Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Für Med. Phys. 2019, 29,
102–127. [CrossRef] [PubMed]
5. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning-based Text Classification: A
Comprehensive Review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
6. Latif, G.; Bouchard, K.; Maitre, J.; Back, A.; Bédard, L.P. Deep-Learning-Based Automatic Mineral Grain Segmentation and
Recognition. Minerals 2022, 12, 455. [CrossRef]
7. Al-Hmouz, A.; Latif, G.; Alghazo, J.; Al-Hmouz, R. Enhanced numeral recognition for handwritten multi-language numerals
using a fuzzy set-based decision mechanism. Int. J. Mach. Learn. Comput. 2020, 10, 99–107. [CrossRef]
8. Gambäck, B.; Sikdar, U.K. Using convolutional neural networks to classify hate speech. In Proceedings of the First Workshop on
Abusive Language Online, Vancouver, BC, Canada, 4 August 2017; pp. 85–90.
9. Malmasi, S.; Zampieri, M. Detecting hate speech in social media. arXiv 2017, arXiv:1712.06427.
10. Del Vigna, F.; Cimino, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate me, hate me not: Hate speech detection on Facebook. In
Proceedings of the First Italian Conference on Cybersecurity, Venice, Italy, 17–20 January 2017; pp. 86–95.
11. Zimmerman, S.; Kruschwitz, U.; Fox, C. Improving hate speech detection with deep learning ensembles. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018.
12. Robinson, D.; Zhang, Z.; Tepper, J. Hate speech detection on Twitter: Feature engineering vs feature selection. In Proceedings of the
European Semantic Web Conference, Heraklion, Greece, 3–7 June 2018; Springer: Cham, Switzerland, 2018; pp. 46–49.
13. Alfina, I.; Mulia, R.; Fanany, M.I.; Ekanata, Y. Hate speech detection in the Indonesian language: A dataset and preliminary study.
In Proceedings of the International Conference on Advanced Computer Science and Information Systems (ICACSIS), Yogyakarta,
Indonesia, 28–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 233–238.
14. Fauzi, M.A.; Yuniarti, A. Ensemble method for Indonesian twitter hate speech detection. Indones. J. Electr. Eng. Comput. Sci. 2018,
11, 294–299. [CrossRef]
15. Mubarak, H.; Rashed, A.; Darwish, K.; Samih, Y.; Abdelali, A. Arabic offensive language on Twitter: Analysis and experiments.
arXiv 2020, arXiv:2004.02192.
16. Omar, A.; Mahmoud, T.M.; Abd-El-Hafeez, T. Comparative performance of machine learning and deep learning algorithms for
Arabic hate speech detection in osns. In Proceedings of the International Conference on Artificial Intelligence and Computer
Vision, Cairo, Egypt, 8–10 April 2020; Springer: Cham, Switzerland, 2020; pp. 247–257.
Appl. Sci. 2022, 12, 6010 15 of 15

17. Abdellatif, M.; Elgammal, A. Offensive language detection in Arabic using ULMFiT. In Proceedings of the 4th Workshop on
Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France,
11–16 May 2020; pp. 82–85.
18. Khalifa, M.; Hussein, N. Ensemble Learning for Irony Detection in Arabic Tweets. In FIRE (Working Notes); CEUR-WS: New Delhi,
India, 2019; pp. 433–438.
19. Burnap, P.; Williams, M.L. Hate speech, machine classification and statistical modeling of information flow on Twitter: Inter-
pretation and communication for policy decision making. In Proceedings of the Internet, Policy & Politics, Oxford, UK, 25–26
September 2014; pp. 1–18.
20. Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting offensive language on Arabic social media using deep learning. In
Proceedings of the 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS),
Granada, Spain, 22–25 October 2019; pp. 466–471.
21. Aref, A.; Al Mahmoud, R.H.; Taha, K.; Al-Sharif, M. Hate Speech Detection of Arabic Shorttext. CS IT Conf. Proc. 2020, 10, 81–94.
22. Gao, L.; Huang, R. Detecting online hate speech using context-aware models. arXiv 2017, arXiv:1710.07395.
23. Sitaula, C.; Basnet, A.; Mainali, A.; Shahi, T.B. Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related
Tweets. Comput. Intell. Neurosci. 2021, 2021, 2158184. [CrossRef] [PubMed]
24. Shahi, T.B.; Sitaula, C.; Paudel, N. A Hybrid Feature Extraction Method for Nepali COVID-19-Related Tweets Classification.
Comput. Intell. Neurosci. 2022, 2022, 5681574. [CrossRef] [PubMed]
25. Mastromattei, M.; Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Syntax and prejudice: Ethically-charged biases of a syntax-based hate
speech recognizer unveiled. PeerJ Comput. Sci. 2022, 8, e859. [CrossRef] [PubMed]
26. Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Dis-Cover AI Minds to Preserve Human Knowledge. Future Internet 2021, 14, 10.
[CrossRef]
27. Hu, W.; Gu, Z.; Xie, Y.; Wang, L.; Tang, K. Chinese Text Classification Based on Neural Networks and Word2vec. In Proceedings
of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019;
pp. 284–291.
28. Hossain, T.; Mauni, H.Z.; Rab, R. Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble
and Deep Learning. Comput. Inform. 2022, 41, 98–115. [CrossRef]

You might also like