Applied Sciences: Arabic Hate Speech Detection Using Deep Recurrent Neural Networks
Applied Sciences: Arabic Hate Speech Detection Using Deep Recurrent Neural Networks
sciences
Article
Arabic Hate Speech Detection Using Deep Recurrent
Neural Networks
Faisal Yousif Al Anezi
Abstract: With the vast number of comments posted daily on social media and other platforms,
manually monitoring internet activity for possible national security risks or cyberbullying is an
impossible task. However, with recent advances in machine learning (ML), the automatic monitoring
of such posts for possible national security risks and cyberbullying becomes feasible. There is still the
issue of privacy on the internet; however, in this study, only the technical aspects of designing an
automated system that could monitor and detect hate speech in the Arabic language were targeted,
which many companies, such as Facebook, Twitter, and others, could use to prevent hate speech
and cyberbullying. For this task, a unique dataset consisting of 4203 comments classified into
seven categories, including content against religion, racist content, content against gender equality,
violent content, offensive content, insulting/bullying content, normal positive comments, and normal
negative comments, was designed. The dataset was extensively preprocessed and labeled, and its
features were extracted. In addition, the use of deep recurrent neural networks (RNNs) was proposed
for the classification and detection of hate speech. The proposed RNN architecture, called DRNN-2,
consisted of 10 layers with 32 batch sizes and 50 iterations for the classification task. Another model
consisting of five hidden layers, called DRNN-1, was used only for binary classification. Using the
proposed models, a recognition rate of 99.73% was achieved for binary classification, 95.38% for the
three classes of Arabic comments, and 84.14% for the seven classes of Arabic comments. This accuracy
Citation: Anezi, F.Y.A. Arabic Hate was high for the classification of a complex language, such as Arabic, into seven different classes. The
Speech Detection Using Deep achieved accuracy was higher than that of similar methods reported in the recent literature, whether
Recurrent Neural Networks. for binary classification, three-class classification, or seven-class classification, as discussed in the
Appl. Sci. 2022, 12, 6010. https:// literature review section.
doi.org/10.3390/app12126010
Keywords: hate speech detection; Arabic comment classification; deep learning; recurrent neural
Academic Editor:
Valentino Santucci networks; bidirectional RNN; natural language processing
words, phrases, and sentences will be required. This system will be smart enough to
identify these kinds of conversations and able to determine if they are one-time phrases or
if they profile a certain pattern for an individual or organization. In addition, the system
will be able to alert the authorities when repetitive unacceptable behavior is observed from
an individual group or organization. The system will be called a “Peace Monitor” and may
also be used to identify bullying on social media. Thus, the motivation for developing an
automatic machine learning-based system that is able to detect and identify hate speech is
especially needed in countries that are currently progressing to be more inclusive; it will
also be useful for the prevention of cyberbullying, particularly in school-aged children with
access to social media platforms. A great deal of work has been carried out on the detection
of hate speech and cyberbullying in other languages, especially English; however, little
work has been completed in Arabic for various reasons, including the complex nature of
the Arabic language and the lack of unified large datasets. The aim of this study was to
develop a large Arabic language dataset with seven classes that could be used as the basis
for the development of an even larger dataset with more classes to be used as a unified
dataset for researchers in this field. In addition, this study aimed to develop a modified
machine learning algorithm that was able to detect and accurately classify hate speech
in Arabic.
Machine learning and deep learning algorithms have gained momentum in the past
decade as a means to automate many tasks that have conventionally been performed by
humans. For example, machine learning and deep learning algorithms are being explored
for the automated diagnosis of many diseases and health conditions [1–4]. However, this
is just an example, and the use of these algorithms has been researched in relation to all
aspects of human life, from self-driving cars to monitoring the quality of food ordered in
fast-food restaurants. The fast pace at which technology has advanced to produce high-
speed computing machines and the advancement of the Internet of Things (IoT), where
sensors are attached to everything and everything is connected to either the cloud or the
internet, has increased the availability of big data that can be analyzed to produce results
that are productive and ease human life. This has paved the way for machine learning and
deep learning algorithms to be explored in all walks of life to automate tasks previously
carried out by humans at much faster speeds and with more accuracy [5,6]. One of the areas
in which these algorithms have been explored is text recognition, whether it is handwritten
or typed text [7].
The main contributions of this research can be summarized as follows:
1. Built a new hate speech dataset in Arabic with seven different classes.
2. Modified machine learning algorithms to be able to correctly classify and detect hate
speech in Arabic.
The final product of such a system has various advantages and uses, and some of the
benefits of the final product are as follows:
1. The proposed system can be used in Arabic-speaking countries to ensure a future that
is consistent with peace and tolerance.
2. The system can be used by the authorities to save lives because it can predict violent
behavior before it occurs.
3. The system can be used in smart cities as a smart feature, which can have several
added features to ensure safety and security.
4. The system can be expanded to include other Arabic dialects and thus can be used in
other Arab countries.
5. The system can be used by western countries for Arabic-speaking populations.
The rest of the paper is organized as follows: Section 2 includes a literature review of
the algorithms explored for text classification and recognition. Section 3 details the novel
methodology proposed in this paper. Section 4 highlights the experimental results and
discussion. Section 5 highlights the conclusion.
Appl. Sci. 2022, 12, 6010 3 of 15
2. Literature Review
A great deal of research has been carried out on hate speech in various languages,
especially English. The authors of [8] proposed the use of a convolutional neural network
for the classification of hate speech. The dataset they used divided hate speech into several
sections, including racist sentences, sexist sentences, harmful words, and threatening
phrases, such as death threats. The database consisted of 6655 tweets: 91 racism tweets,
946 sexism tweets, 18 tweets with both racism and sexism, and 5600 non-hate-speech tweets.
Among the different algorithms used, they reported that the best results achieved were
a precision of 86.61%, a recall of 70.42%, and an F-score of 77.38%. The authors of [9]
proposed the use of a support vector machine (SVM) for multiclass classification. Their
dataset consisted of 14,509 tweets annotated to three categories: hate, offensive, and OK.
In these experiments, the authors also proposed surface n-grams, word skip-grams, and
Brown clusters as the features to be extracted. The total dataset consisted of 2399 sentences
classified as hate, 4836 sentences classified as offensive, and 7274 sentences classified as
OK. They reported that the best accuracy of 78% was achieved using a character 4-g model.
The authors of [10] collected 17,567 Facebook posts annotated as no hate, weak hate, and
strong hate as a dataset for their proposed system. They proposed the use of two methods
to capture and classify hate phrases and speech, which are published daily on one of
the largest social networking sites—Facebook. The two methods used in this study were
support vector machines (SVM) and long short-term memory (LSTM), which is a recurrent
neural network (RNN). The SVM system provides excellent and wide levels for analyzing
and classifying linguistic texts, and these features have been used well in polar sentiment
classification tasks.
On the other hand, the LSTM system was used to expand the ranges in order to dissect
longer sentences that were not observably accurate from the first scan. Among their results,
the authors reported the highest accuracy of 80.6% using the SVM classifier. In [11], the
authors proposed the use of an ensemble neural network for the classification of hate
speech and used two publicly available datasets. Although accuracy was not reported as a
result, the reported mean of the ensemble was 78.62%.
The authors of [12] used seven publicly available datasets along with a state-of-the-
art linear SVM. However, this study concentrated on feature extraction and specifically
whether to use the extracted feature or engineer a set of features that might produce better
results. The authors concluded that the selection (engineering) of a set of features produced
better results than using all the extracted features. Moving away from the English language,
the authors of [13] created a new dataset consisting of 1100 tweets annotated and labeled
as hate speech or non-hate speech. The features used in this study were word n-gram
with n = 1 and n = 2 and character n-gram with n = 3 and n = 4 plus negative sentiment.
Several machine learning algorithms were tested, including naïve Bayes (NB), support
vector machine (SVM), Bayesian logistic regression (BLR), and random forest decision tree
(RF). The highest F-measure of 93.5% was reported when using RF as a classifier. The
authors of [14] used a publicly available dataset of tweets that were divided into 260 tweets
labeled “hate speech” and 445 tweets labeled “non-hate speech”. The five proposed and
tested classifiers included NB, K-nearest neighbors (KNN), maximum entropy (ME), RF,
and SVM. The authors used two ensemble methods: hard voting and soft voting. They
concluded that the best results were achieved using ensemble methods and reported the
best result when using soft voting with an F1 measure of 79.8% on an unbalanced dataset
and 84.7% on a balanced dataset.
The authors of [15] developed a dataset consisting of 10,000 tweets divided into the
following classes: 1915 offensive tweets, of which 225 were deemed vulgar and 506 were
labelled as hate speech, and the remaining 8085 tweets that were deemed clean. They used
various classifiers, such as AdaBoost, Gaussian NB, perceptron, gradient boosting, logistic
regression (LR), and SVM. They reported the best result with a precision of 88.6% using
SVM. In [16], the authors developed a dataset for Arabic speech collected from various
sources, such as Facebook, Instagram, YouTube, and Twitter. They collected a total of
Appl. Sci. 2022, 12, 6010 4 of 15
20,000 posts, tweets, and comments. They tested the dataset with 12 machine learning
algorithms and two deep learning algorithms. They reported that the highest accuracy
of 98.7% was achieved using the RNN. In [17], the authors responded to a competition,
OSACT4, to develop a machine learning algorithm for the classification of Arabic text and
the detection of offensive text. They were provided with a dataset of 10,000 tweets. Their
proposed method consisted of using ULMFiT, which was originally developed for English
language detection. By using forward and backward training and finding the average of
the two, they reported an accuracy of 96%.
The authors of [18] also responded to a competition for the detection of Arabic text
that was deemed ironic. The competition was held by the Forum for Information Retrieval
(FIRE2019). They provided the contestants with a dataset of 4024 tweets divided into
2091 ironic and 1933 non-ironic tweets. The authors extracted several features, including TF-
IDF word n-gram features, bag-of-words features, topic modeling features, and sentiment
features. They used three types of ensemble learning: hybrid, deep, and classical. They
reported their best F1 score of 84.4 and achieved third place. The authors of [19] collected
450,000 tweets, of which 2000 were annotated and labeled. They extracted various features
from the tweets, including syntactic dependencies between terms, their claim of well-
founded/justified discrimination against social groups, and incitement to respond with
antagonistic action. They found that the ensemble classification approach was the best at
detecting hate speech.
In [20], the authors used a publicly available dataset of 15,050 comments collected
from controversial YouTube videos about Arabic celebrities. After preprocessing, the
authors tested the data using the following classifiers: convolutional neural network (CNN),
bidirectional long short-term memory (Bi-LSTM), Bi-LSTM and attention mechanisms, and
finally a combined CNN-LSTM architecture. They also used Bayesian optimization to
tune the hyperparameters of the network models. They reported that their best result was
achieved using CNN-LSTM, with a recall of 83.46%.
In [21], the authors developed a dataset consisting of 3235 tweets concerning religious
issues, of which 2590 were labeled as “not hate” and 642 were labeled as “hate speech.”
They tested the dataset using many classifiers, including RF, complement NB, decision
tree (DT), and CNN. They also employed two deep learning methods: CNN and RNN.
CNN with FastText achieved the highest F1 measure of 52%. In [22], the authors used a
publicly available dataset from Fox News user comments consisting of 1528 comments, of
which 435 were labeled as hateful. The authors employed the use of LSTM, bidirectional
LSTM, bidirectional LSTM with attention, and an ensemble model. They found that the
best results were achieved using the ensemble model, with an accuracy of 77.9%. Table 1
presents a summary of these recent studies on text classification.
In this study, the recent research was thoroughly analyzed for the latest feature extrac-
tion methods, such as those listed and detailed in [23,24]. In addition, with reference to the
preprocessing stage, the literature detailing the syntax and prejudice of text [25] and that
addressing the clash between symbolic AI and ML was thoroughly analyzed [26].
As can be seen from the literature presented above, Arabic text detection and recogni-
tion is still an open area of research that has not received the full attention it deserves. It
should be noted here that Arabic is spoken by nearly 422 million people, and 25 countries
use Arabic as their official language. This research area is still lacking in having a unified
extensive dataset for researchers to use as well as in the achievement of accuracy levels that
can be implemented in practice. In this paper, a dataset that was developed from various
tweets and divided into seven different classes will be presented. This is by far the largest
number of classes for a dataset of Arabic hate speech according to the knowledge of the
author. In addition, a novel technique to achieve the highest accuracy compared to that
reported in the extant literature will also be proposed.
Appl. Sci. 2022, 12, 6010 5 of 15
on social media. The comments were annotated and labeled by a group of undergraduate
students whose mother tongue was Arabic and who were proficient in the dialect, as
shown in Table 2. Each comment was annotated and labeled by at least three students that
agreed 100% on the class; otherwise, the comment was discarded. The experiments were
performed using a Windows computer with i7-8700K (3.70 GHz) CPU, 32 GB memory, and
NVIDIA GeForce GTX1080 GPU; the software code was written in Python.
Figure 1. Workflow of the proposed system for hate speech detection from Arabic text.
Table 2. Summary of the newly constructed class and data size for hate speech detection.
3.2. Preprocessing and Word Embedding Layer for Arabic Hate Speech Dataset
The cleaning and preprocessing of data is an extremely vital step to increase accuracy.
The Arabic hate speech dataset was subjected to a preprocessing stage in which the data
were cleaned and irrelevant data were discarded. In addition, diacritics, symbols, special
characters, and emojis were removed from the data. It should be noted that Arabic char-
acters depend on the different Unicode representations; thus, a dictionary was created to
map the same representation to their fixed characters.
Every sentence inside a single comment separated by space was taken and a default
tokenizer was applied to remove symbols from the string. This was then converted to
Appl. Sci. 2022, 12, 6010 7 of 15
tensor to represent numerical data, which are suitable for neural networks. The data were
then extensively labeled; this was a hard, laborious, and time-consuming task. To ensure
that the labeling was performed correctly, randomly selected text was shown to a different
set of students who were tasked with labeling it; the two labels were compared to indicate
if the labeling task had been carried out accurately. The labels’ were converted from an
integer to a tensor or float data type so each label contained a vector of the numeral that
represented the label. The default batch size was set at 32; however, this, along with other
hyperparameters, was changed during each experiment.
Word2Vec [27] and global vector (GloVe) [28] were the models chosen for word embed-
ding, which allowed the system to learn word vectors according to their co-occurrence in
sentences. This representation of the words in a lower dimension of vector space allowed
the deep learning algorithms to map the words based on the similarity of the semantic
properties. As shown in Figure 2, two architectures are present for the Word2Vec: con-
tinuous bag of words (CBOW) and skip-gram (SG). The former predicts the target word
using its contextual context, i.e., where it lies in the sentence within a window showing the
words before and after it. The latter predicts the surrounding words based on the target
word. After this extensive preprocessing, a vocabulary set of 25,000 words was created,
and its size was changed based on the various experiments. The data were split into two as
mentioned earlier, with 80% for training and 20% for validation.
3.3. Proposed Deep Recurrent Neural Network Model for Hate Speech Detection
With the advancement of more sophisticated processing power in computing ma-
chines, deep learning has been widely adopted in almost all data processing problems. A
convolutional neural network (CNN) is characterized by very deep neural networks that
can utilize the advantage of the inherent properties of data, particularly from image data
but text as well. Recurrent neural networks (RNNs) are similar to CNNs; however, RNNs
also manage the time domain. In a unidirectional (feedforward) RNN, the model learns
word by word from start to end.
In the Arabic language, most meanings of words and sentences change based on where
the word has occurred as well as the next words in the sentence. In bidirectional recurrent
neural networks, two sequences are considered: the forward sequence (left to right) and
the backward sequence (right to left). For example, an Arabic sequence can be used in the
same order for the forward layer; however, in the backward layer, the order of the text will
of the text will be reversed. In Arabic text classification, the time domain, which represents
the order of the words in the input text, plays a vital role by solving the problem more
effectively and enhancing the classification results. Long short-term memory (LSMT) can
be used to solve the vanishing gradient problem, which occurs if a long bidirectional RNN
Appl. Sci. 2022, 12, 6010
is used alone. 8 of 15
The proposed deep bidirectional RNN model for hate speech detection from the Ar-
abic social media comments consists of 10 layers, including the input layer, bidirectional
RNN layers, flatten layer, dense layers, dropout layers, and output layers. Figure 3 shows
be reversed. In Arabic text classification, the time domain, which represents the order of
the details of the RNN forward and backward layers used to extract the features. The em-
the words in the input text, plays a vital role by solving the problem more effectively and
bedding layer produced by Word2Vec and GloVe detailed earlier is then input into the
enhancing the classification results. Long short-term memory (LSMT) can be used to solve
backward RNN layer followed by the forward RNN layer. The output of the RNN is then
the vanishing gradient problem, which occurs if a long bidirectional RNN is used alone.
concatenated, flattened, and input into the dense and dropout layers, as detailed in Figure
The proposed deep bidirectional RNN model for hate speech detection from the
4. The data are then classified into one of the seven classes.
Arabic social media comments consists of 10 layers, including the input layer, bidirectional
In the proposed deep bidirectional recurrent neural network, the tanh activation
RNN layers, flatten layer, dense layers, dropout layers, and output layers. Figure 3 shows
function was used for the dense layers while the sigmoid activation function was used for
the details of the RNN forward and backward layers used to extract the features. The
the output layer. The experiments were performed using a learning rate of 0.001 with 32
embedding layer produced by Word2Vec and GloVe detailed earlier is then input into the
batch sizes for 50 iterations. The best validation accuracy-based model was saved out of
backward RNN layer followed by the forward RNN layer. The output of the RNN is then
50 iterations, which
concatenated, were
flattened, further
and inputused for the
into the testand
dense dataset to generate
dropout and
layers, as compare
detailed results.
in Figure 4.
The data are then classified into one of the seven classes.
Figure 3. Proposed architecture of the deep recurrent neural network model for feature extraction
from
from Arabic text ((‘ ‘محفظة مفاتيح' رائعةmeans
Arabic text means ‘keys
‘keys wallet
wallet is
is amazing’).
amazing’).
In the proposed deep bidirectional recurrent neural network, the tanh activation
function was used for the dense layers while the sigmoid activation function was used
for the output layer. The experiments were performed using a learning rate of 0.001 with
32 batch sizes for 50 iterations. The best validation accuracy-based model was saved out of
50 iterations, which were further used for the test dataset to generate and compare results.
Table 3. Evaluation metrics used for the experimental results. TP —true positive; TN —true negative;
FP —false positive; FN —false negative.
Figure 4. Architecture of the convolutional operations for the hate speech detection from the
extracted features.
Table 4. Comparison of experimental results using typical classification methods for the two classes
of Arabic comments.
A repeat of the same experiment using the proposed deep RNN model showed
a significant improvement in the binary classification of Arabic comments, as shown
in Table 5. There were two proposed architectures: one consisting of five deep layers
with 231,841 parameters, referred to as DRNN-1, and the other with 10 deep layers and
1,682,787 parameters, referred to as DRNN-2. Using the DRNN-1 architecture, the training
accuracy increased to 99.73%, with a training loss of 0.0044 and a validation loss and
validation accuracy of 0.94 and 83.22%, respectively.
Table 5. Experimental results using the proposed deep recurrent neural network model-based
classification method for two classes of Arabic comments.
Model Iterations Train Loss Train Acc. Val. Loss Val. Acc.
10 0.0709 98.28% 0.4815 75.17%
Five deep layers with
25 0.0074 99.78% 0.7993 81.88%
231,841 parameters
50 0.0041 99.93% 0.9914 82.55%
(DRNN-1)
Best 0.0044 99.73% 0.9439 83.22%
10 0.0289 99.19% 0.8572 79.39%
10 deep layers with
25 0.0038 99.87% 1.1484 81.82%
1,682,787 parameters
50 0.0073 99.80% 1.1335 83.64%
(DRNN-2)
Best 0.0578 98.18% 0.3574 90.30%
With DRNN-2, the highest training accuracy achieved was 98.18% with a training
loss of 0.0578, while the validation loss and validation accuracies were 0.3574 and 90.30%,
respectively. DRNN-1 took 328 s to train the model, while DRNN-2, as a more complex
network, took 492 s to train the model. It was observed that DRNN-1 performed better
than the DRNN-2, while both achieved better results than the classifiers listed in Table 4.
4.2. Hate Speech Detection from Positive and Negative Arabic Comments
The second experiment was based on three classes of comments: negative, positive, or
hate speech. Table 4 shows the results of using the DT, MLP, NB, and LR classifiers. It can
be seen from Table 6 that the highest accuracy (56.192%) was achieved using DT. Again,
this was still a very low percentage. The precision, recall, and F1 score were the highest for
DT as well, with values of 0.56, 0.56, and 0.56, respectively.
A repeat of the same experiment using the proposed deep RNN model (DRNN-2)
showed a significant improvement in the classification of the three classes of Arabic com-
ments, as shown in Table 7. The DRNN-2 results are shown because they are significantly
better than those achieved by DRNN-1. With the use of DRNN-2, the training accuracy
increased to 95.38%, with a training loss of 0.1391 and a validation loss and validation
accuracy of 0.4148 and 86.40%, respectively. It can be observed that the proposed model
showed better results than those listed in Table 6 as well as those listed in Table 1 from the
extant literature while targeting the classification of Arabic comments into three classes.
Appl. Sci. 2022, 12, 6010 11 of 15
Table 6. Comparison of experimental results using the typical classification methods for the three
classes of Arabic comments.
Table 7. Comparison of experimental results using the typical classification methods for three classes
of Arabic comments.
Figure 5 shows the confusion matrix for the classification of Arabic comments into
three classes. The diagonal squares indicate that the confusion was minimal when it came
to normal positive and normal negative comments. However, with the hate speech class,
the confusion matrix shows that there were abnormalities in the true identification of hate
speech comments.
Figure 5. Confusion matrix of the experimental results using the proposed deep recurrent neural
network model-based classification method for three classes of Arabic comments.
Table 8. Comparison of experimental results using the typical classification methods for seven classes
of Arabic comments.
A repeat of the same experiment using the proposed deep RNN model (DRNN-2)
showed a significant improvement in the classification of the seven classes of Arabic
comments, as shown in Table 9. Again, the choice to show the results of DRNN-2 for these
experiments was because it produced a significant improvement in accuracy over DRNN-1.
The training accuracy increased to 84.14%, with a training loss of 0.4752 and a validation
loss and validation accuracy of 1.6282 and 58.26, respectively. It can be observed that the
proposed model showed better results than those listed in Table 8 as well as those listed in
Table 1.
Table 9. Comparison of experimental results using typical classification methods for seven classes of
Arabic comments.
Figure 6 shows the confusion matrix for the classification of Arabic comments into
seven classes. The diagonal squares indicate that the confusion was minimal when it
came to normal positive comments; medium when it came to racist, violent/offensive, and
bullying comments; and worse when it came to negative normal, religious, and gender
inequality comments.
A complete cloud-based system will be designed with various smart functions, such
as a system to alert the authorities, a function to produce tangible reports, one to produce
various reports according to need, and a human–computer interface in order to focus on
certain aspects in certain situations. The system prototype is expected to be developed
towards the end of this project, and publication in a Scopus-indexed or ISI-indexed journal
may result from the complete system.
Appl. Sci. 2022, 12, 6010 13 of 15
Figure 6. Confusion matrix of the experimental results from the proposed deep recurrent neural
network model-based classification method for seven classes of Arabic comments.
5. Conclusions
Many social media platforms prevent users from using hate speech, racism, cyberbully-
ing, etc. In many countries, privacy laws prevent governments from monitoring the internet
and positing the activities of individuals, even for national security reasons. However,
there are many countries in which the government is allowed to monitor social media and
other citizen activities for various reasons, including national security and the prevention
of cyberbullying, hate speech, and racism. In this paper, a deep machine learning algorithm
was proposed for the automatic classification and detection of Arabic hate speech. The
availability of publicly published datasets is scarce, and those that are published do not
contain the type of written comments that were targeted in this study. Therefore, one of the
main contributions of this work was the development of a unique dataset that contained
4203 comments gathered from various social media platforms and labeled into one of the
following seven classes: content against religion, racist content, content against gender
equality, violent/offensive content, insulting/bullying content, normal positive comments,
and normal negative comments.
The dataset was subjected to a thorough preprocessing phase and the comments were
labeled in a scientific manner that ensured the accuracy of the labeling and classification. In
this work, the use of deep recurrent neural networks (RNNs) for the automatic classification
and detection of Arabic hate speech was proposed. The proposed model, which was called
Appl. Sci. 2022, 12, 6010 14 of 15
RDNN-2, consisted of 10 layers with 32 batch sizes and performed the classification in
50 iterations. Another model consisting of five layers, which was called RDNN-1, was used
in the binary classification experiment only. Using the proposed models, a recognition
rate of 99.73% was achieved for binary classification, 95.38% for the three classes of Arabic
comments, and 84.14% for the seven classes of Arabic comments. This was a high accuracy
for the classification of a complex language, such as Arabic, into seven different classes.
The achieved accuracy was higher than any similar method reported in the recent literature,
whether for binary classification, three-class classification, or seven-class classification.
Future work should include the continuous development of the dataset in order to
increase its size and classes and develop a unified dataset for researchers in this field. In
addition, future research should focus on the continuous development and modification of
machine learning techniques to achieve better accuracies for the classification of hate speech
in Arabic. In addition, the author of the current study will work to develop a prototype
that can be applied within social media platforms for real-time testing and processing.
References
1. Bashar, A.; Latif, G.; Ben Brahim, G.; Mohammad, N.; Alghazo, J. COVID-19 Pneumonia Detection Using Optimized Deep
Learning Techniques. Diagnostics 2021, 11, 1972. [CrossRef] [PubMed]
2. Latif, G.; Al Anezi, F.Y.; Sibai, F.N.; Alghazo, J. Lung Opacity Pneumonia Detection with Improved Residual Networks. J. Med.
Biol. Eng. 2021, 41, 581–591. [CrossRef]
3. Latif, G.; Alghazo, J.; Sibai, F.N.; Iskandar, D.A.; Khan, A.H. Recent advancements in Fuzzy C-means based techniques for brain
MRI Segmentation. Curr. Med. Imaging 2021, 17, 917–930. [CrossRef]
4. Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Für Med. Phys. 2019, 29,
102–127. [CrossRef] [PubMed]
5. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning-based Text Classification: A
Comprehensive Review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
6. Latif, G.; Bouchard, K.; Maitre, J.; Back, A.; Bédard, L.P. Deep-Learning-Based Automatic Mineral Grain Segmentation and
Recognition. Minerals 2022, 12, 455. [CrossRef]
7. Al-Hmouz, A.; Latif, G.; Alghazo, J.; Al-Hmouz, R. Enhanced numeral recognition for handwritten multi-language numerals
using a fuzzy set-based decision mechanism. Int. J. Mach. Learn. Comput. 2020, 10, 99–107. [CrossRef]
8. Gambäck, B.; Sikdar, U.K. Using convolutional neural networks to classify hate speech. In Proceedings of the First Workshop on
Abusive Language Online, Vancouver, BC, Canada, 4 August 2017; pp. 85–90.
9. Malmasi, S.; Zampieri, M. Detecting hate speech in social media. arXiv 2017, arXiv:1712.06427.
10. Del Vigna, F.; Cimino, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate me, hate me not: Hate speech detection on Facebook. In
Proceedings of the First Italian Conference on Cybersecurity, Venice, Italy, 17–20 January 2017; pp. 86–95.
11. Zimmerman, S.; Kruschwitz, U.; Fox, C. Improving hate speech detection with deep learning ensembles. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018.
12. Robinson, D.; Zhang, Z.; Tepper, J. Hate speech detection on Twitter: Feature engineering vs feature selection. In Proceedings of the
European Semantic Web Conference, Heraklion, Greece, 3–7 June 2018; Springer: Cham, Switzerland, 2018; pp. 46–49.
13. Alfina, I.; Mulia, R.; Fanany, M.I.; Ekanata, Y. Hate speech detection in the Indonesian language: A dataset and preliminary study.
In Proceedings of the International Conference on Advanced Computer Science and Information Systems (ICACSIS), Yogyakarta,
Indonesia, 28–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 233–238.
14. Fauzi, M.A.; Yuniarti, A. Ensemble method for Indonesian twitter hate speech detection. Indones. J. Electr. Eng. Comput. Sci. 2018,
11, 294–299. [CrossRef]
15. Mubarak, H.; Rashed, A.; Darwish, K.; Samih, Y.; Abdelali, A. Arabic offensive language on Twitter: Analysis and experiments.
arXiv 2020, arXiv:2004.02192.
16. Omar, A.; Mahmoud, T.M.; Abd-El-Hafeez, T. Comparative performance of machine learning and deep learning algorithms for
Arabic hate speech detection in osns. In Proceedings of the International Conference on Artificial Intelligence and Computer
Vision, Cairo, Egypt, 8–10 April 2020; Springer: Cham, Switzerland, 2020; pp. 247–257.
Appl. Sci. 2022, 12, 6010 15 of 15
17. Abdellatif, M.; Elgammal, A. Offensive language detection in Arabic using ULMFiT. In Proceedings of the 4th Workshop on
Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France,
11–16 May 2020; pp. 82–85.
18. Khalifa, M.; Hussein, N. Ensemble Learning for Irony Detection in Arabic Tweets. In FIRE (Working Notes); CEUR-WS: New Delhi,
India, 2019; pp. 433–438.
19. Burnap, P.; Williams, M.L. Hate speech, machine classification and statistical modeling of information flow on Twitter: Inter-
pretation and communication for policy decision making. In Proceedings of the Internet, Policy & Politics, Oxford, UK, 25–26
September 2014; pp. 1–18.
20. Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting offensive language on Arabic social media using deep learning. In
Proceedings of the 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS),
Granada, Spain, 22–25 October 2019; pp. 466–471.
21. Aref, A.; Al Mahmoud, R.H.; Taha, K.; Al-Sharif, M. Hate Speech Detection of Arabic Shorttext. CS IT Conf. Proc. 2020, 10, 81–94.
22. Gao, L.; Huang, R. Detecting online hate speech using context-aware models. arXiv 2017, arXiv:1710.07395.
23. Sitaula, C.; Basnet, A.; Mainali, A.; Shahi, T.B. Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related
Tweets. Comput. Intell. Neurosci. 2021, 2021, 2158184. [CrossRef] [PubMed]
24. Shahi, T.B.; Sitaula, C.; Paudel, N. A Hybrid Feature Extraction Method for Nepali COVID-19-Related Tweets Classification.
Comput. Intell. Neurosci. 2022, 2022, 5681574. [CrossRef] [PubMed]
25. Mastromattei, M.; Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Syntax and prejudice: Ethically-charged biases of a syntax-based hate
speech recognizer unveiled. PeerJ Comput. Sci. 2022, 8, e859. [CrossRef] [PubMed]
26. Ranaldi, L.; Fallucchi, F.; Zanzotto, F.M. Dis-Cover AI Minds to Preserve Human Knowledge. Future Internet 2021, 14, 10.
[CrossRef]
27. Hu, W.; Gu, Z.; Xie, Y.; Wang, L.; Tang, K. Chinese Text Classification Based on Neural Networks and Word2vec. In Proceedings
of the 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Hangzhou, China, 23–25 June 2019;
pp. 284–291.
28. Hossain, T.; Mauni, H.Z.; Rab, R. Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble
and Deep Learning. Comput. Inform. 2022, 41, 98–115. [CrossRef]