Deep Learning-Based Depression Detection From Social Media
Deep Learning-Based Depression Detection From Social Media
Article
Deep Learning-Based Depression Detection from Social Media:
Comparative Evaluation of ML and Transformer Techniques
Biodoumoye George Bokolo * and Qingzhong Liu
Department of Computer Science, Sam Houston State University, Huntsville, TX 77341, USA; [email protected]
* Correspondence: [email protected]
Abstract: Detecting depression from user-generated content on social media platforms has garnered
significant attention due to its potential for the early identification and monitoring of mental health
issues. This paper presents a comprehensive approach for depression detection from user tweets
using machine learning techniques. The study utilizes a dataset of 632,000 tweets and employs data
preprocessing, feature selection, and model training with logistic regression, Bernoulli Naive Bayes,
random forests, DistilBERT, SqueezeBERT, DeBERTA, and RoBERTa models. Evaluation metrics such
as accuracy, precision, recall, and F1 score are employed to assess the models’ performance. The
results indicate that the RoBERTa model achieves the highest accuracy ratio of 0.981 and the highest
mean accuracy of 0.97 (across 10 cross-validation folds) in detecting depression from tweets. This
research demonstrates the effectiveness of machine learning and advanced transformer-based models
in leveraging social media data for mental health analysis. The findings offer valuable insights into
the potential for early detection and monitoring of depression using online platforms, contributing to
the growing field of mental health analysis based on user-generated content.
Keywords: depression detection; social media analysis; deep learning models; NLP techniques; user
tweets; mental health identification; sentiment analysis; large language models
2. Literature Review
In recent years, there has been a growing interest in the use of artificial intelligence
(AI) and machine learning (ML) to improve mental health care. As Shikha et al. (2023) [11]
discussed, AI and ML can be used to detect and diagnose mental health conditions, develop
AI-powered interventions, and improve access to mental health care services.
There has been a growing body of research exploring the detection of depression from
social media data [12], particularly utilizing machine learning techniques. This section
provides an overview of key studies and methodologies in the field, highlighting the
advancements made in detecting depression through user tweets.
Negative comments or expressions of pessimism are often associated with depressive
tendencies [13]. Research studies have explored the link between negative language use
and depression, providing evidence to support the statement [9].
In a study conducted by [5], the researchers analyzed social media data and found
a significant correlation between the language used in tweets and the prevalence of de-
pression symptoms. They identified that individuals with higher levels of depression were
more likely to express negative sentiments in their tweets.
In another study, [9] investigated the association between language markers and de-
pression on social media platforms. They found that individuals with depressive symptoms
tended to use more negative language, indicating a correlation between negative expression
and depression.
Gkotsis et al. [14] employed informed deep learning techniques to characterize mental
health conditions in social media. They utilized a large-scale dataset of Twitter posts and
applied deep learning algorithms to detect mental health conditions, including depression.
Their approach showcased the potential of leveraging deep learning models to gain insights
from user-generated content and improve mental health monitoring.
Electronics 2023, 12, 4396 3 of 20
Moreover, the study by Resnik et al. [15] explored the role of sentiment analysis and
linguistic markers in detecting depression from Twitter data. They developed a machine
learning framework that incorporated sentiment analysis features to predict depression
levels in individuals. Their findings highlighted the importance of sentiment analysis in
capturing emotional states and identifying signs of depression.
The study [16] focused on detecting depression using social media data and machine
learning employed various text classification algorithms, including Support Vector Ma-
chines (SVMs) and random forests, to classify tweets as depressive or non-depressive. As
explained by Kim (2017) [17], SVMs work by finding a hyperplane in the data that sepa-
rates the two classes (depressed vs. not depressed) with the maximum margin. The study
achieved promising results in terms of classification accuracy, demonstrating the potential
of machine learning approaches for depression detection. While the study demonstrated
effective depression detection from social media data, it primarily focused on traditional
machine learning algorithms. Incorporating more advanced deep learning models such as
recurrent neural networks or transformers could potentially improve the performance and
capture complex patterns within the tweet data.
In [18], the researchers examined the use of natural language processing techniques
to analyze social media posts for detecting depression. They applied sentiment analysis
and topic modeling to identify linguistic markers associated with depression. The study
highlighted the importance of linguistic cues in identifying mental health conditions from
social media data. Although the study provided valuable insights into the linguistic mark-
ers of depression, the research focused solely on Twitter activity and did not explore the
potential of utilizing additional contextual information from user profiles or network inter-
actions. Integrating these additional features could enhance the accuracy and robustness of
depression detection models.
Guntuku et al. (2017) [19] investigated the relationship between language patterns
on Twitter and depression symptoms. Their research explored the differences in linguistic
style, linguistic content, and social engagement between depressed and non-depressed
individuals. The study highlighted the importance of considering social context and
interaction patterns in depression detection. The study provided a comprehensive review
of the existing literature on detecting depression from social media. However, further
research is needed to explore the generalizability of the findings to diverse populations
and different social media platforms, as user behavior and language use may vary across
platforms and cultural contexts.
Other studies have also emphasized the significance of incorporating contextual
information from social media platforms [20]. For example, Nguyen et al. [21] investigated
the relationship between social context and depression detection, analyzing not only the
content of tweets but also the social network connections between users. Their research
highlighted the potential benefits of considering the social context in understanding mental
health indicators on social media.
Overall, these studies collectively demonstrate the potential of using machine learning
techniques for detecting depression from user tweets. By analyzing linguistic patterns,
social interactions, and contextual information, researchers have made strides in developing
computational models capable of identifying individuals at risk of depression.
we utilized an already existing dataset, the Sentiment140 dataset, which was first used in
the journal article by Go, Richa, and Lei (2009) [22] and originally intended for sentiment
analysis. For the purpose of this study on depression detection, we applied a custom
algorithm to repurpose and re-label the dataset, as described in Section 3.2. The Senti-
ment140 dataset consists of approximately 1.6 million English-language tweets collected
from Twitter. These tweets were gathered from April to June 2009 and covered a wide
range of topics and sentiments expressed by users. Each tweet in the dataset is labeled with
sentiment polarity, specifically as either positive or negative, as can be seen in Table 1.
Feature Details
Target The sentiment of the tweet (where 0 = negative, 2 = neutral,
4 = positive)
ID The ID of the tweet (e.g., “2087”)
Date The date of the tweet (e.g., “Sat May 16 23:58:44 UTC 2009”)
UserID The user that tweeted (e.g., “coolboy21”)
Text The text of the tweet (e.g., “James is cool”)
It is important to note that while this adaptation of the dataset with the description
shown in Table 2 provides a starting point for depression detection, it may not capture all
nuances of depression, and further refinement and validation may be necessary, some of
which are not covered in this study.
Feature Details
Label Label showing whether the tweet is showing depressive
symptoms or not.
Tweet The text of the user’s twitter post.
After the transformation and re-labeling of the dataset, a word cloud diagram was
used to analyze the new labels. The word cloud diagrams, shown in Figures 3 and 4,
were used to visually compare the most frequently occurring words in depressive and
non-depressive tweets, with the most frequently occurring words visually appearing larger
than the less occurring words.
Electronics 2023, 12, 4396 6 of 20
• Removing Stop Words: Stop words, such as “and”, “the”, or “is”, do not typically
carry significant sentiment information. Hence, we removed them from the tweet
text to reduce noise and improve the accuracy of the analysis.
• Apply Stemming and Lemmatization: We performed stemming and lemmatiza-
tion to reduce words to their root form. Stemming reduces words to their base
or stem (e.g., “running” to “run”), while lemmatization aims to convert words
to their base form based on their dictionary meaning (e.g., “better” to “good”).
These techniques help standardize the text and reduce the dimensionality of the
dataset. By applying these cleaning operations, we ensured that the tweet data
were normalized, free of noise, and ready for subsequent feature extraction and
machine learning tasks.
• Data Length: The __len__ method shown in Figure 7 returns the total number of
samples in the dataset, which is equivalent to the number of text entries.
Figure 10. Model architecture used for fine tuning of the DeBERTa model.
Figure 11. Model architechture used for fine tuning of DistilBERT model.
Figure 12. Model architechture used for fine tuning of SqueezeBERT model.
predicted probability distribution and the true labels [34]. In this study, the cross-
entropy loss function was utilized to compute the loss during the training process [35].
The cross-entropy loss encourages the model to assign higher probabilities to the
correct class and lower probabilities to the incorrect classes. By minimizing this loss,
the model learns to make accurate predictions and capture the underlying patterns in
the data.
• Optimizer used: Adam Optimizer
Adam combined the benefits of two further stochastic gradient descent modifications.
In particular:
– The adaptive gradient algorithm, or AdaGrad, enhances performance on prob-
lems with sparse gradients (such as computer vision and natural language issues)
by maintaining a per-parameter learning rate.
– Root Mean Square Propagation (RMSProp), which also sustains per-parameter
learning rates adjusted according to the mean of the weight’s gradients’ recent
magnitudes (i.e., the rate of change). This indicates that the algorithm performs
well in non-stationary, online scenarios. [36].
In this study, we employed an Adam optimizer, allowing for faster convergence
and better performance compared to traditional optimization methods. It efficiently
updated the model parameters during training, adjusting the learning rate based on
the gradient magnitudes and past gradients.
By utilizing these models, we aimed to explore different approaches to detecting de-
pression from user tweets. Each model has its strengths and characteristics, allowing us to
compare their performance and determine the most effective approach for our specific task.
4. Results
The methodology used to evaluate the performance of the models for depression
detection from user tweets is outlined here. The evaluation employed common classification
metrics, including accuracy, precision, recall, F1 score, and confusion matrix diagram.
1. Accuracy measures the proportion of correctly classified instances out of the total
instances. It is a fundamental metric that provides an overall assessment of the
model’s performance. In depression detection, it indicates the model’s ability to
correctly identify both depressive and non-depressive tweets, offering a clear picture
of its general effectiveness.
2. The F1 score is the harmonic mean of precision and recall, both of which measure
the proportion of prediction against actual instances. It provides a balanced measure
that considers both false positives and false negatives. In depression detection, the
F1 score helps strike a balance between minimizing both types of errors. A high F1
score indicates a model that effectively identifies depressive tweets while maintaining
a low rate of misclassification.
3. Precision measures the proportion of true positive predictions out of all positive
predictions made by the model. In the context of depression detection, precision
signifies the model’s ability to correctly identify tweets as depressive without making
many false positive predictions. This is crucial, as misclassifying non-depressive
tweets as depressive could have negative consequences.
4. Recall, also known as sensitivity or true positive rate, measures the proportion of true
positive predictions out of all actual positive instances. In depression detection, recall
indicates how well the model captures all the depressive tweets present in the dataset.
It is particularly important because missing depressive tweets (false negatives) can be
as problematic as false positives.
5. The confusion matrix diagram provides a detailed breakdown of the model’s pre-
dictions, including true positives, true negatives, false positives, and false negatives.
It offers insights into the distribution of correct and incorrect predictions, helping
researchers understand where the model excels and where it may need improvement.
Electronics 2023, 12, 4396 12 of 20
Visualizing the confusion matrix can also aid in identifying specific areas of concern
and fine tuning the model accordingly.
2. SqueezeBERT: The SqueezeBERT model was trained on the dataset using the selected
methodology. The accuracy metric indicates the proportion of correctly classified
instances out of the total number of instances. In this case, the SqueezeBERT model
achieved an accuracy rate of 95.48%, demonstrating its ability to effectively classify
tweets as indicative of depression or not.
The model demonstrated strong performance in cross-validation with a mean accu-
racy of 95.598% and a low standard deviation of 0.4416. This indicates the model’s
Electronics 2023, 12, 4396 13 of 20
4. Bernoulli Naive Bayes: The evaluation results of the Bernoulli Naive Bayes model
indicated promising performance in detecting depression from user tweets. The
model achieved an accuracy of 90.09%, a precision of 90.12%, a recall of 90.01%, and
an F1 score of 90.07%.
Electronics 2023, 12, 4396 15 of 20
The high accuracy suggests that the model was able to make correct predictions for
a significant proportion of the instances in the evaluation set. The precision value
indicates a strong ability to correctly identify tweets indicative of depression, while
the recall value demonstrates the model’s capability to capture a large portion of the
actual positive instances.
The Bernoulli Naive Bayes model was evaluated using a 10-fold cross-validation
approach. The mean accuracy across the 10 folds was calculated as 90% with a
standard deviation of 0.0091. The confusion matrix is shown in Table 6 below.
5. Random Forests: The random forests model achieved an accuracy of 0.949 on the
evaluation set, indicating that 95% of the instances were correctly classified. The
precision of the model was measured at 0.964, reflecting the proportion of correctly
predicted positive instances (tweets indicating depression) out of all predicted positive
instances at 96%. The recall (or sensitivity) of the model was 0.933, indicating the
proportion of correctly predicted positive instances out of all actual positive instances
was at 93%.
The F1 score, which balances precision and recall, was calculated as 95%. To gain
further insights into the model’s performance, a confusion matrix was constructed.
The confusion matrix, as seen in Table 7, provides a detailed breakdown of the model’s
predictions by comparing them to the actual class labels.
The random forests model demonstrated strong performance in the cross-validation
process, achieving a mean accuracy of 92.7% with a low standard deviation of 0.0017.
This indicates consistent and reliable predictions across the different folds of the
dataset, highlighting the model’s robustness in detecting the target variable.
6. RoBERTa: The RoBERTa model was trained on about 50% of the dataset (300,000 entries)
using the aforementioned methodology because of the constraint of compute power.
After training, the model’s performance was evaluated using various metrics to assess its
effectiveness in detecting depression from user tweets. Additionally, the confusion matrix
in Table 8 visually demonstrates the model’s ability to discriminate between depressive
and non-depressive tweets. The cross-validation result for the RoBERTa model yielded a
mean accuracy of 96.842% with a standard deviation of 0.847, showcasing its robust and
consistent performance in detecting the desired outcome.
The evaluation results demonstrated the promising performance of the RoBERTa
model. It achieved an accuracy of 0.981, indicating that 98% of the tweets were
correctly classified as indicative of depression or not with a loss of 0.2016. The
precision, recall, and F1 scores were measured at 0.99, 0.99, and 0.98, respectively,
indicating a good balance between correctly predicted positive instances and overall
performance. The loss and accuracy curves showing the history of the training
performance for the model are shown in Figures 17 and 18.
Electronics 2023, 12, 4396 16 of 20
7. DeBERTa: The DeBERTa model was also trained on about 50% of the dataset using the
chosen methodology. After training, the model achieved an accuracy of 98% on the
evaluation set. This indicates that the model accurately predicted depression and non-
depression tweets with a high level of precision. Also, the confusion matrix, as seen
in Table 9, provides a detailed breakdown of the model’s predictions by comparing
them to the actual class labels.
The DeBERTa model demonstrated strong performance during cross-validation,
achieving an average accuracy of 95.688% with a low standard deviation of 0.504.
This indicates consistent and reliable predictions across different subsets of the data,
highlighting the model’s robustness in detecting the target variable.
Furthermore, the loss value obtained for the DeBERTa model was 0.3438. The loss
function measures the discrepancy between the predicted and actual values during
training with lower values indicating a better fit of the model to the data. The
training process was monitored by tracking the history curves of accuracy and loss.
The accuracy curve depicts the improvement of the model’s accuracy over each
Electronics 2023, 12, 4396 17 of 20
training epoch, while the loss curve represents the decrease in the loss value over the
same period. These curves provide insights into the model’s learning progress and
convergence, as shown in Figures 19 and 20.
The results and evaluation of the various models in this study demonstrated their
effectiveness in detecting depression from user tweets. The logistic regression, Bernoulli
Naive Bayes, and random forests models showcased strong performance with high accuracy
and precise predictions. Moreover, the advanced transformer-based models, DistilBERT,
SqueezeBERT, RoBERTa, and DeBERTa, exhibited exceptional accuracy and robustness,
showcasing their potential for accurate and nuanced depression detection. These findings
highlight the significance of leveraging machine learning algorithms and advanced models
in analyzing social media data for mental health monitoring and support.
5. Conclusions
In conclusion, this paper presented a comprehensive approach for detecting depression
from user tweets using machine learning techniques. The study utilized a large dataset
consisting of 632,000 tweets, and a series of preprocessing steps were applied, including
feature selection and data cleaning. By employing logistic regression, Bernoulli Naive Bayes,
random forests, DistilBERT, SqueezeBERT, RoBERTa, and DeBERTa models, we achieved
promising results (Table 10) in accurately identifying tweets indicative of depression.
The evaluation of the models revealed the effectiveness and robustness of the method-
ology. The RoBERTa model demonstrated exceptional performance, achieving a mean
accuracy of 0.97 (across 10 cross-validation folds) and an accuracy of 98.1% on the eval-
uation set. These high accuracies indicate the model’s ability to accurately distinguish
between tweets associated with depression and those that are not.
The findings of this study have significant implications for mental health analysis
using social media data. By leveraging machine learning techniques, we have demon-
strated the potential for the early detection and monitoring of depression through user
tweets. This approach has the advantage of being scalable and accessible, as social media
platforms provide a vast amount of user-generated data that can be leveraged for mental
health insights.
It is important to note that this study is just the beginning in the field of depression
detection from user tweets. Further research can explore additional feature engineering
techniques, investigate other machine learning algorithms, and explore the transferability
of models to different social media platforms and languages. Moreover, combining textual
information with other contextual data, such as user demographics or temporal patterns,
can enhance the accuracy and reliability of depression detection models.
Author Contributions: Conceptualization, B.G.B. and Q.L.; Methodology, B.G.B. and Q.L.; Software,
B.G.B.; Validation, B.G.B. and Q.L.; Formal analysis, B.G.B.; Investigation, B.G.B.; Resources, B.G.B.
and Q.L.; Data curation, B.G.B. and Q.L.; Writing—original draft, B.G.B. and Q.L.; Writing—review &
editing, B.G.B.; Visualization, B.G.B.; Supervision, B.G.B.; Project administration, B.G.B.; Funding
acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.
Funding: The research received no funding.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can
be found here: http://help.sentiment140.com/for-students (accessed on 2 October 2023).
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2023, 12, 4396 19 of 20
Abbreviations
The following abbreviations are used in this article:
References
1. Lépine, J.P.; Briley, M. The increasing burden of depression. Neuropsychiatr. Dis. Treat. 2011, 7, 3–7. [PubMed]
2. WHO. Other Common Mental Disorders: Global Health Estimates; World Health Organization: Geneva, Switzerland, 2017; Volume 24.
3. Association, A.P. Diagnostic and Statistical Manual of Mental Disorders: DSM-IV; American Psychiatric Association:
Washington, DC, USA, 1994; Volume 4.
4. Nesi, J. The impact of social media on youth mental health: Challenges and opportunities. North Carol. Med. J. 2020, 81, 116–121.
[CrossRef] [PubMed]
5. Coppersmith, G.; Dredze, M.; Harman, C. Quantifying mental health signals in Twitter. In Proceedings of the Workshop on
Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Baltimore, MD, USA, 27 June 2014;
pp. 51–60.
6. Hassani, H.; Beneki, C.; Unger, S.; Mazinani, M.T.; Yeganegi, M.R. Text mining in big data analytics. Big Data Cogn. Comput. 2020,
4, 1. [CrossRef]
7. Aggarwal, C.C. Neural Networks and Deep Learning: A Textbook; Springer: Berlin/Heidelberg, Germany, 2018; pp. 399–431.
8. Golrooy Motlagh, F. Novel Natural Language Processing Models for Medical Terms and Symptoms Detection in Twitter. Ph.D.
Thesis, Wright State University, Dayton, OH, USA, 2022.
9. De Choudhury, M.; Gamon, M.; Counts, S.; Horvitz, E. Predicting depression via social media. In Proceedings of the International
AAAI Conference on Web and Social Media, Limassol, Cyprus, 5–8 June 2013; Volume 7, pp. 128–137.
10. Russell, S.J. Artificial Intelligence A Modern Approach, 3rd ed.; Pearson Education: London, UK, 2010.
11. Jain, S.; Pandey, K.; Jain, P.; Seng, K.P. Artificial Intelligence, Machine Learning, and Mental Health in Pandemics: A Computational
Approach; Academic Press: Cambridge, MA, USA, 2022.
12. Torous, J.; Bucci, S.; Bell, I.H.; Kessing, L.V.; Faurholt-Jepsen, M.; Whelan, P.; Carvalho, A.F.; Keshavan, M.; Linardon, J.; Firth, J.
The growing field of digital psychiatry: Current evidence and the future of apps, social media, chatbots, and virtual reality. World
Psychiatry 2021, 20, 318–335. [CrossRef] [PubMed]
13. Pyszczynski, T.; Holt, K.; Greenberg, J. Depression, self-focused attention, and expectancies for positive and negative future life
events for self and others. J. Personal. Soc. Psychol. 1987, 52, 994. [CrossRef] [PubMed]
14. Gkotsis, G.; Oellrich, A.; Hubbard, T.J.P.; Dobson, R.J.; Liakata, M.; Velupillai, S. Characterisation of mental health conditions in
social media using Informed Deep Learning. Sci. Rep. 2017, 7, 1–10. [CrossRef] [PubMed]
15. Resnik, P.; Garron, A.; Resnik, R. Detecting depression with lexical clues. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1146–1151.
16. Li, A.; Jiao, D.; Zhu, T. Detecting depression stigma on social media: A linguistic analysis. J. Affect. Disord. 2018, 232, 358–362.
[CrossRef] [PubMed]
17. Kim, P. MATLAB Deep Learning: with Machine Learning, Neural Networks, and Artificial Intelligence; Apress: New York, NY, USA, 2017.
18. Tsugawa, S.; Kikuchi, Y.; Kishino, F.; Nakajima, K.; Itoh, Y.; Ohsaki, H. Recognizing depression from twitter activity. In Proceedings
of the 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Republic of Korea, 18–23 April 2015;
pp. 3187–3196.
19. Guntuku, S.C.; Yaden, D.B.; Kern, M.L.; Ungar, L.H.; Eichstaedt, J.C. Detecting depression and mental illness on social media: An
integrative review. Curr. Opin. Behav. Sci. 2017, 18, 43–49. [CrossRef]
20. Mylonas, P. Types of contextual information in the social networks era. In Proceedings of the 2016 11th International Workshop
on Semantic and Social Media Adaptation and Personalization (SMAP), Thessaloniki, Greece, 20–21 October 2016; pp. 46–52.
21. Nguyen, D.P.; Doğruöz, A.S.; Rosé, C.P.; de Jong, F.; Song, M. Exploring the relationship between social context and depression
detection in social media. In Proceedings of the 8th International Conference on Weblogs and Social Media, Ann Arbor, MI, USA,
1–4 June 2014; pp. 308–317.
22. Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. In Proceedings of the International
Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; Volume 3, pp. 1–8.
Electronics 2023, 12, 4396 20 of 20
23. Yazdavar, A.H.; Al-Olimat, H.S.; Ebrahimi, M.; Bajaj, G.; Banerjee, T.; Thirunarayan, K.; Pathak, J.; Sheth, A. Semi-supervised
approach to monitoring clinical depressive symptoms in social media. In Proceedings of the 2017 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017, Sydney, Australia, 31 July–3 August 2017; pp. 1191–1198.
24. Sarvani, K.; Priya, Y.S.; Teja, C.; Lokesh, T.; Rao, E.B.B. Rainfall Analysis and Prediction Using Machine Learning Techniques. J.
Eng. Sci. 2021, 12, 531–537.
25. Jo, T. Machine Learning Foundations. Supervised, Unsupervised, and Advanced Learning; Springer: Cham, Switzerland, 2021.
26. Alfarizi, M.I.; Syafaah, L.; Lestandy, M. Emotional Text Classification Using TF-IDF (Term Frequency-Inverse Document
Frequency) And LSTM (Long Short-Term Memory). JUITA J. Inform. 2022, 10, 225–232. [CrossRef]
27. Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: New York, NY, USA, 2013.
28. McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI-98
Workshop on Learning for Text Categorization, Madison, WI, USA, 26–27 July 1998; Volume 752, pp. 41–48.
29. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
30. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692.
31. He, P.; Liu, X.; Li, W.; Lyu, J.; Xie, W.; Hua, W.; Wang, J. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv
2020, arXiv:2006.03654.
32. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv
2020, arXiv:1910.01108.
33. Iandola, F.N.; Shaw, A.E.; Krishna, R.; Keutzer, K.W. SqueezeBERT: What Can Computer Vision Teach NLP about Efficient Neural
Networks. arXiv 2020, arXiv:2006.11316.
34. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
35. Shore, J.; Johnson, R. Properties of cross-entropy minimization. IEEE Trans. Inf. Theory 1981, 27, 472–482. [CrossRef]
36. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.