Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms
Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms
Charles Sturt University, Study Centre College of Computer and Information Sciences
Melbourne VIC 3000, Australia AI Imam Mohammad Ibn Saud Islamic University (IMSIU)
[email protected] Riyadh 11432, Saudi Arabia
[email protected] [email protected]
Abstract—Email is the most used source of official communi- cause of these reasons, email management and classification
cation method for business purposes. The usage of the email con- of spam emails are a vital necessity for organizations in order
tinuously increases despite of other methods of communications. to increase their productivity and reduce the financial losses.
Automated management of emails is important in the today’s
context as the volume of emails grows day by day. Out of the
total emails, more than 55 percent is identified as spam. This A. Relevant Spam Statistics
shows that these spams consume email user time and resources
generating no useful output. The spammers use developed and In this following subsection, we will emphasis on some of
creative methods in order to fulfil their criminal activities using the global statistical information on spam vs. financial impact.
spam emails, Therefore, it is vital to understand different spam
email classification techniques and their mechanism. This paper Some country specific metrices for Australia are also discussed
mainly focuses on the spam classification approached using in the analysis. As per [5] there are a little over 4 billion email
machine learning algorithms. Furthermore, this study provides a accounts that are actively in use in 2020 and this number is
comprehensive analysis and review of research done on different projected to grow up to 4.48 billion by the year 2024. This
machine learning techniques and email features used in different means nearly half of the world population are actively using
Machine Learning approaches. Also provides future research
directions and the challenges in the spam classification field that emails at the year 2020. From this, spam accounts for 57.26
can be useful for future researchers. percent of total email traffic for the year 2019 [5]. This shows
Index Terms—Spam Detection, Spam Classification, Spam Fil- that, out of the total emails going around the world, more than
ter, E-mail, Supervised Learning, Machine Learning Algorithms, half of it accounts for unwanted, unsolicited spam emails. As
Email Classification, Spam Email Detection, Email Categoriza- for year 2019 FBI recently reported that a global financial loss
tion, Email Feature Set Analysis, Spam Detection Using Machine
Learning Algorithms. of $12.5Billion has incurred for the businesses due to business
email and business email account compromise as a result of
spam and phishing [6]. These financial losses incurred to the
I. I NTRODUCTION
businesses are expected to skyrocket in the upcoming years as
Email has become the most extensively used official com- the growth of the email usage is increasing day by day. As for
munication mechanism for most of the internet users. In the the Australian context the following are the statistics for digital
past few years, the increased number of email usage has scams for the previous years [7] (Australian Competition and
emerged and escalated the problems caused by spam emails. consumer Commission, 2019. This statistic shows that, these
Spam or junk email is referred to the act of distributing digital scams are a global issue and Australia also have a
bulk unsolicited messages [3]. Whereas the emails which are significant problem because of that. Apart from that, data
meaningful with opposite nature are called ‘Ham’. An average demonstrates that the trends are heading upwards for both
email user receives about 40-50 emails per day. Spammers losses and the number of cases for each year. Investment scams
earn around 3.5 USD million from spam every year making offers promising business opportunities which are fraudulent in
financial losses to both personal and institutional front [4]. exchange for monetary investments. Dating scams focuses on
Because of that, the users spend a significant amount of victimizing individuals who are looking for romantic partners
working time on these emails. According to [1] spam accounts using internet. Emails are the number one used method by the
more than 50 percent of email server traffic, transmitting scammers when for delivering malware and other fraudulent
massive volume of unwanted and unsolicited bulk emails. scams. The solutions that are used in the current context are
They consume user resources to no-useful output, reducing most of the time lagging because of the innovativeness and
the productivity. The spam which are propagated by spammers problem-solving skills of the spammers. Because of these
have the objective of marketing purposes to unfold malicious reasons, it is important to understand and develop systems
criminal activities such as identity theft, financial disruptions, which can detect and classify spam emails from legitimate
stealing sensitive information and reputational damage. Be- ham emails.
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
TABLE I
T OP THREE DIGITAL SCAM LOSSES (AU$) AND F REQUENCY (N UMBER OF CASES ) IN AUSTRALIA FROM 2017 – 2020. [1]
328
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
B. Naı̈ve Based Machine Learning Algorithm A. K-nearest Neighbour machine learning algorithm (KNN)
This is one of the commonly used supervised machine This algorithm is effective to use when there is noise in the
learning algorithm. This has been developed using the Bayes’ input dataset. This can be used to generate both classifications
rule which tries to derive the probability of an event occurrence and regression outputs for the developed system. The main
based on even related prior knowledge and conditions [12]. drawback of this algorithm is it is highly sensitive for the
This approach is highly scalable, fast and easy to implement outliers in the data set. Apart from that, computational cost
into a system. Naı̈ve Based algorithm treats the features as for this algorithm is comparatively higher with regard to other
independent from each other. This has been used in the machine learning algorithms [4]. This may be the main reason
system developed by [13] to provide the solution to the that this has not been adopted more commonly in the reviewed
problem independence of random variables with 23 different studies.
classification rules. This system uses Decision tree along with
Naı̈ve Based to generate the expected outcome. The main B. K- means Clustering machine learning algorithm
drawback of this algorithm is this can be only used if the This algorithm has straightforward implementation mech-
input features are ‘completely independent on each other’. In anism and the computational cost is comparatively lower
the practical scenario, this is not always possible. than KNN ML algorithm. These are the reasons for this
C. Support Vector Machine algorithm to be one of the commonly used unsupervised
machine learning algorithm in spam classification field [4].
Support Vector Machine Support Vector Machine (SVM) In the K means clustering the data mining process initiates
is another well established and most frequent used Machine with the first group which is selected randomly. There is
learning classification algorithm which was proposed by [14]. a randomly selected centroid for each cluster to begin the
Some of the systems have used only SVM as their system process. Repetitive calculations are carried out starting from
classification algorithm while some researchers have used that centroid to generate the optimized position.
combination of algorithms including SVM. [15] Has used a
system with SVM and Weighted SVM. The weights are reflect- V. ANALYSIS AND FINDINGS
ing the importance of different analysis categories; ‘classes’.
In this section we will focus on some key insights and the
As per the researchers, the advanced weighted SVM algorithm
findings derived from the critically analyzed studies. We will
has higher performance metrices. In the SVM algorithm a
start with a basic overview of the testing approaches used and
hyperplane is created generating different classes to analyze
the different machine learning algorithms used by each of the
various features derived from the dataset. SVM can be adopted
reviewed paper. Based on the information in the table 2, we
into any number of vector dimensions. In the 2D dimension
can obtain information on the nature of the approaches of the
the approach would be a line. In the 3D dimension it would
studies and the distribution of the different machine learning
be a hyperplane.
algorithms used to classify spam and ham.
D. Decision tree (DT)
A. High adoption of Supervised Based Approach
Decision tree machine learning algorithm is another algo-
rithm that have been used more commonly in the reviewed The figure 01: The pie chart demonstrates the distribution
supervised learning approach studies. The reasons to use this of the different approaches in the reviewed systems. High
more often are this is an algorithm that can be used easily, adoption of the supervised learning technique can be seen in
easier explanations and visualizations. This can be used with the distribution with 54 percent of the selected sample. The
both large and small data sets. Has the ability to handle both next most adopted framework is unsupervised approach. As
numerical data and the categorical data in the system [4]. In the figure 01 suggest majority of the researchers have adopted
the developed system done by [16], they have used DT along Supervised learning technique as their first choice. This clearly
with other algorithms in their system. DT has been used in the signifies that there is high degree of opportunities to expand
tier three stage with binomial categorization of spam and ham and the availability for the research in the field of semi-
emails. The model could classify the spam in real time, for this supervised and unsupervised Machine learning Approach.
feature DT has provides significant insights as it has simple
B. Consistent and Higher accuracy in Supervised Based Ap-
computational mechanism which is required for efficient real
proach
time computational requirements.
The main objective of majority of the researchers are based
IV. U NSUPERVISED L EARNING BASED M ACHINE on increasing a higher accuracy of spam detection from the
L EARNING A LGORITHMS developed systems. As we can see in the figure 2, The scatter
In this section we are focusing on the unsupervised machine plot shows that the accuracy of the systems using supervised
learning algorithms that have been used in the reviewed learning model has a distribution of accuracy in higher level
systems. The adoption of unsupervised machine learning al- with tight close range with the minimum variation. This shows
gorithms is low compared to supervised machine learning that, the outcomes are consistent with higher accuracy (average
algorithms. There are two algorithms used by the researchers. accuracy is above 90percent for supervised learning model).
329
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II
SUMMARY OF MACHINE LEARNING ALGORITHMS AND RESULTS OBTAINED
Multilayer Perceptron
Swarm Optimization
Decision Tree/C4.5
Maximum Entropy
Gradient Boosting
Semi-Supervised
Key word-based
Weighted SVM
P F Score F R
Unsupervised
Naı̈ve Based
Accuracy%
Supervised
K-means
KPCM
KFCM
Ann
T1 88 R,F
T2 75
T3 91
T4 80 F,R
T5 71 P
T6 98 P
T7
T8 61
T9 82
T10 88
T11 93 P
T12 93 F
Sum 7 3 2 4 6 5 7 1 1 1 1 1 1 1 1 1
330
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
that, there is a higher opportunity to develop spam detection C. Reduction of System Process time
systems analysing the email features such as attachments in the Focus on reduction of the process and classification time
email, structure of the email and spam hyperlinks added in the for a spam classification system using advanced hardware
emails. These research areas have been approached only by a rather than using traditional hardware system is another area
few current researchers leaving these as the higher opportunity which needs to be focused on in the future studies. Real time
areas to explore research areas for the future studies. classifications and user centric email classifications are the
F. Analysing the Email Content next generation of spam classification which requires higher
processing time. Therefore, to reduce classification time suit-
Almost all the studies that have used in this paper have able developed hardware technologies should be adopted for
developed their systems based on analysing the content such the systems. This area should be focused on future researchers
as checking on keywords or phrases that are pre identified to in order to cater the future spam classification requirements.
be included in spam emails. This approach is reasonable and
logical for a certain state for current context. However, as we TABLE III
have explained in the previous sector, the spammers are using E MAIL FEATURES ANALYZED BY ML A LGORITHMS
new techniques day by day. Therefore, the spamming tech-
Email features analyzed by ML Algorithms
Article
niques are evolving. Hence, this traditional approach would
Term Frequency
not be adequate in the future to detect spam as spammers will
Hyper Links
Attachments
find creative and efficient spamming techniques to go under
Structure
Header
this detecting method. Table 4 provides a summary of the
Other
Body
BoW
machine learning algorithms.
A1
VI. CHALLENGES AND FUTURE RESEARCH A2
DIRECTIONS A3
In this section, we highlight the several research challenges A4
A5
and open issues that have been identified during the review
A6
process. In this regard, future research work that is yet to be A7
focused on to enhance the performance of the spam email A8
classification in different application areas and features are A9
presented in below section. A10
A11
A. Real Time Spam Classification A12
Sum 3 7 8 3 3 1 3 1
Majority of the researchers that are reviewed are based on
datasets which are not included with real time environment
features and elements. Therefore, these studies are unable to
classify spam emails in real time. Only one of the studies D. Emphasis on different email features
that are in the review database could detect spam in real time. From the review we have found out that, majority of the
As we are living in a fast pacing environment, it is essential designed systems are focused on detecting spam emails using
to work in real time. This is one of the major untapped the analysis on BoW or the body of the email. But spams
direction that needs future studies to be focused on. The online can come in different formats such as hyperlinks, images and
stream and analysing of emails for spam is complicated and attachments. Efficient methods to detect spam for these spam
more advanced compared to the existing studies, therefore approaches should be more focused on in the future studies.
there will be higher potential research challenges that needs
to be addressed when developing systems catering to these E. Focus Semi Supervised and Unsupervised approach
requirements. From the review conducted, we have understood that most
of the Machine learning approaches are based on supervised
B. Dynamic Feature Updates learning techniques. Therefore, there is an untapped and higher
Another research area that needs to be focused on the future opportunities to develop spam email classification systems are
studies is to addition and removal of spam detection features available using the other two approaches; semi and unsuper-
in the system without rebuilding or restructuring the entire vised approaches.
developed system. As the techniques used by spammers are
changing and developing day by day, it is essential to have a F. Reducing the false positive rate
detection system which can be updated easily with minimum More researchers should be carried out to deliver the out-
time are resource consumption. Most of the reviewed systems come based on reduced false positive rate. False positive is
have not mentioned whether the systems can be updated easily, marking the legitimate emails as spam and risk of losing the
which means these areas is not much focused on during their important emails during the process. A good designed system
studies. should have a better false positive rate. This is another area
331
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
the future studies should be focused on apart from focusing TABLE IV
on achieving the higher accuracy rate from their systems. SUMMARY OF THE MACHINE LEARNING ALGORITHMS
VII. CONCLUSION
Article Machine Learning Algo- Results and Summary
After the comprehensive analysis on the selected research ruthm
studies, we have identified several research findings and ob- A1 Naive Bayes (NB) Hybrid ensemble learning approach to de-
SVM tect review spam
servation. These have been detailed discussed in the prior Decision Tree (DT) Active semi supervised learning approach
sections with adequate explanations. In this section, we will Maximum Entropy (Max- Dealing with duplicates precisely
be more focused on main findings and the conclusions of Ent)
the study. High adoption rate for supervised machine learning A2 DT with binary classifica- Fixed features improves ANN classification
tion Classifies spam emails from Arabic and
approach can be seen throughout the review. This approach English languages
is used mainly because it generates higher accuracy results
with less variation giving high consistency for this approach. A3 DT with binary classifica- Recognize the spam features more accu-
tion rately
Apart from that, we have found out that certain algorithms Detects a pattrn of repetitve keywords in
such as Naı̈ve Based and SVM have high demand compared to spam
other Machine Learning Algorithms. The multi algorithm used Classifications of spam based structure such
as Cc/Bcc, domain and header
systems are more common in use to cater better outcome rather A4 C4.5 Filter spam from valid emails with low error
than using single algorithm. Researchers have more focused DT classifier rates and high efficiency using a multilayer
on email features such as BoW and Body text creating future Multilayer perception perception model
Naive bayes classifier Demonstrates higher effeciency than NB
research opportunities to develop systems to detect spam on classifier and J48 with a low rate of false
other email features. positive
A5 SVM Identifies spam using feature representation
R EFERENCES C4.5 preserving class separability and lower di-
[1] “Global spam volume as percentage of total e-mail Random forest mensional space
traffic from January 2014 to September 2019, by month.” Identify spam when feature size is small
https://www.statista.com/statistics/420391/spam-email-traffic-share/. with a good generalization irrespective of
[2] T. Ouyang, S. Ray, M. Allman, and M. Rabinovich, “A large-scale em- the data source
pirical analysis of email spam detection through network characteristics A6 SVM Reduces the consuming time of email clas-
in a stand-alone enterprise,” Elsevier, vol. 2015, pp. 101–102. NB and active learning sification while guaranteeing the accuracy
[3] O. Saad, A. Darwish, and R. Faraj, “A survey of machine learning K-Nearest Neighbours Proposed method perfoms better than other
techniques for Spam filtering,” IJCSNS Int. J. Comput. Sci. Netw. Secur. (KNN) classifiers methods on the two corpuses by using F1
[4] K. Asif, A. Sami, S. Bharindhan, and K. Krishan, “A Comprehensive measurement
Survey for Intelligent Spam Email Detection,” IEEEXplore, 2019. A7 NB Classifer Effectively increase the accuracy of Navie
[5] “Number of e-mail users worldwide from 2017 to 2024.” [Online]. ML Algorithm with com- bayes and reduce false positives of detecting
Available: https://www.statista.com/statistics/255080/number-of-e-mail- bination of semantic-based, spam Detects text modifications and cor-
users-worldwide/. key-word base and ma- rectly classify the email as spam or ham
[6] M. Guntrip, “https://www.proofpoint.com/us/corporate- chine learning in Python Spam detection with an error rate of 38 %
blog/post/fbi-reports-125-billion-global-financial-losses- and an accuracy of 62 %
due-business-email-compromise.” [Online]. Available: Discovery of relationship of exponential re-
https://www.proofpoint.com/us/corporate-blog/post/fbi-reports-125- gression between email length and spam
billion-global-financial-losses-due-business-email-compromise. score
[7] “Australian Competition and consumer Commission,” Scam A8 Integrated particle swarm Sam and non-spam mails are classified with
Stat., [Online]. Available: https://www.scamwatch.gov.au/scam- optimization based on de- 98.32 % accuracy for the experimental
statistics?scamid=all & date=2018. cision tree algorithm with
[8] K. Jackowski, B. Krawczyk, and M. Woźniak, “Application of adaptive unsupervised filtering
splitting and selection classifier to the spam filtering problem,” Cybern. SVM
Syst. An Int. J. K means
[9] Sathya and A. Abraham, “Comparison of supervised and unsupervised A9 Support Vector Machine Evaluates the performance of non linear
learning algorithms for pattern classification,” ResearchGate. (SVM) SVM based classifiers with two different
[10] F. Qian, Y. C. H. Abhinav Pathak, Z. M. Mao, and Y. Xie, “A case for kernel functions
unsupervised-learning-based spam filtering,” Univ. Minnesota J., 2010. Compare the training and testing accuracy
[11] Y. Alamlahi and A. Muthana, An Email Modelling Approach for Neural of the kernels and find out which kernel
Network Spam Filtering to Improve Score-based Anti-spam Systems. better
Modern Education and Computer Science Press, 2018. Decision tree classifier requires higher the
[12] L. Melian and A. Nursikuwagus, “Prediction student eligibility in voca- memory
tion school with Naı̈ve-Byes decision algorithm,” 2018. System identifies the spam emails from its
[13] A. S. Aski and N. K. Sourati, “Proposed efficient algorithm to filter spam contents
using machine learning techniques,” Elsevier, vol. 2016, pp. 145–149. Spam can be blocked by user and ham can
[14] K. Pawar and M. Patil, “Pattern classification under attack on spam be retained by the user
filtering,” IEEExplore, 2015. A10 SVM Out of the analysed algorithms, SVM for
[15] A. K. Rajan, V, and A K. “V, V., & Rajan, “An Improved Spam Detection NB and active learning both text and file classification has the high-
Method With Weighted Support Vector Machine,” IEEE Explor. .” K-nearest neighbours est spam classification accuracy
IEEExplore. (KNN) classifiers
[16] H. Kaur and A. Sharma, “Improved Email Spam Classification Method DT classifiers
Using Integrated Particle Swarm Optimization and Decision Tree,” IEEE SVM linear kernel
Xplore, vol. 2016, pp. 516–521. Gradient boosting (GB)
Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.