0% found this document useful (0 votes)
100 views

Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms

This document provides a review of machine learning algorithms for classifying email spam. It discusses how over 55% of total emails are spam, costing businesses billions annually in lost productivity and finances. The paper focuses on analyzing different machine learning techniques and email features used for spam classification, and provides directions for future research challenges.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

Review (3) A Comprehensive Review On Email Spam Classification Using Machine Learning Algorithms

This document provides a review of machine learning algorithms for classifying email spam. It discusses how over 55% of total emails are spam, costing businesses billions annually in lost productivity and finances. The paper focuses on analyzing different machine learning techniques and email features used for spam classification, and provides directions for future research challenges.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Comprehensive Review on Email Spam

Classification using Machine Learning Algorithms


Mansoor RAZA and Nathali Dilshani Jayasinghe Muhana Magboul Ali Muslam
School of Computing and Mathematics Department of Information Technology
2021 International Conference on Information Networking (ICOIN) | 978-1-7281-9101-0/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICOIN50884.2021.9334020

Charles Sturt University, Study Centre College of Computer and Information Sciences
Melbourne VIC 3000, Australia AI Imam Mohammad Ibn Saud Islamic University (IMSIU)
[email protected] Riyadh 11432, Saudi Arabia
[email protected] [email protected]

Abstract—Email is the most used source of official communi- cause of these reasons, email management and classification
cation method for business purposes. The usage of the email con- of spam emails are a vital necessity for organizations in order
tinuously increases despite of other methods of communications. to increase their productivity and reduce the financial losses.
Automated management of emails is important in the today’s
context as the volume of emails grows day by day. Out of the
total emails, more than 55 percent is identified as spam. This A. Relevant Spam Statistics
shows that these spams consume email user time and resources
generating no useful output. The spammers use developed and In this following subsection, we will emphasis on some of
creative methods in order to fulfil their criminal activities using the global statistical information on spam vs. financial impact.
spam emails, Therefore, it is vital to understand different spam
email classification techniques and their mechanism. This paper Some country specific metrices for Australia are also discussed
mainly focuses on the spam classification approached using in the analysis. As per [5] there are a little over 4 billion email
machine learning algorithms. Furthermore, this study provides a accounts that are actively in use in 2020 and this number is
comprehensive analysis and review of research done on different projected to grow up to 4.48 billion by the year 2024. This
machine learning techniques and email features used in different means nearly half of the world population are actively using
Machine Learning approaches. Also provides future research
directions and the challenges in the spam classification field that emails at the year 2020. From this, spam accounts for 57.26
can be useful for future researchers. percent of total email traffic for the year 2019 [5]. This shows
Index Terms—Spam Detection, Spam Classification, Spam Fil- that, out of the total emails going around the world, more than
ter, E-mail, Supervised Learning, Machine Learning Algorithms, half of it accounts for unwanted, unsolicited spam emails. As
Email Classification, Spam Email Detection, Email Categoriza- for year 2019 FBI recently reported that a global financial loss
tion, Email Feature Set Analysis, Spam Detection Using Machine
Learning Algorithms. of $12.5Billion has incurred for the businesses due to business
email and business email account compromise as a result of
spam and phishing [6]. These financial losses incurred to the
I. I NTRODUCTION
businesses are expected to skyrocket in the upcoming years as
Email has become the most extensively used official com- the growth of the email usage is increasing day by day. As for
munication mechanism for most of the internet users. In the the Australian context the following are the statistics for digital
past few years, the increased number of email usage has scams for the previous years [7] (Australian Competition and
emerged and escalated the problems caused by spam emails. consumer Commission, 2019. This statistic shows that, these
Spam or junk email is referred to the act of distributing digital scams are a global issue and Australia also have a
bulk unsolicited messages [3]. Whereas the emails which are significant problem because of that. Apart from that, data
meaningful with opposite nature are called ‘Ham’. An average demonstrates that the trends are heading upwards for both
email user receives about 40-50 emails per day. Spammers losses and the number of cases for each year. Investment scams
earn around 3.5 USD million from spam every year making offers promising business opportunities which are fraudulent in
financial losses to both personal and institutional front [4]. exchange for monetary investments. Dating scams focuses on
Because of that, the users spend a significant amount of victimizing individuals who are looking for romantic partners
working time on these emails. According to [1] spam accounts using internet. Emails are the number one used method by the
more than 50 percent of email server traffic, transmitting scammers when for delivering malware and other fraudulent
massive volume of unwanted and unsolicited bulk emails. scams. The solutions that are used in the current context are
They consume user resources to no-useful output, reducing most of the time lagging because of the innovativeness and
the productivity. The spam which are propagated by spammers problem-solving skills of the spammers. Because of these
have the objective of marketing purposes to unfold malicious reasons, it is important to understand and develop systems
criminal activities such as identity theft, financial disruptions, which can detect and classify spam emails from legitimate
stealing sensitive information and reputational damage. Be- ham emails.

978-1-7281-9101-0/21/$31.00 ©2021 IEEE 327 ICOIN 2021

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
TABLE I
T OP THREE DIGITAL SCAM LOSSES (AU$) AND F REQUENCY (N UMBER OF CASES ) IN AUSTRALIA FROM 2017 – 2020. [1]

Year Total Loss Digital Scam Amount Digital Scam Frequency


2017 $90801407 Investment Scams $31,326,476 Phishing 26,386
Dating and Romance $20,530,578 Identity Theft 15,703
Other business & employment $ 5,270,948 False billing 13,455
2018 $107001471 Investment Scams $38 846 635 Phishing 24,291
Dating and Romance $24 648 024 Threats to Life 19,455
False Billing $ 5 512 502 Identity Theft 12,800
2019 $142898217 Investment Scams $61,813,801 Phishing 25,170
Dating and Romance $28,606,215 Threats to Life 13,375
False Billing $ 10,110,753 Identity Theft 11,373
2020 $52971358 Investment Scams $20,650,486 Phishing 10,689
Dating and Romance $14,708,686 Threats to Life 4 255
False Billing $ 4,378,559 Identity Theft 4,237

II. L ITERATURE R EVIEW C. Semi-supervised Machine Learning Algorithms


As we are reviewing spam detection systems which uses In this approach the system is trained with both labelled and
Machine Learning (ML) algorithms, it is important to re- unlabelled data in the testing phase and the system analysis are
view on the history of ML in the field and the different carried out using both techniques. The main objective of this
algorithms that are used in the current context to classify approach is to achieve better accuracy and precision than the
spam. Researchers have pointed out that, content and the traditional supervised and unsupervised approaches. There are
operational mechanisms of the spam emails changes over the two different types of output presentations; Semi supervised
time. Therefore, the techniques that are working now may clustering and semi supervised classification in this approach.
not be useful soon. This phenomenon is identified as the All the research papers that have been selected are categorized
conceptual drift [8]. Machine Learning is the engineering using the above approaches to carry out the analysis in an
approach formulated to enable computational instruments to effective manner. The categorization details and its analysis are
act without being programmed explicitly. This approach is a presented in the analysis and findings section. In the following
huge boon to detect and tackle spam issue because of the section we will be focusing on the different machine learning
ML system’s ability to evolve itself over the time minimizing algorithms that have been used in the reviewed studies. These
the concept drift. In the following section, we will discuss have been analyzed after categorizing them under the above
on number of ML techniques, approaches and algorithms and discussed machine learning algorithm approach.
their associated benefits with Supervised, Unsupervised and
Semi Supervised Machine Learning Algorithm Approaches. III. S UPERVISED L EARNING BASED M ACHINE L EARNING
A LGORITHMS
A. Supervised Machine Learning Algorithms
Supervised machine learning algorithms learns from a set In this following section ML algorithms, which are used
of pre-labelled data, with the possible outputs for the cor- in supervised based nature approach have been discussed. In
responding spends have been already been given [9]. This a system one or several ML algorithms have been used to
algorithm learns gradually using the labelled data provided and achieve the expected performance measurements.
eventually builds up its own probabilistic mapping system to
use for new inputs. This technique has two different subtypes A. Artificial Neutral Network (ANN)
called, Regression and classification [9]. This technique is [11] Has used a system using ANN approach to classify
mostly used to generate the outputs which are in categorical spam and ham emails. This developed model is based on
nature. In this context; Spam and Ham as the two categories. thirteen pre labelled-fixed email features which are associated
with spam emails. ANN is built using artificial neutrons,
B. Unsupervised Machine Learning Algorithms Hence the name come from. The number of artificial neutrons
As the name describes, in this technique there are no that are been used in the system can be varied and depend on
labelled data or explicit instructions to pre trained the designed the requirements of the system. These neurons are connected
model. Therefore, these systems are not provided with a to different layers such as Input layer, Hidden layers and
training. In this algorithm the analysis is carried out based output layers. ANN systems ‘learns’ through a process named,
on the dataset and feature out the common characteristics, ‘Back-Propagation’. The produced new output of the network
structures and features in a group. Then rearrange the output is compared and matched with the ideal match that should
data in different based structure or the pattern [10]. The output have been produced. The variation is taken into account and
data can be organized in different types such as clustering, adjust the weights between the neutron connections with many
anomaly detection, association and autoencoders iterations [4] .

328

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
B. Naı̈ve Based Machine Learning Algorithm A. K-nearest Neighbour machine learning algorithm (KNN)
This is one of the commonly used supervised machine This algorithm is effective to use when there is noise in the
learning algorithm. This has been developed using the Bayes’ input dataset. This can be used to generate both classifications
rule which tries to derive the probability of an event occurrence and regression outputs for the developed system. The main
based on even related prior knowledge and conditions [12]. drawback of this algorithm is it is highly sensitive for the
This approach is highly scalable, fast and easy to implement outliers in the data set. Apart from that, computational cost
into a system. Naı̈ve Based algorithm treats the features as for this algorithm is comparatively higher with regard to other
independent from each other. This has been used in the machine learning algorithms [4]. This may be the main reason
system developed by [13] to provide the solution to the that this has not been adopted more commonly in the reviewed
problem independence of random variables with 23 different studies.
classification rules. This system uses Decision tree along with
Naı̈ve Based to generate the expected outcome. The main B. K- means Clustering machine learning algorithm
drawback of this algorithm is this can be only used if the This algorithm has straightforward implementation mech-
input features are ‘completely independent on each other’. In anism and the computational cost is comparatively lower
the practical scenario, this is not always possible. than KNN ML algorithm. These are the reasons for this
C. Support Vector Machine algorithm to be one of the commonly used unsupervised
machine learning algorithm in spam classification field [4].
Support Vector Machine Support Vector Machine (SVM) In the K means clustering the data mining process initiates
is another well established and most frequent used Machine with the first group which is selected randomly. There is
learning classification algorithm which was proposed by [14]. a randomly selected centroid for each cluster to begin the
Some of the systems have used only SVM as their system process. Repetitive calculations are carried out starting from
classification algorithm while some researchers have used that centroid to generate the optimized position.
combination of algorithms including SVM. [15] Has used a
system with SVM and Weighted SVM. The weights are reflect- V. ANALYSIS AND FINDINGS
ing the importance of different analysis categories; ‘classes’.
In this section we will focus on some key insights and the
As per the researchers, the advanced weighted SVM algorithm
findings derived from the critically analyzed studies. We will
has higher performance metrices. In the SVM algorithm a
start with a basic overview of the testing approaches used and
hyperplane is created generating different classes to analyze
the different machine learning algorithms used by each of the
various features derived from the dataset. SVM can be adopted
reviewed paper. Based on the information in the table 2, we
into any number of vector dimensions. In the 2D dimension
can obtain information on the nature of the approaches of the
the approach would be a line. In the 3D dimension it would
studies and the distribution of the different machine learning
be a hyperplane.
algorithms used to classify spam and ham.
D. Decision tree (DT)
A. High adoption of Supervised Based Approach
Decision tree machine learning algorithm is another algo-
rithm that have been used more commonly in the reviewed The figure 01: The pie chart demonstrates the distribution
supervised learning approach studies. The reasons to use this of the different approaches in the reviewed systems. High
more often are this is an algorithm that can be used easily, adoption of the supervised learning technique can be seen in
easier explanations and visualizations. This can be used with the distribution with 54 percent of the selected sample. The
both large and small data sets. Has the ability to handle both next most adopted framework is unsupervised approach. As
numerical data and the categorical data in the system [4]. In the figure 01 suggest majority of the researchers have adopted
the developed system done by [16], they have used DT along Supervised learning technique as their first choice. This clearly
with other algorithms in their system. DT has been used in the signifies that there is high degree of opportunities to expand
tier three stage with binomial categorization of spam and ham and the availability for the research in the field of semi-
emails. The model could classify the spam in real time, for this supervised and unsupervised Machine learning Approach.
feature DT has provides significant insights as it has simple
B. Consistent and Higher accuracy in Supervised Based Ap-
computational mechanism which is required for efficient real
proach
time computational requirements.
The main objective of majority of the researchers are based
IV. U NSUPERVISED L EARNING BASED M ACHINE on increasing a higher accuracy of spam detection from the
L EARNING A LGORITHMS developed systems. As we can see in the figure 2, The scatter
In this section we are focusing on the unsupervised machine plot shows that the accuracy of the systems using supervised
learning algorithms that have been used in the reviewed learning model has a distribution of accuracy in higher level
systems. The adoption of unsupervised machine learning al- with tight close range with the minimum variation. This shows
gorithms is low compared to supervised machine learning that, the outcomes are consistent with higher accuracy (average
algorithms. There are two algorithms used by the researchers. accuracy is above 90percent for supervised learning model).

329

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II
SUMMARY OF MACHINE LEARNING ALGORITHMS AND RESULTS OBTAINED

Techniques Model Types Primary Algorithms Used Result

Support Vector Machine

Multilayer Perceptron
Swarm Optimization
Decision Tree/C4.5

Maximum Entropy

Gradient Boosting
Semi-Supervised

Key word-based

Weighted SVM

P F Score F R
Unsupervised

Naı̈ve Based

Accuracy%
Supervised

K-means

KPCM

KFCM
Ann
T1      88 R,F
T2    75
T3   91
T4    80 F,R
T5    71 P
T6     98 P
T7   
T8      61
T9   82
T10      88
T11    93 P
T12      93 F
Sum 7 3 2 4 6 5 7 1 1 1 1 1 1 1 1 1

D. Using Single Algorithm Vs. Multi Algorithm Framework


As we can see in the table 2, most of the systems (83
percent) have used a combination of different ML algorithms
for their systems in-order to achieve higher results from their
study. Out of twelve studies only two systems have used single
ML algorithm as their approach. All the other studies have
used two or more algorithms combined to achieve higher and
better results from their studies.

Fig. 1. Distribution of types of framework adopted.

From finding A and B, we can conclude that there is a huge


opportunity to develop systems using semi-supervised and
unsupervised frameworks for future researchers. Apart from
that, the performance evaluation bar for the existing systems
for these approaches are very low compared to supervised
framework, meaning there is a higher flexibility in achieving
higher percentage outcomes.

C. Algorithm Preference. Fig. 2. Usage of number of ML algorithms in system.

The bar chart on the figure 3 illustrates the distribution of


the Machine Learning algorithms used in the reviewed papers. E. Email Features Analysed using ML algorithms
Majority of the developed system have used SVM machine The table 3 demonstrates the different email features that
learning algorithm. Some of the studies have used, their own are analysed in the reviewed articles. As we can see majority
advanced algorithms that have been derived from the main ML of the systems that are developed are focused on analysing for
algorithm. These have been included in the original ML format spam using the Body of the email and the BoW. BoW (bag of
for above analysis purpose. These can be further analyzed in word) is the approach of analysing the email using key words,
future studies in this section. phrases and texts in the email. From this we can identify

330

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
that, there is a higher opportunity to develop spam detection C. Reduction of System Process time
systems analysing the email features such as attachments in the Focus on reduction of the process and classification time
email, structure of the email and spam hyperlinks added in the for a spam classification system using advanced hardware
emails. These research areas have been approached only by a rather than using traditional hardware system is another area
few current researchers leaving these as the higher opportunity which needs to be focused on in the future studies. Real time
areas to explore research areas for the future studies. classifications and user centric email classifications are the
F. Analysing the Email Content next generation of spam classification which requires higher
processing time. Therefore, to reduce classification time suit-
Almost all the studies that have used in this paper have able developed hardware technologies should be adopted for
developed their systems based on analysing the content such the systems. This area should be focused on future researchers
as checking on keywords or phrases that are pre identified to in order to cater the future spam classification requirements.
be included in spam emails. This approach is reasonable and
logical for a certain state for current context. However, as we TABLE III
have explained in the previous sector, the spammers are using E MAIL FEATURES ANALYZED BY ML A LGORITHMS
new techniques day by day. Therefore, the spamming tech-
Email features analyzed by ML Algorithms

Article
niques are evolving. Hence, this traditional approach would

Term Frequency
not be adequate in the future to detect spam as spammers will

Hyper Links

Attachments
find creative and efficient spamming techniques to go under

Structure
Header
this detecting method. Table 4 provides a summary of the

Other
Body

BoW
machine learning algorithms.
A1  
VI. CHALLENGES AND FUTURE RESEARCH A2   
DIRECTIONS A3  
In this section, we highlight the several research challenges A4   
A5 
and open issues that have been identified during the review
A6  
process. In this regard, future research work that is yet to be A7  
focused on to enhance the performance of the spam email A8  
classification in different application areas and features are A9  
presented in below section. A10     
A11   
A. Real Time Spam Classification A12  
Sum 3 7 8 3 3 1 3 1
Majority of the researchers that are reviewed are based on
datasets which are not included with real time environment
features and elements. Therefore, these studies are unable to
classify spam emails in real time. Only one of the studies D. Emphasis on different email features
that are in the review database could detect spam in real time. From the review we have found out that, majority of the
As we are living in a fast pacing environment, it is essential designed systems are focused on detecting spam emails using
to work in real time. This is one of the major untapped the analysis on BoW or the body of the email. But spams
direction that needs future studies to be focused on. The online can come in different formats such as hyperlinks, images and
stream and analysing of emails for spam is complicated and attachments. Efficient methods to detect spam for these spam
more advanced compared to the existing studies, therefore approaches should be more focused on in the future studies.
there will be higher potential research challenges that needs
to be addressed when developing systems catering to these E. Focus Semi Supervised and Unsupervised approach
requirements. From the review conducted, we have understood that most
of the Machine learning approaches are based on supervised
B. Dynamic Feature Updates learning techniques. Therefore, there is an untapped and higher
Another research area that needs to be focused on the future opportunities to develop spam email classification systems are
studies is to addition and removal of spam detection features available using the other two approaches; semi and unsuper-
in the system without rebuilding or restructuring the entire vised approaches.
developed system. As the techniques used by spammers are
changing and developing day by day, it is essential to have a F. Reducing the false positive rate
detection system which can be updated easily with minimum More researchers should be carried out to deliver the out-
time are resource consumption. Most of the reviewed systems come based on reduced false positive rate. False positive is
have not mentioned whether the systems can be updated easily, marking the legitimate emails as spam and risk of losing the
which means these areas is not much focused on during their important emails during the process. A good designed system
studies. should have a better false positive rate. This is another area

331

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.
the future studies should be focused on apart from focusing TABLE IV
on achieving the higher accuracy rate from their systems. SUMMARY OF THE MACHINE LEARNING ALGORITHMS

VII. CONCLUSION
Article Machine Learning Algo- Results and Summary
After the comprehensive analysis on the selected research ruthm
studies, we have identified several research findings and ob- A1 Naive Bayes (NB) Hybrid ensemble learning approach to de-
SVM tect review spam
servation. These have been detailed discussed in the prior Decision Tree (DT) Active semi supervised learning approach
sections with adequate explanations. In this section, we will Maximum Entropy (Max- Dealing with duplicates precisely
be more focused on main findings and the conclusions of Ent)
the study. High adoption rate for supervised machine learning A2 DT with binary classifica- Fixed features improves ANN classification
tion Classifies spam emails from Arabic and
approach can be seen throughout the review. This approach English languages
is used mainly because it generates higher accuracy results
with less variation giving high consistency for this approach. A3 DT with binary classifica- Recognize the spam features more accu-
tion rately
Apart from that, we have found out that certain algorithms Detects a pattrn of repetitve keywords in
such as Naı̈ve Based and SVM have high demand compared to spam
other Machine Learning Algorithms. The multi algorithm used Classifications of spam based structure such
as Cc/Bcc, domain and header
systems are more common in use to cater better outcome rather A4 C4.5 Filter spam from valid emails with low error
than using single algorithm. Researchers have more focused DT classifier rates and high efficiency using a multilayer
on email features such as BoW and Body text creating future Multilayer perception perception model
Naive bayes classifier Demonstrates higher effeciency than NB
research opportunities to develop systems to detect spam on classifier and J48 with a low rate of false
other email features. positive
A5 SVM Identifies spam using feature representation
R EFERENCES C4.5 preserving class separability and lower di-
[1] “Global spam volume as percentage of total e-mail Random forest mensional space
traffic from January 2014 to September 2019, by month.” Identify spam when feature size is small
https://www.statista.com/statistics/420391/spam-email-traffic-share/. with a good generalization irrespective of
[2] T. Ouyang, S. Ray, M. Allman, and M. Rabinovich, “A large-scale em- the data source
pirical analysis of email spam detection through network characteristics A6 SVM Reduces the consuming time of email clas-
in a stand-alone enterprise,” Elsevier, vol. 2015, pp. 101–102. NB and active learning sification while guaranteeing the accuracy
[3] O. Saad, A. Darwish, and R. Faraj, “A survey of machine learning K-Nearest Neighbours Proposed method perfoms better than other
techniques for Spam filtering,” IJCSNS Int. J. Comput. Sci. Netw. Secur. (KNN) classifiers methods on the two corpuses by using F1
[4] K. Asif, A. Sami, S. Bharindhan, and K. Krishan, “A Comprehensive measurement
Survey for Intelligent Spam Email Detection,” IEEEXplore, 2019. A7 NB Classifer Effectively increase the accuracy of Navie
[5] “Number of e-mail users worldwide from 2017 to 2024.” [Online]. ML Algorithm with com- bayes and reduce false positives of detecting
Available: https://www.statista.com/statistics/255080/number-of-e-mail- bination of semantic-based, spam Detects text modifications and cor-
users-worldwide/. key-word base and ma- rectly classify the email as spam or ham
[6] M. Guntrip, “https://www.proofpoint.com/us/corporate- chine learning in Python Spam detection with an error rate of 38 %
blog/post/fbi-reports-125-billion-global-financial-losses- and an accuracy of 62 %
due-business-email-compromise.” [Online]. Available: Discovery of relationship of exponential re-
https://www.proofpoint.com/us/corporate-blog/post/fbi-reports-125- gression between email length and spam
billion-global-financial-losses-due-business-email-compromise. score
[7] “Australian Competition and consumer Commission,” Scam A8 Integrated particle swarm Sam and non-spam mails are classified with
Stat., [Online]. Available: https://www.scamwatch.gov.au/scam- optimization based on de- 98.32 % accuracy for the experimental
statistics?scamid=all & date=2018. cision tree algorithm with
[8] K. Jackowski, B. Krawczyk, and M. Woźniak, “Application of adaptive unsupervised filtering
splitting and selection classifier to the spam filtering problem,” Cybern. SVM
Syst. An Int. J. K means
[9] Sathya and A. Abraham, “Comparison of supervised and unsupervised A9 Support Vector Machine Evaluates the performance of non linear
learning algorithms for pattern classification,” ResearchGate. (SVM) SVM based classifiers with two different
[10] F. Qian, Y. C. H. Abhinav Pathak, Z. M. Mao, and Y. Xie, “A case for kernel functions
unsupervised-learning-based spam filtering,” Univ. Minnesota J., 2010. Compare the training and testing accuracy
[11] Y. Alamlahi and A. Muthana, An Email Modelling Approach for Neural of the kernels and find out which kernel
Network Spam Filtering to Improve Score-based Anti-spam Systems. better
Modern Education and Computer Science Press, 2018. Decision tree classifier requires higher the
[12] L. Melian and A. Nursikuwagus, “Prediction student eligibility in voca- memory
tion school with Naı̈ve-Byes decision algorithm,” 2018. System identifies the spam emails from its
[13] A. S. Aski and N. K. Sourati, “Proposed efficient algorithm to filter spam contents
using machine learning techniques,” Elsevier, vol. 2016, pp. 145–149. Spam can be blocked by user and ham can
[14] K. Pawar and M. Patil, “Pattern classification under attack on spam be retained by the user
filtering,” IEEExplore, 2015. A10 SVM Out of the analysed algorithms, SVM for
[15] A. K. Rajan, V, and A K. “V, V., & Rajan, “An Improved Spam Detection NB and active learning both text and file classification has the high-
Method With Weighted Support Vector Machine,” IEEE Explor. .” K-nearest neighbours est spam classification accuracy
IEEExplore. (KNN) classifiers
[16] H. Kaur and A. Sharma, “Improved Email Spam Classification Method DT classifiers
Using Integrated Particle Swarm Optimization and Decision Tree,” IEEE SVM linear kernel
Xplore, vol. 2016, pp. 516–521. Gradient boosting (GB)

A11 Modified K-Means Improved classification accuracy 96 % pre-


NB classification cision in detection
Decreased the number of iteration step
A12 SVM linear kernel Evaluate the impact of spam detection using
332 Weighted SVM SVM, WSVM with KPCM and WSVM
KPCM (Kernel based with KFCM.UC
Probabilistic C-Means)
KFCM (Kernel based

Authorized licensed use limited to: Central Michigan University. Downloaded on May 14,2021 at 06:41:36 UTC from IEEE Xplore. Restrictions apply.

You might also like