0% found this document useful (0 votes)
52 views

Detection of Phising Websites Using Machine Learning Approaches

The document discusses the detection of phishing websites using machine learning approaches. It presents a study that implemented three supervised learning models (Decision Tree, K-Nearest Neighbor, and Random Forest) on a dataset of 11,055 observations and 32 variables to classify websites as legitimate or illegitimate. The purpose is to conduct a mini-review of existing phishing detection techniques and experiments to identify malicious websites. Accurately detecting phishing websites is important as phishing can result in financial loss, identity fraud, or infecting devices with malware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Detection of Phising Websites Using Machine Learning Approaches

The document discusses the detection of phishing websites using machine learning approaches. It presents a study that implemented three supervised learning models (Decision Tree, K-Nearest Neighbor, and Random Forest) on a dataset of 11,055 observations and 32 variables to classify websites as legitimate or illegitimate. The purpose is to conduct a mini-review of existing phishing detection techniques and experiments to identify malicious websites. Accurately detecting phishing websites is important as phishing can result in financial loss, identity fraud, or infecting devices with malware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/356770549

Detection of Phising Websites using Machine Learning Approaches

Conference Paper · October 2021


DOI: 10.1109/ICoDSA53588.2021.9617482

CITATIONS READS

7 577

7 authors, including:

Fara Yahya Rio Guntur Utomo


Universiti Malaysia Sabah (UMS) Telkom University
32 PUBLICATIONS 318 CITATIONS 35 PUBLICATIONS 94 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rio Guntur Utomo on 12 May 2022.

The user has requested enhancement of the downloaded file.


2021 International Conference on Data Science and Its Applications (ICoDSA)

Detection of Phising Websites using Machine


Learning Approaches
Farashazillah Yahya Ryan Isaac W Mahibol Chong Kim Ying
Faculty of Computing and Informatics Faculty of Computing and Informatics Faculty of Computing and Informatics
Universiti Malaysia Sabah Universiti Malaysia Sabah Universiti Malaysia Sabah
Kota Kinabalu, Sabah, Malaysia Kota Kinabalu, Sabah, Malaysia Kota Kinabalu, Sabah, Malaysia
[email protected] [email protected] [email protected]

Magnus Bin Anai Sidney Allister Frankie Eric Ling Nin Wei
Faculty of Computing and Informatics Faculty of Computing and Informatics Faculty of Computing and Informatics
Universiti Malaysia Sabah Universiti Malaysia Sabah Universiti Malaysia Sabah
Kota Kinabalu, Sabah, Malaysia Kota Kinabalu, Sabah, Malaysia Kota Kinabalu, Sabah, Malaysia
[email protected] [email protected] [email protected]

Rio Guntur Utomo


School of Computing
2021 International Conference on Data Science and Its Applications (ICoDSA) | 978-1-6654-4303-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICODSA53588.2021.9617482

Telkom University
Bandung, Indonesia
[email protected]

Abstract—As the world responded to the Coronavirus financial loss or even identity fraud. Phishing can also infect
Disease 2019 (COVID-19) pandemic in 2020, digital operations the victim devices with malware.
became more important, and people started to depend on new
initiatives such as the cloud and mobile infrastructure. Nowadays, most work is conducted on digital platforms.
Consequently, the number of cyberattacks such as phishing has In many ways, having a computer and access to the internet
increased. Phishing websites can be detected using machine makes work and personal lives easier. Moreover, digital
learning by classifying the websites into legitimate or networks enable the execution of transactions and activities
illegitimate websites. The purpose of the study is to conduct a in sectors promptly. Other than that, many individuals are
mini-review of the existing techniques and implement aware of the benefits of utilizing the internet for different
experiments to detect whether a website is malicious or not. The activities such as online shopping, bill payment, mobile top-
dataset consists of 11,055 observations and 32 variables. Three up, and banking transactions. With the advancement of
supervised learning models are implemented in this study: mobile and wireless technologies, users who require access
Decision Tree, K-Nearest Neighbour (KNN), and Random to a local network may effortlessly connect to the Internet
Forest. The three algorithms are chosen because it provides a from anywhere and at any time. Nonetheless, it has also
better understanding and more suitable for the dataset. Based revealed some major security flaws even though it offers a lot
on the experiments undertaken, the result shows Decision Tree of conveniences [2].
has an accuracy of 91.16% which is the lowest compared to the
other models, 97.6% for the KNN model which is the highest Users frequently have several user accounts on numerous
among all the models and 94.44% accuracy for the Random websites, including social networking sites, email, and
Forest model. Through comparisons between the three models, financial services. As a result, the most vulnerable targets for
KNN was the prime candidate for the best model considering this attack are innocent online users, because most
that it has the highest accuracy. However, Random Forest is individuals are clueless about their important information,
deemed more suitable for the dataset even though the accuracy which aids in the success of the attack. Generally, a phishing
is lesser because of the lowest false-negative value than the other assault uses social engineering to trick the target into clicking
models. The experiments can be further investigated with on a falsified link that leads to a false web page. Most of the
different datasets and models for comparative analysis.
time, the false website is designed to seem like the real one.
Keywords—website, phishing, malicious URL, prediction,
The fake link is posted on prominent websites or given to the
machine learning victim through email. As a result, rather than going to the
genuine web server, the victim's response may go to the
I. INTRODUCTION attacker's server. This paper undertakes a mini-review of the
techniques and the state-of-art in the detection of phishing
Phishing is known as one of the oldest forms of
websites using machine learning approaches. The objective
cyberattacks, which started way back in the 1990s, and it
of this work is to:
remains to be one of the most common and malicious attacks.
Throughout the years, phishing messages, tactics, and 1. Review existing techniques and related work
techniques have evolved. Phishing techniques include email 2. Detect phishing websites URLs using Decision Tree,
phishing scams, spear-phishing, and whaling. Phishing K-Nearest Neighbour (KNN), and Random Forest
targets the victims by posing as a trustworthy source. models
Typically, the attacker poses as an actual, believable real 3. Evaluate the performance of the machine learning
person, or legitimate agency to trick individuals into models
supplying sensitive data which includes bank data, credit, and Three machine learning algorithms used in the
debit card numbers, passwords as well as private credentials experiments are Decision Tree, KNN, and Random Forest.
[1]. The data is then used to obtain access to sensitive The researchers chose the Decision Tree algorithm for the
accounts such as online accounts which potentially result in first experiment because it provides a better understanding

978-1-6654-4303-6/21/$31.00
Authorized licensed use limited ©2021 40 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.
IEEE OF SOUTHAMPTON. Downloaded
to: UNIVERSITY
2021 International Conference on Data Science and Its Applications (ICoDSA)

through the tree visualisation. Besides that, KNN is chosen mining that can generate association rules. Their best
for the second experiment because it makes no assumptions performing algorithm is CBA because it achieved the lowest
about the dataset and can be implemented easier. Finally, the error rate of 4.5%.
researchers have chosen the Random Forest algorithm for the
third experiment because it is better at handling a large Lakshmi et al. [9] proposed a machine learning algorithm
amount of data and an imbalanced dataset [3]. as an alternative for the modelling of efficient phishing
website detection. One of the features that are used to detect
The paper is arranged in seven sections. The first section phishing websites is the use of third-party services which are
introduces the idea and concepts. The next section discusses blacklists, search engines that provide more to the reliable
the related works. Followed by the description of the forecast of phishing websites. They implemented supervised
methodology and then the results and discussion. Afterwards, learning classification algorithms such as Multilayer
the trends and opportunities of the work. Finally, a conclusion perceptron (MLP), Decision Tree Induction (DT), and Naïve
to summarise the work. Bayes (NB) to build models for phishing website detection.
They used datasets that consisted of 100 legitimate websites
II. RELATED WORK and 100 phishing websites. In addition, they applied 10-fold
In this section, the relevant works on detecting phishing cross-validation to evaluate the robustness of the classifiers.
websites are discussed (Table I). The researchers have added As a result, they found out that the decision tree classifier
based on the existing review and state-of-art work [4], [5]. achieved a better performance than the other two models.
Justin et al. [6] proposed machine learning-based techniques An innovative machine learning approach to classifying
that come up with better coverage than blacklisting-based phishing websites is presented in Akanbi et al. [10]. They
approaches to detect malicious websites using classification used data pre-processing techniques in many stages such as
models. The researchers used a normal dataset that was feature extraction, dataset division, normalization, and
extracted from DMOZ open directory project and the yahoo attribute weighting to affirm the classifier can comprehend
directory and acquired a phishing dataset from Phishtank and the dataset and arrange them correctly into the reference
Spam scatter and later applied three different classifiers such classes. The data set consists of phishing and non-phishing
as Naive Bayes, Support Vector Machine (SVM), and logistic datasets are separated into three different groups for training
regression on the dataset. They evaluated their models and and testing purposes as the dataset size ratio is set to 50:50,
determined that logistic regression is the best approach to 70:30, and 30:70. Moreover, they selected machine learning
perform automatic feature selection as it achieved high algorithms that are pruning decision trees and classifiers such
classification accuracy which is between 95% to 99% along as Linear Regression, KNN, Decision Tree C4.5, and SVM
with presenting a linear model of the training data that is to determine the accuracy of classification across different
interpretable. algorithms. To avoid overfitting, they applied a pessimistic
An anti-phishing approach used to identify and choose the pruning technique as post pruning with Decision Tree on the
best machine learning algorithms for phishing website data set. Overall, they found that pruning decision trees can
detection by comparing false positive and false negative rates reduce the complexity of the phishing website classification
and accuracy rates are discussed in Mahajan et al. [7]. The process.
authors sought to focus more on machine learning algorithms Sahingoz et al. [11] proposed a real-time anti-phishing
to effectively conquer the emerging issues from heuristics- system that uses seven different classification algorithms and
based and blacklist methods. They investigated a total of natural language processing (NLP). The authors used 7
36,711 URLs in the data set which consists of 17,058 benign unique classification algorithms ranging from Naive Bayes,
URLs and 19,653 phishing URLs that were obtained from Random Forest, KNN (n = 3), Adaboost, K-star, SMO, and
Alexa and Phishtank. They divided the data set into training Decision Tree as a machine learning mechanism for their
sets and testing sets in 50:50,70:30 and 90:10 to achieve more anti-phishing system. 73,575 URLs containing 36,400
accurate results. They tested three different machine learning legitimate URLs and 37,175 phishing URLs are obtained
classification model algorithms on the collected data set from Phishtank and Yandex Search API. Based on their
which are Decision Tree, Random Forest, and SVM. As a results, Random Forest with NLP-based features has the
result, their most accurate phishing website detection highest accuracy (97.98%).
algorithm is Random Forest as it achieved an accuracy of
97.14% with the lowest false-negative rate than Decision A phishing websites detection approach based on 12
Tree and SVM. Furthermore, they discovered that when more selected features to identify the best algorithm as the model
datasets are used as training sets, the performance of phishing to recognize unknown URLs [12]. The author extracted 3547
website detection improves. malicious web pages from PhishTank and 3511 benign web
pages from the DMOZ directory as datasets. A ratio of 4:1 is
Mohammad et al. [8] derived several features using their used to train and test the models. The classifier tested in this
software tool to detect phishing websites more accurately study includes the logistic regression classifier, SVM
through reducing the false-negative rate which distinguishes classifier, Naive Bayes’ classifier, and Decision Tree
phishing websites as legitimate. The dataset consists of 2500 classifier. In sum, without considering the text features, the
phishing URLs obtained from Phishtank and utilization of decision tree has the highest accuracy (90.04%) followed by
Millersmiles and 450 non-phishing URLs. The authors tested the SVM (88.98%) then the logistic regression (85.06%), and
four different rule-based classification algorithms such as lastly Naive Bayes (70.06%).
C4.5 that derived a decision tree from data set originated from
information theory, RIPPER which acquired separate and Basnet et al. [13] presented an approach to propose a
conquering techniques, PRISM that categorized under the methodology that can be used for developing an anti-phishing
covering algorithms, and CBA which the Apriori algorithm tool. The authors carried out a heuristic-based approach to
is implemented. Apriori algorithm is for frequent itemset classify phishing URLs. 11,361 phishing URLs have been

41 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

collected by them from “OldPhishTank” datasets and to track to detect phishing. The classifier used in this study includes
the evolving tactics used by scammers, they collected the the Random Forest classifier, Decision Tree classifier, KNN,
second batch of 5,456 phishing URLs from “NewPhishTank'' SVM, and Linear Support Vector classifier (SVC). As a result
datasets. The authors also obtained 22,213 legitimate URLs of the comparison, Random Forest obtained the highest
from the Yahoo directory and 9,636 non-phishing URLs from accuracy level of 96.8% followed by Decision Tree
DMOZ Open Directory Project. Features that the authors (96.05%), then KNN (93.53%) whereas SVM had the lowest
used to train the models are lexicon-based features, keyword- accuracy level of about 48.56%.
based features, reputation-based features, and search engine-
based features. They evaluated 7 supervised batch-learning Various machine learning algorithms to detect phishing
classifiers and compared their performance in detecting websites is compared and the results are presented in
phishing URLs. The classifiers they evaluated include SVMs Shahrivari et al. [16]. The researchers obtained 11,000
with RBF kernels, Multilayer Perceptron, Random Forest, sample websites from PhishTank, the phishing website
Naive Bayes’, and logistic regression. They used 10 times 10- dataset consists of 4,898 phishing websites and 6,157
fold cross-validation to evaluate the classifiers. Overall, legitimate websites. According to their result, Random Forest
Random Forest performs the best in all performance metrics. and XGBoost are performing well in terms of computational
response and accuracy.
An approach to developing a method of defence to
differentiate phishing websites and their authentic Hasan et al. [17] presented an approach to examine and
counterparts by categorizing those websites has been evaluate several machine learning models to identifying
proposed in Kulkarni et al. [14]. A dataset containing 548 phishing websites. The machine learning methods
legitimate, 702 phishing, and 103 suspicious URLs was used investigated in this study are logistic regression, Random
in this study. In this paper, 60% of randomly selected samples Forest, SVM, KNN, Naïve Bayes, and the XGBoost classier.
are used to train the neural network, 20% was used for The authors used datasets consisting of 6,080 authentic
validation, and another 20% was used for testing. As for the websites and 4,974 malicious websites. The authors utilised
decision tree, Naive Bayes’ classifier, and SVM, 40% of 70% of the sample websites for training and 30% for testing.
randomly selected samples were used for training and the Overall, they found out that among the six algorithms
remaining 60% were used for testing. Based on the results, employed, Random Forest has the highest accuracy
Prune Decision Tree had the highest accuracy among all 4 (97.17%). On the other hand, Alam et al. [18] presented a
classifiers which is 91.5% followed by SVM (86.69%), then work that utilises two machine learning models for detecting
Naive Bayes’ Classifier (86.14%), and lastly Neural Network phishing attacks i.e. Decision Tree and Random Forest. The
(84.87%). authors retrieved an online dataset of phishing attacks
obtained from kaggle.com. The 32 features were extracted
Lokesh and BoreGawda [15] proposed an approach to using a feature selection. The models we evaluated were
detect phishing websites using effective machine learning based on a confusion matrix. The result shows the
algorithms. Dataset is collected mainly from MillerSmiles performance of Random Forest was better at the accuracy
and Phish Tank that retrieved using data mining techniques. level of 97%.
They divided the data set into training sets and testing sets in
the proportion of 80:20 to get more accurate outcomes. In
feature extraction, 30 site features were chosen and utilized

TABLE I. RELATED WORK IN PHISHING WEBSITES DETECTION USING MACHINE LEARNING

Author ML Algorithm Description

[6] Naive Bayes, SVM, and Logistic − Detect malicious websites from suspicious URLs.
Regression − Normal dataset from DMOZ open directory project and yahoo directory meanwhile
acquired phishing dataset from PhishTank and Spam scatter.
− Logistic Regression is the best approach to perform automatic feature selection as it
has achieved high classification accuracy which is between 95% to 99%.

[7] Decision Tree, Random Forest, and SVM − Detect phishing websites using machine learning algorithms
− 36,711 URLs in the dataset consist of 17,058 benign URLs and 19,653 phishing
URLs. The dataset size ratio is set to 50:50, 70:30, and 90:10.
− Random Forest is the most accurate phishing website detection algorithm as it
achieved an accuracy of 97.14% with the lowest false-negative rate.

[8] C4.5, RIPPER, PRISM, CBA − Detect phishing websites accurately using a software tool that derived several features
to reduce the false-negative rate
− 2,500 phishing URLs that obtained from PhishTank and utilization of Millersmiles
and 450 non-phishing URLs
− CBA is the best performing algorithm because it achieved the lowest error rate of
4.5%.

[9] Multilayer perceptron (MLP), Decision − Machine learning algorithm as an alternative for the modelling of efficient phishing
Tree Induction, and Naïve Bayes (NB) website detection
− 100 legitimate websites and 100 phishing websites then applied 10-fold cross-
validation
− The decision tree achieved better performance than the other two models.

42 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

Author ML Algorithm Description

[10] Linear Regression, KNN, Decision Tree − Classify phishing websites through a machine learning approach
C4.5, and SVM − The dataset size ratio is set to 50:50, 70:30, and 30:70.
− Pruning decision tree classifier able to be used for reducing the complexity of the
phishing website classification process.

[11] Naive Bayes, Random Forest, KNN, − Real-time anti-phishing system


Adaboost, K-star, SMO, and Decision − 73,575 URLs containing 36,400 legitimate URLs and 37,175 phishing URLs are
Tree obtained from PhishTank and Yandex Search API
− Random Forest with NLP based features has the highest accuracy (97.98%)

[12] Logistic Regression classifier, SVM − Detect phishing websites based on the machine learning algorithm
classifier, Naive Bayes’ classifier, and − 3,547 malicious web pages from PhishTank and 3,511 benign web pages from the
Decision Tree classifier DMOZ directory PhishTank and Spam scatter.
− The decision tree classifier has the highest accuracy (90.04%)

[13] MLP, Decision Tree Induction, and Naïve − Developing a Tool to Detect Phishing URLs
Bayes − 11,361 phishing URLs were collected from “OldPhishTank” datasets and 5456
phishing URLs from “NewPhishTank” datasets 22,213 legitimate URLs from the
Yahoo directory and 9,636 non-phishing URLs from DMOZ Open Directory Project
− Random Forest performs the best in all performance metrics.

[14] Decision Tree, Naive Bayes, SVM, and − Develop a method of defence to differentiate phishing websites and their authentic
Neural Network counterparts
− A dataset containing 548 legitimate, 702 phishing, and 103 suspicious URLs used in
this study
− Prune Decision Tree had the highest accuracy among all 4 classifiers which is 91.5%

[15] Random Forest, Decision Tree, KNN, − Detect phishing websites using effective machine learning algorithms
SVM − The dataset size ratio is set to 80:20.
− Random Forest classifier had the highest accuracy among all 4 classifiers which is
96.8%

[16] Logistic Regression, Decision Tree, − Compare the results of various machine learning algorithms to detect phishing
Random Forest, Ada-Boost, SVM, KNN, websites
Artificial Neural Networks, Gradient − 11,000 sample websites obtained from PhishTank, consists of 4,898 phishing
Boosting, XGBoost websites and 6,157 legitimate websites.
− Random Forest and XGBoost performed better for computational response time and
accuracy

[17] Logistic Regression, Random Forest, − Examine and evaluate various machine learning methods for identifying phishing
SVM, KNN, Naïve Bayes, and XGBoost websites
− Datasets obtained consist of 6,080 legitimate websites and 4974 phishing websites.
− Random Forest has the highest accuracy (97.17%)

[18] Random Forest, Decision Tree − Detect phishing attacks using the Decision Tree and Random Forest
− The Random Forest model achieved the highest accuracy of 97%

The internet is transforming how people study and work Machine learning has been found across a variety of
as it becomes more deeply integrated with social life, but it is industries as an innovative approach to analyse massive
also exposing everyone to increasingly significant potential volumes of data to build evolutionary approaches for the
cybersecurity threats [1], [2], [19]. According to Craigen et. prevention of unauthorized access and destruction of
al [20], cybersecurity can be defined as the organization and networks. The method by Hasan et al. [22] implements
collection of resources, processes, and structures used to machine learning approaches using Random Forest as feature
protect cyberspace and cyberspace-enabled systems from selection for intrusion detection systems. Firstly, they used an
occurrences that misalign de jure from de facto property important permutation index to rank the features. Secondly,
rights. The number of cybercriminals has increased [2] to select the best subset of features for classification to avoid
drastically as the chance for financial gain has increased as overfitting and improve model performance, the Random
well as the inclusion of additional motives. Since the latest Forest algorithm is used to reduce the feature set to build an
technologies such as mobile computing, cloud computing, E- Intrusion Detection System. Intrusion Detection Systems
commerce, and online banking contain sensitive confidential (IDS) acts as a mandatory addition to the security
information that is crucial to the security and economic well- infrastructure of most organizations to overcome
being of every country. As a result, it is compulsory to cybersecurity threats. The KDD’99 data set was applied to
enhance cybersecurity and safeguard information see how well the aforementioned method worked. Finally, the
infrastructures by implementing a comprehensive findings indicate that Random Forest classification with
cybersecurity strategy to prevent data loss [21]. reduced features can generate more accurate results.

43 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

Furthermore, there has been a trend to apply machine variables were chosen. After the data splitting, there are 8,805
learning techniques for big data [23], cloud [2], rows of data in the training set and 2,250 rows of data in the
environmental [24] and network security[25] in recent years testing set. In the three experiments, the variables that were
due to significant security concerns regarding cloud chosen are all the same. Only the 2nd until the 15th variables
computing as virtualized data centres become more popular out of the 32 variables in the dataset are used based on the
as a cost-effective infrastructure and solutions. After using feature selection method. For the feature selection method,
the labelled UNSW datasets to train different supervised the ‘Boruta’ function and package are used to determine
machine learning models, Bhamare et al. [26] explored the whether a variable is important or not. Based on Fig. 1, all the
robustness of these learned models by testing them using the variables are deemed important.
ISOT dataset which was collected from a range of different
experimental environments and settings. They compared the
performance of these algorithms such as J48, Naïve Bayes,
logistic regression and SVM. From the results, the accuracy
of J48 and logistic regression achieved around 95% while Fig. 1. No attributes are deemed unimportant.
SVM was estimated to be around 90%. However, they
conclude that supervised machine learning models that The three experiments used a confusion matrix to evaluate
perform well on one dataset might not perform well on other its performance. Based on the confusion matrix table, the
datasets created under various simulated or experimental accuracy of the experiments that were conducted can be
circumstances because the models performed poorly in calculated. The equation of the confusion matrix to calculate
identifying the anomalous packets, which might be the accuracy is given in Equation (1). It is calculated by
dangerous in real-time circumstances. They encourage more dividing the number of all correct predictions by the total
significant rework and research to perform better in terms of number of the dataset.
cloud security. Accuracy = (1)
III. METHODOLOGY
Where TP is True Positive which is predicted values that
The detection of phishing websites using machine are correctly predicted as actual positive, and TN is True
learning can be conducted using one of the most common Negative which are predicted values that are correctly
modelling tasks which are classification. The aim is to decide predicted as an actual negative whereas FP is False Positive
whether a website belongs to one category or another, so which are negative values that are predicted as positive, and
three types of machine learning algorithms were chosen such FN is False-Negative which are positive values that are
as Decision tree (First Experiment), KNN (Second predicted as negative.
Experiment), and Random Forest (Third Experiment) to
achieve it. After that, for the first model Decision tree (First Other than (1), other basic measures are derived from the
Experiment) was decided to be used, the second model is confusion matrix such as error rate, sensitivity (recall),
KNN (Second Experiment) and the third model is the specificity, precision and F-measure. The equations for the
Random Forest algorithm (Third Experiment). The reason other basic measures that was mentioned is as follows:
why the Decision tree was chosen for the first model and Equation (2) is calculated by dividing the number of all
KNN for the second model is to compare the results that incorrect predictions by the total number of the dataset and
would have gotten from an eager learner and a lazy learner the worst error rate is 1.0 whereas the best error rate is 0.0.
model. Besides that, 80% of the dataset will be randomly
selected and will go into the training set, whereas 20% of the Error rate = (2)
dataset will go into the testing set. The dataset was divided in
this ratio based on the 80/20 rule of the Pareto principle and Equation (3) is calculated by dividing the number of
because of the size of the dataset. correct positive predictions by the total number of positives
In this work, a secondary dataset available online is used in the dataset. The worst sensitivity is 0.0 and the best
as published by the UCI Machine Learning Repository: sensitivity is 1.0.
Phishing Websites Data Set [27]. The dataset has 11,055 Sensitivity = (3)
observations and 32 variables. The researcher of the dataset
has provided the phishing websites features that further Equation (4) is calculated by dividing the number of
explains the variables and justifies the characteristics which correct negative predictions by the total number of negatives
would be considered legitimate or phishing websites. The in the dataset. The worst specificity is 0.0 and the best
imported dataset will undergo a data cleaning process to sensitivity is 1.0.
determine whether there are any missing, duplicated, or too
narrow or wide values. After that, the training set will be fed Specificity = (4)
into the classification algorithm according to each type of
machine learning algorithm that was mentioned. In addition, Equation (5) is calculated by dividing the number of
the unseen data which is the testing set will be fed into the correct positive predictions by the total number of positive
classifier algorithm. Finally, a confusion matrix was used for predictions in the dataset. The worst precision is 0.0 and the
evaluating the performance of all three of the models. best precision is 1.0.
Determining which variables are suitable and important has
proven challenging for a multi-variable dataset. There are 32 Precision = (5)
variables or columns in the dataset and 11,055 rows of values.
In the end, 15 variables were used because all variables were Equation (6) helps to measure the Recall and Precision of
proven important so there is a small impact if any of the a model at the same time.

44 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

∗ ∗
F-measure = (6)

IV. RESULT AND DISCUSSION


A. First Experiment (Decision Tree)
The first experiment uses the Decision Tree algorithm
which is a supervised learning algorithm used for both
classification and regression tasks but in this experiment, the
Decision Tree classifier is used. The goal of this experiment
is to classify the type of websites by learning simple decision Fig. 3. The accuracy for the first experiment (91%).
rules based on the selected variables. The model in this
experiment uses a total of four packages which are ‘rpart’, B. Second Experiment (K-Nearest Neighbor)
‘rpart.plot’, ‘Boruta’, and ‘dplyr’. 'rpart' and rpart.plot, is The KNN algorithm can be used for solving nearly any
used for the decision tree algorithm whereas the Boruta supervised learning problem, including security issues. Some
package is used for the feature selection process. The 'dplyr' studies have used the KNN approach to categorize phishing
package is imported to use the ‘select’ function which is used websites. In KNN, elements that share the same
for selecting the 15 variables. characteristics are grouped. KNN classifiers are non-
parametric classification algorithms. This classifier has the
After the feature selection process, the dataset will characteristic of generalizing whenever it needs to classify an
undergo a data cleaning process to determine whether there instance. By securing all information before it is scanned,
are any missing, duplicated, wide or narrow values. The ensures that no information will be lost if any scan is made
dataset will be shuffled before undergoing the data partition Toolan et al. [28]. KNN has also been demonstrated to
process. The dataset is split into 80% training set and 20% produce accurate results, and at times even more accurate
testing set. After that, the ‘rpart’ function will be used and than symbolic classifiers. KNN classifier was shown to
inside of the function are the independent and dependent achieve a 99% detection rate in a study carried out by Kim
variables as well as the training set. The method is set to class and Huh in 2011.
because this is a decision tree classifier.
Firstly, the dataset is imported into the model and the 'str'
Fig. 2 is the visualization for the decision tree algorithm function is used to see whether the data is structured or not.
using the ‘rpart.plot’ function. There are only 3 variables that After that, the data is normalized, and all the values were
can be seen in the plot and the reason for this is because based transformed to a common scale. Soon after, the 'set.seed'
on the result of the ‘Boruta’ algorithm, ‘SSL_Final_State’ function is used to ensure the same result after rerunning it
and ‘URL_of_Anchor’ is in the top 3 of the most important multiple times. In addition, the dataset is split into 80%
variables whereas ‘Prefix_Suffix’ is in the top 6 of the most training set and 20% testing set as well as using 15 attributes
important variables. From this, a conclusion can be made that out of the 32 variables. NROW’ was then used to find the
this plot is simplified and only shows the most important number of observations. Since the performance of KNN is
variables out of the 15 selected variables. primarily determined by the choice of K, which depends on
The first node in Fig. 2. contains all the data and if the the size of the dataset and the type of the classification
value of ‘SSL_Final_State’ is less than 1 then it is a legit problem, the best K was found by varying it from 94 to 95;
website otherwise it is a phishing website. After that, it will and based on the result, KNN performs best when K = 94.
split into two nodes and if ‘URL_of_Anchor’ is less than 0 After that, the prediction can be checked against the actual
then it is a legit website and vice versa. Lastly, if the value of values in tabular form. Other than that, a couple of libraries
‘Prefix_Suffix’ is less than 0, then it is a legit website and were included such as ‘class’ so that KNN can be performed,
vice versa. The performance of the first experiment is ‘caret’ and ‘e1071’ were also used for the confusion matrix.
evaluated using Equation (1) and has achieved an accuracy of The accuracy for every K=94 was printed using a loop.
around 91% as can be seen in Fig. 3. which means that the Lastly, ‘plot’ was used to get the data visualization as shown
aim of this experiment has been achieved. in Fig. 4.

Fig. 4. The visualization for KNN using ‘plot’.

Based on the model for the experiment that was built, the
accuracy for KNN of K = 94 can be seen to be pretty accurate,
Fig. 2. Visualization of the result of the decision tree algorithm. which is around 97.69% as shown in Fig. 5. Nevertheless, the
high accuracy that was predicted using KNN raised concerns

45 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

as it can be seen as overfitting. Overfitting is defined as a experiment is deemed better out of the three experiments, and
modelling error that introduces bias to the model since it is this is further supported by research that the Random Forest
almost related to the data set. algorithm is the most accurate and suitable with the lowest
false-negative rate [7].
V. OPPORTUNITIES
In contrast to the research data, one of the ways to
improve the results in the work is to study the learning curves.
The learning curves imply testing against a test set when the
number of training occurrences is increased. Thus, the
difference between the in-sample and out-of-sample errors
can be easily identified. To elaborate, a large initial difference
indicates estimated variance; on the other hand, having errors
that are both big and similar indicates that you are working
with a biased model. The measure of model rigidity and
inflexibility in predictive models is identified by bias that is
also known as underfitting. Bias indicated that the model is
Fig. 5. The accuracy result for the second experiment (97%). not acquiring all the signals that it could from the data.
Whereas variance measures the model inconsistency, which
C. Third Experiment (Random Forest) indicates whether the model provides more flexibility for data
The third experiment is implemented using the Random by learning patterns randomly. As a result, models with high
Forest algorithm. Random forest is a supervised learning variance have the tendency to perform better on some data
algorithm, and it can be used for both classification and points but incredibly poorly on others. Another essential way
regression problems. The packages imported are to improve models is using stacking models that can help to
randomForest() and caret package. The dataset is loaded into achieve better performance. Furthermore, different
RStudio. The next segment of codes tells R that the label is a algorithms were used to predict multiple results and each of
categorical variable to ensure that classification will be them learns through the features that are found in the data of
carried out instead of regression. The set.seed() function is the first stage of this technique. Next, predictions of the
used to make results reproducible. Then the dataset is previously trained models to that model are provided instead
shuffled, and data partition is carried out. The dataset is of providing features that the new model will learn. The
divided into 80% training sets and 20% testing sets. Next, 14 outcome for guessing complex target functions is justified
variables are chosen to train the model in this experiment. For using this stacking model approach. In brief, it leads to a more
the next code segment, prediction is done on the test sets and powerful model by implementing this stacking model. On the
the accuracy of this experiment is obtained. other hand, selecting features is equally important to improve
the model compared to the researched data. Feature selection
Based on the output, the accuracy of the third experiment is needed to achieve a better outcome with pruning some
is 94.44% as shown in Fig. 6, which is in between the first features if the estimated variance is high, and the machine
experiment and the second experiment. The result of the three learning algorithm is relying on too many features.
experiments shows the accuracy of the second experiment Subsequently, the highest predictive value must be chosen
(KNN) is the highest, followed by the third experiment when reducing the number of features in the data.
(Random Forest) then the first experiment (Decision Tree), Regularization can be used in amounts to implicit feature
as shown in Table II. selection as regularization will rely on smart implementations
of training algorithms and informs the algorithm to use as few
features as possible to remove noisy and not informative
features.
VI. CONCLUSION AND FUTURE WORK
Phishing has become a significant network security issue
that has been on the rise in the last few years. On this basis,
the researcher has implemented phishing website detection
Fig. 6. The accuracy of the third experiment (94.44%).
using Machine Learning (ML). According to the result and
discussion of each developed model, the researcher has
TABLE II. RESULTS FROM THE THREE EXPERIMENTS achieved one of the main objectives of this research which
identifies and detects a website or URLs whether it is a
legitimate or phishing website using machine learning. To
Experiment Accuracy (%) detect phishing websites, the investigation employs a variety
of methods. The researcher has chosen to apply and evaluate
Decision Tree 91.51
datasets of phishing websites that consist of 11.055
K-Nearest Neighbor 97.69
observations of 15 variables from Kaggle.com and
implement variable named result as the independent variable.
Random Forest 94.44 The implemented machine learning algorithms to build
models for the experiments are Decision Tree (first
experiment), KNN (second experiment), and Random Forest
Even though the second experiment achieved the highest (third experiment). The main reason behind choosing this
accuracy, it has a high risk of overfitting. The third machine learning algorithm is to determine the process of

46 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
2021 International Conference on Data Science and Its Applications (ICoDSA)

lazy learners and eager learners to construct a classification pp. 45–47, 2018.
model. In addition, the construction of an effective feature list [8] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Intelligent rule-
is a critical step for improving the phishing website detection based phishing websites classification,” IET Inf. Secur., vol. 8, no. 3,
pp. 153–160, 2014.
accuracy.
[9] V. S. Lakshmi and M. S. Vijaya, “Efficient prediction of phishing
Hence, the feature selection process was accomplished by websites using supervised learning algorithms,” Procedia Eng., vol.
using the ‘Boruta’ package that contains an algorithm that can 30, pp. 798–805, 2012.
measure the significant variables. It can be seen from the [10] O. Akanbi, A. Abunadi, and A. Zainal, “Phishing Website
Classification: A Machine Learning Approach.,” J. Inf. Assur. \&
above analysis in result and discussion, the confusion matrix Secur., vol. 9, no. 5, 2014.
was computed to evaluate the performance of these three
[11] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning
different algorithms. Compared to the Second Experiment based phishing detection from URLs,” Expert Syst. Appl., vol. 117, pp.
and Third Experiment, the first experiment achieved the 345–357, 2019.
lowest accuracy which is 91.16%. Meanwhile, the second [12] W. Bai, “Phishing Website Detection Based on Machine Learning
experiment obtained the highest detection accuracy among Algorithm,” in 2020 International Conference on Computing and Data
the three experiments with the result of 97.6%. The third Science (CDS), 2020, pp. 293–298.
experiment achieved 94.44% accuracy based on the result. [13] R. B. Basnet and T. Doleck, “Towards developing a tool to detect
On the other hand, although the Second Experiment achieved phishing URLs: A machine learning approach,” in 2015 IEEE
International Conference on Computational Intelligence \&
the highest accuracy among the three experiments because Communication Technology, 2015, pp. 220–223.
the accuracy of KNN algorithms (Second Experiment) is [14] A. D. Kulkarni, L. L. Brown III, and Others, “Phishing websites
remarkably high thus it shows a greater chance of overfitting. detection using machine learning,” 2019.
Hence, the researcher observed that the third experiment [15] G. H. Lokesh and G. BoreGowda, “Phishing website detection based
using random forest suits better for the dataset because it has on effective machine learning,” J. Cyber Secur. Technol., vol. 5, no. 1,
the lowest false-negative compared to the other two pp. 1–14, 2021, doi: 10.1080/23742917.2020.1813396.
algorithms. For the said reason, the researcher determines that [16] V. Shahrivari, M. M. Darabi, and M. Izadi, “Phishing Detection Using
the third experiment is the best-suited classification model in Machine Learning Techniques,” CoRR, vol. abs/2009.1, 2020,
[Online]. Available: https://arxiv.org/abs/2009.11116.
this research to identify whether a website is a legitimate or
phishing website. Correspondingly, the researcher succeeds [17] S. M. Hasan, N. M. Jakilim, and M. F. Rabbi, “Determine the Most
Effective Machine Learning Technique for Detecting Phishing
in achieving the main objectives of the research which is to Websites,” 2021.
identify and evaluate the machine learning models in [18] M. N. Alam, D. Sarma, F. F. Lima, I. Saha, R.-E.- Ulfath, and S.
detecting phishing websites. Hossain, “Phishing Attacks Detection using Machine Learning
Approach,” in 2020 Third International Conference on Smart Systems
In future work, a fully functional system that acts as a tool and Inventive Technology (ICSSIT), 2020, pp. 1173–1179, doi:
to detect phishing websites will be developed. The identified 10.1109/ICSSIT48917.2020.9214225.
and evaluated models will be integrated into the system. [19] F. I. Salih, N. A. Abu Bakar, N. H. Hassan, F. Yahya, N. Kama, and J.
Machine learning is used in exploring better ways for Shah, “IoT Security Risk Management Model for Healthcare Industry,”
detecting cyberattacks using machine learning to ensure the Malaysian J. Comput. Sci., vol. 0, no. 0 SE-Articles, pp. 131–144, Dec.
safety of browsing the internet. 2019, doi: 10.22452/mjcs.sp2019no3.9.
[20] D. Craigen, N. Diakun-Thibault, and R. Purse, “Defining
ACKNOWLEDGMENT cybersecurity,” Technol. Innov. Manag. Rev., vol. 4, no. 10, 2014.
[21] F. Yahya, R. J. Walters, and G. B. Wills, “Protecting data in personal
The authors would like to thank the anonymous reviewers cloud storage with security classifications,” in 2015 Science and
for their comments. Information Conference (SAI), 2015, pp. 838–843, doi:
10.1109/SAI.2015.7237241.
REFERENCES [22] M. A. M. Hasan, M. Nasser, S. Ahmad, and K. I. Molla, “Feature
[1] F. Yahya, R. J. Walters, and G. B. Wills, “Goal-based security selection for intrusion detection using random forest,” J. Inf. Secur.,
components for cloud storage security framework: A preliminary vol. 7, no. 3, pp. 129–140, 2016.
study,” 2016 Int. Conf. Cyber Secur. Prot. Digit. Serv. Cyber Secur. [23] L. Wang and R. Jones, “Big Data Analytics in Cyber Security: Network
2016, 2016, doi: 10.1109/CyberSecPODS.2016.7502338. Traffic and Attacks,” J. Comput. Inf. Syst., vol. 00, no. 00, pp. 1–8,
[2] F. Yahya, R. J. Walters, and G. B. Wills, “Analysing threats in cloud 2020, doi: 10.1080/08874417.2019.1688731.
storage,” in 2015 World Congress on Internet Security (WorldCIS), [24] B. M. Fazli et al., “Improvement of Dam Management in Terms of
2015, pp. 44–48, doi: 10.1109/WorldCIS.2015.7359411. WAM Using Machine Learning,” in In: Mohd Sidek L., Salih G.,
[3] S. Gupta, “Pros and cons of various Classification ML algorithms,” Boosroh M. (eds) ICDSME 2019. ICDSME 2019. Water Resources
Towards Data Science. Towards Data Science, Jun. 2020, [Online]. Development and Management., 2020, pp. 226–236, doi:
Available: https://towardsdatascience.com/pros-and-cons-of-various- https://doi.org/10.1007/978-981-15-1971-0_23.
classification-ml-algorithms-3b5bfb3c87d6. [25] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim, “A survey
[4] L. Tang and Q. H. Mahmoud, “A Survey of Machine Learning-Based of deep learning-based network anomaly detection,” Cluster Comput.,
Solutions for Phishing Website Detection,” Mach. Learn. Knowl. Extr., 2019, doi: 10.1007/s10586-017-1117-8.
vol. 3, no. 3, pp. 672–694, 2021, doi: 10.3390/make3030034. [26] D. Bhamare, T. Salman, M. Samaka, A. Erbad, and R. Jain, “Feasibility
[5] Gandotra, Ekta, and D. Gupta, “An Efficient Approach for Phishing of supervised machine learning for cloud security,” in 2016
Detection using Machine Learning,” in Multimedia Security: International Conference on Information Science and Security (ICISS),
Algorithm Development, Analysis and Applications, K. J. Giri, S. A. 2016, pp. 1–5.
Parah, R. Bashir, and K. Muhammad, Eds. Singapore: Springer [27] R. M. A. Mohammad, L. McCluskey, and F. Thabtah, “UCI Machine
Singapore, 2021, pp. 239–253. Learning Repository: Phishing Websites Data Set.” [Online].
[6] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying Available: https://archive.ics.uci.edu/mL/datasets/Phishing+Websites.
suspicious URLs: an application of large-scale online learning,” in [28] F. Toolan and J. Carthy, “Feature selection for spam and phishing
Proceedings of the 26th annual international conference on machine detection,” in 2010 eCrime Researchers Summit, 2010, pp. 1–12.
learning, 2009, pp. 681–688.
[7] R. Mahajan and I. Siddavatam, “Phishing website detection using
machine learning algorithms,” Int. J. Comput. Appl., vol. 181, no. 23,

47 on May 12,2022 at 08:31:46 UTC from IEEE Xplore. Restrictions apply.


Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded
View publication stats

You might also like