0% found this document useful (0 votes)
122 views11 pages

PhishNotCloud-Based ML

The document presents PhishNot, a cloud-based machine learning system designed for detecting phishing URLs, which addresses the increasing sophistication of phishing attacks that traditional methods struggle to mitigate. The system achieves a high accuracy of 97.5% using a reduced feature set of 14, and operates efficiently with an average detection speed of 11.5 microseconds per URL. PhishNot's design allows for easy deployment as an API, enhancing accessibility and adaptability for various applications in network security.

Uploaded by

Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views11 pages

PhishNotCloud-Based ML

The document presents PhishNot, a cloud-based machine learning system designed for detecting phishing URLs, which addresses the increasing sophistication of phishing attacks that traditional methods struggle to mitigate. The system achieves a high accuracy of 97.5% using a reduced feature set of 14, and operates efficiently with an average detection speed of 11.5 microseconds per URL. PhishNot's design allows for easy deployment as an API, enhancing accessibility and adaptability for various applications in network security.

Uploaded by

Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Computer Networks 218 (2022) 109407

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

PhishNot: A Cloud-Based Machine-Learning Approach to Phishing URL


Detection
Mohammed M. Alani a,b ,∗, Hissam Tawfik c,d
a Computer Science Department, Toronto Metropolitan University, Toronto, Canada
b
School of IT Administration and Security, Seneca College of Applied Arts and Technology, Toronto, Canada
c
College of Engineering, University of Sharjah, Sharjah, United Arab Emirates
d
School of Built Environment, Engineering, and Computing, Leeds Beckett University, Leeds, United Kingdom

ARTICLE INFO ABSTRACT

Keywords: Phishing is constantly growing to be one of the most adopted tools for conducting cyber-attacks. Recent
Phishing statistics indicated that 97% of users could not recognize a sophisticated phishing email. With over 1.5
Machine learning million new phishing websites being created every month, legacy black lists and rule-based filters can no
Security
longer mitigate the increasing risks and sophistication level of phishing. Phishing can deploy various malicious
Attack
payloads that compromise the network’s security. In this context, machine learning can play a crucial role in
Cloud
Url
adapting the capabilities of computer networks to recognize current and evolving phishing patterns. In this
paper, we present PhishNot, a phishing URL detection system based on machine learning. Hence, our work
uses a primarily "learning from data" driven approach, validated with a representative scenario and dataset.
The input features were reduced to 14 to assure the system’s practical applicability. Experiments showed that
Random Forest presented the best performance with a very high accuracy of 97.5%. Furthermore, the design
of our system also lends itself to being more adoptable in practice through a combination of high phishing
detection rate and high speed (an average of 11.5𝜇𝑠 per URL) when deployed on the cloud.

1. Introduction increase in online living and working due to Covid-19, a trend that is
likely to continue.
Phishing is a social engineering attack that exploits the weakness One of the main challenges is the network perimeter can be pro-
in system processes caused by system users [1]. An attacker can send tected with state-of-the-art firewalls and intrusion detection systems
a phishing Uniform Resource Locator (URL) such that when the user but could still suffer from phishing. Phishing penetrates these protected
clicks on that link, it takes the user to a phishing website. Phishing network borders through encrypted web traffic or via emails. Once
URLs are delivered in various ways, including emails, text messages, or the user clicks this phishing URL, malicious activity proceeds to infect
on other suspicious websites, with email being the primary phishing the target’s device with malware or perform other harmful actions.
medium. The phishing website might have a URL that resembles a Hence, protecting users from phishing is an integral part of securing
legitimate link, such as a social media website, banking website, or an the network.
email website, and the webpage on the phishing URL would resemble
With the increasing reliance on technology, phishing has become
a legitimate service webpage. It would typically ask the user to log
more wide-ranging, intense, and sophisticated. Spear phishing attacks
in. At this stage, once the users type their login credentials, they are
have increased in number and improved in quality. In a spear-phishing
stolen, and the users are usually redirected to the original login page.
attack, the attacker gathers information about a specific user or a small
In other phishing attacks, clicking on a link could download malware
group of users and creates highly-crafted spoofed emails, usually imper-
or spyware, install backdoors, or steal session information.
sonating well-known companies, trusted relationships, or contexts [2].
Fig. 1 shows the growth in phishing websites from Q1 2017 to Q1
2021. This rapid growth, as shown in the figure, of about three times Another type of phishing is called Vishing, which is Voice-phishing. In
in the last two years of this 5-year period indicates malicious actors’ vishing, the attack vector is a phone call instead of an email.
reliance on phishing as one of the most successful attack vectors. This Lack of user awareness contributes heavily to the success rates of
‘‘spike-like’’ increase in Q2/Q3 of 2020 might be linked to the massive phishing. According to [3], only 52% of users raise an alarm upon

∗ Corresponding author at: School of IT Administration and Security, Seneca College of Applied Arts and Technology, Toronto, Canada.
E-mail address: [email protected] (M.M. Alani).

https://doi.org/10.1016/j.comnet.2022.109407
Received 7 January 2022; Received in revised form 16 June 2022; Accepted 3 October 2022
Available online 13 October 2022
1389-1286/© 2022 Elsevier B.V. All rights reserved.
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Fig. 1. Number of phishing websites from 2017 to Q1 2021.

receiving a suspected phishing email within 5 min. This behavior indi- 1. Build a high accuracy machine-learning-based phishing detec-
cates weak user awareness about phishing and its potentially harmful tion system that relies on the URL only, without the need to
impact. It became increasingly challenging when many organizations read the target webpage, to reduce the threats to the network’s
moved to work from home due to the Covid-19 pandemic. Phishing attack surface. This approach provides better network protection
has become the most widely used attack vector to deliver malicious when compared to other approaches that require accessing the
payloads to targets. According to the 2022 Verizon Data Breach Report, phishing webpage to detect the phishing attack.
82% of data breaches involved a human element [4]. 2. Utilize a minimal number of critical features through recursive
Within a networking context, phishing is a commonly used tech- feature elimination (RFE) in the feature selection stage. This
nique at the ‘‘delivery’’ stage within the cyber kill-chain [5]. After method not only improves efficiency by reducing the number
reconnaissance and weaponization, the malicious actor intends to de- of features fed into the machine learning classifier, but also
liver the malicious payload to the target in the least suspicious way reduces the number of features captured and extracted at the
possible. When successful, phishing jeopardizes the network’s security data acquisition phase.
by enabling malicious actors to implant malware and Trojan horses or 3. Deploy the machine-learning-based phishing detection system
establish covert connections back to the command-and-control center. with high detection accuracy on the cloud as an API that can be
This action could create a strong foothold for the attacker to move utilized to create browser plugins, email client plugins, or any
vertically or horizontally in the network. While firewalls, intrusion other deployment architecture. This deployment enables easy
detection systems, or other network security appliances can help defend access to the service by various networks and provides higher
the network border, they do not protect the network from phishing. availability compared to locally-hosted solutions.
With malicious actors developing new techniques, static rule-based
detection approaches do not provide sufficient protection against phish- In addition to the above, our research produced a smaller version of the
ing. Hence, the machine learning paradigm presents itself as a possible dataset with only 14 features that future research can use in phishing
method to build phishing detection systems that are inherently ca- detection.
pable of adaptively protecting networks from current and evolving
techniques of phishing attacks and the repercussions they can bring to 1.2. Paper layout
the networks.
Conventional techniques to detect phishing rely on old assumptions, The following section will discuss related works in phishing detec-
static or not easily adaptable approaches that cannot catch up with tion using both machine-learning and non-machine-learning solutions.
the fast evolving nature of technological development and phishing In Section 3, the proposed system is explained with an overview of how
methods. In this ‘‘data’’ and ‘‘cloud’’ era, machine learning paradigms the detection process works. Section 4 describes the dataset, collected
present a natural opportunity to design, implement, deploy, and evolve data, and features used in the training and testing of the proposed
better phishing detection techniques. classifier. The detailed steps of the experiments with their results are
presented in Section 5. The results are discussed and compared to
1.1. Contributions previous works in Section 6. The last section provides the conclusions
along with directions for future research.
This paper presents a machine-learning-based phishing detection
system that extracts features from the URL with external features from 2. Related works
other sources about that URL. This research focuses on producing a
high-accuracy, implementable, efficient, and easily-accessible phishing Phishing has been a problem for a long time and a subject of study
URL detection system. in many research publications. In this section, we will discuss examples
The following points summarize this research’s contributions. of recent and relevant research in phishing detection.

2
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

2.1. Classical phishing detection and the reliability of the results by identifying a set of discriminative
features and discarding irrelevant features. CBR-PDS relies on a two-
In 2017, Sonowal and Kuppusamy presented a multilayer phishing stage hybrid procedure using Information gain and Genetic algorithms.
detection system named PhiDMA [6]. PhiDMA provides a model to The reduction of the data dimensionality results in an improved accu-
detect phishing by incorporating five layers: Auto upgrade whitelist racy rate and a reduced processing time. Testing shows that CBR-PDS
layer, URL features layer, Lexical signature layer, String matching can achieve an accuracy of 95%. However, the system requires high
layer, and Accessibility Score comparison layer. They built a prototype processing power and high memory.
implementation of the proposed PhiDMA model. The testing results Zhu et al. proposed, in 2020, Decision Tree and Optimal Features
showed that the model could detect phishing sites with an accuracy based Artificial Neural Network (DTOF-ANN) to target proper feature
of 92.72%. selection to help the ANN classifier perform better [14]. The proposed
Rao and Pais presented, in 2019, an application named Jail-Phish, system starts with improving the traditional k-medoids clustering al-
that relies on search engine-based techniques to detect phishing sites gorithm with an incremental selection of initial centers to remove
[7]. The focus of the proposed system was on Phishing Sites Hosted on duplicate points from the public datasets. Then, an optimal feature
Compromised Servers (PSHCS) and the detection of newly registered selection algorithm based on the newly defined feature evaluation
legitimate sites. Jail-Phish compares the suspicious site and matched index, decision tree, and local search method prunes out the negative
domain in the search results to calculate the similarity score between and useless features. Finally, an optimal structure of the neural network
them. Testing results showed Jail-Phish to achieve an accuracy of classifier is constructed through finely-tuned parameters and trained
98.61%, a true positive rate of 97.77%, and a false positive rate of less by the selected optimal features. Testing results demonstrated that
than 0.64%. DTOF-ANN could achieve an accuracy of 97.80%.
In 2021, Mourtaji et al. introduced a hybrid Rule-Based Solution
2.2. Machine-learning based phishing detection for phishing URL detection using Convolutional Neural Networks(CNN)
[15]. The proposed system extracts 37 features from seven different
In 2018, Chin et al. presented a phishing detection system relying methods, including the blacklisted method, lexical and host method,
on Software Defined Networks (SDN) and Deep Packet Inspection (DPI) content method, identity method, identity similarity method, visual
named Phishlimiter [8]. Phishlimiter starts with DPI and then leverages similarity method, and behavioral method. When tested, the proposed
it with SDN to identify phishing activities through e-mail and web- system achieved 97.945% with the CNN model and 93.216% for the
based communications. The proposed DPI approach consists of two MLP model.
components, phishing signature classification and real-time DPI. The Gandotra and Gupta presented, in 2021, a study on the role of
proposed system performs well in the SDN environment but is not feature selection methods in detecting phishing webpages efficiently
suited for applications in end-user environments that are not reliant and effectively [16]. Their comparative analysis of machine learning
on SDNs. algorithms was carried out based on their performance without and
Wei et al. presented, in 2019, a lightweight deep learning algorithm with feature selection. The experiments demonstrate that employing
to detect the malicious URLs and enable a real-time and energy-saving a feature selection method along with machine learning algorithms
phishing detection sensor [9]. The proposed deep learning classifier improves the build time of classification models for phishing detection
achieved an accuracy of 86.630%. The focus of the study was to without compromising their accuracy.
create a low-power phishing detector for sensors and Internet of Things In 2021, Wazirali et al. proposed another SDN-based phishing de-
(IoT) devices. Although the proposed system acquitted itself well as tection technique based on clustering and CNN [17]. The proposed
a lightweight solution, it has a relatively lower and less practicable work uses Recursive Feature Elimination (RFE) with a Support Vector
accuracy performance. Machine (SVM) algorithm for feature selection. The SDN transfers the
In 2019, Sahingoz et al. presented a machine-learning-based phish- URLs phishing detection process out of the user’s hardware to the con-
ing URL detector based on Natural Language Processing (NLP) [10]. troller layer, continuously trains on new data, and then sends its out-
The proposed system used seven classification algorithms and natural comes to the SDN-Switches. RFE-SVM and CNN increase the accuracy
language processing (NLP) features. Testing showed the proposed sys- of phishing detection. The experimental results showed 99.5% phishing
tem could achieve an accuracy of 97.98% using an RF classifier, with detection accuracy. The proposed work consumed about 500MB of
a relatively high false-positive rate of 3%. These results were achieved memory, which could heavily overload SDN devices where the classifier
operates. Another shortfall of the proposed system is that it uses online
using 27 features.
training, which can make the classifier model susceptible to adversarial
Also, in 2019, Chiew et al. presented another machine-learning
ML attacks.
phishing URL detector named HEFS [11]. The proposed work suggests
Some of the machine-learning systems mentioned earlier achieved
a feature reduction from 48 to 10 using a Cumulative Distribution
relatively high accuracy. However, we argue that the proposed system
Function gradient (CDF-g) algorithm. Testing showed that the pro-
needs to attain a combination of high accuracy, practical implementa-
posed model achieved 94.6% accuracy with 48 features, while accu-
tion, and adaptability to a wide range of phishing attacks, which the
racy dropped to 94.60% when the features were reduced to 10. They
previous literature did not sufficiently address. This combination of
performed another testing phase with a second dataset, where the
objectives makes our proposed system a stronger candidate for realistic
proposed system achieved 94.27% accuracy with 30 features.
implementation and high accuracy.
Jain and Gupta published, in 2019, a paper proposing a machine-
Further information on previous works can be found in [18–21].
learning phishing URL detection based on analysis of the HTML code
of the web page [12]. The proposed approach extracts 12 features
3. PhishNot
from the HTML code of the page linked by the URL and feeds that
information into a classifier. It achieved 98.4% accuracy on the logis-
The proposed system was based on the following design goals:
tic regression classifier. However, this method is considered high-risk
because it reads the actual page contents before deciding whether it is • High accuracy: This was achieved by experimenting with sev-
a phishing or benign link. This process can be particularly dangerous if eral types of classifiers and choosing the one with the highest
the phishing page contains malware or is used for drive-by attacks. accuracy. In addition, feature selection contributed to maintain-
Abu Tair et al. proposed, in 2019, a case-based reasoning Phishing ing high accuracy by removing redundant or minimally relevant
detection system (CBR-PDS) that relies on previous cases to detect features that can negatively impact ML’s efficiency and prediction
phishing attacks [13]. CBR-PDS aims to improve the detection accuracy accuracy.

3
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Fig. 2. PhishNot system overview.

• Implementability: The proposed system relies on a few features one or more symbols, the URL is encoded using UTF-8 encoding and
to simplify the feature extraction in a real-life deployment. In then sent in a simple API call with the URL as an HTTP parameter. The
addition, the features selected are easy to acquire and extract. The server receives the URL, extracts the features, and requests the external
feature selection process focused on feature importance, in that features from their sources. Once the extracted features are gathered,
the features extracted at the deployment stage can feed directly they are sent as an input to the trained ML classifier. The classifier then
into the system without preprocessing. produces a prediction of ‘‘benign’’ or ‘‘phishing’’. The server returns
• Efficiency: The proposed system needs to achieve high efficiency this prediction to the client, where the client can accordingly decide
by using a small number of features. This efficiency translates to block or allow the URL access.
to lower data acquisition time and lower prediction time. In
addition, the actual processing takes place on the cloud, and the
system performance should be efficient even on low-processing 4. The dataset
power devices.
• Ease of Access: Being cloud-based, the detection system deploys
The dataset chosen to train and test our models was the one in-
as a web service with an easy-to-access Application Programming
troduced in [22]. The dataset was built by collecting data about well-
Interface (API). Thus, the system can be embedded as a plug-in
in email clients, web browsers, or any other service where URLs known phishing URLs from PhishTank [23] and benign URLs from
are shared. Alexa [24].
The original dataset included 111 features extracted for 88,646
Fig. 2 shows an overview of the proposed system. The proposed URLs. Within these URL instances, 58,000 were labeled ‘‘benign’’, and
system is split into two stages, a pre-deployment stage and a de- 30,646 were labeled as ‘‘phishing’’. Of the 111 features collected, 96
ployment stage. At the pre-deployment stage, the dataset undergoes
were extracted from the URL itself, while 15 others were collected from
a preprocessing stage that prepares data, addresses missing data, and
external sources such as Google search indexing.
establishes data balancing. The preprocessing output feeds into the fea-
ture reduction and selection process, where a relatively small number of Features were extracted from the URL after dissecting it into four
features is selected. The resulting feature-reduced dataset is then used parts, as shown in Fig. 3. These features counted the occurrences of
for training and testing of a pipeline of ML classifiers. After the initial specific special characters, such as the number of ‘‘.’’ within the domain
testing phase, the best-performing model is further tested to ensure that name or the number of ‘‘/’’ within the directory part. In addition,
it generalizes well beyond the training dataset. After systematic testing, some of these features were included for the whole URL. The dataset
the model is stored for later deployment into a cloud environment. also included several external features from external sources, such as
At the deployment stage, the client captures the URL and sends it domain age and whether the domain has an indexed Alexa website
through an API call to the cloud-based server. As the URL might contain ranking or not. Details of all extracted features can be found in [22].

4
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Fig. 3. Anatomy of a URL.

5. Experiments and results 2. Precision: This measures the ratio of the accuracy of positive
predictions using the equation:
The experiment included three phases: a preprocessing phase, model 𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
training, and cloud deployment. Algorithm 1 shows the main steps of 𝑇𝑃 + 𝐹𝑃
the conducted experiments.
3. Recall: This measures the ratio of positive instances that are
correctly detected by the classifier using the equation:
Algorithm 1: High-level Experimental Steps 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (3)
Input: Raw Dataset (88,646 instance, 111 features) 𝑇𝑃 + 𝐹𝑁
Output: Preprocessed Dataset (57,024 instances, 14 features), 4. 𝐹1 Score: This measures the harmonic mean of precision and
Trained Model, Experiment Results, Cloud-Deployment recall using the equation:
load 𝑅𝑎𝑤𝐷𝑎𝑡𝑎𝑠𝑒𝑡 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑃 𝑟𝑒𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝐷𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑃 𝑟𝑒𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔(𝑅𝑎𝑤𝐷𝑎𝑡𝑎𝑠𝑒𝑡) 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ (4)
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
create 𝑀𝐿𝑀𝑜𝑑𝑒𝑙𝑃 𝑖𝑝𝑒𝑙𝑖𝑛𝑒
train 𝑀𝐿𝑀𝑜𝑑𝑒𝑙𝑃 𝑖𝑝𝑒𝑙𝑖𝑛𝑒 with 𝑃 𝑟𝑒𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝐷𝑎𝑡𝑎𝑠𝑒𝑡 5.3. Dataset preprocessing
test 𝑀𝐿𝑀𝑜𝑑𝑒𝑙𝑃 𝑖𝑝𝑒𝑙𝑖𝑛𝑒 → 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑅𝑒𝑠𝑢𝑙𝑡𝑠
This stage is used to address the following aspects:
𝑅𝑒𝑑𝑢𝑐𝑒𝑑𝐷𝑎𝑡𝑎𝑠𝑒𝑡 ← 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛(𝑃 𝑟𝑒𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝐷𝑎𝑡𝑎𝑠𝑒𝑡)
test 𝑀𝐿𝑀𝑜𝑑𝑒𝑙𝑃 𝑖𝑝𝑒𝑙𝑖𝑛𝑒 → 𝐸𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡𝑅𝑒𝑠𝑢𝑙𝑡𝑠 • Handling ‘Invalid’ features: Some features are deemed invalid,
select 𝐵𝑒𝑠𝑡𝑀𝐿𝑀𝑜𝑑𝑒𝑙 such as the feature qty_questionmark_domain that indicates the
store 𝐵𝑒𝑠𝑡𝑀𝐿𝑀𝑜𝑑𝑒𝑙 number of question marks in a domain name. According to the
deploy 𝐵𝑒𝑠𝑡𝑀𝐿𝑀𝑜𝑑𝑒𝑙 DNS RFC [29], domain names cannot contain a symbol, such as
an underscore, comma, semicolon, or question mark. We found
16 invalid features in the dataset.
5.1. Implementation environment • Handling ‘Redundant’ Features: Some features can be easily math-
ematically deduced from other features, such as qty_hyphen_url
The hardware and software specifications of the implementation feature that indicates the number of hyphens (-) symbols. Al-
environment used in the preprocessing, training, testing, and storage though this is a valid feature, four other features count the
of the trained model can be described in the following points: number of hyphens in the domain, folder, file, and parameter.
Thus, this feature is redundant. Based on our domain knowledge
• Desktop computer with AMD Ryzen 5 3600 CPU, 4.2 GHz, 16 GB and close examination of the features, 19 redundant features were
RAM, NVidia GeForce GT710 GPU found and removed.
• Python v3.8.5 [25]. • Handling instances with ‘Missing values’: Some instances were
• TensorFlow v2.3.0 [26]. missing some values.
• Keras v2.4.3 [27].
• Sci-Kit Learn v1.0.1 [28]. Based on the observations mentioned earlier, we created the prepro-
cessing phase as shown in algorithm 2.
5.2. Performance measures
Algorithm 2: Data Preprocessing
The standard four performance measures of a binary ML-based Input: Raw Dataset with 111 features
classifier are: Output: Balanced Dataset with no missing data and 77 features
𝐴𝑟𝑟𝑎𝑦 ← 𝑅𝑎𝑤𝐷𝑎𝑡𝑎𝑠𝑒𝑡
1. True Positive(TP): The number of test instances whose true value
Remove invalid features from 𝐴𝑟𝑟𝑎𝑦
and predicted value is 1, divided by the number of test instances
Remove redundant features from 𝐴𝑟𝑟𝑎𝑦; for 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 ∈ 𝐴𝑟𝑟𝑎𝑦 do
whose true value is 1.
if 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 ∈ 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 is 𝑒𝑚𝑝𝑡𝑦 then
2. True Negative (TN): The number of test instances whose true
Remove 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
value and predicted value is 0, divided by the number of test
end
instances whose true value is 0.
end
3. False Positive (FP): The number of test instances whose true
value is 0 and predicted value is 1, divided by the number of
test instances whose true value is 0. The first two steps of preprocessing, as shown in algorithm 2,
4. False Negative (FN): The number of test instances whose true identified and removed invalid and redundant features. This process
value is 1 and predicted value is 0, divided by the number of results in the reduction of the number of features to 77. Next, the
test instances whose true value is 1. last step removed all instances with missing values from the dataset.
This process reduced the number of instances from 88,646 to 57,024.
These four measures, when combined, generate the confusion matrix.
Table 1 shows the number of features and instances before and after
Our work uses the main four performance parameters:
preprocessing.
1. Accuracy: This measures the ratio of correct predictions using As shown in Table 1, the number of ‘‘benign’’ instances after pre-
the equation: processing was 38,962, while the number of ‘‘phishing’’ instances was
18,062. Thus, the minority class is 31.64% of the dataset, which does
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1) not cause a severe imbalance problem.
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

5
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Table 1 Algorithm 3: Recursive Feature-Elimination Using Feature Impor-


Number of features and instances before and after preprocessing.
tance
Specification Raw dataset After preprocessing
Input: Dataset with 77 features
Number of instances 88,646 57,024
Number of benign instances 58,000 38,962
Output: Dataset with 14 features
Number of phishing instances 30,646 18,062 𝐴𝑟𝑟𝑎𝑦 ← 𝐷𝑎𝑡𝑎𝑠𝑒𝑡
Number of features 111 77
𝑚𝑜𝑑𝑒𝑙 = 𝑅𝑎𝑛𝑑𝑜𝑚𝐹 𝑜𝑟𝑒𝑠𝑡𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟
𝑇 𝑎𝑟𝑔𝑒𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 14
Table 2 while 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑠(𝐷𝑎𝑡𝑎𝑠𝑒𝑡) > 𝑇 𝑎𝑟𝑔𝑒𝑡𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 do
Initial results with 77 features. train 𝑚𝑜𝑑𝑒𝑙 with 𝐴𝑟𝑟𝑎𝑦
Model Accuracy Precision Recall 𝐹1 Score 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑚𝑜𝑑𝑒𝑙)
RF 0.9715 0.9692 0.9683 0.9687 𝑖 = index of feature with lowest importance
LR 0.9210 0.9248 0.9005 0.9108 𝐴𝑟𝑟𝑎𝑦.𝐷𝑒𝑙𝑒𝑡𝑒𝐹 𝑒𝑎𝑡𝑢𝑟𝑒(𝑖)
DT 0.9508 0.9458 0.9461 0.9460
GNB 0.7979 0.8407 0.7220 0.7408
end
MLP 0.9122 0.9049 0.9001 0.9024 Store 𝐴𝑟𝑟𝑎𝑦 → 𝐷𝑎𝑡𝑎𝑠𝑒𝑡

5.4. Initial model training

After preprocessing, a pipeline of five machine-learning classifiers


was created. These five classifiers were:

• Random Forest (RF)


• Logistic Regression (LR)
• Decision Tree (DT)
• Gaussian Naive Bayes (GNB)
• Multi-Layer Perceptron (MLP)

The data was split into two parts; 75% used for training and 25%
used for testing. This random split considers the balance of the two
classes when creating the two subsets. For data splitting, we used the
SciKit learn built-in function named train_test_split with shuffling to
perform the task [28]. After the initial training and testing phase, the Fig. 4. Impact of feature reduction on 𝐹1 score.
best-performing model was selected for further testing to assure that
it generalizes well beyond the training subset of the dataset and that
it does not suffer from overfitting. The selected model would then be
deployments and not just the dimensionality of the data input to the
stored and deployed on a cloud instance for performance testing.
system. It enables more efficient data acquisition, training, testing, and
Table 2 shows the performance measures of the initial testing of the
lightweight real-life deployment.
five trained classifiers.
Several other papers, as explained in Section 2, relied on feature im-
As shown in Table 2, the RF classifier outperforms all the other portance to select the features with the highest importance. However,
classifiers in terms of accuracy and 𝐹1 Score. The second in terms of our research followed an alternative method. Instead of choosing the
performance was the DT classifier. features with the highest feature importance, our proposed method re-
lies on repetitive elimination of the feature with the lowest importance
5.5. Feature selection and re-training the model again. This successive elimination considers
that the importance of one feature might be impacted by the existence
This research aims to create a high-accuracy phishing URL detector (and therefore the elimination of) another feature. Hence, we re-train
based on a sub-dataset from the original dataset utilizing a smaller and re-calculate the importance after the elimination of each feature.
number of features while maintaining high accuracy. The selected Our feature reduction phase was terminated at 14 features because
features for the deployed model can be easily extracted during data further reduction would cause noticeable performance degradation, as
acquisition. This direction rules out some statistical dimensionality validated by our experimentation results. Fig. 4 shows the change in
reduction algorithms such as Principal Component Analysis (PCA), average 𝐹1 Score with feature reduction, where score rapidly drops
Singular Value Decomposition (SVD), and Linear Discriminant Analysis below 14 features.
(LDA). The reason is that these methods do not reduce the number of The 14 features selected at the end of the feature selection process
acquired features and would add preprocessing steps that could hinder are summarized in Table 3.
the performance of real-life deployments. Table 4 shows the performance measures of the classifiers when
The method we used to select the features in this research was trained and tested with the feature-reduced dataset.
The RF classifier still outperforms the other classifiers. Notably, GNB
RFE based on feature reduction using feature importance. The steps
performance improved significantly, while RF, LR, DT, and MLP had
followed are shown in algorithm 3.
marginal improvements.
As shown in algorithm 3, a Random Forest classifier model was
Fig. 5 shows the confusion matrix plot of the RF classifier with 14
created and trained with 75% of the dataset instances and tested with
features. As shown in the figure, the FP rate is 2.18%, while the FN rate
the remaining 25%. After testing, feature importance was calculated was 3.22%.
for each feature. The feature with the lowest feature importance was
then removed, another cycle of training and testing continued, and 5.6. Validation with 10-fold cross-validation
this step was repeated until it reached a minimum number of features
that produces high prediction accuracy. This reduction method lowers To assure that the high accuracy measured earlier was not due
the number of features that need to be used in prediction in live to overfitting, we used 10-fold cross-validation on the reduced-feature

6
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Table 3 Algorithm 4: 10-Fold Cross-Validation


Features resulting from feature-reduction phase.
Feature name Feature description Input: Dataset with 14 features
qty_dot_domain Number of ‘‘.’’ in the domain
Output: testing results for 10 runs; 𝑟𝑒𝑠𝑢𝑙𝑡𝑠
qty_vowels_domain Number of English language vowels in the domain name 𝐴𝑟𝑟𝑎𝑦 ← 𝐷𝑎𝑡𝑎𝑠𝑒𝑡
domain_length Number of characters in the domain name
Split 𝐴𝑟𝑟𝑎𝑦 into 10-folds randomly to produce 𝑓 𝑜𝑙𝑑(1𝑡𝑜10)
qty_dot_directory Number of ‘‘.’’ in the directory part of the URL
qty_slash_directory Number of ‘‘∕’’ in directory part of the URL 𝑚𝑜𝑑𝑒𝑙 = 𝑅𝑎𝑛𝑑𝑜𝑚𝐹 𝑜𝑟𝑒𝑠𝑡𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑟
directory_length Number of characters in the directory part of the URL for 𝑖 = 1𝑡𝑜10 do
qty_dot_file Number of ‘‘.’’ in the filename part of the URL 𝑇 𝑒𝑠𝑡𝑖𝑛𝑔𝐷𝑎𝑡𝑎 = 𝑓 𝑜𝑙𝑑(𝑖)
file_length Number of characters in the filename part of the URL 𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐷𝑎𝑡𝑎 = 𝑁𝑢𝑙𝑙
params_length Number of characters in the URL’s parameters
for 𝑗 = 1𝑡𝑜10 do
time_response Domain lookup time response (in seconds)
if 𝑗 <> 𝑖 then
asn_ip Autonomous System Number to which the domain’s IP
address belongs
𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐷𝑎𝑡𝑎 = 𝑇 𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐷𝑎𝑡𝑎.𝐴𝑝𝑝𝑒𝑛𝑑(𝑓 𝑜𝑙𝑑(𝑗))
end
time_domain_activation Number of days since domain activation
time_domain_expiration Number of days remaining until domain expiry end
ttl_hostname Time-to-Live (TTL) when contacting the domain (in train 𝑚𝑜𝑑𝑒𝑙(𝑇 𝑟𝑎𝑖𝑛𝐷𝑎𝑡𝑎)
milliseconds) test 𝑚𝑜𝑑𝑒𝑙(𝑇 𝑒𝑠𝑡𝐷𝑎𝑡𝑎) to produce 𝑟𝑒𝑠𝑢𝑙𝑡𝑠(𝑖)
end
print 𝑟𝑒𝑠𝑢𝑙𝑡𝑠
Table 4
PhishNot testing results with 14 features.
Model Accuracy Precision Recall 𝐹1 Score
Table 5
RF 0.9748 0.9692 0.9730 0.9711 RF classifier 10-fold cross-validation results.
LR 0.9300 0.9255 0.9110 0.9177 Fold Accuracy Precision Recall 𝐹1 Score
DT 0.9600 0.9533 0.9545 0.9539
GNB 0.8757 0.8808 0.8261 0.8460 1 0.975101 0.955100 0.965807 0.960424
MLP 0.9387 0.9248 0.9361 0.9301 2 0.974399 0.953577 0.966242 0.959868
3 0.978082 0.960677 0.970751 0.965688
4 0.976854 0.956660 0.973118 0.964819
5 0.975447 0.958243 0.965574 0.961894
6 0.973518 0.954595 0.962493 0.958528
7 0.974570 0.957119 0.962942 0.960022
8 0.980533 0.963227 0.975542 0.969345
9 0.971764 0.952302 0.958147 0.955216
10 0.976149 0.953553 0.969835 0.961625
Mean 0.975642 0.956505 0.967045 0.961743
Sdev 0.002328 0.003265 0.005017 0.003795

accuracy of 0.9756. Thus, the classifier generalizes well beyond its


training dataset.

5.7. Validation using a second dataset

To further validate the generalization of our proposed model, we


conducted additional training and testing using the dataset from [10].
This dataset contains 73,575 URLs (36,400 benign and 37,175 phish-
ing).
We extracted the 14 features selected in our research and examined
the dataset to ensure that it does not require any preprocessing before
Fig. 5. Confusion matrix plot for RF classifier tested on the original dataset.
training and testing. The dataset was randomly split into a randomly
selected 75% training subset and a 25% testing subset, similarly to our
original dataset.
data, shown in algorithm 4. In a 10-fold cross-validation process, the Fig. 6 shows the confusion matrix plot for the testing results with
data is split into ten subsets randomly. Then, the data go through ten the second dataset. Our trained model achieved an accuracy of 0.9976,
cycles of training and testing. In each cycle, one subset of the ten is precision of 0.9963, recall of 0.9974, and 𝐹1 Score of 0.9968. As the
excluded from the training process and used for the testing process. performance metrics and the figure illustrates, the overall performance
This procedure repeats ten times until all ten subsets have been used was better than our original dataset and the original results from [10],
for testing one time. Each cycle produces a classifier with specific where the accuracy was 0.98 with an FP rate of 3% and FN rate of 1%.
performance parameters. If these parameters have high variance, the These figures indicate that the selected features and the trained model
classifier suffers from over-fitting and is not generalizing properly can generalize well beyond our original dataset.
within the dataset. If the variance is low, then the mean values of the
performance parameters are reliable results. 5.8. Cloud-based deployment
Table 5 shows the 10-fold cross-validation results for the RF clas-
sifier. As the table details, there is minimal variance in the values of After the classifier testing and cross-validation, the model was saved
the performance measures for the 10 folds. In addition, the standard using a Python library called joblib [30]. The next step created the
deviation was as little and 0.002 in accuracy while maintaining a mean deployment web service that would use the saved model to produce

7
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

For testing purposes, the classifier was deployed on an Amazon Web


Services (AWS) instance in the cloud. Figs. 8 and 9 show a sample cloud
server response to phishing URL and a benign URL, respectively.
As shown in Figs. 8 and 9, the server adopts a simple RESTful call
method where the URL to be inspected is passed as a parameter to the
server using the HTTP protocol. Then, the response is sent as simple text
‘‘phishing’’ or ‘‘benign’’. This response can be passed to an email client
plugin, a web browser plugin, or even a mobile text message plugin to
prevent the user from clicking on phishing URLs. The average request
processing time on the cloud was 11.5 μs per URL.

6. Discussion

The majority of the previous research discussed in Section 2 only


focused on accuracy. Our research focuses on proposing a phishing
detection system capable of achieving high accuracy while maintaining
practical applicability and delivering high efficiency and ease of access.
Achieving these goals delivers a system highly suitable for practical
implementation, with the elasticity and availability advantages of cloud
deployment.
This section contains three subsections: impact of feature engi-
Fig. 6. Confusion matrix plot for the second dataset testing.
neering, real-world implementation considerations, and a comparative
analysis of results with previous research.

predictions. Fig. 7 shows an overview of the server-side operation after 6.1. Impact of feature Engineering
deployment.
As shown in Fig. 7, The process starts with the client sending an Creating an implementable model was one of the main aims of this
HTTP request to the running web service. The format of the request is research. This goal is that data acquisition be easy and realizable and
http://serveraddress/?url=URL-to-inspect, where URL-to-inspect is the requires minimal preprocessing to extract features in real-life deploy-
URL that the system must inspect to flag as ‘‘phishing’’ or ‘‘benign’’. All ment. This requirement is one of the main reasons behind choosing
URLs passed to the server must be encoded using URI encoding [31]. successive feature elimination based on feature importance. In addition
Once the HTTP request arrives at the web service, the URL is extracted to reducing the number of features to improve the efficiency, accuracy,
from the request and parsed. During the parsing process, the URL is split and generalizability of the ML classification algorithms, the features
into the domain name, folder, file name, and additional parameters. require no preprocessing before being fed into the classifier, which is
After removing the protocol name (HTTP:// or HTTPS://), the first another important practical consideration.
occurrence of a ‘‘/’’ marks the end of the domain name, while the last Fig. 10 shows the change in performance measures of the RF clas-
occurrence of a ‘‘/’’ marks the start of the file name (if any), and all the sifier with the full dataset of 77 features and the reduced version of 14
text in between (if any) is considered the folder name, including any features. The figure clearly shows the improvement in the ML classifi-
‘‘/’’. Then, the parsed URL is passed to two feature extraction units. The cation performance when the features were reduced. This improvement
first unit extracts internal features (i.e., features to be extracted from was expected when reducing the number of irrelevant features that
the URL itself, such as qty_dot_domain). The second unit contacts other might otherwise limit the classifier’s training and generalization per-
services to collect the required data, including the following features: formance. In particular, and crucially, the ‘‘Recall’’ performance reflects
the ML’s ability to correctly identify the cases of ‘‘Phishing’’. This ability
1. time_response: This feature is obtained by calculating the time
is significantly enhanced after feature reduction and is reflected by
needed for the domain lookup time.
improvement in the overall F1 and accuracy scores.
2. asn_ip: This feature is obtained by querying a whois database
using a package named cymruwhois [32], which queries whois
6.2. Real-world implementation considerations
servers for information about the domain name, including the
ASN.
The proposed system reduced the number of required features from
3. time_domain_activation: This value represents the number of
111 to 14 in a manner that did not only boost accuracy, but also
days since the domain was first registered. In our work, this
boosted implementability. The 14 features selected in the proposed
information is also obtained using a library called python-
system are easy to acquire and extract, as explained in Section 5.
whois [33].
Nine of the selected features can be extracted from the URL itself
4. time_domain_expiration: This value represents the number of
without the need for external sources. The remaining five features are
days left until the domain registration expires. It is also obtained
easily obtained by accessing publicly available Whois databases or by
using the pythonwhois library.
a simple ping command. The successful deployment of the proposed
5. ttl_hostname: The TTL measure is obtained by sending a PING
system on AWS cloud infrastructure demonstrates its potential practical
request to the domain name and waiting for a response to obtain
implementability.
the TTL from the response message.
Fig. 11 shows the feature importance of the 14 selected features.
Once all the features are obtained, the data is orderly fed into the It is clear from Fig. 11 that four features contribute to the classi-
trained RF classifier. The trained ML classifier then provides its pre- fication process significantly more than the other nine. These features
diction in the output form of either ‘‘phishing’’ or ‘‘benign’’. The client are directory_length, time_domain_activation, qty_slash_directory, and
then fetches this response to be used to block phishing URLs. file_length. The feature time_domain_activation, which is the number of
If one of the external parameters could not be obtained, such as a days since the domain was registered, is a strong indicator of phishing
website not responding to PING, or a missing domain age, the missing domains, as most domains used in phishing are registered recently and
feature would be assigned a value of −1. do not stay registered for a long time. In most cases, these domain

8
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Fig. 7. Overview of the server-side operation.

Fig. 8. A sample of phishing URL sent to the cloud server.

Fig. 9. A sample of benign URL sent to the cloud server.

Fig. 10. RF classifier performance with 77 and 14 features.

Fig. 11. Features importance.

names get blocked, taken over by law enforcement, or blacklisted,


so attackers regularly register new domains to use in phishing. The
remaining three features usually have high values in phishing URLs. Deploying the proposed system on the cloud gives it the advan-
This includes the number of ‘‘\\’’ in the directory part of the URL tages of high scalability, redundancy, and high availability. Cloud
(which indicates the use of many sub-directories, the length of the file infrastructure is highly scalable, and resources are dynamically cre-
name, and the length of the directory part. ated whenever needed. These features enable the proposed system to

9
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

Table 6
Comparison of performance measures of PhishNot with other research.
Reference Dataset Language Features Instances Balanced Classifier Accuracy FP FN
[10] [10] English 27 73,575 ✓ RF 0.980 0.030 0.010
[11] [11] English 48 10,000 ✓ RF 0.9617 – –
10 RF 0.9460 – –
[12] [12] English 12 2544 ✓ RF 0.9737 0.0315 0.0197
LR 0.9842 0.0161 0.0152
[15] [15] English 37 40,000 × CNN 0.9794 – –
MLP 0.9321 – –
PhishNot [22] English 14 57,024 ✓ RF 0.9748 0.0322 0.0218
[10] English 14 73,575 ✓ 0.9976 0.002 0.003

provide the service to a large user base. In terms of redundancy, the means that there is a possibility, albeit low, that these services might
cloud infrastructure eliminates the single point of failure in a single- not be available or might cause delay. However, this possibility is
server architecture. The cloud also provides the basis for easier future deemed an acceptable risk given the valuable contribution of these
updates in that new datasets are available. A newer trained ML classi- features to the decision process.
fication model can be deployed centrally into the cloud infrastructure,
eliminating the need for distributed updates. 7. Conclusions and future work

6.3. Comparative analysis of results In this paper, we presented PhishNot, a cloud-based machine-
learning-based phishing URL detector. The detector relies on features
As mentioned earlier in Section 2, phishing detection using machine extracted from the URL itself, in addition to external features, such as
learning has been the subject of many research articles. However, the the domain age, to infer whether the received URL is a phishing or
system we proposed has contributions on four fronts: benign URL. The dataset used in training the classifier was reduced
to 14 from 111 using successive feature elimination based on feature
• Implementability: The number and type of the selected features
importance. The proposed ML classifier was tested and recorded an
make them easy to acquire and extract in real-life deployments.
accuracy of 0.9748, with an FP rate of 2.18%, and an FN rate of
• Accuracy: Our proposed system has improved accuracy, benefit-
3.22%. The trained model was deployed on the AWS cloud as a web
ing from systematic preprocessing and feature selection.
service and tested successfully. The combination of high-performing
• Efficiency: The proposed system is considered highly efficient due
ML classification, low number of features, and cloud basis position our
to the low number of features selected and the cloud deployment.
system favorably in terms of accuracy of phishing detection, efficiency,
• Scalability: As the proposed system is deployed and tested on the
scalability, and implementability.
cloud, the scalability of the cloud is extended to the system and
Future directions in this research can focus on exploring the po-
makes it very well placed to serve a huge number of users.
tential of our system as a web service in a fog or edge environment.
Table 6 shows performance comparison with other state-of-the-art Within a smart city context, the proposed work could be expanded to
research. provide better efficiency using edge and fog computing concepts. The
In terms of the ML classification performance, as shown in Table 6, improvement in efficiency can be noticeable as these techniques rely
the proposed system outperforms the systems presented in the combina- on high-performing networks, such as 5G and beyond.
tion of high accuracy and a low number of features. Other research that
achieved better accuracy had a noticeably higher number of features, CRediT authorship contribution statement
making them difficult to implement, and hence less practical.
When compared to [10], using the same dataset, our proposed Mohammed M. Alani: Conceptualization, Methodology, Software,
system achieved even higher accuracy and lower FP and FN measures. Formal Analysis, Writing – Original Draft. Hissam Tawfik: Conceptu-
These figures indicate that the chosen 14 features are effective and alization, Methodology, Writing – review & editing.
widely applicable.
There were challenges in utilizing datasets from other related works Declaration of competing interest
to test our proposed system for two reasons. Some previous works
had created their own datasets that were not publicly available, while The authors declare that they have no known competing finan-
other related studies made their datasets available but used different cial interests or personal relationships that could have appeared to
data representations and extracted different features. Hence, it was not influence the work reported in this paper.
possible to find the raw data used, and we could not extract the features
that fit our proposed system to create an ideal comparison. Data availability
Compared to [12], we found that PhishNot achieved higher accu-
racy with fewer features. Upon closer examination, this research relies Data will be made available on request.
on features from the web page hosted on the URL rather than examining
the URL itself. Therefore, we argue that this approach has a high risk as References
phishing pages can contain malware that is automatically downloaded
upon access. In addition, many malicious actors use phishing URLs to [1] M. Khonji, Y. Iraqi, A. Jones, Phishing detection: a literature survey, IEEE
deliver drive-by attacks. Hence, our proposed system is more secure Commun. Surv. Tutor. 15 (4) (2013) 2091–2121.
and does not download the contents of the web page for inspection. It [2] D.D. Caputo, S.L. Pfleeger, J.D. Freeman, M.E. Johnson, Going spear phishing:
Exploring embedded training and awareness, IEEE Secur. Priv. 12 (1) (2013)
strikes a very good overall balance between accuracy and the number
28–38.
of features used. [3] Time to report phishing email 2020 | statista, Statista (2021) [Online; ac-
On the downside, our proposed system relies on extracting five cessed 24. Dec. 2021], https://www.statista.com/statistics/1256790/time-to-
features out of the selected 14 from external sources. This process report-phishing-email.

10
M.M. Alani and H. Tawfik Computer Networks 218 (2022) 109407

[4] Key findings from the 2022 verizon data breach investigations report (DBIR) [30] Joblib: running python functions as pipeline jobs – joblib 1.2.0.dev0 documen-
underscore the role of the human element in data breaches | proofpoint US, 2022, tation, 2021, [Online; accessed 25. Dec. 2021]. URL https://joblib.readthedocs.
[Online; accessed 3. Jun. 2022], https://www.proofpoint.com/us/blog/email- io/en/latest.
and-cloud-threats/key-findings-2022-verizon-data-breach-investigations-report- [31] HTML URL encoding reference, 2022, [Online; accessed 21. May 2022]. URL
dbir. https://www.w3schools.com/tags/ref_urlencode.asp.
[5] T. Yadav, A.M. Rao, Technical aspects of cyber kill chain, in: International [32] Welcome to cymruwhois’s documentation! – cymruwhois v1.0 documenta-
Symposium on Security in Computing and Communication, Springer, 2015, pp. tion, 2021, [Online; accessed 25. Dec. 2021]. URL https://pythonhosted.org/
438–452. cymruwhois.
[6] G. Sonowal, K. Kuppusamy, PhiDMA–A phishing detection model with multi-filter [33] Pythonwhois, 2021, [Online; accessed 25. Dec. 2021]. URL https://pypi.org/
approach, J. King Saud Univ. Comput. Inf. Sci. 32 (1) (2020) 99–112. project/pythonwhois.
[7] R.S. Rao, A.R. Pais, Jail-Phish: An improved search engine based phishing
detection system, Comput. Secur. 83 (2019) 246–267.
[8] T. Chin, K. Xiong, C. Hu, Phishlimiter: A phishing detection and miti-
gation approach using software-defined networking, IEEE Access 6 (2018) Mohammed M. Alani holds a Ph.D. in Computer Engineer-
42516–42531. ing with specialization in network security. He has worked
[9] B. Wei, R.A. Hamad, L. Yang, X. He, H. Wang, B. Gao, W.L. Woo, A deep- as a professor, and a cybersecurity expert in many countries
learning-driven light-weight phishing detection sensor, Sensors 19 (19) (2019) around the world. His experience includes serving as VP of
4258. Academic Affairs in the United Arab Emirates, network and
[10] O.K. Sahingoz, E. Buber, O. Demir, B. Diri, Machine learning based phishing security consultancies in the Middle-East, and Cybersecurity
detection from URLs, Expert Syst. Appl. 117 (2019) 345–357. Program Manger in Toronto Canada. He currently works
[11] K.L. Chiew, C.L. Tan, K. Wong, K.S. Yong, W.K. Tiong, A new hybrid ensemble as a Cybersecurity Professor at Seneca College, Toronto,
feature selection framework for machine learning-based phishing detection Canada.
system, Inform. Sci. 484 (2019) 153–166. He has authored 4 books, and edited one in different
[12] A.K. Jain, B.B. Gupta, A machine learning based approach for phishing detection areas of networking and cybersecurity along with many
using hyperlinks information, J. Ambient Intell. Humaniz. Comput. 10 (5) (2019) research papers published in highly ranked journals and
2015–2028. conferences. He currently serves in editorial board of Jour-
[13] H. Abutair, A. Belghith, S. AlAhmadi, CBR-PDS: a case-based reasoning phishing nal of Reliable and Intelligent Environments. He has also
detection system, J. Ambient Intell. Humaniz. Comput. 10 (7) (2019) 2593–2606. guest edited many special issues in highly ranked journals.
[14] E. Zhu, Y. Ju, Z. Chen, F. Liu, X. Fang, DTOF-ANN: An artificial neural network He also holds many industrial certifications such as Se-
phishing detection model based on decision tree and optimal features, Appl. Soft curity+, Cybersecurity Analyst+ (CySA+), PenTest+, Comp-
Comput. 95 (2020) 106505. TIA Advanced Security Practitioner+ (CASP+), Server+,
[15] Y. Mourtaji, M. Bouhorma, D. Alghazzawi, G. Aldabbagh, A. Alghamdi, Hybrid Cisco Certified Network Associate, CCAI, Microsoft Azure
rule-based solution for phishing URL detection using convolutional neural Data Science Associate, and Microsoft AI Fundamentals. His
network, Wirel. Commun. Mob. Comput. 2021 (2021). current interests include applications of ML in cybersecurity,
[16] E. Gandotra, D. Gupta, An efficient approach for phishing detection using and ML security.
machine learning, in: Multimedia Security, Springer, 2021, pp. 239–253.
[17] R. Wazirali, R. Ahmad, A.A.-K. Abu-Ein, Sustaining accurate detection of phishing
Hissam Tawfik is a Professor of Artificial Intelligence. He
URLs using SDN and feature selection approaches, Comput. Netw. 201 (2021)
holds a Ph.D. in Computer Engineering from the University
108591.
of Manchester (then UMIST) and has a research track record
[18] A. El Aassal, S. Baki, A. Das, R.M. Verma, An in-depth benchmarking and
of more than 150 refereed publications in reputable in-
evaluation of phishing detection research for security needs, IEEE Access 8
ternational journals and conference proceedings in Applied
(2020) 22170–22192.
Artificial Intelligence, Biologically Inspired Systems, and
[19] Z. Dou, I. Khalil, A. Khreishah, A. Al-Fuqaha, M. Guizani, Systematization of
Data Science.
knowledge (sok): A systematic review of software-based web phishing detection,
Hissam is currently working as a professor of Artifi-
IEEE Commun. Surv. Tutor. 19 (4) (2017) 2797–2819.
cial Intelligence at the College of Engineering in Sharjah
[20] A. Khormali, J. Park, H. Alasmary, A. Anwar, M. Saad, D. Mohaisen, Domain
University, Sharjah, UAE. Hissam also leads the Computer
name system security and privacy: A contemporary survey, Comput. Netw. 185
Science and Artificial Intelligence research at the School
(2021) 107699.
of Built Environment, Engineering and Computing. Before
[21] A.A. Zuraiq, M. Alkasassbeh, Phishing detection approaches, in: 2019 2nd
joining Leeds Beckett University in 2015, Hissam worked
International Conference on New Trends in Computing Sciences (ICTCS), IEEE,
for University of Salford and Liverpool Hope University.
2019, pp. 1–6.
Hissam is an editor of the International Journal of
[22] G. Vrbančič, I. Fister Jr., V. Podgorelec, Datasets for phishing websites detection,
Future Generation Computer Systems (Elsevier), the Inter-
Data Brief 33 (2020) 106438.
national Journal of Neural Computing and Applications
[23] PhishTank | join the fight against phishing, 2021, [Online; accessed 24. May
(Springer), and International Journal of Reliable Intelligent
2021]. URL http://phishtank.org.
Environments (Springer). He is a visiting Professor at the
[24] citizenlab, Test-lists, 2021, [Online; accessed 24. May 2021]. URL https://github.
University of Seville (Spain), and is a chair of the Inter-
com/citizenlab/test-lists.
national Conference Series on Developments in eSystems
[25] Welcome to python.org, 2021, [Online; accessed 29. Apr. 2021]. URL https:
Engineering (DESE).
//www.python.org.
Hissam is currently guest editing the journal special
[26] TensorFlow, 2021, [Online; accessed 29. Apr. 2021]. URL https://www.
issue on ‘‘Special Issue on Expert Decision Making for
tensorflow.org.
Data Analytics with Applications’’; International Journal of
[27] K. Team, Keras: the python deep learning API, 2021, [Online; accessed 29. Apr.
Applied Soft Computing (Elsevier), and the journal special
2021]. URL https://keras.io.
issue on ‘‘Robotics, Sensors and Industry 4.0’’; Journal of
[28] Scikit-learn: machine learning in python – scikit-learn 1.0.1 documentation,
2022, [Online; accessed 2. Jan 2022]. URL https://scikit-learn.org/stable.
[29] Rfc1035, 2021, [Online; accessed 23. Dec. 2021]. URL https://datatracker.ietf.
org/doc/html/rfc1035.

11

You might also like