Abedin 2020
Abedin 2020
Department of Computer Science and Department of Computer Science and Department of Computer Science and
Engineering Engineering Engineering
East Delta University East Delta University East Delta University
Chittagong, Bangladesh Chittagong , Bangladesh Chittagong , Bangladesh
faisal.a@[Link] 161000112@[Link] 161000412@[Link]
Abstract- Phishing attacks are the most common form of trusting their fake websites and leading us through actions
attacks that can happen over the internet. This method that allow the information to be leaked to them. The solution
involves attackers attempting to collect data of a user without is not avoiding the internet of course, but to gain knowledge
his/her consent through emails, URLs, and any other link that
regarding these attacks, and be careful not to be careless and
leads to a deceptive page where a user is persuaded to commit
fall victim to such attacks [2, 3, 30].
specific actions that can lead to the successful completion of an
attack. These attacks can allow an attacker to collect vital
Cyber Attacks are improving along with the technological
information of the user that can often allow the attacker to
impersonate the victim and get things done that only the victim improvements around us [4, 5]. Attackers can now create
should have been able to do, such as carry out transactions, or the same fakes to actual websites that are more and more
message someone else, or simply accessing the victim's data. difficult to distinguish from the original ones [6]. People get
Many studies have been carried out to discuss possible deceived by these fake pages quite quickly, and they are not
approaches to prevent such attacks. This research work precisely to blame if their knowledge on the subject of
includes three machine learning algorithms to predict any Cyber Security is indeed limited [7-9]. Expecting users to
websites' phishing status. In the experimentation these models tell these sites apart just from visual cues would be unfair
are trained using URL based features and attempted to
after all. Yet this innocent gap in one's knowledge can
prevent Zero-Day attacks by using proposed software proposal
potentially lead him/her to become a victim of social or
that differentiates the legitimate websites and phishing
websites by analyzing the website's URL. From observations, economic damage someday [10-12].
the random forest classifier performed with a precision of
Considering the magnitude of these consequences as
97%, a recall 99%, and F1 S core is 97%. Proposed model is
fast and efficient as it only works based on the URL and it does challenge, this research work is aimed to build a solution
not use other resources for analysis, as was the case for past that would classify phishing and legitimate websites
studies. concretely and save users from getting exploited [13-15].
Online Banking, E-Commerce, HR & Finance, Social
Keywords—Phishing Attack; Phishing Attack Detection; Networking cases of phishing are now common in almost
Artificial Intelligence; Machine Learning; Decision Tree
every sector [16, 17]. While a lot of current methods such as
I. Introduction blacklist – whitelist based techniques can help against these
attacks, these methods are not capable of detecting zero-day
Today, every individual is connected to others through the
attacks [18-21, 31].
internet. The connections are established using different
hardware and different software, and overtime is getting II. BACKGROUND AND RELAT ED W ORK
connected to everything. Today, 16% of the world 's
Supervised machine learning approaches are well suited for
population uses the internet. Despite the benefits the internet
this type of classification based problem. To train these
provides, there are dire consequences to using it without
classifiers, the features of both phishing and legitimate
proper knowledge regarding Cyber Security [1, 25, 32].
websites need to be extracted and used machine learning
Cyber Attackers lurk over the internet, deceive users into
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
It is the 16x16 pixel icon used as branding labelled data to acquire a function that predict the outcome
Favicon the website
when given new unlabeled data is given. In this research, the
Non-standard ports are used for another
NonStdPort purpose than its default assignments.
KNN algorithm uses 80% labelled data to acquire a function
HT T PSDomainURL Secure HT TP, used with T LS/SSL Protocol to predict whether a website is a real or a phishing website.
Used to request resources by the client from The second classifier name is logistic regression. Logistic is
RequestURL the server a statistical model. It uses a logistic function to model a
A clickable content in text form used to
binary dependent variable. In our regression analysis, uses
AnchorURL hyperlink
Used to link at script tag to manipulate the 80% labelled data to acquire a logistic function to predict
LinksInScriptT ags image whether a website is a legitimate or a phishing website. The
It is used to process the contents in the server third classifier in this research is the random forest and is a
ServerFormHandler from the client. supervised learning algorithm. It uses a set of decision trees
Email used with the domain or business
InfoEmail website
which build the forest. It is an ensemble of decision trees,
AbnormalURL T he reverse of normalURL unlikely to occur usually trained with the "bagging" technique. The main idea
Used to redirect multiple sources to a single of the bagging technique is that a mixture of learning
WebsiteForwarding web address models surges the global effect.
Used to show the system information at the
StatusBarCust bottom of the screen
IV. RESULT S AND DISCUSSIONS
Used to prevent web contents of the website
DisableRightClick from saving In our study, we used confusion matrixes, ROC curves,
A menu that appears on the screen by precision, recall, and F1 Score to evaluate the performance
popping up and disappears immediately after
of the three machine learning classifiers.
UsingPopupWindow a click
Used to inspect a website and redirect later
IframeRedirection on
AgeofDomain T he time duration of axe existed domain
DNSRecording Used to get information like IP address
WebsiteT raffic Used to log the visited users of a website
It is the web page ranking tool used by
PageRank google search engine
An indexing tool to add webpages in Google
GoogleIndex exploration
LinksPointingToPage Used to rank the website
Used to get information about all transferred
StatsReport files
Defines the features and behaviours, 1 means
class phishing and 0 means legitimate
Fig. 2 Classification report for KNN
B. Data Preprocessing Fig. 2, Fig. 3, and Fig.4 show the performance of the KNN
Feature scaling is the process of normalizing or algorithm. Fig. 2 shows the precision, recall, and fi score for
standardizing the independent variables of the training the KNN algorithm. It is observed that the precision is 91%
dataset to a fixed range, to handle variance in the values for a phishing website. On the other hand, the precision is
86% for the legitimate website. Besides, we see that recall
among different independent variables. Splitting the dataset
into two portions, one for training and one for testing is very and fi score are 94% and 93% respectively for phishing
important. It is vital to train a model with a subset of the full website. The recall and fi score for legitimate website are
79% and 82% respectively.
dataset and test model with the rest to evaluate the model
performance satisfactorily. We split the dataset into 80:20
ratio as follows: 80% of the dataset used for training and
20% of dataset for testing using a stratified sampling
technique. We did the train test split using the Scikit-Learn
library in Python programming language.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Fig. 3 shows the confusion matrix results for KNN. The left
diagonal values are higher than the values of the right
diagonal, which means out proposed system successfully
detect the phishing website.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
References
[1] M. Humayun, M. Niazi, N. Z. Jhanjhi, M. Alshayeb, and S.
Fig. 8 Confusion Matrix for Random Forest
Mahmood, "Cyber Security T hreats and Vulnerabilities: A
Systematic Mapping Study," (in English), Arabian Journal for
Fig. 8 shows the confusion matrix results for the random Science and Engineering, Article vol. 45, no. 4, pp. 3171-3189,
forest. The left diagonal values are higher than the values of Apr 2020.
[2] E. D. Frauenstein and S. Flowerday, "Susceptibility to phishing
the right diagonal, which means out proposed system on social network sites: A personality information processing
successfully detect the phishing website. model," (in English), Computers & Security, Article vol. 94, p.
18, Jul 2020, Art. no. Unsp 101862.
[3] A. Kulkarni and L. L. Brown, "Phishing Websites Detection
using Machine Learning," (in English), International Journal of
Advanced Computer Science and Applications, Article vol. 10,
no. 7, pp. 8-13, Jul 2019.
[4] M. Botacin, F. Ceschin, P. de Geus, and A. Gregio, "We need to
talk about antiviruses: challenges & pitfalls of AV evaluations,"
(in English), Computers & Security, Article vol. 95, p. 15, Aug
2020, Art. no. Unsp 101859.
[5] E. S. Gualberto, R. T . De Sousa, T . P. D. Vieira, J. Da Costa,
and C. G. Duque, "From Feature Engineering and T opics
Models to Enhanced Prediction Rates in Phishing Detection,"
(in English), Ieee Access, Article vol. 8, pp. 76368-76385, 2020.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
[6] "General Practice and the Community: Research on health [23] S. Hossain, D. Sarma, T . Mittra, M. N. Alam, I. Saha and F. T .
service, quality improvements and training. Selected abstracts Johora, "Bengali Hand Sign Gestures Recognition using
from the EGPRN Meeting in Vigo, Spain, 17-20 October 2019 Convolutional Neural Network," 2020 Second International
Abstracts," (in English), European Journal of General Practice, Conference on Inventive Research in Computing Applications
Article vol. 26, no. 1, pp. 42-50, Dec 2020. (ICIRCA), Coimbatore, India, 2020, pp. 636-641.
[7] H. Alqahtani, I. H. Sarker, A. Kalim, S. M. Minhaz Hossain, S. [24] S. Hossain, A. Abtahee, I. Kashem, M. M. Hoque, and I. H.
Ikhlaq, and S. Hossain, "Cyber Intrusion Detection Using Sarker, "Crime Prediction Using Spatio-T emporal Data," in
Machine Learning Classification T echniques," in Computing Computing Science, Communication and Security, Singapore,
Science, Communication and Security, Singapore, 2020, pp. 2020, pp. 277-289: Springer Singapore.
121-131: Springer Singapore. [25] H. Alqahtani, I.H. Sarker, A. Kalim, S.M.M. Hossain, S. Ikhlaq
[8] J. A. Bland, M. D. Petty, T. S. Whitaker, K. P. Maxwell, and W. and S. Hossain, "Cyber Intrusion Detection Using Machine
A. Cantrell, "Machine Learning Cyberattack and Defense Learning Classification T echniques," in Computing Science,
Strategies," (in English), Computers & Security, Article vol. 92, Communication and Security, Singapore, 2020, pp. 121-131:
p. 23, May 2020, Art. no. Unsp 101738. Springer Singapore.
[9] S. C. Sethuraman, V. Vijayakumar, and S. Walczak, "Cyber [26] S. Hossain, F. Islam, R. Karim and K.N. Siddique, "A Critical
Attacks on Healthcare Devices Using Unmanned Aerial Comparison between Distributed Database Approach and Data
Vehicles," (in English), Journal of Medical Systems, Article vol. Warehousing Approach." International Journal of Scientific &
44, no. 1, p. 10, Jan 2020, Art. no. 29. Engineering Research, Article 5.1 (2014): 196-201.
[10] M. A. Kosan, O. Yildiz, and H. Karacan, "Comparative analysis [27] S. Hossain, D. Sarma, F. T uj-Johora, J. Bushra, S. Sen and M.
of machine learning algorithms in detection of phishing T aher, "A Belief Rule Based Expert System to Predict Student
websites," (in T urkish), Pamukkale University Journal of Performance under Uncertainty," in 2019 22nd International
Engineering Sciences-Pamukkale Universitesi Muhendislik Conference on Computer and Information Technology (ICCIT),
Bilimleri Dergisi, Article vol. 24, no. 2, pp. 276-282, 2018. 2019, pp. 1-6.
[11] O. S. Lih et al., "Comprehensive electrocardiographic diagnosis [28] F. Ahmed, Fatema-Tuj-Johora, R. J. Chakma, S. Hossain and D.
based on deep learning," (in English), Artificial Intelligence in Sarma, "A Combined Belief Rule based Expert System to
Medicine, Article vol. 103, p. 8, Mar 2020, Art. no. Unsp Predict Coronary Artery Disease," in 2020 International
101789. Conference on Inventive Computation Technologies (ICICT),
[12] D. Zhang et al., "Automatic corneal nerve fiber segmentation 2020, pp. 252-257.
and geometric biomarker quantification," (in English), [29] S. Hossain, D. Sarma, R. J. Chakma, W. Alam, M. M. Hoque,
European Physical Journal Plus, Article vol. 135, no. 2, p. 16, and I. H. Sarker, "A Rule-Based Expert System to Assess
Feb 2020, Art. no. 266. Coronary Artery Disease Under Uncertainty," in Computing
[13] A. Cuzzocrea, F. Martinelli, and F. Mercaldo, Applying Science, Communication and Security, Singapore, 2020, pp.
Machine Learning Techniques to Detect and Analyze Web 143-159: Springer Singapore.
Phishing Attacks (Iiwas2018: The 20th International Conference [30] M. N. Alam, D. Sarma, F. F. Lima, I. Saha, R. -E. -. Ulfath and
on Information Integration and Web-Based Applications & S. Hossain, "Phishing Attacks Detection using Machine
Services). New York: Assoc Computing Machinery, 2014, pp. Learning Approach," 2020 Third International Conference on
355-359. Smart Systems and Inventive Technology (ICSSIT), Tirunelveli,
[14] X. W. Liu and J. M. Fu, "SPWalk: Similar Property Oriented India, 2020, pp. 1173-1179, doi:
Feature Learning for Phishing Detection," (in English), Ieee 10.1109/ICSSIT 48917.2020.9214225.
Access, Article vol. 8, pp. 87031-87045, 2020. [31] I. Saha, D. Sarma, R. J. Chakma, M. N. Alam, A. Sultana and S.
[15] J. Mao et al., "Detecting Phishing Websites via Aggregation Hossain, "Phishing Attacks Detection using Deep Learning
Analysis of Page Layouts," in 2017 International Conference on Approach," 2020 Third International Conference on Smart
Identification, Information and Knowledge in the Internet of Systems and Inventive Technology (ICSSIT), Tirunelveli, India,
Things, vol. 129, R. Bie, Y. Sun, and J. Yu, Eds. (Procedia 2020, pp. 1180-1185, doi:
Computer Science, Amsterdam: Elsevier Science Bv, 2018, pp. 10.1109/ICSSIT 48917.2020.9214132.
224-230. [32] S. Hossain, D. Sarma and R. J. Chakma, “Machine Learning-
[16] N. Shulzhenko and S. Romashkin, "Internet fraud and Based Phishing Attack Detection” International Journal of
transnational organized crime," (in English), Juridical Tribune- Advanced Computer Science and Applications(IJACSA), Article
Tribuna Juridica, Article vol. 10, no. 1, pp. 162-172, Mar 2020. vol. 11, no. 9, 2020, pp. 378-388, doi:
10.14569/IJACSA.2020.0110945
[17] A. Zamir et al., "Phishing web site detection using diverse
machine learning algorithms," (in English), Electronic Library,
Article vol. 38, no. 1, pp. 65-80, Jan 2020.
[18] A. Belabed, E. Aimeur, A. Chikh, and Ieee, A personalized
whitelist approach for phishing webpage detection (2012
Seventh International Conference on Availability, Reliability
and Security). Los Alamitos: Ieee Computer Soc, 2012, pp. 249-
254.
[19] E. Buber, O. Demir, O. K. Sahingoz, and Ieee, Feature
Selections for the Machine Learning based Detection of
Phishing Websites (2017 International Artificial Intelligence and
Data Processing Symposium). New York: Ieee, 2017.
[20] V. Patil, P. Thakkar, C. Shah, T . Bhat, S. P. Godse, and Ieee,
Detection and Prevention of Phishing Websites using Machine
Learning Approach (2018 Fourth International Conference on
Computing Communication Control and Automation). New
York: Ieee, 2018.
[21] G. Sonowal and K. S. Kuppusamy, "PhiDMA - A phishing
detection model with multi-filter approach," (in English),
Journal of King Saud University-Computer and Information
Sciences, Article vol. 32, no. 1, pp. 99-112, Jan 2020.
[22] D. Sarma, W. Alam, I. Saha, M. N. Alam, M. J. Alam and S.
Hossain, "Bank Fraud Detection using Community Detection
Algorithm," 2020 Second International Conference on Inventive
Research in Computing Applications (ICIRCA), Coimbatore,
India, 2020, pp. 642-646.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 16,2021 at [Link] UTC from IEEE Xplore. Restrictions apply.