Review Paper

Uploaded by

Dhehus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Review Paper

Uploaded by

Dhehus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Phishing Website Detection Machine Learning

Review
Aditya Deshmukh1, Akash Yadav2, Pratham Maske3, Shreyash Kathane4, Dr. D.S Adane5
Student, Information & Technology, RCOEM, Nagpur, India1-4
Professor of Information & Technology, RCOEM, Nagpur, India5
Investment scams were the most damaging-they alone robbed
Abstract— As the technology develops this increases victims of $4.57 billion, which is an increase of 38% from the
the chance of cybercrimes happening. Phishing attacks previous year. Crypto-investment fraud accounted for alone $3.94
billion-a whopping 53% rise. Phishing type schemes are among the
based on URLs are among the most common threats
most reported crimes, with over 298,000 complaints with make
toward Internet users. Such attacks are not built upon about 34% of the complaints
technical vulnerabilities; instead, they exploit a weakness
in humans and are often launched against organizations
and individuals. Attackers deceive users by clicking on
URLs that appear trustworthy, leading them to reveal
sensitive information or install malware. Various
techniques of machine learning used for phishing URL
detection classify URLs into phishing and legitimate ones.
Models remain under development and refinement
because of researchers' determination to develop them as
accurate and efficient as possible. Different machine
learning techniques for detecting phishing URLs
accompanied by URL features and datasets that train the
models are reviewed. The paper further discusses the
many different methods put forth by the researchers to
enhance the detection accuracy of these models.

Fig 1: Complaints and losses of last 5 years [1]

1. INTRODUCTION
The report places importance on public reporting to IC3 so as to
In the year 2024, we only deepened our reliance on assist the FBI in combating cyber threats. FBI encourages
technology that further exposed us to more non-native cyber consumers to look out for and read consumer and industry alerts
threats. The ongoing digital transformation, with major about cybercrime, notify financial institutions if victimized, and
impetus from the global pandemic, had created fertile fields file a report to IC3 or local law enforcement..
for the operation of cybercriminals. Recent analysis and
reports are pointing at the surge of security breaches, which
caused both financial losses and personal information
exposures of astronomical proportions. Phishing has been
continued to be prevalent among these instances of
cyberspace crime, using both social engineering and further
technical deception to steal an individual's personal identity
data and financial account credentials. Attackers build fake
versions of trusted websites with the aim of tricking people
into voluntarily divulging their usernames, passwords,
banking details, and other sensitive information. These
phishing URLs would typically be distributed through e-mail,
instant messages, or text messages, thus it is worthwhile that
users should remain awake to the matter and embrace solid
respect for cybersecurity practices.

The FBI's Internet Crime Complaint Center (IC3) 2024 report

highlights a significant rise in online fraud, with 880,418
complaints and potential losses exceeding $12.5 billion,
marking a 10% increase in complaints and a 22% rise in
losses from 2023. California reported the highest number of
complaints and losses, with nearly 80,000 complaints and
over $2 billion in losses.
2. BACKGROUND

A. Phishing Detection 3. LITERATURE REVIEW

A URL based phishing attack is carried out by sending In this section, few of the research works that deploy the
malicious links, that seems legitimate to the users, and tricking above-mentioned algorithms are reviewed and their results are
them into clicking on it. In phishing detection, an incoming summarized.
URL is identified as phishing or not by analysing the different
features of the URL and is classified accordingly. Different
machine learning algorithms are trained on various datasets of The study was conducted by Dr. Nitin N. Sakhare et al.
URL features to classify a given URL as phishing or
legitimate.
[2] Integrated conventional machine learning models like
XGBoost, LightGBM, and a referenced but inactive
B. Phishing Detection Approaches Random Forest classifier alongside a Graph Nerual
List-Based Phishing Detection Systems Network (GNN). XGboost classifier gives accuracy of
These systems rely on two lists to classify website as 92.09%, LightGBM gives highest accuracy of 93.29%.
either phishing or non-phishing. The whitelist contains safe Apart from this, they implement another tree-based
and legitimate websites, while the blacklist includes those machine learning algorithm, CatBoost, which gives
identified as phishing. Researchers have used whitelists to accuracy of 92.98%. GNN's performance left a huge scope
ensure that only URLs on the list are accessible. Another for improvement. LightGBM emerged as a standout
approach is the blacklist method, where URLs are checked performer, giving a precision score of 0.93 alongside a
against a list of known phishing sites. However, these systems recall score of 0.93.
have a significant drawback: even a small change in the URL
can prevent it from being matched in the list. Additionally, B. Sucharitha et al [3] investigated the application of machine
they struggle to catch new, zero-day attacks.[3] learning algorithms to classify phishing websites. The dataset for
this research comprises of 32 features including IP address, URL
The Rule-based Phishing Detection Systems length, URL shortening service employed, and state of SSL,
The feature sets for rule-based systems stem from relational among others. The study gives these salient features of malicious
rule mining. The rules provide a weighting of characteristics URLs, and these features identify phishing websites. The authors
most prevalent in phishing URLs. These rules, when used with considered different machine learning models, namely, Decision
the system, provide better accuracy than what can be achieved Trees, Random Forest, and Gradient Boosting. These models
with just features working alone in classification. For were evaluated using metrics such as accuracy, precision, recall,
and F1-score. Among all other models, Gradient Boosting
example, researchers in the CANTINA study resorted to TF-
achieved the highest score with accuracy 98.9%, precision
IDF and some specific rules to identify phishing attacks. 99.0%, recall 99.4%, and F-value just slightly lower at 98.6%.
Researchers have implemented a combination of features and Thus, the authors concluded that ensemble methods such as
rules to uncover higher detection accuracy in similar works.[3] Gradient Boosting and Random Forest can provide accurate and
strong generalization capabilities when detecting phishing
Visual Similarity-Based Phishing Detection Systems websites. The authors stress the importance of using features
The systems compare web pages with phishing sites visually from the varied sources and suggest that combining machine
to detect attempts of phishing. They take a server-perspective learning models and other phishing detection techniques can
comparison of both phishing and non-phishing sites and use enhance the detection capabilities further. This research clearly
image processing techniques to identify minor visual epitomizes machine learning in the detection of phishing
differences which users would not notice. Fake sites are websites, being a step further to its improvement by hybrid
designed with the intention of making them similar to the models and other features.
original ones; however, slight differences are visible due to
these techniques. Studies have shown that visual similarity-
based systems can prove to be effective detection models Machikuri Santoshi Kumari et al. in [4] detects phishing based
against phishing attacks upon comparing generic visual on models enhanced by blacklisting and machine-learning
elements.[3] methods. Several machine-learning algorithms, such as XGBoost,
Random Forest, Decision Tree, and Multilayer Perceptrons, were
Machine Learning-Based Phishing Detection Systems used for the detection. Other datasets were used in addition to the
Machine learning-based systems detect phishing websites by Phishtank dataset, namely: one containing phishing websites and
classifying specified features using artificial intelligence the second containing phonemy features. A total of 30 features
techniques. These features can include URL structure, domain were used out of 30 most important features were HTTPS,
followed by Anchor URL, Website Traffic, etc. XGBoost gave the
name, website content, and more. Due to their dynamic nature, maximum training accuracy of 100% and the best test accuracy at
these systems are particularly popular for detecting anomalies 96.7% out of all other algorithms. They concluded that "using the
on websites. Machine learning models can adapt to new XGBoost algorithm to detect phishing improves prediction
phishing tactics, making them highly effective in protecting accuracy."
users from evolving threats.[3]
A. Orunsolu et al.[5] Proposed an scalable architecture
combined with incremental learning in a modular approach was
effective. Utilizing an extensive dataset from
Phishtank(comprising 2,541 phishing URLs) and Alexa
(containing 2,500 legitimate URLs), the model attained 99.96%
accuracy with a low false positive rate of 0.04%. In conducting
comparative performance studies, use was made of Support and Alexa consists of valid and phishing URLs, together with
Vector Machine (SVM) and Naïve Bayes (NB) algorithms. internal features, such as the length of the URL and external
The study provides a criterion for assessing feature features that are derived from third-party services. Principal
importance based on how often phishing and legitimate Component Analysis (PCA) was performed for dimension
datasets favor certain features. The selection therefore reduction to facilitate more efficient processing. The model
introduces features as per maximum relevance with minimum achieved 95.66% accuracy using SVM with only five
redundancy. The URL features consist of, but are not limited features, much higher than that achieved using any other
to, length, presence of '@', and hexadecimal codes. The techniques, for example, Random Forest, which showed an
webpage features investigated include validity of SSL accuracy of 94.27% with 30 features. This reduction in
certificates and congruency with domain names; while feature set improved computational efficiency while
patterns of behavior, like cookie handling, and the age of the maintaining good detection rates. The authors indicated how
domain, also qualify to be important features. The robust their solution is at identifying new and transient
incremental methodology processes these features in stages, phishing sites that constitute a practical attack against cyber
starting with URL analysis, followed by webpage properties, threats.
and finally webpage behaviors if needed. This modular
approach ensures scalability and adaptability to new phishing The proposed research by Vahid Shahrivari and
tactics. The study’s results demonstrate the effectiveness of Mohammad Mahdi Darabi [9] deals with the application of
the proposed system, though limitations such as dataset various machine-learning algorithms for the detection of
diversity, lack of real-time testing, and absence of phishing websites. This research uses a dataset constituted of
benchmarking. 30 features, such as IP address presence, URL length,
whether shortening services are used, and SSL state among
Korkmaz et al [6] This research work addressed a persistent others. Characteristics common to such URL layouts are
concern regarding phishing through URL analysis, which employed to distinguish phishing websites from those which
employs machine-learning techniques to track these attacks do not engage in this practice. Logistic regression, decision
proliferated by exploiting vulnerabilities inherent within tree, random forest, AdaBoost, KNN, SVM, gradient
human nature by imitating legitimate sites in a bid to obtain boosting, XGBoost, and neural networks constituted the
sensitive data. Also, such an attempt to assess performance machine learning algorithms that tried out. Besides the
can improve by addressing primarily the attributes of URLs accuracy, precision, recall, and F1 score are also used to
for further improvement in efficiency. The authors employed assess the performance of different models. While XGBoost
eight machine learning algorithms via Random Forest (RF), proved most accurate at 98.32%, Random Forest came
Artificial Neural Networks (ANN), and Support Vector second best at an accuracy close to 97.27%; moreover, Neural
Machines (SVM), which were tested on three datasets with Network exhibited good performance, achieving 96.98%
over 126,000 URLs. The datasets combined the phishing accuracy. The authors concluded that the ensemble methods
URLs from PhishTank and the legitimate URLs from Alexa such as Random Forest and XGBoost are good at detecting
and Common Crawl databases. The system extracted and phishing websites due to their high accuracy and robustness.
used 48 key features from the URLs that include domain They stressed the usefulness of employing multiple features
structure, special character presence, and length metrics, and suggested that one method for enhancing detection
without recourse to third-party services for efficiency performance might be coupling machine learning models
concerns. The experimental results indicate that the Random with other phishing detection methods. This work exemplifies
Forest algorithm had the highest accuracy across the dataset the potential for machine learning to help discern phishing
(up to 94.59%) and had better accuracy than previous studies. websites, and its further promise of improvement with hybrid
Such an experiment proves to be running with a high degree models and novel features.
of efficiency in that it can be effectively used for real-time
detection and speed. However, limited area coverages Jitendra Kumar et al. described in their research [10] the
mentioned in the paper provided directions for further work. training of Logistic Regression, Naive Bayes, Random
Expanding upon the initial dataset. Forest, Decision Tree and K-Nearest Neighbor classifiers
using features derived from the lexical structure of URLs.
Phishing attack detection was investigated in Alam et al. They had carefully created a dataset to solve common
(2020) [7], which used decision tree and random forest problems like data imbalance, biased training, variance and
algorithms for the classification of attacks. The dataset, which overfitting. The preprocessed dataset was evenly split into
came from Kaggle, had 30 very significant features for phishing and trusted URLs and was further divided into a
identifying phishing URLs. The detailed preprocessing step 70:30 ratio for training and testing. Interestingly, all
was reasonably done to render clean and noise-free data, classifiers had similar AUC (Area Under Curve) values, but
followed by feature selection using algorithms like PCA. The the Naive Bayes Classifier claimed to be the best performer
performance of each algorithm was analyzed in terms of with the highest AUC value. It achieved an accuracy of 98%
confusion matrices and the following performance measures: with precision of 1, recall of 0.95 and F1-score of 0.97, thus
accuracy, precision, recall, and F1-score. The performance of the study makes a point regarding the importance of a
random forests was superior to DTs, offering a 97% accuracy balanced dataset and further emphasizes Naive Bayes being a
compared to 91.94% accuracy for DTs, with random forests strong candidate choice in the detection of phishing.
dealing with overfitting and variability issues effectively. The
study asserted that random forests, ensemble approaches, The detection of phishing websites with machine learning
techniques for web-based search filter out the spam and help techniques by Kulkarni and Brown (2019) [11]. A dataset
assistant for phishing detection substantially in view of the was reported as obtained from the University of California,
large data. Irvine Machine Learning Repository containing 1353 URLs
Rashid et al. (2020) in [8] have presented a machine labeled as phishing, suspicious, and legitimate. Nine features
learning approach for phishing detection that harnesses were extracted from URLs, including URL length, age of
Support Vector Machines (SVM) for classification. This domain, presence of an IP address, and others. Four classifiers
dataset obtained from repositories such as Phish Tank were set to run: Decision Tree, Support Vector Machine
(SVM), Naïve Bayes, and Neural Network. The accuracy 4) Having @ Symbol: Using the @ symbol in the URL leads
achieved by the Decision Tree classifier was 91.5%, with a the browser to ignore everything preceding the @ symbol
True Positive Rate (TPR) of 90.97% and a False Positive and the real address often follows the @ symbol
Rate (FPR) of 7.81%. The SVM was slightly behind,
achieving an accuracy of 86.69%, and both Naïve Bayes and 5) Double Slash Redirection: The existence of // within the
Neural Network slightly trailed at rates of 86.14% and URL which means that the user will be redirected to another
84.87%, respectively. The study stated that Decision Trees website
are quite good with discrete feature values, but they need 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
pruning to deal with problems of overfitting. The authors
separated by (-) to the domain name so that users feel that
concluded more features with larger datasets would help the
performance of the classifier and recommended going for they are dealing with a legitimate webpage. For example
ensemble methods and rule-based approaches for future http://www.Confirme-paypal.com.
work. 7) Having Sub Domain: Having subdomain in URL.
Rishikesh Mahajan and Irfan Siddavatam[12] emphasized 8) SSL State: Shows that website use SSL
three class orientation algorithms: Decision Tree, Random
Forest, and Support Vector Machine. The dataset of benign 9) Domain Registration Length: Based on the fact that
URLs was constructed by taking 17,058 from Alexa and phishing website lives for a short period
19,653 from PhishTank, all with16 features. The data were
respectively partitioned into training and testing sets with 10) Favicon: A favicon is a graphic image (icon) associated with
proportions of 50:50, 70:30, and 90:10. The performance was a specific webpage. If the favicon is loaded from a other
judged according to accuracy, false negative rate, and false domain then the webpage is likely to be considered Phishing
positive rate. Random Forest stood out as the algorithm attempt.
where 97.14% accuracy was achieved with the least false 11) Using Non-Standard Port: To control intrusions, it is much
negative rate. Their conclusion was that the more data used better to merely open ports that you need. Several firewalls,
for training, the better the accuracy.
Proxy and Network Address Translation (NAT) servers will,
by default, block all or most of the ports.
4. DATASETS
12) HTTPS token: Having deceiving https token in URL. For
The datasets have been collected from various sites such example, http://https-www-mellat-phish.ir
as PhishTank[13] , Alexa, etc. Which has the data about the
phishing websites and keeps updating them .The datasets
contains all features and their respective values. Abnormal Based Features
13) Request URL: Request URL examines whether the external
5. FEATURE EXTRACTION objects contained within a webpage such as images, videos,
and sounds are loaded from another domain.
URLs have certain characteristics and patterns that can be
considered as its features. 14) URL of Anchor: An anchor is an element defined by the < a
In case of URL based analysis for designing machine > tag. This feature is treated exactly as Request URL.
learning models, we need to extract these features in order to 15) Links In Tags: It is common for legitimate websites to use
form a dataset that can be used for training and testing. There ¡Meta¿ tags to offer metadata about the HTML document;
are four categories of features that are most commonly ¡Script¿ tags to create a client side script; and ¡Link¿ tags to
considered for feature extraction as in [9]. They are as retrieve other web resources.
follows:
16) Server Form Handler: If the domain name in SFHs is
1) Address Bar based features
different from the domain name of the webpage.
2) Abnormal based features
3) HTML and JavaScript based features 17) Submitting Information To E-mail: A phisher might
redirect the users information to his email.
4) Domain based features
18) Abnormal URL: It is extracted from the WHOIS database.
For a legitimate website, identity is typically part of its URL.

Address Bar Based Features

HTML & JavaScript Based Features
1) Having IP Address: If an IP address is used instead of
the domain name in the URL, such as 19) Website Redirect Count: If the redirection is more than
http://217.102.24.235/sample.html four-time

2) URL Length: Phishers can use a long URL to hide the 20) Status Bar Customization: Use JavaScript to show a fake
doubtful part in the address bar. URL in the status bar to users

3) Shortening Service: Links to the webpage that has a 21) Disabling Right Click: It is treated exactly as Using
long URL. For example, the URL onMouseOver to hide the Link
http://sharif.hud.ac.uk/ can be shortened to 22) Using Pop-up Window: Showing having popo-up windows
bit.ly/1sSEGTB. on the webpage.
23) IFrame: IFrame is an HTML tag used to display an
additional webpage into one that is currently shown.

Domain Based Features

24) Age of Domain: If the age of the domain is less than a
month.
25) DNS Record: Having the DNS record Fig. 2. Different parts of the URL
26) Web Traffic: This feature measures the popularity of
the website by determining the number of visitors.
27) Page Rank: Page rank is a value ranging from 0 to 1.
PageRank aims to measure how important a webpage is
on the Internet.
28) Google Index: This feature examines whether a website
is in Googles index or not.
29) Links Pointing To Page: The number of links pointing
to the web page.
30) Statistical Report: If the IP belongs to top phishing
IP’s or not.

5.1) LEXICAL STRUCTURE OF A URL[10]

The structure of a URL can reveal a lot of hidden
information. A URL starts with a protocol name like
HTTP or HTTPS. The fully qualified domain name
(FQDN) is the complete domain name of the server
hosting the website, which is then translated into an IP
address using DNS servers. The domain name consists
of a second-level domain (SLD) and a top-level domain
(TLD). This domain name is unique and registered with
a domain registrar.

Fig 2: Lexical structure of a URL [10]

Let us consider this URL:

http://amazon.com-verification
accounts.darotob.com/Sign-in/5b60fcc60b36d1c3d
The lexical analysis of the above URL reveals parts as
shown in above Fig. The attackers obfuscate the URL
in such a way that the actual domain name might not be
easily revealed to the normal user and it will be nested
deep inside the URL.
TABLE I. R E S U L T A N A L Y S I S
Paper Model Used Suitable Models Accuracy score Paper Model Used Suitable Models Accuracy score
[8] Applied SVM on data RF and NB SVM: 95.66%,
[2] XGBoost, LightGBM, LightGBM gave XGBoost:
from PhishTank and classifiers had RF: 94.27%
Graph Neural Network highest accuracy 92.09%, Alexa, with internal better
(GNN) and CatBoost with precision LightGBM: and external features, accuracies. In
applied. Performance 0.93 and recall 93.29%, GNN: and PCA for terms of AUC,
evaluated using accuracy, score 0.93 70%, CatBoost: dimensionality Gaussian Naive
precision, recall and F1- 92.98% reduction Bayes had a
score. slightly higher
value of 0.991.
[9] The examined Very good Logistic
[3] GB produces GB: 98.9%, RF: classifiers are Logistic performance in regression:
reliable results in 96.9%, DT: Regression, Decision ensembling 92.6%, Decision
terms of 96.0% Tree, Support Vector classifiers tree: 96.5%,
accuracy, Machine, Ada Boost, namely, Random Random forest:
precision, recall, Random Forest, Neural Forest, XGBoost 97.2%,
and F1 score. Networks, KNN, both on Adabooster:
Gradient Boosting, and computation 93.6%, KNN:
XGBoost. duration and 95%, SVM:
accuracy 94.9%, Gradient
[4] Combined blacklisting Among these, XGBoost: boosting: 94.8%,
applied ML Algorithms: XGBoost was 96.7%, RF: XGBoost: 98.3%
[10] A balanced dataset was Random Forest Random Forest:
XGBoost, RF, DT, and found to be the 92.5%, DT:
utilized to train and Naive Bayes 98.03%,
Multilayer Perceptrons to most accurate 90.5%,
classifiers such as demonstrated Gaussian Naive
dataset with features, model. Multilayer Logistic Regression superior Bayes: 97.18%
Phishing URLs collected Perceptrons: (LR), Naive Bayes accuracy
from Phishtank and 88% (NB), Random Forest
OpenPhish. (RF), Decision Tree
[5] Support Vector Machine Both Support SVM: 99.96%, (DT), and k-Nearest
(SVM) and Naïve Bayes Vector Machine Neighbors (k-NN),
(NB) with features based (SVM) and NB: 99.96% using features derived
on maximum relevance Naïve Bayes from the lexical
with minimum (NB) classifiers structure of URLs.
redundancy. Phishtank have TPR of
(2,541 phishing URLs) 99.96, FNR of [11] Four classifiers (DT, DT performed DT: 91.5%,
and Alexa (2,500 0.04, TNR of SVM, Naïve Bayes, best with 91.5% SVM: 86.69%,
Neural Network) accuracy but Naïve Bayes:
legitimate URLs) datasets. 99.96, and FPR
applied to a UCI required pruning 86.14%, Neural
of 0.04 dataset with 1,353 to address Network:
[6] Random Forest (RF), Random Forest RF: 94.59%, labeled URLs and 9 overfitting. 84.87%
Artificial Neural Networks (RF) was the ANN: 94.35%, extracted features Ensemble
(ANN), Support Vector best-suited XGBoost: methods were
Machines (SVM), Logistic model based on 92.95%, DT: recommended
Regression (LR), K- its highest 92.59%, KNN: [12] The dataset was The Random 50:50 split ratio:
Nearest Neighbor (KNN), accuracy and 91.49%, LR: divided into split ratios Forest classifier 96.72%, 70:30
Decision Tree (DT), Naive overall 91.31%, NB: of 50:50, 70:30, and demonstrated split ratio:
Bayes (NB), XGBoost performance in 88.35%, SVM: 90:10. Decision Tree superior 96.84%, 90:10
detecting 87.03% (DT), Random Forest accuracy and the split ratio:
phishing URLs. (RF), and (SVM) lowest false 97.14%
classifiers were negative rate.
[7] DT and RF applied to a RF outperformed RF: 97%, DT: applied.
Kaggle dataset with 30 DT, addressing 91.94%
features. PCA used for overfitting and
feature selection. variability
Performance evaluated effectively
using accuracy, precision, 6) PERFORMANCE EVALUATION METRICS
recall, and F1-score.
A selected parameter will be used to evaluate the
measure of performance for the system. The associated
models are Accuracy, Precision, Recall, F1 Score, and
ROC curve, all derived from the values of True Positive
(TP), True Negative (TN), False Positive (FP), and False
Negative (FN).
In the context of URL classification.
True Positive (TP): The number of phishing URLs
correctly detected as phishing.
True Negative (TN): The number of legitimate URLs
correctly detected as legitimate.
False Positive (FP): The number of legitimate URLs review builds a good basis for future researchers taking their
incorrectly classified as phishing. next step at improving phishing detection systems.
False Negative (FN): The number of phishing URLs
incorrectly classified as legitimate. REFERENCES
[1] 2023 Internet Crime Report FBI. Retrieved from:
A Confusion Matrix represents these values in terms https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Repo
of how it indicates the performance of the classification rt.pdf
model.
[2] Dr. Nitin N. Sakhare, Jyoti L. Bangare, Dr. Radhika G.
Purandare, Disha S. Wankhede, Pooja Dehankar, “Phishing
Website Detection Using Advanced Machine Learning
Techniques”, International Journal of Intelligent Systems and
Applications in Engineering 2024.
[10]
[3] Sucharitha, B., Chandini, B., Kumar, D. S., Surendra, M., &
Kumar, G. K. (2024). Detecting phishing websites using
machine learning. IJARCCE, 13(4).
https://doi.org/10.17148/ijarcce.2024.134145
[10]
[4] Machikuri Santoshi Kumari, Chiguru Keerthi Priya, Gondhi
Bhavya Haridas Neha, Monisha Awasthi, Surendra Tripathi, ”
Viable Detection of URL Phishing using Machine Learning
Approach”, 15th International Conference on Materials
Processing and Characterization (ICMPC 2023).
[5].A.A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A
[10] predictive model for phishing detection,” Journal of King Saud
University – Computer and Information Sciences, vol. 34, no.
2, pp. 232–247, 2022.
[6] Korkma, M., Sahingoz, O. K., & Diri, B. (2020). Detection
of Phishing Websites by Using Machine Learning-Based URL
Analysis. Presented at the 11th International Conference on
Computing, Communication and Networking Technologies
[10] (ICCCNT), July 1-3, 2020, IIT Kharagpur, India. IEEE.
[7] Mohammad Nazmul Alam, Dhiman Sarma et al., “Phishing
OBSERVATIONS attacks detection using machine learning approach,” 3rd
International Conference on Smart Systems and Inventive
Phishing attacks are constantly evolving and the cyber world Technology (ICSSIT), 2020.
is hit by new types of attacks often. Hence a particular detection
approach or algorithm cannot be tagged as the best one giving [8] Junaid Rashid, “Phishing Detection Using Machine
exact results. Through the literature survey, it is evidently Learning Technique”, First International Conference of Smart
visible that Random Forest gives better results in most Systems and Emerging Technologies (SMARTTECH), 2020.
scenarios. But then the performance of each algorithm varies
depending on the dataset used, train-test split ratio, feature [9] Vahid Shahrivari, Mohammad Mahdi Darabi, Mohammad
selection techniques applied etc. Researchers prefer to create Izadi “Phishing Detection Using Machine Learning
machine learning models that perform phishing detection with Techniques” arXiv preprint arXiv:2009.11116, 2020. Retrieved
best value for evaluation parameters and least training time. from arXiv.
Therefore, the future works should focus on these aspects of
phishing detection. [10] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji
Rajendran, and Bindhumadhava BS, “Phishing website
6. CONCLUSION classification and detection using machine learning,”
International Conference on Computer Communication and
Due to the greater demand for the security of personal, Informatics (ICCCI), 2020.
financial, and professional data in this digital era, phishing
detection has risen to be a highly critical area of research. [11] Arun Kulkarni, Leonard L. Brown, “Phishing Websites
URL-based analysis is one of the ways that enhance both Detection using Machine Learning”, IJACSA International
detection speed and detection accuracy. By extracting Journal of Advanced Computer Science and Applications, Vol.
those features from the given URL and applying feature 10, No. 7, 2019.
selection and dimensionality reduction techniques, models
are refined by eliminating unnecessary data and focusing [12] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing
on the most informative features. Numerous machine website detection using machine learning algorithms,”
learning algorithms have shown strong performance on International Journal of Computer Applications (0975-8887),
phishing URL classification including Random Forest, vol. 181, no. 23, 2018.
XGBoost, and Support Vector Machines. In this paper, we
[13] PhishTank : https://phishtank.org/
retrospectively examined phishing detection, focusing on
different methodologies and their performance. The

Catalogue Ralph Lauren Home
75% (4)
Catalogue Ralph Lauren Home
189 pages
Review Paper
No ratings yet
Review Paper
9 pages
Review Paper
No ratings yet
Review Paper
8 pages
20mis0106 VL2023240102875 Pe003
No ratings yet
20mis0106 VL2023240102875 Pe003
42 pages
Research_paper_ Group-B5
No ratings yet
Research_paper_ Group-B5
4 pages
V6I602
No ratings yet
V6I602
8 pages
Based On URL Feature Extraction
No ratings yet
Based On URL Feature Extraction
6 pages
paper-major1
No ratings yet
paper-major1
6 pages
Comparative Analysis of Features Based Machine Learning Approaches For Phishing Detection
No ratings yet
Comparative Analysis of Features Based Machine Learning Approaches For Phishing Detection
6 pages
Edited Phishing Domains Detection Using Deep Learning
No ratings yet
Edited Phishing Domains Detection Using Deep Learning
11 pages
Final Paper on Phishing Domains Detection Using Deep Learning
No ratings yet
Final Paper on Phishing Domains Detection Using Deep Learning
11 pages
Detection of Phising Websites Using Machine Learning Approaches
No ratings yet
Detection of Phising Websites Using Machine Learning Approaches
9 pages
Phishing Detection Using Machine Learning
No ratings yet
Phishing Detection Using Machine Learning
9 pages
Detection_of_Phishing_Websites_using_mac
No ratings yet
Detection_of_Phishing_Websites_using_mac
3 pages
phishing4
No ratings yet
phishing4
6 pages
Phishing Detection Using Machine Learnin
No ratings yet
Phishing Detection Using Machine Learnin
5 pages
Phishing Website Detection Using Machine Learning Algorithms
No ratings yet
Phishing Website Detection Using Machine Learning Algorithms
4 pages
An investigation into the performances of the Current state-of-the-art Naive Bayes, Non-Bayesian and Deep Learning Based Classifier for Phishing Detection A Survey
No ratings yet
An investigation into the performances of the Current state-of-the-art Naive Bayes, Non-Bayesian and Deep Learning Based Classifier for Phishing Detection A Survey
12 pages
IEEE_Format_Paper
No ratings yet
IEEE_Format_Paper
20 pages
IJRTI2207237
No ratings yet
IJRTI2207237
19 pages
Final Research Paper
No ratings yet
Final Research Paper
6 pages
Mahajan 2018 Ijca 918026
No ratings yet
Mahajan 2018 Ijca 918026
3 pages
Batch-5 Journal-6 ECE-D new (1)
No ratings yet
Batch-5 Journal-6 ECE-D new (1)
6 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Social Engineering Detection: Phishing URLs
No ratings yet
Social Engineering Detection: Phishing URLs
7 pages
Phishing Website Detection Using ML IJERTCONV9IS13006
No ratings yet
Phishing Website Detection Using ML IJERTCONV9IS13006
4 pages
A multi-algorithm approach for phishing uniform resource locator’s detection
No ratings yet
A multi-algorithm approach for phishing uniform resource locator’s detection
10 pages
2023.I4.001
No ratings yet
2023.I4.001
11 pages
Phish Guard Phishing Website using Machine Learning Algorithms
No ratings yet
Phish Guard Phishing Website using Machine Learning Algorithms
10 pages
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
No ratings yet
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
15 pages
ASRP-116 Camera Ready
No ratings yet
ASRP-116 Camera Ready
13 pages
Detection of Phishing URLs Using Machine Learning
No ratings yet
Detection of Phishing URLs Using Machine Learning
6 pages
Phishing Detection in Email Using Deep Learning
No ratings yet
Phishing Detection in Email Using Deep Learning
8 pages
A Comparative Analysis of Different Feature Set On The Performance of Different Algorithms in Phishing Website Detection
No ratings yet
A Comparative Analysis of Different Feature Set On The Performance of Different Algorithms in Phishing Website Detection
7 pages
CSE3502-Final J Comp Report
No ratings yet
CSE3502-Final J Comp Report
20 pages
Batch-5 ECE-D
No ratings yet
Batch-5 ECE-D
4 pages
Detection_of_phishing_attacks_
No ratings yet
Detection_of_phishing_attacks_
7 pages
Automated Phishing Detection Through URL Analysis and Machine Learning
No ratings yet
Automated Phishing Detection Through URL Analysis and Machine Learning
9 pages
Machine_Learning_for_Detecting_the_Phishing_Threats
No ratings yet
Machine_Learning_for_Detecting_the_Phishing_Threats
6 pages
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
No ratings yet
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
6 pages
Selvakumari 2021 J. Phys. Conf. Ser. 1916 012169
No ratings yet
Selvakumari 2021 J. Phys. Conf. Ser. 1916 012169
9 pages
Applying+Machine+Learning+Algorithms+for+Detecting+Phishing+Websites+Applications+of+SVM,+KNN,+Decision+Trees,+and+Random+Forests
No ratings yet
Applying+Machine+Learning+Algorithms+for+Detecting+Phishing+Websites+Applications+of+SVM,+KNN,+Decision+Trees,+and+Random+Forests
19 pages
Phishing Detection Based On Machine Learning and Feature Selection Methods
No ratings yet
Phishing Detection Based On Machine Learning and Feature Selection Methods
13 pages
CH 2. Literature Survey
No ratings yet
CH 2. Literature Survey
5 pages
Random Forest
No ratings yet
Random Forest
10 pages
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
No ratings yet
(IJCST-V9I3P26) :P.Hema Sujatha, S.Sushma Sree, N. Vinay Sreenath, S. Suresh, DR - Bala Brahmeswara Kadaru
6 pages
Survey On Phishing Websites Detection Using Machine Learning
No ratings yet
Survey On Phishing Websites Detection Using Machine Learning
8 pages
DEEP LEARNING APPROACH FOR DETECTION OF PHISHING ATTACK TO STRENGTHEN NETWORK SECURITY
No ratings yet
DEEP LEARNING APPROACH FOR DETECTION OF PHISHING ATTACK TO STRENGTHEN NETWORK SECURITY
19 pages
A Structured Synopsis For Phishing Website Identification
No ratings yet
A Structured Synopsis For Phishing Website Identification
5 pages
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
No ratings yet
Fake URL Detection Using Machine LearningNKKKKKKKKKKKKKKK
7 pages
CHAPTER
No ratings yet
CHAPTER
101 pages
IJCRTI020051
No ratings yet
IJCRTI020051
4 pages
Major Project Final Report
No ratings yet
Major Project Final Report
53 pages
34962-71279-1-PB
No ratings yet
34962-71279-1-PB
11 pages
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
No ratings yet
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
14 pages
Detection of Phishing Websites Using Machine Learning Techniques
No ratings yet
Detection of Phishing Websites Using Machine Learning Techniques
5 pages
A Novel Algorithm To Detect Phishing URLs - 2016
No ratings yet
A Novel Algorithm To Detect Phishing URLs - 2016
5 pages
INFOCOMP+Journal+Final 3
No ratings yet
INFOCOMP+Journal+Final 3
6 pages
978-3-030-23943-5_9
No ratings yet
978-3-030-23943-5_9
17 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
From Everand
Honeypot Systems and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ISUZU
No ratings yet
ISUZU
9 pages
Week 5: Developmental State and Economic Policies
No ratings yet
Week 5: Developmental State and Economic Policies
16 pages
JZN GROUP(JOZANI) ELECTRONICS LAPTOP PRICE LIST
No ratings yet
JZN GROUP(JOZANI) ELECTRONICS LAPTOP PRICE LIST
2 pages
New Executable: Newexe) Is A 16-Bit .Exe File Format, A
No ratings yet
New Executable: Newexe) Is A 16-Bit .Exe File Format, A
7 pages
Irrigation and Water ManagemenT
No ratings yet
Irrigation and Water ManagemenT
48 pages
Lucid Immersion Guidebook A Holistic Blueprint For Lucid Dreaming
No ratings yet
Lucid Immersion Guidebook A Holistic Blueprint For Lucid Dreaming
341 pages
LESSON 11 - PASSAGE 2
No ratings yet
LESSON 11 - PASSAGE 2
4 pages
Source:: Author Contributions: Conceptualization, C.N., M.M., and G.K. Methodology, C.N. and G.K. Validation, C.N.
No ratings yet
Source:: Author Contributions: Conceptualization, C.N., M.M., and G.K. Methodology, C.N. and G.K. Validation, C.N.
2 pages
Stanford Prison Experiment
No ratings yet
Stanford Prison Experiment
29 pages
G4-English-Comprehension-Battle-of-Dograi-690
No ratings yet
G4-English-Comprehension-Battle-of-Dograi-690
2 pages
_Book of the Dead Notes 2024
No ratings yet
_Book of the Dead Notes 2024
4 pages
CV Stefan Onica
No ratings yet
CV Stefan Onica
4 pages
Tela Q&a
No ratings yet
Tela Q&a
3 pages
American Standard
No ratings yet
American Standard
62 pages
GE Sentry For Windows Manual Ver3-0
No ratings yet
GE Sentry For Windows Manual Ver3-0
20 pages
Pore Pressure Prediction From Well Logs
100% (1)
Pore Pressure Prediction From Well Logs
17 pages
Halalube GFM 220: H1 Food Grade Gear Oil
No ratings yet
Halalube GFM 220: H1 Food Grade Gear Oil
1 page
High-Level Programming of Openfoam Applications, and A First Glance at C++
No ratings yet
High-Level Programming of Openfoam Applications, and A First Glance at C++
48 pages
G10 Q1 SLM1 Information Gathering Through Listening for Everyday Life Usage
No ratings yet
G10 Q1 SLM1 Information Gathering Through Listening for Everyday Life Usage
32 pages
Issue 87 Vol 15, No 3, July 2019 by Magicseen Magazine @MagicTuts
100% (3)
Issue 87 Vol 15, No 3, July 2019 by Magicseen Magazine @MagicTuts
60 pages
BST 32202 LINEAR REGRESSION 4 TWO WAY ANOVA
No ratings yet
BST 32202 LINEAR REGRESSION 4 TWO WAY ANOVA
25 pages
Excel Notes
No ratings yet
Excel Notes
7 pages
IELTS General Reading Test 1
No ratings yet
IELTS General Reading Test 1
7 pages
De Thi Hoc Ki 2 Tieng Anh 8 English Discovery de So 4 1713751477
No ratings yet
De Thi Hoc Ki 2 Tieng Anh 8 English Discovery de So 4 1713751477
3 pages
Accounting Thesis Proposal Example
100% (2)
Accounting Thesis Proposal Example
5 pages
Fuse Box Diagram Renault 19 and Relay With Assignment and Locati
No ratings yet
Fuse Box Diagram Renault 19 and Relay With Assignment and Locati
5 pages
Preparatory Notes For ASNT NDT Level III Examination - Ultrasonic Testing, UT
No ratings yet
Preparatory Notes For ASNT NDT Level III Examination - Ultrasonic Testing, UT
19 pages
Kamla Nehru Institute of Child Education: School Curriculum
No ratings yet
Kamla Nehru Institute of Child Education: School Curriculum
4 pages
FF-1600-EX &amp FF3200 Proportioning Unit
No ratings yet
FF-1600-EX &amp FF3200 Proportioning Unit
36 pages

Review Paper

Uploaded by

Review Paper

Uploaded by

Phishing Website Detection Machine Learning

Fig 1: Complaints and losses of last 5 years [1]

The FBI's Internet Crime Complaint Center (IC3) 2024 report

A. Phishing Detection 3. LITERATURE REVIEW

Address Bar Based Features

Domain Based Features

5.1) LEXICAL STRUCTURE OF A URL[10]

Fig 2: Lexical structure of a URL [10]

Let us consider this URL:

You might also like