0% found this document useful (0 votes)
32 views

Review Paper

Uploaded by

Dhehus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Review Paper

Uploaded by

Dhehus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Phishing Website Detection Machine Learning

Review
Aditya Deshmukh1, Akash Yadav2, Pratham Maske3, Shreyash Kathane4, Dr. D.S Adane5
Student, Information & Technology, RCOEM, Nagpur, India1-4
Professor of Information & Technology, RCOEM, Nagpur, India5
Investment scams were the most damaging-they alone robbed
Abstract— As the technology develops this increases victims of $4.57 billion, which is an increase of 38% from the
the chance of cybercrimes happening. Phishing attacks previous year. Crypto-investment fraud accounted for alone $3.94
billion-a whopping 53% rise. Phishing type schemes are among the
based on URLs are among the most common threats
most reported crimes, with over 298,000 complaints with make
toward Internet users. Such attacks are not built upon about 34% of the complaints
technical vulnerabilities; instead, they exploit a weakness
in humans and are often launched against organizations
and individuals. Attackers deceive users by clicking on
URLs that appear trustworthy, leading them to reveal
sensitive information or install malware. Various
techniques of machine learning used for phishing URL
detection classify URLs into phishing and legitimate ones.
Models remain under development and refinement
because of researchers' determination to develop them as
accurate and efficient as possible. Different machine
learning techniques for detecting phishing URLs
accompanied by URL features and datasets that train the
models are reviewed. The paper further discusses the
many different methods put forth by the researchers to
enhance the detection accuracy of these models.

Fig 1: Complaints and losses of last 5 years [1]


1. INTRODUCTION
The report places importance on public reporting to IC3 so as to
In the year 2024, we only deepened our reliance on assist the FBI in combating cyber threats. FBI encourages
technology that further exposed us to more non-native cyber consumers to look out for and read consumer and industry alerts
threats. The ongoing digital transformation, with major about cybercrime, notify financial institutions if victimized, and
impetus from the global pandemic, had created fertile fields file a report to IC3 or local law enforcement..
for the operation of cybercriminals. Recent analysis and
reports are pointing at the surge of security breaches, which
caused both financial losses and personal information
exposures of astronomical proportions. Phishing has been
continued to be prevalent among these instances of
cyberspace crime, using both social engineering and further
technical deception to steal an individual's personal identity
data and financial account credentials. Attackers build fake
versions of trusted websites with the aim of tricking people
into voluntarily divulging their usernames, passwords,
banking details, and other sensitive information. These
phishing URLs would typically be distributed through e-mail,
instant messages, or text messages, thus it is worthwhile that
users should remain awake to the matter and embrace solid
respect for cybersecurity practices.

The FBI's Internet Crime Complaint Center (IC3) 2024 report


highlights a significant rise in online fraud, with 880,418
complaints and potential losses exceeding $12.5 billion,
marking a 10% increase in complaints and a 22% rise in
losses from 2023. California reported the highest number of
complaints and losses, with nearly 80,000 complaints and
over $2 billion in losses.
2. BACKGROUND

A. Phishing Detection 3. LITERATURE REVIEW


A URL based phishing attack is carried out by sending In this section, few of the research works that deploy the
malicious links, that seems legitimate to the users, and tricking above-mentioned algorithms are reviewed and their results are
them into clicking on it. In phishing detection, an incoming summarized.
URL is identified as phishing or not by analysing the different
features of the URL and is classified accordingly. Different
machine learning algorithms are trained on various datasets of The study was conducted by Dr. Nitin N. Sakhare et al.
URL features to classify a given URL as phishing or
legitimate.
[2] Integrated conventional machine learning models like
XGBoost, LightGBM, and a referenced but inactive
B. Phishing Detection Approaches Random Forest classifier alongside a Graph Nerual
List-Based Phishing Detection Systems Network (GNN). XGboost classifier gives accuracy of
These systems rely on two lists to classify website as 92.09%, LightGBM gives highest accuracy of 93.29%.
either phishing or non-phishing. The whitelist contains safe Apart from this, they implement another tree-based
and legitimate websites, while the blacklist includes those machine learning algorithm, CatBoost, which gives
identified as phishing. Researchers have used whitelists to accuracy of 92.98%. GNN's performance left a huge scope
ensure that only URLs on the list are accessible. Another for improvement. LightGBM emerged as a standout
approach is the blacklist method, where URLs are checked performer, giving a precision score of 0.93 alongside a
against a list of known phishing sites. However, these systems recall score of 0.93.
have a significant drawback: even a small change in the URL
can prevent it from being matched in the list. Additionally, B. Sucharitha et al [3] investigated the application of machine
they struggle to catch new, zero-day attacks.[3] learning algorithms to classify phishing websites. The dataset for
this research comprises of 32 features including IP address, URL
The Rule-based Phishing Detection Systems length, URL shortening service employed, and state of SSL,
The feature sets for rule-based systems stem from relational among others. The study gives these salient features of malicious
rule mining. The rules provide a weighting of characteristics URLs, and these features identify phishing websites. The authors
most prevalent in phishing URLs. These rules, when used with considered different machine learning models, namely, Decision
the system, provide better accuracy than what can be achieved Trees, Random Forest, and Gradient Boosting. These models
with just features working alone in classification. For were evaluated using metrics such as accuracy, precision, recall,
and F1-score. Among all other models, Gradient Boosting
example, researchers in the CANTINA study resorted to TF-
achieved the highest score with accuracy 98.9%, precision
IDF and some specific rules to identify phishing attacks. 99.0%, recall 99.4%, and F-value just slightly lower at 98.6%.
Researchers have implemented a combination of features and Thus, the authors concluded that ensemble methods such as
rules to uncover higher detection accuracy in similar works.[3] Gradient Boosting and Random Forest can provide accurate and
strong generalization capabilities when detecting phishing
Visual Similarity-Based Phishing Detection Systems websites. The authors stress the importance of using features
The systems compare web pages with phishing sites visually from the varied sources and suggest that combining machine
to detect attempts of phishing. They take a server-perspective learning models and other phishing detection techniques can
comparison of both phishing and non-phishing sites and use enhance the detection capabilities further. This research clearly
image processing techniques to identify minor visual epitomizes machine learning in the detection of phishing
differences which users would not notice. Fake sites are websites, being a step further to its improvement by hybrid
designed with the intention of making them similar to the models and other features.
original ones; however, slight differences are visible due to
these techniques. Studies have shown that visual similarity-
based systems can prove to be effective detection models Machikuri Santoshi Kumari et al. in [4] detects phishing based
against phishing attacks upon comparing generic visual on models enhanced by blacklisting and machine-learning
elements.[3] methods. Several machine-learning algorithms, such as XGBoost,
Random Forest, Decision Tree, and Multilayer Perceptrons, were
Machine Learning-Based Phishing Detection Systems used for the detection. Other datasets were used in addition to the
Machine learning-based systems detect phishing websites by Phishtank dataset, namely: one containing phishing websites and
classifying specified features using artificial intelligence the second containing phonemy features. A total of 30 features
techniques. These features can include URL structure, domain were used out of 30 most important features were HTTPS,
followed by Anchor URL, Website Traffic, etc. XGBoost gave the
name, website content, and more. Due to their dynamic nature, maximum training accuracy of 100% and the best test accuracy at
these systems are particularly popular for detecting anomalies 96.7% out of all other algorithms. They concluded that "using the
on websites. Machine learning models can adapt to new XGBoost algorithm to detect phishing improves prediction
phishing tactics, making them highly effective in protecting accuracy."
users from evolving threats.[3]
A. Orunsolu et al.[5] Proposed an scalable architecture
combined with incremental learning in a modular approach was
effective. Utilizing an extensive dataset from
Phishtank(comprising 2,541 phishing URLs) and Alexa
(containing 2,500 legitimate URLs), the model attained 99.96%
accuracy with a low false positive rate of 0.04%. In conducting
comparative performance studies, use was made of Support and Alexa consists of valid and phishing URLs, together with
Vector Machine (SVM) and Naïve Bayes (NB) algorithms. internal features, such as the length of the URL and external
The study provides a criterion for assessing feature features that are derived from third-party services. Principal
importance based on how often phishing and legitimate Component Analysis (PCA) was performed for dimension
datasets favor certain features. The selection therefore reduction to facilitate more efficient processing. The model
introduces features as per maximum relevance with minimum achieved 95.66% accuracy using SVM with only five
redundancy. The URL features consist of, but are not limited features, much higher than that achieved using any other
to, length, presence of '@', and hexadecimal codes. The techniques, for example, Random Forest, which showed an
webpage features investigated include validity of SSL accuracy of 94.27% with 30 features. This reduction in
certificates and congruency with domain names; while feature set improved computational efficiency while
patterns of behavior, like cookie handling, and the age of the maintaining good detection rates. The authors indicated how
domain, also qualify to be important features. The robust their solution is at identifying new and transient
incremental methodology processes these features in stages, phishing sites that constitute a practical attack against cyber
starting with URL analysis, followed by webpage properties, threats.
and finally webpage behaviors if needed. This modular
approach ensures scalability and adaptability to new phishing The proposed research by Vahid Shahrivari and
tactics. The study’s results demonstrate the effectiveness of Mohammad Mahdi Darabi [9] deals with the application of
the proposed system, though limitations such as dataset various machine-learning algorithms for the detection of
diversity, lack of real-time testing, and absence of phishing websites. This research uses a dataset constituted of
benchmarking. 30 features, such as IP address presence, URL length,
whether shortening services are used, and SSL state among
Korkmaz et al [6] This research work addressed a persistent others. Characteristics common to such URL layouts are
concern regarding phishing through URL analysis, which employed to distinguish phishing websites from those which
employs machine-learning techniques to track these attacks do not engage in this practice. Logistic regression, decision
proliferated by exploiting vulnerabilities inherent within tree, random forest, AdaBoost, KNN, SVM, gradient
human nature by imitating legitimate sites in a bid to obtain boosting, XGBoost, and neural networks constituted the
sensitive data. Also, such an attempt to assess performance machine learning algorithms that tried out. Besides the
can improve by addressing primarily the attributes of URLs accuracy, precision, recall, and F1 score are also used to
for further improvement in efficiency. The authors employed assess the performance of different models. While XGBoost
eight machine learning algorithms via Random Forest (RF), proved most accurate at 98.32%, Random Forest came
Artificial Neural Networks (ANN), and Support Vector second best at an accuracy close to 97.27%; moreover, Neural
Machines (SVM), which were tested on three datasets with Network exhibited good performance, achieving 96.98%
over 126,000 URLs. The datasets combined the phishing accuracy. The authors concluded that the ensemble methods
URLs from PhishTank and the legitimate URLs from Alexa such as Random Forest and XGBoost are good at detecting
and Common Crawl databases. The system extracted and phishing websites due to their high accuracy and robustness.
used 48 key features from the URLs that include domain They stressed the usefulness of employing multiple features
structure, special character presence, and length metrics, and suggested that one method for enhancing detection
without recourse to third-party services for efficiency performance might be coupling machine learning models
concerns. The experimental results indicate that the Random with other phishing detection methods. This work exemplifies
Forest algorithm had the highest accuracy across the dataset the potential for machine learning to help discern phishing
(up to 94.59%) and had better accuracy than previous studies. websites, and its further promise of improvement with hybrid
Such an experiment proves to be running with a high degree models and novel features.
of efficiency in that it can be effectively used for real-time
detection and speed. However, limited area coverages Jitendra Kumar et al. described in their research [10] the
mentioned in the paper provided directions for further work. training of Logistic Regression, Naive Bayes, Random
Expanding upon the initial dataset. Forest, Decision Tree and K-Nearest Neighbor classifiers
using features derived from the lexical structure of URLs.
Phishing attack detection was investigated in Alam et al. They had carefully created a dataset to solve common
(2020) [7], which used decision tree and random forest problems like data imbalance, biased training, variance and
algorithms for the classification of attacks. The dataset, which overfitting. The preprocessed dataset was evenly split into
came from Kaggle, had 30 very significant features for phishing and trusted URLs and was further divided into a
identifying phishing URLs. The detailed preprocessing step 70:30 ratio for training and testing. Interestingly, all
was reasonably done to render clean and noise-free data, classifiers had similar AUC (Area Under Curve) values, but
followed by feature selection using algorithms like PCA. The the Naive Bayes Classifier claimed to be the best performer
performance of each algorithm was analyzed in terms of with the highest AUC value. It achieved an accuracy of 98%
confusion matrices and the following performance measures: with precision of 1, recall of 0.95 and F1-score of 0.97, thus
accuracy, precision, recall, and F1-score. The performance of the study makes a point regarding the importance of a
random forests was superior to DTs, offering a 97% accuracy balanced dataset and further emphasizes Naive Bayes being a
compared to 91.94% accuracy for DTs, with random forests strong candidate choice in the detection of phishing.
dealing with overfitting and variability issues effectively. The
study asserted that random forests, ensemble approaches, The detection of phishing websites with machine learning
techniques for web-based search filter out the spam and help techniques by Kulkarni and Brown (2019) [11]. A dataset
assistant for phishing detection substantially in view of the was reported as obtained from the University of California,
large data. Irvine Machine Learning Repository containing 1353 URLs
Rashid et al. (2020) in [8] have presented a machine labeled as phishing, suspicious, and legitimate. Nine features
learning approach for phishing detection that harnesses were extracted from URLs, including URL length, age of
Support Vector Machines (SVM) for classification. This domain, presence of an IP address, and others. Four classifiers
dataset obtained from repositories such as Phish Tank were set to run: Decision Tree, Support Vector Machine
(SVM), Naïve Bayes, and Neural Network. The accuracy 4) Having @ Symbol: Using the @ symbol in the URL leads
achieved by the Decision Tree classifier was 91.5%, with a the browser to ignore everything preceding the @ symbol
True Positive Rate (TPR) of 90.97% and a False Positive and the real address often follows the @ symbol
Rate (FPR) of 7.81%. The SVM was slightly behind,
achieving an accuracy of 86.69%, and both Naïve Bayes and 5) Double Slash Redirection: The existence of // within the
Neural Network slightly trailed at rates of 86.14% and URL which means that the user will be redirected to another
84.87%, respectively. The study stated that Decision Trees website
are quite good with discrete feature values, but they need 6) Prefix Suffix: Phishers tend to add prefixes or suffixes
pruning to deal with problems of overfitting. The authors
separated by (-) to the domain name so that users feel that
concluded more features with larger datasets would help the
performance of the classifier and recommended going for they are dealing with a legitimate webpage. For example
ensemble methods and rule-based approaches for future http://www.Confirme-paypal.com.
work. 7) Having Sub Domain: Having subdomain in URL.
Rishikesh Mahajan and Irfan Siddavatam[12] emphasized 8) SSL State: Shows that website use SSL
three class orientation algorithms: Decision Tree, Random
Forest, and Support Vector Machine. The dataset of benign 9) Domain Registration Length: Based on the fact that
URLs was constructed by taking 17,058 from Alexa and phishing website lives for a short period
19,653 from PhishTank, all with16 features. The data were
respectively partitioned into training and testing sets with 10) Favicon: A favicon is a graphic image (icon) associated with
proportions of 50:50, 70:30, and 90:10. The performance was a specific webpage. If the favicon is loaded from a other
judged according to accuracy, false negative rate, and false domain then the webpage is likely to be considered Phishing
positive rate. Random Forest stood out as the algorithm attempt.
where 97.14% accuracy was achieved with the least false 11) Using Non-Standard Port: To control intrusions, it is much
negative rate. Their conclusion was that the more data used better to merely open ports that you need. Several firewalls,
for training, the better the accuracy.
Proxy and Network Address Translation (NAT) servers will,
by default, block all or most of the ports.
4. DATASETS
12) HTTPS token: Having deceiving https token in URL. For
The datasets have been collected from various sites such example, http://https-www-mellat-phish.ir
as PhishTank[13] , Alexa, etc. Which has the data about the
phishing websites and keeps updating them .The datasets
contains all features and their respective values. Abnormal Based Features
13) Request URL: Request URL examines whether the external
5. FEATURE EXTRACTION objects contained within a webpage such as images, videos,
and sounds are loaded from another domain.
URLs have certain characteristics and patterns that can be
considered as its features. 14) URL of Anchor: An anchor is an element defined by the < a
In case of URL based analysis for designing machine > tag. This feature is treated exactly as Request URL.
learning models, we need to extract these features in order to 15) Links In Tags: It is common for legitimate websites to use
form a dataset that can be used for training and testing. There ¡Meta¿ tags to offer metadata about the HTML document;
are four categories of features that are most commonly ¡Script¿ tags to create a client side script; and ¡Link¿ tags to
considered for feature extraction as in [9]. They are as retrieve other web resources.
follows:
16) Server Form Handler: If the domain name in SFHs is
1) Address Bar based features
different from the domain name of the webpage.
2) Abnormal based features
3) HTML and JavaScript based features 17) Submitting Information To E-mail: A phisher might
redirect the users information to his email.
4) Domain based features
18) Abnormal URL: It is extracted from the WHOIS database.
For a legitimate website, identity is typically part of its URL.

Address Bar Based Features


HTML & JavaScript Based Features
1) Having IP Address: If an IP address is used instead of
the domain name in the URL, such as 19) Website Redirect Count: If the redirection is more than
http://217.102.24.235/sample.html four-time

2) URL Length: Phishers can use a long URL to hide the 20) Status Bar Customization: Use JavaScript to show a fake
doubtful part in the address bar. URL in the status bar to users

3) Shortening Service: Links to the webpage that has a 21) Disabling Right Click: It is treated exactly as Using
long URL. For example, the URL onMouseOver to hide the Link
http://sharif.hud.ac.uk/ can be shortened to 22) Using Pop-up Window: Showing having popo-up windows
bit.ly/1sSEGTB. on the webpage.
23) IFrame: IFrame is an HTML tag used to display an
additional webpage into one that is currently shown.

Domain Based Features


24) Age of Domain: If the age of the domain is less than a
month.
25) DNS Record: Having the DNS record Fig. 2. Different parts of the URL
26) Web Traffic: This feature measures the popularity of
the website by determining the number of visitors.
27) Page Rank: Page rank is a value ranging from 0 to 1.
PageRank aims to measure how important a webpage is
on the Internet.
28) Google Index: This feature examines whether a website
is in Googles index or not.
29) Links Pointing To Page: The number of links pointing
to the web page.
30) Statistical Report: If the IP belongs to top phishing
IP’s or not.

5.1) LEXICAL STRUCTURE OF A URL[10]


The structure of a URL can reveal a lot of hidden
information. A URL starts with a protocol name like
HTTP or HTTPS. The fully qualified domain name
(FQDN) is the complete domain name of the server
hosting the website, which is then translated into an IP
address using DNS servers. The domain name consists
of a second-level domain (SLD) and a top-level domain
(TLD). This domain name is unique and registered with
a domain registrar.

Fig 2: Lexical structure of a URL [10]

Let us consider this URL:


http://amazon.com-verification
accounts.darotob.com/Sign-in/5b60fcc60b36d1c3d
The lexical analysis of the above URL reveals parts as
shown in above Fig. The attackers obfuscate the URL
in such a way that the actual domain name might not be
easily revealed to the normal user and it will be nested
deep inside the URL.
TABLE I. R E S U L T A N A L Y S I S
Paper Model Used Suitable Models Accuracy score Paper Model Used Suitable Models Accuracy score
[8] Applied SVM on data RF and NB SVM: 95.66%,
[2] XGBoost, LightGBM, LightGBM gave XGBoost:
from PhishTank and classifiers had RF: 94.27%
Graph Neural Network highest accuracy 92.09%, Alexa, with internal better
(GNN) and CatBoost with precision LightGBM: and external features, accuracies. In
applied. Performance 0.93 and recall 93.29%, GNN: and PCA for terms of AUC,
evaluated using accuracy, score 0.93 70%, CatBoost: dimensionality Gaussian Naive
precision, recall and F1- 92.98% reduction Bayes had a
score. slightly higher
value of 0.991.
[9] The examined Very good Logistic
[3] GB produces GB: 98.9%, RF: classifiers are Logistic performance in regression:
reliable results in 96.9%, DT: Regression, Decision ensembling 92.6%, Decision
terms of 96.0% Tree, Support Vector classifiers tree: 96.5%,
accuracy, Machine, Ada Boost, namely, Random Random forest:
precision, recall, Random Forest, Neural Forest, XGBoost 97.2%,
and F1 score. Networks, KNN, both on Adabooster:
Gradient Boosting, and computation 93.6%, KNN:
XGBoost. duration and 95%, SVM:
accuracy 94.9%, Gradient
[4] Combined blacklisting Among these, XGBoost: boosting: 94.8%,
applied ML Algorithms: XGBoost was 96.7%, RF: XGBoost: 98.3%
[10] A balanced dataset was Random Forest Random Forest:
XGBoost, RF, DT, and found to be the 92.5%, DT:
utilized to train and Naive Bayes 98.03%,
Multilayer Perceptrons to most accurate 90.5%,
classifiers such as demonstrated Gaussian Naive
dataset with features, model. Multilayer Logistic Regression superior Bayes: 97.18%
Phishing URLs collected Perceptrons: (LR), Naive Bayes accuracy
from Phishtank and 88% (NB), Random Forest
OpenPhish. (RF), Decision Tree
[5] Support Vector Machine Both Support SVM: 99.96%, (DT), and k-Nearest
(SVM) and Naïve Bayes Vector Machine Neighbors (k-NN),
(NB) with features based (SVM) and NB: 99.96% using features derived
on maximum relevance Naïve Bayes from the lexical
with minimum (NB) classifiers structure of URLs.
redundancy. Phishtank have TPR of
(2,541 phishing URLs) 99.96, FNR of [11] Four classifiers (DT, DT performed DT: 91.5%,
and Alexa (2,500 0.04, TNR of SVM, Naïve Bayes, best with 91.5% SVM: 86.69%,
Neural Network) accuracy but Naïve Bayes:
legitimate URLs) datasets. 99.96, and FPR
applied to a UCI required pruning 86.14%, Neural
of 0.04 dataset with 1,353 to address Network:
[6] Random Forest (RF), Random Forest RF: 94.59%, labeled URLs and 9 overfitting. 84.87%
Artificial Neural Networks (RF) was the ANN: 94.35%, extracted features Ensemble
(ANN), Support Vector best-suited XGBoost: methods were
Machines (SVM), Logistic model based on 92.95%, DT: recommended
Regression (LR), K- its highest 92.59%, KNN: [12] The dataset was The Random 50:50 split ratio:
Nearest Neighbor (KNN), accuracy and 91.49%, LR: divided into split ratios Forest classifier 96.72%, 70:30
Decision Tree (DT), Naive overall 91.31%, NB: of 50:50, 70:30, and demonstrated split ratio:
Bayes (NB), XGBoost performance in 88.35%, SVM: 90:10. Decision Tree superior 96.84%, 90:10
detecting 87.03% (DT), Random Forest accuracy and the split ratio:
phishing URLs. (RF), and (SVM) lowest false 97.14%
classifiers were negative rate.
[7] DT and RF applied to a RF outperformed RF: 97%, DT: applied.
Kaggle dataset with 30 DT, addressing 91.94%
features. PCA used for overfitting and
feature selection. variability
Performance evaluated effectively
using accuracy, precision, 6) PERFORMANCE EVALUATION METRICS
recall, and F1-score.
A selected parameter will be used to evaluate the
measure of performance for the system. The associated
models are Accuracy, Precision, Recall, F1 Score, and
ROC curve, all derived from the values of True Positive
(TP), True Negative (TN), False Positive (FP), and False
Negative (FN).
In the context of URL classification.
True Positive (TP): The number of phishing URLs
correctly detected as phishing.
True Negative (TN): The number of legitimate URLs
correctly detected as legitimate.
False Positive (FP): The number of legitimate URLs review builds a good basis for future researchers taking their
incorrectly classified as phishing. next step at improving phishing detection systems.
False Negative (FN): The number of phishing URLs
incorrectly classified as legitimate. REFERENCES
[1] 2023 Internet Crime Report FBI. Retrieved from:
A Confusion Matrix represents these values in terms https://www.ic3.gov/Media/PDF/AnnualReport/2023_IC3Repo
of how it indicates the performance of the classification rt.pdf
model.
[2] Dr. Nitin N. Sakhare, Jyoti L. Bangare, Dr. Radhika G.
Purandare, Disha S. Wankhede, Pooja Dehankar, “Phishing
Website Detection Using Advanced Machine Learning
Techniques”, International Journal of Intelligent Systems and
Applications in Engineering 2024.
[10]
[3] Sucharitha, B., Chandini, B., Kumar, D. S., Surendra, M., &
Kumar, G. K. (2024). Detecting phishing websites using
machine learning. IJARCCE, 13(4).
https://doi.org/10.17148/ijarcce.2024.134145
[10]
[4] Machikuri Santoshi Kumari, Chiguru Keerthi Priya, Gondhi
Bhavya Haridas Neha, Monisha Awasthi, Surendra Tripathi, ”
Viable Detection of URL Phishing using Machine Learning
Approach”, 15th International Conference on Materials
Processing and Characterization (ICMPC 2023).
[5].A.A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A
[10] predictive model for phishing detection,” Journal of King Saud
University – Computer and Information Sciences, vol. 34, no.
2, pp. 232–247, 2022.
[6] Korkma, M., Sahingoz, O. K., & Diri, B. (2020). Detection
of Phishing Websites by Using Machine Learning-Based URL
Analysis. Presented at the 11th International Conference on
Computing, Communication and Networking Technologies
[10] (ICCCNT), July 1-3, 2020, IIT Kharagpur, India. IEEE.
[7] Mohammad Nazmul Alam, Dhiman Sarma et al., “Phishing
OBSERVATIONS attacks detection using machine learning approach,” 3rd
International Conference on Smart Systems and Inventive
Phishing attacks are constantly evolving and the cyber world Technology (ICSSIT), 2020.
is hit by new types of attacks often. Hence a particular detection
approach or algorithm cannot be tagged as the best one giving [8] Junaid Rashid, “Phishing Detection Using Machine
exact results. Through the literature survey, it is evidently Learning Technique”, First International Conference of Smart
visible that Random Forest gives better results in most Systems and Emerging Technologies (SMARTTECH), 2020.
scenarios. But then the performance of each algorithm varies
depending on the dataset used, train-test split ratio, feature [9] Vahid Shahrivari, Mohammad Mahdi Darabi, Mohammad
selection techniques applied etc. Researchers prefer to create Izadi “Phishing Detection Using Machine Learning
machine learning models that perform phishing detection with Techniques” arXiv preprint arXiv:2009.11116, 2020. Retrieved
best value for evaluation parameters and least training time. from arXiv.
Therefore, the future works should focus on these aspects of
phishing detection. [10] Jitendra Kumar, A. Santhanavijayan, B. Janet, Balaji
Rajendran, and Bindhumadhava BS, “Phishing website
6. CONCLUSION classification and detection using machine learning,”
International Conference on Computer Communication and
Due to the greater demand for the security of personal, Informatics (ICCCI), 2020.
financial, and professional data in this digital era, phishing
detection has risen to be a highly critical area of research. [11] Arun Kulkarni, Leonard L. Brown, “Phishing Websites
URL-based analysis is one of the ways that enhance both Detection using Machine Learning”, IJACSA International
detection speed and detection accuracy. By extracting Journal of Advanced Computer Science and Applications, Vol.
those features from the given URL and applying feature 10, No. 7, 2019.
selection and dimensionality reduction techniques, models
are refined by eliminating unnecessary data and focusing [12] Rishikesh Mahajan, and Irfan Siddavatam, “Phishing
on the most informative features. Numerous machine website detection using machine learning algorithms,”
learning algorithms have shown strong performance on International Journal of Computer Applications (0975-8887),
phishing URL classification including Random Forest, vol. 181, no. 23, 2018.
XGBoost, and Support Vector Machines. In this paper, we
[13] PhishTank : https://phishtank.org/
retrospectively examined phishing detection, focusing on
different methodologies and their performance. The

You might also like