Real Time Phishing Website Detectionusing ML

Phishing involves fraudulent activities where attackers impersonate trustworthy websites to unlawfully obtain private information, including usernames, passwords, and financial details. Traditional detection methods, including blacklists and heuristic- based approaches, struggles identifying new, evolving phishing sites. In recent times, AI using machine learning (ML) has emerged as a powerful tool for phishing detection, offering predictive capabilities that adapt to changing attack patterns.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Real Time Phishing Website Detectionusing ML

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Real Time Phishing Website

Detection using ML
Praveen N1; Kartik S N2; Santosh V3; Kishore N4; Dr. Prakasha S5
Department of Information Science RNS Institute of Technology, Bengaluru

Abstract:- Phishing involves fraudulent activities where architectures have greatly enhanced the precision of
attackers impersonate trustworthy websites to phishing detection. Algorithms including Random Forest
unlawfully obtain private information, including and SVM, and neural networks are widely applied, each
usernames, passwords, and financial details. Traditional offering unique advantages in handling complex data.
detection methods, including blacklists and heuristic-
based approaches, struggles identifying new, evolving II. LITERATURE SURVEY
phishing sites. In recent times, AI using machine
learning (ML) has emerged as a powerful tool for The literature on machine learning-based phishing
phishing detection, offering predictive capabilities that detection shows both advancements and ongoing challenges
adapt to changing attack patterns. This survey examines in the areas of feature extraction, detection efficiency, and
state- of-the-art ML techniques for phishing website adaptability to new phishing techniques. Below, we review
detection, covering feature extraction, model types, and significant studies on feature-based detection, deep learning
challenges in data handling. Through analyzing recent methods, ensemble models, and hybrid approaches.
methodologies, this paper highlights the strengths and
limitations of various ML models and proposes  Feature-Based identification through Machine Learning:
directions for further improving phishing detection Sarma et al. (2021) conducted a detailed analysis of
systems. machine learning methods applied to phishing
prevention, focusing on Random Forest (RF) and
Keywords:- Phishing Detection, Machine Learning, (SVM), and K-nearest neighbors (KNN). Among these,
Cybersecurity, Feature Extraction, Classification Models, RF showed the highest accuracy (98%) in distinguishing
URL Analysis. phishing from legitimate sites due to its handling of
complex features like URL structure, domain age, and
I. INTRODUCTION HTTPS status. This study underscores the importance of
well-chosen features but also highlights challenges in
Phishing is one among the top widespread and adapting models to new phishing
deceptive forms of cybercrime, targeting users to obtain patterns(Sarma2021_Chapter_Compa…).
secure data, such as account credentials, financial data, or
personal identity details. Attackers accomplish this by  Machine Learning in Phishing Lifecycle Detection: Tang
creating false sites mirroring the appearance of legitimate and Mahmoud (2021) analyzed ML techniques at
ones, often exploiting human psychology through urgent or different stages of phishing attacks, such as URL
enticing messages. These attacks have evolved significantly analysis, feature extraction, and classification. They
over the years, becoming more sophisticated and harder to noted that each phase benefits from specific ML models:
detect, especially as the internet expands in both user base decision trees are effective in feature extraction, while
and functionality. Traditional methods, such as blacklists neural networks can identify deeper patterns. The study
and heuristic-based detection, offer some protection by suggests that a multi-stage ML framework enhances
filtering known phishing sites or using basic rule-based detection accuracy, but real-time deployment remains
criteria. However, these techniques are inherently limited: challenging due to high computational costs(make-03-
blacklists cannot identify newly emerging phishing sites, 00034 (1)).
and heuristic rules are often bypassed by attackers who
adjust tactics to avoid detection.  Deep Learning and Convolutional Neural Networks
(CNNs): Odeh et al. (2021) explored advanced deep
The advent of machine learning (ML) has proven to be learning architectures like Convolutional Neural
a promising solution to these limitations, bringing predictive Networks (CNNs) and Long Short-Term Memory
capabilities that allow systems to recognize phishing (LSTM) networks networks, to improve phishing
attempts based on patterns rather than specific pre-identified detection. CNNs process URLs and web content to detect
threats. By analyzing numerous characteristics—such as phishing patterns more accurately but at a higher
URL structure, domain registration details, and website computational cost. The authors conclude that while
content—ML algorithms can classify websites as legitimate CNNs improve detection rates, a hybrid approach may
or phishing featuring an elevated degree of accuracy. In balance accuracy and efficiency more effectively in
recent times, advances in a combination of conventional resource- constrained environments(2020013989).
machine learning techniques and advanced deep learning

IJISRT24DEC281 www.ijisrt.com 268

Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Unsupervised Learning for Phishing Detection: Studies III. OBJECTIVES
by Kalaharsha and Mehtre emphasize the possibilities
offered by unsupervised learning to detect phishing The goal of this survey is to thoroughly examine and
without relying on labeled data. Clustering techniques compare modern machine learning methods applied used for
can reveal patterns among phishing sites that supervised detecting phishing websites. This review has multiple,
models may miss. Nonetheless, these models encounter focused aims:
difficulties in achieving the precision of supervised
learning and it might be most effective when combined  Explore Key Machine Learning Models: To evaluate
with other approaches. to enhance adaptability to new various machine learning approaches, including Random
phishing strategies(make-03-00034 (1)). Forest, Support Vector Machines, Convolutional Neural
Networks, and Long Short-Term Memory models, and
 Ensemble Models for Improved Accuracy: Patel et al. understand how each contributes to detecting phishing
(2024) examined ensemble models, specifically websites.
combining Random Forest with PCA for enhanced  Identify Model Strengths and Weaknesses: To outline
results to enhance phishing detection performance. the strengths, such as high accuracy or adaptability, and
Ensemble approaches combine classifiers for robust the limitations of each model, including challenges like
predictions, reducing false positives and improving computational demands or susceptibility to evolving
reliability. Patel's study also highlights the importance phishing tactics.
of embedding security checks within ML models to  Examine Feature Selection Techniques: To analyze
identify vulnerabilities like poor input validation and which features (such as URL length, domain age,
encryption, promoting safer real-world HTTPS usage, and webpage content) are most effective
applications(2020013989). at telling apart phishing and legitimate sites, helping
refine future detection models.
 Natural Language Processing (NLP) in Phishing  Compare Ensemble and Hybrid Approaches: To
Detection: Bingyang (2024) researched NLP applications evaluate the efficiency of combining different models or
for feature extraction from phishing emails and URLs. integrating traditional approaches (e.g., heuristics) with
By analyzing textual content, NLP models can identify machine learning to increase detection performance
phishing patterns within language and URL structure. while retaining computational efficiency high.
However, this method requires extensive, domain-  Address Real-Time Detection Needs: To explore the
specific training data to achieve accuracy across diverse challenges of applying machine learning models in real-
phishing scenarios. NLP models show promise, time scenarios, including issues related to speed,
especially when used alongside other machine learning processing power, and scalability to large numbers of
techniques to improve adaptability(2020013989). users.
 Investigate Emerging Solutions and Trends: To
 Hybrid Approaches Combining Heuristics and Machine highlight recent innovations like reinforcement learning
Learning: Vijayalakshmi et al. (2020) presented a hybrid and natural language processing-based models,
phishing detection model that combines rule-based examining how these innovative methods can adapt to
heuristics with machine learning classifiers. Their study changing phishing tactics and enhance the resilience of
divides detection into web address-based methods, detection systems.
webpage content analysis, and hybrid approaches. The  Suggest Directions for Future Research: To suggest
authors found that combining heuristics with ML potential directions for future research focusing on ways
enhances detection accuracy, particularly in real-time to address current limitations—such as creating more
scenarios, by filtering out non-suspicious cases early in adaptable and efficient models or incorporating security
the detection process. This layered approach shows features directly into the model training process.
potential in reducing false positives and computational  Support Practical Applications: To evaluate how these
load (make-03-00034 (1)). results may guide the deployment of machine learning-
based phishing detection systems in real-world
 Phishing Detection Using Reinforcement Learning: applications, enhancing online security for individuals
Recently, Jain and Gupta (2022) investigated and organizations.
reinforcement learning for phishing detection, where the
model adapts its detection strategy based on user IV. PROPOSED SYSTEM
feedback. Their study demonstrated that reinforcement
learning models could adapt to new phishing types over This survey paper suggests a comprehensive ML-based
time, improving accuracy as they gather more data on phishing detection framework. The proposed system will
successful detections. However, they noted that incorporate a combined model integrating deep learning
reinforcement learning requires significant with traditional feature-based techniques. The objective is
computational resources and training time, potentially to enhance accuracy in identifying both known and novel
restricting its practicality in real-time applications phishing sites by leveraging URL analysis, page structure
without further optimization (Sarma2021_Chapter_ examination, and textual content. Integrating supervised and
Compa. unsupervised learning will enhance adaptability to evolving
phishing patterns.

IJISRT24DEC281 www.ijisrt.com 269

Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. ADVANTAGES OF PROPOSED SYSTEM VI. METHODOLOGY

 Real-Time Detection: The hybrid ML model aims to  The Methodology of the Proposed System Involves
achieve faster detection suitable for real-time Several Stages:
applications.
 Improved Accuracy: By combining deep learning with  Data Collection: Collect URL data and webpage content
feature-based methods .The model can achieve improved from sources like PhishTank and OpenPhish for phishing
detection rates with reduced false alarms. sites and Alexa for legitimate sites.
 Adaptability: The model’s design allows it to adapt to  Feature Extraction: Identify key features, including
emerging phishing tactics, improving its relevance in URL length, domain age, and HTTPS presence. Extract
dynamic online environments. visual and structural features for deep learning models.
 Scalability: The use of ensemble methods and  Model Training: Train various ML classifiers, such as
dimensionality reduction enables efficient handling of Random Forest, SVM, CNN, and LSTM, on labeled
large datasets, essential for real-world deployment. data. Fine-tune models through cross-validation to
optimize accuracy.
 Ensemble Learning: Apply ensemble methods by
combining RF with PCA to minimize data complexity
while preserving high accuracy.
 Evaluation: Assess models using metrics like accuracy,
precision, recall, and F1 score. Compare performance
across models to determine the optimal configuration.

VII. SYSTEM ARCHITECTURE

Fig 1 System Architecture

 Creating a Fake Website:  Social media and messaging apps are commonly used,
expanding the reach of these phishing attempts.
 Attackers build a phishing site that closely resembles a  These messages often create urgency, using language
legitimate website, often using similar logos, colors, and that pressures users to click, such as warnings about
layout. account suspensions or overdue payments.
 To deceive users, attackers may alter the URL subtly,
like using slight spelling changes or similar characters.  Collecting User Information:
For instance, a fake URL might look like "aimazon"
insteadof "amazon."  Once users click the phishing link, they’re taken to the
fake website, where they’re asked to enter secure data
 Delivering the Phishing Link: such as login credentials, or payment details.
 The phishing site may mimic login or payment pages to
 Attackers send out links to the fake site, often through make the experience feel authentic.
emails, SMS, voice messages, or QR codes.

IJISRT24DEC281 www.ijisrt.com 270

Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Using Stolen Data for Theft: [6]. Patel, R., et al. "Ensemble Models for Phishing
Detection and Security Awareness." Cybersecurity
 Attackers use the collected data to access the victim’s Journal, 2024. The study reviews ensemble methods,
real accounts, potentially across multiple sites if the user combining machine learning models with security
has reused their credentials. checks to prevent vulnerabilities in generated code.
 The stolen information can also be used in other illegal [7]. El Asri, L., et al. "Multi-Turn Dialogue for Clarifying
activities or sold to other criminals. User Intent in Phishing Detection Systems."
Computational Intelligence Journal, 2024. This
 Growing Cyber Threat: paper proposes using dialogue models for interpreting
ambiguous user prompts in phishing detection,
 Phishing has adapted over time to target new online enhancing model accuracy in complex scenarios.
services, especially as digital transactions have grown.
 Statistics show phishing is a widespread issue; in 2020,
phishing made up nearly a third of all cybercrime
complaints, resulting in substantial financial losses.

VIII. CONCLUSION

Machine learning offers a flexible and effective

approach to phishing detection, enabling predictive models
to identify previously unseen phishing attacks. The survey
concludes that while models like RF and deep learning
techniques provide high accuracy, a hybrid model
combining these methods may offer the most comprehensive
solution. Future research should focus on developing
adaptable models with low computational cost, capable of
real-time deployment in practical settings.

REFERENCES

[1]. Tang, L., Mahmoud, Q. H. "A Survey of Machine

Learning- Based Solutions for Phishing Website
Detection." Machine Learning & Knowledge
Extraction, 2021. This paper reviews the life cycle of
phishing attacks.
[2]. Vijayalakshmi, T., et al. "Taxonomy of Automated
Phishing Detection Solutions." Journal of
Cybersecurity, 2020. This paper categorizes phishing
detection methods into URL-based, content- based,
and hybrid approaches, comparing the strengths of
each.
[3]. Jain, A., Gupta, P. "Reinforcement Learning for
Phishing Detection." International Journal of
Computer Science Research, 2022. The authors
explore reinforcement learning for phishing
detection, noting its adaptability in evolving phishing
tactics.
[4]. Kalaharsha, A., Mehtre, B. M. "Unsupervised
Learning Techniques for Phishing Detection."
Journal of Information Security and Applications,
2021. This research examines unsupervised learning
approaches that cluster phishing data without labels,
offering insights into alternative detection methods.
[5]. Bingyang, L. "Natural Language Processing in
Phishing Detection." Journal of Emerging
Technologies in Computing Systems, 2024. This
paper focuses on NLP techniques for extracting
phishing features from text and URLs, addressing
challenges in model training.