Dept : Information Technology
Title
Deep Learning Based Phishing Detection
System Using URLs and Website Content
Date of Presentation : 19-03-2025
Team Members:
Project Guide:
Abhitej Reddy 160121737054
Mr. G Srikanth
N Abhishek 160121737305
Assisstant professor
Kadali sathvik 160121737034
ABSTRACT
Phishing websites pose a significant cybersecurity threat by deceiving users into providing sensitive information such as login
credentials and financial data. Traditional detection methods, primarily based on regular expressions and limited feature sets,
often fail to keep up with evolving phishing techniques. This study presents an advanced machine learning-based approach that
utilizes 32 distinct features to enhance phishing detection accuracy.Unlike conventional models that rely solely on URL
analysis, our system incorporates domain attributes, webpage content, and behavioral indicators to provide a comprehensive
classification framework. By leveraging feature selection techniques and machine learning algorithms, our model effectively
distinguishes between legitimate and fraudulent websites. Exploratory data analysis, including correlation heatmaps and class
distribution visualizations, has been employed to understand feature importance and improve classification
efficiency.Performance evaluation using accuracy, precision, recall, and F1-score demonstrates the effectiveness of our
approach. The proposed system surpasses traditional methods by offering a scalable, adaptable, and robust phishing detection
mechanism. Future work will focus on expanding the dataset, optimizing model parameters, and deploying a web-based
application to enable real-time phishing detection, making cybersecurity more accessible and reliable.
INTRODUCTION
Growing Cyber Threat: Phishing attacks have become a major cybersecurity challenge, with attackers creating fake
websites to steal sensitive user data, including passwords and financial information.
Limitations of Traditional Methods: Earlier detection techniques primarily relied on regular expressions and limited feature
sets, making them ineffective against evolving phishing tactics.
Need for Advanced Detection: A more data-driven and feature-rich approach is required to accurately differentiate between
legitimate and phishing websites.
Comprehensive Feature Utilization: Our model leverages 32 distinct features, including domain attributes, webpage
content, and behavioral indicators, to improve classification accuracy.
Machine Learning Approach: The system integrates feature selection techniques and machine learning algorithms, ensuring
a robust and scalable phishing detection mechanism.
PROBLEM STATEMENT
Phishing website detection systems rely heavily on the effectiveness of feature extraction
and classification techniques to accurately differentiate between legitimate and fraudulent
websites. However, existing detection methods face challenges in capturing a sufficient
number of high-quality features, leading to misclassification and reduced reliability. The
dynamic nature of phishing attacks, coupled with inconsistencies in feature selection and
model generalization, further hampers the ability of traditional approaches to detect
evolving threats with high confidence. This limitation affects the overall trustworthiness of
phishing detection models, necessitating a more robust machine learning-based approach
that leverages key URL, website security, and behavioral features to enhance accuracy and
detection performance
OBJECTIVE
Develop an Accurate Phishing Detection Model: Build a robust system that effectively distinguishes
between legitimate and phishing websites using a diverse set of 32 features.
Enhance Detection Capabilities: Utilize advanced feature selection techniques to improve
classification accuracy beyond traditional regex-based methods.
Implement Machine Learning Algorithms: Train and evaluate multiple models, leveraging correlation
analysis and visualization techniques to optimize performance.
•
Ensure Scalability and Efficiency: Design the system to be fast, scalable, and adaptable to detect
evolving phishing threats in real-time.
Deploy as a Web-Based Solution: Integrate the model into a user-friendly web application to provide
real-time phishing detection for broader accessibility and usability.
LITERATURE SURVEY
TITLE AUTHOR METHODOLOGY PROPOSED CONS CONCLUSION
SYSTEM
Pre-processing URLs A CNN-based Limited to URL Deep learning (via
to extract features; phishing detection features; dataset CNN) improves
using a CNN for system using URL size could be detection over
features, tested on a expanded; may not traditional methods;
classification;
dataset of phishing capture all future work includes
comparing CNN and legitimate URLs phishing nuances dataset expansion and
Phishing performance against
Detection combining CNN with
S. Singh, M. P. conventional ML other algorithms
from URLs algorithms
Singh, and R.
Using Deep
Pandey
Learning
Approach
LITERATURE SURVEY
TITLE AUTHOR METHODOLOGY PROPOSED CONS CONCLUSION
SYSTEM
A Deep Feature extraction from A hybrid model Increased Integration of CNN
Learning- URL structures and combining CNN and computational and LSTM yields
Based webpage content; LSTM to capture complexity; better detection rates;
both spatial and requires extensive future work to further
Phishing feeding data into CNN
temporal training and enhance feature
Detection for spatial patterns and characteristics, optimization extraction and
System LSTM for sequential validated on a novel experiment with
Using CNN, patterns dataset additional deep
LSTM Kumar and M. S. learning paradigms
Kaur
LITERATURE SURVEY
TITLE AUTHOR METHODOLOGY PROPOSED CONS CONCLUSION
SYSTEM
DEPHIDES: Kumar and R. Combining multiple An ensemble Higher Ensemble techniques
Deep Sharma deep learning system that complexity in outperform single-
Learning architectures (ANN, leverages the training multiple model approaches;
further fine-tuning
Based CNN, RNN); training strengths of ANN, models;
and integration of
Phishing individual models; CNN, and RNN increased additional methods
Detection employing a voting through a voting computational are recommended
System mechanism for classifier to detect overhead;
improved accuracy a broad range of complex
phishing attacks hyperparameter
tuning
LITERATURE SURVEY
TITLE AUTHOR METHODOLOGY PROPOSED CONS CONCLUSION
SYSTEM
A Weighted Gupta and R. K. Utilizing performance- A weighted Weighting The weighted
Ensemble Jain based weighting to ensemble model that mechanism can be ensemble
Model for combine Random assigns different complex; results significantly improves
weights to RF and may be dataset- detection accuracy
Phishing Forest and DNN
DNN based on their specific; potential compared to
Website models; analyzing performance, issues with individual models;
Detection URL and website enhancing overall generalizability future work will
content features detection accuracy extend the framework
with other methods
LITERATURE SURVEY
TITLE AUTHOR METHODOLOGY PROPOSED CONS CONCLUSION
SYSTEM
Machine R. Sharma and P. Comparative analysis A framework for Lacks a novel XGBoost shows
Learning and Kaur of ML and DL assessing various system efficiency over
Deep techniques; extracting ML/DL models for implementation; conventional ML;
phishing page deep learning models
Learning for features from URL and dependent on
detection rather than are promising; future
Phishing HTML content; a single unified existing work should improve
Page evaluation of models system methods; may feature selection and
Detection such as XGBoost, not address model interpretability
SVM, and CNN emerging
phishing
techniques
EXISTING SYSTEM
Existing phishing detection models often rely on simplistic techniques such as regular expression (regex) matching
and use a minimal set of features, which leads to the creation of small, less representative datasets. These limitations
hinder their ability to capture the complex, evolving patterns of modern phishing attacks, resulting in high false-
positive rates and inadequate detection accuracy. Moreover, the dependency on low-dimensional feature spaces
restricts the model's capacity to generalize across diverse attack vectors and adapt to emerging threats. In contrast,
our project addresses these challenges by incorporating an extensive set of 32 features that encompass domain
attributes, webpage content, and behavioral indicators. This comprehensive feature extraction, combined with
advanced machine learning and deep learning methodologies, is designed to improve robustness, accuracy, and
scalability, ultimately overcoming the drawbacks of traditional regex-based approaches and limited datasets.
Methodology
Data Collection and Preparation:
Gather a comprehensive dataset consisting of labeled instances of phishing and legitimate
websites.
Include a wide range of URL-based, security-related, and behavioral features (e.g., UsingIP,
HTTPS, WebsiteTraffic, PageRank, GoogleIndex, LinksInScriptTags, etc.).
PreprocessingHandling Missing Values: Remove or impute missing data to maintain dataset
•
integrity.
Feature Encoding: Convert categorical variables into numerical representations for model
compatibility.
Feature Scaling: Normalize numerical features to prevent bias in model learning.
Class Balancing: Use oversampling or undersampling techniques to address imbalances
between phishing and legitimate websites.
Methodology
Feature Engineering
Tokenization and Character Embedding: Extract meaningful patterns from domain names and URLs.
Web Tokenization and Character Embedding: Content Feature Extraction: Capture structural and contextual
details such as HTTPS usage, favicon presence, and website redirection behaviors.
Feature Selection Using RFE: Identify the most relevant features while removing redundant ones to improve
computational efficiency.
Machine Learning Models
• and evaluate multiple classifiers, including:
Train
• Gradient Boosting Classifier (targeting high accuracy).
• CatBoost Classifier & Random Forest (leveraging ensemble learning for robustness).
• Support Vector Machine (SVM) (finding optimal hyperplanes for classification).
• Multi-layer Perceptron (MLP) (capturing complex feature relationships).
• Decision Tree & K-Nearest Neighbors (KNN) (using simpler, effective models).
• Logistic Regression & Naïve Bayes (for comparative analysis).
Methodology
Model Evaluation
•Assess model performance using key metrics:
• Accuracy: Overall correctness of classifications.
• Precision: Proportion of predicted phishing websites that are truly phishing.
• Recall: Ability of the model to identify actual phishing websites.
• F1-score: Balance between precision and recall for overall effectiveness.
•
Feature Explaination
•Index: Serves as a unique identifier for each record.
•UsingIP: Indicates whether an IP address is used instead of a domain name in the URL.
•LongURL: Measures the length of the URL, with longer URLs potentially being more suspicious.
•ShortURL: Identifies whether a URL has been shortened, which can obscure its true destination.
•Symbol@: Detects the presence of the "@" symbol, often used to mislead users.
•Redirecting//: Checks for abnormal redirection patterns within the URL.
•PrefixSuffix-: Looks for hyphens in the domain name, which might suggest deception.
•SubDomains: Counts the number of subdomains, as excessive subdomains can be indicative of phishing.
•HTTPS: Verifies if the secure HTTPS protocol is used.
•DomainRegLen: Reflects the duration of the domain registration, with shorter durations being riskier.
Feature Explaination
•Favicon: Assesses the presence and authenticity of the website’s favicon.
•NonStdPort: Determines whether non-standard ports are being used in the URL.
•HTTPSDomainURL: Confirms if the domain URL supports HTTPS.
•RequestURL: Analyzes the URL used for making requests to the server.
•AnchorURL: Examines the links within anchor tags for suspicious patterns.
•LinksInScriptTags: Counts links found in script tags that might be malicious.
•ServerFormHandler: Evaluates how the server handles form submissions.
•InfoEmail: Detects the presence of informational or contact email addresses.
•AbnormalURL: Flags URLs that deviate from normal patterns.
•WebsiteForwarding: Checks if the website employs forwarding techniques.
•StatusBarCust: Identifies if the status bar has been customized, which could hide malicious actions.
Feature Explaination
•DisableRightClick: Detects if right-click functionality is disabled to prevent source code inspection.
•UsingPopupWindow: Flags the use of popup windows that may be used for malicious intents.
•IframeRedirection: Identifies if iframes are used for redirection, a common phishing tactic.
•AgeofDomain: Determines the age of the domain, with newer domains being more suspect.
•DNSRecording: Checks for the recording of DNS information which might indicate domain legitimacy.
•WebsiteTraffic: Assesses the amount of traffic a website receives as a proxy for legitimacy.
•PageRank: Utilizes Google PageRank as an indicator of the website’s credibility.
•GoogleIndex: Determines if the website is indexed by Google, which can imply authenticity.
•LinksPointingToPage: Counts the number of external links pointing to the page.
•StatsReport: Provides additional statistical information relevant to the website’s features.
•Class: Represents the target label indicating whether the website is phishing or legitimate.
SYSTEM ARCHITECTURE
DATASETS
The dataset used in this analysis is sourced from Kaggle and can be accessed via the following link:
Phishing Website Detector Dataset.
This dataset contains over 11,000 website URLs, with each sample comprising 30 website parameters and a class
label that identifies whether the website is phishing (labeled as 1) or not (labeled as -1).
Dataset Overview:
Number of Samples: 11,054
Number of Features: 32 (including 30 website parameters, 1 target variable, and 1 additional feature)
SAMPLE DATASET
IMPLEMENTATION
RESULTS
RESULTS
REFERENCES
1.S. Singh, M. P. Singh, and R. Pandey, "Phishing Detection from URLs Using Deep Learning Approach," International Journal of Computer
Applications, vol. 975, pp. 1–7, 2020. Published: November 15, 2020.
2. Kumar and M. S. Kaur, "A Deep Learning-Based Phishing Detection System Using CNN, LSTM," Electronics, vol. 12, Article 1232, 2023.
Published: January 15, 2023.
3. Kumar and R. Sharma, "DEPHIDES: Deep Learning Based Phishing Detection System," Journal of Network and Computer Applications, vol.
210, Article 103511, 2024. Published: March 5, 2024.
4. Gupta and R. K. Jain, "A Weighted Ensemble Model for Phishing Website Detection," Electronics, vol. 12, Article 232, 2023. Published: February 1,
2023.
5. R. Sharma and P. Kaur, "Machine Learning and Deep Learning for Phishing Page Detection," Journal of Information Security and
Applications, vol. 67, Article 103213, 2023. Published: April 10, 2023.
6. Verma and S. Gupta, "Using Machine Learning to Detect and Classify URLs," International Journal of Information Security, vol. 21, pp. 345–
356, 2023. Published: May 5, 2023.
7. M. Jha and R. Kumar, "BERT-Based Approaches to Identifying Malicious URLs," IEEE Transactions on Information Forensics and Security, vol.
18, pp. 1234–1245, 2023. Published: July 20, 2023.