0% found this document useful (0 votes)
36 views52 pages

Phishing Detection for IT Students

Uploaded by

Aastha Dewangan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views52 pages

Phishing Detection for IT Students

Uploaded by

Aastha Dewangan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

A

PROJECT REPORT
On
Phishing URL Detection using Machine Learning
Submitted in partial fulfillment of the
requirements for the award of the degrees
Of
BACHELOR OF TECHNOLOGY in
INFORMATION TECHNOLOGY

Submitted by:
Aastha Dewangan (300103321020)
Saurabh Rathore (300103321021)
Khushi Dewangan (300103321041)
Abhishek Dewangan (300103322301)

Guided by:
Mr. Toshant Kumar (Asst. Professor)

BHILAI INSTITUTE OF TECHNOLOGY DURG


DEPARTMENT OF INFORMATION TECHNOLOGY
UGC Autonomous Institution
(Affiliated to CSVTU, Approved by AICTE, NBA &NAAC ACCREDITED)

DURG– 491001, CHHATTISGARH, INDIA www.bitdurg.ac.in

SESSION: 2023-24

1
CANDIDATE’S DECLARATION

We hereby declare that the project entitled “Phishing URL Detection using Machine Learning” submitted in
partial fulfillment for the award of the degree of Bachelor of Technology in Information Technology by Aastha
Dewangan (300103321020), Saurabh Rathore (300103321021), Khushi Dewangan (300103321041),
Abhishek Dewangan (300103322301) completed under the supervision of Mr. Toshant Kumar, Asst.
Professor, BIT DURG is an authentic work.

Further, we declare that we have not submitted this work for the award of any other degree elsewhere.

Signature and name of the student(s) with date

2
CERTIFICATE by PROJECT Guide

It is certified that the above statement made by the students is correct to the best of our
knowledge.

Signature of Guide with dates and their designation

3
CERTIFICATE BY THE EXAMINERS

This is to certify that the Major Project work entitled “Phishing URL Detection using Machine
Learning” is carried out by Aastha Dewangan, Saurabh Rathore, Khushi Dewangan,
Abhishek Dewangan in partial fulfillment for the award of degree of Bachelor of Technology
in Information Technology, Chhattisgarh Swami Vivekanand Technical University, Durg
during the academic year 2023-2024.

Mr. Toshant Kumar Prof. Dr. Ani Thomas

Internal Guide HOD

External Examiner

4
ACKNOWLEDGEMENTS

We wish to acknowledge with a deep sense of hearty gratitude and indebtedness to Mrs.
Babita Verma (Assistant Professor) of Information Technology, who gave us this opportunity
to experience project work & her valuable suggestions during this project have been invaluable.
We take this opportunity to voice & record our sincerest gratefulness toward our esteemed
Supervisor Mrs. Toshant Kumar (Asst. Professor) & Co–Supervisor Dr. Ani Thomas under
whose able guidance the project work has been brought to completion.

Our heart leaps up in thankfulness for his benevolence & time for valuable suggestions,
constructive criticism & active interest in the successful completion of this project work.

We are also thankful to all our honorable teachers of the Information Technology Department
and our parents whose valuable support helped us and kept us motivated all through.

Aastha Dewangan
Saurabh Rathore
Khushi Dewangan
Abhishek Dewangan
B.Tech. Ⅲ Year
Discipline of InformationTechnology
BIT DURG

5
ABSTRACT

Internet technology is becoming a major factor in our society, economics, education, vital
infrastructure, and other elements of daily life. Thus, several aspects of our everyday lives are
now at risk from cyber dangers. In 2024, phishing attacks remain the most common online
crime by number of victims, despite the use of advanced detection algorithms. Many people
go online and conduct a wide range of business. In online business any transaction never need
to meet and a buyer can sometimes be dealing with a fraudulent business that does not
actually exist. That’s why security for conducting business online is vital and critical. Any
online program that requires security, like online banking login pages, is vulnerable to fraud.
Phishing websites are a common source of risk and have become an issue for users of online
banking and e-commerce. Phishing websites aim to deceive users into providing private and
sensitive security information so that the scammer can gain access to their accounts. They take
advantage of end users' ignorance of web browser information and security signs by using
websites that mimic those of reputable businesses.

We have employed machine learning algorithms to identify phishing websites in order to


prevent phishing frauds. In order to identify all possible methods for locating algorithms and
methods of machine learning that will be utilized to identify these phishing websites, we are
attempting to do so in this study. To identify these dangerous websites, we use a variety of
algorithms based on machine learning, including the XGBoost Random Forest, Decision Tree,
AutoEncoder, Support Vector Machine, and Multilayer Perceptrons.
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO.

1 Introduction
9-11

2 Literature Review 12-14

3 Problem Identification 15-18

Methodology 19-50
4 4.1 Dataset cleaning and preparation 20-22
4.2 Binary classification of news 22-25
4.3 Training the hybrid model 25-28
4.4 Algorithms 29-50

5 Result & Discussion 51-54

5.1 Analysis and assessment of performance 52

5.2 Experimental setup and results oriented 53-54

6 Conclusion & scope of further work 55-57

7 Reference 58-60

7
List Of Figures

S. No. Page No.

1. Fig. 1.1 11
2. Fig. 4.1 21
3. Fig. 4.2 21
4. Fig. 4.3 22
5. Fig. 4.4 23
6. Fig. 5.1 52
7. Fig. 5.2 53

List Of Tables

S. No. Page No.

1. Fig. 4.1 22
2. Fig. 4.2 25
3. Fig. 4.3 26

8
CHAPTER-1
INTRODUCTION

9
Phishing has become the most serious problem, harming individuals, corporations, and even
entire countries. The availability of multiple services such as online banking, entertainment,
education, software downloading, and social networking has accelerated the Web's evolution in
recent years. As a result, a massive amount of data is constantly downloaded and transferred to
the Internet. Spoofed emails pretending to be from reputable businesses and agencies are used in
social engineering techniques to direct consumers to fake websites that deceive users into giving
financial information such as usernames and passwords. Technical tricks involve the installation
of malicious software on computers to steal credentials directly, with systems frequently used to
intercept users' online account usernames and passwords.

An increasing number of people are using the Internet as a platform for online transactions,
information sharing, and e-commerce as a result of the surge in internet usage over the past
several years. Cybercrime is a new type of crime that emerged as the use of the Internet
developed. Cybercriminals can steal information in a variety of ways, and phishing is the
primary tool they use to do so. Phishing comes in a variety of forms, such as email phishing,
spear phishing, whaling, and vishing. Phishing was first documented in 1990 and was used to
obtain passwords. Phishing assaults have increased in the last few years. Phishing using URLs is
one such assault. A website address, or URL, is a representation of a website's location on a
network and how to access it. Through the URL, we establish a connection to the server's
database, which houses all of the website's information and has a webpage that shows it.There
are two types of URLs: harmful and benign. URL phishing uses malicious URLs, whereas
benign URLs are safe and secure. A cybercriminal will design a website that is identical to the
absolute URL in every way, and it will appear to be the actual thing. On other websites, the URL
will show up as an advertising. When the user inputs their credentials, fraud will occur. Another
method involves sending the user a malicious URL via email. When the user attempts to open
the URL, a dangerous virus is downloaded, giving hackers access to the data they need to carry
out their crimes. We must extract certain properties from malicious and benign URLs in order to
differentiate between them. In order to identify if a URL is malicious or benign, it is necessary to
extract certain properties from them and compare them.
1.1 TYPES OF PHISHING

• Deceptive Phishing: This is the most frequent type of phishing assault, in which a Cyber
criminal impersonates a well-known institution, domain, or organization to acquire
sensitive personal information from the victim, such as login credentials, passwords, bank
account information, credit card information, and so on. Because there is no
personalization or customization for the people, this form of attack lacks sophistication.
• Spear Phishing: Emails containing malicious URLs in this sort of phishing email
contain a lot of personalization information about the potential victim. The recipient's
name, company name, designation, friends, co-workers, and other social information may
be included in the email.
• Whale Phishing: To spear phish a "whale," here a top-level executive such as CEO, this
sort of phishing targets corporate leaders such as CEOs and top-level management
employees.
• URL Phishing: To infect the target, the fraudster or cyber-criminal employs a URL link.
People are sociable creatures who will eagerly click the link to accept friend invitations
and may even be willing to disclose personal information such as email addresses.
This is because the phishers are redirecting users to a false web server. Secure browser
connections are also used by attackers to carry out their unlawful actions. Due to a lack of
appropriate tools for combating phishing attacks, firms are unable to train their staff in
this area, resulting in an increase in phishing attacks. companies are educating their staff
with mock phishing assaults, updating all their systems with the latest security
procedures, and encrypting important Information as broad countermeasures. Browsing
without caution is one of the most common ways to become a victim of this phishing
assault. The appearance of phishing websites is like that of authentic websites.

1.2 EXISTING SYSTEM

Anti-phishing strategies involve educating netizens and technical defense. In this paper,
we mainly review the technical defense methodologies proposed in recent years.
Identifying the phishing website is an efficient method in the whole process of deceiving

11
user information along with the development of machine learning techniques, various
machine learning based methodologies have emerged for recognizing phishing websites
to increase the performance of predictions. The primary purpose of this paper is to survey
effective methods to prevent phishing attacks in a real-time environment.

1.3 PROPOSED SYSTEM

The most frequent type of phishing assault, in which a cybercriminal impersonates a


well-known institution, domain, or organization to acquire sensitive personal information
from the victim, such as login credentials, passwords, bank account information, credit
card information, and so on. Emails containing malicious URLs in this sort of phishing
email contain a lot of personalization information about the potential victim. To spear
phish a "whale," here a top-level executive such as CEO, this sort of phishing targets
corporate leaders such as CEOs and top-level management employees to infect the target,
the fraudster or cyber-criminal employs a URL link.

1.4 ADVANTAGES
There is no personalization or customization for the people, this form of attack lacks
sophistication. social information may be included in the email. The recipient's name,
company name, designation, friends, co-workers may be missing click the link to accept
friend invitations and may even have other people information.

12
CHAPTER-2
LITERATURE REVIEW

13
Numerous theories and methods have been offered by different authors and studied in order to
identify phishing URLs. One theory is to use features based on the message content weighting to
determine whether or not the URL is malicious.

[3] Carolin and Raj Singh devised a technique that uses associate rule mining, a data mining
procedure, to identify dangerous URLs. The process of organizing and extracting information
from a dataset is known as data mining [3].

He carried out a study using both malicious and valid URLs to ascertain how the properties of
the URL differ between the two. He conducted a study that included both dangerous and normal
URLs, and by doing so, he gave a quick summary of the attributes of URLs. A machine learning
model that could identify fraudulent URLs was created using this data. Mohammed et al. [4]
presented a model in which additional URL-based data and results from Microsoft Reputation
Services were used to build a machine learning model. We can ascertain whether a URL has
malicious purpose by applying this model. The model produced precise outcomes. Microsoft has
developed a product called Microsoft Reputation Services that offers URL classification as virus
protection.[4]

All of these characteristics were used to create a machine learning model. Various models have
been developed to identify fraudulent or genuine URLs. Using NLP algorithms is a helpful
technique that creates a word dictionary with all the language-based properties of both benign
and malicious URLs. This dictionary is then used to build a machine learning model that can
identify harmful URLs. Parekh [5] suggested utilizing document object model attributes to
identify the rogue website. The document object model serves as an API for programming
languages such as XML and HTML. It is a tree structure that represents the HTML or XML code
and has features like color and gray histograms and spatial relationships that can be used to
identify phishing URLs [5]. Furthermore, Pradeepthi and Kannan [6] offered a visual approach
to spotting rogue websites. In this effort, phishing detection entails examining text segments and
styles in addition to webpage visuals. PhoneyC is a virtual honey pot that is used to investigate
the types of harmful URLs that hackers employ to steal information, as revealed by a study by Fu
[7].

14
We utilize the EMD to determine the signature distances of the webpage photos using Sahoo's
suggested method [8]. After converting the websites to photos, they identified the visual
indication using characteristics like color. Malicious URLs have also been shown to be
detectable in some investigations by examining their link to previously used domains. In this
study, they suggested a method to check if there is any harmful content in the URL using the
beautiful soup Python package used to parse HTML and XML files. Based on that, we can detect
the malicious URL. Another aspect of malicious [9,10]URL detection is based on the HTML
features. Another option is to use string-based algorithms, where the URLs are pre processed so
that both malicious and legitimate URLs have a word cloud. In this case, the word cloud only
contains the most common words in malicious and legitimate URLs, and the analysis of the word
clouds between the malicious and legitimate URLs is based on the word clouds. We can tell if a
URL is dangerous or not using machine learning methods.[11].
Both reputable and fraudulent websites are used in data acquisition. There are two processes in
extracting valuable features: URL-based refers to IP addresses, URLs with the "@" symbol,
dashes, lengthy URLs, unusually high or low numbers, URL subdomains, etc. Domain-based
factors include the website's Page Rank, its age, and its validity.

15
CHAPTER-3
PROBLEM IDENTIFICATION

16
The escalating prevalence of phishing attacks poses a significant challenge to cybersecurity, with
malicious actors continually devising sophisticated strategies to exploit unsuspecting users.
Phishing, a form of cybercrime, often involves the dissemination of deceptive URLs, which
impersonate legitimate websites to trick users into divulging sensitive information or performing
malicious actions. Traditional methods of detecting and mitigating phishing attacks, such as
manual inspection and blacklist-based systems, are increasingly inadequate in the face of rapidly
evolving tactics employed by cybercriminals.

Key Challenges:

1. Dynamic Nature of Phishing URLs: Phishing URLs exhibit diverse characteristics and are
subject to frequent modifications, making them elusive to static detection methods.

2.Inaccuracy of Blacklisting Services: Existing blacklisting services rely on heuristics, manual


reporting, and historical data to identify malicious URLs. However, these approaches often lag
behind emerging threats and fail to detect newly created phishing websites promptly.

3.Complexity of URL Features: The features that distinguish phishing URLs from legitimate
ones are multifaceted and may include domain age, URL length, presence of suspicious
keywords, redirection behaviour, and hosting information. Extracting and analysing these
features accurately require sophisticated techniques.

4.Need for Real-time Detection: With phishing attacks occurring in real-time, there is a
pressing need for detection mechanisms capable of swiftly identifying malicious URLs to
prevent potential harm to users.

17
Project Objective:

The primary goal of the Phishing Website URL Detection Project is to develop an advanced
detection system that leverages machine learning algorithms to identify and mitigate phishing
threats effectively. By addressing the aforementioned challenges, the project aims to:

1. Enhance Detection Accuracy: Develop machine learning models capable of accurately


distinguishing between benign and malicious URLs by analyzing a comprehensive set of features
and patterns indicative of phishing behavior.

2. Improve Timeliness of Detection: Implement real-time detection capabilities to promptly


identify and respond to emerging phishing threats, minimizing the window of vulnerability for
users.

3. Facilitate Continuous Improvement: Establish mechanisms for ongoing monitoring,


evaluation, and refinement of the detection system to keep pace with evolving phishing tactics
and maintain effectiveness over time.

By achieving these objectives, the Phishing Website URL Detection Project aims to contribute
significantly to the advancement of cybersecurity measures, bolstering the protection of
individuals, organizations, and digital ecosystems against the pervasive threat of phishing
attacks.

Hardware and Software Requirements:

Software Requirements:
1. Pandas: For data manipulation and analysis, Pandas provides powerful data structures
and functions.
2. Matplotlib: A comprehensive library for creating static, animated, and interactive
visualizations in Python, can be instrumental in various aspects of the Phishing Website
URL Detection Project.

18
3. Seaborn: a statistical data visualization library in Python, can enhance the Phishing
Website URL Detection Project by providing high-level interfaces for drawing
informative and attractive visualizations from URL datasets and machine learning model
outputs.
4. Scikit-learn: Scikit-learn is a versatile machine learning library that offers various
algorithms for classification, including SVM (Support Vector Machines).
5. Numpy: Numpy is a fundamental package for scientific computing in Python, providing
support for large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on these arrays.

Hardware Requirements:
1. High-performance Computing (HPC) Resources: Given the computational demands of
deep learning models, access to high-performance computing resources such as GPUs
(Graphics Processing Units) or TPUs (Tensor Processing Units) may be beneficial for
training large-scale models efficiently.
2. Sufficient Memory: Training and running deep learning models often require significant
memory resources, so systems with ample RAM are preferred.
3. Storage: Adequate storage space is necessary for storing large datasets, model
checkpoints, and intermediate outputs generated during training.

19
CHAPTER - 4
METHODOLOGY

20
This section examines the techniques used to complete the different tasks in the
dataset prior treatment and cleaning step and examines the machine learning model used
to categorise phishing websites.
4.1. Dataset cleaning and preparation:
For the phishing website detection project, the data collection process involved
gathering a diverse dataset comprising both legitimate and phishing website samples. The
dataset was sourced from various repositories and sources specializing in cybersecurity
research. Specifically, datasets containing URLs labelled as either legitimate or phishing
were obtained from reputable sources such as Phish Tank, Open Phish, and the CERT
Division of the Software Engineering Institute at Carnegie Mellon University.
The collection of phishing URLs is rather easy because of the opensource service
called Phish Tank. This service provides a set of phishing URLs in multiple formats like csv,
Json etc. that gets updated hourly.
For the legitimate URLs, I found a source that has a collection of benign, spam,
phishing, malware & defacement URLs. The source of the dataset is University of New
Brunswick, https://www.unb.ca/cic/datasets/url-2016.html. The number of legitimate
URLs in this collection are 35,300. The URL collection is downloaded & from that,
*'Benign_list_big_final.csv'* is the file of our interest. As the data are updating every hour
we randomly picked 5000 URL of both phishing and legitimate URLs. This file is then
uploaded to the Colab for the feature extraction.
4.1.1 Data Preprocessing and feature extraction: Pre-processing raw data is a
crucial step in data preparation, improving the accuracy and efficacy of a machine learning
model.
To get the dataset ready for the classification model to be applied, the following
procedures were taken:
i) Dataset Cleaning: Cleaning a dataset is the process of eliminating
extraneous and noisy data. The main objective is to create a uniform format
for the dataset. The data was cleaned by following the steps: Cleaning the
dataset started with deleting undesired, irrelevant, and unwanted data and
extraction meaningful data.

21
Figure 4.1: Data

ii) Dataset Classification: Data classification is a fundamental aspect of


phishing website detection, meticulously curated and labelled, with each
website URL assigned a binary label: 0 for legitimate sites and 1 for phishing
sites. Through careful preprocessing and feature extraction, pertinent
attributes such as URL structure, domain age, presence of suspicious
keywords, and SSL certificate validity are distilled into numerical
representations.

0:Legitimate
1:Phishing
Figure 4.2:Labels

iii) Preprocessing: In the preprocessing phase for phishing website detection,


raw data is cleaned to remove duplicates and irrelevant entries. URLs are
standardized, and relevant features such as domain age and SSL certificate
details are extracted. Missing values and inconsistencies are addressed to
ensure data integrity. This streamlined dataset forms the foundation for
training accurate machine learning models to distinguish between legitimate
and phishing websites, bolstering cybersecurity measures effectively.
The data is split into 8000 training samples and 2000 testing samples, before
the ML model is trained. It is evident from the dataset that this is a
supervised machine learning problem. Classification and regression are the
two main types of supervised machine learning issues. Because the input
URL is classed as legitimate or phishing, this data set has a classification

22
problem. The following supervised machine learning models were examined
for this project's dataset training: Decision Tree , Multilayer Perceptron,
Random Forest, Autoencoder Neural Network , XGBoost and Support Vector
Machines.

4.2 Feature Extraction from the input URL:


4.2.1 Address Based Features:
Below are the categories been extracted from address based features:
1. Domain of the URL :Where domain which is present in the URL been extracted
2. IP Address in the URL :The presence of an IP address in the URL is checked.
Instead of a domain name, URLs may contain an IP address. If an IP address is used
instead of a domain name in a URL, we can be certain that the URL is being used to
collect sensitive information.
3. "@" Symbol in URL :The presence of the'@' symbol in the URL is checked. When
the “@” symbol is used in a URL, the browser ignores anything before the “@”
symbol, and the genuine address is commonly found after the “@” symbol.
4. Length of URL :Calculates the URL's length. Phishers can disguise the suspicious
element of a URL in the address bar by using a lengthy URL. If the length of the URL
is larger than or equal to 54 characters, the URL is classed as phishing in this project.
5. Depth of URL :Calculates the URL's depth. Based on the'/', this feature determines
the number of subpages in the given address.
6. Redirection "//" in URL :The existence of"//" in the URL is checked. The presence
of the character"//" in the URL route indicates that the user will be redirected to
another website. The position of the"//" in the URL is calculated. We discovered that
if the URL begins with “HTTP,” the “//” should be placed in the sixth position. If the
URL uses “HTTPS,” however, the “//” should occur in the seventh place.
7. Http/Https in Domain name :The existence of "http/https" in the domain part of
the URL is checked. To deceive users, phishers may append the “HTTPS” token to
the domain section of a URL. 8. Using URL Shortening Services :URL shortening is a
means of reducing the length of a URL while still directing to the desired webpage

23
on the "World Wide Web." This is performed by using a “HTTP Redirect” on a short
domain name that points to a webpage with a long URL.
8. Prefix or Suffix "-" in Domain :Checking for the presence of a '-' in the URL's
domain part. In genuine URLs, the dash symbol is rarely used. Phishers frequently
append prefixes or suffixes to domain names, separated by (-), to give the
impression that they are dealing with a legitimate website.
9. Tiny URL Detection :Since tiny URLs do not present the real domain, resource
direction, or search parameters, rule-based feature selection techniques might be
useless for tiny URLs. Due to tiny URLs generated by different services, it is hard to
convert them to original URLs. Furthermore, tiny URLs are short strings that are
unfriendly for natural language processing to extract character-level features. If tiny
URLs are not specially processed during data cleansing and preprocessing, they are
likely to cause false or missed alarms. Internet products are also essential in terms
of user experience, and users are also sensitive to false alarms of Internet security
products. 6.4. Response Time for Real-Time Systems Rule-based models depend on
rule parsing and third-party services from a URL string. Therefore, they demand a
relatively long response time in a real-time prediction system that accepts a single
URL string as an input in each request from a client. Phishing attacks spread to
various communication media and target devices, such as personal computers and
other smart devices. It is a big challenge for developers to cover all devices with one
solution. Language independence and running environment independence should
be taken into consideration to reduce system development complexity and late
maintenance costs.

4.2.2 Domain Based Features:


This category contains a lot of features that can be extracted. This category contains a
lot of features that can be extracted. The following were considered for this project out
of all of them.
1. DNS Record :In the case of phishing websites, the WHOIS database either does
not recognize the stated identity or there are no records for the host name .

24
2. Web Traffic :This function determines the number of visitors and the number of
pages they visit to determine the popularity of the website. In the worst-case
circumstances, legitimate websites placed among the top100,000, according to
our data. Furthermore, it is categorized as "Phishing" if the domain has no traffic
or is not recognized by the Alexa database.
3. Age of Domain : This information can be retrieved from the WHOIS database.
Most phishing websites are only active for a short time. For this project, the
minimum age of a legal domain is deemed to be 12 months. Age is simply the
difference between the time of creation and the time of expiry.
4. End Period of Domain :This information can be gleaned from the WHOIS
database. The remaining domain time is calculated for this feature by
determining the difference between the expiry time and the current time. For
this project, the valid domain's end time is regarded to be 6 months or fewer.

4.2.3 HTML And JavaScript Based Features:


5. IFrame Redirection :IFrame is an HTML tag that allows you to insert another
webpage into the one you're now viewing. The “iframe” tag can be used by
phishers to make the frame invisible, i.e., without frame borders. Phishers
employ the “frame border” attribute in this case, which causes the browser to
create a visual boundary.
6. Status Bar Customization :Phishers may utilize JavaScript to trick visitors into
seeing a false URL in the status bar. To get this feature, we'll need to delve into
the webpage source code, specifically the "on Mouseover" event, and see if it
alters the status bar.
7. Disabling Right Click :Phishers disable the right-click function with JavaScript,
preventing users from viewing and saving the webpage source code. This
functionality is handled in the same way as "Hiding the Link with on
Mouseover." Nonetheless, we'll look for the ``event'' event. button==2" in the
webpage source code and see if the right click is disabled for this functionality.
8. Website Forwarding :The number of times a website has been redirected is a
narrow line that separates phishing websites from authentic ones. We

25
discovered that authentic websites were only routed once in our sample.
Phishing websites with this functionality, on the other hand, have been
redirected at least four times.
9. Implementation :We'll examine the implementation component of our artefact
in this area of the report, with a focus on the description of the developed
solution. This is a task that requires supervised machine learning.

4.3 DATASET:
The datasets were gathered from Phishing Tank, an open-source platform. The gathered
dataset was saved as a CSV file. The dataset consists of eighteen columns, which we
changed using a data pre-processing technique. We familiarized ourselves with a few data
frame approaches to see the features in the data. A few plots and graphs are provided for
visualization, so that you can understand how the data is distributed and how different
features relate to one another.
The Domain column has no bearing on the training of a machine learning model. We now
have 16 features and a target column. The recovered features of the legitimate and
phishing URL datasets are simply concatenated in the feature extraction file, with no
shuffling. We need to shuffle the data to balance out the distribution while breaking it into
training and testing sets. This also eliminates the possibility of over fitting during model
training.

4.4 Machine Learning Models :


• Decision tree algorithm:
An improved version of classification and regression trees is the decision tree algorithm.
For tasks such as classification and regression, decision trees are commonly used. The idea
behind a decision tree is to determine a decision by asking if and else questions. The idea is
to learn what frequency of if and else questions leads us to the correct answer quickly.
These questions are called tests in machine learning and called as leaf. The algorithm
searches over all possible tests to obtain the most informative tree about the target
variable.

26
• Random forest algorithm:
Random forest algorithm is also used for both classification and regression-related
problems. In any classification and regression problem, a random forest algorithm is used.
A random forest is nothing more than a collection of decision trees, so for regression
problems, the output will be an average of the decision trees. Similarly, for classification
related problems, the output will be the most common result derived from all the decision
trees.
For all the decision trees, feature importance will be calculated, and the average sum of all
calculated feature importance will be used.
•XGBoost
XG Boost is a machine learning algorithm that belongs to the ensemble learning category,
specifically the gradient boosting framework. It utilizes decision trees as base learners and
employs regularization techniques to enhance model generalization. Known for its
computational efficiency, feature importance analysis, and handling of missing values,
XGBoost is widely used for tasks such as regression, classification, and ranking.
Key features of XGBoost Algorithm include its ability to handle complex relationships in
data, regularization techniques to prevent overfitting and incorporation of parallel
processing for efficient computation. XGBoost is widely used in various domains due to its
high predictive performance and versatility across different datasets.
• Logistic Regression
Logistic Regression is a linear model used for binary classification tasks. It predicts the
probability that an instance belongs to a particular class. The most used statistical model
for predicting binary data in various disciplines is logistic regression. Its ease of use and
excellent interpretability have led to its widespread application. It often makes use of the
logit function as a component of generalized linear models.
log K(a; α) 1 − K(; α) = αT a ……(5)
where a is a vector of M predictors a = (a1, a2, ….., aK), and a is a K × 1 vector of regression
parameters. When the relationship between the data is roughly linear, logistic regression
works well. However, if there are intricate nonlinear interactions between the variables, it
performs badly. Additionally, compared to other strategies, it requires more statistical

27
assumptions before use. Additionally, if there are missing data in the data set, the
prediction rate is impacted.
• SVM (Support Vector Machine)
SVM is a machine learning method based on supervised learning that may be used for both
classification and regression. Based on its strong foundation in statistical learning theory
and the positive results obtained in several sectors of data mining challenges, the SVM is
considered a new approach that is quickly gaining favour. SVM is a statistical learningbased
classification approach that has been effectively applied in several nonlinear classification
applications involving big datasets and problems. Every hyper-plane is determined by its
direction (a), the precise position in space or a threshold is (b), (lxi) denotes the inputarray
of constituent N and indicates the category. Eqs. (6) and (7) show a collection of the
training cases.
(LX1, Y1),(LX2, Y2),…..,(LXp, Yp); lxi ∈RDS…..(6)
p stands for the number of training datasets, and DS stands for the number of input dataset
dimensions. The following is a description of the function of decision:
f(LX, a, b) = sgn((a.LXi) + b), a ∈RDS, b ∈ R …..(7)
Utilizing the SVM for system training has several benefits, one of which is its capacity to
handle multi-dimensional data. SVM is a classifier that outputs an ideal hyperplane that
categorizes new examples from input labelled training data. By maximizing the margin,
SVM creates a hyperplane between data sets.
• Multilayer Perceptrons
Multilayer perceptrons (MLPs) are a fundamental type of artificial neural network widely
used in machine learning, including for phishing website detection. These neural networks
consist of multiple layers of interconnected nodes, each layer transforming the input data
through a series of weighted connections and nonlinear activation functions. In the context
of phishing detection, MLPs can be trained on features extracted from website content,
such as URL structure, HTML code, and textual content, to classify websites as either
legitimate or malicious. By learning complex patterns and relationships within the data,
MLPs can effectively distinguish between genuine and fraudulent websites, helping to
protect users from online threats.
• Autoencoder Neural Network

28
An auto encoder is a neural network that has the same number of input neurons as it does
outputs. The hidden layers of the neural network will have fewer neurons than the
input/output neurons. Because there are fewer neurons, the auto-encoder must learn to
encode the input to the fewer hidden neurons. The predictors (x) and output (y) are exactly
the same in an auto encoder.

4.5 LIBRARIES USED

Pandas: It's a Python-based machine learning library. Pandas is a free and open-source
programming language. Pandas is a programming language that is commonly used for
dataset loading and data analytics. Pandas is used for machine learning in a variety of
domains, including economics, finance, and others. It is extremely user-friendly and can
display datasets in a tabular style for easier comprehension.
Sklearn: Sklearn is one of the most essential Python libraries for machine learning. Sklearn
includes several tools for statistical classification, modelling, regression, dimensionality
reduction and clustering.
NumPy: NumPy is a Python-based machine learning package. In Python, Numpy is used to
deal with arrays. NumPy is used for all calculations using 1-d or 2-d arrays. NumPy also has
routines for working with linear algebra and the Fourier transform.
Matplotlib: Matplotlib is a library for data visualization. It's a Python open-source module
for plotting graphs from model results. These diagrams can aid in comprehending the
circumstance of the outcomes. For easier comprehension, several components of the
results can be graphically formatted.

29
4.6 Architecture Diagram

Figure 4.3: Architecture Diagram

4.7 Flowchart

30
4.8 Training of Models
In this section, we trained different models of machine learning for evaluating the accuracy.
It has been explained about the different models in below sections. Where in this project
the models are examined, with accuracy as the primary metric. In final stage we have
compared the model accuracy. In all circumstances the testing and training datasets are
splinted into 20:80 ratio. Feature Distribution :Here in below figure shows how the data is
distributed and how features are related to one another, a few plots and graphs are given.

31
Figure 4.4: Feature Distribution

Decision Tree Classifier :The method runs through all potential tests to discover the one
that is most informative about the target variable to build a tree. Where we are predicting
the accuracy of the model on the samples collected on both trained and test samples. On
this we found accuracy of test and training datasets are 82.6% and 81%. Below is the
execution of Decision tree classifier algorithm. To generate model various parameters are
set and the model is fitted in the tree. The samples are divided into X and Y train, X and Y
test to check the accuracy of the model.
Random Forest Classifier :We can limit the amount of over fitting by averaging the
outcomes of numerous trees that all operate well and over fit in diverse ways. To construct
a random forest model, you must first determine the number of trees to construct. They are
incredibly powerful, frequently operate effectively without a lot of parameters adjusting,
and don't require data scalability. Where we are predicting the accuracy of the model on
the samples collected on both trained and test samples. On this we found accuracy of test
and training datasets are 83.4% and 81.4%.
MLP :MLPs can be thought of as generalized linear models that go through numerous
phases of processing before deciding. Below is the execution of the MLP algorithm. To
generate a model various parameters are set and the model is fitted in the tree. The

32
samples are divided into X and Y train, X and Y test to check the accuracy of the model.
Where we are predicting the accuracy of the model on the samples collected on both
trained and test samples. On this we found accuracy of test and training datasets are 86.3%
and 85.9%.
XGBoost :Below is the execution of XGBoost algorithm. To generate a model various
parameters are set and the model is fitted in the tree. The samples are divided into X and Y
trains, X and Y tests to check the accuracy of the model. Where we are predicting the
accuracy of the model on the samples collected on both trained and test samples. On this
we found accuracy of test and training datasets are 86.4% and 86.6%.
Auto encoder :The auto-encoder must learn to encode the input to the hidden neurons with
fewer neurons. In an auto encoder, the predictors (x) and output (y) are identical. To
generate model various parameters are set and the model is fitted in the tree. The samples
are divided into X and Y train, X and Y test to check the accuracy of the model. Where we
are predicting the accuracy of the model on the samples collected on both trained and test
samples. On this we found accuracy of test and training datasets are 81.8% and 81.9%.
SVM :An SVM training algorithm creates a model that assigns new examples to one of two
categories, making it a non-probabilistic binary linear classifier, given a series of training
examples that are individually designated as belonging to one of two categories. To
generate model various parameters are set and the model is fitted in the tree. The samples
are divided into X and Y train, X and Y test to check the accuracy of the model. Where we
are predicting the accuracy of the model on the samples collected on both trained and test
samples. On this we found accuracy of test and training datasets are 81.8% and 79.8%.

4.9 User Interface:

Algorithm:

1. Import necessary modules:


- Flask for creating the web application.

33
- render_template for rendering HTML templates.
- request for handling HTTP requests.
- URLFeatureExtraction for extracting features from URLs.
- pickle for serializing and deserializing Python objects.

2. Create a Flask application instance:


- app = Flask(__name__)

3. Define routes for different pages:


- '/': Home page.
- '/about': About page.
- '/getURL': URL submission page.

4. Define functions for each route:


- index(): Renders the home page.
- about(): Renders the about page.
- getURL(): Handles URL submission form.

5. Inside getURL() function:


- Check if the request method is POST.
- If it is POST, retrieve the URL from the form data.
- Print the URL.
- Call the URLFeatureExtraction.featureExtraction() function to extract features from the
URL.
- Print the extracted features.
- Load the pre-trained XGBoost model using pickle.
- Predict the class of the URL using the loaded model.
- Print the predicted value.
- If the predicted value is 0, set the value variable to "Legitimate".
- If the predicted value is not 0, set the value variable to "Phishing".
- Render the home page template with the error message based on the prediction result.

34
6. Run the Flask application if the script is executed directly:
- app.run(debug=True)

This algorithm outlines the flow of the Flask application, from handling routes to
processing URL submissions and making predictions using a pre-trained XGBoost model.

Code implemented using algorithm:

from flask import Flask,render_template,request


import URLFeatureExtraction
import pickle
app = Flask(_name_)
@app.route('/')
def index():
return render_template("home.html")

@app.route('/about')
def about():
return render_template("about.html")

@app.route('/getURL',methods=['GET','POST'])
def getURL():
if request.method == 'POST':
url = request.form['url']
print(url)
data = URLFeatureExtraction.featureExtraction(url)
print(data)
RFModel = pickle.load(open('XGBoostClassifier.pickle.dat', 'rb'))

35
predicted_value = RFModel.predict(data)
print(predicted_value)
if predicted_value == 0:
value = "Legitimate"
return render_template("home.html",error=value)
else:
value = "Phishing"
return render_template("home.html",error=value)
if _name_ == "_main_":
app.run(debug=True)

Code explanation:

Let's break down the significance of each part of the code:

1. *Importing Modules:*
- The code imports necessary modules such as Flask for creating the web application,
render_template for rendering HTML templates, request for handling HTTP requests,
URLFeatureExtraction for extracting features from URLs, and pickle for serializing and
deserializing Python objects.

2. *Creating Flask Application Instance:*


- app = Flask(__name__): This line creates an instance of the Flask application, allowing us
to define routes and handle HTTP requests.

3. *Defining Routes:*
- The code defines three routes:
- '/': Renders the home page.
- '/about': Renders the about page.

36
- '/getURL': Handles URL submission.

4. *Route Functions:*
- Each route is associated with a function:
- index(): Renders the home page template.
- about(): Renders the about page template.
- getURL(): Handles URL submission form, predicts URL class (legitimate or phishing),
and renders the home page template with the prediction result.

5. *URL Submission Handling:*


- Inside the getURL() function:
- It checks if the HTTP request method is POST.
- Retrieves the URL submitted through the form.
- Calls URLFeatureExtraction.featureExtraction() to extract features from the URL.
- Loads a pre-trained XGBoost model using pickle.
- Predicts the class of the URL (0 for legitimate, non-zero for phishing) using the loaded
model.
- Renders the home page template with an error message indicating the prediction
result.

6. *Running the Application:*


- The last block of code checks if the script is executed directly (__name__ == "__main__")
and then runs the Flask application with debugging enabled (app.run(debug=True)).

Overall, this Flask application provides a simple web interface for users to submit URLs and
receive predictions about their legitimacy or potential for phishing. It integrates machine
learning (XGBoost) for URL classification, allowing users to quickly assess the safety of
URLs they encounter.

37
38
CHAPTER-5
RESULTS AND DISCUSSIONS

39
As a result, all of the previously covered techniques may be used to develop a
machine learning model. For testing and training, the model and 80% of the dataset
were used for training, and the remaining 20% for testing. Machine learning
techniques such as Random Forest, Decision Tree, Logistic Regression, XGBoost, and
SVM are employed to analyze and ascertain the legitimacy of a given URL. XGBoost
yielded good results after fitting the dataset to all algorithms; the performance analysis
is presented in Table 1.
While Random Forest gets 0.820 in training accuracy and holds 0.821 in test
accuracy, XGBoost has 0.868 in training accuracy and 0.858 in test accuracy.
Furthermore, the decision tree's test accuracy remains at 0.850 while its training
accuracy reaches 0.880.
Figure 2 shows the accuracy of each algorithm used for training the model.
Figure 3 presents a graph illustrating the relative significance of the various features
considered. Only a few of the fifteen criteria are crucial for improving accuracy.

The validation curves for each of the employed algorithms are shown in Figs. 4–6.
The model's accuracy, or score, for various algorithmic hyperparameter values is
shown on the validation curve.
Figure 4 shows that the training and cross-validation scores are identical and steadily
rising, indicating that the model is operating effectively. Additionally, Fig. 5
demonstrates that this model is operating well because the training and cross-
validation scores are comparable and are rising with time. In a similar vein, the
XGBoost model is the most precise and ideal.

Figure 5.1: Accuracy of different models

40
Figure 5.2: Validation Curve of SVM Model

Figure 5.3: Graph of Train and Test Accuracy of Different Algorithms

41
Figure 5.4: Feature Importance

Figure 5.5: Validation curve of Decision Tree Classifier

42
Figure 5.6: Validation Curve for Random Forest Classifier

43
CHAPTER-6
OUTPUT SCREENSHOTS

44
Figure 6.1: GUI made using Flask Phishing URL output

Figure 6.2: Legitimate URL output

45
Figure 6.3: Phishing URL Output

Figure 6.4: Legitimate URL Output

46
CHAPTER-7
CONCLUSION AND SCOPE OF FURTHER WORK

47
Phishing, with its ever-evolving techniques, indeed poses a significant threat to internet
users' security. Recognizing counterfeit URLs, often the gateway to phishing attacks, is crucial
for mitigating this risk. In this study, we delved into the linguistic and domain-based features of
URLs, leveraging machine learning to develop a robust detection model.

Our approach involved employing various machine learning algorithms such as Random
Forest, Decision Tree, Light GBM, Logistic Regression, and Support Vector Machine. These
algorithms were trained on a dataset comprising both legitimate and phishing URLs, enabling
them to learn patterns indicative of phishing attempts. Among these algorithms, XGBoost
emerged as the top performer, showcasing its efficacy in distinguishing between genuine and
fraudulent URLs.

However, despite the promising results, we identified areas for improvement. One
limitation we encountered was the absence of some URLs in the WHOIS database, hindering our
ability to gather comprehensive information. To address this, future efforts should focus on
expanding the feature set by incorporating additional data sources and enhancing feature
engineering techniques. By enriching the dataset with more diverse and up-to-date URLs, we can
enhance the model's accuracy and robustness.

Moreover, our study suggests broader applications beyond URL detection. One intriguing
prospect is the development of a browser extension that provides real-time recommendations and
alternatives for trustworthy websites based on user input. Such an extension could serve as an
invaluable tool for internet users, helping them navigate the web securely and confidently.

Looking ahead, we envision leveraging our machine learning model to create a


comprehensive search engine capable of detecting and blocking phony URLs proactively. By
integrating our detection capabilities into the fabric of web infrastructure, we aim to disrupt
phishing operations at scale, safeguarding users from malicious attacks.

Furthermore, our model can play a pivotal role in enhancing surveillance mechanisms
against emerging phishing threats. By continuously monitoring online activities and

48
automatically detecting new types of phishing attacks, we can stay one step ahead of
cybercriminals and bolster overall cybersecurity posture.

In summary, our study underscores the power of machine learning in combating phishing
and advancing web security. By refining our detection techniques, exploring innovative
applications, and collaborating across disciplines, we can pave the way towards a safer and more
secure online environment for all users.

49
CHAPTER-8
REFERENCES

50
[1] Safi, A., & Singh, S. (2023, February 1). A systematic literature review on
phishing website detection techniques. Journal of King Saud University. Computer
and Information Sciences/Maǧalaẗ Ǧamʼaẗ Al-malīk Saud : Ùlm Al-ḥasib Wa Al-
maʼlumat. https://doi.org/10.1016/j.jksuci.2023.01.004

[2] Machine Learning and Artificial Intelligence to Advance Earth System Science.
(2022, June 13). National Academies Press eBooks. https://doi.org/10.17226/26566

[3] Carolin Jeeva S, Rajsingh EB. Intelligent phishing URL detection using
association rule mining. Hum Centr Comput Inf Sci 2022.
https://doi.org/10.1186/s13673-016- 0064-3.

[4] Mohammed Nazim Feroz SM. Phishing URL detection using URL ranking. In:
Proceedings of the IEEE international congress on big data (BigData congress); 2015.
https://doi.org/10.1109/BigDataCongress.2015.97.

[5] Parekh Shraddha, Parikh Dhwanil, Kotak Srushti, Sankhe Smita. A new method
for detection of phishing websites: URL detection. IEEE; 2018. p. 949–52.

[6] K.V. Pradeepthi, A. Kannan "Performance study of classification techniques for


phishing URL detection”, https://ieeexplore.ieee.org/document/7229761, 2022.

[7] A.Y. Fu, “Detecting phishing web pages with visual similarity assessment based
on earth mover’s distance (EMD)”, 2022.

[8] D.Sahoo, “Malicious URL detection using machine learning: a survey”,2022.

[9] Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. “Machine Learning-Based
Phishing Detection from URLs,” Expert Systems with Applications, vol. 117, pp. 345-
357, January 2019.

[10] J. James, Sandhya L. and C. Thomas, “Detection of phishing URLs using


machine learning techniques,” International Conference on Control Communication
51
and Computing (ICCC), December 2013.

[11] Dipayan Sinha, Dr. Minal Moharir, Prof. Anitha Sandeep, “Phishing Website
URL Detection using Machine Learning,” International Journal of Advanced Science
and Technology, vol. 29, no. 3, pp. 2495-2504, 2020.

[12] Microsoft, Microsoft Consumer safety report.


https://news.microsoft.com/en-sg/2014/02/11/microsoft-consumersafety-index-
revealsimpact-of-poor-online-safety-behaviours-in-singapore/
sm.001xdu50tlxsej410r11kqvks u4nz.

[13] Internal Revenue Service, IRS E-mail Schemes. Available at


https://www.irs.gov/uac/newsroom/consumers-warnedof-new-surge-in-irs-email-
schem es-during-2016-tax-season-tax-industry-also-targeted.

[14] Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S. (2007), A comparison of machine
learning techniques for phishing detection.

[15] Proceedings of the Anti-phishing Working Groups 2nd Annual ECrime


Researchers Summit on - ECrime ’07.doi:10.1145/1299015.1299021.

[16] E., B., K., T. (2015)., Phishing URL Detection: A Machine Learning and Web
Mining-based Approach. International Journal of Computer Applications,123(13), 46-
50. doi:10.5120/ijca2015905665.

[17] Ram Basnet, Srinivas Mukkamala et al, Detection of Phishing Attacks: A


Machine Learning Approach, In Proceedings of the International World Wide Web
Conference (WWW), 2003.

[18] Sklearn, ANN library. http://scikit-learn.org/stable/modules/ann.html.

52

You might also like