0% found this document useful (0 votes)
23 views19 pages

Phishing Website Detection

A research paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views19 pages

Phishing Website Detection

A research paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Phishing Website detection

Submitted in partial fulfillment of the requirements

of the degree of

Bachelor of Engineering

by

Megha Agarwal (04)


Arieyshma Chowhan (20)
Shruti Jani (44)
Hansika Koli (54)

Supervisor:
Prof. Renuka Nagpure

Department of Information Technology

Atharva College of Engineering


Year: 2022-2023

1
ATHARVA COLLEGE OF ENGINEERING
MALAD (W), MUMBAI 400 095
YEAR: 2021-22

CERTIFICATE
This is to certify that

Megha Agarwal
Arieyshma Chowhan
Shruti Jani
Hansika Koli

have submitted the project report for the requirements of the Bachelor of
Engineering in Information Technology satisfactorily
on

“Phishing Website Detection”

As prescribed by the University of Mumbai Under the guidance of

PROJECT GUIDE H.O.D. PRINCIPAL

INTERNAL EXAMINER COLLEGE SEAL EXTERNAL EXAMINER

2
B.E. Mini-Project Report Approval

This mini-project synopsis entitled Phishing Website Detection by


Megha Agarwal, Arieyshma Chowhan, Shruti Jani, Hansika Koli
is approved for the degree of Information Technology from University
of Mumbai.

Examiners

1.

2.

Date:

Place:

3
Declaration

I declare that this written submission represents my ideas in my own words


and where others' ideas or words have been included, I have adequately cited
and referenced the original sources. I also declare that I have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in my submission. I understand
that any violation of the above will be cause for disciplinary action by the
Institute and can also evoke penal action from the sources which have thus not
been properly cited or from whom proper permission has not been taken when
needed.

-----------------------------------------

(Signature)

-----------------------------------------

Megha Agarwal (04)


Arieyshma Chowhan (20)
Shruti Jani (44)
Hansika Koli (54)

Date:
Date:

4
Table of Contents

Chapter 1 Introduction 7
1.1 Motivation 7
1.2 Problem Statement 7
1.3 Objectives 8
1.4 Scope 8
Chapter 2 Review of Literature 9
Chapter 3 Report on Present Investigation 11
3.1 Proposed System 11
3.1.1 Block diagram 11
3.2 Implementation 13
3.2.1 ML Algorithm 14
3.2.2 Dataset description / Data 15
Preparation/Feature Engineering
Chapter 4 Model Implementation 16
• Training of Model
• Evaluation of Model
Chapter 5 Results and Discussion (Screenshots of the 17
output with description )
5.1 Parameter Tuning and Inference
Chapter 6 Conclusion 18

Chapter 7 Future Scope 19

References

5
List of Figures
Figure No. Figure Name Page No.
3.1 BLOCK DIAGRAM 11

List of Tables
Table No. Table Name Page No.
3.1 LITERATURE REVIEW 9

Chapter 1
6
INTRODUCTION
In recent years, advancements in Internet and cloud technologies have led to a significant
increase in electronic trading in which consumers make online purchases and transactions.
This growth leads to unauthorized access to users’ sensitive information and damages the
resources of an enterprise. Phishing is one of the familiar attacks that trick users to access
malicious content and gain their information. In terms of website interface and uniform
resource locator (URL), most phishing webpages look identical to the actual webpages.

1
MOTIVATION

Website Phishing costs internet users billions of dollars per year. Phishers steal personal
information and financial account details such as usernames and passwords, leaving users
vulnerable in the online space. CheckPoint Research Security Report 2018, 77% of IT
professionals feel their security teams are unprepared for today’s cybersecurity challenge, and
64% of organizations have experienced a phishing attack in the past year. Detecting phishing
websites is not easy because of the use of URL obfuscation to shorten the URL, link
redirections and manipulating link in such a way that it looks trustable and the list goes on.
This necessitated the need to switch from traditional programming methods to machine
learning approach

Problem Statement
Phishing detection techniques do suffer low detection accuracy and high false alarm
especially when novel phishing approaches are introduced. Besides, the most common
technique used, blacklist-based method is inefficient in responding to emanating phishing
attacks since registering new domain has become easier, no comprehensive blacklist can
ensure a perfect up-to-date database.

7
OBJECTIVES

The objectives are as follows:

To develop a novel approach to detect malicious URL and alert users.


To apply ML techniques in the proposed approach in order to analyze the real time URLs and
produce effective results.
To implement the concept of RNN, which is a familiar ML technique that has the capability
to handle huge amounts of data.

The rest of the paper is organized as follows: Section 1 introduces the concept of malicious
URL and objective of the study. The background of the study and related literature in
detecting URL is discussed in section 2. Section 3 presents the methodology of the research.
Results and discussion are presented in section 4. Finally, section 5 concludes the study with
its future direction.

4
SCOPE

Website Phishing costs internet users billions of dollars per year. Phishers steal personal
information and financial account details such as usernames and passwords, leaving users
vulnerable in the online space.
The COVID-19 pandemic has boosted the use of technology in every sector, resulting in
shifting of activities like organizing official meetings, attending classes, shopping, payments,
etc. from physical to online space. This means more opportunities for phishers to carry out
attacks impacting the victim financially, psychologically & professionally.

Chapter 2
Review of Literature

8
Table 3.1

Sr AUTHOR/YEAR TITLE WORK


No
.
1 Amani HYPERLINK Detecting We have
"https://ieeexplore.ieee.org/author/37086921111"Alsw Phishing selected the
ailem, 2019 Websites Random
Using Forest. They
Machine conclude
Learning their paper
with
combination
of 26
features.
2 Weiheng Bai,2020 Detection of This software
Phishing is designed to
Website show
using awareness of
Machine the
Learning extensive
level of its
functionality,
whereas our
software
blacklists the
particular
website.

3 Abdulhamit Subasi , 2020 Comparison This paper


of Adaboost aims to
with enhance
MultiBoosti detection
ng for method to
Phishing detect
Website phishing
Detection websites
using SVM.

4 Guru raj Harinahalli Lokesh,2020 Phishing This paper


website aims to
detection enhance
based on detection
effective method to
machine detect
learning phishing
approach websites
using
Random

9
Forest, K
nearest
neighbors.

Phishing attack is a simplest way to obtain sensitive


information from innocent users. These papers deals with machine learning
technology for detection of phishing URLs by extracting and
analyzing various features of legitimate and phishing URLs.
Some Machine Learning Algorithms like decision Tree, random forest and Support vector
machine algorithms are used to detect phishing websites.
These papers are providing us with above 85% of accuracy, also result shows
that classifiers give better performance when we use more data as training data

Chapter 3

10
Report on Present Investigation

3.1 Proposed System

Uses different machine learning models trained over features like if URL contains @, if it has
double slash redirecting, page rank of the URL, number of external links embedded on the
webpage, etc.
Neural network perceptron on data provided by Machine Learning and were able to achieve a
better accuracy This approach could get up to 92% true positive rate and 0.4% false positive
rate.

3.1.1 Block diagram

Figure 3.1

Steps for the training and evaluation of model:


• Dataset Collection - The set of phishing URLs are collected from opensource service
called Kaggle. This service provide a set of phishing URLs in multiple formats like
csv, json etc. that gets updated on a regular basis. To download the data:
https://www.kaggle.com/datasets/shashwatwork/phishing-dataset-for-machine-
learning

11
• Data Exploration - • This step helps identifying styles and issues inside the dataset, as
well as finding out which model or algorithm to apply in next steps.
• Extracting Features and Feature Selection - Address Bar based Features,Domain
based Features,HTML & Javascript based Features
So, all together 48 features are extracted from the 10,000 URL dataset and are
stored in 'Phishing_Legitimate_full' .csv file in the DataFiles folder.

• Model Training and Classification


• - Earlier than declaring the ML model training, the facts is break up into 80-20 i.e.,
8000 education samples & 2000 checking out samples. From the dataset, it's
far clean that this is a supervised system studying undertaking. There
are main varieties of supervised machine studying problems, called category and
regression.
This data set comes beneath type trouble, because the enter URL is classed as
phishing (1) or legitimate (zero). The supervised machine gaining knowledge
of models (class) taken into consideration to teach the dataset on this mission are:
Logistic Regression
K NeighbourClassifier
Random Forest
Decision Tree
XGBoost

All these models are trained on the dataset and evaluation of the model is done with
the test dataset. The elaborate details of the models & its training are mentioned in
https://colab.research.google.com/drive/1t6eUJFBhe-rfBK2NMc2DU1-TPcHMt7In?
usp=sharing

• Model Evaluation: From the obtained results of the above models, XGBoost Classifier
has highest model performance of 99%

3.2 Implementation

12
13
3.2.1 ML Algorithm
Xg Booster
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
flexible and portable. It implements machine learning algorithms under the Gradient Boosting
framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that
solve many data science problems in a fast and accurate way.
STEPS:
Step 1: Load the important libraries
Step 2: Import dataset.
Step 3: Divide the dataset into train and test
Step 4: Initializing the models
Step 5: Fitting the models
Step 6: Coming up with predictions
Step 7: Evaluating model’s performance

3.2.2 Dataset description / Data Preparation/Feature Engineering

Data Preparation is the process of collecting, cleaning, and consolidating data into one file or
data table, primarily for use in analysis.
The major tasks we use in data preparation are as follows:
• Data discretization
• Data cleaning
• Data integration
• Data transformation
• Data reduction
We have collected the dataset from Kaggle under the name Phishing_Legitimate_full.

Chapter 4

14
Model Implementation
• Training of Model

• Evaluation of Model

Chapter 5
Results and Discussion

15
5.1 Parameter Tuning and Inference

Chapter 6
Conclusion
To the best of our knowledge, the present study is the first review which included results
from all studies that applied machine learning methods to the detection of Phishing Websites.

16
The proposed observe the phishing method within the context of category, where phishing
website is taken into consideration to involve automatic categorization of web sites into a
predetermined set of sophistication values primarily based on several features and the
magnificence variable. The ML primarily based phishing strategies depend on internet site
functionalities to accumulate records which could help classify websites for detecting
phishing sites. The hassle of phishing can't be eliminated, however can be reduced by means
of preventing it in two methods, improving centered anti-phishing strategies and strategies
and informing the public on how fraudulent phishing web sites may be detected and
identified. To fight the ever evolving and complexity of phishing attacks and approaches, ML
anti-phishing techniques are critical. The outcome of this examine famous that the proposed
method offers advanced effects as opposed to the present deep studying strategies. The
model has performed higher accuracy and F1—score with restrained amount of time. The
destiny route of this observe is to expand an unmonitored deep mastering method to generate
insight from a URL. in addition, the study can be prolonged with a view to generate an final
results for a bigger network and defend the privacy of an man or woman.

Chapter 7
Future Scope
• This task can be further prolonged to advent of browser extention or advanced a GUI
which takes the URL and predicts it is nature i.e., valid of phishing.

17
• As of now, i am working closer to the introduction of browser extention for this
venture. and can even attempt the GUI option also.
• The further traits can be up to date at the earliest.
• We’ll be looking forward in making a full fledge application that directly blocks the
website instead of checking.

References

 Alswailem, A. (2019)Detecting Phishing Websites Using Machine Learning

 Bai,W(2020) Phishing Website Detection Based on Machine Learning Algorithm


 Subasi,A()2020 Comparison of Adaboost with MultiBoosting for Phishing Website
Detection
 Boregowda,G(2020)Phishing website detection based on effective machine learning
approach

18
19

You might also like