Phishing Website Detection
Phishing Website Detection
of the degree of
Bachelor of Engineering
by
Supervisor:
Prof. Renuka Nagpure
1
ATHARVA COLLEGE OF ENGINEERING
MALAD (W), MUMBAI 400 095
YEAR: 2021-22
CERTIFICATE
This is to certify that
Megha Agarwal
Arieyshma Chowhan
Shruti Jani
Hansika Koli
have submitted the project report for the requirements of the Bachelor of
Engineering in Information Technology satisfactorily
on
2
B.E. Mini-Project Report Approval
Examiners
1.
2.
Date:
Place:
3
Declaration
-----------------------------------------
(Signature)
-----------------------------------------
Date:
Date:
4
Table of Contents
Chapter 1 Introduction 7
1.1 Motivation 7
1.2 Problem Statement 7
1.3 Objectives 8
1.4 Scope 8
Chapter 2 Review of Literature 9
Chapter 3 Report on Present Investigation 11
3.1 Proposed System 11
3.1.1 Block diagram 11
3.2 Implementation 13
3.2.1 ML Algorithm 14
3.2.2 Dataset description / Data 15
Preparation/Feature Engineering
Chapter 4 Model Implementation 16
• Training of Model
• Evaluation of Model
Chapter 5 Results and Discussion (Screenshots of the 17
output with description )
5.1 Parameter Tuning and Inference
Chapter 6 Conclusion 18
References
5
List of Figures
Figure No. Figure Name Page No.
3.1 BLOCK DIAGRAM 11
List of Tables
Table No. Table Name Page No.
3.1 LITERATURE REVIEW 9
Chapter 1
6
INTRODUCTION
In recent years, advancements in Internet and cloud technologies have led to a significant
increase in electronic trading in which consumers make online purchases and transactions.
This growth leads to unauthorized access to users’ sensitive information and damages the
resources of an enterprise. Phishing is one of the familiar attacks that trick users to access
malicious content and gain their information. In terms of website interface and uniform
resource locator (URL), most phishing webpages look identical to the actual webpages.
1
MOTIVATION
Website Phishing costs internet users billions of dollars per year. Phishers steal personal
information and financial account details such as usernames and passwords, leaving users
vulnerable in the online space. CheckPoint Research Security Report 2018, 77% of IT
professionals feel their security teams are unprepared for today’s cybersecurity challenge, and
64% of organizations have experienced a phishing attack in the past year. Detecting phishing
websites is not easy because of the use of URL obfuscation to shorten the URL, link
redirections and manipulating link in such a way that it looks trustable and the list goes on.
This necessitated the need to switch from traditional programming methods to machine
learning approach
Problem Statement
Phishing detection techniques do suffer low detection accuracy and high false alarm
especially when novel phishing approaches are introduced. Besides, the most common
technique used, blacklist-based method is inefficient in responding to emanating phishing
attacks since registering new domain has become easier, no comprehensive blacklist can
ensure a perfect up-to-date database.
7
OBJECTIVES
The rest of the paper is organized as follows: Section 1 introduces the concept of malicious
URL and objective of the study. The background of the study and related literature in
detecting URL is discussed in section 2. Section 3 presents the methodology of the research.
Results and discussion are presented in section 4. Finally, section 5 concludes the study with
its future direction.
4
SCOPE
Website Phishing costs internet users billions of dollars per year. Phishers steal personal
information and financial account details such as usernames and passwords, leaving users
vulnerable in the online space.
The COVID-19 pandemic has boosted the use of technology in every sector, resulting in
shifting of activities like organizing official meetings, attending classes, shopping, payments,
etc. from physical to online space. This means more opportunities for phishers to carry out
attacks impacting the victim financially, psychologically & professionally.
Chapter 2
Review of Literature
8
Table 3.1
9
Forest, K
nearest
neighbors.
Chapter 3
10
Report on Present Investigation
Uses different machine learning models trained over features like if URL contains @, if it has
double slash redirecting, page rank of the URL, number of external links embedded on the
webpage, etc.
Neural network perceptron on data provided by Machine Learning and were able to achieve a
better accuracy This approach could get up to 92% true positive rate and 0.4% false positive
rate.
Figure 3.1
11
• Data Exploration - • This step helps identifying styles and issues inside the dataset, as
well as finding out which model or algorithm to apply in next steps.
• Extracting Features and Feature Selection - Address Bar based Features,Domain
based Features,HTML & Javascript based Features
So, all together 48 features are extracted from the 10,000 URL dataset and are
stored in 'Phishing_Legitimate_full' .csv file in the DataFiles folder.
All these models are trained on the dataset and evaluation of the model is done with
the test dataset. The elaborate details of the models & its training are mentioned in
https://colab.research.google.com/drive/1t6eUJFBhe-rfBK2NMc2DU1-TPcHMt7In?
usp=sharing
• Model Evaluation: From the obtained results of the above models, XGBoost Classifier
has highest model performance of 99%
3.2 Implementation
12
13
3.2.1 ML Algorithm
Xg Booster
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
flexible and portable. It implements machine learning algorithms under the Gradient Boosting
framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that
solve many data science problems in a fast and accurate way.
STEPS:
Step 1: Load the important libraries
Step 2: Import dataset.
Step 3: Divide the dataset into train and test
Step 4: Initializing the models
Step 5: Fitting the models
Step 6: Coming up with predictions
Step 7: Evaluating model’s performance
Data Preparation is the process of collecting, cleaning, and consolidating data into one file or
data table, primarily for use in analysis.
The major tasks we use in data preparation are as follows:
• Data discretization
• Data cleaning
• Data integration
• Data transformation
• Data reduction
We have collected the dataset from Kaggle under the name Phishing_Legitimate_full.
Chapter 4
14
Model Implementation
• Training of Model
• Evaluation of Model
Chapter 5
Results and Discussion
15
5.1 Parameter Tuning and Inference
Chapter 6
Conclusion
To the best of our knowledge, the present study is the first review which included results
from all studies that applied machine learning methods to the detection of Phishing Websites.
16
The proposed observe the phishing method within the context of category, where phishing
website is taken into consideration to involve automatic categorization of web sites into a
predetermined set of sophistication values primarily based on several features and the
magnificence variable. The ML primarily based phishing strategies depend on internet site
functionalities to accumulate records which could help classify websites for detecting
phishing sites. The hassle of phishing can't be eliminated, however can be reduced by means
of preventing it in two methods, improving centered anti-phishing strategies and strategies
and informing the public on how fraudulent phishing web sites may be detected and
identified. To fight the ever evolving and complexity of phishing attacks and approaches, ML
anti-phishing techniques are critical. The outcome of this examine famous that the proposed
method offers advanced effects as opposed to the present deep studying strategies. The
model has performed higher accuracy and F1—score with restrained amount of time. The
destiny route of this observe is to expand an unmonitored deep mastering method to generate
insight from a URL. in addition, the study can be prolonged with a view to generate an final
results for a bigger network and defend the privacy of an man or woman.
Chapter 7
Future Scope
• This task can be further prolonged to advent of browser extention or advanced a GUI
which takes the URL and predicts it is nature i.e., valid of phishing.
17
• As of now, i am working closer to the introduction of browser extention for this
venture. and can even attempt the GUI option also.
• The further traits can be up to date at the earliest.
• We’ll be looking forward in making a full fledge application that directly blocks the
website instead of checking.
References
18
19