Malicious URL Detection Using
Machine Learning
Presented By
Mr. Swapnil Thorat
TE (Computer Engineering)
Roll No. : TC52
Under the Guidance of
Prof. Dr. Sashikala Mishra
DEPARTMENT OF COMPUTER ENGINEERING
Hope Foundation’s
International Institute of Information Technology, Hinjewadi,
Pune-411057
content
1. Introduction
2. Identify the Social Problem to be solved using Computing Algorithms
3. Motivation
4. Literature Survey
5. Objective
6. Approach
7. Architecture
8. Details of Design and structure of module
9. Advantages and Disadvantages
10. Conclusion & Future work
11. References
Introduction
• Phishing is the most commonly used social engineering and cyber attack.
• Through such attacks, the phisher targets naïve online users by tricking them
into revealing confidential information, with the purpose of using it
fraudulently.
• Have a blacklist of phishing websites which requires the knowledge of
website being detected as phishing.
• Detect them in their early appearance, using machine learning and deep
neural network algorithms.
• Of the below three, the machine learning based method is proven to be most
effective than the other methods.
• Even then, online users are still being trapped into revealing sensitive
information in phishing websites.
• Identify the Social Problem to be solved using
Computing Algorithms
• Malicious Web sites are the basis of most of the criminal activities over
the internet.
• The dangers that arise due to the malicious sites are enormous and the
end-users must be prohibited from visiting such sites.
• The users should prohibit themselves from clicking on such Uniform
Resource Locator (URL).
• The detection of malicious URLs is a binary classification problem and
several Machine Learning Algorithms, namely Random Forests, SVMs
and Naive Bayes are implemented on training dataset. Also, it has
been seen that the Random Forest classifier performs better for the
particular problem than the SVM classifier
Motivation
• Currently, the risk of network information insecurity is increasing
rapidly in number and level of danger. The methods mostly used by
hackers today is to attack end-to-end technology and exploit
human vulnerabilities.
• These techniques include social engineering, phishing, pharming, etc.
One of the steps in conducting these attacks is to deceive users with
malicious Uniform Resource Locators(URLs). As a results, malicious URL
detection is of great interest nowadays.
• There have been several scientific studies showing a number of methods
todetect malicious URLs based on machine learning and deep learning
techniques.
• In this paper, we propose a maliciousURL detection method using
machine learning techniques based on ourproposed URL behaviors and
attributes.
• This is suggested that the proposed system may be considered as an
optimized and friendly used solution for malicious URL detection
Literature Survey
Sr. No. Paper Name And Year Author Summary
1 Empirical Study on Malicious URL Ripon Patgiri(B) , Hemanth Malicious Web sites are the basis of
Detection Using Machine Learning [2018] Katari(B) , Ronit Kumar(B), and most of the criminal activities over the
Dheeraj Sharma( internet. The dangers that arise due to
the malicious sites are enormous and the
end-users must be prohibited from
visiting such sites. The users should
prohibit themselves from clicking on
such Uniform Resource Locator (URL).
2 Detection of URL based Phishing Attacks Ms. Sophiya Shikalgar This paper addresses the widespread
using Machine Learning[ Nov -2019] Department of Computer cybersecurity concern where threat
Engineering actors bypass security defenses and use
Datta Meghe College of URLs to launch various forms of
Engineering, malicious attacks on unsuspecting
Airoli, Navi Mumbai, INDIA individuals. In order to prevent such
Dr. S. D. Sawarkar attacks, the paper proposes the use of
Department of Computer machine learning algorithms to detect
Engineering malicious URLs. The proposed MuD
Datta Meghe College of (Malicious URL Detection) model is
Engineering, trained using an existing dataset which
Airoli, Navi Mumbai, INDIA contains URLs, each with unique
Mrs.Swati Narwane features, and is applied to three different
Department of Computer machine learning classififiers—support
Engineering vector machine, logistic regression and
Datta Meghe College of Naïve Bayes. After training and testing
Engineering, the algorithms, it is observed that Naïve
Airoli, Navi Mumbai, INDIA Bayes classififier recorded the highest
accuracy
3 Malicious URL Detection Based on Sandra Kumi 1 , ChaeHo Lim Cybercriminals have invented
Associative Classifification [2020] 2 and Sang-Gon Lee 1 sophisticated ways such as injecting
malicious code into websites to
disseminate malware in an attempt to
infect target systems. Associative
classifification approaches to detect
malicious URLs mainly focus on
phishing websites. Regarding this, we
present an approach based on
classifification based on association
(CBA) algorithm to detect malicious
URLs comprising phishing, malware,
and drive-by-download websites.
4 Using Deep Learning to Detect Malicious Yuchen Liang Shady Side This paper presents different approaches
URLs [2019] Academy Pittsburgh, PA, United to detect DGAgenerated domains based
States on the features of URLs. The result
[email protected] proves that the DBLSTM algorithm is
g Xiaodan Yan Beijing superior to other conventional machine
University of Posts and learning methods. The source code is
Telecommunications Beijing, posted on GitHub for other groups to
China [email protected] use or to reproduce the same result
(https://github.com/liangy2019/Using-
Deep Learning-to-Detect-Malicious-
URLs). The deep learning technique
presented in the paper can be widely
utilized in the realm of cybersecurity,
especially for energy network security,
to detect attacks initiated by different
domain generation algorithms.
Objectives
• Calculating the accuracy using each of the algorithms.
• Extract features from the training data categorized into lexical features,
network based features and host based features
• Divide the collected dataset into two subsets in the ratio of 80:20 for
training purposes and testing purposes
• Collecting a dataset which consists of huge number of URL’s which consists
of both malicious and non malicious URLs
APPROACH
Below mentioned are the steps involved in the completion of
this project:
• Collect dataset containing phishing and legitimate websites from the open source platforms.
• Write a code to extract the required features from the URL database.
• Analyze and preprocess the dataset by using EDA techniques.
• Divide the dataset into training and testing sets.
• Run selected machine learning and deep neural network algorithms like SVM, Random Forest,
Autoencoder on the dataset.
• Write a code for displaying the evaluation result considering accuracy metrics.
• Compare the obtained results for trained models and specify which is better.
Architecture of System
Technology
1.Naive Bayes:
This classifier can also be known as a Generative
Learning Model. The classification here is based on Baye’s Theorem, it
assumes independent predictors. In simple words, this classifier will assume
that the existence of specific features in a class is not related to the existence
of any other feature. If there is dependency among the features of each other
or on the presence of other features, all of these will be considered as an
independent contribution to the probability of the output. This classification
algorithm is very much useful to large datasets and is very easy to use.
Random Forest:
This classification algorithm are similar to ensemble learning
method of classification. The regression and other tasks, work by building a
group of decision trees at training data level and during the output of the
class, which could be the mode of classification or prediction regression for
individual trees. This classifier accuracy for decision trees practice of
overfitting the training data set.
Support vector machine (SVM):
This is also one of the classification
algorithm which is supervised and is easy to use. It can used for both
classification and regression applications, but it is more famous to be used
in classification applications. In this algorithm each point which is a data
item is plotted in a dimensional space, this space is also known as n
dimensional plane, where the ‘n’ represents the number of features of the
data. The classification is done based on the differentiation in the
classes, these classes are data set points present in different planes
XGBoost:
Recently, the researches have come across an algorithm
“XGBoost” and its usage is very useful for machine learning classification. It
is very much fast and its performance is better as it is an execution of a
boosted decision tree. This classification model is used to improve the
performance of the model and also to improve the speed
Data set:
The data of urls is obtained from Phishtank website,where Phishtank
is an anti-phishing site.It contains urls which is in unstructured form. Our
main objective is to detect whether the url is phishing or egitimate based on
the features extracted. In Preprocessing we have done feature extraction
where The URLs are transmitted to the feature extractor, which extracts
feature values through the predefined URL-based features.The features have
assigned binary values 0 and 1 which indicates that feature is present or not
as shown in figure below. The extracted feature values are stored as input
and passed to the classifiers. structured dataset is given to the classifiers. We
use four methods classification namely: XG Boost, SVM, Naive Bayes and
stacking classifier for detection of url as phishing or legitimate. Now the
classifier will find whether a requested site is a phishing site. When there is
page request , the URL of the requested site is radiate do the feature
extractor. It extracts the feature values through the predefined URL-based
features. These feature values are act as a input for the classifier. After
this we will come to know if the site is phishing or not.
Advantages,
Disadvantages/Limitations of System
Advantages:
•-Provide clear idea about the effective level of each classifier on phishing email detection
•-High level of accuracy by take the advantages of classifiers many
•- High level of accuracy
•-Fast in classification process fast ,less consuming memory, high accuracy, Evolving with time, online working
Disadvantages :
•-Time consuming
• -huge number of features
•-consuming memory Non standard classifier
•-Time consuming because this technique has many layers to make the final result
•-huge number of features -many algorithm for classification which mean time consuming
•-higher cost
•-need large mail server and high memory requirement
•-Less accuracy because it depend on unsupervised leaming, need feed continuously Need feed continuously
Next work
• Working on this project is very knowledgeable and worth the effort.
• Through this project, one can know a lot about the phishing websites and
how they are differentiated from legitimate ones.
• This project can be taken further by creating a browser extensions of
developing a GUI.
• These should classify the inputted URL to legitimate or phishing with the
use of
the saved model.
Conclusion & Future work
It is found that phishing attacks is very crucial and it is important for us to get a
mechanism to detect it. As very important and personal information of the user
can be leaked through phishing websites, it becomes more critical to take care of
this issue.This problem can be easily solved by using any of the machine learning
Algorithm with the classifier.
The proposed technique is much more secured as it detects new and previous
phishing sites The proposed model is also planned to be deployed onlineby
integrating it as a Web browser plug-in capable of warning users of potential
malicious URLs in real time. URLs clicked or typed will be checked based on its
features to determine if it is malicious or not. If it is malicious or suspected to be
malicious, there will be a pop-up informing the user of the potential threat and it
will be temporarily blocked except the user chooses to still navigate to the URL.
The future work also includes evaluating the proposed model against more
recent and diverse datasets along with using additional classifiers such as
decision trees and random forest.
References
• References in standard IEEE format
1. Vanhoenshoven, F., N´apoles, G., Falcon, R., Vanhoof, K., K¨oppen, M.: Detecting
malicious URLs using machine learning techniques. In: 2016 IEEE Symposium Series
on Computational Intelligence (SSCI), pp. 1–8. IEEE (2016)
2. F. Vanhoenshoven, G. Nápoles, R. Falcon, K. Vanhoof, M. Köppen, Detecting
malicious URLs
using machine learning techniques, in 2016 IEEE Symposium Series on
Computational Intelligence (SSCI) (IEEE, 2016), pp. 1–8
3. A. Singh, N. Goyal, A comparison of machine learning attributes for detecting
malicious websites, in 2019 11th International Conference on Communication
Systems & Networks (COMSNETS) (IEEE, 2019), pp. 352–358
4. A.S. Manjeri, R. Kaushik, M. Ajay, P.C. Nair, A machine learning approach for
detecting malicious websites using URL features, in 2019 3rd International
conference on Electronics, Communication and Aerospace Technology (ICECA) (IEEE,
2019), pp. 555–561
5.Internet Security Threat Report (ISTR) 2019–Symantec.
https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-2019-
en.pdf