Phishing Detection via Logistic Regression
Phishing Detection via Logistic Regression
Abstract— Nowadays, many people start switching from Phishing is instant, simp le, social engineering proces s of
offline to online to save their precious time. They started hacking. Attackers widely target large companies to steal
buying products online and made their payments through organization confidential data and financial details through
online transactions across websites. These online buyers are suspicious e-mails.
asked to provide details such as their name, address, location,
passwords, and other essential bank details on that particular The machine learn ing-based phishing website
website. The unaware online buyer got caught in these sites, prediction is an excellent hotspot in today's research field .
which leads to a process of phishing. They are called phishing The gathered features of datasets decide the outcome in the
websites. This research work has proposed an efficient mach ine learning model. Ext racting and selecting unique
prediction method based on the machine learning technique to advanced features before pre-processing the data is one of
analyze and predict these phishing websites. Novel the research fields' most prominent goals. The URL
classification algorithm and techniques are used to analyze and addresses the resources and the free URL in the website.
extract the datasets that might maliciously cause phishing. The The URL structure is vital wh ile identifying phishing URL
essential traits are helpful to identify these types of phishing that shows like pay "http://creation-paytm.us-com.
sites such as domain, URL and encryption technique of a
website while detecting malicious data. This research work will
use a logistic regression algorithm for detecting the phishing The structure is in the form of:
website. A logistic regression algorithm is used to provide i. Protocol: HTTP
better performance than the traditional classification ii. Domain: 1544dhde6587.co m
algorithm. To protect user sensitive information and for iii. Subdomain: creation-Paytm.us-com
effective, secure transaction payments, many E-commerce iv. Host name: creation-paytym.us-com.148gdff8765
enterprises are using this application to stay on the safer side. v. Free URL: /sign-in/ using this structure, we can detect the
original URL.
Keywords— Phishing website, logistic regression, machine
learning, online, E-commerce, security II. LITERATURE SURVEY
are added under blacklists within 12 hours, shown in websites.[11]. The idea undergone an interesting approaches
statistical analysis studies [2]. Different clusters are taken to detect the visual correspondence between web pages [12].
into account to show the differences of the legalized site and [13] d iscover the confidence weighted linear classifiers that
phishing site [3]. 63% of the site's existence of phishing can concentrate the attribute confidence data to the linear
websites is limited to just 2 hours. Blacklists and white lists classifier.Zuhair.et.al [14]exp lained new hybrid features and
with greater level running speed and a small false positive remodeled them as maximu m, minimu m relevant, and
rate can easily imp lement the detection of phishing site [3]. robust features. Machine learning models can classify new
The current conventional machine learn ing procedures of URLs before they get loaded in search engines to iden tify
detecting phishing sites from the URL and the host will start phishing attacks [15]-[20].
extracting special features such as layout and CSS
(Cascading Style Sheet) fro m the part icular website under
the study of statistical information [3]. III. PROPOSED W ORK
Based on the URL(Uniform Resource Locator), strings and This section depicts the proposed model of phishing
mathematical ru les exp loit warp the URL pathway and threat detection across websites. In this proposed model, we
hostname of the targeted page and the exploited site is can check phishing website's characteristics, WHOIS
detected in this way Verma et al used some characteristics database information, and blacklists, which help find out a
like KS distance, Euclidean distance, KL distance, character phished site that may help protect impo rtant informat ion of
frequency and editing of the specified targeted URL which a user. By based on filtered features, we can easily compare
relay on the contrasting parts among the threatening URL and differentiate between legalized and spoofed web pages.
.The typical rule of Eng lish that group-specific contrasting Some of the filtered important characteristics are as follows,
characteristics with suspected Uniform Resource Locator 1. Protocol (set of rules) 2. Domain check 3. Subdomain
leads to exp loiting explo ited sites [3]. To evolve many-sided 4. Hostname 5. Free URL
characteristics of phished site various features like HTM L This model can be div ided into a step by step process, which
(Hypertext Markup Language), text features of web pages is listed below:
like border, colo r, font size and unbiased site characteristics
are grouped with URL characteristics [21]. By concentrating 1. Construction of phishing database
on developing a system to analyze and classify the URLs 2. Cleaning and data pre-processing.
essentially to detect phishing attacks [4].Le et al's proposed 3. Detection Process of phishing website
system focuses on strings of Un iform Resource Locator,
which concentrates on drawing out linguistic features with Construction of phishing website
Adaptive Regularizat ion Of Weights (AROW)[5]. Blacklists The process of collecting datasets is essential in machine
of malicious websites across the web were p rovided by the learning. The method of grouping and ordering the
Google search engine, which is updated quickly. Using informat ion fro m various sources is known as data
Google Safe browsing APIs users can check the site security collecting. In machine learning, data collection is always a
of URL lin k pages[6]. CANTINA, Xiang et al proposed real bottleneck problem. Websites may also contain
CANTINA+ phishing website detection frame work. blocked URLs. Thus, a phishing website's construction
Unconventional language and accessible phishing should be carried out by involving phishing threat URL and
website observation and detection method was proposed by IP address included with their WHOIS informat ion, domain
Marchal et al[7]. For pred icting phishing threats across the name. Collection o f the URL fo r the construction of data on
internet ,an intelligent model was made to relay structuring threats is done by the phishing website
neural networks given by Rami et al[8]. Phishing webpage http://archive.ics.uci.edu/ml/datasets/. Generally, so me
prediction by the criteria of getting relevance maximu m and datasets are needed for different purposes. A collected
redundancy min imu m was shown in the Journal of dataset was put into our specified mach ine learning
Theoretical and Applied IT ,188–205[8]. Phishing sites that algorith m to train our model known as train ing datasets. We
relay on clustering process can't include specially labeled use a dataset to evaluate the proposed model's correct
phishing specimens or official specimens. Features like accuracy, but that is not used in train ing the model known as
integer, binary and host are taken out from suspicious URL a testing dataset.
done by Mao et al [9].The conventional backpropagation Cleaning and data pre-processing
neural networks are co mpared with huge feed-forward This technique is used to transform the raw data into the
artificial neural networks like CNN(Convolutional Neural clean data set. Data pre-processing is an impo rtant step. The
Network). Conventional neural networks are not suitable in transformation applied to our raw data before feeding it to
predicting problems on series of time problems, but the specified mach ine learning algorithm is pre-processing.
Recurrent Neural Network(RNN) is useful in predict ing the Missing data, noisy data, irrelevant data, and unordered data
time series problems. The extraction of the syntax and the are present in the collected suspected URL. So these
mathematical characteristics of uniform resource locator collected data are g iven for p re-processing. Data cleaning is
was founded by Bahnsen et al. The advantage of the process of reducing, identifying the co llected raw data,
Convolutional Neural Net work is to get the local which helps remove the incorrect and irrelevant data.
characteristics of sequence and the semantic features of the Missing data are ignored, and a calculating mean value
sequence is taken out by benefits of LSTM. Pat il et.al replaces them and noisy data are cleaned using the binning
showed how to effectively detect malicious webpages with method. Data reduction helps in increasing the efficient
the help of static analysis of URL string [10]. Ph ishing storage of data and reduces analysis cost.
alarm and CSS techniques are used to detect phishing
[9]. Cao, J. & Dong, D. & Mao, B. & Wang, T.. (2013). Phishing detection
method based on URL features. Journal of Southeast University (English
Edition). 29. 134-138. 10.3969/j.issn.1003-7985.2013.02.005.
[10] Patil D. R. & Patil J. B. (2016). Malicious web pages detection using
static analysis of URLs. International Journal of Information Security
and Cybercrime, 5(2), 57–70.