0% found this document useful (0 votes)
71 views4 pages

Phishing Detection via Logistic Regression

The document discusses a research paper presented at the International Conference on Inventive Research in Computing Applications (ICIRCA) that proposes using a logistic regression machine learning technique to detect phishing websites. Phishing is a social engineering process where attackers steal user information and credentials through suspicious emails and websites. The researchers extracted features from datasets to train their logistic regression model to better identify phishing websites compared to traditional classification algorithms. Some key features examined included the website domain, URL structure, and encryption techniques. The goal is to protect users from revealing sensitive information on illegitimate websites.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views4 pages

Phishing Detection via Logistic Regression

The document discusses a research paper presented at the International Conference on Inventive Research in Computing Applications (ICIRCA) that proposes using a logistic regression machine learning technique to detect phishing websites. Phishing is a social engineering process where attackers steal user information and credentials through suspicious emails and websites. The researchers extracted features from datasets to train their logistic regression model to better identify phishing websites compared to traditional classification algorithms. Some key features examined included the website domain, URL structure, and encryption techniques. The goal is to protect users from revealing sensitive information on illegitimate websites.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2022)

IEEE Xplore Part Number: CFP22N67-ART; ISBN: 978-1-6654-9707-7

Logistic Regression based Machine Learning


Technique for Phishing Website Detection
Soumya.T.R1 Ramesh P 2, N. Arockia Rosy 3
Assistant Professor, Dept. of CSE Associate Professor, Dept. of EEE, Assistant Professor, Dept. of IT
Prathyusha Engineering College CM R Institute of Technology, RM D Engineering College,
Thiruvallur, Tamilnadu Bengaluru, Karnataka Thiruvallur, Tamilnadu.
[email protected] [email protected] [email protected]

N. Pughazendi4 S.Padmapriya5 Rashmita Khilar6


2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA) | 978-1-6654-9707-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIRCA54612.2022.9985643

Professor, Dept of CSE, Professor, Dept of CSE, Professor, Dept. of IT


Panimalar Engineering College Prathyusha Engineering College Saveetha School of Engineering
Poonamalle, Chennai 600123 Thiruvallur, Tamilnadu SIM TS, Saveetha University, Chennai,
[email protected] [email protected] [email protected]

Abstract— Nowadays, many people start switching from Phishing is instant, simp le, social engineering proces s of
offline to online to save their precious time. They started hacking. Attackers widely target large companies to steal
buying products online and made their payments through organization confidential data and financial details through
online transactions across websites. These online buyers are suspicious e-mails.
asked to provide details such as their name, address, location,
passwords, and other essential bank details on that particular The machine learn ing-based phishing website
website. The unaware online buyer got caught in these sites, prediction is an excellent hotspot in today's research field .
which leads to a process of phishing. They are called phishing The gathered features of datasets decide the outcome in the
websites. This research work has proposed an efficient mach ine learning model. Ext racting and selecting unique
prediction method based on the machine learning technique to advanced features before pre-processing the data is one of
analyze and predict these phishing websites. Novel the research fields' most prominent goals. The URL
classification algorithm and techniques are used to analyze and addresses the resources and the free URL in the website.
extract the datasets that might maliciously cause phishing. The The URL structure is vital wh ile identifying phishing URL
essential traits are helpful to identify these types of phishing that shows like pay "http://creation-paytm.us-com.
sites such as domain, URL and encryption technique of a
website while detecting malicious data. This research work will
use a logistic regression algorithm for detecting the phishing The structure is in the form of:
website. A logistic regression algorithm is used to provide i. Protocol: HTTP
better performance than the traditional classification ii. Domain: 1544dhde6587.co m
algorithm. To protect user sensitive information and for iii. Subdomain: creation-Paytm.us-com
effective, secure transaction payments, many E-commerce iv. Host name: creation-paytym.us-com.148gdff8765
enterprises are using this application to stay on the safer side. v. Free URL: /sign-in/ using this structure, we can detect the
original URL.
Keywords— Phishing website, logistic regression, machine
learning, online, E-commerce, security II. LITERATURE SURVEY

I. INT RODUCT ION


Many security analysts and researchers have proposed
In providing security in the computer field, phishing is a various special characteristics of different phishing sites
fraudulent attempt by an attacker to acquire essential fro m additional perspective analyses. Some wo rks related to
informat ion such as names, passwords, and banking phishing website detection techniques are subjected below:
credentials by causing deception to a user in a technical Ph ishing mechanis m is the procedure by Anti-Phishing
way. Around 1996, some hackers coined the word phishing. Working Group(APW G), wh ich involves technical deceit
To protect people's sensitive in formation against social and combines social engineering to obtain individuals'
attacks various attempts are made. A collaboration of personal and vital credentials [1]. Statistical analysis of
organizations, revenues, customer interest, market ing, and Kaspersky lab detects 29.4% of user co mputers were
corporation profit has impacted mainly in a negative way affected by at least one malware -class web attack within a
because of phishing attacks. year. The URLs, wh ich is unique199455606, were identified
Phishing is the practice of sending fraudulent e-mail o r as malicious antivirus web co mponents. Phishing is not
message which appears to be in an orig inal reputable firm. It limited only to a tradit ional way such as SMS, e -mails and
gains user's trust in the wrong way and influences them to sudden pop-ups. The emerg ing field of mob ile internet and
give their information in an unofficial site that pretends to social networks had created more impact on users. They
look like a genuine site. Phishing co mes under social started indulging in phishing activities that include spear and
engineering techniques that favor a hacker to gain user trust QR code phishing attacks and started to develop imitated
to gather details about the individual and break the mobile applications. Detection of phishing websites helps
inadequate framework of web security technologies. in largely finding whitelists and blacklists . Blacklists and
Phishing is one of the most significant cyber-attacks faced white lists integrate the present search engines to save the
by every large and small o rganizat ion simultaneously. user from a phishing attack. Nearly 83% of phishing sites

978-1-6654-9707-7/22/$31.00 ©2022 IEEE 683


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on September 24,2023 at 10:27:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2022)
IEEE Xplore Part Number: CFP22N67-ART; ISBN: 978-1-6654-9707-7

are added under blacklists within 12 hours, shown in websites.[11]. The idea undergone an interesting approaches
statistical analysis studies [2]. Different clusters are taken to detect the visual correspondence between web pages [12].
into account to show the differences of the legalized site and [13] d iscover the confidence weighted linear classifiers that
phishing site [3]. 63% of the site's existence of phishing can concentrate the attribute confidence data to the linear
websites is limited to just 2 hours. Blacklists and white lists classifier.Zuhair.et.al [14]exp lained new hybrid features and
with greater level running speed and a small false positive remodeled them as maximu m, minimu m relevant, and
rate can easily imp lement the detection of phishing site [3]. robust features. Machine learning models can classify new
The current conventional machine learn ing procedures of URLs before they get loaded in search engines to iden tify
detecting phishing sites from the URL and the host will start phishing attacks [15]-[20].
extracting special features such as layout and CSS
(Cascading Style Sheet) fro m the part icular website under
the study of statistical information [3]. III. PROPOSED W ORK
Based on the URL(Uniform Resource Locator), strings and This section depicts the proposed model of phishing
mathematical ru les exp loit warp the URL pathway and threat detection across websites. In this proposed model, we
hostname of the targeted page and the exploited site is can check phishing website's characteristics, WHOIS
detected in this way Verma et al used some characteristics database information, and blacklists, which help find out a
like KS distance, Euclidean distance, KL distance, character phished site that may help protect impo rtant informat ion of
frequency and editing of the specified targeted URL which a user. By based on filtered features, we can easily compare
relay on the contrasting parts among the threatening URL and differentiate between legalized and spoofed web pages.
.The typical rule of Eng lish that group-specific contrasting Some of the filtered important characteristics are as follows,
characteristics with suspected Uniform Resource Locator 1. Protocol (set of rules) 2. Domain check 3. Subdomain
leads to exp loiting explo ited sites [3]. To evolve many-sided 4. Hostname 5. Free URL
characteristics of phished site various features like HTM L This model can be div ided into a step by step process, which
(Hypertext Markup Language), text features of web pages is listed below:
like border, colo r, font size and unbiased site characteristics
are grouped with URL characteristics [21]. By concentrating 1. Construction of phishing database
on developing a system to analyze and classify the URLs 2. Cleaning and data pre-processing.
essentially to detect phishing attacks [4].Le et al's proposed 3. Detection Process of phishing website
system focuses on strings of Un iform Resource Locator,
which concentrates on drawing out linguistic features with Construction of phishing website
Adaptive Regularizat ion Of Weights (AROW)[5]. Blacklists The process of collecting datasets is essential in machine
of malicious websites across the web were p rovided by the learning. The method of grouping and ordering the
Google search engine, which is updated quickly. Using informat ion fro m various sources is known as data
Google Safe browsing APIs users can check the site security collecting. In machine learning, data collection is always a
of URL lin k pages[6]. CANTINA, Xiang et al proposed real bottleneck problem. Websites may also contain
CANTINA+ phishing website detection frame work. blocked URLs. Thus, a phishing website's construction
Unconventional language and accessible phishing should be carried out by involving phishing threat URL and
website observation and detection method was proposed by IP address included with their WHOIS informat ion, domain
Marchal et al[7]. For pred icting phishing threats across the name. Collection o f the URL fo r the construction of data on
internet ,an intelligent model was made to relay structuring threats is done by the phishing website
neural networks given by Rami et al[8]. Phishing webpage http://archive.ics.uci.edu/ml/datasets/. Generally, so me
prediction by the criteria of getting relevance maximu m and datasets are needed for different purposes. A collected
redundancy min imu m was shown in the Journal of dataset was put into our specified mach ine learning
Theoretical and Applied IT ,188–205[8]. Phishing sites that algorith m to train our model known as train ing datasets. We
relay on clustering process can't include specially labeled use a dataset to evaluate the proposed model's correct
phishing specimens or official specimens. Features like accuracy, but that is not used in train ing the model known as
integer, binary and host are taken out from suspicious URL a testing dataset.
done by Mao et al [9].The conventional backpropagation Cleaning and data pre-processing
neural networks are co mpared with huge feed-forward This technique is used to transform the raw data into the
artificial neural networks like CNN(Convolutional Neural clean data set. Data pre-processing is an impo rtant step. The
Network). Conventional neural networks are not suitable in transformation applied to our raw data before feeding it to
predicting problems on series of time problems, but the specified mach ine learning algorithm is pre-processing.
Recurrent Neural Network(RNN) is useful in predict ing the Missing data, noisy data, irrelevant data, and unordered data
time series problems. The extraction of the syntax and the are present in the collected suspected URL. So these
mathematical characteristics of uniform resource locator collected data are g iven for p re-processing. Data cleaning is
was founded by Bahnsen et al. The advantage of the process of reducing, identifying the co llected raw data,
Convolutional Neural Net work is to get the local which helps remove the incorrect and irrelevant data.
characteristics of sequence and the semantic features of the Missing data are ignored, and a calculating mean value
sequence is taken out by benefits of LSTM. Pat il et.al replaces them and noisy data are cleaned using the binning
showed how to effectively detect malicious webpages with method. Data reduction helps in increasing the efficient
the help of static analysis of URL string [10]. Ph ishing storage of data and reduces analysis cost.
alarm and CSS techniques are used to detect phishing

978-1-6654-9707-7/22/$31.00 ©2022 IEEE 684


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on September 24,2023 at 10:27:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2022)
IEEE Xplore Part Number: CFP22N67-ART; ISBN: 978-1-6654-9707-7

Detection Process of phishing website Architecture:


The module which used for detecting phishing site helps in
analyzing the URL in the blacklist. The URL is studied
based on the URL structure, including various
characteristics like protocol, sub domain, domain, and
hostname. The URL contains the name of the set of rules
(protocols) required to access a resource and a resource
name. The first part of a URL validates which protocol is
used for the primary access mediu m. The second part
validates the IP address or domain name and also
subdomain – to identify resource location. Fo llowing the
domain, a URL can also show a way or path to a true page
or file within a specified do main. A port network is used to
create the link. Figure.1 state that search parameters are
used basically to find URLs provide search results.

Figure 3. Workflow Diagram

The proposed work Figure.3 model can get some


characteristics fro m suspected pages. To analyze and detect
phishing sites, the analytical results were given as input to
our specified algorith m. Two introductory screening
coursework (module) in this system were processed before
Figure 1.URL structure applying heuristics to the specified websites. The pre -
legalized identification of site, checks phished pages over a
Logistic regression white-list (privatized) wh ich the normal user maintains is
done in first coursework. And the second coursework helps
find in itial login form, which classifies webpages as official
when there is non-availability of current login forms. This
coursework minimizes redundant calculation in the system,
which reduces the false positives rate over the false
negatives. 99.8% webpages precision can be classified by
using all modules, and false-positive rate can be validated
upto 0.4%. This wo rk is the best one to safeguard the
unaware users from online threats. The trained data is the
true and relevant dataset used to perform several machine
learning actions. Current developing process models were
subjected to learn with various API and algorithms trained
Figure 2. Linear model Vs Logistic model to make the machine work on its own. The given
architecture will detect the given suspected websites as
The logistic regression involves a dependent variable shown phishing website or legalized website
in the form of binary (0 or 1) wh ich answers true or false. It
represents the outcome which always presented in the IV. RESULTS
mentioned types of binary form. For examp le, it can be used
when we need to validate the chances of positive or negative
events. Figure.2 represent the logistic regression, the same
process is presented with Y value and with a sig moid
function. Y value is ordered fro m 0 to 1. Let x1 and x2 be
the two predictors of the model. It is in the form of b inary
variables function indicators (taking value 0 or 1) or will be
variable constant.

Figure 4. Comparative Analysis of nearest neighbor


algorithm logistic regression and naive baye

978-1-6654-9707-7/22/$31.00 ©2022 IEEE 685


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on September 24,2023 at 10:27:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2022)
IEEE Xplore Part Number: CFP22N67-ART; ISBN: 978-1-6654-9707-7

Figure.4 when we take the nearest neighboring


[11] J. Mao, W. Tian, P. Li, T. Wei and Z. Liang, "Phishing-Alarm: Robust
algorith m logistic and naive bayes, logistic regression is the
and Efficient Phishing Detection via Page Component Similarity," in IEEE
best because the efficiency is co mparitively h igher than the Access, vol. 5, pp. 17020-17030, 2017, doi:
other two algorithms. So we have chosen logistic regression 10.1109/ACCESS.2017.2743528.
for the detection of phishing attack
[12] T.-C. Chen, S. Dick, and J. Miller,( 2010) "Detecting visually similar
IV. CONCLUSION Web pages: Application to phishing detection," ACM Trans. Internet
T echnol., vol. 10, no. 2, pp. 1–38.
Predicting malicious phished URLs is a significant essential
role fo r various cybersecurity and business platforms to [13] Mark Dredze, Koby Crammer, & Fernando Pereira. (2008).
Confidence-weighted linear classification. In 25th International Conference
overcome organizat ions' threats . A large nu mber of internet on Machine Learning (ICML), pp. 264–271.
threats are launched by clicking on weak malicious
webpages. A user gets influenced and got stuck into an [14] Zuhair, H., Selamat, A., & Salleh, M. (2015). Selection of robust
attacker's social engineering process, which makes them feature subsets for phish webpage prediction using maximum relevance
and minimum redundancy criterion. Journal of Theoretical and Applied
provide important essential information on a phishing page Information Technology, 81(2), 188–205
that meets the attacker's target's goal, resulting in loss of
user's privacy. In our model, the detection of malicious [15]. T upsamudre, Harshal & Singh, Ajeet & Lodha, Sachin. (2019).
URLs by mach ine learn ing algorithms such as logistic Everything Is in the Name – A URL Based Approach for Phishing
Detection. 10.1007/978-3-030-20951-3_21.
regression helps to get accuracy compared to traditional
algorithms like the random forest, naïve bays. [16] G.Maria Jones and G.Godfrey Winster" (2020)Analysis of Crime
Report by Dat a Analytics Using Python" In book :Challenges and
V. FUT URE ENHANCEMENT Applications of Data Analytics in social perspectives.10.4018/978 -1-7998-
2566-Lch003
To deploy suspicious web content and provide the best [17] G. Parthasarathy and DC.Tomar, “A novel approach for classification
accuracy of prediction to all connected devices, we use the and clustering of biomedical citations”, Biomedical Research-An
International Journal of Medical Sciences, Scientific Publishers, pp. S22 -
future idea o f training and testing a dataset. By adding so me S30, Vol. 27, 2016.
various characteristics like host-based features make mo re [18] G. Parthasarathy and DC.Tomar, “Trends in citation analysis”,
accurate to our model in the future. Intelligent Computing, Communication and Devices, Springer, New Delhi,
pp.813-821, 2015.
[19] N, SaiSupriya, Rashmi S, Parthasarathy G and Priyanka, "Face Mask
VI. REFERENCES Detection Using CNN." Smart Intelligent Computing and Communication
T echnology 38,pp.118, 2021.
[1] Justin Ma, Saul L. K., Savage S., & Voelker G. M. (2011). Learning
to detect malicious urls. ACM Transactions on Intelligent Systems and [20] Parthasarathy G,Soumya T .R.,Ramanathan L and Ramesh P,
T echnology, 3(2), 1–24. "Improvised Approach for Real T ime Patient Health Monitoring System
Using IoT ." In Intelligent Systems and Computer Technology, pp. 78-83.
[2] Steve Sheng , Brad Wardman , Gary Warner , Lorrie Faith Cranor , IOS Press, 2020..
Jason Hong , Chengshan Zhang(2009). An Empirical Analysis of Phishing
Blacklists. CEAS - Sixth Conference on E-mail and Anti-Spam July 16-17. [21] Dhaya, R., and R. Kanthavel. "Comprehensively Meld Code Clone
Identifier for Replicated Source Code Identification in Diverse Web
[3] P. Yang, G. Zhao and P. Zeng, "Phishing Website Detection Based on Browsers." Journal of trends in Computer Science and Smart technology
Multidimensional Features Driven by Deep Learning," in IEEE Access, vol. (T CSST ) 2, no. 02 (2020): 109-119.
7, pp. 15196-15209, 2019, doi: 10.1109/ACCESS.2019.2892066.
[22] Joby, P. P. "Expedient information retrieval system for web pages
[4] Verma R. & Das A. (2017). Whats in a URL: Fast feature extraction
using the natural language modeling." Journal of Artificial Intelligence 2,
and malicious URL detection. In 3rd International Workshop on no. 02 (2020): 100-110.
Security and Privacy Analytics, pp. 55–63.

[5]. A. Le, A. Markopoulou and M. Faloutsos, "PhishDef: URL names say


it all," 2011 Proceedings IEEE INFOCOM, Shanghai, China, 2011, pp.
191-195, doi: 10.1109/INFCOM.2011.5934995.

[6] L. Ma, B. Ofoghi, P. Watters and S. Brown, "Detecting Phishing E-


mails Using Hybrid Features," 2009 Symposia and Workshops on
Ubiquitous, Autonomic and Trusted Computing, Brisbane, QLD, Australia,
2009, pp. 493-497, doi: 10.1109/UIC-ATC.2009.103.

[7]. A. Niakanlahiji, B. Chu and E. Al-Shaer, "PhishMon: A Machine


Learning Framework for Detecting Phishing Webpages," 2018 IEEE
International Conference on Intelligence and Security Informatics (ISI),
Miami, FL, USA, 2018, pp. 220-225, doi: 10.1109/ISI.2018.8587410.

[8] Mohammad, Rami. (2013). Predicting Phishing Websites based on Self-


Structuring Neural Network. Neural Computing and Applications.

[9]. Cao, J. & Dong, D. & Mao, B. & Wang, T.. (2013). Phishing detection
method based on URL features. Journal of Southeast University (English
Edition). 29. 134-138. 10.3969/j.issn.1003-7985.2013.02.005.

[10] Patil D. R. & Patil J. B. (2016). Malicious web pages detection using
static analysis of URLs. International Journal of Information Security
and Cybercrime, 5(2), 57–70.

978-1-6654-9707-7/22/$31.00 ©2022 IEEE 686


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on September 24,2023 at 10:27:10 UTC from IEEE Xplore. Restrictions apply.

You might also like