0% found this document useful (0 votes)
12 views22 pages

Internship

The document is an internship report by Rohan.K on 'Phishing Web Sites Classification Based on Machine Learning' submitted to the University of Madras for a Master's degree in Computer Science. It includes a bonafide certificate, declaration, acknowledgments, and an abstract discussing the significance of detecting phishing websites using machine learning techniques. The report outlines the existing and proposed systems, objectives, software and hardware requirements, and the technologies used in the project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views22 pages

Internship

The document is an internship report by Rohan.K on 'Phishing Web Sites Classification Based on Machine Learning' submitted to the University of Madras for a Master's degree in Computer Science. It includes a bonafide certificate, declaration, acknowledgments, and an abstract discussing the significance of detecting phishing websites using machine learning techniques. The report outlines the existing and proposed systems, objectives, software and hardware requirements, and the technologies used in the project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1

BONAFIDE CERTIFICATE

This is to certify that the Internship report


entitled

Phishing Web Sites Classification Based on Machine


Learning
being submitted to the University of Madras, Chennai-600 005
by
Rohan.K
Register Number : 832400293

For the partial fulfillment for the award of Degree of

MASTER OF COMPUTER SCIENCE

is the bonafide record work carried by

his, under my guidance and supervision

Signature of the Guide Head of the Department


2

CERTIFICATE

This is to certify that the Internship Report entitled “Phishing Web Sites
Classification Based on Machine Learning”, in partial fulfillment of the
requirements for the award of the Degree of Master of Computer Science is a
record of original Internship undergone by your Rohan.K (832400293) during the
year 2025-2026 of his study in the Master of Computer Science, under my
supervision and the report has not formed the basis for the award of any Degree or
other similar title to any candidate of any University.

Signature of Guide HOD

Place:

Date:
3

DECLARATION

I, Rohan.K (832400293) hereby declare that the Internship Report, entitled

“Phishing Web Sites Classification Based on Machine Learning” submitted to

the SRM Arts and Science College in partial fulfillment of the requirements for

the award of the Master of Computer Science is a record of original training

undergone by me during the period 2025-2026 under the supervision and

guidance of Department of Computer science, SRM Arts and Science College

and it has not formed the basis for the award of any Degree or other similar title

to any candidate of any University.

Signature of the Student

Place:

Date:
4

ACKNOWLEDGEMENT

I would like to express my thanks to Dr. T. R. PAARIVENDHAR,


[Link]., M.I.E, Chairman and Dr. R. VASUDEVARAJ, [Link]., [Link].,
[Link]., MBA, Ph.D., Principal of our college, for their scholarly guidance and
motivation throughout the Internship as well as the curriculum.

I wish to express my sincere gratitude and respect to


Dr. K. R. ANANTHA PADMANABAN., MCA., [Link]., Ph.D., Professor and
Head, Computer Science and Applications for his invaluable guidance,
support and for providing the necessary infrastructure and environment to
complete this Internship.

My sincere thanks to Dr. R. KARTHIKEYAN, [Link]., [Link].,


[Link]., Ph.D., Head, Department of Computer Science for the useful guidance
and helpful mind to complete this Internship.

My sincere thanks goes to my guide Mr. [Link], M.C.A, [Link],


NET, Assistant Professor, Department of Computer Science, for suggesting the
problem, offering inspiring guidance and fruitful discussion throughout the
course of the work.

My special thanks to all the staff members, Department of Computer


Science for their support and encouragement for the successful completion of the
Internship.

I acknowledge my profound thanks to my parents for their encouragement


and for social and economical support for completing this Internship.

Rohan.K
5

TABLE OF CONTENTS

CHAPTER NO. PARTICULARS PAGE NO.

1 Introduction 6

System Analysis
2 a. Existing System
7
b. Proposed System

3 System Requirements 8

4 Technology 11

5 Coding 13

6 Output Screen (or) Result 17

7 Conclusion 21

8 Bibliography 21
6

ABSTRACT:

The phishing website has evolved as a major cybersecurity threat in recent times. The phishing
websites host spam, malware, ransomware, drive-by exploits, etc. A phishing website many a
time look-alike a very popular website and lure an unsuspecting user to fall victim to the trap.
The victim of the scams incurs a monetary loss, loss of private information and loss of reputation.
Hence, it is imperative to find a solution that could mitigate such security threats in a timely
manner. Traditionally, the detection of phishing websites is done using blacklists. There are many
popular websites which host a list of blacklisted websites, e. g. PhisTank. The blacklisting
technique lack in two aspects, blacklists might not be exhaustive and do not detect a newly
generated phishing website. In recent times machine learning techniques have been used in the
classification and detection of phishing websites. In, this paper we have compared different
machine learning techniques for the phishing URL classification task and achieved the with multi
classifier like naïve Bayes, Decision tree and Random Forest.

Keywords—Extreme Learning Machine, Features Classification, Information Security,


Phishing,Navie Bayes, Decision tree, random forest.

INTRODUCTION
Phishing is a form type of a cyber-security attack where an attacker gains control on sensitive
website user accounts by learning sensitive information such as login credentials, credit card
information by sending a malicious URL in email or masquerading as a reputable person in email
or through other communication channels. The victim receives a message from known contacts,
persons, entities or organizations and looks very much genuine in its appeal. The received
message might contain malicious links, software that might target the user computer or the
malicious link might direct the user to some forged website which is similar in look and feel of a
popular website, further victim might be tricked to divulge his personal information e.g. credit
card information, login and password details and other sensitive information like account id
details etc. Phishing is the most popular type of cyber security attack and very common among
the attackers. Phishing attacks are generally easy as most of the victims are not well aware of the
intricacies about the web applications and computer networks and its technologies and are easy
prey for getting tricked or spoofed. It is very easy to phishing unsuspecting users using forged
websites and luring them forclicking the websites for some prize and offers than targeting the
computer defense system. The malicious website is designed in such a way that it has a similar
look and feel and it appears very genuine in its appearance as it contains the organization's logos
7

and other copyrighted contents. As many users unwittingly clicking the phishing websites URLs
and this results in huge financial and loss of reputation to the person and to the concerned
organization. The phishing email might contain a PDF or Word document as a malicious
attachment.

EXISTING SYSTEM :

In the existing system, the authors have have used data mining techniques inorder to detect the
most useful machine learning algorithm.A phishing website usually demonstrates abnormal
discrepancies between its web objects or HTTP transaction and its claimed identity. Our anti-
phishing scheme is mainly built upon the detection of those identity relevant anomalies.

PROPOSED SYSTEM :

In the proposed system it is possible to train the phishing websites dataset by using different
machine learning models. We examined phishing websitesdataset with their features like various
hints among their contents and web browser-based information. The purpose of this study is to
perform including Phishing Websites Database based on that we can use machine learning
technic.

Motivations :
The attacks are to steal the information used by individuals and organizations to conduct
[Link] is defined as imitating reliable websites in order to obtain the proprietary
information entered into websites every day for various purposes, such as usernames, passwords
and citizenship numbers. Phishing websites contain various hints among their contents and web
browser-based information.
8

Objectives of the work :

 The factors leading to phishing website change over time since they are dependent upon
multiple political and social reasons.

 Hence the cllasify of the phishing website is necessary for saving the people of the
country to upload the data in website.

Key features with scope of the features or overall scope of the work.
Machine learning approaches can ad in analyzing the likelihood of a phishing website, given the
required data. The results of this work can help the security agencies and policy makers to
eradicate terrorism by taking relevant and effective measures.

Software Requirements Specifications


H/W System Configuration:

Processor Dual Core.


Speed 1.1 G Hz.
RAM 1 GB (min).
Hard Disk 1 GB.

S/W System Configuration:

Operating System Windows 10.


Technology Machine Learning.
Front End GUI-tkinter.
IDLE Python 3.7 or higher.
9

Functional Requirements :
The particular necessities are user interfaces. The outside clients are the customers. Every one
of the customers can utilize this product for ordering and looking.
 Hardware Interfaces: The outside equipment interface utilized for ordering and looking is

PCs of the customers. The PC's might be portable PCs with remote LAN as the web

association gave will be remote.

 Software Interfaces: The working Frameworks can be any rendition of windows.

 Performance Prerequisites: The PC's utilized must be atleastpentium 4 machine with the goal

that they can give ideal execution of the item.

Non-FunctionalRequirements :
Non utilitarian necessities are the capacities offered by the framework. It incorporates

time imperative and requirement on the advancement procedure and models. The non useful

prerequisites are as per the following:

 Speed: The framework ought to prepare the given contribution to yield inside fitting time.

 Ease of utilization: The product tought to be easy to understand. At that point the clients can

utilize effortlessly, so it doesn't require much preparing time.

 Reliability: The rate of disappointments ought to be less then just the framework is more

solid.

 Portability: It thought to be anything but difficult to actualize in any framework


10

Hardware requirements :

The most widely recognized arrangement of prerequisites characterized by any

working framework or programming application is the physical PC assets, otherwise called

equipment, An equipment necessities list is frequently joined by an equipment similarity

list, particularly if there should be an occurrence of working frameworks. A HCL records

tried, perfect, and now and then incongruent equipment gadgets for a specific working

framework or application. The accompanying sub-segments examine the different parts of

equipment prerequisites.

All PC working frameworks are intended for a specific PC design. Most programming

applications are restricted to specific working frameworks running on specific structures. In

spite of the fact that engineering free working frameworks and applications exist, most

should be recompiled to keep running on another design.

The energy of the focal preparing unit (CPU) is a central framework necessity for any

product. Most programming running on x86 engineering characterize preparing power as

the model and the clock speed of the CPU. Numerous different highlights of a CPU that

impact its speed and power, similar to transport speed, store, and MIPS are frequently

overlooked. This meaning of energy is regularly wrong, as AMD Intel Pentium CPUs at

comparative clock speed frequently have distinctive throughput speeds.

• 10GB HDD(min)

• 128 MB RAM(min)

• Pentium P4 Processor 2.8Ghz(min)

Software requirements

Programming necessities manage characterizing programming asset necessities and

requirements that should be introduced on a PC to give ideal working of an application.


11

These necessities or requirements are for the most part excluded in the product

establishment bundle and should be introduced independently before the product is

introduced.

 Python 3.7 or higher

 Pycharm

 opencv

Outline of advances

The innovations utilized is depicted as underneath:

Python

• Python is a general purpose high level programming Language (human

understandable languages are High level programming languages)

• Python Developed by Guido Van Rossam

• 1989 National Research Institute(NRI) At Netherland

• Officially Python available to the public in 1991 :: FEB 20th 1991

Python was imagined in the late 1980s,[29] and its usage started in December 1989[30] by

Guido van Rossum at Centrum Wiskunde and Informatica (CWI) in the Netherlands as a

successor to the ABC dialect (itself roused by SETL)[31]capable of exemption dealing with

and interfacing with the Amoeba working system.[6] Van Rossum remains Python's chief

creator. His proceeding with focal part in Python's advancement is reflected in the title given

him by the Python people group:

Technology :
1. Machine Learning (ML)

 Core technology of the project.


12

 Used to classify URLs as phishing or legitimate.


 Algorithms used:
o Naive Bayes
o Decision Tree
o Random Forest

These algorithms learn patterns from URL features (e.g., length, special characters, presence of
IP, “@” symbol, etc.).

2. Python Programming Language

 Entire system is implemented using Python because of its strong ML libraries and ease of
development.
 Main libraries used:
o pandas – data handling
o numpy – numerical computation
o scikit-learn – machine learning models
o tkinter – GUI development

3. Feature Extraction Technology

 Extracts meaningful lexical and structural features from URLs.


 Examples of features:
o URL length
o IP address in domain
o presence of “@”, “-”, “https”
o subdomain count
o special character frequency

These features are converted into numeric vectors used by ML models.

4. Graphical User Interface (GUI)

 Tkinter (Python’s standard GUI library) used to create:


o Simple input box for URL
o Output messages showing whether the site is phishing or legitimate

This makes the tool user-friendly and accessible to non-technical users.

5. Dataset & Data Processing Technology

 The model uses phishing datasets (e.g., UCI, PhishTank, or custom collected URLs).
 Data processing involves:
o Preprocessing URLs
o Extracting features
o Splitting into train/test sets
13

Project Structure :

phishing_detection/
│── data/
│ └── [Link]
│── models/
│ └── best_model.pkl
│── [Link]
│── feature_extraction.py
│── gui_app.py
│── [Link]

1. feature_extraction.py – Extract 10+ features from a URL

# feature_extraction.py
# Extracts lexical features from a URL for phishing detection

import re
from [Link] import urlparse
import numpy as np

def has_ip_address(domain):
return 1 if [Link](r'(\d{1,3}\.){3}\d{1,3}', domain) else 0

def count_char(s, ch):


return [Link](ch)

def extract_features(url):
"""
Returns a list of numeric features from the URL
"""
if not [Link]("http"):
url = "[Link] + url
parsed = urlparse(url)
domain = [Link]
path = [Link]

features = [
len(url), # 1: URL length
has_ip_address(domain), # 2: IP address in domain
count_char(url, '@'), # 3: '@' character
1 if '-' in domain else 0, # 4: '-' in domain
[Link]('.'), # 5: subdomain count
1 if [Link]("https") else 0, # 6: HTTPS or not
len(path), # 7: path length
14

count_char(url, '?'), # 8: query presence


count_char(url, '='), # 9: '=' count
count_char(url, '%'), # 10: '%' count
len(set([Link](r'[^A-Za-z0-9]', url))) # 11: unique special chars
]
return [Link](features).reshape(1, -1)

2. [Link] – Train ML Models & Save Best

# [Link]
# Trains Naive Bayes, Decision Tree & Random Forest

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, classification_report
import pickle, os

DATA_PATH = "data/[Link]"
MODEL_DIR = "models"
[Link](MODEL_DIR, exist_ok=True)

# Load dataset
df = pd.read_csv(DATA_PATH)
X = [Link][:, :-1]
y = [Link][:, -1]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Models
models = {
"Naive Bayes": GaussianNB(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

best_model, best_acc = None, 0


for name, model in [Link]():
[Link](X_train, y_train)
pred = [Link](X_test)
acc = accuracy_score(y_test, pred)
print(f"\n✅ {name} Accuracy: {acc*100:.2f}%")
print(classification_report(y_test, pred))
if acc > best_acc:
best_acc = acc
best_model = model
15

# Save best model


[Link](best_model, open(f"{MODEL_DIR}/best_model.pkl", "wb"))
print(f"\n🏆 Best Model Saved with Accuracy: {best_acc*100:.2f}%")

3. gui_app.py – GUI for Prediction

# gui_app.py
# Tkinter-based GUI for phishing URL detection

import tkinter as tk
from tkinter import messagebox
import pickle
from feature_extraction import extract_features

# Load the trained model


model = [Link](open("models/best_model.pkl", "rb"))

# GUI functions
def predict_url():
url = [Link]().strip()
if not url:
[Link]("Warning", "Please enter a URL")
return
try:
features = extract_features(url)
pred = [Link](features)[0]
if pred == 1:
[Link]("Result", "🚨 Phishing URL Detected")
else:
[Link]("Result", "✅ Legitimate URL")
except Exception as e:
[Link]("Error", f"Prediction failed: {e}")

# Build GUI
root = [Link]()
[Link]("Phishing URL Detection")
[Link]("400x180")

[Link](root, text="Enter URL:", font=("Arial", 12)).pack(pady=10)


entry = [Link](root, width=40)
[Link](pady=5)

[Link](root, text="Check", command=predict_url, bg="green", fg="white",


width=15).pack(pady=10)
[Link](root, text="Quit", command=[Link], bg="red", fg="white",
width=15).pack()

[Link]()
16

4. Optional: Dataset Creation Script (if using URLs)

# create_dataset.py
# Convert a list of URLs with labels into numeric feature dataset

import pandas as pd
from feature_extraction import extract_features

data = [
("[Link] 0),
("[Link] 1),
("[Link] 1),
("[Link] 0)
]

features, labels = [], []


for url, label in data:
[Link](extract_features(url).flatten())
[Link](label)

df = [Link](features)
df['label'] = labels
df.to_csv("data/[Link]", index=False)
print("✅ Dataset saved to data/[Link]")

5. [Link]
pandas
numpy
scikit-learn
tkinter

RESULTS:

The algorithm can be applied to our code for the analysis of the features of phishing attack facts.
Thus it gives the accuracy as an output. Hence the features of phishing attack analysis are done in
our project.
17

Output Screenshot :
18
19
20
21

CONCLUSION :
In this paper, we defined features of phishing we proposed a classification model in order to
classification of the phishing attacks. This method consists of feature in dataset from websites
and classification section. In the feature extraction, we have clearly defined rules of phishing
feature extraction and these rules have been used for obtaining features. In order to classification
of this feature, RF, NB, DT used highest accuracy score.

Bibliography :
[1] Samuel Marchal, Jérôme François, Radu State, and Thomas Engel, “PhishStorm: Detecting
Phishing With Streaming Analytics,” IEEE Transactions on Network and Service Management,
vol. 11 , issue: 4 , pp. 458-471, December 2014
22

[2] Mohammed NazimFeroz,SusanMengel, “Phishing URL Detection Using URL Ranking,”


IEEE International Congress on Big Data, July 2015

[3] MahdiehZabihimayvan, Derek Doran, “Fuzzy Rough Set Feature Selection to Enhance
Phishing Attack Detection,” International Conference on Fuzzy Systems (FUZZ-IEEE), New
Orleans, LA, USA, June 2019

[4] MoitrayeeChatterjee,Akbar-SiamiNamin, “Detecting Phishing Websites through Deep


Reinforcement Learning,” IEEE 43rd Annual Computer Software and Applications Conference
(COMPSAC), July 2019

[5] Chun-Ying Huang,Shang-Pin Ma,Wei-Lin Yeh,Chia-Yi Lin,ChienTsung Liu, “Mitigate web


phishing using site signatures,” TENCON 2010-2010 IEEE Region 10 Conference, January 2011

[6] Aaron Blum,BradWardman,ThamarSolorio,Gary Warner, “Lexical feature based phishing


URL detection using online learning,” 3rd ACM workshop on Artificial intelligence and security,
Chicago, Illinois, USA, pp. 54-60, August 2010

[7] Mohammed Al-Janabi,Ed de Quincey,PeterAndras, “Using supervised machine learning


algorithms to detect suspicious URLs in online social networks,” IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017, Sydney, Australia, pp.
1104-1111, July 2010

[8] ErzhouZhu,YuyangChen,ChengchengYe,XuejunLi,Feng Liu, “OFSNN:An Effective


Phishing Websites Detection Model Based on Optimal Feature Selection and Neural Network,”
IEEE Access(Volume:7), pp. 73271-73284, June 2019

[9] AnkeshAnand,KshitijGorde,Joel Ruben Antony


Moniz,NoseongPark,TanmoyChakraborty,Bei-Tseng Chu, “Phishing URL Detection with
Oversampling based on Text Generative Adversarial Networks,” IEEE International Conference
on Big Data (Big Data), December 2018

[10] Justin Ma,Lawrence K. Saul,StefanSavage,Geoffrey M. Voelker, “Learning to detect


malicious URLs,” ACM Transactions on Intelligent Systems and Technology (TIST) archive
Volume 2 Issue 3, April 2011

You might also like