0% found this document useful (0 votes)

65 views8 pages

Final Report SIH

The document describes a project to create an AI/ML system to detect phishing domains. A team of 6 students are working on the project under the guidance of Prof. Bhagyashree Dhakulkar. The objectives are to analyze phishing and legitimate websites using machine learning classifiers and create a browser extension application. The team analyzed datasets using Catboost and Random Forest classifiers and achieved 94% accuracy in detection. They also developed a GUI using Tkinter for the browser extension.

Uploaded by

vaishnavidhorje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views8 pages

Final Report SIH

Uploaded by

vaishnavidhorje

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

TEAM : SF90 PS ID :1454

REPORT
PROBLEM STATEMENT :
Create an intelligent system using AI/ML to detect phishing domains which imitate look and
feel of genuine domains.
TEAM MEMBERS:
1.Arjun Salunke
2.Adarsh Saware
3.Siddhesh Raondure
4.Vaishnavi Dhorje
5.Aarya Jadhav
6.Gauri Pansambal
MENTOR:
Prof.Bhagyashree Dhakulkar
Aim: To create a user friendly application for detection of phishing domains
Objectives:
1.Comparative Analysis of Phishing and Legitimate Websites Using Machine Learning
Classifiers
2.
1. Introduction:
Brief overview of the study's objective.
Importance of distinguishing between phishing and legitimate websites.
Mention of the self made datasets comprising 15,000 to 20,000 instances.
2. Dataset Composition:
Description of the self made datasets.
Inclusion of both phishing and legitimate instances.
Equal representation of both sectors (50% phishing, 50% legitimate).
Explanation of the significance of a balanced dataset.
3. Methodology:
Overview of the machine learning classifiers employed.
Catboost Classifier:

CatBoost is well suited for datasets with categorical features, which makes it particularly
advantageous for the Phishing Classifier Bot project. In the context of URLs, certain features,
such as the protocol used ('https' or 'http'), may be categorical. CatBoost efficiently handles
these categorical features without the need for extensive preprocessing, simplifying the
workflow.

Random Forest:
Random Forests operate on the principle of ensemble learning, where multiple decision trees
are created during the training phase. Each decision tree is trained on a random subset of the
data and makes individual predictions. The final prediction is then determined by combining
the outputs of all the trees. This ensemble approach enhances the model's robustness and
generalization ability.

4. Data Preprocessing:
Steps taken to prepare the datasets for training and testing.
Feature selection and extraction methods applied.
Handling of missing or irrelevant data.

5. Model Training:
Details of the training process for each classifier.
Self generated datasets were prepared containing 20,000 hyperlinks.
Parameters tuned for optimal performance.
Address Bar based Features

Abnormal Based Features

HTML and JavaScript based Features

Domain based Features

Evaluation metrics used (accuracy, precision, recall, F1 score).

BROWSER EXTENSION:
 We have made an FIREFOX Extension in which we can check if the given URL is
legitimate or phishing.
 The URL’S get highlighted in “RED” colour if it is a phishing site ,otherwise it is
shown in “GREEN” colour which is a legitimate one.

6. Results and Analysis:

The execution using above parameters gives an accuracy of 94% in finding legitimate and
phishing websites.
After giving the input URL our browser detects legitimate and phishing domains.

EVALUATION ROUND 1
Methodology:
Website Screenshot and Favicon Analysis Using VGG16 CNN Model
This report provides an in depth analysis of website screenshots and favicons using
advanced techniques, including the utilization of a VGG16 Convolutional Neural Network
(CNN) model. The investigation aims to enhance cybersecurity measures by identifying
potential phishing sites through image analysis and similarity calculations.
Certainly, the VGG16 model is a convolutional neural network (CNN) architecture that was
introduced by the Visual Graphics Group (VGG) at the University of Oxford. It became
popular for its simplicity and effectiveness in image classification tasks. Here's an overview
of the VGG16 model:
VGG16 Architecture:
1. Input Layer:
The input layer takes in a fixed size RGB image (typically 224x224 pixels).
2. Convolutional Blocks:
VGG16 consists of 13 convolutional layers, grouped into five convolutional blocks.
Each block consists of multiple convolutional layers followed by max pooling layers.
The convolutional layers use small 3x3 filters, and the max pooling layers reduce
spatial dimensions.
3. Fully Connected (Dense) Layers:
After the convolutional blocks, there are three fully connected layers.
The last layer produces the final output logits for classification.
4. Activation Function:
Rectified Linear Unit (ReLU) activation functions are used throughout the network to
introduce non linearity.
5. Dropout:
To prevent overfitting, dropout layers are included after the fully connected layers.
6. Softmax Activation:
The final layer often employs softmax activation to convert the logits into class
probabilities.
Key Characteristics:
Parameter Size:
VGG16 has a large number of parameters (138 million) due to its deep architecture.
This contributes to both the model's expressiveness and the need for substantial
computational resources during training.
Simplicity:
The architecture is straightforward and consists of repeating blocks of convolutional
layers. This simplicity aids in understanding and modifying the model for specific tasks.
Pre Trained Models:
VGG16 is often used as a pre trained model for image classification tasks. Pre
training on large datasets like ImageNet allows it to capture general features that can be fine
tuned for specific tasks with smaller datasets.
Use Cases:
Image Classification:
VGG16 is primarily used for image classification tasks where the goal is to assign a label
to an input image.
Feature Extraction:
The intermediate layers of VGG16 can be used as a feature extractor for various
computer vision tasks.
Transfer Learning:
Due to its pretrained nature, VGG16 is often employed in transfer learning scenarios
where it is fine tuned on a smaller dataset for a specific task.

PROJECT FLOW DIAGRAM

URL Permutation

Sitemap Details Voting Image Similarity

Request URL Of

Favicon Existence All Beautiful Soup Version 4

Web Elements

1 Suspicious
SF90
-1 Malicious
PROJECT IMPLEMENTATION
1.Website Screenshot Analysis:
Selenium for Screenshot Testing: Selenium, a powerful web testing tool, was employed to
capture complete screenshots of target webpages.
Image Preprocessing: The captured screenshots underwent preprocessing to enhance the
quality and facilitate accurate feature extraction.
Feature Extraction: VGG16 CNN model was utilized to extract meaningful features from
the preprocessed screenshots.
2. Favicon Analysis:

Image Hashing: Favicon images were hashed to generate unique identifiers for
comparison.
Similarity Calculation: The image hashing technique was employed to calculate the
similarity between the original and phishing site favicons.
3. URL Analysis:

DNS Twist: The DNS twist tool was utilized to identify potential phishing domains by
generating permutations of the original URL.
Similar URL Identification: Similar looking URLs that were already registered were
identified, signaling potential phishing attempts.
Regular expression was used for this .
Results:

The analysis provided valuable insights into the following:

Screenshot Similarity: The VGG16 CNN model successfully extracted features, allowing
for accurate assessment of similarity between the original and potential phishing site
screenshots.
Favicon Similarity: Image hashing proved effective in identifying similarities between
favicons, aiding in the detection of phishing sites.
URL Permutations: DNS twist and URL permutation techniques revealed potential
phishing domains by identifying variations of the original URL.
Regular Monitoring: Implement regular website monitoring to detect changes in
screenshot and favicon patterns.

Dynamic Parameter Adjustments: Fine tune parameters based on changes in URL

permutations to enhance accuracy in phishing site detection.
Collaboration: Foster collaboration with cybersecurity communities to share insights and
improve the effectiveness of detection mechanisms.
Conclusion:

The use of advanced image analysis techniques, combined with URL permutation analysis,
proves to be a robust approach in identifying potential phishing sites. The integration of
VGG16 CNN models and Selenium testing contributes to a comprehensive and proactive
cybersecurity strategy.
Proposed work serves as a foundation for ongoing research and development in the field of
phishing site detection, emphasizing the importance of leveraging cutting edge technologies
for enhanced online security.

EVALUATION ROUND 2
Development of GUI
Graphical User Interface(GUI) is a form of user interface which allows users to interact with
computers through visual indicators using items such as icons, menus, windows, etc. It has
advantages over the Command Line Interface(CLI) where users interact with computers by
writing commands using keyboard only and whose usage is more difficult than GUI.

Use of tkinter
Tkinter is the inbuilt python module that is used to create GUI applications. It is one of the
most commonly used modules for creating GUI applications in Python as it is simple and
easy to work with. You don’t need to worry about the installation of the Tkinter module
separately as it comes with Python already. It gives an object-oriented interface to the Tk GUI
toolkit.
Connectivity of database
Database connectivity allows the client software to communicate with the database server
software. It is an interface that allows communication between the database and the software
application. Elements of frontend applications/websites like buttons, fonts, or menus need to
be connected to the database (back-end) to deliver relevant information to the end-user.
Database connectivity allows this type of communication/data transfer between the frontend
and backend applications.
Use of Headless in Selenium
Headless mode is a functionality that allows the execution of a full version of the browser
while controlling it programmatically. They are executed via a command-line interface or
using network communication. This means it can be used in servers without graphics or
display, and still, the Selenium tests run.

CONCLUSION:
Thus our implemented code finds out the legitimate similar domains from the available
database.

Web Development
No ratings yet
Web Development
11 pages
Seminar PPT For Smart Mirror
57% (7)
Seminar PPT For Smart Mirror
17 pages
AZ 104T00A ENU PowerPoint - 03
No ratings yet
AZ 104T00A ENU PowerPoint - 03
43 pages
DVTK DICOM Anonymizer User Manual
No ratings yet
DVTK DICOM Anonymizer User Manual
14 pages
Extensive Database Management Using Artificial Intelligence
100% (2)
Extensive Database Management Using Artificial Intelligence
7 pages
Tentative BTech - CSE 4TH Sem Syllabus 2018-19
No ratings yet
Tentative BTech - CSE 4TH Sem Syllabus 2018-19
26 pages
AI - Model Paper Answers - 240817 - 173447
No ratings yet
AI - Model Paper Answers - 240817 - 173447
27 pages
Traffic Sign Board Recognition and Voice Alert System Using Convolutional Neural Network
No ratings yet
Traffic Sign Board Recognition and Voice Alert System Using Convolutional Neural Network
1 page
19 - Crop Recommender System Using Machine Learning Approach
No ratings yet
19 - Crop Recommender System Using Machine Learning Approach
64 pages
SPM
No ratings yet
SPM
83 pages
Unit01-Getting Started With .NET Framework 4.0
No ratings yet
Unit01-Getting Started With .NET Framework 4.0
40 pages
Visvesvaraya Technological University: "Car Rental Management System"
No ratings yet
Visvesvaraya Technological University: "Car Rental Management System"
31 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
INTERN
No ratings yet
INTERN
40 pages
Student management system
No ratings yet
Student management system
41 pages
Major Project Report BIG MART Final Reedited
No ratings yet
Major Project Report BIG MART Final Reedited
91 pages
LP 4 Lab Manual
No ratings yet
LP 4 Lab Manual
52 pages
21cs734 Simp
No ratings yet
21cs734 Simp
3 pages
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
No ratings yet
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
25 pages
Resume Screening Using Machine Learning
No ratings yet
Resume Screening Using Machine Learning
5 pages
FSD Module 3 Notes
No ratings yet
FSD Module 3 Notes
16 pages
DSBDA Practical Final
No ratings yet
DSBDA Practical Final
49 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Techniques of Knowledge Representation
No ratings yet
Techniques of Knowledge Representation
3 pages
AdminLTE 2 - Documentation
No ratings yet
AdminLTE 2 - Documentation
32 pages
Project - Report
No ratings yet
Project - Report
56 pages
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
No ratings yet
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
4 pages
18MAB204T - PQT - UNIT 2, 3 - Cycle Test II - March 2023
No ratings yet
18MAB204T - PQT - UNIT 2, 3 - Cycle Test II - March 2023
29 pages
12th IT
No ratings yet
12th IT
152 pages
Chatgpt Clone
No ratings yet
Chatgpt Clone
34 pages
Liver Disease Prediction using Machine learning and Deep Learning
No ratings yet
Liver Disease Prediction using Machine learning and Deep Learning
73 pages
Machine Learning (15CS73) Question Bank
No ratings yet
Machine Learning (15CS73) Question Bank
2 pages
Major PPT 1
No ratings yet
Major PPT 1
15 pages
THEORY FILE - Internet of Things (5th Sem) .
No ratings yet
THEORY FILE - Internet of Things (5th Sem) .
13 pages
APMC Prachi Synopsis
No ratings yet
APMC Prachi Synopsis
6 pages
East West Institute of Technology: Sadp Notes
No ratings yet
East West Institute of Technology: Sadp Notes
30 pages
Department of Computer Engineering: Mini Project Report On " Covid-19 Tracker Using Python "
No ratings yet
Department of Computer Engineering: Mini Project Report On " Covid-19 Tracker Using Python "
17 pages
Project Synopsis (127, 128, 136)
No ratings yet
Project Synopsis (127, 128, 136)
24 pages
Artificial-Intelligence-lab 7TH SEMESTER
No ratings yet
Artificial-Intelligence-lab 7TH SEMESTER
24 pages
Unit - I Introduction and Web Development Strategies
No ratings yet
Unit - I Introduction and Web Development Strategies
12 pages
Soft Computing Quantum
No ratings yet
Soft Computing Quantum
100 pages
Internship Report
No ratings yet
Internship Report
13 pages
Signature Verification and Detection
No ratings yet
Signature Verification and Detection
61 pages
PROJECT REPORT Sample 6 Sem
No ratings yet
PROJECT REPORT Sample 6 Sem
70 pages
Big Data
No ratings yet
Big Data
30 pages
System Analysis and Design
0% (1)
System Analysis and Design
3 pages
BI Mini Project Report
No ratings yet
BI Mini Project Report
11 pages
Oomd (U1&u2)
100% (1)
Oomd (U1&u2)
83 pages
Mini Project Synopsis Saurabh
No ratings yet
Mini Project Synopsis Saurabh
58 pages
E Mart - 093653
No ratings yet
E Mart - 093653
49 pages
Python Project Result Management System
No ratings yet
Python Project Result Management System
21 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Appu Certoi
No ratings yet
Appu Certoi
8 pages
Notes Management System: A Synopsis On
No ratings yet
Notes Management System: A Synopsis On
8 pages
Seminar Report Final
No ratings yet
Seminar Report Final
26 pages
Kidney Stone Detection Using Ultrasound
No ratings yet
Kidney Stone Detection Using Ultrasound
26 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
Seminar Report on AI driven drug discovery 2
No ratings yet
Seminar Report on AI driven drug discovery 2
22 pages
Module 5
No ratings yet
Module 5
16 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
128 Submission
No ratings yet
128 Submission
7 pages
Information Security Project
No ratings yet
Information Security Project
7 pages
DSA Solved Paper (May - June 2023) PDF
No ratings yet
DSA Solved Paper (May - June 2023) PDF
27 pages
Chi-Square Test, F Test and T Test Examples and Formulas
No ratings yet
Chi-Square Test, F Test and T Test Examples and Formulas
12 pages
BLACKBOOK
No ratings yet
BLACKBOOK
5 pages
Case Study
No ratings yet
Case Study
6 pages
CISP Appa
No ratings yet
CISP Appa
4 pages
4PA03 - Jesisca Putri - Laporan Praktikum M3
No ratings yet
4PA03 - Jesisca Putri - Laporan Praktikum M3
3 pages
C1-Introduction To PFC
No ratings yet
C1-Introduction To PFC
36 pages
Computing Fundamentals: Quiz 1
No ratings yet
Computing Fundamentals: Quiz 1
182 pages
MRS-SPECT-PETCT Brochure MR Solutions
No ratings yet
MRS-SPECT-PETCT Brochure MR Solutions
8 pages
329M ManualEnglish 171027
No ratings yet
329M ManualEnglish 171027
17 pages
NN RGB FPGA Exercise
No ratings yet
NN RGB FPGA Exercise
12 pages
05 GDL Reference Guide
No ratings yet
05 GDL Reference Guide
398 pages
Free Resume Builder and
100% (2)
Free Resume Builder and
4 pages
Unit 3, Pathophysiology, B Pharmacy 2nd Sem, Carewell Pharma
No ratings yet
Unit 3, Pathophysiology, B Pharmacy 2nd Sem, Carewell Pharma
22 pages
TactFOOT Pro Manual
No ratings yet
TactFOOT Pro Manual
14 pages
DX Diag
No ratings yet
DX Diag
34 pages
Unit3 PDMS Databases
No ratings yet
Unit3 PDMS Databases
4 pages
Quiz Mid Test
No ratings yet
Quiz Mid Test
3 pages
Mobile Operating System: BY S.Siddharth 07P71AO497
No ratings yet
Mobile Operating System: BY S.Siddharth 07P71AO497
26 pages
Esp32 Cam
No ratings yet
Esp32 Cam
6 pages
A. Portfolio and Quatation Farrel Revikasha
No ratings yet
A. Portfolio and Quatation Farrel Revikasha
26 pages
TVL-ICT (Computer System Servicing) Activity Sheet Quarter 2 - Lesson 3
No ratings yet
TVL-ICT (Computer System Servicing) Activity Sheet Quarter 2 - Lesson 3
24 pages
QXvue (v1.1.1.x) User Manual For Veterinary Use
No ratings yet
QXvue (v1.1.1.x) User Manual For Veterinary Use
84 pages
GameForge Help - GameForge Help Documents
No ratings yet
GameForge Help - GameForge Help Documents
14 pages
1Z0-820-Demo
No ratings yet
1Z0-820-Demo
6 pages
Smart Anti Theft Bicycle Locking System Project Final Report AS2019908
No ratings yet
Smart Anti Theft Bicycle Locking System Project Final Report AS2019908
30 pages
3rd Quarter Assessment CSS 10
No ratings yet
3rd Quarter Assessment CSS 10
2 pages
GRED HD User Manual
No ratings yet
GRED HD User Manual
107 pages
Acer Aspire 1350 Series Service Guide: Service CD Part No.: Vd.A10V7.001
No ratings yet
Acer Aspire 1350 Series Service Guide: Service CD Part No.: Vd.A10V7.001
104 pages
Ebooks File MATLAB Programming For Engineers 6th Edition Stephen J. Chapman All Chapters
No ratings yet
Ebooks File MATLAB Programming For Engineers 6th Edition Stephen J. Chapman All Chapters
49 pages
DS-A81024S_20230707 (1)
No ratings yet
DS-A81024S_20230707 (1)
5 pages

Final Report SIH

Uploaded by

Final Report SIH

Uploaded by

TEAM : SF90 PS ID :1454

Abnormal Based Features

HTML and JavaScript based Features

Domain based Features

Evaluation metrics used (accuracy, precision, recall, F1 score).

6. Results and Analysis:

PROJECT FLOW DIAGRAM

Sitemap Details Voting Image Similarity

Favicon Existence All Beautiful Soup Version 4

The analysis provided valuable insights into the following:

Dynamic Parameter Adjustments: Fine tune parameters based on changes in URL

You might also like