0% found this document useful (0 votes)
65 views8 pages

Final Report SIH

The document describes a project to create an AI/ML system to detect phishing domains. A team of 6 students are working on the project under the guidance of Prof. Bhagyashree Dhakulkar. The objectives are to analyze phishing and legitimate websites using machine learning classifiers and create a browser extension application. The team analyzed datasets using Catboost and Random Forest classifiers and achieved 94% accuracy in detection. They also developed a GUI using Tkinter for the browser extension.

Uploaded by

vaishnavidhorje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views8 pages

Final Report SIH

The document describes a project to create an AI/ML system to detect phishing domains. A team of 6 students are working on the project under the guidance of Prof. Bhagyashree Dhakulkar. The objectives are to analyze phishing and legitimate websites using machine learning classifiers and create a browser extension application. The team analyzed datasets using Catboost and Random Forest classifiers and achieved 94% accuracy in detection. They also developed a GUI using Tkinter for the browser extension.

Uploaded by

vaishnavidhorje
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

TEAM : SF90 PS ID :1454

REPORT
PROBLEM STATEMENT :
Create an intelligent system using AI/ML to detect phishing domains which imitate look and
feel of genuine domains.
TEAM MEMBERS:
1.Arjun Salunke
2.Adarsh Saware
3.Siddhesh Raondure
4.Vaishnavi Dhorje
5.Aarya Jadhav
6.Gauri Pansambal
MENTOR:
Prof.Bhagyashree Dhakulkar
Aim: To create a user friendly application for detection of phishing domains
Objectives:
1.Comparative Analysis of Phishing and Legitimate Websites Using Machine Learning
Classifiers
2.
1. Introduction:
Brief overview of the study's objective.
Importance of distinguishing between phishing and legitimate websites.
Mention of the self made datasets comprising 15,000 to 20,000 instances.
2. Dataset Composition:
Description of the self made datasets.
Inclusion of both phishing and legitimate instances.
Equal representation of both sectors (50% phishing, 50% legitimate).
Explanation of the significance of a balanced dataset.
3. Methodology:
Overview of the machine learning classifiers employed.
Catboost Classifier:

CatBoost is well suited for datasets with categorical features, which makes it particularly
advantageous for the Phishing Classifier Bot project. In the context of URLs, certain features,
such as the protocol used ('https' or 'http'), may be categorical. CatBoost efficiently handles
these categorical features without the need for extensive preprocessing, simplifying the
workflow.

Random Forest:
Random Forests operate on the principle of ensemble learning, where multiple decision trees
are created during the training phase. Each decision tree is trained on a random subset of the
data and makes individual predictions. The final prediction is then determined by combining
the outputs of all the trees. This ensemble approach enhances the model's robustness and
generalization ability.

4. Data Preprocessing:
Steps taken to prepare the datasets for training and testing.
Feature selection and extraction methods applied.
Handling of missing or irrelevant data.

5. Model Training:
Details of the training process for each classifier.
Self generated datasets were prepared containing 20,000 hyperlinks.
Parameters tuned for optimal performance.
Address Bar based Features

Abnormal Based Features

HTML and JavaScript based Features

Domain based Features

Evaluation metrics used (accuracy, precision, recall, F1 score).


BROWSER EXTENSION:
 We have made an FIREFOX Extension in which we can check if the given URL is
legitimate or phishing.
 The URL’S get highlighted in “RED” colour if it is a phishing site ,otherwise it is
shown in “GREEN” colour which is a legitimate one.

6. Results and Analysis:


The execution using above parameters gives an accuracy of 94% in finding legitimate and
phishing websites.
After giving the input URL our browser detects legitimate and phishing domains.

EVALUATION ROUND 1
Methodology:
Website Screenshot and Favicon Analysis Using VGG16 CNN Model
This report provides an in depth analysis of website screenshots and favicons using
advanced techniques, including the utilization of a VGG16 Convolutional Neural Network
(CNN) model. The investigation aims to enhance cybersecurity measures by identifying
potential phishing sites through image analysis and similarity calculations.
Certainly, the VGG16 model is a convolutional neural network (CNN) architecture that was
introduced by the Visual Graphics Group (VGG) at the University of Oxford. It became
popular for its simplicity and effectiveness in image classification tasks. Here's an overview
of the VGG16 model:
VGG16 Architecture:
1. Input Layer:
The input layer takes in a fixed size RGB image (typically 224x224 pixels).
2. Convolutional Blocks:
VGG16 consists of 13 convolutional layers, grouped into five convolutional blocks.
Each block consists of multiple convolutional layers followed by max pooling layers.
The convolutional layers use small 3x3 filters, and the max pooling layers reduce
spatial dimensions.
3. Fully Connected (Dense) Layers:
After the convolutional blocks, there are three fully connected layers.
The last layer produces the final output logits for classification.
4. Activation Function:
Rectified Linear Unit (ReLU) activation functions are used throughout the network to
introduce non linearity.
5. Dropout:
To prevent overfitting, dropout layers are included after the fully connected layers.
6. Softmax Activation:
The final layer often employs softmax activation to convert the logits into class
probabilities.
Key Characteristics:
Parameter Size:
VGG16 has a large number of parameters (138 million) due to its deep architecture.
This contributes to both the model's expressiveness and the need for substantial
computational resources during training.
Simplicity:
The architecture is straightforward and consists of repeating blocks of convolutional
layers. This simplicity aids in understanding and modifying the model for specific tasks.
Pre Trained Models:
VGG16 is often used as a pre trained model for image classification tasks. Pre
training on large datasets like ImageNet allows it to capture general features that can be fine
tuned for specific tasks with smaller datasets.
Use Cases:
Image Classification:
VGG16 is primarily used for image classification tasks where the goal is to assign a label
to an input image.
Feature Extraction:
The intermediate layers of VGG16 can be used as a feature extractor for various
computer vision tasks.
Transfer Learning:
Due to its pretrained nature, VGG16 is often employed in transfer learning scenarios
where it is fine tuned on a smaller dataset for a specific task.

PROJECT FLOW DIAGRAM

URL Permutation

Sitemap Details Voting Image Similarity


Request URL Of

Favicon Existence All Beautiful Soup Version 4

Web Elements

1 Suspicious
SF90
-1 Malicious
PROJECT IMPLEMENTATION
1.Website Screenshot Analysis:
Selenium for Screenshot Testing: Selenium, a powerful web testing tool, was employed to
capture complete screenshots of target webpages.
Image Preprocessing: The captured screenshots underwent preprocessing to enhance the
quality and facilitate accurate feature extraction.
Feature Extraction: VGG16 CNN model was utilized to extract meaningful features from
the preprocessed screenshots.
2. Favicon Analysis:

Image Hashing: Favicon images were hashed to generate unique identifiers for
comparison.
Similarity Calculation: The image hashing technique was employed to calculate the
similarity between the original and phishing site favicons.
3. URL Analysis:

DNS Twist: The DNS twist tool was utilized to identify potential phishing domains by
generating permutations of the original URL.
Similar URL Identification: Similar looking URLs that were already registered were
identified, signaling potential phishing attempts.
Regular expression was used for this .
Results:

The analysis provided valuable insights into the following:


Screenshot Similarity: The VGG16 CNN model successfully extracted features, allowing
for accurate assessment of similarity between the original and potential phishing site
screenshots.
Favicon Similarity: Image hashing proved effective in identifying similarities between
favicons, aiding in the detection of phishing sites.
URL Permutations: DNS twist and URL permutation techniques revealed potential
phishing domains by identifying variations of the original URL.
Regular Monitoring: Implement regular website monitoring to detect changes in
screenshot and favicon patterns.

Dynamic Parameter Adjustments: Fine tune parameters based on changes in URL


permutations to enhance accuracy in phishing site detection.
Collaboration: Foster collaboration with cybersecurity communities to share insights and
improve the effectiveness of detection mechanisms.
Conclusion:

The use of advanced image analysis techniques, combined with URL permutation analysis,
proves to be a robust approach in identifying potential phishing sites. The integration of
VGG16 CNN models and Selenium testing contributes to a comprehensive and proactive
cybersecurity strategy.
Proposed work serves as a foundation for ongoing research and development in the field of
phishing site detection, emphasizing the importance of leveraging cutting edge technologies
for enhanced online security.

EVALUATION ROUND 2
Development of GUI
Graphical User Interface(GUI) is a form of user interface which allows users to interact with
computers through visual indicators using items such as icons, menus, windows, etc. It has
advantages over the Command Line Interface(CLI) where users interact with computers by
writing commands using keyboard only and whose usage is more difficult than GUI.

Use of tkinter
Tkinter is the inbuilt python module that is used to create GUI applications. It is one of the
most commonly used modules for creating GUI applications in Python as it is simple and
easy to work with. You don’t need to worry about the installation of the Tkinter module
separately as it comes with Python already. It gives an object-oriented interface to the Tk GUI
toolkit.
Connectivity of database
Database connectivity allows the client software to communicate with the database server
software. It is an interface that allows communication between the database and the software
application. Elements of frontend applications/websites like buttons, fonts, or menus need to
be connected to the database (back-end) to deliver relevant information to the end-user.
Database connectivity allows this type of communication/data transfer between the frontend
and backend applications.
Use of Headless in Selenium
Headless mode is a functionality that allows the execution of a full version of the browser
while controlling it programmatically. They are executed via a command-line interface or
using network communication. This means it can be used in servers without graphics or
display, and still, the Selenium tests run.

CONCLUSION:
Thus our implemented code finds out the legitimate similar domains from the available
database.

You might also like