Phishing Website Detection Dissertation
Phishing Website Detection Dissertation
to the
By
AJAY.K
C22PG101CSC001
VADACHENNIMALAI (P.O)
APR-2024
ARIGNAR ANNA GOVERNMENT ARTS COLLEGE
(Accredited with ‘C’ grade by NAAC)
(AFFILIATED TO PERIYAR UNIVERSITY)
VADACHENNIMALAI (P.O)
ATTUR (T.K), SALEM (D.T)
PIN CODE: 636121
APR-2024
PROJECT WORK
AJAY.K
C22PG101CSC001
A Dissertation submitted in partial
Fulfillment of the requirement for the degree of
I express my sincere thanks Dr. M. SUMATHI, M.Sc., B.Ed., Ph.D., Prinicipal, Arignar
Anna Government Arts College, Attur for providing me the right ambiance for carrying out the project
work.
I express my sincere thanks Dr. K.SELVARAJ, M.Sc., M.Phil., MBA., Ph.D The Head and
Associate professor Department of Computer Science for providing me the right ambiance for carrying
out the project work.
I also express my sincere thanks to All Staff Member in the, Department of Computer Science
for their support throughout my project work.
I thank everyone who has helped me to complete this project work effectively finally, I am also
grateful to My Family and Friends who inspired me all through the days of my project.
AJAY.K
DECLARATION
2 SYSTEM STUDY
2.1 EXISTING SYSTEM
2.1.1 DRAWBACKS
2.2 PROPOSED SYSTEM
2.2.1 FEATURES
5 CONCLUSION
BIBLIOGRAPHY
APPENDICES
A DATA FLOW DIAGRAM
B TABLE STRUCTURE
C SAMPLE CODING
D SAMPLE INPUT
E SAMPLE OUTPUT
SYNOPSIS
Phishing is an internet scam in which an attacker sends out fake messages that look to
come from a trusted source. Phishing websites pose a significant threat to online security, aiming
to deceive users into sensitive information such as login credentials, financial details, and personal
data. Traditional methods of detecting phishing websites rely on static blacklists and manual
analysis, which are often insufficient in keeping up with the rapidly evolving tactics of attackers.
In this study, we propose a novel approach to phishing website detection using machine learning
techniques. Our method leverages features extracted from website content, such as HTML
structure, textual content and visual elements, along with contextual information such as URL
characteristics and domain reputation. Overall, our proposed machine learning-based approach
offers a promising solution for proactive and automated phishing website detection, enhancing
online security and mitigating the risks associated with phishing attacks.
1
1. INTRODUCTION
2
1.1 SYSTEM SPECIFCATION
HARDWARE CONFIGURATION
Processor : Dualcoreprocessor2.6.0GHz
RAM : 8GB
SOFTWARE SPECIFICATION
Platform : windows10Pro
Back-End : SQL
Language :Python
3
2. SYSTEM STUDY
A detailed study to determine whether, to what extent, and how automatic data-processing
equipment should be used; it usually includes an analysis of the existing system and the design of
the new system, including the development of system specifications which provide a basis for the
selection of equipment.
EXISTING SYSTEM
The existing system for phishing website detection often relies on traditional methods such as
static blacklists, manual analysis, and heuristic rules. These methods have limitations in keeping up
with the dynamic nature of phishing attacks and may result in false positives or negatives.
DRAWBACKS
Unfortunately, many of the existing phishing-detection tools, especially those that depend on
an existing blacklist, suffer limitations such as low detection accuracy and high false alarm that is
often caused by either a delay in blacklist update as a result of human verification process involved
in classification.
PROPOSED SYSTEM
Our project aims to develop a machine learning-based phishing website detection system. We'll
collect a dataset with features like URL length, domain age, and SSL certificate presence.
Preprocessing involves cleaning and encoding data. Using algorithms like decision trees, we'll train
the model to distinguish between legitimate and phishing websites.
4
FEATURES
URL Characteristics: Length of the URL, presence of sub domains, use of special
characters or numbers, and similarity to known legitimate websites.
Domain Information: Domain age, registration length, presence of hyphens, and WHOIS
registration details.
5
3. SYSTEM DESIGN AND DEVELOPMENT
FILE DESIGN
System design is the process of planning a new system to complement or altogether replace the
old system. The purpose of the design phase is to plan a solution for the problem. The phrase is the
first step in moving from the problem domain to the solution domain.
INPUT DESIGN
Inputdesignistheprocessofconvertinguser-originatedinputstoacomputer-basedformat.Input
design is one of the most expensive phases of the operation of computerized system and is often
the major problem of a system. In the project, the input design is made in various web forms with
various methods. For example, in the user creation form, the empty username and password is not
allowed. Theuser name if exists in the database, the input is invalid and is not accepted. Likewise,
during the loginprocess, the username is a must and must be available in the user list in the database.
Then only login isallowed.
OUTPUT DESIGN
Output design generally refers to the results and information that are generated by the
system for many end-users; output is the main reason for developing the system and the basis on which
they evaluate the usefulness of the application. In the project, the mail details, the greetings details,
resume details, are the web forms in which the output is available
CODE DESIGN
Code design is the process by which an agent creates a specification of a software artifact intended to
accomplish goals, using a set of primitive components and subject to constraints. The term is
sometimes used broadly to refer to "all the activity involved in conceptualizing, framing,
implementing, commissioning, and ultimately modifying" the software, or more specifically "the
activity following requirements specification and before programming, as a stylized software
engineering process."
Software design usually involves problem-solving and planning a software solution. This includes both
6
a low-level component and algorithm design and a high-level, architecture design. Code design is a
7
fundamental aspect of software development, encompassing a set of principles and practices that
guide the planning and organization of code. It involves breaking down complex systems into
manageable, modular components, adhering to SOLID principles, and utilizing design patterns to solve
common problems efficiently. Good code design prioritizes readability and maintainability, with well-
documented code that's easy to understand and test. It also considers architectural patterns and
distributed systems design when building larger applications. Recognizing and addressing code smells,
conducting regular code reviews, and practicing code refactoring are essential for continuous
improvement. Achieving high cohesion and low coupling in code modules is key to enhancing
maintainability. Overall, effective code design is an ongoing process that evolves with the project,
promoting robust, scalable, and well-structuredsoftware.
DATABASE DESIGN
A table is made up of rows and columns. A row is also called a record (or tuple). Database
design is a collection of interactive data store. An effective method of defining, store and retrieving the
information in the database, multiple application and user can use the data contained in the database.
Database design is the organization of data according to a database model. The designer determines
what data must be stored and how the data elements interrelate. With this information, they can begin
to fit the data to the database model. A database management system manages the data accordingly.
Database design involves classifying data and identifying interrelationships. This theoretical
representation of the data is called an ontology.
This process is one which is generally considered part of requirements analysis, and
requires skill on the part of the database designer to elicit the needed information from those with the
domain knowledge. This is because those with the necessary domain knowledge often cannot
clearly express the system requirements for the database as they are unaccustomed to thinking in terms
of the discrete data elements which must be stored. Data to be stored can be determined by
Requirement Specification.
A standard piece of database design guidance is that the designer should create a fully normalized
design; selective renormalization can subsequently be performed, but only for performance reasons.
The trade- off is storage space vs. performance. The more normalized the design is, the less data
redundancy there is (and therefore, it takes up less space to store), however, common data retrieval
patterns may now need complex joins, merges, and sorts to occur – which takes up more data read, and
compute cycles. Some modeling disciplines, such as the dimensional modeling approach to data
8
warehouse design, explicitly recommend non-normalized designs, i.e. designs that in large part do not
adhere to 3NF. Normalization consists of normal forms that are 1NF, 2NF, 3NF, BOYCE-CODD NF
(3.5NF), 4NF and 5NF
9
Document databases take a different approach. A document that is stored in such a database typically
would contain more than one normalized data unit and often the relationships between the units as
well. If all the data units and the relationships in question are often retrieved together, then this
approach optimizes the number of retrieves. It also simplifies how data gets replicated, because now
there is a clearly identifiable unit of data whose consistency is self-contained. Another consideration is
that reading and writing a single document in such databases will require a single transaction – which
can be an important consideration in a Microservices architecture. In such situations, often, portions of
the document are retrieved from other services via an API and stored locally for efficiency reasons. If
the data units were to be split out across the services, then a read (or write) to support a service
consumer might require more than one service calls, and this could result in management of multiple
transactions, which may not be preferred.
SYSTEM DEVELOPMENT
HTML Hyper Text Markup Language or HTML is the standard markup language for documents
designed to be displayed in a web browser. It defines the content and structure of web content. It is
often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such
as JavaScript.
Web browsers receive HTML documents from a web server or from local storage and render the
documents into multimedia web pages. HTML describes the structure of a web page semantically and
originally included cues for its appearance.
HTML elements are the building blocks of HTML pages. With HTML constructs, images and other
objects such as interactive forms may be embedded into the rendered page. HTML provides a means to
create structured documents by denoting structural semantics for text such as headings, paragraphs,
lists, links, quotes, and other items. HTML elements are delineated by tags, written using angle
brackets. Tags such as <img> and <input> directly introduce content into the page. Other tags such
as <p> and </p> surround and provide information about document text and may include sub-element
tags. Browsers do not display the HTML tags but use them to interpret the content of the page.
HTML can embed programs written in a scripting language such as JavaScript, which affects the
behavior and content of web pages. The inclusion of CSS defines the look and layout of content.
The World Wide Web Consortium (W3C), former maintainer of the HTML and current maintainer of
10
the CSS standards, has encouraged the use of CSS over explicit presentational HTML since 1997. A
form of HTML, known as HTML5, is used to display video and audio, primarily using
the <canvas> element, together with JavaScript.
In 1980, physicist Tim Berners-Lee, a contractor at CERN, proposed and prototyped ENQUIRE, a
system for CERN researchers to use and share documents. In 1989, Berners-Lee wrote a memo
proposing an Internet-based hypertext system.[3] Berners-Lee specified HTML and wrote the browser
and server software in late 1990. That year, Berners-Lee and CERN data systems engineer Robert
Cailliau collaborated on a joint request for funding, but the project was not formally adopted by
CERN. In his personal notes of 1990, Berners-Lee listed "some of the many areas in which hypertext
is used"; an encyclopedia is the first entry.
The first publicly available description of HTML was a document called "HTML Tags", first
mentioned on the Internet by Tim Berners-Lee in late 1991. It describes 18 elements comprising the
initial, relatively simple design of HTML. Except for the hyperlink tag, these were strongly influenced
by SGML guid, an in-house Standard Generalized Markup Language (SGML)-based documentation
format at CERN. Eleven of these elements still exist in HTML 4.
HTML is a markup language that web browsers use to interpret and compose text, images, and
other material into visible or audible web pages. Default characteristics for every item of HTML
markup are defined in the browser, and these characteristics can be altered or enhanced by the web
page designer's additional use of CSS. Many of the text elements are mentioned in the 1988 ISO
technical report TR 9537 Techniques for using SGML, which describes the features of early text
formatting languages such as that used by the RUNOFF command developed in the early 1960s for
the CTSS (Compatible Time-Sharing System) operating system. These formatting commands were
derived from the commands used by typesetters to manually format documents. However, the SGML
concept of generalized markup is based on elements (nested annotated ranges with attributes) rather
than merely print effects, with separate structure and markup. HTML has been progressively moved in
this direction with CSS.
11
custom tag for embedding in-line images, reflecting the IETF's philosophy of basing standards on
successful prototypes. Similarly, Dave Raggett's competing Internet Draft, "HTML+ (Hypertext
Markup Format)", from late 1993, suggested standardizing already-implemented features like tables
and fill-out forms.
After the HTML and HTML+ drafts expired in early 1994, the IETF created an HTML Working
Group. In 1995, this working group completed "HTML 2.0", the first HTML specification intended to
be treated as a standard against which future implementations should be based.
Further development under the auspices of the IETF was stalled by competing interests. Since
1996, the HTML specifications have been maintained, with input from commercial software vendors,
by the World Wide Web Consortium (W3C).In 2000, HTML became an international standard
(ISO/IEC 15445:2000). HTML 4.01 was published in late 1999, with further errata published through
2001. In 2004, development began on HTML5 in the Web Hypertext Application Technology
Working Group (WHATWG), which became a joint deliverable with the W3C in 2008, and was
completed and standardized on 28 October 2014.
Cascading Style Sheets (CSS) is a style sheet language used for specifying the presentation and
styling of a document written in a markup language such as HTML or XML (including XML dialects
such as SVG, Math ML or XHTML). CSS is a cornerstone technology of the World Wide Web,
alongside HTML and JavaScript.CSS is designed to enable the separation of content and presentation,
including layout, colors, and fonts. This separation can improve content accessibility;[further
explanation needed] provide more flexibility and control in the specification of presentation
characteristics; enable multiple web pages to share formatting by specifying the relevant CSS in a
separate . css file, which reduces complexity and repetition in the structural content; and enable the .
css file to be cached to improve the page load speed between the pages that share the file and its
formatting.Separation of formatting and content also makes it feasible to present the same markup
page in different styles for different rendering methods, such as on-screen, in print, by voice (via
speech-based browser or screen reader), and on Braille-based tactile devices. CSS also has rules for
alternate formatting if the content is accessed on a mobile device.Values may be keywords, such as
"center" or "inherit", or numerical values, such as 200px (200 pixels), 50vw (50 percent of the
viewport width) or 80% (80 percent of the parent element's width).Color values can be specified with
keywords (e.g. "red"), hexadecimal values (e.g. #FF0000, also abbreviated as #F00), RGB values on a
0 to 255 scale (e.g. rgb(255, 0, 0)), RGBA values that specify both color and alpha transparency (e.g.
12
rgba(255, 0, 0, 0.8)), or HSL or HSLA values (e.g. hsl(000, 100%, 50%), hsla(000, 100%, 50%,
80%)).Non-zero numeric values representing linear measures must include a length unit, which is
either an alphabetic code or abbreviation, as in 200px or 50vw; or a percentage sign, as in 80%. Some
units – cm (centimetre); in (inch); mm (millimetre); pc (pica); and pt (point) – are absolute, which
means that the rendered dimension does not depend upon the structure of the page; others – em (em);
ex (ex) and px (pixel)[clarification needed] – are relative, which means that factors such as the font
size of a parent element can affect the rendered measurement. These eight units were a feature of CSS
1[14] and retained in all subsequent revisions. The proposed CSS Values and Units Module Level 3
will, if adopted as a W3C Recommendation, provide seven further length units: ch; Q; rem; vh; vmax;
vmin; and vw. The name cascading comes from the specified priority scheme to determine which
declaration applies if more than one declaration of a property match a particular element. This
cascading priority scheme is predictable.
The CSS specifications are maintained by the World Wide Web Consortium (W3C). Internet media
type (MIME type) text/css is registered for use with CSS by RFC 2318 (March 1998). The W3C
operates a free CSS validation service for CSS documents.In addition to HTML, other markup
languages support the use of CSS including XHTML, plain XML, SVG, and XUL. CSS is also used in
the GTK widget toolkit. CSS, or Cascading Style Sheets, offers a flexible way to style web content,
with styles originating from browser defaults, user preferences, or web designers. These styles can be
applied inline, within an HTML document, or through external .css files for broader consistency. Not
only does this simplify web development by promoting reusability and maintainability, it also
improves site performance because styles can be offloaded into dedicated .css files that browsers can
cache. Additionally, even if the styles cannot be loaded or are disabled, this separation maintains the
accessibility and readability of the content, ensuring that the site is usable for all users, including those
with disabilities. Its multi-faceted approach, including considerations for selector specificity, rule
order, and media types, ensures that websites are visually coherent and adaptive across different
devices and user needs, striking a balance between design intent and user accessibility. Multiple style
sheets can be imported. Different styles can be applied depending on the output device being used; for
example, the screen version can be quite different from the printed version, so authors can tailor the
presentation appropriately for each medium. The style sheet with the highest priority controls the
content display. Declarations not set in the highest priority source are passed on to a source of lower
13
priority, such as the user agent style. The process is called cascading.One of the goals of CSS is to
allow users greater control over presentation. Someone who finds red italic headings difficult to read
may apply a different style sheet. Depending on the browser and the website, a user may choose from
various style sheets provided by the designers, or may remove all added styles, and view the site using
the browser's default styling, or may override just the red italic heading style without altering other
attributes. Browser extensions like Stylish and Stylus have been created to facilitate the management
of such user style sheets. In the case of large projects, cascading can be used to determine which style
has a higher priority when developers do integrate third-party styles that have conflicting priorities,
and to further resolve those conflicts. Additionally, cascading can help create themed designs, which
help designers fine-tune aspects of a design without compromising the overall layout.
Structured Query Language (SQL) S-Q-L, sometimes "sequel" for historical reasons is a domain-
specific language used to manage data, especially in a relational database management system
(RDBMS). It is particularly useful in handling structured data, i.e., data incorporating relations among
entities and variables.
Introduced in the 1970s, SQL offered two main advantages over older read–write APIs such as ISAM
or VSAM. Firstly, it introduced the concept of accessing many records with one single command.
Secondly, it eliminates the need to specify how to reach a record, i.e., with or without an index.
Originally based upon relational algebra and tuple relational calculus, SQL consists of many types of
statements, which may be informally classed as sublanguages, commonly: Data query Language
(DQL), Data Definition Language (DDL), Data Control Language (DCL), and Data Manipulation
Language (DML).The scope of SQL includes data query, data manipulation(insert, update, and delete),
data definition (schema creation and modification), and data access control. Although SQL is
essentially a declarative language (4GL), it also includes procedural elements.SQL was one of the first
commercial languages to use Edgar F. Codd's relational model. The model was described in his
influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks". Despite not
entirely adhering to the relational model as described by Codd, SQL became the most widely used
database language. SQL became a standard of the American National Standards Institute (ANSI) in
1986 and of the International Organization for Standardization (ISO) in 1987. Since then, the standard
has been revised multiple times to include a larger set of features and incorporate common extensions.
14
Despite the existence of standards, virtually no implementations in existence adhere to it fully, and
most SQL code requires at least some changes before being ported to different database systems. SQL
was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce after learning about
the relational model from Edgar F. Codd in the early 1970s. This version, initially called SEQUEL
(Structured English Query Language), was designed to manipulate and retrieve data stored in IBM's
original quasirelational database management system, System R, which a group at IBM San Jose
Research Laboratory had developed during the 1970s. Chamberlin and Boyce's first attempt at a
relational database language was SQUARE (Specifying Queries in A Relational Environment), but it
was difficult to use due to subscript/superscript notation. After moving to the San Jose Research
Laboratory in 1973, they began work on a sequel to SQUARE. The original name SEQUEL, which is
widely regarded as a pun on QUEL, the query language of Ingres,was later changed to SQL (dropping
the vowels) because "SEQUEL" was a trademark of the UK-based Hawker Siddeley Dynamics
Engineering Limited company. The label SQL later became the acronym for Structured Query
Language.After testing SQL at customer test sites to determine the usefulness and practicality of the
system, IBM began developing commercial products based on their System R prototype, including
System/38, SQL/DS, and IBM Db2, which were commercially available in 1979, 1981, and 1983,
respectively.In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of
the concepts described by Codd, Chamberlin, and Boyce, and developed their own SQL-based
RDBMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S.
government agencies. In June 1979, Relational Software introduced one of the first commercially
available implementations of SQL, Oracle V2 (Version2) for VAX computers.By 1986, ANSI and ISO
standard groups officially adopted the standard "Database Language SQL" language definition. New
versions of the standard were published in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 and
most recently, 2023.
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code
readability with the use of significant indentation. Python is dynamically typed and garbage-collected.
It supports multiple programming paradigms, including structured (particularly procedural), object-
oriented and functional programming. It is often described as a "batteries included" language due to its
comprehensive standard library. Guido van Rossum began working on Python in the late 1980s as a
successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0
was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-
compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python
15
2.Python consistently ranks as one of the most popular programming languages, and has gained
widespread use in the machine learning community. Python was invented in the late 1980s[40] by
Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to
the ABC programming language, which was inspired by SETL, capable of exception handling and
interfacing with the Amoeba operating system.[10] Its implementation began in December 1989. Van
Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when
he announced his "permanent vacation" from his responsibilities as Python's "benevolent dictator for
life", a title the Python community bestowed upon him to reflect his long-term commitment as the
project's chief decision-maker. In January 2019, active Python core developers elected a five-member
Steering Council to lead the project. Python 2.0 was released on 16 October 2000, with many major
new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and
Unicode support. Python 3.0, released on 3 December 2008, with many of its major features
backported to Python 2.6.x and 2.7.x. Releases of Python 3 include the 2to3 utility, which automates
the translation of Python 2 code to Python 3.Python 2.7's end-of-life was initially set for 2015, then
postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported
to Python 3. No further security patches or other improvements will be released for it. Currently only
3.8 and later are supported (2023 security issues were fixed in e.g. 3.7.17, the final 3.7.x release).
While Python 2.7 and older is officially unsupported, a different unofficial Python implementation,
PyPy, continues to support Python 2, i.e. "2.7.18+" (plus 3.9 and 3.10), with the plus meaning (at least
some) "backported security updates".In 2021 (and again twice in 2022), security updates were
expedited, since all Python versions were insecure (including 2.7) because of security issues leading to
possible remote code execution and web-cache poisoning. In 2022, Python 3.10.4 and 3.9.12 were
expedited and 3.8.13, because of many security issues. When Python 3.9.13 was released in May 2022,
it was announced that the 3.9 series (joining the older series 3.8 and 3.7) would only receive security
fixes in the future. On 7 September 2022, four new releases were made due to a potential denial-of-
service attack: 3.10.7, 3.9.14, 3.8.14, and 3.7.14.As of October 2023, Python 3.12 is the stable release,
and 3.12 and 3.11 are the only versions with active (as opposed to just security) support. Notable
changes in 3.11 from 3.10 include increased program execution speed and improved error
reporting.Python 3.12 adds syntax (and in fact every Python since at least 3.5 adds some syntax) to the
language, the new (soft) keyword type (recent releases have added a lot of typing support e.g. new type
union operator in 3.10), and 3.11 for exception handling, and 3.10 the match and case (soft) keywords,
for structural pattern matching statements. Python 3.12 also drops outdated modules and functionality,
16
and future versions will too, see below in Development section.Python 3.11 claims to be between 10
and 60% faster than Python 3.10, and Python 3.12 adds another 5% on top of that. It also has improved
error messages, and many other changes.Since 27 June 2023, Python 3.8 is the oldest supported
version of Python (albeit in the 'security support' phase), due to Python 3.7 reaching end-of-life.
Python is a multi-paradigm programming language. Object-oriented programming and structured
programming are fully supported, and many of their features support functional programming and
aspect-oriented programming (including metaprogramming[ and metaobjects). Many other paradigms
are supported via extensions, including design by contract and logic programming. Python uses
dynamic typing and a combination of reference counting and a cycle-detecting garbage collector for
memory management. It uses dynamic name resolution (late binding), which binds method and
variable names during program execution.Its design offers some support for functional programming
in the Lisp tradition. It has filter, mapandreduce functions; list comprehensions, dictionaries, sets, and
generator expressions. The standard library has two modules (itertools and functools) that implement
functional tools borrowed from Haskell and Standard ML.Its core philosophy is summarized in the
Zen of Python (PEP 20), which includes aphorisms such as. Python features regularly violate these
principles and received criticism for adding unnecessary language bloat. Responses to these criticisms
are that the Zen of Python is a guideline rather than a rule. The addition of some new features had been
so controversial that Guido van Rossum resigned as Benevolent Dictator for Life following vitriol over
the addition of the assignment expression operator in Python 3.8. Nevertheless, rather than building all
of its functionality into its core, Python was designed to be highly extensible via modules. This
compact modularity has made it particularly popular as a means of adding programmable interfaces to
existing applications. Van Rossum's vision of a small core language with a large standard library and
easily extensible interpreter stemmed from his frustrations with ABC, which espoused the opposite
approach. Python claims to strive for a simpler, less-cluttered syntax and grammar while giving
developers a choice in their coding methodology. In contrast to Perl's "there is more than one way to
do it" motto, Python embraces a "there should be one and preferably only one obvious way to do it."
philosophy. In practice, however, Python provides many ways to achieve the same task. There are, for
example, at least three ways to format a string literal, with no certainty as to which one a programmer
should use. Alex Martelli, a Fellow at the Python Software Foundation and Python book author, wrote:
"To describe something as 'clever' is not considered a compliment in the Python culture."Python's
developers usually strive to avoid premature optimization and reject patches to non-critical parts of the
CPython reference implementation that would offer marginal increases in speed at the cost of clarity.
17
Execution speed can be improved by moving speed-critical functions to extension modules written in
languages such as C, or by using a just-in-time compiler like PyPy. It is also possible to cross-compile
to other languages, but it either doesn't provide the full speed-up that might be expected, since Python
is a very dynamic language, or a restricted subset of Python is compiled, and possibly semantics are
slightly changed. Python's developers aim for it to be fun to use. This is reflected in its name a tribute
to the British comedy group Monty Python and in occasionally playful approaches to tutorials and
reference materials, such as the use of the terms "spam" and "eggs" (a reference to a Monty Python
sketch) in examples, instead of the often-used "foo" and "bar". A common neologism in the Python
community is pythonic, which has a wide range of meanings related to program style. "Pythonic" code
may use Python idioms well, be natural or show fluency in the language, or conform with Python's
minimalist philosophy and emphasis on readability. Code that is difficult to understand or reads like a
rough transcription from another programming language is called unpythonic. Python uses duck typing
and has typed objects but untyped variable names. Type constraints are not checked at compile time;
rather, operations on an object may fail, signifying that it is not of a suitable type. Despite being
dynamically typed, Python is strongly typed, forbidding operations that are not well-defined (for
example, adding a number to a string) rather than silently attempting to make sense of them. Python
allows programmers to define their own types using classes, most often used for object-oriented
programming. New instances of classes are constructed by calling the class (for example, SpamClass()
or EggsClass()), and the classes are instances of the metaclass type (itself an instance of itself),
allowing metaprogramming and reflection.Before version 3.0, Python had two kinds of classes (both
using the same syntax): old-style and new-style, current Python versions only support the semantics
new style. Python supports optional type annotations. These annotations are not enforced by the
language, but may be used by external tools such as mypy to catch errors. Mypy also supports a
Python compiler called mypyc, which leverages type annotations for optimization.
18
DESCRIPTION OF MODULES
1. Login page
19
5. URL info page
Login page
A login page specifies the login URL in a web application that users must pass
through to get to the authenticated URLs at the heart of the application. Authenticated
URLs are URLs that become accessible to users only after they successfully log in to
the login URL.
20
URL info page
A URL is the location of a web page or file that's been added to the internet.
You can see a web page's URL in the address bar of your web browser.
21
4. SYSTEM TESTING AND IMPLEMENTATION
TEST PLAN
A Test Plan is a detailed document that describes the test strategy, objectives, schedule,
estimation, deliverables and resources required toper form testing for a software product. Test Plan helps
us determine the effort needed to validate the quality of the application under test. The test plan serves as
a blueprint to conduct software testing activities as a defined process, which is minutely monitored and
controlled by the test manager.
Analyze the product.
Resource Planning.
22
Test Case specification is major activity in the testing process.In this project,I have performed three
levels of testing.
23
Unit testing
Validation Testing
Integration Testing
Validation Testing
TEST DELIVERABLES
The following documents are required be sides the test plan
Error report
The test case specification for system testing has to be submitted for review before the system testing
commences.
The first test in the development process is the unit test. The source code is normally
divided into modules, which in turn are divided into smaller units called units. These units have
specific behavior. The test done on these units of code is called unit test. Unit test depends up
on the language on which the project is developed.
Integration Testing:
Integration Testing follows unit testing and precedes system testing. Testing after the
24
product is code complete. Betas are often widely distributed or even distributed to the public at
25
large in hopes that they will buy the final product when it is release.
Validation Testing
The process of evaluating software during the development process or at the end of the
development process to determine whether it satisfies specified business requirements.
Validation Testing ensures that the product actually meets the client's needs.
26
5. CONCLUSION
Phishing is a form of scam in which an attacker poses as a legitimate entity or person via
website or other forms of communication. Phishing website is frequently used by attackers to
distribute malicious links or attachments that can perform a variety of functions. Phishing is a form of
social engineering that involves communication via email, phone or text requesting a user take action,
such as navigating to a fake website. In both phishing and social engineering attacks, the collected
information is used in order to gain unauthorized access to protected accounts or data .
27
6. BIBLIOGRAPHY
BOOK REFERENCES
1. Learning HTML and CSS: A Step-by-Step Guide to Creating Dynamic Websites –by Robin
Nixon
2. HTML,CSS & Myself Web Development – by Luke Welling
WEB REFERENCES
1. https://www.w3schools.com/html.asp
2. https://www.w3schools.com/css.asp
3. https://www.tutorialspoint.com/python_introduction.htm
4. https://en.wikipedia.org/wiki/sqlprogramming
5. https://en.wikipedia.org/wiki/htmlprogramming
6. https://www.tutorialspoint.com/javascript_introduction.htm
28
APPENDICES
A two – dimensional diagram explain show data is processed and transferred in a system. The
graphical depiction identifies each source of data and how it interacts with other data sources to reach a
common output.
Symbol Description
A data flow.
This type of diagram helps business development and design teams visualize how data is processed and
identify or improve certain aspects.
29
LEVEL 0
DFD Level 0 is also called a Context Diagram. It’s a basic overview of the whole system or
process being analyzed or modeled. It’s designed to be an at-a-glance view, showing the system as a
single high-level process, with its relationship to external entities. It should be easily understood by a
wide audience, including take holders, business analysts, and data analysts
user
Phishing kit
Verify URL
30
LOG IN
LEVEL 1 PAGE
VERIFY URL
SHOW SOURCE
CODE OF URL
URL INFO
31
B. TABLE STRUCTURE
32
C. SAMPLE CODING
app = Flask(__name__)
controller = Controller()
@app.route('/', methods=['GET','POST'])
def home():
try:
url = request.form['url']
result = controller.main(url)
output = result
except:
output = 'NA'
@app.route('/preview', methods=['POST'])
def preview():
try:
url = request.form.get('url')
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('link'):
if link.get('href'):
33
link['href'] = urljoin(url, link['href'])
@app.route('/source-code', methods=['GET','POST'])
def view_source_code():
try:
url = request.form.get('url')
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
formatted_html = soup.prettify()
except Exception as e:
return f"Error: {e}"
if __name__ == '__main__':
app.run(debug=True)
from urllib.parse import urlparse, urlencode, quote, unquote
34
import tldextract
import model
class Controller:
def __init__(self):
self.BASE_SCORE = 50 # default trust score of URL out of 100
self.model = model
# Default data
domain = tldextract.extract(url).domain + '.' + tldextract.extract(url).suffix
response = {'status': 'SUCCESS', 'url': url}
trust_score = self.BASE_SCORE
# Phishtank check
phishtank_response = self.model.phishtank_search(url)
if phishtank_response:
response['msg'] = "This is a verified phishing link."
# Website status
response['response_status'] = url_validation
# Domain rank
domain_rank = self.model.get_domain_rank(domain)
trust_score = self.model.calculate_trust_score(trust_score, 'domain_rank', domain_rank)
response['rank'] = domain_rank if domain_rank else '10,00,000+'
# Is URL shortened
is_url_shortened = self.model.is_url_shortened(url)
trust_score = self.model.calculate_trust_score(trust_score, 'is_url_shortened', is_url_shortened)
response['is_url_shortened'] = is_url_shortened
# HSTS support
hsts_support = self.model.hsts_support(url)
trust_score = self.model.calculate_trust_score(trust_score, 'hsts_support', hsts_support)
response['hsts_support'] = hsts_support
# IP present
ip_present = self.model.ip_present(url)
trust_score = self.model.calculate_trust_score(trust_score, 'ip_present', ip_present)
response['ip_present'] = ip_present
# URL redirects
url_redirects = self.model.url_redirects(url)
trust_score = self.model.calculate_trust_score(trust_score, 'url_redirects', url_redirects)
response['url_redirects'] = url_redirects
# Get IP address
ip = self.model.get_ip(domain)
response['ip'] = 'Unavailable' if ip == 0 else ip
except Exception as e:
print(f"Error: {e}")
response = {'status': 'ERROR', 'url': url, 'msg': "Some error occurred, please check the
URL.",'emsg':e}
return response
import ipaddress
import re
from bs4 import BeautifulSoup
import requests
import whois
import urllib
import urllib.request
from datetime import datetime
import requests
import json
import csv
import time
import socket
import ssl
37
global BASE_SCORE
global PROPERTY_SCORE_WEIGHTAGE
BASE_SCORE = 50 # default trust_ score of url out of 100
PROPERTY_SCORE_WEIGHTAGE = {
'domain_rank': 0.9,
'domain_age': 0.3,
'is_url_shortened': 0.8,
'hsts_support': 0.1,
'ip_present': 0.8,
'url_redirects': 0.2,
'too_long_url': 0.1,
'too_deep_url': 0.5,
'content': 0.1
}
except requests.exceptions.RequestException:
return False
def include_protocol(url):
try:
if not url.startswith('http://') and not url.startswith('https://'):
url = 'https://' + url
return url
38
except:
return url
with open('static/data/sorted-top1million.txt') as f:
top1million = f.read().splitlines()
if is_in_top1million == 1:
with open('static/data/domain-rank.json', 'r') as f:
domain_rank_dict = json.load(f)
rank = domain_rank_dict.get(domain, 0)
return int(rank)
else:
return 0
# binary search
def binary_search(arr, x):
low = 0
high = len(arr) - 1
while low <= high:
mid = (low + high) // 2
if arr[mid] == x:
return 1
elif arr[mid] < x:
low = mid + 1
else:
high = mid - 1
return 0
39
# get whois data of domain
def whois_data(domain):
try:
whois_data = whois.whois(domain)
creation_date = whois_data.creation_date
data = {}
if type(creation_date) is list:
creation_date = creation_date[0]
whois_data['creation_date'] = [d.strftime('%Y-%m-%d %H:%M:%S') for d in
whois_data.creation_date]
# else:
# whois_data['creation_date'] = whois_data.creation_date.strftime('%Y-%m-%d %H:%M:%S')
if type(whois_data.updated_date) is list:
whois_data['updated_date'] = [d.strftime('%Y-%m-%d %H:%M:%S') for d in
whois_data.updated_date]
# else:
# whois_data['updated_date'] = whois_data.updated_date.strftime('%Y-%m-%d %H:%M:%S')
if type(whois_data.expiration_date) is list:
whois_data['expiration_date'] = [d.strftime('%Y-%m-%d %H:%M:%S') for d in
whois_data.expiration_date]
# else:
# whois_data['expiration_date'] = whois_data.expiration_date.strftime('%Y-%m-%d
%H:%M:%S')
if creation_date == None:
age = 'Not Given'
else:
age = (datetime.now() - creation_date).days / 365
except Exception as e:
print(f"Error: {e}")
return False
def pascal_case(s):
result = s.replace('_',' ').title()
return result
if slashes > 5:
return 1
else:
return 0
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
43
# check if onmouseover is enabled
if soup.find(onmouseover=True):
result['onmouseover'] = 1
return result
except Exception as e:
# print(f"Error: {e}")
return 0
44
def phishtank_search(url):
try:
endpoint = "https://checkurl.phishtank.com/checkurl/"
response = requests.post(endpoint, data={"url": url, "format": "json"})
data = json.loads(response.content)
if data['results']['valid'] == True:
return 1
return 0
except Exception as e:
# print(f"Error: {e}")
return 0
def get_ip(domain):
try:
ip = socket.gethostbyname(domain)
return ip
except Exception as e:
print(f"Error: {e}")
return 0
def get_certificate_details(domain):
try:
context = ssl.create_default_context()
with socket.create_connection((domain, 443)) as sock:
with context.wrap_socket(sock, server_hostname=domain) as sslsock:
cert = sslsock.getpeercert()
45
# Certificate Authority (CA) information
issuer = dict(x[0] for x in cert['issuer'])
if 'organizationName' in issuer:
ca_info = issuer['organizationName']
else:
ca_info = issuer['commonName']
# Cipher suite
cipher = sslsock.cipher()
cipher_suite = cipher[0]
# SSL/TLS version
version = sslsock.version()
return {
'Issued By': ca_info,
'Issued To': common_name,
'Valid From': not_before.strftime('%Y-%m-%d %H:%M:%S %Z'),
# 'sans': sans
'Valid Till': not_after.strftime('%Y-%m-%d %H:%M:%S %Z'),
'Days to Expiry': days_to_expiry,
'Version': version,
'Is Certificate Revoked': revoked,
'Cipher Suite': cipher_suite
# 'chain_info': chain_info,
}
except Exception as e:
print(f"Error: {e}")
return 0
with open('sorted-top1million.txt') as f:
top1million = f.read().splitlines()
# res = content_check(url)
# print(res)
47
score = current_score
if case == 'domain_rank':
if value == 0: # not in top 10L rank
score = current_score #- (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.5)
elif value < 100000: # in top 1L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE)
elif value < 500000: # in 1L - 5L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.8)
else: # in 5L - 10L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.6)
return score
48
elif case == 'hsts_support':
if value == 1:
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['hsts_support'] *
BASE_SCORE)
else:
score = current_score - (PROPERTY_SCORE_WEIGHTAGE['hsts_support'] *
BASE_SCORE)
return score
import os
import json
49
import csv
import time
class OneTimeScript:
def __init__(self):
self.file_path = 'static/data/top-1m.csv'
self.output_txt_path = 'static/data/sorted-top1million.txt'
self.output_json_path = 'static/data/domain-rank.json'
def check_file_existence(self):
# Check if the required file exists
if not os.path.exists(self.file_path):
print("File does not exist.")
print("Please add file", self.file_path)
return False
return True except Exception as e:
print(f"Error: {e}")
return False
if __name__ == "__main__":
script = OneTimeScript()
script.create_sorted_arr_and_dict()
function showLoadingSpinner() {
var input = document.querySelector('input[name="url"]');
var button = document.querySelector('button');
if (input.value.trim() !== '') {
button.style.display = 'none';
var spinner = document.createElement('div');
spinner.className = 'spinner';
button.parentNode.appendChild(spinner);
}
}
50
{% extends "base.html" %}
<!DOCTYPE html>
<html>
<body>
{% block content %}
<div class="container">
<div class="short-note">
<p itemprop="description">Protect yourself from <strong>phishing attacks</strong> with the
help of <strong>FOSS</strong>.safe with <strong>websit</strong>.</p>
</div>
{% if output != "NA" %}
<div class="result">
{% if output.status == "SUCCESS" %}
51
{% if output.trust_score >= 0 and output.trust_score < 60 %}
<span style="color: red; font-size: 1.25rem">Trust Score : {{output.trust_score}} /
100</span>
{% elif output.trust_score >= 60 and output.trust_score < 70 %}
<span style="color: orange; font-size: 1.25rem">Trust Score : {{output.trust_score}} /
100</span>
{% elif output.trust_score >= 70 and output.trust_score < 90 %}
<span style="color: yellowgreen; font-size: 1.25rem">Trust Score : {{output.trust_score}} /
100</span>
{% else %}
<span style="color: green; font-size: 1.25rem">Trust Score : {{output.trust_score}} /
100</span>
{% endif %}
</strong>
<br>
URL : {{output.url}}
{% if output.msg is defined %}
<br>
{{output.msg}}
{% endif %}
{% if output.response_status != False %}
<br><br>
52
<button class = "preview-button"
onclick="document.getElementById('preview').submit()">Preview URL within PHISHING
KIT</button>
{% else %}
<br><br>
Can not access this page at the moment. Page may be down or may have blocked viewing
with scripts.
def create_sorted_arr_and_dict(self):
# Create sorted list and dictionary from CSV data
try:
if not self.check_file_existence():
return False
start = time.time()
domain_data_array = []
domain_data_dict = {}
end = time.time()
except Exception as e:
print(f"Error: {e}")
return False
if __name__ == "__main__":
script = OneTimeScript()
script.create_sorted_arr_and_dict()
function showLoadingSpinner() {
var input = document.querySelector('input[name="url"]');
54
var button = document.querySelector('button');
if (input.value.trim() !== '') {
button.style.display = 'none';
var spinner = document.createElement('div');
spinner.className = 'spinner';
button.parentNode.appendChild(spinner);
}
}
{% extends "base.html" %}
<!DOCTYPE html>
<html>
<body>
{% block content %}
<div class="container">
<div class="short-note">
<p itemprop="description">Protect yourself from <strong>phishing attacks</strong> with the
help of <strong>FOSS</strong>.safe with <strong>websit</strong>.</p>
</div>
{% if output != "NA" %}
55
<div class="result">
{% if output.status == "SUCCESS" %}
</strong>
<br>
URL : {{output.url}}
{% if output.msg is defined %}
<br>
{{output.msg}}
{% endif %}
{% if output.response_status != False %}
56
<br><br>
{% else %}
<br><br>
Can not access this page at the moment. Page may be down or may have blocked viewing
with scripts.
{% endif %}
57
<br><br><br>
<br><br>
<table class="table-view">
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Global Rank</td>
<td>{{output.rank}}</td>
</tr>
<tr>
<td>HTTP Status Code</td>
<td>{{output.response_status}}</td>
</tr>
<tr>
<td>Domain Age</td>
<td>{{output.age}}</td>
</tr>
<tr>
<td>Use of URL Shortener</td>
<td>{% if output.is_url_shortened == 1%} YES {% else %} NO {% endif %}</td>
</tr>
58
<tr>
<td>HSTS Support</td>
<td>{% if output.hsts_support == 1%} YES {% else %} NO {% endif %}</td>
</tr>
<tr>
<td>IP instead of Domain</td>
<td>{% if output.ip_present == 1%} YES {% else %} NO {% endif %}</td>
</tr>
<tr>
<td>URL Redirects</td>
<td>{% if output.url_redirects == 0%} NO {% else %} {% for value in
output.url_redirects %} {{ value }} {% endfor %} {% endif %}</td>
</tr>
<tr>
<td>IP of Domain</td>
<td>{{output.ip}}</td>
</tr>
<tr>
<td>Too Long URL</td>
<td>{% if output.too_long_url == 1%} YES {% else %} NO {% endif %}</td>
</tr>
<tr>
<td>Too Deep URL</td>
<td>{% if output.too_deep_url == 1%} YES {% else %} NO {% endif %}</td>
</tr>
</tbody>
</table>
59
<br><br>
{% if output.ssl != 0 %}
<table class="table-view">
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
{% for key, value in output.ssl.items() %}
<tr>
<td>{{ key }}</td>
<td>{{ value }}</td>
</tr>
{% endfor %}
</tbody>
</table>
{% endif %}
<br><br>
<strong> WHOIS Data </strong>
<br><br>
<table class="table-view">
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
60
</thead>
<tbody>
{% for key, value in output.whois.items() %}
<tr>
<td>{{ key }}</td>
<td>{{ value }}</td>
</tr>
{% endfor %}
</tbody>
</table>
{% else %} URL : {{output.url}} <br> Message : {{output.msg}} <br> {% endif %}
<br><br>
</div>
{% endif %} {% endblock %}
</body>
</html>
# Certificate validity period
not_before = datetime.strptime(cert['notBefore'], '%b %d %H:%M:%S %Y %Z')
not_after = datetime.strptime(cert['notAfter'], '%b %d %H:%M:%S %Y %Z')
days_to_expiry = (not_after - datetime.now()).days
61
# Cipher suite
cipher = sslsock.cipher()
cipher_suite = cipher[0]
# SSL/TLS version
version = sslsock.version()
return {
'Issued By': ca_info,
'Issued To': common_name,
'Valid From': not_before.strftime('%Y-%m-%d %H:%M:%S %Z'),
# 'sans': sans
'Valid Till': not_after.strftime('%Y-%m-%d %H:%M:%S %Z'),
'Days to Expiry': days_to_expiry,
'Version': version,
'Is Certificate Revoked': revoked,
'Cipher Suite': cipher_suite
# 'chain_info': chain_info,
}
except Exception as e:
print(f"Error: {e}")
return 0
with open('sorted-top1million.txt') as f:
top1million = f.read().splitlines()
62
# res = content_check(url)
# print(res)
def calculate_trust_score(current_score, case, value):
score = current_score
if case == 'domain_rank':
if value == 0: # not in top 10L rank
score = current_score #- (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.5)
elif value < 100000: # in top 1L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE)
elif value < 500000: # in 1L - 5L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.8)
else: # in 5L - 10L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.6)
return score
63
elif case == 'is_url_shortened':
if value == 1:
score = current_score - (PROPERTY_SCORE_WEIGHTAGE['is_url_shortened'] *
BASE_SCORE)
return score
import os
import json
import csv
import time
class OneTimeScript:
def __init__(self):
self.file_path = 'static/data/top-1m.csv'
self.output_txt_path = 'static/data/sorted-top1million.txt'
self.output_json_path = 'static/data/domain-rank.json'
def check_file_existence(self):
# Check if the required file exists
if not os.path.exists(self.file_path):
print("File does not exist.")
print("Please add file", self.file_path)
return False
return True
def create_sorted_arr_and_dict(self):
# Create sorted list and dictionary from CSV data
try:
if not self.check_file_existence():
return False
start = time.time()
domain_data_array = []
domain_data_dict = {}
pass
# Cipher suite
cipher = sslsock.cipher()
cipher_suite = cipher[0]
# SSL/TLS version
version = sslsock.version()
return {
'Issued By': ca_info,
'Issued To': common_name,
'Valid From': not_before.strftime('%Y-%m-%d %H:%M:%S %Z'),
# 'sans': sans
'Valid Till': not_after.strftime('%Y-%m-%d %H:%M:%S %Z'),
'Days to Expiry': days_to_expiry,
'Version': version,
'Is Certificate Revoked': revoked,
'Cipher Suite': cipher_suite
# 'chain_info': chain_info,
}
except Exception as e:
print(f"Error: {e}")
return 0
66
# TEST FUNCTION TO ADD NEW URL CHECKS
def test(domain):
with open('sorted-top1million.txt') as f:
top1million = f.read().splitlines()
# res = content_check(url)
# print(res)
score = current_score
if case == 'domain_rank':
if value == 0: # not in top 10L rank
score = current_score #- (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.5)
elif value < 100000: # in top 1L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE)
elif value < 500000: # in 1L - 5L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.8)
else: # in 5L - 10L rank
score = current_score + (PROPERTY_SCORE_WEIGHTAGE['domain_rank'] *
BASE_SCORE * 0.6)
return score
import os
import json
import csv
import time
class OneTimeScript:
def __init__(self):
self.file_path = 'static/data/top-1m.csv'
self.output_txt_path = 'static/data/sorted-top1million.txt'
self.output_json_path = 'static/data/domain-rank.json'
def check_file_existence(self):
# Check if the required file exists
if not os.path.exists(self.file_path):
print("File does not exist.")
print("Please add file", self.file_path)
return False
return True
def create_sorted_arr_and_dict(self):
# Create sorted list and dictionary from CSV data
try:
69
if not self.check_file_existence():
return False
start = time.time()
domain_data_array = []
domain_data_dict = {}
end = time.time()
except Exception as e:
print(f"Error: {e}")
return False
70
if __name__ == "__main__":
script = OneTimeScript()
script.create_sorted_arr_and_dict()
function showLoadingSpinner() {
var input = document.querySelector('input[name="url"]');
var button = document.querySelector('button');
if (input.value.trim() !== '') {
button.style.display = 'none';
var spinner = document.createElement('div');
spinner.className = 'spinner';
button.parentNode.appendChild(spinner);
}
}
{% extends "base.html" %}
<!DOCTYPE html>
<html>
<body>
{% block content %}
<div class="container">
<div class="short-note">
<p itemprop="description">Protect yourself from <strong>phishing attacks</strong> with the
help of <strong>FOSS</strong>.safe with <strong>websit</strong>.</p>
</div>
71
<form action="/" method="post">
{% if output != "NA" %}
<div class="result">
{% if output.status == "SUCCESS" %}
</strong>
<br>
URL : {{output.url}}
72
{% if output.msg is defined %}
<br>
{{output.msg}}
{% endif %}
{% if output.response_status != False %}
<br><br>
{% else %}
73
<br><br>
Can not access this page at the moment. Page may be down or may have blocked viewing
with scripts. elif case == 'too_deep_url':
if value == 1:
score = current_score - (PROPERTY_SCORE_WEIGHTAGE['too_deep_url'] *
BASE_SCORE)
return score
import os
import json
import csv
import time
class OneTimeScript:
def __init__(self):
self.file_path = 'static/data/top-1m.csv'
self.output_txt_path = 'static/data/sorted-top1million.txt'
self.output_json_path = 'static/data/domain-rank.json'
def check_file_existence(self):
# Check if the required file exists
if not os.path.exists(self.file_path):
print("File does not exist.")
print("Please add file", self.file_path)
return False
return True
def create_sorted_arr_and_dict(self):
# Create sorted list and dictionary from CSV data
try:
if not self.check_file_existence():
return False
start = time.time()
domain_data_array = []
74
domain_data_dict = {}
end = time.time()
except Exception as e:
print(f"Error: {e}")
return False
if __name__ == "__main__":
script = OneTimeScript()
script.create_sorted_arr_and_dict()
75
function showLoadingSpinner() {
var input = document.querySelector('input[name="url"]');
var button = document.querySelector('button');
if (input.value.trim() !== '') {
button.style.display = 'none';
var spinner = document.createElement('div');
spinner.className = 'spinner';
button.parentNode.appendChild(spinner);
}
}
76
D. SAMPLE INPUT
77
78
E. SAMPLE OUTUPUT
79
80
81