100% found this document useful (1 vote)
432 views

Fake Profile Detection

The document discusses detecting fake profiles on social media using machine learning algorithms. It analyzes a Facebook dataset containing 2,816 user profiles to identify fake identities. Three supervised machine learning algorithms - Random Forest, Decision Tree, and Naive Bayes - are used to make predictions of whether a profile is fake or genuine. The goal is to detect fake profiles before users are notified of their creation, as fake profiles can damage reputations and decrease likes/followers.

Uploaded by

python developer
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
432 views

Fake Profile Detection

The document discusses detecting fake profiles on social media using machine learning algorithms. It analyzes a Facebook dataset containing 2,816 user profiles to identify fake identities. Three supervised machine learning algorithms - Random Forest, Decision Tree, and Naive Bayes - are used to make predictions of whether a profile is fake or genuine. The goal is to detect fake profiles before users are notified of their creation, as fake profiles can damage reputations and decrease likes/followers.

Uploaded by

python developer
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Fake Account Detection using Machine Learning

Abstract
Fake profiles play an important role in advanced persisted threats and are also
involved in other malicious activities. The present paper focuses on identifying fake
profiles in social media. The approaches to identifying fake profiles in social media
can be classified into the approaches aimed on analysing profiles data and individual
accounts. Social networks fake profile creation is considered to cause more harm than
any other form of cyber crime. This crime has to be detected even before the user is
notified about the fake profile creation. Many algorithms and methods have been
proposed for the detection of fake profiles in the literature. This paper sheds light on
the role of fake identities in advanced persistent threats and covers the mentioned
approaches of detecting fake social media profiles. In order to make a relevant
prediction of fake or genuine profiles, we will assess the impact of three supervised
machine learning algorithms: Random Forest (RF), Decision Tree (DT-J48), and
Naïve Bayes (NB).

Keywords: User profiling Fake profile detection Machine learning


CHAPTER 1

INTRODUCTION

1.1 Diabetes Mellitus

Social media is growing incredibly fast these days. This is very important for
marketing companies and celebrities who try to promote themselves by growing their
base of followers and fans. The social networks are making our social lives better but
there are a lot of issues which need to be addressed. The issues related to social
networking like privacy, online bullying, misuse, and trolling etc. are most of the
times used by fake profiles on social networking sites [1]. However, fake profiles,
created seemingly on behalf of organizations or people, can damage their reputations
and decrease their numbers of likes and followers. On the other hand fake profile
creation is considered to cause more harm than any other form of cyber crime. This
crime has to be detected even before the user is notified about the fake profile
creation. In this very context figures this article, which is part of a series of research
conducted by our team within the user profiling subject and profiles classification in
social networks. Facebook is one of most famous online social networks. With
Facebook, users can create user profile, add other users as friends, exchange
messages, post status updates, photos, and share videos etc. Facebook website is
becoming popular day by day and more and more people are creating user profiles on
this site. Fake profiles are the profiles which are not genuine i.e. they are the profiles
of persons with false credentials. Our research aims at detecting fake profiles at
online social media websites using Machine learning algorithms. In order to address
this issue, we have chosen to use a Facebook dataset with two thousand eight hundred
and sixteen users (instances). The goal of the current research is to detect fake
identities among a sample of Facebook users. This paper consists of three sections. In
the first section, Machine learning algorithms that we chose to use to address our
research issues are presented. Secondly, we present our architecture. In the third
section, evaluation model guided by Machine learning algorithms will be advanced in
order to identify fake profiles. The conclusion comes in the last section.
1.1 SYSTEM SPECIFICATION
1.1.1 HARDWARE CONFIGURATION

Processor : Intel Core i3

RAM Capacity : 4 GB

Hard Disk : 90 GB

Mouse : Logical Optical Mouse

Keyboard : Logitech 107 Keys

Monitor : 15.6 inch

Mother Board : Intel

Speed : 3.3GHZ

1.1.2 SOFTWARE CONFIGURATION

Operating System : Windows 10

Front End : PYTHON

Middle Ware : ANACONDA (JUPYTER NOTEBOOK)

Back End : Python


1.2 ABOUT SOFTWARE

1.2.1 PYTHON

Python is an interpreted, object-oriented, high-level programming language with

dynamic semantics. Its high-level built in data structures, combined with dynamic

typing and dynamic binding, make it very attractive for Rapid Application

Development, as well as for use as a scripting or glue language to connect existing

components together. Python's simple, easy to learn syntax emphasizes readability

and therefore reduces the cost of program maintenance. Python supports modules and

packages, which encourages program modularity and code reuse. The Python

interpreter and the extensive standard library are available in source or binary form

without charge for all major platforms, and can be freely distributed.

Python Features

Python has few keywords, simple structure, and a clearly defined syntax. Python code

is more clearly defined and visible to the eyes. Python's source code is fairly easy-to-

maintaining. Python's bulk of the library is very portable and cross-platform

compatible on UNIX, Windows, and Macintosh. Python has support for an

interactive mode which allows interactive testing and debugging of snippets of code.

Portable Python can run on a wide variety of hardware platforms and has the same

interface on all platforms.

Extendable
It allows to add low-level modules to the Python interpreter. These modules enable

programmers to add to or customize their tools to be more efficient.

Databases

Python provides interfaces to all major commercial databases.

GUI Programming

Python supports GUI applications that can be created and ported to many system

calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X

Window system of Unix.

Scalable

Python provides a better structure and support for large programs than shell scripting.

Object-Oriented Approach

One of the key aspects of Python is its object-oriented approach. This basically

means that Python recognizes the concept of class and object encapsulation thus

allowing programs to be efficient in the long run.

Highly Dynamic

Python is one of the most dynamic languages available in the industry today. There is

no need to specify the type of the variable during coding, thus saving time and

increasing efficiency.

Extensive Array of Libraries


Python comes inbuilt with many libraries that can be imported at any instance and be

used in a specific program.

Open Source and Free

Python is an open-source programming language which means that anyone can create

and contribute to its development. Python is free to download and use in any

operating system, like Windows, Mac or Linux.

1.3 ANACONDA

Anaconda is a free and open-source distribution of the Python and R programming

languages for scientific computing (data science, machine learning applications,

large-scale data processing, predictive analytics, etc.), that aims to simplify package

management and deployment. Package versions are managed by the package

management system .The Anaconda distribution includes data-science packages

suitable for Windows, Linux, and MacOS.

Anaconda Navigator is a desktop graphical user interface (GUI) included in

Anaconda distribution that allows users to launch applications and manage conda

packages, environments and channels without using command-line commands.

Navigator can search for packages on Anaconda Cloud or in a local Anaconda

Repository, install them in an environment, run the packages and update them.

It is available for Windows, MacOS and Linux.

1.4 JUPYTER NOTEBOOK


"Jupyter" is a loose acronym meaning Julia, Python, and R. These programming

languages were the first target languages of the Jupyter application. As a server-client

application, the Jupyter Notebook App allows you to edit and run your notebooks via

a web browser. The application can be executed on a PC without Internet access, or it

can be installed on a remote server and it can access through the Internet.

A kernel is a program that runs and introspects the user’s code. The Jupyter Notebook

App has a kernel for Python code. "Notebook" or "Notebook documents" denote

documents that contain both code and rich text elements, such as figures, links,

equations. The mix of code and text elements, these documents are the ideal place to

bring together an analysis description, and can be executed to perform the data

analysis in real time.

Jupyter Notebook contains two components such as web application and notebook

documents.

A web application is a browser-based tool for interactive authoring of documents

which combine explanatory text, mathematics, computations and their rich media

output.

Notebook documents is a representation of all content visible in the web application,

including inputs and outputs of the computations, explanatory text, mathematics,

images, and rich media representations of objects.Structure of a notebook document

The notebook consists of a sequence of cells. A cell is a multiline text input field, and

its contents can be executed by using Shift-Enter, or by clicking either the “Play”

button the toolbar, or Cell , Run in the menu bar. The execution behavior of a cell is
determined by the cell’s type. There are three types of cells namely code cells,

markdown cells, and raw cells. Every cell starts off being a code cell, but its type can

be changed by using a drop-down on the toolbar.

Code cells

A code cell allows you to edit and write new code, with full syntax highlighting and

tab completion. The programming language you use depends on the kernel, and the

default kernel (IPython) runs Python code.

Markdown cells

Document the computational process in a literate way, alternating descriptive text

with code, using rich text. In IPython this is accomplished by marking up text with

the Markdown language. The corresponding cells are called Markdown cells. The

Markdown language provides a simple way to perform this text mark-up, to specify

which parts of the text should be emphasized (italics), bold, form lists, etc.

Raw cells

Raw cells provide a place in which you can write output directly. Raw cells are not

evaluated by the notebook. When passed through nbconvert, raw cells arrive in the

destination format unmodified.

1.5 MICROSOFT EXCEL

Microsoft Excel is a spreadsheet developed by Microsoft for Windows, MacOS,

Android and iOS. It features calculation, graphing tools, pivot tables and a macro

programming language called Visual Basic for applications.


FEATURES

Basic Operation

Microsoft Excel has the basic features of all spreadsheets, using a grid of cells

arranged in numbered rows and letter-named columns to organize data manipulations

like arithmetic operations. It has a battery of supplied functions to answer statistical,

engineering and financial needs. In addition, it can display data as line graphs,

histograms and charts, and with a very limited three-dimensional graphical display.

VBA programming The Windows version of Excel supports programming through

Microsoft's Visual Basic for Applications (VBA), which is a dialect of Visual Basic.

Programmers may write code directly using the Visual Basic Editor (VBE), which

includes a window for writing code, debugging code, and code module organization

environment. The user can implement numerical methods as well as automating tasks

such as formatting or data organization in VBA and guide the calculation using any

desired intermediate results reported back to the spreadsheet.

Charts

Excel supports charts, graphs, or histograms generated from specified groups of cells.

The generated graphic component can either be embedded within the current sheet, or

added as a separate object. These displays are dynamically updated if the content of

cells change.

For example, suppose that the important design requirements are displayed visually;
then, in response to a user's change in trial values for parameters, the curves

describing the design change shape, and their points of intersection shift, assisting the

selection of the best design.

Data storage and communication

Number of rows and columns

Versions of Excel up to 7.0 had a limitation in the size of their data sets of 16K (2 14

= 16384) rows. Versions 8.0 through 11.0 could handle 64K (2 16 = 65536) rows and

256 columns (2 8 as label 'IV'). Version 12.0 onwards, including the current Version

16.x, can handle over 1M (2 20 = 1048576) rows, and 16384 (2 14 as label 'XFD')

columns.

File formats

Microsoft Excel up until 2007 version used a proprietary binary file format called

Excel Binary File Format (.XLS) as its primary format. Excel 2007 uses Office Open

XML as its primary file format, an XML-based format that followed after a previous

XML-based format called "XML Spreadsheet" ("XMLSS"), first introduced in Excel

2002.

In addition, most versions of Microsoft Excel can read CSV, DBF, SYLK, DIF, and

other legacy formats. Support for some older file formats was removed in Excel

2007. The file formats were mainly from DOS-based programs.

Binary

OpenOffice.org has created documentation of the Excel format. Since then Microsoft
made the Excel binary format specification available to freely download.Export and

migration of spreadsheets Programmers have produced APIs to open Excel

spreadsheets in a variety of applications and environments other than Microsoft

Excel. These include opening Excel documents on the web using either ActiveX

controls, or plugins like the Adobe Flash Player.

The Apache

POI open source project provides Java libraries for reading and writing Excel

spreadsheet files. Excel Package is another open-source project that provides server-

side generation of Microsoft Excel 2007 spreadsheets. PHPExcel is a PHP library that

converts Excel5, Excel 2003, and Excel 2007 formats into objects for reading and

writing within a web application. Excel Services is a current .NET developer tool that

can enhance Excel's capabilities. Excel spreadsheets can be accessed from Python

with xlrd and openpyxl.

CSV File

A comma-separated values (CSV) file is a delimited text file that uses a comma to

separate values. Each line of the file is a data record. Each record consists of one or

more fields, separated by commas. The use of the comma as a field separator is the

source of the name for this file format. A CSV file typically stores tabular data

(numbers and text) in plain text, in which case each line will have the same number

of fields. These files serve a few different business purposes. They help companies

export a high volume of data to a more concentrated database.

The rules should be followed to format CSV file,


Each record (row of data) is to be located on a separate line, delimited by a line

break.

The last record in the file may or may not have an ending line break.

There may be an optional header line appearing as the first line of the file with the

same format as normal record lines.

It should contain the same number of fields as the records in the rest of the file.

The header contains names corresponding to the fields in the file.

In the header and each record, there may be one or more fields, separated by

commas.

The last field in the record must not be followed by a comma.

Each field may or may not be enclosed in double quotes.

If fields are not enclosed with double quotes, then double quotes may not appear

inside the fields.

Fields containing line breaks (CRLF), double quotes, and commas should be

enclosed

in double quotes.

If double quotes are used to enclose fields, then a double quote appearing inside a

field must be escaped by preceding it with another double quote.


CHAPTER 2

SYSTEM STUDY

System study contains existing and proposed system details. Existing system is useful

to develop proposed system. To elicit the requirements of the system and to identify

the elements, Inputs, Outputs, subsystems and the procedures, the existing system had

to be examined and analyzed in detail.

This increases the total productivity. The use of paper files is avoided and all the data

are efficiently manipulated by the system. It also reduces the space needed to store

the larger paper files and records.

EXISTING SYSTEM

The existing systems use very fewer factors to decide whether an account is fake or
not. The factors largely affect the way decision making occurs. When the number of
factors is low, the accuracy of the decision making is reduced significantly. There is
an exceptional improvement in fake account creation, which is unmatched by the
software or application used to detect the fake account. Due to the advancement in
creation of fake account, existing methods have turned obsolete. The most common
algorithm used by fake account detection Applications is the Random forest
algorithm. The algorithm has few downsides such as inefficiency to handle the
categorical variables which has different number of levels. Also, when there is an
increase in the number of trees, the algorithm's time efficiency takes a hit
PROPOSED SYSTEM

The existing system uses a random forest algorithm to identify the fake account. It is

efficient when it has the correct inputs and when it has all the inputs. When some of

the inputs are missing it becomes difficult for the algorithm to produce the output. To

overcome such difficulties in the proposed systems we used a gradient boosting

algorithm. Gradient boosting algorithm is like random forest algorithm which uses

decision trees as its main component. We also changed the way we find the fake

accounts i.e., we introduced new methods to find the account. The methods used are

spam commenting, engagement rate and artificial activity. These inputs are used to

form decision trees that are used in the gradient boosting algorithm. This algorithm

gives us an output even if some inputs are missing. This is the major reason for

choosing this algorithm. Due to the use of this algorithm we were able to get highly

accurate results
CHAPTER 3

SYSTEM DESIGN

The degree of interest in each concept has varied over the year, each has stood the

test of time. Each provides the software designer with a foundation from which more

sophisticated design methods can be applied. Fundamental design concepts provide

the necessary framework for “getting it right”.During the design process the software

requirements model is transformed into design models that describe the details of the

data structures, system architecture, interface, and components. Each design product

is reviewed for quality before moving to the next phase of software development.

3.1 INPUT DESIGN

The design of input focus on controlling the amount of dataset as input required,

avoiding delay and keeping the process simple. The input is designed in such a way

to provide security. Input design will consider the following steps:

 The dataset should be given as input.

 The dataset should be arranged.

 Methods for preparing input validations.

3.2 OUTPUT DESIGN

A quality output is one, which meets the requirement of the user and presents the

information clearly. In output design, it is determined how the information is to be

displayed for immediate need.Designing computer output should proceed in an

organized, well thought out manner;the right output must be developed while

ensuring that each output element is designed so that the user will find the system can

be used easily and effectively.


3.3 DATABASE DESIGN

This phase contains the attributes of the dataset which are maintained in the database

table. The dataset collection can be of two types namely train dataset and test dataset.

3.4 DATAFLOW DIAGRAM

Data flow diagrams are used to graphically represent the flow of data in a business

information system. DFD describes the processes that are involved in a system to

transfer data from the input to the file storage and reports generation. Data flow

diagrams can be divided into logical and physical. The logical data flow diagram

describes flow of data through a system to perform certain functionality of a business.

The physical data flow diagram describes the implementation of the logical data flow.

DFD graphically representing the functions, or processes, which capture, manipulate,

store, and distribute data between a system and its environment and between

components of a system. The visual representation makes it a good communication

tool between User and System designer. The objective of a DFD is to show the scope

and boundaries of a system.The DFD is also called as a data flow graph or bubble

chart. It can be manual, automated, or a combination of both. It shows how data

enters and leaves the system, what changes the information, and where data is stored.

Design Notation
3.5 SYSTEM DEVELOPMENT

3.5.1 DESCRIPTION OF MODULES

 DATASET COLLECTION

 HYPOTHESIS DEFINITION

 DATA EXPLORATION

 DATA CLEANING

 DATA MODELLING

 FEATURE ENGINEERING

DATASET COLLECTION
A data set is a collection of data. Departmental store data has been used as the

dataset for the proposed work.

Sales data has Item Identifier, Item Fat, Item Visibility, Item Type, Outlet

Type, Item MRP, Outlet Identifier, Item Weight, Outlet Size, Outlet Establishment

Year, Outlet Location Type, and Item Outlet Sales.

HYPOTHESIS DEFINITION

This is a very important step to analyse any problem. The first and foremost

step is to understand the problem statement.

The idea is to find out the factors of a product that creates an impact on the

sales of a product.

A null hypothesis is a type of hypothesis used in statistics that proposes that

no statistical significance exists in a set of given observations.

An alternative hypothesis is one that states there is a statistically significant

relationship between two variables.

DATA EXPLORATION

Data exploration is an informative search used by data consumers to form true

analysis from the information gathered.

Data exploration is used to analyse the data and information from the data to

form true analysis.

After having a look at the dataset, certain information about the data was
explored. Here the dataset is not unique while collecting the dataset. In this module,

the uniqueness of the dataset can be created.

DATA CLEANING

In data cleaning module, is used to detect and correct the inaccurate dataset. It

is used to remove the duplication of attributes.

Data cleaning is used to correct the dirty data which contains incomplete or

outdated data, and the improper parsing of record fields from disparate systems. It

plays a significant part in building a model.

DATA MODELLING

In data modelling module, the machine learning algorithms were used to

predict the Wave Direction. Linear regression and K-means algorithm were used to

predict various kinds of waves.

The user provides the ML algorithm with a dataset that includes desired inputs

and outputs, and the algorithm finds a method to determine how to arrive at those

results.

Linear regression algorithm is a supervised learning algorithm. It implements a

statistical model when relationships between the independent variables and the

dependent variable are almost linear, shows optimal results.

This algorithm is used to show the direction of waves and its height prediction

with increased accuracy rate.

K-means algorithm is an unsupervised learning algorithm. It deals with the


correlations and relationships by analysing available data. This algorithm clusters the

data and predict the value of the dataset point.

The train dataset is taken and are clustered using the algorithm. The

visualization of the clusters is plotted in the graph.

FEATURE ENGINEERING

In the feature engineering module, the process of using the import data into

machine learning algorithms to predict the accurate directions.

A feature is an attribute or property shared by all the independent products on

which the prediction is to be done. Any attribute could be a feature, it is useful to the

model.

CHAPTER 4

SYSTEM ANALYSIS

4.1 FEASIBILITY STUDY

A feasibility analysis is used to determine the viability of an idea, such as ensuring a

project is legally and technically feasible as well as economically justifiable.

Feasibility study lets the developer to foresee the project and the usefulness of the

system proposal as per its workability. It impacts the organization, ability to meet the

user needs and effective use of resource. Thus, when a new application is proposed it

normally goes through a feasibility study before it is approved for development.

Three key consideration involved in the feasibility analysis are,


 TECHNICAL FEASIBILITY

 OPERATIONAL FEASIBILITY

 ECONOMIC FEASIBILITY

4.1.1 TECHNICAL FEASIBILITY

This phase focuses on the technical resources available to the organization. It helps

organizations determine whether the technical resources meet capacity and whether

the ideas can be converted into working system model. Technical feasibility also

involves the evaluation of the hardware, software, and other technical requirements of

the proposed system.

4.1.2 OPERATIONAL FEASIBILITY

This phase involves undertaking a study to analyse and determine how well the

organization’s needs can be met by completing the project.

Operational feasibility study also examines how a project plan satisfies the

requirements that are needed for the phase of system development.

4.1.3 ECONOMIC FEASIBILITY

This phase typically involves a cost benefits analysis of the project and help

the organization to determine the viability, cost-benefits associated with a project

before financial resources are allocated.

It also serves as an independent project assessment and enhances project


credibility. It helps the decision-makers to determine the positive economic benefits

of the organization that the proposed project will provide.

CHAPTER 5

5.1 SYSTEM TESTING

System testing is the stage of implementation that is aimed at ensuring that the

system works accurately and efficiently before live operation commences. Testing is

vital to the success of the system. System testing makes logical assumption that if all

the parts of the system are correct, then the goal will be successfully achieved.

System testing involves user training system testing and successful running of the

developed proposed system. The user tests the developed system and changes are

made per their needs. The testing phase involves the testing of developed system

using various kinds of data. While testing, errors are noted and the corrections are
made. The corrections are also noted for the future use.

5.2 UNIT TESTING:

Unit testing focuses verification effort on the smallest unit of software design,

software component or module. Using the component level design description as a

control paths are tested to uncover errors within the boundary of the module. The

relative complexity of tests and the errors those uncover is limited by the constrained

scope established for unit testing. The unit test focuses on the internal processing

logic and data structures within the boundaries of a component. This is normally

considered as an adjunct to the coding step. The design of unit tests can be performed

before coding begins.

5.3 BLACK BOX TESTING

Black box testing also called behavioural testing, focuses on the functional

requirement of the software. This testing enables to derive set of input conditions of

all functional requirements for a program. This technique focuses on the information

domain of the software, deriving test cases by partitioning the input and output of a

program.

5.4 WHITE BOX TESTING

White box testing also called as glass box testing, is a test case design that uses the

control structures described as part of component level design to derive test cases.

This test case is derived to ensure all statements in the program have been executed at
least once during the testing and that all logical conditions have been exercised.

5.5 INTEGRATION TESTING

Integration testing is a systematic technique for constructing the software architecture

to conduct errors associated with interfacing. Top-down integration testing is an

incremental approach to construction of the software architecture. Modules are

integrated by movingdownward through the control hierarchy, beginning with the

main control module. Bottom-up integration testing begins the construction and

testing with atomic modules. Because components are integrated from the bottom up,

processing required for components subordinate to a given level is always available.

5.6 VALIDATION TESTING

Validation testing begins at the culmination of integration testing, when individual

components have been exercised, the software is completely assembled as a package.

The testing focuses on user visible actions and user recognizable output from the

system. The testing has been conducted on possible condition such as the function

characteristic conforms the specification and a deviation or error is uncovered. The

alpha test and beta test is conducted at the developer site by end-users.
CHAPTER 6

Machine Learning Algorithms Used:

1. Support Vector Machines (SVM)

2. Logistic Regression

3. K Neighbors Classifier

4. Naive Bayers

5. Gradient Booster Classifier

6. Random Forest Classifier

7. Decision Tree Classifier


SUPPORT VECTOR MACHINES

Support Vector Machine or SVM is one of the most popular Supervised

Learning algorithms, which is used for Classification as well as Regression problems.

However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can

segregate n-dimensional space into classes so that we can easily put the new data

point in the correct category in the future. This best decision boundary is called a

hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These

extreme cases are called as support vectors, and hence algorithm is termed as Support

Vector Machine. Consider the below diagram in which there are two different

categories that are classified using a decision boundary or hyperplane:


Example: SVM can be understood with the example that we have used in the KNN

classifier. Suppose we see a strange cat that also has some features of dogs, so if we

want a model that can accurately identify whether it is a cat or dog, so such a model

can be created by using the SVM algorithm. We will first train our model with lots of

images of cats and dogs so that it can learn about different features of cats and dogs,

and then we test it with this strange creature. So as support vector creates a decision

boundary between these two data (cat and dog) and choose extreme cases (support

vectors), it will see the extreme case of cat and dog. On the basis of the support

vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text

categorization, etc.

Types of SVM

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line, then

such data is termed as linearly separable data, and classifier is used called as

Linear SVM classifier.

● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,

which means if a dataset cannot be classified by using a straight line, then such

data is termed as non-linear data and classifier used is called as Non-linear

SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes

in n-dimensional space, but we need to find out the best decision boundary that helps

to classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,

which means if there are 2 features (as shown in image), then hyperplane will be a

straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the

maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the

position of the hyperplane are termed as Support Vector. Since these vectors support

the hyperplane, hence called a Support vector.


How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose

we have a dataset that has two tags (green and blue), and the dataset has two features

x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in

either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two

classes. But there can be multiple lines that can separate these classes. Consider the

below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best

boundary or region is called as a hyperplane. SVM algorithm finds the closest point

of the lines from both the classes. These points are called support vectors. The

distance between the vectors and the hyperplane is called as margin. And the goal of

SVM is to maximize this margin. The hyperplane with maximum margin is called

the optimal hyperplane.


Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for

non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,

we have used two dimensions x and y, so for non-linear data, we will add a third

dimension z. It can be calculated as:

2 2
z=x +y

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the

below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we

convert it in 2d space with z=1, then it will become as:


Hence we get a circumference of radius 1 in case of non-linear data.

2. DECISION TREE CLASSIFIER

“Decision Tree algorithm belongs to the family of supervised learning

algorithms. Unlike other supervised learning algorithms, the decision tree

algorithm can be used for solving regression and classification problems too”.

The goal of using a Decision Tree is to create a training model that can use to predict

the class or value of the target variable by learning simple decision rules inferred

from prior data(training data).

In Decision Trees, for predicting a class label for a record we start from the root of

the tree. We compare the values of the root attribute with the record’s attribute. On

the basis of comparison, we follow the branch corresponding to that value and jump

to the next node.

Types of Decision Trees

Types of decision trees are based on the type of target variable we have. It can be of

two types:

1. Categorical Variable Decision Tree: Decision Tree which has a categorical

target variable then it called a Categorical variable decision tree.

2. Continuous Variable Decision Tree: Decision Tree has a continuous target

variable then it is called Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his

renewal premium with an insurance company (yes/ no). Here we know that the
income of customers is a significant variable but the insurance company does not

have income details for all customers. Now, as we know this is an important variable,

then we can build a decision tree to predict customer income based on occupation,

product, and various other variables. In this case, we are predicting values for the

continuous variables.

Important Terminology related to Decision Trees

1. Root Node: It represents the entire population or sample and this further gets

divided into two or more homogeneous sets.

2. Splitting: It is a process of dividing a node into two or more sub-nodes.

3. Decision Node: When a sub-node splits into further sub-nodes, then it is called

the decision node.

4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

5. Pruning: When we remove sub-nodes of a decision node, this process is called

pruning. You can say the opposite process of splitting.

6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes is called a

parent node of sub-nodes whereas sub-nodes are the child of a parent node.
1. Entropy

“Entropy is a measure of the randomness in the information being

processed. The higher the entropy, the harder it is to draw any conclusions from

that information. Flipping a coin is an example of an action that provides

information that is random”.

From the above graph, it is quite evident that the entropy H(X) is zero when the

probability is either 0 or 1. The Entropy is maximum when the probability is 0.5

because it projects perfect randomness in the data and there is no chance if perfectly

determining the outcome.

ID3 follows the rule — A branch with an entropy of zero is a leaf node

and A brach with entropy more than zero needs further splitting.

Mathematically Entropy for 1 attribute is represented as:


Where S → Current state, and Pi → Probability of an event i of state S or

Percentage of class i in a node of state S.

Mathematically Entropy for multiple attributes is represented as:

where T→ Current state and X → Selected attribute Information Gain

Information gain or IG is a statistical property that measures how well a given

attribute separates the training examples according to their target classification.


Constructing a decision tree is all about finding an attribute that returns the highest

information gain and the smallest entropy.

Information Gain

Information gain is a decrease in entropy. It computes the difference between entropy

before split and average entropy after split of the dataset based on given attribute

values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.

Mathematically, IG is represented as:


In a much simpler way, we can conclude that:

Information Gain

Where “before” is the dataset before the split, K is the number of subsets generated

by the split, and (j, after) is subset j after the split.Gini Index

“You can understand the Gini index as a cost function used to evaluate splits in

the dataset. It is calculated by subtracting the sum of the squared probabilities of each

class from one. It favors larger partitions and easy to implement whereas information

gain favors smaller partitions with distinct values”.


Gini Index

Gini Index works with the categorical target variable “Success” or “Failure”. It

performs only Binary splits.

Higher the value of Gini index higher the homogeneity.

Steps to Calculate Gini index for a split

1. Calculate Gini for sub-nodes, using the above formula for success(p) and

failure(q) (p²+q²).

2. Calculate the Gini index for split using the weighted Gini score of each node of

that split.

CART (Classification and Regression Tree) uses the Gini index method to create split

points.

Gain ratio

“Information gain is biased towards choosing attributes with a large

number of values as root nodes. It means it prefers the attribute with a large

number of distinct values”.

C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information

gain that reduces its bias and is usually the best option. Gain ratio overcomes the

problem with information gain by taking into account the number of branches that

would result before making the split. It corrects information gain by taking the

intrinsic information of a split into account.


Let us consider if we have a dataset that has users and their movie genre

preferences based on variables like gender, group of age, rating, blah, blah. With

the help of information gain, you split at ‘Gender’ (assuming it has the highest

information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be

equally important and with the help of gain ratio, it will penalize a variable with

more distinct values which will help us decide the split at the next level.

Gain Ratio

Where “before” is the dataset before the split, K is the number of subsets generated

by the split, and (j, after) is subset j after the split.

Reduction in Variance

“Reduction in variance is an algorithm used for continuous target

variables (regression problems). This algorithm uses the standard formula of

variance to choose the best split”. The split with lower variance is selected as the

criteria to split the population:

Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:

1. Calculate variance for each node.

2. Calculate variance for each split as the weighted average of each node

variance.

Chi-Square

“The acronym CHAID stands for Chi-squared Automatic Interaction

Detector. It is one of the oldest tree classification methods. It finds out the

statistical significance between the differences between sub-nodes and parent

node. We measure it by the sum of squares of standardized differences between

observed and expected frequencies of the target variable”.

It works with the categorical target variable “Success” or “Failure”. It can perform

two or more splits. Higher the value of Chi-Square higher the statistical significance

of differences between sub-node and Parent node.

It generates a tree called CHAID (Chi-square Automatic Interaction Detector).

Mathematically, Chi-squared is represented as:

Steps to Calculate Chi-square for a split:


1. Calculate Chi-square for an individual node by calculating the deviation for

Success and Failure both

2. Calculated Chi-square of Split using Sum of all Chi-square of success and

Failure of each node of the split

Decision trees classify the examples by sorting them down the tree from the root to

some leaf/terminal node, with the leaf/terminal node providing the classification of

the example.

Each node in the tree acts as a test case for some attribute, and each edge descending

from the node corresponds to the possible answers to the test case. This process is

recursive in nature and is repeated for every subtree rooted at the new node.
Assumptions while creating Decision Tree

Below are some of the assumptions we make while using Decision tree:

● In the beginning, the whole training set is considered as the root.

● Feature values are preferred to be categorical. If the values are continuous then

they are discretized prior to building the model.

● Records are distributed recursively on the basis of attribute values.

● Order to placing attributes as root or internal node of the tree is done by using

some statistical approach.

Decision Trees follow Sum of Product (SOP) representation. The Sum of product

(SOP) is also known as Disjunctive Normal Form. For a class, every branch from

the root of the tree to a leaf node having the same class is conjunction (product) of

values, different branches ending in that class form a disjunction (sum).

The primary challenge in the decision tree implementation is to identify which

attributes do we need to consider as the root node and each level. Handling this is to

know as the attributes selection. We have different attributes selection measures to

identify the attribute which can be considered as the root note at each level.

How do Decision Trees work?

“The decision of making strategic splits heavily affects a tree’s accuracy.

The decision criteria are different for classification and regression trees”.

Decision trees use multiple algorithms to decide to split a node into two or more sub-

nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes.

In other words, we can say that the purity of the node increases with respect to the
target variable. The decision tree splits the nodes on all available variables and then

selects the split which results in most homogeneous sub-nodes.

3. LOGISTIC REGRESSION

This article discusses the basics of Logistic Regression and its implementation

in Python. Logistic regression is basically a supervised classification algorithm.

In a classification problem, the target variable(or output), y, can take only

discrete values for given set of features(or inputs), X.

Contrary to popular belief, logistic regression IS a regression model. The model

builds a regression model to predict the probability that a given data entry belongs to

the category numbered as “1”. Just like Linear regression assumes that the data

follows a linear function, Logistic regression models the data using the sigmoid

function.

Logistic regression becomes a classification technique only when a decision

threshold is brought into the picture. The setting of the threshold value is a very

important aspect of Logistic regression and is dependent on the classification problem

itself.
The decision for the value of the threshold value is majorly affected by the values of

precision and recall. Ideally, we want both precision and recall to be 1, but this

seldom is the case. In case of a Precision-Recall tradeoff we use the following

arguments to decide upon the thresold:-

1. Low Precision/High Recall: In applications where we want to reduce the number

of false negatives without necessarily reducing the number false positives, we choose

a decision value which has a low value of Precision or high value of Recall. For

example, in a cancer diagnosis application, we do not want any affected patient to be

classified as not affected without giving much heed to if the patient is being

wrongfully diagnosed with cancer.

2. High Precision/Low Recall: In applications where we want to reduce the number

of false positives without necessarily reducing the number false negatives, we choose

a decision value which has a high value of Precision or low value of Recall. For

example, if we are classifying customers whether they will react positively or

negatively to a personalised advertisement, we want to be absolutely sure that the

customer will react positively to the advertisemnt because otherwise, a negative

reaction can cause a loss potential sales from the customer.

Based on the number of categories, Logistic regression can be classified as:

1. binomial: target variable can have only 2 possible types: “0” or “1” which may

represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.

2. multinomial: target variable can have 3 or more possible types which are not

ordered(i.e. types have no quantitative significance)


3. ordinal: it deals with target variables with ordered categories. For example, a

test score can be categorized as:“very poor”, “poor”, “good”, “very good”.

Here, each category can be given a score like 0, 1, 2, 3.

4. K-NEAREST NEIGHBOR(KNN)

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

● K-Nearest Neighbour is one of the simplest Machine Learning algorithms

based on Supervised Learning technique.

● K-NN algorithm assumes the similarity between the new case/data and

available cases and put the new case into the category that is most similar to

the available categories.

● K-NN algorithm stores all the available data and classifies a new data point

based on the similarity. This means when new data appears then it can be

easily classified into a well suite category by using K- NN algorithm.

● K-NN algorithm can be used for Regression as well as for Classification but

mostly it is used for the Classification problems.

● K-NN is a non-parametric algorithm, which means it does not make any

assumption on underlying data.

● It is also called a lazy learner algorithm because it does not learn from the

training set immediately instead it stores the dataset and at the time of

classification, it performs an action on the dataset.

● KNN algorithm at the training phase just stores the dataset and when it gets

new data, then it classifies that data into a category that is much similar to the
new data.

● Example: Suppose, we have an image of a creature that looks similar to cat

and dog, but we want to know either it is a cat or dog. So for this identification,

we can use the KNN algorithm, as it works on a similarity measure. Our KNN

model will find the similar features of the new data set to the cats and dogs

images and based on the most similar features it will put it in either cat or dog

category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a

new data point x1, so this data point will lie in which of these categories. To solve

this type of problem, we need a K-NN algorithm. With the help of K-NN, we can

easily identify the category or class of a particular dataset. Consider the below

diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

● Step-1: Select the number K of the neighbors

● Step-2: Calculate the Euclidean distance of K number of neighbors

● Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

● Step-4: Among these k neighbors, count the number of the data points in each

category.

● Step-5: Assign the new data points to that category for which the number of

the neighbor is maximum.

● Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

● Firstly, we will choose the number of neighbors, so we will choose the k=5.

● Next, we will calculate the Euclidean distance between the data points. The

Euclidean distance is the distance between two points, which we have already

studied in geometry. It can be calculated as:


● By calculating the Euclidean distance we got the nearest neighbors, as three

nearest neighbors in category A and two nearest neighbors in category B.

Consider the below image:


● As we can see the 3 nearest neighbors are from category A, hence this new data

point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN

algorithm:

● There is no particular way to determine the best value for "K", so we need to

try some values to find the best out of them. The most preferred value for K is

5.

● A very low value for K such as K=1 or K=2, can be noisy and lead to the

effects of outliers in the model.

● Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

● It is simple to implement.

● It is robust to the noisy training data

● It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

● Always needs to determine the value of K which may be complex some time.

● The computation cost is high because of calculating the distance between the

data points for all the training samples.

5. RANDOM FOREST
Random Forest is a flexible, easy to use machine learning algorithm that produces,

even without hyper-parameter tuning, a great result most of the time. It is also one of

the most used algorithms, because of its simplicity and diversity (it can be used for

both classification and regression tasks). In this post we'll learn how the random

forest algorithm works, how it differs from other algorithms and how to use it.

Ensemble Learning

An Ensemble method is a technique that combines the predictions from multiple

machine learning algorithms together to make more accurate predictions than any

individual model. A model comprised of many models is called an Ensemble model.

Types of Ensemble Learning:

1. Boosting.

2. Bootstrap Aggregation (Bagging)

1. Boosting
Boosting refers to a group of algorithms that utilize weighted averages to make

weak learners into stronger learners. Boosting is all about “teamwork”. Each model

that runs, dictates what features the next model will focus on.

In boosting as the name suggests, one is learning from other which in turn boosts the

learning.

2. Bootstrap Aggregation (Bagging)

Bootstrap refers to random sampling with replacement. Bootstrap allows us

to better understand the bias and the variance with the dataset. Bootstrap involves

random sampling of small subset of data from the dataset.

It is a general procedure that can be used to reduce the variance for those algorithm

that have high variance, typically decision trees. Bagging makes each model run

independently and then aggregates the outputs at the end without preference to

any model.

Problems with Decision Trees

Decision trees are sensitive to the specific data on which they are trained. If the

training data is changed the resulting decision tree can be quite different and in turn

the predictions can be quite different.

Also Decision trees are computationally expensive to train, carry a big risk of

overfitting, and tend to find local optima because they can’t go back after they have

made a split.

To address these weaknesses, we turn to Random Forest :) which illustrates the


power of combining many decision trees into one model.

Random Forest

Random forest is a Supervised Learning algorithm which uses ensemble learning

method for classification and regression.

Random forest baggingnot a boosting random forests

It operates by constructing a multitude of decision trees at training time and

outputting the class that is the mode of the classes (classification) or mean

prediction (regression) of the individual trees.

A random forest is a meta-estimator (i.e. it combines the result of multiple

predictions) which aggregates many decision trees, with some helpful


modifications:

1. The number of features that can be split on at each node is limited to some

percentage of the total (which is known as the hyperparameter). This ensures

that the ensemble model does not rely too heavily on any individual feature,

and makes fair use of all potentially predictive features.

2. Each tree draws a random sample from the original data set when generating its

splits, adding a further element of randomness that prevents overfitting.

The above modifications help prevent the trees from being too highly correlated.

For Example, See these nine decision tree classifiers below :


These decision tree classifiers can be aggregated into a random forest ensemble

which combines their input. Think of the horizontal and vertical axes of the above

decision tree outputs as features x1 and x2. At certain values of each feature, the

decision tree outputs a classification of “blue”, “green”, “red”, etc.

These above results are aggregated, through model votes or averaging, into a single

ensemble model that ends up outperforming any individual decision tree’s output.

The aggregated result for the nine decision tree classifiers is shown below :
Feature and Advantages of Random Forest :

1. It is one of the most accurate learning algorithms available. For many data sets,

it produces a highly accurate classifier.

2. It runs efficiently on large databases.

3. It can handle thousands of input variables without variable deletion.

4. It gives estimates of what variables that are important in the classification.

5. It generates an internal unbiased estimate of the generalization error as the

forest building progresses.

6. It has an effective method for estimating missing data and maintains

accuracy when a large proportion of the data are missing.

Disadvantages of Random Forest :

1. Random forests have been observed to overfit for some datasets with noisy

classification/regression tasks.

2. For data including categorical variables with different number of levels,

random forests are biased in favor of those attributes with more levels.

Therefore, the variable importance scores from random forest are not reliable

for this type of data.

6. GRADIENT BOOSTING MACHINE(GBM)

A Gradient Boosting Machine or GBM combines the predictions from multiple

decision trees to generate the final predictions. Keep in mind that all the weak

learners in a gradient boosting machine are decision trees.


But if we are using the same algorithm, then how is using a hundred decision trees

better than using a single decision tree? How do different decision trees capture

different signals/information from the data?

Here is the trick – the nodes in every decision tree take a different subset of features

for selecting the best split. This means that the individual trees aren’t all the same and

hence they are able to capture different signals from the data.

Additionally, each new tree takes into account the errors or mistakes made by the

previous trees. So, every successive decision tree is built on the errors of the previous

trees. This is how the trees in a gradient boosting machine algorithm are built

sequentially.

When we try to predict the target variable using any machine learning technique, the

main causes of difference in actual and predicted values are noise, variance, and bias.

Ensemble helps to reduce these factors (except noise, which is irreducible error)

An ensemble is just a collection of predictors which come together (e.g. mean of all

predictions) to give a final prediction. The reason we use ensembles is that many
different predictors trying to predict same target variable will perform a better job

than any single predictor alone. Ensembling techniques are further classified into

Bagging and Boosting.

Bagging:

Bagging is a simple ensembling technique in which we build many independent

predictors/models/learners and combine them using some model averaging

techniques. (e.g. weighted average, majority vote or normal average)

We typically take random sub-sample/bootstrap of data for each model, so that all the

models are little different from each other. Each observation is chosen with

replacement to be used as input for each of the model. So, each model will have

different observations based on the bootstrap process. Because this technique takes

many uncorrelated learners to make a final model, it reduces error by reducing

variance. Example of bagging ensemble is Random Forest models.

Boosting:

This technique employs the logic in which the subsequent predictors learn from the

mistakes of the previous predictors. Therefore, the observations have an unequal

probability of appearing in subsequent models and ones with the highest error appear

most. (So the observations are not chosen based on the bootstrap process, but based

on the error). The predictors can be chosen from a range of models like decision

trees, regressors, classifiers etc. Because new predictors are learning from mistakes

committed by previous predictors, it takes less time/iterations to reach close to actual

predictions. But we have to choose the stopping criteria carefully or it could lead to
overfitting on training data. Gradient Boosting is an example of boosting algorithm.

Fig 1. Bagging (independent models) & Boosting (sequential models).

Fig 2. Ensembling

Gradient Boosting algorithm

. Gradient boosting is a machine learning technique for regression and

classification problems, which produces a prediction model in the form of an


ensemble of weak prediction models, typically decision trees.

The objective of any supervised learning algorithm is to define a loss function and

minimize it. Let’s see how maths work out for Gradient Boosting algorithm. Say we

have mean squared error (MSE) as loss defined as:

We want our predictions, such that our loss function (MSE) is minimum. By using

gradient descent and updating our predictions based on a learning rate, we can find

the values where MSE is minimum.

So, we are basically updating the predictions such that the sum of our residuals is

close to 0 (or minimum) and predicted values are sufficiently close to actual values.

Intuition behind Gradient Boosting

The logic behind gradient boosting is simple, (can be understood intuitively, without

using mathematical notation). I expect that whoever is reading this post might be
familiar with simple linear regression modeling.

A basic assumption of linear regression is that sum of its residuals is 0, i.e. the

residuals should be spread randomly around zero.Now think of these residuals as

mistakes committed by our predictor model. Although, tree-based models

(considering decision tree as base models for our gradient boosting here) are not

based on such assumptions, but if we think logically (not statistically) about this

assumption, we might argue that, if we are able to see some pattern of residuals

around 0, we can leverage that pattern to fit a model.So, the intuition behind gradient

boosting algorithm is to repetitively leverage the patterns in residuals and strengthen

a model with weak predictions and make it better. Once we reach a stage that

residuals do not have any pattern that could be modeled, we can stop modeling

residuals (otherwise it might lead to overfitting). Algorithmically, we are minimizing

our loss function, such that test loss reach its minima.

ADVANTAGES:

• We first model data with simple models and analyze data for errors.
• These errors signify data points that are difficult to fit by a simple model.

• Then for later models, we particularly focus on those hard to fit data to get them

right.

• In the end, we combine all the predictors by giving some weights to each predictor.

“The idea is to use the weak learning method several times to get a succession of

hypotheses, each one refocused on the examples that the previous ones found difficult

and misclassified. … Note, however, it is not obvious at all how this can be done”

CHAPTER 8

SYSTEM MAINTENANCE
The maintenance phase of the software cycle is the time in which a software product

performs useful work. After a system is successfully implemented, it should be

maintained in a proper manner. System maintenance is an important aspect in the

software development life cycle. The need for system maintenance is to make

adaptable and some changes in the system environment. There may be social,

technical and other environmental changes, which affect a system, that is

implemented. Software product enhancements may involve providing new functional

capabilities, improving user displays and mode of interaction, upgrading the

performance of the characteristics of the system.

Maintenance phase identifies if there are any changes required in the current system.

If the changes are identified, then an analysis is made to identify if the changes are

really required. Cost benefit analysis is a way to find out if the change is essential.

System maintenance conforms the system to its original requirements and the purpose

is to preserve the value of software over the time. The value can be enhanced by

expanding the customer base, meeting additional requirements, becoming easier to

use, more efficient and employing newer technology.

CHAPTER 9

CONCLUSION

In this Project, We have come up with an ingenious way to detect fake


accounts on OSNs By using machine learning algorithms to its full extent, we have
eliminated the need for manual prediction of a fake account, which needs a lot of
human resources and is also a time-consuming process. Existing systems have
become obsolete due to the advancement in the creation of fake accounts. The factors
that the existing system relayed upon is unstable. In this research, we used stable
factors such as engagement rate, artificial activity to increase the accuracy of the
prediction.

LITERATURE SURVEY

Detecting fake accounts in social media has become a tedious problem for
many Online Social Networking sites such as Facebook and Instagram. Generally,

fake accounts are found using machine learning. Previously used methods to identify

fake accounts have become inefficient. In , Multiple algorithms like decision tree,

logistic regression and support vector machine algorithms were used for detection. A

major drawback of the decision tree algorithm is that the tree contains data sets for a

feature and not for multiple features. Thus, the models which came after this

minimized the number of features as done in where comparing the age entered with

their registered mail id and location of the users were used as features. Improvement

in creating fake accounts made these methods inefficient in detecting it. Thus, service

providers changed their way to predict fake accounts by changing their algorithms as

done in where the METIS clustering algorithm was used. This algorithm gets the data

and clusters it into different groups which made it easier to separate fake accounts

from real accounts. Whereas in Naïve Bayes algorithm is used. The probability for

the used features was calculated and is substituted in the naïve Bayes formula and the

computed value is checked with a reference value. If the computed value is less than

the reference value, then that account is considered to be fake

REFERENCES
1. ”Detection of Fake Twitter accounts with Machine Learning Algorithms” Ilhan

aydin,Mehmet sevi, Mehmet umut salur.

2. ”Detection of fake profile in online social networks using Machine Learning”

Naman singh, Tushar sharma, Abha Thakral, Tanupriya Choudhury.

3. ”Detecting Fake accounts on Social Media” Sarah Khaled, Neamat el tazi, Hoda

M.O. Mokhtar.

4. ”Twitter fake account detection”, Buket Ersahin, Ozlem Aktas, Deniz kilinc,

Ceyhun Akyol.

5. ” a new heuristic of the decision tree induction” ning li, li zhao, ai-xia chen, qing-

wu meng, guo-fang zhang.

6. ” statistical machine learning used in integrated anti-spam system” peng-fei zhang,

yu-jie su, cong wang.

7. ” a study and application on machine learning of artificial intellligence” ming xue,

changjun zhu.

8. ” learning-based road crack detection using gradient boost decision tree” peng

sheng, li chen, jing tian.

9. ” verifying the value and veracity of extreme gradient boosted decision trees on a

variety of datasets” aditya gupta, kunal gusain, bhavya popli.

10. ” fake account identification in social networks” loredana caruccio.

9.2 SYSTEM FLOW DIAGRAM


A. DATA FLOW DIAGRAM

Figure A.1 data flow diagram

Figure A.2 Accuracy comparison between DT, NB and RF algorithms

You might also like