Fake Profile Detection
Fake Profile Detection
Abstract
Fake profiles play an important role in advanced persisted threats and are also
involved in other malicious activities. The present paper focuses on identifying fake
profiles in social media. The approaches to identifying fake profiles in social media
can be classified into the approaches aimed on analysing profiles data and individual
accounts. Social networks fake profile creation is considered to cause more harm than
any other form of cyber crime. This crime has to be detected even before the user is
notified about the fake profile creation. Many algorithms and methods have been
proposed for the detection of fake profiles in the literature. This paper sheds light on
the role of fake identities in advanced persistent threats and covers the mentioned
approaches of detecting fake social media profiles. In order to make a relevant
prediction of fake or genuine profiles, we will assess the impact of three supervised
machine learning algorithms: Random Forest (RF), Decision Tree (DT-J48), and
Naïve Bayes (NB).
INTRODUCTION
Social media is growing incredibly fast these days. This is very important for
marketing companies and celebrities who try to promote themselves by growing their
base of followers and fans. The social networks are making our social lives better but
there are a lot of issues which need to be addressed. The issues related to social
networking like privacy, online bullying, misuse, and trolling etc. are most of the
times used by fake profiles on social networking sites [1]. However, fake profiles,
created seemingly on behalf of organizations or people, can damage their reputations
and decrease their numbers of likes and followers. On the other hand fake profile
creation is considered to cause more harm than any other form of cyber crime. This
crime has to be detected even before the user is notified about the fake profile
creation. In this very context figures this article, which is part of a series of research
conducted by our team within the user profiling subject and profiles classification in
social networks. Facebook is one of most famous online social networks. With
Facebook, users can create user profile, add other users as friends, exchange
messages, post status updates, photos, and share videos etc. Facebook website is
becoming popular day by day and more and more people are creating user profiles on
this site. Fake profiles are the profiles which are not genuine i.e. they are the profiles
of persons with false credentials. Our research aims at detecting fake profiles at
online social media websites using Machine learning algorithms. In order to address
this issue, we have chosen to use a Facebook dataset with two thousand eight hundred
and sixteen users (instances). The goal of the current research is to detect fake
identities among a sample of Facebook users. This paper consists of three sections. In
the first section, Machine learning algorithms that we chose to use to address our
research issues are presented. Secondly, we present our architecture. In the third
section, evaluation model guided by Machine learning algorithms will be advanced in
order to identify fake profiles. The conclusion comes in the last section.
1.1 SYSTEM SPECIFICATION
1.1.1 HARDWARE CONFIGURATION
RAM Capacity : 4 GB
Hard Disk : 90 GB
Speed : 3.3GHZ
1.2.1 PYTHON
dynamic semantics. Its high-level built in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for Rapid Application
and therefore reduces the cost of program maintenance. Python supports modules and
packages, which encourages program modularity and code reuse. The Python
interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed.
Python Features
Python has few keywords, simple structure, and a clearly defined syntax. Python code
is more clearly defined and visible to the eyes. Python's source code is fairly easy-to-
interactive mode which allows interactive testing and debugging of snippets of code.
Portable Python can run on a wide variety of hardware platforms and has the same
Extendable
It allows to add low-level modules to the Python interpreter. These modules enable
Databases
GUI Programming
Python supports GUI applications that can be created and ported to many system
calls, libraries and windows systems, such as Windows MFC, Macintosh, and the X
Scalable
Python provides a better structure and support for large programs than shell scripting.
Object-Oriented Approach
One of the key aspects of Python is its object-oriented approach. This basically
means that Python recognizes the concept of class and object encapsulation thus
Highly Dynamic
Python is one of the most dynamic languages available in the industry today. There is
no need to specify the type of the variable during coding, thus saving time and
increasing efficiency.
Python is an open-source programming language which means that anyone can create
and contribute to its development. Python is free to download and use in any
1.3 ANACONDA
large-scale data processing, predictive analytics, etc.), that aims to simplify package
Anaconda distribution that allows users to launch applications and manage conda
Repository, install them in an environment, run the packages and update them.
languages were the first target languages of the Jupyter application. As a server-client
application, the Jupyter Notebook App allows you to edit and run your notebooks via
can be installed on a remote server and it can access through the Internet.
A kernel is a program that runs and introspects the user’s code. The Jupyter Notebook
App has a kernel for Python code. "Notebook" or "Notebook documents" denote
documents that contain both code and rich text elements, such as figures, links,
equations. The mix of code and text elements, these documents are the ideal place to
bring together an analysis description, and can be executed to perform the data
Jupyter Notebook contains two components such as web application and notebook
documents.
which combine explanatory text, mathematics, computations and their rich media
output.
The notebook consists of a sequence of cells. A cell is a multiline text input field, and
its contents can be executed by using Shift-Enter, or by clicking either the “Play”
button the toolbar, or Cell , Run in the menu bar. The execution behavior of a cell is
determined by the cell’s type. There are three types of cells namely code cells,
markdown cells, and raw cells. Every cell starts off being a code cell, but its type can
Code cells
A code cell allows you to edit and write new code, with full syntax highlighting and
tab completion. The programming language you use depends on the kernel, and the
Markdown cells
with code, using rich text. In IPython this is accomplished by marking up text with
the Markdown language. The corresponding cells are called Markdown cells. The
Markdown language provides a simple way to perform this text mark-up, to specify
which parts of the text should be emphasized (italics), bold, form lists, etc.
Raw cells
Raw cells provide a place in which you can write output directly. Raw cells are not
evaluated by the notebook. When passed through nbconvert, raw cells arrive in the
Android and iOS. It features calculation, graphing tools, pivot tables and a macro
Basic Operation
Microsoft Excel has the basic features of all spreadsheets, using a grid of cells
engineering and financial needs. In addition, it can display data as line graphs,
histograms and charts, and with a very limited three-dimensional graphical display.
Microsoft's Visual Basic for Applications (VBA), which is a dialect of Visual Basic.
Programmers may write code directly using the Visual Basic Editor (VBE), which
includes a window for writing code, debugging code, and code module organization
environment. The user can implement numerical methods as well as automating tasks
such as formatting or data organization in VBA and guide the calculation using any
Charts
Excel supports charts, graphs, or histograms generated from specified groups of cells.
The generated graphic component can either be embedded within the current sheet, or
added as a separate object. These displays are dynamically updated if the content of
cells change.
For example, suppose that the important design requirements are displayed visually;
then, in response to a user's change in trial values for parameters, the curves
describing the design change shape, and their points of intersection shift, assisting the
Versions of Excel up to 7.0 had a limitation in the size of their data sets of 16K (2 14
= 16384) rows. Versions 8.0 through 11.0 could handle 64K (2 16 = 65536) rows and
256 columns (2 8 as label 'IV'). Version 12.0 onwards, including the current Version
16.x, can handle over 1M (2 20 = 1048576) rows, and 16384 (2 14 as label 'XFD')
columns.
File formats
Microsoft Excel up until 2007 version used a proprietary binary file format called
Excel Binary File Format (.XLS) as its primary format. Excel 2007 uses Office Open
XML as its primary file format, an XML-based format that followed after a previous
2002.
In addition, most versions of Microsoft Excel can read CSV, DBF, SYLK, DIF, and
other legacy formats. Support for some older file formats was removed in Excel
Binary
OpenOffice.org has created documentation of the Excel format. Since then Microsoft
made the Excel binary format specification available to freely download.Export and
Excel. These include opening Excel documents on the web using either ActiveX
The Apache
POI open source project provides Java libraries for reading and writing Excel
spreadsheet files. Excel Package is another open-source project that provides server-
side generation of Microsoft Excel 2007 spreadsheets. PHPExcel is a PHP library that
converts Excel5, Excel 2003, and Excel 2007 formats into objects for reading and
writing within a web application. Excel Services is a current .NET developer tool that
can enhance Excel's capabilities. Excel spreadsheets can be accessed from Python
CSV File
A comma-separated values (CSV) file is a delimited text file that uses a comma to
separate values. Each line of the file is a data record. Each record consists of one or
more fields, separated by commas. The use of the comma as a field separator is the
source of the name for this file format. A CSV file typically stores tabular data
(numbers and text) in plain text, in which case each line will have the same number
of fields. These files serve a few different business purposes. They help companies
break.
The last record in the file may or may not have an ending line break.
There may be an optional header line appearing as the first line of the file with the
It should contain the same number of fields as the records in the rest of the file.
In the header and each record, there may be one or more fields, separated by
commas.
If fields are not enclosed with double quotes, then double quotes may not appear
Fields containing line breaks (CRLF), double quotes, and commas should be
enclosed
in double quotes.
If double quotes are used to enclose fields, then a double quote appearing inside a
SYSTEM STUDY
System study contains existing and proposed system details. Existing system is useful
to develop proposed system. To elicit the requirements of the system and to identify
the elements, Inputs, Outputs, subsystems and the procedures, the existing system had
This increases the total productivity. The use of paper files is avoided and all the data
are efficiently manipulated by the system. It also reduces the space needed to store
EXISTING SYSTEM
The existing systems use very fewer factors to decide whether an account is fake or
not. The factors largely affect the way decision making occurs. When the number of
factors is low, the accuracy of the decision making is reduced significantly. There is
an exceptional improvement in fake account creation, which is unmatched by the
software or application used to detect the fake account. Due to the advancement in
creation of fake account, existing methods have turned obsolete. The most common
algorithm used by fake account detection Applications is the Random forest
algorithm. The algorithm has few downsides such as inefficiency to handle the
categorical variables which has different number of levels. Also, when there is an
increase in the number of trees, the algorithm's time efficiency takes a hit
PROPOSED SYSTEM
The existing system uses a random forest algorithm to identify the fake account. It is
efficient when it has the correct inputs and when it has all the inputs. When some of
the inputs are missing it becomes difficult for the algorithm to produce the output. To
algorithm. Gradient boosting algorithm is like random forest algorithm which uses
decision trees as its main component. We also changed the way we find the fake
accounts i.e., we introduced new methods to find the account. The methods used are
spam commenting, engagement rate and artificial activity. These inputs are used to
form decision trees that are used in the gradient boosting algorithm. This algorithm
gives us an output even if some inputs are missing. This is the major reason for
choosing this algorithm. Due to the use of this algorithm we were able to get highly
accurate results
CHAPTER 3
SYSTEM DESIGN
The degree of interest in each concept has varied over the year, each has stood the
test of time. Each provides the software designer with a foundation from which more
the necessary framework for “getting it right”.During the design process the software
requirements model is transformed into design models that describe the details of the
data structures, system architecture, interface, and components. Each design product
is reviewed for quality before moving to the next phase of software development.
The design of input focus on controlling the amount of dataset as input required,
avoiding delay and keeping the process simple. The input is designed in such a way
A quality output is one, which meets the requirement of the user and presents the
organized, well thought out manner;the right output must be developed while
ensuring that each output element is designed so that the user will find the system can
This phase contains the attributes of the dataset which are maintained in the database
table. The dataset collection can be of two types namely train dataset and test dataset.
Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to
transfer data from the input to the file storage and reports generation. Data flow
diagrams can be divided into logical and physical. The logical data flow diagram
The physical data flow diagram describes the implementation of the logical data flow.
store, and distribute data between a system and its environment and between
tool between User and System designer. The objective of a DFD is to show the scope
and boundaries of a system.The DFD is also called as a data flow graph or bubble
enters and leaves the system, what changes the information, and where data is stored.
Design Notation
3.5 SYSTEM DEVELOPMENT
DATASET COLLECTION
HYPOTHESIS DEFINITION
DATA EXPLORATION
DATA CLEANING
DATA MODELLING
FEATURE ENGINEERING
DATASET COLLECTION
A data set is a collection of data. Departmental store data has been used as the
Sales data has Item Identifier, Item Fat, Item Visibility, Item Type, Outlet
Type, Item MRP, Outlet Identifier, Item Weight, Outlet Size, Outlet Establishment
HYPOTHESIS DEFINITION
This is a very important step to analyse any problem. The first and foremost
The idea is to find out the factors of a product that creates an impact on the
sales of a product.
DATA EXPLORATION
Data exploration is used to analyse the data and information from the data to
After having a look at the dataset, certain information about the data was
explored. Here the dataset is not unique while collecting the dataset. In this module,
DATA CLEANING
In data cleaning module, is used to detect and correct the inaccurate dataset. It
Data cleaning is used to correct the dirty data which contains incomplete or
outdated data, and the improper parsing of record fields from disparate systems. It
DATA MODELLING
predict the Wave Direction. Linear regression and K-means algorithm were used to
The user provides the ML algorithm with a dataset that includes desired inputs
and outputs, and the algorithm finds a method to determine how to arrive at those
results.
statistical model when relationships between the independent variables and the
This algorithm is used to show the direction of waves and its height prediction
The train dataset is taken and are clustered using the algorithm. The
FEATURE ENGINEERING
In the feature engineering module, the process of using the import data into
which the prediction is to be done. Any attribute could be a feature, it is useful to the
model.
CHAPTER 4
SYSTEM ANALYSIS
Feasibility study lets the developer to foresee the project and the usefulness of the
system proposal as per its workability. It impacts the organization, ability to meet the
user needs and effective use of resource. Thus, when a new application is proposed it
OPERATIONAL FEASIBILITY
ECONOMIC FEASIBILITY
This phase focuses on the technical resources available to the organization. It helps
organizations determine whether the technical resources meet capacity and whether
the ideas can be converted into working system model. Technical feasibility also
involves the evaluation of the hardware, software, and other technical requirements of
This phase involves undertaking a study to analyse and determine how well the
Operational feasibility study also examines how a project plan satisfies the
This phase typically involves a cost benefits analysis of the project and help
CHAPTER 5
System testing is the stage of implementation that is aimed at ensuring that the
system works accurately and efficiently before live operation commences. Testing is
vital to the success of the system. System testing makes logical assumption that if all
the parts of the system are correct, then the goal will be successfully achieved.
System testing involves user training system testing and successful running of the
developed proposed system. The user tests the developed system and changes are
made per their needs. The testing phase involves the testing of developed system
using various kinds of data. While testing, errors are noted and the corrections are
made. The corrections are also noted for the future use.
Unit testing focuses verification effort on the smallest unit of software design,
control paths are tested to uncover errors within the boundary of the module. The
relative complexity of tests and the errors those uncover is limited by the constrained
scope established for unit testing. The unit test focuses on the internal processing
logic and data structures within the boundaries of a component. This is normally
considered as an adjunct to the coding step. The design of unit tests can be performed
Black box testing also called behavioural testing, focuses on the functional
requirement of the software. This testing enables to derive set of input conditions of
all functional requirements for a program. This technique focuses on the information
domain of the software, deriving test cases by partitioning the input and output of a
program.
White box testing also called as glass box testing, is a test case design that uses the
control structures described as part of component level design to derive test cases.
This test case is derived to ensure all statements in the program have been executed at
least once during the testing and that all logical conditions have been exercised.
main control module. Bottom-up integration testing begins the construction and
testing with atomic modules. Because components are integrated from the bottom up,
The testing focuses on user visible actions and user recognizable output from the
system. The testing has been conducted on possible condition such as the function
alpha test and beta test is conducted at the developer site by end-users.
CHAPTER 6
2. Logistic Regression
3. K Neighbors Classifier
4. Naive Bayers
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
● Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
which means if a dataset cannot be classified by using a straight line, then such
SVM classifier.
in n-dimensional space, but we need to find out the best decision boundary that helps
to classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
2 2
z=x +y
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
algorithm can be used for solving regression and classification problems too”.
The goal of using a Decision Tree is to create a training model that can use to predict
the class or value of the target variable by learning simple decision rules inferred
In Decision Trees, for predicting a class label for a record we start from the root of
the tree. We compare the values of the root attribute with the record’s attribute. On
the basis of comparison, we follow the branch corresponding to that value and jump
Types of decision trees are based on the type of target variable we have. It can be of
two types:
Example:- Let’s say we have a problem to predict whether a customer will pay his
renewal premium with an insurance company (yes/ no). Here we know that the
income of customers is a significant variable but the insurance company does not
have income details for all customers. Now, as we know this is an important variable,
then we can build a decision tree to predict customer income based on occupation,
product, and various other variables. In this case, we are predicting values for the
continuous variables.
1. Root Node: It represents the entire population or sample and this further gets
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
7. Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.
1. Entropy
processed. The higher the entropy, the harder it is to draw any conclusions from
From the above graph, it is quite evident that the entropy H(X) is zero when the
because it projects perfect randomness in the data and there is no chance if perfectly
ID3 follows the rule — A branch with an entropy of zero is a leaf node
and A brach with entropy more than zero needs further splitting.
Information Gain
before split and average entropy after split of the dataset based on given attribute
values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.
Information Gain
Where “before” is the dataset before the split, K is the number of subsets generated
by the split, and (j, after) is subset j after the split.Gini Index
“You can understand the Gini index as a cost function used to evaluate splits in
the dataset. It is calculated by subtracting the sum of the squared probabilities of each
class from one. It favors larger partitions and easy to implement whereas information
Gini Index works with the categorical target variable “Success” or “Failure”. It
1. Calculate Gini for sub-nodes, using the above formula for success(p) and
failure(q) (p²+q²).
2. Calculate the Gini index for split using the weighted Gini score of each node of
that split.
CART (Classification and Regression Tree) uses the Gini index method to create split
points.
Gain ratio
number of values as root nodes. It means it prefers the attribute with a large
gain that reduces its bias and is usually the best option. Gain ratio overcomes the
problem with information gain by taking into account the number of branches that
would result before making the split. It corrects information gain by taking the
preferences based on variables like gender, group of age, rating, blah, blah. With
the help of information gain, you split at ‘Gender’ (assuming it has the highest
information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be
equally important and with the help of gain ratio, it will penalize a variable with
more distinct values which will help us decide the split at the next level.
Gain Ratio
Where “before” is the dataset before the split, K is the number of subsets generated
Reduction in Variance
variance to choose the best split”. The split with lower variance is selected as the
Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:
2. Calculate variance for each split as the weighted average of each node
variance.
Chi-Square
Detector. It is one of the oldest tree classification methods. It finds out the
It works with the categorical target variable “Success” or “Failure”. It can perform
two or more splits. Higher the value of Chi-Square higher the statistical significance
Decision trees classify the examples by sorting them down the tree from the root to
some leaf/terminal node, with the leaf/terminal node providing the classification of
the example.
Each node in the tree acts as a test case for some attribute, and each edge descending
from the node corresponds to the possible answers to the test case. This process is
recursive in nature and is repeated for every subtree rooted at the new node.
Assumptions while creating Decision Tree
Below are some of the assumptions we make while using Decision tree:
● Feature values are preferred to be categorical. If the values are continuous then
● Order to placing attributes as root or internal node of the tree is done by using
Decision Trees follow Sum of Product (SOP) representation. The Sum of product
(SOP) is also known as Disjunctive Normal Form. For a class, every branch from
the root of the tree to a leaf node having the same class is conjunction (product) of
attributes do we need to consider as the root node and each level. Handling this is to
identify the attribute which can be considered as the root note at each level.
The decision criteria are different for classification and regression trees”.
Decision trees use multiple algorithms to decide to split a node into two or more sub-
In other words, we can say that the purity of the node increases with respect to the
target variable. The decision tree splits the nodes on all available variables and then
3. LOGISTIC REGRESSION
This article discusses the basics of Logistic Regression and its implementation
builds a regression model to predict the probability that a given data entry belongs to
the category numbered as “1”. Just like Linear regression assumes that the data
follows a linear function, Logistic regression models the data using the sigmoid
function.
threshold is brought into the picture. The setting of the threshold value is a very
itself.
The decision for the value of the threshold value is majorly affected by the values of
precision and recall. Ideally, we want both precision and recall to be 1, but this
of false negatives without necessarily reducing the number false positives, we choose
a decision value which has a low value of Precision or high value of Recall. For
classified as not affected without giving much heed to if the patient is being
of false positives without necessarily reducing the number false negatives, we choose
a decision value which has a high value of Precision or low value of Recall. For
1. binomial: target variable can have only 2 possible types: “0” or “1” which may
2. multinomial: target variable can have 3 or more possible types which are not
test score can be categorized as:“very poor”, “poor”, “good”, “very good”.
4. K-NEAREST NEIGHBOR(KNN)
● K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
● K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
● K-NN algorithm can be used for Regression as well as for Classification but
● It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
● KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
and dog, but we want to know either it is a cat or dog. So for this identification,
we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog
category.
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
● Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
● Step-4: Among these k neighbors, count the number of the data points in each
category.
● Step-5: Assign the new data points to that category for which the number of
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
● Firstly, we will choose the number of neighbors, so we will choose the k=5.
● Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
● There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K is
5.
● A very low value for K such as K=1 or K=2, can be noisy and lead to the
● Large values for K are good, but it may find some difficulties.
● It is simple to implement.
● Always needs to determine the value of K which may be complex some time.
● The computation cost is high because of calculating the distance between the
5. RANDOM FOREST
Random Forest is a flexible, easy to use machine learning algorithm that produces,
even without hyper-parameter tuning, a great result most of the time. It is also one of
the most used algorithms, because of its simplicity and diversity (it can be used for
both classification and regression tasks). In this post we'll learn how the random
forest algorithm works, how it differs from other algorithms and how to use it.
Ensemble Learning
machine learning algorithms together to make more accurate predictions than any
1. Boosting.
1. Boosting
Boosting refers to a group of algorithms that utilize weighted averages to make
weak learners into stronger learners. Boosting is all about “teamwork”. Each model
that runs, dictates what features the next model will focus on.
In boosting as the name suggests, one is learning from other which in turn boosts the
learning.
to better understand the bias and the variance with the dataset. Bootstrap involves
It is a general procedure that can be used to reduce the variance for those algorithm
that have high variance, typically decision trees. Bagging makes each model run
independently and then aggregates the outputs at the end without preference to
any model.
Decision trees are sensitive to the specific data on which they are trained. If the
training data is changed the resulting decision tree can be quite different and in turn
Also Decision trees are computationally expensive to train, carry a big risk of
overfitting, and tend to find local optima because they can’t go back after they have
made a split.
Random Forest
outputting the class that is the mode of the classes (classification) or mean
1. The number of features that can be split on at each node is limited to some
that the ensemble model does not rely too heavily on any individual feature,
2. Each tree draws a random sample from the original data set when generating its
The above modifications help prevent the trees from being too highly correlated.
which combines their input. Think of the horizontal and vertical axes of the above
decision tree outputs as features x1 and x2. At certain values of each feature, the
These above results are aggregated, through model votes or averaging, into a single
ensemble model that ends up outperforming any individual decision tree’s output.
The aggregated result for the nine decision tree classifiers is shown below :
Feature and Advantages of Random Forest :
1. It is one of the most accurate learning algorithms available. For many data sets,
1. Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
random forests are biased in favor of those attributes with more levels.
Therefore, the variable importance scores from random forest are not reliable
decision trees to generate the final predictions. Keep in mind that all the weak
better than using a single decision tree? How do different decision trees capture
Here is the trick – the nodes in every decision tree take a different subset of features
for selecting the best split. This means that the individual trees aren’t all the same and
hence they are able to capture different signals from the data.
Additionally, each new tree takes into account the errors or mistakes made by the
previous trees. So, every successive decision tree is built on the errors of the previous
trees. This is how the trees in a gradient boosting machine algorithm are built
sequentially.
When we try to predict the target variable using any machine learning technique, the
main causes of difference in actual and predicted values are noise, variance, and bias.
Ensemble helps to reduce these factors (except noise, which is irreducible error)
An ensemble is just a collection of predictors which come together (e.g. mean of all
predictions) to give a final prediction. The reason we use ensembles is that many
different predictors trying to predict same target variable will perform a better job
than any single predictor alone. Ensembling techniques are further classified into
Bagging:
We typically take random sub-sample/bootstrap of data for each model, so that all the
models are little different from each other. Each observation is chosen with
replacement to be used as input for each of the model. So, each model will have
different observations based on the bootstrap process. Because this technique takes
Boosting:
This technique employs the logic in which the subsequent predictors learn from the
probability of appearing in subsequent models and ones with the highest error appear
most. (So the observations are not chosen based on the bootstrap process, but based
on the error). The predictors can be chosen from a range of models like decision
trees, regressors, classifiers etc. Because new predictors are learning from mistakes
predictions. But we have to choose the stopping criteria carefully or it could lead to
overfitting on training data. Gradient Boosting is an example of boosting algorithm.
Fig 2. Ensembling
The objective of any supervised learning algorithm is to define a loss function and
minimize it. Let’s see how maths work out for Gradient Boosting algorithm. Say we
We want our predictions, such that our loss function (MSE) is minimum. By using
gradient descent and updating our predictions based on a learning rate, we can find
So, we are basically updating the predictions such that the sum of our residuals is
close to 0 (or minimum) and predicted values are sufficiently close to actual values.
The logic behind gradient boosting is simple, (can be understood intuitively, without
using mathematical notation). I expect that whoever is reading this post might be
familiar with simple linear regression modeling.
A basic assumption of linear regression is that sum of its residuals is 0, i.e. the
(considering decision tree as base models for our gradient boosting here) are not
based on such assumptions, but if we think logically (not statistically) about this
assumption, we might argue that, if we are able to see some pattern of residuals
around 0, we can leverage that pattern to fit a model.So, the intuition behind gradient
a model with weak predictions and make it better. Once we reach a stage that
residuals do not have any pattern that could be modeled, we can stop modeling
our loss function, such that test loss reach its minima.
ADVANTAGES:
• We first model data with simple models and analyze data for errors.
• These errors signify data points that are difficult to fit by a simple model.
• Then for later models, we particularly focus on those hard to fit data to get them
right.
• In the end, we combine all the predictors by giving some weights to each predictor.
“The idea is to use the weak learning method several times to get a succession of
hypotheses, each one refocused on the examples that the previous ones found difficult
and misclassified. … Note, however, it is not obvious at all how this can be done”
CHAPTER 8
SYSTEM MAINTENANCE
The maintenance phase of the software cycle is the time in which a software product
software development life cycle. The need for system maintenance is to make
adaptable and some changes in the system environment. There may be social,
Maintenance phase identifies if there are any changes required in the current system.
If the changes are identified, then an analysis is made to identify if the changes are
really required. Cost benefit analysis is a way to find out if the change is essential.
System maintenance conforms the system to its original requirements and the purpose
is to preserve the value of software over the time. The value can be enhanced by
CHAPTER 9
CONCLUSION
LITERATURE SURVEY
Detecting fake accounts in social media has become a tedious problem for
many Online Social Networking sites such as Facebook and Instagram. Generally,
fake accounts are found using machine learning. Previously used methods to identify
fake accounts have become inefficient. In , Multiple algorithms like decision tree,
logistic regression and support vector machine algorithms were used for detection. A
major drawback of the decision tree algorithm is that the tree contains data sets for a
feature and not for multiple features. Thus, the models which came after this
minimized the number of features as done in where comparing the age entered with
their registered mail id and location of the users were used as features. Improvement
in creating fake accounts made these methods inefficient in detecting it. Thus, service
providers changed their way to predict fake accounts by changing their algorithms as
done in where the METIS clustering algorithm was used. This algorithm gets the data
and clusters it into different groups which made it easier to separate fake accounts
from real accounts. Whereas in Naïve Bayes algorithm is used. The probability for
the used features was calculated and is substituted in the naïve Bayes formula and the
computed value is checked with a reference value. If the computed value is less than
REFERENCES
1. ”Detection of Fake Twitter accounts with Machine Learning Algorithms” Ilhan
3. ”Detecting Fake accounts on Social Media” Sarah Khaled, Neamat el tazi, Hoda
M.O. Mokhtar.
4. ”Twitter fake account detection”, Buket Ersahin, Ozlem Aktas, Deniz kilinc,
Ceyhun Akyol.
5. ” a new heuristic of the decision tree induction” ning li, li zhao, ai-xia chen, qing-
changjun zhu.
8. ” learning-based road crack detection using gradient boost decision tree” peng
9. ” verifying the value and veracity of extreme gradient boosted decision trees on a