0% found this document useful (0 votes)
28 views11 pages

Clustering

Uploaded by

Елена О
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Clustering

Uploaded by

Елена О
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Detecting Clusters of Fake Accounts in Online Social

Networks

Cao Xiao David Mandell Freeman Theodore Hwa


University of Washington and LinkedIn Corporation LinkedIn Corporation
LinkedIn Corporation dfreeman@linkedin.com thwa@linkedin.com
xiaoc@uw.edu

ABSTRACT Keywords
Fake accounts are a preferred means for malicious users of Spam detection, fake profiles, machine learning, data min-
online social networks to send spam, commit fraud, or oth- ing, clustering, classification
erwise abuse the system. A single malicious actor may cre-
ate dozens to thousands of fake accounts in order to scale 1. INTRODUCTION
their operation to reach the maximum number of legitimate
Today people around the world rely on online social net-
members. Detecting and taking action on these accounts
works (OSNs) to share knowledge, opinions, and experi-
as quickly as possible is imperative in order to protect le-
ences; seek information and resources; and expand personal
gitimate members and maintain the trustworthiness of the
connections. However, the same features that make OSNs
network. However, any individual fake account may appear
valuable to ordinary people also make them targets for a
to be legitimate on first inspection, for example by having a
variety of forms of abuse. For example, the large audi-
real-sounding name or a believable profile.
ence on a single platform is a prime target for spammers
In this work we describe a scalable approach to finding
and scammers, and the trustworthiness of the platform may
groups of fake accounts registered by the same actor. The
make the targets more amenable to falling for scams [2].
main technique is a supervised machine learning pipeline
The “gamification” aspects of a site (e.g., “Like” or “Follow”
for classifying an entire cluster of accounts as malicious or
counters) lend to bots engaging in artificial actions to il-
legitimate. The key features used in the model are statis-
legitimately promote products or services [33, 36]. Details
tics on fields of user-generated text such as name, email ad-
about connections can be used to extract valuable business
dress, company or university; these include both frequencies
information [19]. And the large amount of member data
of patterns within the cluster (e.g., do all of the emails share
available is enticing to scrapers who wish to bootstrap their
a common letter/digit pattern) and comparison of text fre-
own databases with information on real people [30]. Accord-
quencies across the entire user base (e.g., are all of the names
ing to statistics provided by the security firm Cloudmark,
rare?).
between 20% and 40% of Facebook accounts could be fake
We apply our framework to analyze account data on LinkedIn
profiles [11]; Twitter and LinkedIn also face fake account
grouped by registration IP address and registration date.
problems to varying degrees [13, 21].
Our model achieved AUC 0.98 on a held-out test set and
Regardless of the particular motivations for creating fake
AUC 0.95 on out-of-sample testing data. The model has
accounts, the existence of large numbers of fake accounts can
been productionalized and has identified more than 250,000
undermine the value of online social networks for legitimate
fake accounts since deployment.
users. For example, they can weaken the credibility of the
network if users start to doubt the authenticity of profile
information [20]. They can also have negative impact on
the networks’ ad revenue, since advertisers might question
the rates they pay to reach a certain number of users if many
Categories and Subject Descriptors of them are not real people.
H.3.3 [Information Storage and Retrieval]: Informa- Yet fake accounts are hard to detect and stop. A large-
tion Search and Retrieval - Spam; I.2.6 [Artificial Intelli- scale OSN may have millions of active users and billions of
gence]: Learning user activities, of which the fake accounts comprise only a
tiny percentage. Given this imbalance, false positive rates
must be kept very low in order to avoid blocking many legiti-
mate members. While some fake accounts may demonstrate
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
clear patterns of automation, many are designed to be in-
for profit or commercial advantage and that copies bear this notice and the full cita- distinguishable from real ones. Security measures such as
tion on the first page. Copyrights for components of this work owned by others than CAPTCHAs and phone verification via SMS have been de-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- signed to interrogate suspicious accounts and hence raise the
publish, to post on servers or to redistribute to lists, requires prior specific permission barrier to creating fake accounts. However, the OSN must
and/or a fee. Request permissions from Permissions@acm.org.
still select a subset of accounts to challenge (since challeng-
AISec’15, October 16, 2015, Denver, Colorado, USA.
© 2015 ACM. ISBN 978-1-4503-3826-4/15/10 ...$15.00. ing all accounts would place undue friction on the experience
DOI: http://dx.doi.org/10.1145/2808769.2808779. of real users), and once faced with these challenges, spam-
mers can either solve the challenges using CAPTCHA farms components: the Cluster Builder, which produces clusters
or SIM card farms [7], or they can use the feedback to learn of accounts to score; the Profile Featurizer, which extracts
how to avoid the fake account classifier [16]. features for use in modeling; and the Account Scorer, which
While there has been a good deal of research on detecting trains machine learning models and evaluates the models on
fake accounts (see Section 6), including some using machine new imput data. Details of our pipeline can be found in
learning algorithms, the literature still suffers from the fol- Section 3.
lowing gaps:
1.2 Experimental Results
1. None of the existing approaches perform fast detection We evaluated our approach on LinkedIn account data. For
of clusters of fake accounts. Most published fake ac- training data we sampled approximately 275,000 accounts
count detection algorithms make a prediction for each registered over a six-month period, of which 55% had been
account [1, 26, 31, 36]. Since a large-scale OSN may labeled as fake or spam by the LinkedIn Security team.1
register hundreds of thousands of new accounts per We grouped account-level labels into cluster-level labels for
day and bad actors try to create accounts at scale, it training our classifiers.
is more desirable to have a cluster-level detection al- We trained models using random forest, logistic regres-
gorithm that can perform fast, scalable detection and sion, and support vector machine classifiers. We evaluated
catch all accounts in a cluster at once. the classifiers’ performance with 80-20 split in-sample test-
ing and out-of-sample testing with a more recent data set.
2. None of the existing approaches are designed to detect
The latter test is a better approximation of real-life perfor-
and take action on fake accounts before they can con-
mance, since models are trained on data from the past and
nect with legitimate members, scrape, or spam. Exist-
run on data from the present.
ing algorithms for fake account detection are in general
To measure the classifiers’ performance, we computed AUC
based on the analysis of user activities and/or social
(area under the ROC curve) and recall at 95% precision. In
network connections [10,17,27,38,39], which means the
practice the desired precision rates and thresholds for classi-
fake accounts should be allowed to stay in the network
fication may be higher or lower depending on business needs
for a while in order to develop connections and accu-
and the relative cost of false positives and false negatives.
mulate enough activity data. In practice we want to
We found that the random forest algorithm provided the
catch fake accounts as soon as possible after they are
best results for all metrics. On the held-out test set, the
registered in order to prevent them from interacting
random forest model produced AUC 0.98 and recall 0.90 at
with real users. This creates a challenge since we will
95% precision. When run on out-of-sample testing data the
only have some basic information provided during the
random forest model again performed best, with AUC 0.95
registration flow. Hence an algorithm that can cap-
and recall 0.72 at 95% precision.
ture as many patterns as it can, based on very limited
profile information, becomes an urgent need. 1.3 Organization of the Paper
1.1 Our Contribution The rest of the paper is organized as follows. In Section 2
we provide an overview of the supervised learning methods
In this work, we develop a scalable and time-sensitive ma- that we use in our study, along with metrics for model evalu-
chine learning approach to finding groups of fake accounts ation. In Section 3 we describe the machine learning pipeline
registered by the same actor. Our apprach solves the chal- used to implement our system, and in Section 4 we describe
lenges described above as follows: our approach to feature engineering. Next in Section 5 we
provide results of our experiment on one embodiment of the
1. The first step in our pipeline is to group accounts
proposed approach, describing the performance on testing
into clusters, and our machine learning algorithms take
data as well as results on live LinkedIn data. We discuss
as input cluster-level features exclusively. All of our
related work in Section 6, and we consider future directions
features are engineered to describe the whole cluster
in Section 7.
rather than individual accounts, and the resulting clas-
sification is on entire clusters. Our approach is scalable
to OSNs that have large amounts of daily account reg- 2. TRAINING METHODOLOGIES
istrations.
2.1 Supervised Learning Methods
2. Our algorithms use only features available at registra-
tion time or shortly thereafter. In particular, we do During model training, our goal is to construct and se-
not require graph data or activity data. However, since lect subsets of features that are useful for building a good
the raw data available at registration time is limited, predictor. In our experiments, we considered the following
we must cleverly construct features that will enable three regression methods: logistic regression with L1 regu-
us to distinguish good clusters from bad clusters. In larization [34], support vector machine with a radial basis
Section 4 we describe three classes of features that al- function kernel [15], and random forest [4], a nonlinear tree-
low us to achieve this goal. We also propose generic based ensemble learning method.
pattern encoding algorithms that allow us to collapse Logistic Regression. Given a set S = {(x(i) , y (i) )}mi=1 of m
user- generated text into a small space on which we
training samples with x(i) as feature inputs and y (i) ∈ {0, 1}
can compute statistical features.
1
Note that this sample is not representative of the LinkedIn
We implemented our framework as an offline machine learn- member base, but rather is a sampling of accounts that had
ing pipeline in Hadoop. The pipeline is comprised of three been flagged as suspicious for some reason.
as labels, logistic regression can be modeled as 2.2 Evaluation Metrics
1 All three of our classifiers output real number scores that
p(y = 1|x, θ) = , (1) can be used to order the samples in a test set. To measure
1 + exp(−θT x)
the classifiers’ performance, we calculate the AUC (area un-
where θ ∈ Rn are the model parameters. der the ROC curve), precision, and recall. We can calculate
Without regularization, logistic regression tries to find pa- each metric either on the cluster level or on the account
rameters using the maximum likelihood criterion, while with level, where each account is assigned the score output by
regularization, the goal is to control the tradeoff between fit- the classifier for its parent cluster.
ting and having fewer variables being chosen in the model. The area under the receiver operating characteristic curve
In our study, we use L1 penalization to regularize the logistic (AUC) is commonly used in model comparison and can be
regression model. This technique maximizes the probability interpreted as the probability that the classifier will assign a
distribution of the class label y given a feature vector x, higher score to a randomly chosen positive example than to
and also reduces the number of irrelevant features by using a randomly chosen negative example. A model with higher
a penalty term to bound the coefficients θ in the L1 norm. AUC is considered a better model. Advantages of AUC as
The model parameters θ ∈ Rn are computed as a metric are that it doesn’t require choosing a threshold for
Pm (i) (i) assigning labels to scores and that it is independent of class
arg min i=1 − log p(y |x , θ) + β|θ|1 . (2)
θ bias in the test set.
In this formulation, β is the regularization parameter and Precision and recall are well known metrics for binary clas-
will be optimally chosen using cross-validation. sification. In our application, precision is the fraction of pre-
dicted fake accounts that are truly fake, while recall is the
Support Vector Machine. The second learning algorithm fraction of fake accounts in the wild that are caught by the
we consider is the support vector machine (SVM) [3,9,29,35]. model. For a classifier that outputs a score or probability,
The support vector machine algorithm looks for an opti- precision and recall can be calculated for each score thresh-
mal hyperplane as a decision function in a high-dimensional old, giving a parametric curve. Since false positives in a fake
space. account model are very costly, our metric of choice for model
Our training dataset again consists of pairs (x(i) , y (i) ) ∈ evaluation is recall rate at the threshold that produces 95%
Rn × {0, 1}. In our study, since we would like to use a non- precision. (The 95% rate is merely for baselining; in practice
linear classifier, we used SVM with a radial basis function we aim for much higher precision.)
(RBF) kernel in training. The RBF kernel can be formu-
lated as k(x, x0 ) = exp(−rkx − x0 k2 ). The hyperparameter 3. MACHINE LEARNING PIPELINE
r is called the kernel bandwidth and is tuned based on results
of cross-validation. To make the proposed fake account detection system scal-
In principle, the SVM algorithm first maps x into a higher able, we designed and implemented a practical machine learn-
dimensional space via a function ψ, then finds a hyperplane ing pipeline involving a sequence of data pre-processing,
H in the higher-dimensional space which maximizes the dis- feature extraction, prediction and validation stages. The
tance between the point set ψ(xi ) and H. If this hyperplane pipeline consists of three major components, which we de-
is hw, Xi = b (where X is in the higher- dimensional space), scribe below and illustrate in Figure 1.
then the decision function is f (x) = hw, ψ(x)i − b. The sign 3.1 Cluster Builder
of f (x) gives the class label of x. In practice, the function
The Cluster Builder, as its name implies, takes the raw
ψ is implicit and all calculations are done with the kernel k.
list of accounts and builds clusters of accounts along with
In our experiments, we adopt a probability model for clas-
their raw features. The module takes user-specified param-
sification using the R package “e1071” [25]. The decision
eters for (1) minimum and maximum cluster size; (2) time
values of the binary classifier are fitted to a logistic distribu-
span of accounts registered (e.g. last 24 hours, last week),
tion using maximum likelihood to output numerical scores
and (3) clustering criteria. The clustering criteria can be as
indicating probabilities. While we could just as easily use
simple as grouping all accounts that share a common char-
the raw SVM scores for classification, mapping the scores to
acteristic such as IP address, or a more complex clustering
probabilities allows us to compare SVM results with other
algorithm such as k-means. Once the initial clusters are
models that output probability estimates.
built, user-defined criteria can be added to filter out some
Random Forest. The random forest algorithm [4] is an of the clusters that are not likely to be suspicious or may
ensemble approach that combines many weak classifiers (de- introduce high false positives. For example, one may wish
cision trees) to form a strong classifier (random forest). For to filter out accounts registered from the OSN’s corporate
each decision tree, we first sample with replacement from the IP space, as these are likely to be test accounts that should
original training set to get a new training set of the same not be restricted.
size. Then at each node of the decision tree, we choose m The Cluster Builder takes raw member profile tables as
features at random, and split the decision tree according to input and outputs a table of accounts with features that
the best possible split among those m features. The value of are needed for feature engineering, such as member’s name,
m needs to be chosen to balance the strength of individual company, and education. Each row of the table represents
trees (higher m is better) against the correlation between one account and contains a “cluster identifier” unique to that
trees (lower m is better). Now given a new sample, the re- account’s cluster. This table is used as input to the Profile
sulting model scores it by running the sample through all Featurizer.
trees, and then combining the results; in the case of a bi- In the training phase the Cluster Builder must also use
nary classification problem like ours, the score is simply the account-level labels to label each cluster as real or fake.
percentage of trees that give a positive result on the sample. While most clusters have either all accounts or no accounts
Figure 1: Our learning pipeline implementing the fake account clusters detection approach. We assemble
accounts into clusters, extract features, train or evaluate the model, and assign scores to the accounts in each
cluster.

labeled as fake, there will in general be a few clusters with cially bots) that are following a pattern in their ac-
some accounts in each group. Thus to compute cluster la- count signups.
bels, we choose a threshold x such that the clusters with
fewer than x percent fake accounts are labeled real and those 3. Frequency features. For each feature value, we com-
with greater than x percent fake are labeled fake. The op- pute the frequency of that value over the entire account
timal choice of x depends on precision/recall tradeoffs (i.e., database. We then compute basic distribution features
higher values of x increase precision at the expense of recall). over these frequencies. In general we expect clusters of
However, as discussed in Section 5.2 below, in practice we legitimate accounts to have some high-frequency data
find that the model is fairly insensitive to this choice. and some low-frequency data, while bots or malicious
users will show less variance in their data frequencies;
3.2 Profile Featurizer e.g., using only common or only rare names.
The Profile Featurizer is the key component of the pipeline.
Its purpose is to convert the raw data for each cluster (i.e. 3.3 Account Scorer
the data for all of the individual accounts in the cluster) into The Account Scorer’s function is to train the models and
a single numerical vector representing the cluster that can be evaluate them on previously unseen data. The Account
used in a machine learning algorithm. It is implemented as a Scorer takes as input the output of Profile Featurizer; i.e.,
set of functions designed to capture as much information as one numerical factor for each cluster. The specific learn-
possible from the raw features in order to discriminate clus- ing algorithm used is user-configurable; in our experiments
ters of fake accounts from clusters of legitimate accounts. we consider logistic regression, random forests, and support
The extracted features can be broadly grouped into three vector machines. In “training mode,” the Account Scorer
categories, which we describe at a high level here; further is given a labeled set of training data and outputs a model
details can be found in Section 4. description as well as evaluation metrics that can be used
to compare different models. In “evaluation mode,” the Ac-
1. Basic distribution features. For each cluster, we count Scorer is given a model description and an input vector
take basic statistical measures of each column (e.g. of cluster features and outputs a score for that cluster indi-
company name). Examples include mean or quartiles cating the likelihood of that cluster being composed of fake
for numerical features, or number of unique values for accounts.
text features. Based on the cluster’s score, accounts in that cluster can
be selected for any of three actions: automatic restriction
2. Pattern features. We have designed “pattern en- (if the probability of being fake is high), manual review (if
coding algorithms” that map user-generated text to a the results are inconclusive), or no action (if the probability
smaller categorical space. We then take basic distri- of being fake is low). The exact thresholds for selecting
bution features over these categorical variables. These between the three actions are configured to minimize false
features are designed to detect malicious users (espe- positives and give human reviewers a mix of good and bad
accounts. The manually labeled accounts can later be used Clearly all of these email addresses satisfy the regular ex-
as training data in further iterations of the model. pression [a-z]+[0-9]+@domain\.com . We can apply this reg-
ular expression to the emailaddress to obtain a binary
4. FEATURE ENGINEERING feature, upon which we can calculate the basic distribution
features described above. In theory we could apply the tech-
The quality of the numerical features output by the Profile
niques of Prasse et al. [28] to our training set to generate
Featurizer is the single most important factor in the effec-
a list of spammy regular expressions and use each regular
tiveness of our classifiers. We now describe this process in
expression as binary features. However, this approach will
greater detail.
generate a very sparse feature vector and will not generalize
4.1 Basic Distribution Features to unseen patterns.
We began with a manual survey of clusters of fake ac- Instead of relying on regular expressions, we have designed
counts in the LinkedIn data set (see Section 5 for details) two “pattern encoding algorithms” that map arbitrary text
that had already been detected and labeled as fake. We find to a smaller space. The first algorithm normalizes character
that accounts within a large cluster generally show patterns classes: the universe of characters is divided into text classes
in their user-entered data such as name, company, or edu- such as uppercase, lowercase, digit, punctuation, etc., and
cation. Sometimes they may be obvious; for example, all each character is mapped to a representative character for
accounts may use identical text for the description of their the class, as described in Algorithm 1 below.
current position. Such a pattern can be captured by what
we term a basic distribution feature, in this case the feature Algorithm 1 Pattern Encoding Algorithm (Length-
being the number of unique position descriptions. The basic Preserving)
distribution features we consider include the following: Require: s.length > 0
• For numerical features: 1: procedure Encode(s) . 'abc12' → 'LLLDD'
2: i←0
– Min, max, and quartiles. 3: t ← ''
– Mean and variance. 4: while i < s.length do
• For categorical features: 5: if isU pperCase(s[i]) then
6: t ← t + 'U '
– Number of distinct feature values in the cluster 7: else if isLowerCase(s[i]) then
(both raw count and as a fraction of cluster size). 8: t ← t + 'L'
– Percentage of null values (i.e. empty fields). 9: else if isDigit(s[i]) then
– Percentage of values belonging to the mode. 10: t ← t + 'D'
– Percentage of values belonging to the top two fea- 11: else
ture values. 12: t ← t + 'O'
– Percentage of values that are unique. 13: end if
– Numerical features (see above) on the array of 14: end while
value counts. 15: return t
P 16: end procedure
– Entropy, computed as i −pi log(pi ), where i ranges
over the feature values and
number of instances of i This algorithm, which is length-preserving, would be able
pi = .
number of distinct feature values to detect email addresses that were all eight letters plus
three digits at the same domain. However, it would not
For categorical features that take two values we can en-
detect the list of email addresses above since the names and
code the values as 0/1 and compute the numerical features
numbers are of varying length. To address this problem we
described above; we can also consider text fields as categor-
use a length-independent variant that collapses consecutive
ical and compute the corresponding distribution features.
instances of a class into a single representative, as described
4.2 Pattern features in Algorithm 2 below.
The output of this algorithm on the list of email usernames
We often find that when a single entity — whether bot
(i.e. the text before the @ sign) above would be “LD” in each
or human — registers a cluster of fake accounts, the user-
case. Our experience shows that it is rare for a collection
entered text in one or more columns always matches a cer-
of legitimate users to all follow such a pattern, so this is a
tain pattern. For example, the email addresses on the ac-
good feature for distinguishing clusters of accounts created
counts might appear as follows (this is a synthetic sample):
by a single entity.
charlesgreen992@domain.com In addition to using the algorithm as described above,
josephbaker247@domain.com we can add new character classes such as punctuation or
thomasadams319@domain.com spaces, or we can collapse classes together (e.g. lowercase
chrisnelson211@domain.com and uppercase letters). In this way simple metrics such as
danielhill538@domain.com text length and word count can also be subsumed in the
paulwhite46@domain.com framework. Some of the patterns we used in our analysis
markcampbell343@domain.com are as follows:
donaldmitchell92@domain.com • Encode() (Algorithm 1).
georgeroberts964@domain.com
kennethcarter149@domain.com • ShortEncode() (Algorithm 2).
Algorithm 2 Pattern Encoding Algorithm (Length- basic distribution features as described in Section 4.1. We
Independent) can do the same for the logarithms of the frequencies or
Require: s.length > 0 the ranks of the features in the sorted list of frequencies,
1: procedure ShortEncode(s) . 'abc12' → 'LD' which can help to distinguish extremely rare entries from
2: i←0 somewhat rare entries.
3: s ← Encode(s)
4: curr ← '' 5. EXPERIMENTAL RESULTS
5: t ← ''
6: while i < s.length do 5.1 Data Acquisition
7: if curr 6= s[i] then We evaluated our model on labeled LinkedIn account data,
8: t ← t + s[i] where the labels were provided by LinkedIn’s Security and/or
9: curr ← s[i] Trust and Safety teams. Our approach first requires us to
10: end if choose a method of clustering accounts. For our study, we
11: i←i+1 created clusters of LinkedIn accounts by grouping on regis-
12: end while tration IP address2 and registration date (in Pacific Time).
13: return t We chose this approach primarily because this is a grouping
14: end procedure for which we were able to obtain a large amount of man-
ually labeled data; in Section 7 we discuss other clustering
approaches.
• Len(Encode()) using a single character class (i.e. text For our training set we collected labeled accounts from the
length). 6-month period December 1, 2013 to May 31, 2014. During
• Len(ShortEncode()) using two character classes, space this time the accounts in all (IP, date) clusters satisfying
and non-space (i.e. word count). an internal criterion for suspicious registration were sent to
LinkedIn’s Trust and Safety team for manual review and ac-
• Binary features checking the existence of each charac- tion. We extracted raw profile data for each account in these
ter class in Encode(). clusters and labeled the account as fake if it was restricted
• Encode() on the first character of the text. or as real if it was in good standing as of the survey time.
Once mapped to a smaller categorical space, we apply The total number of labeled accounts was 260,644, of which
the basic distribution features described in Section 4.1 to 153,019 were fake accounts and 107,625 were legitimate.
compute numerical features. In a similar fashion we obtained data from June 2014 to
be used as “out-of-sample” testing data. This data included
4.3 Frequency features 30,550 accounts, of which 15,078 were fake accounts and
Upon close examination of clusters of fake accounts, we of- 15,472 were legitimate.
ten find patterns that are apparent to the trained eye but al- 5.2 Cluster Labeling
gorithmically hard to describe. For example, consider these
two sets of names (again, a synthetic sample): The labeled accounts in our data set fell into 20,559 dis-
tinct (IP, date) clusters. The median cluster size was 9; a
Cluster 1 Cluster 2 histogram of the cluster sizes appears in Figure 2. For each
Charles Green Shirely Lofgren cluster we calculated the percentage of accounts labeled as
Joseph Baker Tatiana Gehring spam; a histogram of this data appears in Figure 3. We
Thomas Adams China Arzate found that 89% of the clusters had either no fake accounts
Chris Nelson Marcelina Pettinato or all fake accounts, and only 3.8% of clusters had between
Daniel Hill Marilu Marusak 20% and 80% fake accounts.
Paul White Bonita Naef To determine a threshold for labeling a cluster as fake
Mark Campbell Etta Scearce (see Section 3.1), we ran our random forest classifier using
Donald Mitchell Paulita Kao cluster labels generated by setting three different thresholds:
George Roberts Alaine Propp 20%, 50%, and 80%. The resulting AUC metrics at the ac-
Kenneth Carter Sellai Gauer count level were 0.9765, 0.9777, and 0.9776, respectively. We
conclude that the relative ordering of scored accounts is in-
It’s fairly apparent that these names are not randomly sensitive to the cluster-labeling threshold, and we chose 50%
sampled from the population at large. The names in Cluster as our threshold for further experiments. At this threshold,
1 are all common male names (in fact, they were generated 10,456 of our training clusters were labeled as spam and
by taking top-ranked first and last names from U.S. Census 10,103 were labeled as legitimate. The out-of-sample test
data) and the names in Cluster 2 are all exceedingly rare set fell into 2,705 clusters, out of which 1,227 were spam
— it’s possible there could be someone in the world named and 1,478 were legitimate.
Bonita Naef, but the probability that she would register for
a social network from the same IP address as Alaine Propp
5.3 Performance Analysis
and all the others is quite low. We evaluated our proposed approach using the machine
We quantify this intuition using data from the social net- learning algorithms described in Section 2: logistic regres-
work’s entire member base. Specifically, for a given column sion, SVM, and random forest. We chose these three al-
of text (such as first name), we compute the frequency of gorithms to illustrate possible approaches; in principle any
that text among all of the social network’s members. This 2
For accounts registered using IPv6 we grouped on the /56
gives a number between 0 and 1, on which we can compute subnet.
we catch, there are still many fake accounts not caught by
Histogram of Cluster Size the model. Among all models, logistic regression has the
worst performance, because the nonlinearity in the true pat-
8000
terns cannot be well modeled using a linear classifier.

Table 1: 80-20 split testing performance (cluster


6000
Count of Clusters

level)
Algorithm AUC Recall@p95
4000

Random forest 0.978 0.900


Logistic regression 0.936 0.657
SVM 0.963 0.837
2000

Table 2 shows the testing AUC and recall at 95% precision


for all three algorithms at the account level; that is, when
0

each account is assigned the score computed for its cluster.


The data show that our prediction for each algorithm is
0.5 1.0 1.5 2.0 2.5 3.0
even more accurate for each account, and an examination
log10(Size of Clusters) of the data shows that this is caused by the classifiers being
more accurate on larger clusters. (See Table 5 for further
evidence.)
Figure 2: Distribution of cluster sizes for our train-
ing data.
Table 2: 80-20 split testing performance (account
level)
Algorithm AUC Recall@p95
Distribution of spam percentage per cluster
Random forest 0.978 0.935
Number of clusters

Logistic regression 0.951 0.821


5000

SVM 0.961 0.889


200

ROC (80:20)
20

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Percentage of fake accounts


0.8
True positive rate

Figure 3: Distribution of spam percentage per clus-


0.6

ter for our training data.


0.4

binary classification algorithm can be used, and the best


algorithm may change depending on the domain area. To
make a fair comparison and evaluation, the parameters of
0.2

all supervised learning algorithms were determined through Random Forest


an 80-20 split cross-validation process. Specifically, 80% of SVM
the training data was used to construct the classifier and the Logistic Regression
0.0

remaining 20% of the data was used for “in-sample” perfor-


mance testing. The optimal parameter setting is the setting 0.0 0.2 0.4 0.6 0.8 1.0
that maximizes the in-sample testing AUC.
False positive rate
We ran the three aforementioned algorithms using R pack-
ages “glmnet” [14], “e1071” [25], and “randomForest” [23] on
training data, respectively. Table 1 shows the in-sample pre- Figure 4: Comparison of ROC curves for different
diction performance as measured by AUC and recall at 95% models on in-sample data.
precision. The data show that random forest performs the
best based on both metrics. The other nonlinear classifier, We also tested our model on out-of-sample data from
SVM with RBF kernel, also has good performance in terms June, 2014. The motivation for doing out-of-sample testing
of its AUC value. However, its recall at 95% precision is not is that spammers’ patterns in their fake account data will
as good as that of random forest, which indicates that for vary over time as they experiment and learn from failure.
SVM, although we have high confidence in the fake accounts Performing out-of-sample testing simulates this scenario in
ROC (Out of Sample) logistic regression and random forests, and SVM does bet-
ter at classifying smaller clusters; it also suggests that an
1.0 ensemble approach combining all of the classifiers may lead
to still greater performance.
0.8

Analysis by Cluster Size. Table 5 shows the random for-


est results binned by cluster size. We see that as clusters
True positive rate

grow bigger, the model performance improves. If a cluster


0.6

has more than 30 accounts, which means on one day from


an IP address there were more than 30 accounts signed up,
we have almost perfect confidence to label this cluster and
all accounts in the cluster. If a cluster has more than 100 ac-
0.4

counts, then we can reach 100% accuracy level on all metrics


for the cross-validation set.
0.2

Random Forest
Table 5: Random forest performance by cluster size
SVM
Logistic Regression Cluster Size AUC Recall@p95
0.0

1 to 10 0.967 0.817
0.0 0.2 0.4 0.6 0.8 1.0 11 to 30 0.988 0.965
False positive rate 31 to 100 0.988 0.989
greater than 100 1.000 1.000
Figure 5: Comparison of ROC curves for different
models on out-of-sample data.
Analysis of Top Ranked Features. To gain more insight
into the features in our study, we ranked them using the
production, and provides a practical and useful evaluation Gini importance index, which is calculated based on the Gini
of the real-life performance of the model. index [4]. In our model, the top features included average
Tables 3 and 4 give the out-of-sample performance com- frequency counts of the two least common last or first names,
parison of the three models trained from training data at the as well as fraction of top patterns generated from our pattern
cluster level and account level, respectively. The data show encoding algorithms in name and email address.
that random forest still performs the best based on all met-
rics. The recall at 95% precision for all three algorithms de- 5.4 False Positive and False Negative Analysis
creases as compared with the cross-validation results, which We manually reviewed all accounts in our validation set
confirms our assumption that given a certain level of pre- and out-of-sample testing set that were predicted to be fake
cision (i.e., the fraction of predicted fake accounts that are but were actually labeled as legitimate. We found that the
truly fake), there are more fake accounts not being caught majority of these were from organizational signups. A num-
in the newer dataset. The results also indicate that we need ber of members all signed up from the same organization,
to re-train our model regularly, so as to capture the newer which might all come from a single IP address, and some
patterns and increase the fraction of fake accounts caught. parts of their profile information may be similar. For ex-
ample, their email addresses may follow a standard pattern
(such as <last name><first initial>@<organization>.org).
Table 3: Out-of-sample testing performance (cluster To address such false positives, we developed an organiza-
level) tional account detection model, and we configured our clas-
Algorithm AUC Recall@p95 sifier in production so that organizational accounts that the
model labels as fake are sent for manual review instead of
Random forest 0.949 0.720
automatically restricted. This approach helped resolve the
Logistic regression 0.906 0.127
false positive issue greatly.
SVM 0.928 0.522
We also manually reviewed all accounts in both datasets
that were predicted to be legitimate but were labeled by hu-
mans as fake. In many cases, it turned out the prediction
from our model was correct. Usually, if there is a legitimate
Table 4: Out-of-sample testing performance (ac- mass signup (e.g., during a LinkedIn marketing event), the
count level) large number of signups will fall into a single cluster. This
Algorithm AUC Recall@p95 will likely trigger some previous rule-based model to label
them as fake, and the human labeler might also label it as
Random forest 0.954 0.713
fake for the same reason. However, as the size of cluster
Logistic regression 0.917 0.456
grows, the account profile patterns within the cluster will
SVM 0.922 0.311
become more and more diverse, which seems more normal
from the model’s perspective, so the model is able to cor-
One interesting finding here is that the performances of rectly label the cluster as good. That also explains why as
the SVM classifier decreases as we move from the cluster clusters grow bigger, the model becomes more and more ac-
level to the account level. This result implies that unlike curate, as shown in Table 5. Where the model found errors
in the previous human label (as confirmed by our subsequent
Distribution of cluster size and precision on live data
manual review), we reversed the previous decision, and re-
labeled those accounts as legitimate.

Precision
1
5.5 Running on Live Data

5000
We implemented the system using Java, Hive, and R,

Number of clusters
and trained it on the dataset discussed in Section 5.1. Us-
ing Hadoop streaming, we ran the algorithm daily on new

3000
cluster size
LinkedIn registrations. The highest-scoring accounts were precision

automatically restricted. Scores in a “gray area” were sent


to LinkedIn’s Trust and Safety team for manual review and

1000
action. This process allows us to collect quality labeled data

0
on borderline cases for training future models.
Since its rollout, the model has caught more than 15,000 0.0 0.5 1.0 1.5 2.0 2.5
clusters, comprising more than 250,000 fake LinkedIn ac- log10(cluster size)
counts. The trend in the model’s precision can be seen in
Figure 6, which plots a 14-day moving average of the pre-
cision. The decrease in precision at one point is due to a Figure 7: Distribution of cluster size and precision
large number of organizational signups that the model erro- on live data.
neously flagged; upon addition of the “organization detector”
the precision returned to its previous levels.
ing some same basic profile information, the effect of such
Model  precision  on  live  LinkedIn  data     features would be reduced.
(14-­‐day  moving  average)   Much research has been done to analyze fake accounts
in OSNs from a graph-theoretic perspective. Two relevant
surveys are those of Yu et al. [38], who describe a number of
specific sybil defense mechanisms, and Viswanath et al. [37],
who point out that most existing Sybil defense schemes work
by detecting local communities (i.e., clusters of nodes more
tightly knit than the rest of the graph) around a trusted
node.
In more recent graph-theoretic work, Jiang et al. [17] pro-
pose to detect fake accounts by constructing latent interac-
tion graphs as models of user browsing behavior. They then
compare these graphs’ structural properties, evolution, com-
munity structure, and mixing times against those of both
active interaction graphs and social graphs. Mohaisen et
al. [27] detect Sybil nodes, which disrupt the fast mixing
property of social networks, and thus propose several heuris-
tics to improve the mixing of slow-mixing graphs using their
topological structures. Conti et al. [8] analyze social network
Figure 6: 14-day moving average of model precision graphs from a dynamic point of view to detect adversaries
(at the account level) since deployment. who create fake profiles to impersonate real people and then
interact with the real people’s friends.
Figure 7 shows a histogram of cluster sizes on live data and While the above graph-theoretic techniques may be appli-
the precision (on the level of clusters) within each bucket. cable to the clusters of accounts we study in this work, our
Most of the clusters detected were relatively small; the me- goal is to detect clusters before they can make the connec-
dian cluster size was 11 accounts. Contrary to our expe- tions or engage in the behavior that produces the relevant
rience with the training data, we found that precision did graph structures. Thus our approach focuses on signals that
not in general increase with cluster size, except for the very are available at or shortly after registration time, which in-
largest clusters (of size greater than 100). cludes only a small amount of activity data and little to no
connection data.
Many researchers have applied machine learning algorithms
6. RELATED WORK to the problem of spam detection on OSNs. Fire et al. [12]
The problem of detecting fake accounts in online social use topology anomalies, decision trees, and naive Bayes clas-
networks has been approached from a number of different sifiers to identify spammers and fake profiles that are used in
perspectives, including behavioral analysis, graph theory, multiple social networks. Jin et al. [18] analyze the behavior
machine learning, and system design. of identity clone attacks and propose a detection framework.
Using a behavioral perspective, Malhotra et al. [24] de- Cao et al. [5] develop a ranking algorithm to rank users in on-
velop features to detect malicious users who create fake ac- line services and detect fake accounts; this rank is calculated
counts across different social networks. However, the fea- according to the degree-normalized probability of a short
tures they propose are all account-level basic profile features. random walk in the non-Sybil region. Tan et al. [32] put the
If the same spammer does not abuse different platforms us- network spam detection problem in an unsupervised learn-
ing framework, deliberately removing non-spammers from http://scholarlyoa.com/2015/02/10/
the network and leveraging both the social graph and the publisher-uses-fake-linkedin-identities-to-attract-submissions.
user-link graph. [3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A
Rather than focus on detecting fake accounts after they training algorithm for optimal margin classifiers. In
penetrate the network, some researchers have focused on Proceedings of the 5th Annual ACM Workshop on
designing the systems themselves to prevent attacks in the Computational Learning Theory, pages 144–152. ACM
first place. Lesniewski-Laas and Kaashoek [22] propose a Press, 1992.
novel routing protocol for distributed hash tables that is ef- [4] L. Breiman. Random forests. Mach. Learn.,
ficient and strongly resistant to Sybil attacks. Chiluka et 45(1):5–32, Oct. 2001.
al. [6] propose a new design point in the trade-off between [5] Q. Cao, M. Sirivianos, X. Yang, and T. Pregueiro.
network connectivity and attack resilience of social network- Aiding the detection of fake accounts in large scale
based Sybil defense schemes, where each node adds links to social online services. In Proceedings of the 9th
only a selective few of its 2-hop neighbors based on a mini- USENIX Conference on Networked Systems Design
mum expansion contribution (MinEC) heuristic. Viswanath and Implementation, NSDI’12, pages 15–15, Berkeley,
et al. [37] present a system that uses routing-based tech- CA, USA, 2012. USENIX Association.
niques to efficiently approximate credit payments over large [6] N. Chiluka, N. Andrade, J. Pouwelse, and H. Sips.
networks. Social networks meet distributed systems: Towards a
While system-design techniques for preventing abuse can robust sybil defense under churn. In Proceedings of the
be effective, they are often not applicable in practice to a 10th ACM Symposium on Information, Computer and
large-scale network that was originally designed to optimize Communications Security, ASIA CCS ’15, pages
for growth and engagement long before abuse became a sig- 507–518, New York, NY, USA, 2015. ACM.
nificant issue.
[7] D. B. Clark. The bot bubble: How click farms have
inflated social media currency. The New Republic,
7. CONCLUSIONS AND FUTURE WORK April 20 2015. Available at
In this paper we have presented a machine learning pipeline http://www.newrepublic.com/article/121551/
for detecting fake accounts in online social networks. Rather bot-bubble-click-farms-have-inflated-social-media-currency.
than making a prediction for each individual account, our [8] M. Conti, R. Poovendran, and M. Secchiero.
system classifies clusters of fake accounts to determine whether Fakebook: Detecting fake profiles in on-line social
they have been created by the same actor. Our evaluation networks. In Proceedings of the 2012 International
on both in-sample and out-of-sample data showed strong Conference on Advances in Social Networks Analysis
performance, and we have used the system in production to and Mining (ASONAM 2012), ASONAM ’12, pages
find and restrict more than 250,000 accounts. 1071–1078, Washington, DC, USA, 2012. IEEE
In this work we evaluated our framework on clusters cre- Computer Society.
ated by simple grouping on registration date and registra- [9] N. Cristianini and J. Shawe-Taylor. An Introduction to
tion IP address. In future work we expect to run our model Support Vector Machines and Other Kernel-based
on clusters created by grouping on other features, such as Learning Methods. Cambridge University Press, New
ISP or company, and other time periods, such as week or York, NY, USA, 2000.
month. Another promising line of research is to use more [10] G. Danezis and P. Mittal. Sybilinfer: Detecting sybil
sophisticated clustering algorithms such as k-means or hier- nodes using social networks. Technical Report
archical clustering. While these approaches may be fruitful, MSR-TR-2009-6, Microsoft, January 2009.
they present obstacles to operating at scale: k-means may [11] Digital Trends Staff. 40 pct. fake profiles on
require too many clusters (i.e., too large a value of k) to pro- Facebook? http://www.digitaltrends.com/computing/
duce useful results, and hierarchical clustering may be too fake-profiles-facebook/.
computationally intensive to classify millions of accounts. [12] M. Fire, G. Katz, and Y. Elovici. Strangers intrusion
From a modeling perspective, one important direction for detection - detecting spammers and fake profiles in
future work is to apply feature sets used in other spam de- social networks based on topology anomalies. ASE
tection models, and hence to realize multi-model ensemble Human Journal, 1(1):26–39, Jan. 2012.
prediction. Another direction is to make the system robust [13] D. M. Freeman. Using Naive Bayes to detect spammy
against adversarial attacks, such as a botnet that diversifies names in social networks. In A. Sadeghi, B. Nelson,
all features, or an attacker that learns from failures. A final C. Dimitrakakis, and E. Shi, editors, AISec’13,
direction is to construct more language-insensitive pattern Proceedings of the 2013 ACM Workshop on Artificial
matching features; our features assume the text is written Intelligence and Security, Co-located with CCS 2013,
in an alphabet that can be mapped to a small number of Berlin, Germany, November 4, 2013, pages 3–12.
character classes (e.g. uppercase or lowercase) and this does ACM, 2013.
not readily adapt to pictographic languages such as Chinese.
[14] J. Friedman, T. Hastie, and R. Tibshirani.
Regularization paths for generalized linear models via
8. REFERENCES coordinate descent. Journal of Statistical Software,
[1] S. Adikari and K. Dutta. Identifying fake profiles in 33(1):1–22, 2010.
LinkedIn. Pacific Asia Conference on Information [15] T. Hastie, R. Tibshirani, and J. Friedman. The
Systems Proceedings 2014, 2014. Elements of Statistical Learning. Springer Series in
[2] J. Beall. Publisher uses fake LinkedIn identities to Statistics. Springer New York Inc., New York, NY,
attract submissions. USA, 2001.
[16] L. Huang, A. D. Joseph, B. Nelson, B. I. P. [29] A. Rakotomamonjy. Variable selection using svm
Rubinstein, and J. D. Tygar. Adversarial machine based criteria. J. Mach. Learn. Res., 3:1357–1370,
learning. In Proceedings of the 4th ACM Workshop on Mar. 2003.
Security and Artificial Intelligence, AISec 2011, [30] L. Ruff. Why do people create fake LinkedIn profiles?
Chicago, IL, USA, October 21, 2011, pages 43–58, http://integratedalliances.com/blog/
2011. why-do-people-create-fake-linkedin-profiles.
[17] J. Jiang, C. Wilson, X. Wang, W. Sha, P. Huang, [31] M. Singh, D. Bansal, and S. Sofat. Detecting malicious
Y. Dai, and B. Y. Zhao. Understanding latent users in Twitter using classifiers. In Proceedings of the
interactions in online social networks. ACM Trans. 7th International Conference on Security of
Web, 7(4):18:1–18:39, Nov. 2013. Information and Networks, SIN ’14, pages
[18] L. Jin, H. Takabi, and J. B. Joshi. Towards active 247:247–247:253, New York, NY, USA, 2014. ACM.
detection of identity clone attacks on online social [32] E. Tan, L. Guo, S. Chen, X. Zhang, and Y. Zhao.
networks. In Proceedings of the First ACM Conference Unik: Unsupervised social network spam detection. In
on Data and Application Security and Privacy, Proceedings of the 22nd ACM International
CODASPY ’11, pages 27–38, New York, NY, USA, Conference on Conference on Information &
2011. ACM. Knowledge Management, CIKM ’13, pages 479–488,
[19] P. Judge. Social klepto: Corporate espionage with fake New York, NY, USA, 2013. ACM.
social network accounts. https://www.rsaconference. [33] K. Thomas, D. McCoy, C. Grier, A. Kolcz, and
com/writable/presentations/file upload/br-r32.pdf. V. Paxson. Trafficking fraudulent accounts: The role
[20] K. Lee. Fake profiles are killing LinkedIn’s value. of the underground market in Twitter spam and
http://www.clickz.com/clickz/column/2379996/ abuse. In Proceedings of the 22nd USENIX Conference
fake-profiles-are-killing-linkedin-s-value. on Security, SEC’13, pages 195–210, Berkeley, CA,
[21] K. Lee, B. D. Eoff, and J. Caverlee. Seven months USA, 2013. USENIX Association.
with the devils: a long-term study of content polluters [34] R. Tibshirani. Regression shrinkage and selection via
on Twitter. In AAAI International Conference on the Lasso. Journal of the Royal Statistical Society,
Weblogs and Social Media (ICWSM), 2011. Series B, 58:267–288, 1994.
[22] C. Lesniewski-Laas and M. F. Kaashoek. Whanau: A [35] V. N. Vapnik. The Nature of Statistical Learning
sybil-proof distributed hash table. In Proceedings of Theory. Springer-Verlag New York, Inc., New York,
the 7th USENIX Conference on Networked Systems NY, USA, 1995.
Design and Implementation, NSDI’10, pages 8–8, [36] B. Viswanath, M. A. Bashir, M. Crovella, S. Guha,
Berkeley, CA, USA, 2010. USENIX Association. K. P. Gummadi, B. Krishnamurthy, and A. Mislove.
[23] A. Liaw and M. Wiener. Classification and regression Towards detecting anomalous user behavior in online
by randomforest. R News, 2(3):18–22, 2002. social networks. In Proceedings of the 23rd USENIX
[24] A. Malhotra, L. Totti, W. Meira Jr., P. Kumaraguru, Conference on Security Symposium, SEC’14, pages
and V. Almeida. Studying user footprints in different 223–238, Berkeley, CA, USA, 2014. USENIX
online social networks. In Proceedings of the 2012 Association.
International Conference on Advances in Social [37] B. Viswanath, M. Mondal, K. P. Gummadi,
Networks Analysis and Mining (ASONAM 2012), A. Mislove, and A. Post. Canal: Scaling social
ASONAM ’12, pages 1065–1070, Washington, DC, network-based sybil tolerance schemes. In Proceedings
USA, 2012. IEEE Computer Society. of the 7th ACM European Conference on Computer
[25] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, Systems, EuroSys ’12, pages 309–322, New York, NY,
F. Leisch, and C. Chang. R package “e1071”. 2014. USA, 2012. ACM.
[26] A. Mislove, B. Viswanath, K. P. Gummadi, and [38] H. Yu. Sybil defenses via social networks: A tutorial
P. Druschel. You are who you know: Inferring user and survey. SIGACT News, 42(3):80–101, Oct. 2011.
profiles in online social networks. In Proceedings of the [39] H. Yu, P. B. Gibbons, M. Kaminsky, and F. Xiao.
Third ACM International Conference on Web Search Sybillimit: A near-optimal social network defense
and Data Mining, WSDM ’10, pages 251–260, New against sybil attacks. IEEE/ACM Trans. Netw.,
York, NY, USA, 2010. ACM. 18(3):885–898, June 2010.
[27] A. Mohaisen and S. Hollenbeck. Improving social
network-based sybil defenses by rewiring and
augmenting social graphs. In Revised Selected Papers
of the 14th International Workshop on Information
Security Applications - Volume 8267, WISA 2013,
pages 65–80, New York, NY, USA, 2014.
Springer-Verlag New York, Inc.
[28] P. Prasse, C. Sawade, N. Landwehr, and T. Scheffer.
Learning to identify regular expressions that describe
email campaigns. In Proceedings of the 29th
International Conference on Machine Learning, ICML
2012, Edinburgh, Scotland, UK, June 26 - July 1,
2012, 2012.

You might also like