Journal of Computer Science 10 (10): 2135-2140, 2014
ISSN: 1549-3636
© 2014 Science Publications
doi:10.3844/jcssp.2014.2135.2140 Published Online 10 (10) 2014 (http://www.thescipub.com/jcs.toc)
A STUDY OF SPAM DETECTION ALGORITHM ON
SOCIAL MEDIA NETWORKS
1
Saini Jacob Soman and 2S. Murugappan
1
Department of Computer Science and Engineering, Sathyabama University, India
2
Department of Computer Science and Engineering, Annamalai University, India
Received 2013-11-21; Revised 2014-05-12; Accepted 2014-07-07
ABSTRACT
In today’s world, the issue of identifying spammers has received increasing attention because of its practical
relevance in the field of social network analysis. The growing popularity of social networking sites has
made them prime targets for spammers. By allowing users to publicize and share their independently
generated content, online social networks become susceptible to different types of malicious and
opportunistic user actions. Social network community users are fed with irrelevant information while
surfing, due to spammer’s activity. Spam pervades any information system such as e-mail or web, social,
blog or reviews platform. Therefore, this study attempts to review various spam detection frameworks
which deals about the detection and elimination of spams in various sources.
Keywords: Spam Detection, Spam Analysis, Feature Extraction
1. INTRODUCTION through rates, many techniques exist such as: Hiding the
destination of hyperlinks, falsifying header information
Social networks such as Face book, MySpace, and creative use of images (Markus and Ratkiewicz,
LinkedIn, Friendster (Danah, 2004) and Tickle have 2006; Alex and Jakobsson, 2007).
millions of members who use them for both social and Email messages also takes advantage of some shared
business networking. Due to the astonishing amount of context among friends on a social network such as
information on web, users follow the way of searching celebrations of birthday functions, residing in the same
useful web pages by querying search engines. Given a home town, or common events participation. This shared
query, a search engine identifies the relevant pages on context dramatically increase email authenticity, filters
the web and presents the users with the links to such and increasing the click-through rate for spam that
pages. Spammers try to increase the page rank of the contains advertisements, installs malicious software, or
target web page in search results by Search Engine solicits sensitive personal information (Takeda and
Optimization (SEO), the injection of artificially created Takasu, 2008; Kamaliha et al., 2008). But in the content
pages into the web in order to influence the results from of blog platforms, spammer post irrelevant comments for
search engines to drive traffic to certain pages for fun or an already existing post. They focus on several kinds of
profit. Initially, spams are introduced in mails. Later, this spam promotion such as, splog (the whole blog is used to
has been extended to Social networks. On the other hand, promote a product or service), comment spam
in E-mail system spammer sends unsolicited bulk email (comments promoting services with no relation to the
to users by redirecting them to irrelevant websites. The blog topic) and trace back spam (spam that takes
success of delivered attacks is dependent almost entirely advantage of the trace back ping feature of popular blogs
upon the click-through rate of the email. If the target to get links from them). This study primarily focusses on
does not click on the malicious link presented in the the survey of the literature which deals with the
email, then the attack usually fails. To improve click- comment spams in blog. Since comments are typically
Corresponding Author: Saini Jacob Soman, Department of Computer Science and Engineering, Sathyabama University, India
Science Publications 2135 JCS
Saini Jacob Soman and S. Murugappan / Journal of Computer Science 10 (10): 2135.2140, 2014
short by nature. Comment spam is essentially link spam spammers and validate with honey pots. The identified
originating from comments and responses added to web spam accounts with the help of spam bots share common
pages which support dynamic user editing. As a result of traits which has been formalized as features in their
the presence of spammers in a network, there is a decrease honey pots (first feature, URL ratio, message similarity,
in both the quality of the service and the value of the data friend choice, message sent, friend number etc). The
set as a representation of a real social phenomenon. With classifier namely Weka framework with a random forest
the help of the extracted features it is possible for algorithm has been used to classify spammers for best
identifying spammers from the legitimate one. Various accuracy. Similarly during spam campaign the spam bots
machine learning, supervised and unsupervised methods were clustered based on spam profiles using naïve
have been used in the literature for classification of these Bayesian classifier to advertise same page during
spams. A study of various spam detection algorithms has message content observation.
been dealt thoroughly in this study. Similarly, (Lee et al., 2010) dealt with the social
spam detection which has become tedious in social
2. RELATED WORK media nowadays. Here Social honey pots are deployed
after its construction based on the features such as
Initially, certain researchers concentrated on the
number of friends, text on profile, age, etc. Here, both
development of Honey pots to detect spams. To detect legitimate and spam profiles have been used as initial
spams, (Webb et al., 2008) dealt with automatic training set and Support Vector Machine has been used
collection of deceptive spam profiles in social network for classification. An inspector has been assigned to
communities based on anonymous behavior of user by validate the quality of extracted spam candidates using
using social honey pots. This created unique user profiles “Learned classifier” and provide feedback to spam
with personal information like age, gender, date of birth classifier for correct prediction in future. In this study
and geographic features like locality and deployed in three research challenges have been addressed. Initially,
MySpace community. Spammer follows one of the it validates whether the honey pot approach is capable of
strategy such as being active on web for longer time collecting profiles with low false positives, next to that it
period and sending friend request. The honey profile addresses whether the users are correctly predicted and
monitors spammers behavior by assigning bots. Once the finally it evaluates the effectiveness of fighting against
spammers sends friend request the bots stores spammer new and emerging spam attacks.
profile and crawls through the web pages to identify the The first challenge is proved using automatic
target page where advertisements originated. classifier which groups the spammers accurately. The
The spammer places woman’s image with a link in second one considers demographic features for training
the “About Me” section in its profile and the honey the classifier using 10-fold cross validation. It has been
profile bots crawls through the link, parses the profile, tested in MySpace using Meta classifier. In twitter it
extracts its URL and stores the spammers profile in the used Bigram model for classification along with the
spam list. URL does not redirect at times during crawling preprocessing steps. Finally, post filters has been used to
process and “Redirection Detection Algorithm” is check the links and remove the spam label by applying
executed to parse the web page and extract redirection “Support Vector Machine” for correct prediction. They
URL to access it with the motive of finding the source also proposed that in future, Clique based social honey
account. Also, he proposed a “Shingling Algorithm” pots can be applied with many honey profiles against
which verifies the collected spam profile for content many social network communities.
duplication like URL, image, comments and to Next to honey pots, Spammers have been
accurately cluster spam and non-spam profile based on identified in the literature by analyzing content and
the features. In this way, he eliminated Spams. link based features of web pages. In this context,
Another researcher named (Gianluca et al., 2010) (Sreenivasan and Lakshmipathi, 2013) has performed
used social honey pots to construct honey profiles spam detection in social media by considering content
manually with features like age, gender, dob, name, and link based features.
surname etc. Here, the honey profiles have been assigned Web spam misleads search engines to provide high
to three different social network communities (Myspace, page rank to pages of no importance by manipulating
Facebook, Twitter). It considered friend request as well link and contents of the page. Here, to identify web
as the message (wall post, status updates) received from spam, Kullback-leiblerence techniques are used to find
Science Publications 2136 JCS
Saini Jacob Soman and S. Murugappan / Journal of Computer Science 10 (10): 2135.2140, 2014
the difference between source page features (anchor text, features for detection. The content features calculate
page title, meta tags) and target page features (recovery the cosine similarity between blog text and blog post
degree, incoming links, broken links etc). Therefore, title while searching for a particular blog. It has been
three unsupervised models were considered and proved that a co-relation exist between the above two
compared for web spam detection. As a first one, Hidden features with which the spammer activity is detected
Markov model has been used which captures different based on the degree of similarity. This detection
browsing patterns, jump between pages by typing URL’s achieved a precision of 1.0 and recall of 0.87.
or by opening multiple windows. The features mentioned The blog link structure feature finds spammer activity
above were given as input to HMM and it is not visible by decoupling between two classes (duplicate and unique
to the user. As a result, a link is categorized as spam or links) up to three hop counts. Spammers always move
non spam based on how frequently a browser moves within closed group rather than with other blogosphere.
from one page to another. The duplicate links are identified and removed.
Second method uses “Self Organizing maps” a neural Wang and Lin (2011) focused on comment spams
model to classify training data (Web links) without with hyperlinks. The similarity between the content of
human intervention. It classifies each web link as either page for a post to the link it points to has been
spam or non spam link. One more method called compared to identify spam. Here, the collected blogs
Adaptive Resonance Theory has also been used to clarify are preprocessed which finds the stop word ratio that
a link as either spam or not. is found to be less in spammers post. The contents are
Another work in the literature (Karthick et al., extracted from the post and are sorted where “Jaccard
2011) has dealt with the detection of link spam and Dice’s “ co-efficient is calculated which provides
through pages linked by hyperlink that are the degree of overlapping between words. The degree of
semantically related. Here, Qualified Link Analysis overlapping is used for calculating inter comment
(QLA) has been performed. The relation existing similarity for a comment with respect to a post. Analysis
between the source page and target page is calculated of content features like inter comment similarity and post
by extracting features of those two pages from web comment similarity along with the non-content features
link and compared with the contents extracted from like link number, comment length, stop words showed
these pages. In QLA, the nepotistic links are identified better results in identifying spam links.
by extracting URL, anchor text and cached page of the Next to this, comment based spams have also been
analyzed link stored in the search engines. During discussed here. Archana et al. (2009) has dealt with the
query generation, once the page is available with spam that gets penetrated in the form of comments in
search engines, this result has been compared together Blog. A blog is a type of web content which contains a
with the page features for easy prediction of spam and sequence of periodic user comments and opinions for a
non spam links. particular topic. Here, spam comment is an irrelevant
In this study, QLA has been combined with response received for a blog post in the form of a
language model detection for better prediction of comment. This comments are analyzed using supervised
spams. In Language model detection, the KL and semi supervised methods. Analysis considers various
divergence technique has been used to calculate the features to identify spams. They are listed below: The
difference between the information of the source post similarity feature has been used to find the
pages with the content extracted from the link. Once relevancy between the post and the comment. Word Net
matched, it is clustered as non spam and vice versa. tool has been used to spot out the word duplication
Here, the result of LM detection, QLA along with pre features. Word duplication feature identifies the
trained link and content features lead to accurate redundant words in comments and it is found to be
classification and detection. higher for spam comments and low for genuine
Qureshi et al. (2011) handled the problem of comments. Anchor text feature counts the number of
eliminating the existence of irrelevant blogs while links exists for a comment and predicts that the
searching for a general query in web. The objective is to spammers are the one having higher count. Noun
promote relevancy in ordering of blogs and to remove concentration feature has been used to extract comments
irrelevant blogs from top search results. The presence of and part of speech tags from the sentences. In that, the
irrelevancy is not because of spam, but is due to legitimate users have low noun concentration.
inappropriate classification for a topic against a query. Stop word ratio feature consider sentences with a
This approach uses both content and link structure finishing point where spammers have less stop word
Science Publications 2137 JCS
Saini Jacob Soman and S. Murugappan / Journal of Computer Science 10 (10): 2135.2140, 2014
ratio. Number of sentence feature counts the number space and the weights are assigned to the vector
of sentences exists in a comment and is found to be obtained using term frequency function. The feature
higher for spammers. Spam similarity feature checks selection methodologies like information gain and odd
for the presence of spam words listed and categorize ratio has been used for selecting words from SMS
it. The words identified as spam after preprocessing with which class dependency and class particularity
were assigned a weightage and the contents which are found for clustering “content based features”.
falls above the threshold are detected as spam Features on social network tries to extract both the
comments. Here, a supervised learning method (Naïve sending behavior of mobile users and closeness for
Baye’s classifier) has been used along with pre categorizing spammer and legitimate user. Bloom
classified training data for labeling a comment as filter is used to test the membership between sender
spam and non spam. One more unsupervised method and receiver for removing spammer’s relationship.
directly classifies the comments based on the expert Naïve Bayesian classifier has been used for
specified threshold. classifying users as legitimate or spam using the
Interestingly in literature, works have been carried above features.
out for book spammers also. Sakakura et al. (2012) Ravindran et al. (2010) deals with the problem of
deals with bookmark spammers who create bookmark tag recommendation face which contains popular tags
entries for the target web resource which contains for particular bookmarks based on user feedback and
advertisements or inappropriate contents thereby to filter spam posts. In this problem, Spammer may
creating hyperlinks to increase search result ranking in increase the frequency of a particular tag and the
a search engine system. Spammer may also create system may suggest those tags which have higher
many social bookmark accounts to increase the frequency to the user. To eliminate this problem, this
ranking for that web resource. Therefore, user study uses “frequency move to set “model to choose a
accounts must be clustered based on the similarity set of tags suggested by user for a bookmark. To find
between set of bookmarks to a particular website or whether a tag is popular or not for placing it in the
web resource and not based on the contents. Here in suggestion set, the tag feature like simple vocabulary
this study, data preprocessing is done by clustering similarity has been considered. The suggestion set which
bookmarks by extracting web site URL from the raw is kept updated is measured using the stagnation rate and
URL since spammer may create different bookmark unpopular tags are removed randomly from the set. The
entry for same URL. decision tree classifier has been used here to classify tags
Here, the similarity based on raw URL (which is as spam and non-spam. The accuracy obtained in this
the ratio of number of common URL’S to total approach is about 93.57%.
number of URL’S contained in the bookmarks of two Ariaeinejad and Sadeghian (2011) deals with
accounts) has been considered and the similarity based detecting email spam in an email system by
on site URL without duplicates (which is the ratio of considering plain text alone for categorizing a mail as
number of common site URL’S to the total number of spam or ham. The common words in spam and ham
all URL’s in both the accounts) and the similarity emails are eliminated and stored in white list. The
based on site URL with duplicates (Weight of the sites collected words are parsed by removing unwanted
based on the number of bookmarks common to the spaces and other signs among the words. The parsed
user accounts) has been calculated. The agglomerative words are compared with white list and common
hierarichal clustering of accounts has been made words are eliminated. The cleaned words are checked
based on one of the above mentioned similarities. The for making decision using “Jaro-Wrinkler” technique.
cluster which is large and having higher cohesion is Here, a fuzzy map is constructed as a two dimension
categorized as an intensive bookmark account using an interval type and 2 fuzzy methods have been
spammer. This study achieves a precision of 100%. used which represents distance of each word in email
Yang and Chen (2012) this study deals with online with closed similarity in dictionaries as a horizontal
detection of SMS spam’s using Naïve Bayesian vector and represents weight of the words in
classifier, which considers both content and social dictionary as an vertical vector. Third dimension
network SMS features. The SMS social network is considers importance of a word in an email and its
constructed from the historical data collected over a frequency which is identified using term frequency
period with the help of telecom operator. The content inverse document frequency technique. Here, Email
features are extracted that are presented in vector has been categorized into spam, ham and uncertain
Science Publications 2138 JCS
Saini Jacob Soman and S. Murugappan / Journal of Computer Science 10 (10): 2135.2140, 2014
zone using fuzzy-C means clustering. Later, the words Danah, M.B., 2004, Friendster and publicly articulated
are updated consistently for correct prediction. social networking. Proceedings of Extended
Another work reported by (Ishida, 2009) deals with Abstracts on Human Factors in Computing Systems,
detection of spam blogs and keywords mutually by its (CHI ‘04), Vienna, Austria, pp: 1279-1282. DOI:
co-occurrence in the cluster. He employed shared 10.1145/985921.986043
interest algorithm for the blogs collected. This Gianluca, S., C. Kruegel and G. Vigna, 2010. Detecting
algorithm constructs a bi-partite graph between blogs spammers on social networks. Proceedings of the
and low frequency keywords from which clusters of Annual Computer Security Applications
varying size has been formed. The spam score for Conference, Dec. 6-10, New York, pp: 1-9. DOI:
each cluster is calculated and are ranked by 10.1145/1920261.1920263
multiplying number of blogs and keywords in the Ishida, K., 2009. Mutual detection between spam blogs
cluster. The spam blogs with highest score is and keywords based on cooccurrence cluster seed.
considered as spam seed and is stored in a list. A Proceedings of the 1st International Conference on
threshold is set manually for the ranked spam blogs Networked Digital Technologies, Jul. 28-31, IEEE
and keywords. Those which exceeds the threshold has Xplore Press, Ostrava, pp: 8-13. DOI:
been detected as spam blog, spam keywords are 10.1109/NDT.2009.5272171
removed from the list. This approach provides mutual Wang, J.H. and M.S. Lin, 2011. Using Inter-comment
detection and thereby reducing the filtering cost and similarity for comment spam detection in chinese
the words are kept updated. blogs. Proceedings of the International Conference
on Advances in Social Networks Analysis and
3. CONCLUSION Mining, Jul. 25-27, IEEE Xplore Press, Kaohsiung,
pp:189-194. DOI: 10.1109/ASONAM.2011.49
This survey has presented various approaches Kamaliha, E., F. Riahi, V. Qazvinian and J. Adibi, 2008.
which could identify or detect spams in the social Characterizing network motifs to identify spam
network by extracting necessary information from comments. Proceedings of the IEEE International
web pages. Many researchers worked on Honey pot Conference on Data Mining Workshops, Dec. 15-19,
profiles, whereas a few people worked on IEEE Xplore Press, Pisa, pp: 919-928. DOI:
identifying spam links. Even works have been 10.1109/ICDMW.2008.72
carried out on email and SMS spams. But still this Karthick, K., V. Sathiya and J. Pugalendiran, 2011.
area remains in its infant stage and more number of Detecting nepotistic links based on qualified link
spam detection algorithms need to be devised for analysis and language models. Int. J. Comput.
social media networks. Trends Tech.
Lee, K., J. Caverlee and S. Webb, 2010. Uncovering
4. REFERENCES social spammers: Social honeypots + machine
learning. Proceedings of the 33rd international ACM
Alex, T. and M. Jakobsson, 2007. Deceit and SIGIR Conference on Research and Development in
deception: A large user study of phishing. Information Retrieval, Jul. 19-23, New York, pp:
Pennsylvania State University. 435-442. DOI: 10.1145/1835449.1835522
Archana, B., V. Rus and D. Dasgupta, 2009. Markus, J. and J. Ratkiewicz, 2006. Designing ethical
Characterizing comment spam in the blogosphere phishing experiments: A study of (ROT13) rOnl
through content analysis. Proceedings of the IEEE query features, Proceedings of the 15th International
Symposium on Computational Intelligence in Conference on World Wide Web, May 22-26, IEEE
Cyber Security, Mar. 3-3, IEEE Xplore Press, Xplore Press, New York, pp: 513-522. DOI:
Nashville, TN, pp: 37-44. DOI: 10.1145/1135777.1135853
10.1109/CICYBS.2009.4925088 Qureshi, M.A., A. Younus, N. Touheed, M.S. Qureshi
Ariaeinejad, R. and A. Sadeghian, 2011. Spam and M. Saeed, 2011. Discovering irrelevance in the
detection system: A new approach based on blogosphere through blog search. Proceedings of the
interval type-2 fuzzy sets. Proceedings of the 24th International Conference on Advances in Social
Canadian Conference on Electrical and Computer Networks Analysis and Mining, Jul. 25-27, IEEE
Engineering, May 8-11, Niagara Falls, pp: 379- Xplore Press, Kaohsiung, pp: 457-460. DOI:
384. DOI: 10.1109/CCECE.2011.6030477 10.1109/ASONAM.2011.84
Science Publications 2139 JCS
Saini Jacob Soman and S. Murugappan / Journal of Computer Science 10 (10): 2135.2140, 2014
Ravindran, P.P., A. Mishra, P. Kesavan and S. Takeda, T. and A. Takasu, 2008. A splog filtering
Mohanavalli, 2010. Randomized tag method based on string copy detection.
recommendation in social networks and Proceedings of the 1st International Conference
classification of spam posts. Proceedings of the on Applications of Digital Information and Web
IEEE International Workshop on Business Technologies, Aug. 4-6, IEEE Xplore Press,
Applications of Social Network Analysis, Dec. 15- Ostrava, pp: 543-548. DOI:
15, IEEE Xplore Press, Bangalore, pp: 1-6. DOI: 10.1109/ICADIWT.2008.4664407
10.1109/BASNA.2010.5730294 Yang, Y. and Y. Chen, 2012. A novel content based
Sakakura, Y., T. Amagasa and H. Kitagawa, 2012. and social network aided online spam short
Detecting social bookmark spams using multiple message filter. Proceedings of 10th World
user accounts. Proceedings of IEEE/ACM Congress on Intelligent Control and Automation,
International Conference on Advances in Social Jul. 6-8, IEEE Xplore Press, Beijing, pp: 444-449.
Networks Analysis and Mining, (NAM’ 12), DOI: 10.1109/WCICA.2012.6357916
Washington, pp: 1153-1158. DOI: Webb, S., J. Caverlee and C. Pu, 2008. Social
10.1109/ASONAM.2012.199 honeypots: Making friends with a spammer near
Sreenivasan, S. and B. Lakshmipathi, 2013. An you. Paper Presented Meeting CEAS.
unsupervised model to detect web spam based on
qualified link analysis and language models. Int.
J. Comput. Applic., 63: 33-37. DOI:
10.5120/10455-5163
Science Publications 2140 JCS