Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
ISSN 2278-6856
2Department of Computer Engineering, P.E.S Modern College of Engineering, Shivajinagar, Pune, India
3Department of Computer Engineering, P.E.S Modern College of Engineering, Shivajinagar, Pune, India
4Department of Computer Engineering, P.E.S Modern College of Engineering, Shivajinagar, Pune, India
Abstract
Sentiment analysis is the field of study that analyzes people's
opinions, sentiments and emotions towards entities such as
products, services,events, topics, and their attributes.With the
explosive
growth
of
social
media
(e.g.
reviews,blogs,Twitter,comments in social network sites)on the
Web, individuals and organizations are increasingly using the
content in these media for decision making.In the real world,
businesses and organizations always want to find consumer or
public opinions about their products and services. Individual
consumers also want to know the opinions of existing users of a
product before purchasing it. Document similarity is a metric
defined over a set of documents, where the idea of distance
between them is based on the likeness of their meaning or
semantic content. Clustering is a useful technique that organizes
a large quantity of unordered text documents into a small
number of meaningful and coherent clusters.
1. INTRODUCTION
In this paper, we have presented a comparative study of
document similarity and clustering algorithms for sentiment
analysis. A basic task in sentiment analysis is classifying the
polarity of a given text in the document, sentence level,
whether the expressed opinion in a document or a sentence
aspect is positive, negative, or neutral. Beyond polarity
sentiment classification looks, at emotional states such as
"angry," "sad," and "happy[2].
Similarity is the state or fact of being similar while similar
is referring to a resemblance in appearance,character, or
quantity, without being identical.In order to compute the
similarity of documents we need some mathematical
expression or an algorithmthe computer can work with. This
ISSN 2278-6856
3. CLUSTERING
Clustering is an unsupervised learning task where one seeks
to identify a finite set of categories termed clusters to
describe the data.A similarity metric is defined between
items of data, and then similar items are grouped together to
form Clusters.The grouping of data into clusters is based on
the principle of maximizing the intra class similarity and
minimizing the inter class similarity. A good clustering
methodwill produce high quality clusters with high intraclass similarity - Similar to one another within the same
cluster low inter-class similarity - Dissimilar to the objects
in other clusters. The quality of a clustering result depends
on boththe similarity measure used by the method and its
implementation[4]. Clustering algorithms can be broadly
classified into three categories, in the following subsections
together with specific algorithms:
Partitioning
Hierarchical
Density-based
[7].
Where ta and tb are m-dimensional vectors over the term set
T = {t1,tm}. Each dimension represents a term with its
weight in the document, which is always non-negative. As a
result, the cosine similarity is non-negative and its value
varies between [0,1]. An important feature of the cosine
similarity is its independence of document length. For
example, If we combine two identical copies of a document
d to get a new pseudo document d0, the cosine similarity
between d and d0 will be 1, which means that these two
documents are regarded to be identical. Meanwhile, given
another document l, d and d0 will have the same similarity
value to l, i.e.sim(td ,tl )= sim(td ,tl). In other words,
documents with the same composition but different totals
will be treated identically. Strictly speaking, this does not
satisfy the second condition of a metric, because after all the
combination of two copies is a different object from the
original document. However, in practice, when the term
vectors are normalized to a unit length such as 1, and in this
case the representation of d and d0 is the same [3].
3.2 Hierarchicalalgorithms
Unlike partitioning methods that create a single partition,
hierarchical algorithms produce a nested cluster, with a
single all-inclusive cluster at the top and singleton clusters
of individual points at thebottom. The hierarchy can be
formed in top-down orbottom-up fashion and need not
Page 197
ISSN 2278-6856
5. CONCLUSION
6. ACKNOWLEDGEMENT
We would like to express our gratitude to Prof. Ms Deipali
Gore and Prof. Mrs Manisha Petare who have guided us
regarding matters where we needed clarity about the subject.
We are thankful for their aspiring guidance, invaluably
constructive criticism and advice during the course of this
study.
of
Best
Better
Density
Based
Algorithm
Good
Best
Good
Better
Good
Best
Better
Easy to
implement
Sensitive
HC
Algorithm
Complex
Modera
te
More
Sensitive
Sensitiv
Page 198
ISSN 2278-6856
Page 199