Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
Project Report
Submitted by:
Submitted to:
Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing
the text/document into a specific class or category (like positive and negative). In other words, we
can say that sentiment analysis classifies any particular text or document as positive or negative.
Basically, the classification is done for two classes: positive and negative. By definition Sentiment
analysis refers to the use of natural language processing, text analysis, computational linguistics,
and biometrics to systematically identify, extract, quantify, and study affective states and
subjective information. Sentiment Analysis is also referred as Opinion Mining. It’s mostly used in
social media and customer reviews data.
Figure 1.1
1.2 Libraries used
NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning.
2. STEPS OF WORKING
In this project, NLTK’s movie_reviews corpus is used as our labeled training data. The
movie_reviews corpus contains 2,000 movie reviews with sentiment polarity classification. The
two categories for classification are - positive and negative. The movie_reviews corpus already
has the reviews categorized as positive and negative. The reviews are categorized using supervised
classification technique. In supervised classification, the classifier is trained with labeled training
data.
The below shown figure depicts the working followed during training and testing of the model.
Figure 2.2
2.1 Pre-processing of data
Three different ways are used pre-process the data to achieve maximum training and testing
accuracy.
bag_of_words: that extracts only unigram features from the movie review words
bag_of_ngrams: that extracts only bigram features from the movie review words
The model is trained using NLTK’s Naïve Bayes Classifier which is an in-built classifier of the
module. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a
simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the
probability of an event, based on prior knowledge of conditions that might be related to the
event.
2.3 Testing of model
The model accuracy is tested on training data as well as on custom data input by the user.
3. CODE
Pre-processing of data
The below shown code creates frequency distribution of all the words in the document and removes
stop-words and punctuations from the text and as a result data is cleaned and cleaned words are
added to a new list.
Figure 3.1
Creating document feature using top-N occurring words
The below shown code creates the document feature using 2000 frequently occurring words and
then trains the model using Naïve Bayes classifier and prints the accuracy of the model.
Figure 3.2
Creating feature word using bag of words method
The code shown below categorizes the text as positive and negative in different lists which helps
to reduce positive and negative data in separately and then pre-processes the data.
Figure 3.3
Bi-Gram Feature
In bag of words feature extraction, we used only unigrams. In the example below, we will use
both unigram and bigram feature, i.e. we will deal with both single words and double words.
Figure 3.4
Training the model
After pre-processing, the created feature sets are trained using NLTK’s Naïve Bayes classifier.
Figure 3.5
Figure 3.6
4. Results
Figure 4.1
We can see that custom negative reviews are categorized accurately but in case of positive
custom review we get inaccurate results.
In the top-N feature, we only used the top 2000 words in the feature set.
We combined the positive and negative reviews into a single list, randomized the list, and then
separated the train and test set.
This approach can result in the un-even distribution of positive and negative reviews across the
train and test set.
Bag of words Feature –
Figure 4.2
Now using bag of words feature we get appropriate results on custom test reviews but the overall
accuracy of the model is decreased to 70%
Bi-gram Feature –
Figure 4.3
The accuracy of the classifier has significantly increased when trained with combined feature set
(unigram + bigram).
Accuracy has increased to 77% while using combined (unigram + bigram) features.
5. Applications & Future Scope
5.1 Brand Monitoring - or you could also call it Reputation management. We all know how
much good reputation means these days when the majority of us check social media reviews as
well as review sites before making a purchase decision.
5.2 Customer support - Social media are channels of communication with your customers
these days, and whenever they’re unhappy about something related to you, whether or not
it’s your fault, they’ll call you out on Facebook/Twitter/Instagram.
Such mentions will appear in your dashboard with a flashing red color, and you better start
engaging them as soon as they are there.
People nowadays expect brands to respond on social media almost immediately, and if
you’re not quick enough, you might as well see them moving on to your competitors instead
of waiting for your reply.