VERZEO
MACHINE
LEARNING JUNE
MAJOR PROJECT
PRESENTED BY:
MANEESHA NIDIGONDA
ABSTRACT
With the rise of social networking epoch and its growth, Internet has become a
promising platform for online learning, exchanging ideas and sharing opinions.
Social media contain huge amount of the sentiment data in the form of tweets,
blogs, and updates on the status, posts, etc. In this paper, the most popular micro
blogging platform twitter is used. Twitter sentiment analysis is an application of
sentiment analysis on data from Twitter (tweets), to extract user’s opinions and
sentiments. The main goal is to explore how text analysis techniques can be used to
dig into some of the data in a series of posts focusing on different trends of tweets
languages, tweets volumes on twitter. Experimental evaluations show that the
proposed machine learning classifiers are efficient and performs better in terms of
accuracy. The proposed algorithm is implemented in python.
Keywords – Machine Learning, Natural Language Processing, Python, Sentimental
Analysis
SENTIMENT ANALYSIS
Sentiment Analysis (SA) is an ongoing field of research in text mining field. SA is the
computational treatment of opinions, sentiments and subjectivity of text. This survey
paper tackles a comprehensive overview of the last update in this field. Many
recently proposed algorithms' enhancements and various SA applications are
investigated and presented briefly in this survey. These articles are categorized
according to their contributions in the various SA techniques. The related fields to SA
(transfer learning, emotion detection, and building resources) that attracted
researchers recently are discussed. The main target of this survey is to give nearly full
image of SA techniques and the related fields with brief details. The main
contributions of this paper include the sophisticated categorizations of a large
number of recent articles and the illustration of the recent trend of research in the
sentiment analysis and its related areas.
1. Introduction
Sentiment Analysis (SA) or Opinion Mining (OM) is the computational study of
people’s opinions, attitudes and emotions toward an entity. The entity can represent
individuals, events or topics. These topics are most likely to be covered by reviews.
The two expressions SA or OM are interchangeable. They express a mutual meaning.
However, some researchers stated that OM and SA have slightly different notions [1].
Opinion Mining extracts and analyses people’s opinion about an entity while
Sentiment Analysis identifies the sentiment expressed in a text then analyses it.
Therefore, the target of SA is to find opinions, identify the sentiments they express,
and then classify their polarity as shown in Fig.
Figure 1. Sentiment analysis process on product reviews.
Sentiment Analysis can be considered a classification process as illustrated. There are
three main classification levels in SA: document-level, sentence-level, and aspect-
level SA. Document-level SA aims to classify an opinion document as expressing a
positive or negative opinion or sentiment. It considers the whole document a basic
information unit (talking about one topic). The opinion holders can give different
opinions for different aspects of the same entity like this sentence “The voice quality
of this phone is not good, but the battery life is long”. This survey tackles the first two
kinds of SA.
The data sets used in SA are an important issue in this field. The main sources of data
are from the product reviews. These reviews are important to the business holders
as they can take business decisions according to the analysis results of users’
opinions about their products. The reviews sources are mainly review sites.
Sentiment Analysis Dataset
The dataset which we will use in sentiment analysis is the International Movie
Database (IMDb) reviews for 50,000 reviews of movies from all over the
world, it is a binary classification dataset categorizing each review in a
positive or negative. It has 25000 samples for training and 25000 for testing.
You don’t need to download it separately for this project but you can have a
look at it on its official website. Because it is a text dataset it is very lightweight
around 80MB.
We are going to code all this up in a Jupiter notebook on google collab to make
use of the free c p u. If you follow along on your own system everything will be
pretty much the same except for mounting the google drive for use as a
persistent storage option.
• So we begin by mounting our google drive and navigating to the folder
where we have to work.
import OS from google. Collab import drive. mount('/content/drive') OS. chdir
('/content/drive/My Drive/Data Flair/Sentiment')
!ls
Preparation of data
We are going to python for this project and luckily it comes preinstalled with
some functionalities for helping us speeding up our work
The torch. Text library is a great tool for n l p projects. It has a loader for some
common n l p datasets like the one we are going to use today, also complete
pipeline for abstraction of vectorization of data, data loaders and iteration of
data.
import random import torch
from torch text .legacy import data from torch text. legacy import datasets seed = 42 torch. Manual
seed(seed) torch. backends. cpu. deterministic = True device = torch .device('c u d a' if torch. Cuda .is_
available() else 'c p u') txt = data .Field(tokenize = 'spacy', tokenizer _language = 'e n core_ web_ s m ', include
_lengths = True) labels = data. Label Field ( = torch .float) train data, test _data = datasets. IMDB. splits(txt,
labels) train_ data, valid _data = train_ data. split(random _state = random .seed(seed) n u m _words = 25_000
txt. Build _vocab(train _data, max _size = n u m _words, vectors = "glove.6B.100d", u n k _int = torch.
Tensor. normal_) labels. build_ vocab(train_ data)
Here we have downloaded the in d B dataset for python sentiment analysis
and divided it into train test and validation split. The dataset is already
divided into a train and test set, we further create a validation set from it.
We further limit the number of words the model will learn to 25000, this will
choose the most used 25000 words from the dataset and use them for
training. Significantly reducing the work of the model without any real loss
in accuracy. btch_ size = 64 train its, valid_ its, test_ its = data. Bucket Iterator. splits(
(train_ data, valid_ data, test_ data), batch_ size = b tch _ size, sort_ within_ batch = True, device =
device) import torch. n as n class RNN(nn. Module): def _in it__(self, word_ limit, dimension_
embedding, dimension_ hidden, dimension_ output, number_ layers, bidirectional, dropout,
pad_idx):
super()._ in it_ () self. embedding = nn. Embedding(word_ limit, dimension_ embedding, padding_ idx = pad_ idx) self. rnn =
nn. LSTM(dimension_ embedding, dimension_ hidden, number_ layers = number_ layers, bidirectional=bidirectional,
dropout=dropout) self. fc = n n. Linear(dimension_ hidden * 2, dimension_ output) self. dropout = nn. Dropout(dropout) def
forward(self, text, length_ txt):
embedded = self. dropout(self. embedding(text)) packed_ embedded = nn. utils. rnn. pack_ padded_
sequence(embedded, len_txt.to('c p u')) packed_ output, (hidden, cell) = self. Rnn (packed_ embedded) output,
output_ lengths = nn. utils. rnn. pad_ packed_ sequence(packed_ output) hidden = self. dropout(torch.cat((hidden[-
2,:,:], hidden[-1,:,:]), dim = 1)) return self. fc(hidden)
We define the parameters for python sentiment analysis model and pass it to
an instance of the model class we just defined. The number of input
parameters, hidden layer, and the output dimension along with throughput
rate and bidirectionality Boolean is defined. dimension_ input = length(txt. vocab) dimension
_embedding = 100 dimension_ hidden = 256 dimension_ out = 1
layers = 2 bidirectional = True dropout = 0.5 idx_ pad =
txt. vocab. stoi[txt. pad_ token] model = RNN(dimension_
input, dimension_ embedding, dimension_ hidden,
dimension_ out, layers, bidirectional, dropout, idx pad)
Now we print some details about our model. Getting the number of trainable
parameters that are present there in the model.
We then get the pre-trained embedding weights and copy them to our model
so that it does not need to learn the embeddings, and can directly focus on the
job at hand that is learning the sentiments related to those embeddings.
def count_ parameters(model):
return sum(p. number) for p in model. Parameters() if p. requires_ grad) print f('The model has {count_
parameters(model):,} trainable parameters') pretrained _embeddings = txt. vocab. vectors
print(pretrained_ embeddings. shape) unique_ id = txt. vocab. [txt. unk_ token] model. embedding. weight.
data[unique_ id] = torch. zeros(dimension_ embedding) model. embedding. weight. data[idx_ pad] = torch.
zeros(dimension_ embedding) print(model. embedding. weight. data)
import torch. optim as optim optimizer = optim. Adam(model.
parameters()) criterion = nn. BCE With Logics Loss() model =
model.to(device)
criterion = criterion.to(device)
def bin _account(preds, y):
predictions = torch. round(torch. sigmoid(preds)) correct =
(predictions == y).float() account = correct. sum() / length(correct)
return account
We define the function for training and evaluating the models. The process
here is standard. We start by looping through the number of epochs and the
number of iterations in each epoch is according to the batch size that we
defined. We pass the text to the model, get the predictions from it, calculate
the loss for each iteration and then backward propagate that loss.
The only major change in the evaluating function from the training function is
that we do not backward propagate the loss through the model and use torch.
no grad basically signifying no gradient descent while evaluating. def train(model, itr,
optimizer, criterion): epoch_ loss = 0 epoch_ account = 0 model. train() for i in ITR:
optimizer. zero_ grad() text, length_ txt = I. text predictions = model(text,
length _ txt).squeeze(1) loss = criterion(predictions, I. label) account =
bin_ account(predictions, I. label) loss. backward() optimizer. step()
epoch_ loss += loss. item() epoch_ account+= account. item() return
epoch_ loss / length(ITR), epoch_ account / length(ITR) def
evaluate(model, ITR, criterion): epoch_ loss = 0
epoch_ account =
0 model. train()
for I in ITR:
optimizer .zero_ grad() text, length_ txt = I. text predictions = model(text,
length_ txt).squeeze(1) loss = criterion(predictions, I. label) account =
bin_ account(predictions, I. label) loss. backward() optimizer. step()
epoch_ loss += loss. item() epoch_ account += acc. item() return epoch_
loss / length(ITR), epoch_ account / length(ITR) def evaluate(model, ITR,
criterion): epoch_ loss = 0 epoch_ account = 0 model. eval() with
torch.no_ grad(): for I in ITR: text, length_ txt = i. text predictions =
model(text, length_ txt).squeeze(1) loss = criterion(predictions, i. label)
account = bin_ account(predictions, I. label) epoch_ loss += loss. item()
epoch_ account += acc. item() return epoch_ loss / length(ITR), epoch_
account / length (I t r)
We build a helper function epoch time for calculating the time each epoch
takes to complete its run and print it. We set the number of epochs to 5 and
then begin our training. Adding the training and validation loss at each stage,
if we need to understand or plot the training curve at a later point. We save the
python sentiment analysis model that has the best validation loss.
We load the saved checkpoint of the model and test it on the test set that we
created earlier. During the dry run of python sentiment analysis model, we
achieved a decent accuracy score of 85.83%.
model. load_ state_ dict(torch. load('tut2-model.pt')) test_ loss, test_ account = evaluate(model,
test_ itr, criterion) print f('Test Loss: {test_loss:.3f} | Test Acc: {test_ acc*100:.2f}%')
We can also check the model on our data. This is trained to classify the movie
reviews into positive, negative, and neutral, therefore we will pass to it
relatable data for checking. So for that we will import and load spacy for
tokenizing the data we need to give to the model. In the beginning, while
defining the pre processing we used spacy built-in torch. text, but here we are
not using batches, and the pre processing that we need to do can be handled
by the spacy library. We define a predict sentiment function for this. After the
pre processing, we convert it into tensors and ready to be passed to the model
import spacy nlp = spacy. load('en_ core_ web_ sm') def pred(model, sentence):
model. eval() tokenized = [tok. text for tok in nlp. tokenizer(sentence)] indexed =
[txt. vocab. stoi[t] for t in tokenized] length = [length(indexed)] tensor = torch. Long
Tensor(indexed).to(device) tensor = tensor .un squeeze ( 1) length_ tensor = torch.
Long Tensor(length) prediction = torch. sigmoid(model(tensor, length_ tensor))
return prediction. item()
We define another helper function that will print the sentiment of the
comment based on the score that the model provides.
sent=["positive", "neutral" ,"negative"] def print_ sent(x):
if (x<0.3): print(sent[0]) elif (x>0.3 and x<0.7):
print(sent[1]) else: print(sent[2])
Python Sentiment Analysis Output
Summary
We have successfully developed python sentiment analysis model based on
lstm techniques that is pretty robust and highly accurate. As discussed earlier,
sentiment analysis has many use-cases based on requirements we can use it.
We can similarly train it on any other kind of data just by changing the dataset
according to our needs. We can use this sentiment analysis model in all
different ways possible.