Assignment No 4 - KNN Twitter
Assignment No 4 - KNN Twitter
Sentiment Analysis
It is the process of detecting positive or negative sentiment in text.
It’s often used by businesses to detect sentiment in social data, gauge brand
reputation, and understand customers.
Sentiment analysis models focus on polarity (positive, negative, neutral) but also
on feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and
even intentions (interested v. not interested).
Depending on how you want to interpret customer feedback and queries, you can
define and tailor your categories to meet your sentiment analysis needs.
Automatically analyzing customer feedback, such as opinions in survey responses and
social media conversations, allows brands to learn what makes customers happy or
frustrated, so that they can tailor products and services to meet their customers’
needs.
For example, using sentiment analysis to automatically analyze 4,000+ reviews about
your product could help you discover if customers are happy about your pricing
plans and customer service.
It’s estimated that 90% of the world’s data is unstructured, in other words it’s
unorganized. Huge volumes of unstructured business data are created every day:
emails, support tickets, chats, social media conversations, surveys, articles,
documents, etc).
Sentiment Analysis
Automatic Approaches
Automatic methods, contrary to rule-based systems, don't rely on manually crafted
rules, but on machine learning techniques.
A sentiment analysis task is usually modeled as a classification problem, whereby a
classifier is fed a text and returns a category, e.g. positive, negative, or
neutral.
Working of K-NN
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.
Implementation Algorithm
Storing the training and test datasets into their respective dataframes
Preprocessing
Parsing the stop_words.txt file and storing all the words in a list.
List of all special characters that are to be removed.
With training and testing data
Removing all stopwords from all the tweets.
Removing hyperlinks from all the tweets. They are not needed for classification.
Removing usernames from all the tweets.
Removing hashtags, including the text, from all the tweets. Hashtags are useless
since their words cannot be splitted with spaces.
Removing all special characters from all the tweets
Finding all the unique words in training and testing data's Tweet column
Feature Extraction
Training and testing Data: Extracting features and storing them into the training
feature matrix
Calculating distances between every test instance with all the train instances.
This returns a 2D distances vector.
K Nearest Neighbors & Performance Measures by plotting graphs
Making a general structure of our confusion matrix
Extracting values from the Frequency DataFrame and assigning to specific cells in
the confusion matrix.
Extracting all recalls from the matrix to measure macroaveraged F1_score,recall and
precision.
Performance Evaluation
confusion matrix
Accuracy -99.9%
F1 Score -F1 Score is needed when you want to seek a balance between Precision and
Recall.
Accuracy can be largely contributed by a large number of True Negatives which in
most business circumstances, we do not focus on much whereas False Negative and
False Positive usually has business costs (tangible & intangible)
Performance Evaluation