0% found this document useful (0 votes)
81 views3 pages

Assignment No 4 - KNN Twitter

This document summarizes sentiment analysis and how it works using a KNN algorithm. Sentiment analysis detects positive and negative sentiment in text. It can analyze customer feedback to understand satisfaction. KNN is an algorithm that classifies new data based on similarity to existing classified data. It works by selecting K neighbors, calculating distances, and assigning the new data to the category of its K closest neighbors. The document implements KNN for sentiment analysis on tweets, preprocessing data, extracting features, performing KNN classification, and evaluating performance with metrics like accuracy and F1 score.

Uploaded by

Vaishnavi Gurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views3 pages

Assignment No 4 - KNN Twitter

This document summarizes sentiment analysis and how it works using a KNN algorithm. Sentiment analysis detects positive and negative sentiment in text. It can analyze customer feedback to understand satisfaction. KNN is an algorithm that classifies new data based on similarity to existing classified data. It works by selecting K neighbors, calculating distances, and assigning the new data to the category of its K closest neighbors. The document implements KNN for sentiment analysis on tweets, preprocessing data, extracting features, performing KNN classification, and evaluating performance with metrics like accuracy and F1 score.

Uploaded by

Vaishnavi Gurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Honours* in Data Science #Fourth year of Engineering (Semester VII) #410502:

Machine Learning and Data Science Laboratory


Dr. Girija Gireesh Chiddarwar
Assignment No 4 - Text classification for Sentimental analysis using KNN Note: Use
twitter data

Sentiment Analysis
It is the process of detecting positive or negative sentiment in text.
It’s often used by businesses to detect sentiment in social data, gauge brand
reputation, and understand customers.
Sentiment analysis models focus on polarity (positive, negative, neutral) but also
on feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and
even intentions (interested v. not interested).
Depending on how you want to interpret customer feedback and queries, you can
define and tailor your categories to meet your sentiment analysis needs.
Automatically analyzing customer feedback, such as opinions in survey responses and
social media conversations, allows brands to learn what makes customers happy or
frustrated, so that they can tailor products and services to meet their customers’
needs.
For example, using sentiment analysis to automatically analyze 4,000+ reviews about
your product could help you discover if customers are happy about your pricing
plans and customer service.

It’s estimated that 90% of the world’s data is unstructured, in other words it’s
unorganized. Huge volumes of unstructured business data are created every day:
emails, support tickets, chats, social media conversations, surveys, articles,
documents, etc). 
Sentiment Analysis

How Does Sentiment Analysis Work?


Rule-based: these systems automatically perform sentiment analysis based on a set
of manually crafted rules.
Automatic: systems rely on machine learning techniques to learn from data.
Hybrid systems combine both rule-based and automatic approaches.

Automatic Approaches
Automatic methods, contrary to rule-based systems, don't rely on manually crafted
rules, but on machine learning techniques.
A sentiment analysis task is usually modeled as a classification problem, whereby a
classifier is fed a text and returns a category, e.g. positive, negative, or
neutral.

Working of Sentiment Analysis

K-Nearest Neighbor(KNN) Algorithm for Machine Learning


K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available
categories.
K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.

Working of K-NN
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.

Implementation Algorithm
Storing the training and test datasets into their respective dataframes
Preprocessing
Parsing the stop_words.txt file and storing all the words in a list.
List of all special characters that are to be removed.
With training and testing data
Removing all stopwords from all the tweets.
Removing hyperlinks from all the tweets. They are not needed for classification.
Removing usernames from all the tweets.
Removing hashtags, including the text, from all the tweets. Hashtags are useless
since their words cannot be splitted with spaces.
Removing all special characters from all the tweets
Finding all the unique words in training and testing data's Tweet column
Feature Extraction
Training and testing Data: Extracting features and storing them into the training
feature matrix
Calculating distances between every test instance with all the train instances.
This returns a 2D distances vector.
K Nearest Neighbors & Performance Measures by plotting graphs
Making a general structure of our confusion matrix
Extracting values from the Frequency DataFrame and assigning to specific cells in
the confusion matrix.
Extracting all recalls from the matrix to measure macroaveraged F1_score,recall and
precision.

Performance Evaluation
confusion matrix

Accuracy -99.9%

F1 Score -F1 Score is needed when you want to seek a balance between Precision and
Recall.
Accuracy can be largely contributed by a large number of True Negatives which in
most business circumstances, we do not focus on much whereas False Negative and
False Positive usually has business costs (tangible & intangible) 
Performance Evaluation

You might also like