0% found this document useful (0 votes)
12 views17 pages

Naive Bayes Sentiment Tutorial

Uploaded by

22ee143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views17 pages

Naive Bayes Sentiment Tutorial

Uploaded by

22ee143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Naive Bayes Sentiment Analysis (Twitter) - Step-by-Step

Teaching Guide
A clear, beginner-friendly walkthrough with toy example, formulas, code, common pitfalls and improvement
ideas.

Prepared for: You (the teacher) — use this as a PDF handout or lecture notes

Author: Generated by ChatGPT (assisted)


Date: 2025-08-11
Table of Contents
1. 1. Project overview and goal
2. 2. The twitter_samples dataset
3. 3. Train/test splitting (and shuffle)
4. 4. Preprocessing (process_tweet) — full walkthrough
5. 5. Building frequency dictionary (count_tweets)
6. 6. Helper functions for frequencies
7. 7. Training Naive Bayes (train_naive_bayes) — step-by-step
8. 8. Log prior and Log likelihood — formulas & code mapping
9. 9. Predicting sentiment (naive_bayes_predict)
10. 10. Testing & evaluation (test_naive_bayes + metrics)
11. 11. Toy example — full numeric run
12. 12. Edge cases, limitations & improvements
13. 13. Interview Q&A; (expected questions + answers)
14. 14. Teaching tips & demo ideas
15. 15. Appendix: Full code (cleaned)
16. 16. References
1. Project overview and goal
Goal: Build a simple classifier that reads a tweet and predicts whether its sentiment is positive or negative.
We use the Naive Bayes algorithm because it is fast, interpretable, and effective for text classification.

2. The twitter_samples dataset


What it is:
NLTK provides a small labeled dataset called twitter_samples with files like 'positive_tweets.json' and
'negative_tweets.json'. Each contains ~5,000 tweets already labeled by sentiment. These are perfect for
learning and demos.

3. Train/test splitting (and shuffle)


Why split?
We split data into training and testing so the model learns on one set and is evaluated on unseen data.
Important: always shuffle before splitting to avoid ordering bias (tweets in dataset may be grouped).
Example (python):
all_pos = twitter_samples.strings('positive_tweets.json')
all_neg = twitter_samples.strings('negative_tweets.json')

# shuffle before slicing


import random
random.seed(42)
random.shuffle(all_pos)
random.shuffle(all_neg)

train_pos = all_pos[:4000]
test_pos = all_pos[4000:]
train_neg = all_neg[:4000]
test_neg = all_neg[4000:]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
4. Preprocessing (process_tweet) — full walkthrough
Purpose:
Tweets are noisy. Preprocessing normalizes text so our model can focus on meaningful tokens. Steps
include removing links, mentions, hashtags (or the '#'), lowercasing, tokenizing, removing stopwords, and
stemming.
Detailed step explanations:
- Remove stock tickers: Regex r'\$\w*' removes tokens like $TSLA which add noise.
- Remove RT: Regex r'^RT[\s]+' removes old-style 'RT' markers.
- Remove hyperlinks: Regex r'https?:\/\/.*[\r\n]*' removes URLs.
- Hashtag handling: We remove the '#' but keep the word (e.g., #happy → happy).
- Tokenize: Use TweetTokenizer to keep emoticons and contractions sensible.
- Stopwords: Remove common words like 'the', 'is' which carry little sentiment.
- Punctuation: Remove punctuation tokens.
- Stemming: Reduce words to root form (PorterStemmer).
process_tweet example (python):
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = str(tweet)
tweet = re.sub(r'\$\w*','',tweet)
tweet = re.sub(r'^RT[\s]+','',tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*','',tweet)
tweet = re.sub(r'#','',tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for w in tweet_tokens:
if w not in stopwords_english and w not in string.punctuation:
tweets_clean.append(stemmer.stem(w))
return tweets_clean
5. Building the frequency dictionary (count_tweets)
We store counts for (word, label) pairs. Example key: ('happy', 1) -> value: number of times 'happy'
appeared in positive tweets.
def count_tweets(result, tweets, ys):
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
result[pair] = result.get(pair, 0) + 1
return result
# After running: freqs = count_tweets({}, train_x, train_y)
# freqs might look like: {('happy',1):27, ('happy',0):3, ('sad',0):15, ...}

Why this matters: Naive Bayes uses these counts to estimate P(word|class).
6. Helper functions for frequencies
Short purpose:
They are small, reusable functions to fetch positive and negative counts for a word. Makes later code
simpler and cleaner.
def freq_pos_count(word, freqs):
return freqs.get((word,1), 0)

def freq_neg_count(word, freqs):


return freqs.get((word,0), 0)
7. Training the Naive Bayes model (train_naive_bayes)
High-level idea:
Compute how often each word appears in positive vs negative tweets; turn those into probabilities. Also
compute the prior (how common positive vs negative tweets are).
Key variables computed in training:
- vocab (V): Set of unique words seen in training
- N_pos, N_neg: Total number of word occurrences in positive/negative tweets
- D_pos, D_neg: Number of positive and negative documents (tweets)
- logprior: log(D_pos) - log(D_neg)
- loglikelihood[word]: log(P(word|pos)) - log(P(word|neg)) for each word
Training code (python):
def train_naive_bayes(freqs, train_x, train_y):
loglikelihood = {}
vocab = set([k[0] for k in freqs.keys()])
V = len(vocab)
N_pos = N_neg = 0
for pair in freqs.keys():
if pair[1] > 0:
N_pos += freqs[pair]
else:
N_neg += freqs[pair]
D_pos = len(train_y[train_y == 1])
D_neg = len(train_y[train_y == 0])
logprior = np.log(D_pos) - np.log(D_neg)
for word in vocab:
freq_pos = freq_pos_count(word, freqs)
freq_neg = freq_neg_count(word, freqs)
p_w_pos = (freq_pos + 1) / (N_pos + V) # Laplace smoothing
p_w_neg = (freq_neg + 1) / (N_neg + V)
loglikelihood[word] = np.log(p_w_pos) - np.log(p_w_neg)
return logprior, loglikelihood
8. Log prior & Log likelihood — formulas and relationship
Math (plain text):
- Bayes (ratio form): log P(Pos|tweet) - log P(Neg|tweet) = log P(Pos) - log P(Neg) + sum_i [ log
P(w_i|Pos) - log P(w_i|Neg) ]
- logprior = log(D_pos) - log(D_neg)
- P(w|Pos) = (count(w,Pos) + 1) / (N_pos + V) # Laplace smoothing
- loglikelihood[word] = log(P(w|Pos)) - log(P(w|Neg))
- Final score = logprior + sum_{words} loglikelihood[word]
Interpretation: logprior is the baseline. Each word's loglikelihood nudges the score toward positive (if
positive value) or negative (if negative value).
9. Predicting sentiment (naive_bayes_predict)
Algorithm (simple):
1) Clean tweet (process_tweet) 2) Start with p = logprior 3) For each word in cleaned tweet, if present, add
loglikelihood[word] 4) If p > 0 => Positive else Negative
def naive_bayes_predict(tweet, logprior, loglikelihood):
word_l = process_tweet(tweet)
p = logprior
for word in word_l:
if word in loglikelihood:
p += loglikelihood[word]
return p # sign determines class: >0 -> positive
10. Testing & evaluation
In code you loop over the test set, predict labels and compare to true labels. Use metrics:
- Accuracy: Correct predictions / total
- Confusion matrix: Counts of TP, FP, FN, TN
- Precision: TP / (TP + FP) - how many predicted positive were actually positive
- Recall: TP / (TP + FN) - how many actual positives were found
- F1-score: Harmonic mean of precision and recall
Example (python):
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
y_hats = test_naive_bayes(test_x, test_y, logprior, loglikelihood)
print("Accuracy:", accuracy_score(test_y, y_hats))
print("Confusion matrix:\n", confusion_matrix(test_y, y_hats))
print("P/R/F1:", precision_recall_fscore_support(test_y, y_hats, average='binary'))
11. Toy example — full numeric run
Small dataset (3 pos, 3 neg) and step-by-step calculations:
Training tweets:
Pos: "I love this", "This is great", "I am happy"
Neg: "I hate this", "This is bad", "I am sad"

Vocabulary (unique words): i, love, this, is, great, am, happy, hate, bad, sad
V = 10

Positive word counts (sum): N_pos = 7


Negative word counts (sum): N_neg = 7

D_pos = 3, D_neg = 3
logprior = log(3) - log(3) = 0

Laplace smoothing: p(w|pos) = (count_pos(w)+1) / (N_pos + V)

Example for 'love':


count_pos('love') = 1, count_neg('love') = 0
p(love|pos) = (1+1)/(7 + 10) = 2/17
p(love|neg) = (0+1)/17 = 1/17
loglikelihood['love'] = log(2) ≈ +0.693

Predict tweet: "I love this"


score = logprior + loglikelihood['i'] + loglikelihood['love'] + loglikelihood['this']
Assuming loglikelihood['i'] = 0, loglikelihood['this'] = 0 (equal counts),
score ≈ 0 + 0 + 0.693 + 0 = 0.693 -> Positive
12. Edge cases, limitations & ways to improve
- Negation handling: Use n-grams or a small rule-based negation flip so 'not good' becomes a separate
feature.
- N-grams: Include bi-grams/trigrams to capture short phrases.
- TF-IDF: Use TF-IDF weighting instead of raw counts for less frequent but informative words.
- Word embeddings: Use pretrained Word2Vec/GloVe/BERT for semantic understanding.
- Model upgrade: Try Logistic Regression, SVM, or Transformer-based models for better accuracy.
- Cross-validation: Use k-fold cross-validation instead of single split to get robust performance estimates.
- More metrics: Check precision, recall, F1 and confusion matrix, not only accuracy.
- Handle OOV: Map unseen words to an 'UNK' token or expand training data.
- Class imbalance: Use oversampling, undersampling, or class weights if classes are unbalanced.
13. Interview Q&A; (expected questions + short answers)
Q: Why Naive Bayes for text? A: Simple, fast, works well with small text datasets and provides
interpretable word scores.
Q: Why remove stopwords? A: They add noise and little sentiment information.
Q: What is Laplace smoothing? A: Add 1 to counts so unseen words don't make probabilities zero.
Q: How to handle sarcasm? A: Hard for NB; use larger models (transformers) and context-aware
embeddings.
Q: How to improve accuracy? A: Use n-grams, TF-IDF, embeddings, more data, or stronger classifiers.
Q: Is independence assumption realistic? A: No, but NB still often works well in practice for text.
14. Teaching tips & demo ideas
- Start with the toy dataset: manually count words and compute one prediction by hand.
- Show preprocessing effects: compare raw tweet vs cleaned tokens.
- Visualize loglikelihood scores for top positive/negative words.
- Demonstrate errors (sarcasm, negation) to show limitations.
- Provide a live demo where students type tweets and see prediction results.
15. Appendix: Full cleaned code (copy-paste ready)
# --- Full minimal Naive Bayes pipeline (cleaned version) ---
import re, string
import numpy as np
from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
import random

# 1. Load and shuffle


all_pos = twitter_samples.strings('positive_tweets.json')
all_neg = twitter_samples.strings('negative_tweets.json')
random.seed(42)
random.shuffle(all_pos); random.shuffle(all_neg)

train_pos = all_pos[:4000]; test_pos = all_pos[4000:]


train_neg = all_neg[:4000]; test_neg = all_neg[4000:]
train_x = train_pos + train_neg; test_x = test_pos + test_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

# 2. Preprocess
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = str(tweet)
tweet = re.sub(r'\$\w*', '', tweet)
tweet = re.sub(r'^RT[\s]+', '', tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
tweet = re.sub(r'#', '', tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and word not in string.punctuation):
stem_word = stemmer.stem(word)
tweets_clean.append(stem_word)
return tweets_clean

# 3. Build freqs
def count_tweets(result, tweets, ys):
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
result[pair] = result.get(pair,0) + 1
return result

freqs = count_tweets({}, train_x, train_y)

def freq_pos_count(word, freqs):


return freqs.get((word,1),0)

def freq_neg_count(word, freqs):


return freqs.get((word,0),0)

# 4. Train
def train_naive_bayes(freqs, train_x, train_y):
vocab = set([k[0] for k in freqs.keys()])
V = len(vocab)
N_pos = sum([v for (w,l),v in freqs.items() if l==1])
N_neg = sum([v for (w,l),v in freqs.items() if l==0])
D_pos = len(train_y[train_y==1])
D_neg = len(train_y[train_y==0])
logprior = np.log(D_pos) - np.log(D_neg)
loglikelihood = {}
for w in vocab:
freq_pos = freq_pos_count(w, freqs)
freq_neg = freq_neg_count(w, freqs)
p_w_pos = (freq_pos + 1) / (N_pos + V)
p_w_neg = (freq_neg + 1) / (N_neg + V)
loglikelihood[w] = np.log(p_w_pos) - np.log(p_w_neg)
return logprior, loglikelihood

logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

# 5. Predict
def naive_bayes_predict(tweet, logprior, loglikelihood):
word_l = process_tweet(tweet)
p = logprior
for word in word_l:
if word in loglikelihood:
p += loglikelihood[word]
return p

# 6. Test and metrics


from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
y_hats = [1 if naive_bayes_predict(t, logprior, loglikelihood) > 0 else 0 for t in test_x]
print("Accuracy:", accuracy_score(test_y, y_hats))
print("Confusion matrix:\\n", confusion_matrix(test_y, y_hats))
print("P/R/F1:", precision_recall_fscore_support(test_y, y_hats, average='binary'))
16. References
NLTK twitter_samples. NLTK documentation. Other references: Naive Bayes tutorials and standard ML
texts.

You might also like