Naive Bayes Sentiment Analysis (Twitter) - Step-by-Step
Teaching Guide
A clear, beginner-friendly walkthrough with toy example, formulas, code, common pitfalls and improvement
ideas.
Prepared for: You (the teacher) — use this as a PDF handout or lecture notes
Author: Generated by ChatGPT (assisted)
Date: 2025-08-11
Table of Contents
1. 1. Project overview and goal
2. 2. The twitter_samples dataset
3. 3. Train/test splitting (and shuffle)
4. 4. Preprocessing (process_tweet) — full walkthrough
5. 5. Building frequency dictionary (count_tweets)
6. 6. Helper functions for frequencies
7. 7. Training Naive Bayes (train_naive_bayes) — step-by-step
8. 8. Log prior and Log likelihood — formulas & code mapping
9. 9. Predicting sentiment (naive_bayes_predict)
10. 10. Testing & evaluation (test_naive_bayes + metrics)
11. 11. Toy example — full numeric run
12. 12. Edge cases, limitations & improvements
13. 13. Interview Q&A; (expected questions + answers)
14. 14. Teaching tips & demo ideas
15. 15. Appendix: Full code (cleaned)
16. 16. References
1. Project overview and goal
Goal: Build a simple classifier that reads a tweet and predicts whether its sentiment is positive or negative.
We use the Naive Bayes algorithm because it is fast, interpretable, and effective for text classification.
2. The twitter_samples dataset
What it is:
NLTK provides a small labeled dataset called twitter_samples with files like 'positive_tweets.json' and
'negative_tweets.json'. Each contains ~5,000 tweets already labeled by sentiment. These are perfect for
learning and demos.
3. Train/test splitting (and shuffle)
Why split?
We split data into training and testing so the model learns on one set and is evaluated on unseen data.
Important: always shuffle before splitting to avoid ordering bias (tweets in dataset may be grouped).
Example (python):
all_pos = twitter_samples.strings('positive_tweets.json')
all_neg = twitter_samples.strings('negative_tweets.json')
# shuffle before slicing
import random
random.seed(42)
random.shuffle(all_pos)
random.shuffle(all_neg)
train_pos = all_pos[:4000]
test_pos = all_pos[4000:]
train_neg = all_neg[:4000]
test_neg = all_neg[4000:]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
4. Preprocessing (process_tweet) — full walkthrough
Purpose:
Tweets are noisy. Preprocessing normalizes text so our model can focus on meaningful tokens. Steps
include removing links, mentions, hashtags (or the '#'), lowercasing, tokenizing, removing stopwords, and
stemming.
Detailed step explanations:
- Remove stock tickers: Regex r'\$\w*' removes tokens like $TSLA which add noise.
- Remove RT: Regex r'^RT[\s]+' removes old-style 'RT' markers.
- Remove hyperlinks: Regex r'https?:\/\/.*[\r\n]*' removes URLs.
- Hashtag handling: We remove the '#' but keep the word (e.g., #happy → happy).
- Tokenize: Use TweetTokenizer to keep emoticons and contractions sensible.
- Stopwords: Remove common words like 'the', 'is' which carry little sentiment.
- Punctuation: Remove punctuation tokens.
- Stemming: Reduce words to root form (PorterStemmer).
process_tweet example (python):
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = str(tweet)
tweet = re.sub(r'\$\w*','',tweet)
tweet = re.sub(r'^RT[\s]+','',tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*','',tweet)
tweet = re.sub(r'#','',tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for w in tweet_tokens:
if w not in stopwords_english and w not in string.punctuation:
tweets_clean.append(stemmer.stem(w))
return tweets_clean
5. Building the frequency dictionary (count_tweets)
We store counts for (word, label) pairs. Example key: ('happy', 1) -> value: number of times 'happy'
appeared in positive tweets.
def count_tweets(result, tweets, ys):
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
result[pair] = result.get(pair, 0) + 1
return result
# After running: freqs = count_tweets({}, train_x, train_y)
# freqs might look like: {('happy',1):27, ('happy',0):3, ('sad',0):15, ...}
Why this matters: Naive Bayes uses these counts to estimate P(word|class).
6. Helper functions for frequencies
Short purpose:
They are small, reusable functions to fetch positive and negative counts for a word. Makes later code
simpler and cleaner.
def freq_pos_count(word, freqs):
return freqs.get((word,1), 0)
def freq_neg_count(word, freqs):
return freqs.get((word,0), 0)
7. Training the Naive Bayes model (train_naive_bayes)
High-level idea:
Compute how often each word appears in positive vs negative tweets; turn those into probabilities. Also
compute the prior (how common positive vs negative tweets are).
Key variables computed in training:
- vocab (V): Set of unique words seen in training
- N_pos, N_neg: Total number of word occurrences in positive/negative tweets
- D_pos, D_neg: Number of positive and negative documents (tweets)
- logprior: log(D_pos) - log(D_neg)
- loglikelihood[word]: log(P(word|pos)) - log(P(word|neg)) for each word
Training code (python):
def train_naive_bayes(freqs, train_x, train_y):
loglikelihood = {}
vocab = set([k[0] for k in freqs.keys()])
V = len(vocab)
N_pos = N_neg = 0
for pair in freqs.keys():
if pair[1] > 0:
N_pos += freqs[pair]
else:
N_neg += freqs[pair]
D_pos = len(train_y[train_y == 1])
D_neg = len(train_y[train_y == 0])
logprior = np.log(D_pos) - np.log(D_neg)
for word in vocab:
freq_pos = freq_pos_count(word, freqs)
freq_neg = freq_neg_count(word, freqs)
p_w_pos = (freq_pos + 1) / (N_pos + V) # Laplace smoothing
p_w_neg = (freq_neg + 1) / (N_neg + V)
loglikelihood[word] = np.log(p_w_pos) - np.log(p_w_neg)
return logprior, loglikelihood
8. Log prior & Log likelihood — formulas and relationship
Math (plain text):
- Bayes (ratio form): log P(Pos|tweet) - log P(Neg|tweet) = log P(Pos) - log P(Neg) + sum_i [ log
P(w_i|Pos) - log P(w_i|Neg) ]
- logprior = log(D_pos) - log(D_neg)
- P(w|Pos) = (count(w,Pos) + 1) / (N_pos + V) # Laplace smoothing
- loglikelihood[word] = log(P(w|Pos)) - log(P(w|Neg))
- Final score = logprior + sum_{words} loglikelihood[word]
Interpretation: logprior is the baseline. Each word's loglikelihood nudges the score toward positive (if
positive value) or negative (if negative value).
9. Predicting sentiment (naive_bayes_predict)
Algorithm (simple):
1) Clean tweet (process_tweet) 2) Start with p = logprior 3) For each word in cleaned tweet, if present, add
loglikelihood[word] 4) If p > 0 => Positive else Negative
def naive_bayes_predict(tweet, logprior, loglikelihood):
word_l = process_tweet(tweet)
p = logprior
for word in word_l:
if word in loglikelihood:
p += loglikelihood[word]
return p # sign determines class: >0 -> positive
10. Testing & evaluation
In code you loop over the test set, predict labels and compare to true labels. Use metrics:
- Accuracy: Correct predictions / total
- Confusion matrix: Counts of TP, FP, FN, TN
- Precision: TP / (TP + FP) - how many predicted positive were actually positive
- Recall: TP / (TP + FN) - how many actual positives were found
- F1-score: Harmonic mean of precision and recall
Example (python):
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
y_hats = test_naive_bayes(test_x, test_y, logprior, loglikelihood)
print("Accuracy:", accuracy_score(test_y, y_hats))
print("Confusion matrix:\n", confusion_matrix(test_y, y_hats))
print("P/R/F1:", precision_recall_fscore_support(test_y, y_hats, average='binary'))
11. Toy example — full numeric run
Small dataset (3 pos, 3 neg) and step-by-step calculations:
Training tweets:
Pos: "I love this", "This is great", "I am happy"
Neg: "I hate this", "This is bad", "I am sad"
Vocabulary (unique words): i, love, this, is, great, am, happy, hate, bad, sad
V = 10
Positive word counts (sum): N_pos = 7
Negative word counts (sum): N_neg = 7
D_pos = 3, D_neg = 3
logprior = log(3) - log(3) = 0
Laplace smoothing: p(w|pos) = (count_pos(w)+1) / (N_pos + V)
Example for 'love':
count_pos('love') = 1, count_neg('love') = 0
p(love|pos) = (1+1)/(7 + 10) = 2/17
p(love|neg) = (0+1)/17 = 1/17
loglikelihood['love'] = log(2) ≈ +0.693
Predict tweet: "I love this"
score = logprior + loglikelihood['i'] + loglikelihood['love'] + loglikelihood['this']
Assuming loglikelihood['i'] = 0, loglikelihood['this'] = 0 (equal counts),
score ≈ 0 + 0 + 0.693 + 0 = 0.693 -> Positive
12. Edge cases, limitations & ways to improve
- Negation handling: Use n-grams or a small rule-based negation flip so 'not good' becomes a separate
feature.
- N-grams: Include bi-grams/trigrams to capture short phrases.
- TF-IDF: Use TF-IDF weighting instead of raw counts for less frequent but informative words.
- Word embeddings: Use pretrained Word2Vec/GloVe/BERT for semantic understanding.
- Model upgrade: Try Logistic Regression, SVM, or Transformer-based models for better accuracy.
- Cross-validation: Use k-fold cross-validation instead of single split to get robust performance estimates.
- More metrics: Check precision, recall, F1 and confusion matrix, not only accuracy.
- Handle OOV: Map unseen words to an 'UNK' token or expand training data.
- Class imbalance: Use oversampling, undersampling, or class weights if classes are unbalanced.
13. Interview Q&A; (expected questions + short answers)
Q: Why Naive Bayes for text? A: Simple, fast, works well with small text datasets and provides
interpretable word scores.
Q: Why remove stopwords? A: They add noise and little sentiment information.
Q: What is Laplace smoothing? A: Add 1 to counts so unseen words don't make probabilities zero.
Q: How to handle sarcasm? A: Hard for NB; use larger models (transformers) and context-aware
embeddings.
Q: How to improve accuracy? A: Use n-grams, TF-IDF, embeddings, more data, or stronger classifiers.
Q: Is independence assumption realistic? A: No, but NB still often works well in practice for text.
14. Teaching tips & demo ideas
- Start with the toy dataset: manually count words and compute one prediction by hand.
- Show preprocessing effects: compare raw tweet vs cleaned tokens.
- Visualize loglikelihood scores for top positive/negative words.
- Demonstrate errors (sarcasm, negation) to show limitations.
- Provide a live demo where students type tweets and see prediction results.
15. Appendix: Full cleaned code (copy-paste ready)
# --- Full minimal Naive Bayes pipeline (cleaned version) ---
import re, string
import numpy as np
from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
import random
# 1. Load and shuffle
all_pos = twitter_samples.strings('positive_tweets.json')
all_neg = twitter_samples.strings('negative_tweets.json')
random.seed(42)
random.shuffle(all_pos); random.shuffle(all_neg)
train_pos = all_pos[:4000]; test_pos = all_pos[4000:]
train_neg = all_neg[:4000]; test_neg = all_neg[4000:]
train_x = train_pos + train_neg; test_x = test_pos + test_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
# 2. Preprocess
def process_tweet(tweet):
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
tweet = str(tweet)
tweet = re.sub(r'\$\w*', '', tweet)
tweet = re.sub(r'^RT[\s]+', '', tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
tweet = re.sub(r'#', '', tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and word not in string.punctuation):
stem_word = stemmer.stem(word)
tweets_clean.append(stem_word)
return tweets_clean
# 3. Build freqs
def count_tweets(result, tweets, ys):
for y, tweet in zip(ys, tweets):
for word in process_tweet(tweet):
pair = (word, y)
result[pair] = result.get(pair,0) + 1
return result
freqs = count_tweets({}, train_x, train_y)
def freq_pos_count(word, freqs):
return freqs.get((word,1),0)
def freq_neg_count(word, freqs):
return freqs.get((word,0),0)
# 4. Train
def train_naive_bayes(freqs, train_x, train_y):
vocab = set([k[0] for k in freqs.keys()])
V = len(vocab)
N_pos = sum([v for (w,l),v in freqs.items() if l==1])
N_neg = sum([v for (w,l),v in freqs.items() if l==0])
D_pos = len(train_y[train_y==1])
D_neg = len(train_y[train_y==0])
logprior = np.log(D_pos) - np.log(D_neg)
loglikelihood = {}
for w in vocab:
freq_pos = freq_pos_count(w, freqs)
freq_neg = freq_neg_count(w, freqs)
p_w_pos = (freq_pos + 1) / (N_pos + V)
p_w_neg = (freq_neg + 1) / (N_neg + V)
loglikelihood[w] = np.log(p_w_pos) - np.log(p_w_neg)
return logprior, loglikelihood
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
# 5. Predict
def naive_bayes_predict(tweet, logprior, loglikelihood):
word_l = process_tweet(tweet)
p = logprior
for word in word_l:
if word in loglikelihood:
p += loglikelihood[word]
return p
# 6. Test and metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
y_hats = [1 if naive_bayes_predict(t, logprior, loglikelihood) > 0 else 0 for t in test_x]
print("Accuracy:", accuracy_score(test_y, y_hats))
print("Confusion matrix:\\n", confusion_matrix(test_y, y_hats))
print("P/R/F1:", precision_recall_fscore_support(test_y, y_hats, average='binary'))
16. References
NLTK twitter_samples. NLTK documentation. Other references: Naive Bayes tutorials and standard ML
texts.