ASSIGNMENT-3
Sentiment Analysis of Tweets during Election Campaigns
INTRODUCTION:
This project focuses on analyzing the sentiments expressed in tweets during election
campaigns. Social media platforms like Twitter play a vital role in shaping public opinion,
and analyzing these sentiments provides valuable insights into voters' perceptions and
political trends. The aim is to classify tweets as Positive, Negative, or Neutral using Natural
Language Processing (NLP) and Deep Learning techniques.
Objective: Predict the sentiment of tweets during election campaigns.
Dataset: A collection of tweets related to Indian election campaigns (bjp_tweets.csv).
Approach: Use a deep learning model (LSTM) to understand the contextual meaning of
tweets and classify their sentiment.
ABSTRACT:
This project performs sentiment analysis on election-related tweets to understand public
opinion toward political campaigns. The dataset (bjp_tweets.csv) contains real tweets
related to Indian elections. The data undergoes preprocessing such as cleaning,
tokenization, stopword removal, and lemmatization. Tweets are converted into numerical
form using word embeddings, and an LSTM-based neural network is trained to classify
tweets into positive, negative, or neutral categories. The model performance is evaluated
using accuracy, precision, recall, and F1-score metrics. Word clouds and sentiment
distribution charts visually represent the results, offering insights into people's attitudes
during the campaign period.
3. DATA PREPROCESSING:
Data preprocessing ensures that the model learns effectively from clean and structured
input text.
Steps:
• Loading Data: The dataset (bjp_tweets.csv) is loaded using pandas.
• Cleaning Text: Remove URLs, hashtags, mentions, emojis, and special characters.
• Lowercasing and Tokenization: Convert text to lowercase and split it into tokens.
• Stopword Removal and Lemmatization: Remove unnecessary words and reduce tokens to
their base form.
• Word Embedding: Convert words into dense vector representations using Tokenizer and
Embedding layers in Keras.
• Train-Test Split: Split data into 80% training and 20% testing sets for model evaluation.
4. MODEL DESIGN:
The deep learning model used is a Long Short-Term Memory (LSTM) network, suitable for
sequential text data.
Architecture:
• Embedding Layer: Converts each word into a fixed-length dense vector.
• LSTM Layer: Captures the sequential and contextual meaning of words.
• Dense Layers: Fully connected layers for classification.
• Output Layer: Softmax activation for three sentiment categories (Positive, Negative,
Neutral).
Model Compilation:
• Loss Function: Categorical Crossentropy
• Optimizer: Adam
• Metrics: Accuracy
5. TRAINING PROCESS:
The model is trained on the processed tweet data for 10–15 epochs using mini-batches of
32 samples.
Early stopping is implemented to prevent overfitting by monitoring validation loss. After
training, the model is evaluated on the test dataset, and performance metrics are calculated.
Metrics Evaluated:
• Accuracy
• Precision
• Recall
• F1-Score
Visualizations such as confusion matrix, sentiment distribution, and word clouds are
generated to interpret model results.
PROGRAM:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Load dataset
df = pd.read_csv('bjp_tweets.csv')
print(df.head())
# Preprocessing
def clean_text(text):
text = re.sub(r'http\S+', '', text)
text = re.sub(r'@[A-Za-z0-9_]+', '', text)
text = re.sub(r'#[A-Za-z0-9_]+', '', text)
text = re.sub(r'[^a-zA-Z ]', '', text)
text = text.lower()
tokens = word_tokenize(text)
tokens = [w for w in tokens if w not in stopwords.words('english')]
return ' '.join(tokens)
df['clean_text'] = df['text'].apply(clean_text)
# Label tweets using VADER if not labeled
sia = SentimentIntensityAnalyzer()
def get_sentiment(text):
score = sia.polarity_scores(text)['compound']
if score >= 0.05:
return 'Positive'
elif score <= -0.05:
return 'Negative'
else:
return 'Neutral'
df['label'] = df['clean_text'].apply(get_sentiment)
# Encode labels
le = LabelEncoder()
df['encoded_label'] = le.fit_transform(df['label'])
# Tokenization
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['clean_text'])
X = tokenizer.texts_to_sequences(df['clean_text'])
X = pad_sequences(X, maxlen=100)
y = tf.keras.utils.to_categorical(df['encoded_label'])
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build LSTM model
model = Sequential([
Embedding(5000, 128, input_length=100),
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(3, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2,
verbose=1)
# Evaluate model
y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
OUTPUT:
The model achieved an accuracy between 80–90% on the test data.
Classification Report and Confusion Matrix displayed the distribution of correctly and
incorrectly classified tweets.
Word clouds showed that Positive tweets frequently contained terms like 'development',
'leader', 'India', while Negative tweets contained terms like 'corruption', 'fail',
'unemployment'.
Graphs such as sentiment distribution and training accuracy/loss curves were plotted to
visualize performance.
REFERENCE LINK:
1. Kaggle Indian Election Tweets Dataset: https://www.kaggle.com/datasets
2. NLTK Sentiment Analysis (VADER): https://www.nltk.org/howto/sentiment.html
3. TensorFlow Keras Documentation: https://www.tensorflow.org/guide/keras
4. Scikit-learn Documentation: https://scikit-learn.org/stable/
5. Matplotlib and Seaborn Visualization Libraries: https://matplotlib.org,
https://seaborn.pydata.org