Covid-19 Vaccination Analysis
Covid-19 Vaccination Analysis
On
DEGREE
Session 2023-24
in
INDIA
Jan, 2024
SCHOOL OF COMPUTER APPLICATION AND
TECHNOLOGY
GALGOTIAS UNIVERSITY, GREATER NOIDA
CANDIDATE’S DECLARATION
We hereby certify that the work which is being presented in the project, entitled Covid-19
Vaccine Analysis in partial fulfillment of the requirements for the award of the MCA
Technology of Galgotias University, Greater Noida, is an original work carried out during
the period of August, 2023 to Jan and 2024, under the supervision of Dr. Rajnesh Singh,
The matter presented in the thesis/project/dissertation has not been submitted by me/us
for the award of any other degree of this or any other places.
Vishal Bhatt(23SCSE2030541)
Sagar Bhatt(23SCSE2030541)
This is to certify that the above statement made by the candidates is correct to the best of
my knowledge.
Mr. Rajaumar P
Assistant Professor
TABLE OF CONTENTS
Candidate Declaration
Certificate
Acknowledgement
Abstract
Chapter 1: Introduction
Chapter 2: Problem Statement
Chapter 3: Methodology and Related work
Chapter 4: Diagrams
Chapter 5: Conclusion
3
ABSTRACT
The COVID-19 pandemic prompted an urgent need for effective vaccines to mitigate
its spread and impact on global health. Vaccine analysis plays a crucial role in
evaluating the efficacy, safety, and distribution strategies of these vaccines. Python,
with its versatile libraries and tools, serves as a powerful platform for conducting
comprehensive analyses of COVID-19 vaccine data.The analysis typically begins with
data collection from various sources, including clinical trials, public health databases,
and real-world vaccination campaigns. Python's libraries such as Pandas and NumPy
facilitate data manipulation, cleaning, and preprocessing to ensure data quality and
consistency.One fundamental aspect of vaccine analysis is assessing vaccine efficacy.
Python allows researchers to perform statistical analyses, including hypothesis testing
and confidence interval estimation, to determine the effectiveness of vaccines in
preventing COVID-19 infection, severe illness, and mortality.
4
Covid19 Vaccine Sentiment Analysis
Table Of Contents:
1. Importing Libraries
2. EDA and Visulation
3. Text Processing
4. Most Prevalent Words in Tweet
5. Apply VADER Sentiment to the tweets to get labels
6. Time Series Analysis On Sentiments
7. Stop Word Removal and Lemmatization
8. Splitting the Data
9. Feature Extraction
10. Model Building
11. Conclusion
1. Importing Libraries
In [1]:
import numpy as np
import pandas as pd
import re
import string
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from nltk.stem import SnowballStemmer
from wordcloud import WordCloud, STOPWORDS
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
plt.style.use('fivethirtyeight')
In [2]:
df = pd.read_csv('vaccination_tweets.csv') # Dataset Source :
kaggle
# Note: some functionalities differ from google colab and jupyter notebook
5
2. EDA & Visualisation
2.1 Summary statistics
In [3]:
df.head()
Out[3]:
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t
Sa
me
2 fol T
Aggr 0 ks w
La
egato 2 sai it
Cr 20
r of 0- d te
esc 09
R Asia 1 dai r
1340 ent -
ac n 2- ko f
5391 a- 04 ['Pfize Fa
he Ame 40 16 324 Fal 2 n o
0 1197 M - rBioN 0 0 ls
l rican 5 92 7 se 0 pa r
1516 ont 08 Tech'] e
R news 0 ste A
416 ros 17:
oh ; 6: co n
e, 52:
scan 0 uld d
C 46
ning 6: tre r
A
di... 4 at oi
4 a d
cyt
...
Sa Mark 20 2 W T
n eting 09 0 hil w
1338 Al
Fra dude - 2 e it
1585 be Fa
nci , tech 09 83 66 Fal 0- the
1 4335 rt 178 NaN te 1 1 ls
sco geek, - 4 6 se 1 wo r
9250 Fo e
, heav 21 2- rld W
433 ng
C y 15: 1 ha e
A meta 27: 3 s b
6
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t
l& 30 1 be A
'80s . 6: en p
.. 2 on p
7: the
1 wr
3 on
g
sid
e
of
...
#c
or
on
2 avi T
0 rus w
2 #S it
20 ['coro
0- put te
20 naviru
1 nik r
1337 eli heil, - s',
Yo 2- V f
8581 �� hydr 06 'Sputn Fa
ur Fal 1 #A o
2 9914 a - 10 88 155 ikV', 0 0 ls
�� Be se 2 str r
0118 25 'Astra e
� d �☺ 2 aZ A
533 23: Zenec
0: en n
30: a',
3 ec d
28 'Pf...
3: a r
4 #P oi
5 fiz d
er
Bi
o...
3 1337 C Va Hosti 20 49
39 218 Tr
2 Fa NaN T 4 2 Fa
8557 ha nc ng 08 16 0 cts w 4 1 ls
7
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t
Ex
pla
in
2 T
Citiz to
0 w
Ci en me
2 it
tiz New 20 ag
0- ['wher te
en s 20 ain
1 eareal r
1337 N Chan - wh
2- lthesi f
8540 e nel 04 y Fa
Na 15 58 147 Fal 1 ckpeo o
4 6460 w bring - we 0 0 ls
N 2 0 3 se 2 ple', r
4966 s ing 23 ne e
2 'Pfizer i
912 C you 17: ed
0: BioN P
ha an 58: a
1 Tech'] h
nn alter 42 va
7: o
el nati.. cci
1 n
. ne
9 e
@
Bo
r...
8
In [4]:
df.shape # Gives no. of rows and columns (rows --> number of
examples / data points , columns -> number of attributes)
Out[4]:
(8082, 16)
In [5]:
df.drop(["id","user_created"],axis=1,inplace=True)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8082 entries, 0 to 8081
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_name 8082 non-null object
1 user_location 6452 non-null object
2 user_description 7576 non-null object
3 user_followers 8082 non-null int64
4 user_friends 8082 non-null int64
5 user_favourites 8082 non-null int64
6 user_verified 8082 non-null bool
7 date 8082 non-null object
8 text 8082 non-null object
9 hashtags 6133 non-null object
10 source 8081 non-null object
11 retweets 8082 non-null int64
12 favorites 8082 non-null int64
13 is_retweet 8082 non-null bool
dtypes: bool(2), int64(5), object(7)
memory usage: 773.6+ KB
In [7]:
df.isnull().sum() # Lets handle the null values when it's required
Out[7]:
user_name 0
user_location 1630
user_description 506
user_followers 0
user_friends 0
user_favourites 0
user_verified 0
date 0
text 0
hashtags 1949
source 1
retweets 0
favorites 0
is_retweet 0
dtype: int64
In [8]:
df.isnull().values.sum()
9
Out[8]:
4086
In [9]:
df.describe()
Out[9]:
user_followers user_friends user_favourites retweets favorites
2.2 Visualisation
In [10]:
# Let's see the length of the tweets
seq_length = [len(i) for i in df['text']]
pd.Series(seq_length).hist(bins = 25)
Out[10]:
<AxesSubplot:>
10
In [11]:
sns.set_style('darkgrid')
11
In [12]:
# Percentage of Verified and Non-verified users
dict_ = df['user_verified'].value_counts().to_dict()
dict_['Verified'] = dict_.pop(True)
dict_['Not-Verified'] = dict_.pop(False)
plt.figure(figsize=(4,4))
plt.pie(x=dict_.values(), labels=dict_.keys(), autopct='%1.1f%%',
shadow=True, startangle=0, explode = [0.1, 0])
plt.show()
12
We can See that nearly 91% of users tweeted are not verified
In [13]:
# Top 10 Most Used Hashtags in tweets
MostUsedTweets = df.hashtags.value_counts().sort_values(ascending=False)[:5]
colors = ['lightcoral', 'lightskyblue', 'yellowgreen', 'pink', 'orange']
explode = (0.1, 0.2, 0.1, 0.1, 0.1)
# Wedge properties
wp = { 'linewidth' : 0.5, 'edgecolor' : "red" }
# Adding legend
ax.legend(wedges, MostUsedTweets.keys(),
13
title ="Most used tweets",
loc ="center left",
bbox_to_anchor =(1, 0, 0.5, 1))
In [14]:
# Number of Tweets Made Per Day
df['tweet_date']=pd.to_datetime(df['date']).dt.date
tweet_date=df['tweet_date'].value_counts().to_frame().reset_index().rename(c
olumns={'index':'date','tweet_date':'count'})
tweet_date['date']=pd.to_datetime(tweet_date['date'])
tweet_date=tweet_date.sort_values('date',ascending=False)
fig=go.Figure(go.Scatter(x=tweet_date['date'],
y=tweet_date['count'],
mode='markers+lines',
name="Submissions",
marker_color='dodgerblue'))
fig.update_layout(
title_text='Tweets per Day : ({} -
{})'.format(df['tweet_date'].sort_values()[0].strftime("%d/%m/%Y"),
df['tweet_date'].sort_values().iloc[-
1].strftime("%d/%m/%Y")),template="plotly_dark",
title_x=0.5)
fig.show()
14
It can be seen that the tweets related to the vaccine were more during the initial phases of the vaccine
launch.
If The Plot not appeared when viewed in github (due to some github bug) without downloading the notebook
then please check the plot attached as 'Tweets_per_Day.png' with this notebook
In [15]:
#Days With Maximum Number of Tweets
df["date"] = pd.to_datetime(df["date"])
df["Month"] = df["date"].apply(lambda x : x.month)
df["day"] = df["date"].apply(lambda x : x.dayofweek)
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df["day"] = df["day"].map(dmap)
plt.title("Day with maximun tweets")
sns.countplot(df["day"])
Out[15]:
<AxesSubplot:title={'center':'Day with maximun tweets'}, xlabel='day', ylabe
l='count'>
In [16]:
#Number of Retweets Made
y = df['is_retweet']
fig, ax = plt.subplots(figsize=(5, 5))
count = Counter(y)
ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('is_retweet?')
15
plt.show()
In [17]:
#Numer of verified and non verified users tweeted.
plt.figure(figsize=(4, 4))
sns.countplot(x ="user_verified",data=df, palette="Set1")
plt.title(" Verified VS Unverified Users")
plt.xticks([False,True],['Unverified','Verified'])
plt.show()
16
3.Text Preprocessing
Processing the raw tweets using regex
Text Preprocessing is traditionally an important step for Natural Language Processing (NLP) tasks. It
transforms text into a more digestible form so that machine learning algorithms can perform better.
Conert to lowerCase:Convert all the tweets to lowercase Removing Twitter Handles: Remove the
twitter handles (i.e the usernames)
Remove Twitter Hashtags : Remove all the hashtags from the tweet
Remove URL: Remove the url's present in the tweet.
Removing Non-Alphabets: Replacing characters except Digits and Alphabets with a space.
Removing Short Words: Words with length less than 2 are removed.
Removing Consecutive letters: 3 or more consecutive letters are replaced by 2 letters. (eg: "Heyyyy" to
"Heyy")
Removing Multiple Spaces: Replace all multiple spaces with single space.
In [18]:
#Convert to lowercase
df.text = df['text'].str.lower()
#remove hashtags
df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))
# Remove URLS
df.text = df.text.apply(lambda x:re.sub(r"http\S+", "", x))
plt.imshow(wordcloud)
plt.show()
In [20]:
show_wordcloud(df['text'], title = 'Prevalent words in tweets')
In [21]:
india_df = df.loc[df.user_location=="India"]
show_wordcloud(india_df['text'], title = 'Prevalent words in tweets from
India')
18
In [ ]:
data['Sentiment'] = get_sentiment(data)
sns.countplot(x="Sentiment", data=data, palette="Set2")
print(data.Sentiment.value_counts())
Neutral 3479
Positive 3210
Negative 1393
Name: Sentiment, dtype: int64
19
In [24]:
temp =
data.groupby('Sentiment').count()['text'].reset_index().sort_values(by='text
',ascending=False)
temp.style.background_gradient(cmap='Greens')
Out[24]:
Sentiment text
1 Neutral 3479
2 Positive 3210
0 Negative 1393
In [25]:
plt.figure(figsize=(12,6))
fig = go.Figure(go.Funnelarea(
text =temp.Sentiment,
values = temp.text,
title = {"position": "top center", "text": "Funnel-Chart of Sentiment
Distribution"}
))
fig.show()
<Figure size 864x432 with 0 Axes>
20
If The Plot not appeared when viewed in github (due to some github bug) without downloading the notebook
then please check the plot attached as 'Funnel_chart.png' with this notebook
In [26]:
def get_word_cloud(sentiment):
stop_words = (set(stopwords.words('english')))
remove_words = ['vaccin', 'pfizerbiontech', 'coronavirus', 'pfizer',
'covid', 'covidvaccin', 'pfizervaccin']
stop_words = remove_words + list(stop_words)
plt.figure(figsize=[15,15])
clean_tweets= "".join(list(data[data['Sentiment']==sentiment]['Tidy
Tweet'].values))
wordcloud = WordCloud(width=700,height=400,
background_color='white',colormap='plasma', max_words=50,
stopwords=stop_words, collocations=False).generate(clean_tweets)
plt.title(f"Top 50 {sentiment} words used in tweets", fontsize=20)
plt.imshow(wordcloud)
return plt.show()
In [27]:
# Type of tweets made on vaccine over a period of time
data['date'] = pd.to_datetime(data['date']).dt.date
negative_data = data[data['Sentiment']=='Negative'].reset_index()
positive_data = data[data['Sentiment']=='Positive'].reset_index()
grouped_data_neg =
negative_data.groupby('date')['Sentiment'].count().reset_index()
grouped_data_pos =
positive_data.groupby('date')['Sentiment'].count().reset_index()
merged_data = pd.merge(grouped_data_neg, grouped_data_pos, left_on='date',
right_on='date', suffixes=(' Negative', ' Positive'))
21
Here We Observe that the there were More tweets when the vaccine was released and the number of
tweets about vaccine decreases as the time goes on
In [28]:
# Get the Positive, Neutral and Negative Sentiment Scores
sid = SIA()
data['sentiments'] = data['text'].apply(lambda x:
sid.polarity_scores(' '.join(re.findall(r'\w+',x.lower()))))
data['Positive Sentiment'] = data['sentiments'].apply(lambda x:
x['pos']+1*(10**-6))
data['Neutral Sentiment'] = data['sentiments'].apply(lambda x:
x['neu']+1*(10**-6))
data['Negative Sentiment'] = data['sentiments'].apply(lambda x:
x['neg']+1*(10**-6))
data.drop(columns=['sentiments'],inplace=True)
In [29]:
data.head(3)
Out[29]:
3 rows × 22 columns
In [30]:
#Distribution Of Sentiments across the tweets
plt.subplot(2,1,1)
plt.title('Distriubtion Of Sentiments Across Our
Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(data['Negative Sentiment'],bw=0.1)
sns.kdeplot(data['Positive Sentiment'],bw=0.1)
sns.kdeplot(data['Neutral Sentiment'],bw=0.1)
plt.subplot(2,1,2)
22
plt.title('CDF Of Sentiments Across Our
Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(data['Negative Sentiment'],bw=0.1,cumulative=True)
sns.kdeplot(data['Positive Sentiment'],bw=0.1,cumulative=True)
sns.kdeplot(data['Neutral Sentiment'],bw=0.1,cumulative=True)
plt.xlabel('Sentiment Value',fontsize=19)
plt.show()
In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from wordcloud import WordCloud,STOPWORDS
from nltk.corpus import stopwords
import random
plt.rc('figure',figsize=(17,13))
def get_word_cloud(sentiment):
stop_words = (set(stopwords.words('english')))
remove_words = ['vaccin', 'pfizerbiontech', 'coronavirus', 'pfizer',
'covid', 'covidvaccin', 'pfizervaccin']
stop_words = remove_words + list(stop_words)
plt.figure(figsize=[10,10])
clean_tweets=
"".join(list(data[data['Sentiment']==sentiment]['text'].values))
23
wordcloud = WordCloud(width=700,height=400,
background_color='white',colormap='plasma', max_words=50,
stopwords=stop_words, collocations=False).generate(clean_tweets)
plt.title(f"Top 50 {sentiment} words used in tweets", fontsize=20)
plt.imshow(wordcloud)
return plt.show()
In [32]:
get_word_cloud(sentiment='Positive')
Here we can see that the words: dose, vaccine, thank, good, first etc.. contribute towards Positive
Sentiment
In [33]:
get_word_cloud(sentiment='Negative')
Here we can see that the words: death,died,pain,stop etc.. contribute towards Negative Sentiment
fig.add_trace(
go.Scatter(x=b_date_mean['date'], y=b_date_mean['Positive
Sentiment'],name='Positive Sentiment Mean'),
row=1, col=1
)
#positive mean
fig.add_shape(type="line",
x0=b_date_mean['date'].values[0], y0=b_date_mean['Positive
Sentiment'].mean(), x1=b_date_mean['date'].values[-1],
y1=b_date_mean['Positive Sentiment'].mean(),
line=dict(
color="Red",
width=2,
dash="dashdot",
),
name='Mean'
)
fig.add_annotation(x=b_date_mean['date'].values[3], y=b_date_mean['Positive
Sentiment'].mean(),
24
text=r"$\mu : {:.2f}$".format(b_date_mean['Positive
Sentiment'].mean()),
showarrow=True,
arrowhead=3,
yshift=10)
fig.add_trace(
go.Scatter(x=b_date_mean['date'], y=b_date_mean['Negative
Sentiment'],name='Negative Sentiment Mean'),
row=2, col=1
)
#negative mean
fig.add_shape(type="line",
x0=b_date_mean['date'].values[0], y0=b_date_mean['Negative
Sentiment'].mean(), x1=b_date_mean['date'].values[-1],
y1=b_date_mean['Negative Sentiment'].mean(),
line=dict(
color="Red",
width=2,
dash="dashdot",
),
name='Mean',
xref='x2',
yref='y2'
)
fig.add_annotation(x=b_date_mean['date'].values[3], y=b_date_mean['Negative
Sentiment'].mean(),
text=r"$\mu : {:.2f}$".format(b_date_mean['Negative
Sentiment'].mean()),
showarrow=True,
arrowhead=3,
yshift=10,
xref='x2',
yref='y2')
fig.add_annotation(x=b_date_mean['date'].values[5], y=b_date_mean['Negative
Sentiment'].mean()+0.01,
text=r"Start Of Decline",
showarrow=True,
arrowhead=6,
yshift=10,
xref='x2',
yref='y2')
fig.add_annotation(x=b_date_mean['date'].values[15], y=.024,
25
text=r"Start Of Incline",
showarrow=True,
arrowhead=6,
yshift=10,
xref='x2',
yref='y2')
fig['layout']['xaxis2']['title'] = 'Date'
fig.update_layout(height=700, width=900, title_text="Sentiment Average
Change With Time")
fig.show()
Here we can see that the there is no trend or cycles or seasonality observed with Time in Positive and
Negative Sentiments of tweets. So we can conclude that time series analysis on Sentiments is pointless
If The Plot not appear when viewed in github without downloading then please check the plot attached as
'Sentiment_Average_Change_With_Time.png' with this notebook
In [ ]:
In [35]:
# Check the most followed (famous) users tweet
fig.show()
26
As, its obvious that verified users or media channels make a significant impact on people.
ABP news, Economic Times, Business Standard which are most famous media channels tweeted negatively
about the vaccine. while, the news channels Hindu, DD news, CGTN tweeted positively
In [36]:
27
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
plot_acf(b_date_mean['Negative Sentiment'],lags=20,
ax=ax[0,0],title='Autocorrelation Negative')
plot_pacf(b_date_mean['Negative Sentiment'],lags=20,
ax=ax[1,0],title='Partial Autocorrelation Negative')
plot_acf(b_date_mean['Positive Sentiment'],lags=20,
ax=ax[0,1],color='tab:red',title='Autocorrelation Positive')
plot_pacf(b_date_mean['Positive Sentiment'],lags=20,
ax=ax[1,1],color='tab:red',title='Partial Autocorrelation Positive')
plt.show()
Here, from graphs we can observe that the acf and pacf values for positive and negative sentiments are
nearly zero and there is no exponential decrease in acf and pacf plots. Hence, the p and q values are 0.
Hence using time series forecating models like ARMA or ARIMA doesn't make any sense.
In [38]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
df['text'] = df['text'].apply(word_tokenize)
df.replace({'Negative': 0, 'Neutral':1 ,'Positive': 2}, inplace=True)
Training Data: The dataset upon which the model would be trained on. Contains 95% data. Test Data:
The dataset upon which the model would be tested against. Contains 5% data.
In [41]:
X = df['text']
y = df['Sentiment']
29
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.05, random_state=42)
In [42]:
X_train,X_test,y_train,y_test =
X_train.astype(str),X_test.astype(str),y_train.astype(str),y_test.astype(str)
9. Feature Extraction
TF-IDF indicates what the importance of the word is in order to understand the document or dataset. Let
us understand with an example. Suppose you have a dataset where students write an essay on the topic,
My House. In this dataset, the word a appears many times; it’s a high frequency word compared to other
words in the dataset. The dataset contains other words like home, house, rooms and so on that appear less
often, so their frequency are lower and they carry more information compared to the word. This is the
intuition behind TF-IDF.
TF-IDF Vectoriser converts a collection of raw documents to a matrix of TF-IDF features. The Vectoriser
is usually trained on only the X_train dataset.
ngram_range is the range of number of words in a sequence. (e.g "very expensive" is a 2-gram that is
considered as an extra feature separately from "very" and "expensive" when you have a n-gram range of
(1,2))
max_features specifies the number of features to consider. (Ordered by feature frequency across the
corpus)
In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
30
i) Logistic Regression
ii) Naive Bayes
iii) Linear Support Vector Classification (LinearSVC)
iv) Random forest
Since our dataset skewed, i.e. it has no equal number of Positive and Negative Predictions. We're
choosing F1-Score as our evaluation metric. Furthermore, we're plotting the Confusion Matrix to get an
understanding of how our model is performing on different classification types.
In [45]:
# Evaluate Model Function
def model_Evaluate(model):
Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model
that involves changing the loss function to cross-entropy loss and predict probability distribution to a
multinomial probability distribution to natively support multi-class classification problems.
In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
clf = LogisticRegression(random_state=0,max_iter=1000).fit(X_train, y_train)
model_Evaluate(clf)
Classification Report
precision recall f1-score support
31
0 0.93 0.51 0.66 72
1 0.75 0.90 0.82 178
2 0.84 0.82 0.83 155
32
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying
Bayes' theorem with strong (naïve) independence assumptions between the features (see Bayes classifier).
They are among the simplest Bayesian network models,but coupled with kernel density estimation, they
can achieve higher accuracy levels
In [47]:
from sklearn.naive_bayes import MultinomialNB
BNBmodel = MultinomialNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
Classification Report
precision recall f1-score support
33
10.3 Linear Support Vector Classifier
Support Vector Classifier or SVC is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVC algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
In [48]:
34
from sklearn.svm import LinearSVC
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
Classification Report
precision recall f1-score support
35
10.4 RandomForestClassifier
A random forest is a machine learning technique that’s used to solve regression and classification
problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide
solutions to complex problems.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest
algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm
that improves the accuracy of machine learning algorithms.
36
The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome.
In [49]:
from sklearn.ensemble import RandomForestClassifier
random_model = RandomForestClassifier(max_depth=500,random_state=0)
random_model.fit(X_train, y_train)
model_Evaluate(random_model)
Classification Report
precision recall f1-score support
37
11. Conclusion
We can clearly see that the Support Vector Classifier Model performs the best out of all the different
models that we tried. It achieves nearly 83% accuracy and F1-Score's 72%,76%,85% while classifying the
sentiment of a tweet.
Although Logistic Regression and Random Forest Classifiers has nearly same accuracy (80%) as SVC
the overall F1 scores are better for SVC
38
39