0% found this document useful (0 votes)

22 views39 pages

Covid-19 Vaccination Analysis

It's a project report on Covid -19 vaccination analysis

Uploaded by

nk493209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views39 pages

Covid-19 Vaccination Analysis

It's a project report on Covid -19 vaccination analysis

Uploaded by

nk493209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

A Project Report

Covid-19 Vaccine Analysis

Submitted in partial fulfillment of the

requirement for the award of the degree of

MASTER OF COMPUTER APPLICATION

DEGREE
Session 2023-24
in

Covid-19 Vaccine Analysis

By
Mohd Nadeem(23SCSE2030560)
Vishal Bhatt(23SCSE2030543)
Sagar Bhatt(23SCSE2030541)

Under the guidance of

Mr. Rajakumar P

SCHOOL OF COMPUTER APPLICATION AND TECHNOLOGY

GALGOTIAS UNIVERSITY, GREATER NOIDA

INDIA

Jan, 2024
SCHOOL OF COMPUTER APPLICATION AND
TECHNOLOGY
GALGOTIAS UNIVERSITY, GREATER NOIDA

CANDIDATE’S DECLARATION

We hereby certify that the work which is being presented in the project, entitled Covid-19

Vaccine Analysis in partial fulfillment of the requirements for the award of the MCA

(Master of Computer Application) submitted in the School of Computer Application and

Technology of Galgotias University, Greater Noida, is an original work carried out during

the period of August, 2023 to Jan and 2024, under the supervision of Dr. Rajnesh Singh,

Department of Computer Science and Engineering/School of Computer Application and

Technology , Galgotias University, Greater Noida.

The matter presented in the thesis/project/dissertation has not been submitted by me/us

for the award of any other degree of this or any other places.

Mohd Nadeem (23SCSE2030560)

Vishal Bhatt(23SCSE2030541)

Sagar Bhatt(23SCSE2030541)

This is to certify that the above statement made by the candidates is correct to the best of

my knowledge.

Mr. Rajaumar P

Assistant Professor
TABLE OF CONTENTS
Candidate Declaration
Certificate
Acknowledgement
Abstract
Chapter 1: Introduction
Chapter 2: Problem Statement
Chapter 3: Methodology and Related work
Chapter 4: Diagrams
Chapter 5: Conclusion

3
ABSTRACT
The COVID-19 pandemic prompted an urgent need for effective vaccines to mitigate
its spread and impact on global health. Vaccine analysis plays a crucial role in
evaluating the efficacy, safety, and distribution strategies of these vaccines. Python,
with its versatile libraries and tools, serves as a powerful platform for conducting
comprehensive analyses of COVID-19 vaccine data.The analysis typically begins with
data collection from various sources, including clinical trials, public health databases,
and real-world vaccination campaigns. Python's libraries such as Pandas and NumPy
facilitate data manipulation, cleaning, and preprocessing to ensure data quality and
consistency.One fundamental aspect of vaccine analysis is assessing vaccine efficacy.
Python allows researchers to perform statistical analyses, including hypothesis testing
and confidence interval estimation, to determine the effectiveness of vaccines in
preventing COVID-19 infection, severe illness, and mortality.

Furthermore, Python enables researchers to investigate vaccine safety by analyzing

adverse event reports and conducting pharmacovigilance studies. Natural Language
Processing (NLP) techniques can be employed to extract insights from text data, such
as social media posts or medical records, regarding vaccine-related adverse
reactions.In addition to efficacy and safety analyses, Python facilitates the exploration
of vaccination coverage and distribution patterns. Visualization libraries like
Matplotlib and Seaborn aid in creating informative plots and charts to visualize
vaccine uptake across different demographics, regions, and time periods.Moreover,
machine learning algorithms implemented in Python can be utilized for predictive
modeling, forecasting vaccine demand, identifying high-risk populations, and
optimizing vaccination strategies.

In conclusion, Python serves as a versatile tool for conducting comprehensive analyses

of COVID-19 vaccines, encompassing efficacy, safety, distribution, and predictive
modeling. By leveraging Python's capabilities, researchers can derive valuable insights
to inform public health policies and strategies aimed at controlling the pandemic.

4
Covid19 Vaccine Sentiment Analysis
Table Of Contents:
1. Importing Libraries
2. EDA and Visulation
3. Text Processing
4. Most Prevalent Words in Tweet
5. Apply VADER Sentiment to the tweets to get labels
6. Time Series Analysis On Sentiments
7. Stop Word Removal and Lemmatization
8. Splitting the Data
9. Feature Extraction
10. Model Building
11. Conclusion

1. Importing Libraries
In [1]:
import numpy as np
import pandas as pd
import re
import string

import matplotlib.pyplot as plt

import seaborn as sns
import plotly.graph_objects as go
from collections import Counter
from plotly.subplots import make_subplots
from matplotlib import rcParams

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from nltk.stem import SnowballStemmer
from wordcloud import WordCloud, STOPWORDS

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.style.use('fivethirtyeight')
In [2]:
df = pd.read_csv('vaccination_tweets.csv') # Dataset Source :
kaggle
# Note: some functionalities differ from google colab and jupyter notebook

5
2. EDA & Visualisation
2.1 Summary statistics
In [3]:
df.head()
Out[3]:
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t

Sa
me
2 fol T
Aggr 0 ks w
La
egato 2 sai it
Cr 20
r of 0- d te
esc 09
R Asia 1 dai r
1340 ent -
ac n 2- ko f
5391 a- 04 ['Pfize Fa
he Ame 40 16 324 Fal 2 n o
0 1197 M - rBioN 0 0 ls
l rican 5 92 7 se 0 pa r
1516 ont 08 Tech'] e
R news 0 ste A
416 ros 17:
oh ; 6: co n
e, 52:
scan 0 uld d
C 46
ning 6: tre r
A
di... 4 at oi
4 a d
cyt
...

Sa Mark 20 2 W T
n eting 09 0 hil w
1338 Al
Fra dude - 2 e it
1585 be Fa
nci , tech 09 83 66 Fal 0- the
1 4335 rt 178 NaN te 1 1 ls
sco geek, - 4 6 se 1 wo r
9250 Fo e
, heav 21 2- rld W
433 ng
C y 15: 1 ha e
A meta 27: 3 s b

6
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t

l& 30 1 be A
'80s . 6: en p
.. 2 on p
7: the
1 wr
3 on
g
sid
e
of
...

#c
or
on
2 avi T
0 rus w
2 #S it
20 ['coro
0- put te
20 naviru
1 nik r
1337 eli heil, - s',
Yo 2- V f
8581 �� hydr 06 'Sputn Fa
ur Fal 1 #A o
2 9914 a - 10 88 155 ikV', 0 0 ls
�� Be se 2 str r
0118 25 'Astra e
� d �☺ 2 aZ A
533 23: Zenec
0: en n
30: a',
3 ec d
28 'Pf...
3: a r
4 #P oi
5 fiz d
er
Bi
o...

3 1337 C Va Hosti 20 49
39 218 Tr
2 Fa NaN T 4 2 Fa
8557 ha nc ng 08 16 0 cts w 4 1 ls

7
us us us us s r fa is
us use use
er user er er er o et v _r
er r_f r_f d
_n _des _c _f _v tex hasht u w o et
id _lo oll avo at
a cript re rie eri t ags r e ri w
cat ow uri e
m ion ate nd fie c et te ee
ion ers tes
e d s d e s s t

3991 rle ou "Cha - 5 33 53 ue 2 are it 6 2 e

8835 s ver rlesA 09 0- im te 9
717 A , dlerT - 1 mu r
dl BC onig 10 2- tab W
er - ht" 11: 1 le, e
Ca Glob 28: 2 Se b
na al 53 2 nat A
da New 0: or, p
s 2 ev p
Radi. 3: en
.. 5 wh
9 en
yo
u'r
e...

Ex
pla
in
2 T
Citiz to
0 w
Ci en me
2 it
tiz New 20 ag
0- ['wher te
en s 20 ain
1 eareal r
1337 N Chan - wh
2- lthesi f
8540 e nel 04 y Fa
Na 15 58 147 Fal 1 ckpeo o
4 6460 w bring - we 0 0 ls
N 2 0 3 se 2 ple', r
4966 s ing 23 ne e
2 'Pfizer i
912 C you 17: ed
0: BioN P
ha an 58: a
1 Tech'] h
nn alter 42 va
7: o
el nati.. cci
1 n
. ne
9 e
@
Bo
r...

8
In [4]:
df.shape # Gives no. of rows and columns (rows --> number of
examples / data points , columns -> number of attributes)
Out[4]:
(8082, 16)
In [5]:
df.drop(["id","user_created"],axis=1,inplace=True)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8082 entries, 0 to 8081
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_name 8082 non-null object
1 user_location 6452 non-null object
2 user_description 7576 non-null object
3 user_followers 8082 non-null int64
4 user_friends 8082 non-null int64
5 user_favourites 8082 non-null int64
6 user_verified 8082 non-null bool
7 date 8082 non-null object
8 text 8082 non-null object
9 hashtags 6133 non-null object
10 source 8081 non-null object
11 retweets 8082 non-null int64
12 favorites 8082 non-null int64
13 is_retweet 8082 non-null bool
dtypes: bool(2), int64(5), object(7)
memory usage: 773.6+ KB
In [7]:
df.isnull().sum() # Lets handle the null values when it's required
Out[7]:
user_name 0
user_location 1630
user_description 506
user_followers 0
user_friends 0
user_favourites 0
user_verified 0
date 0
text 0
hashtags 1949
source 1
retweets 0
favorites 0
is_retweet 0
dtype: int64
In [8]:
df.isnull().values.sum()

9
Out[8]:
4086
In [9]:
df.describe()
Out[9]:
user_followers user_friends user_favourites retweets favorites

count 8.082000e+03 8082.000000 8.082000e+03 8082.000000 8082.000000

mean 3.550042e+04 1192.207127 1.513661e+04 1.472037 8.690671

std 2.914947e+05 2982.597309 4.882913e+04 12.922145 59.121769

min 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000

25% 1.100000e+02 165.000000 4.172500e+02 0.000000 0.000000

50% 4.805000e+02 465.000000 2.329000e+03 0.000000 1.000000

75% 2.089750e+03 1249.500000 1.124975e+04 1.000000 4.000000

max 1.371493e+07 103226.000000 1.166459e+06 678.000000 2315.000000

2.2 Visualisation
In [10]:
# Let's see the length of the tweets
seq_length = [len(i) for i in df['text']]

pd.Series(seq_length).hist(bins = 25)
Out[10]:
<AxesSubplot:>

10
In [11]:
sns.set_style('darkgrid')

df["num of words in text"] = df["text"].apply(lambda x: len(x))

plt.figure(figsize=(10,7))
sns.kdeplot(df["num of words in text"],shade=True, color='m')
plt.title("Distribution of words in text column")
plt.xlabel("Number of words")
plt.show()

11
In [12]:
# Percentage of Verified and Non-verified users

dict_ = df['user_verified'].value_counts().to_dict()
dict_['Verified'] = dict_.pop(True)
dict_['Not-Verified'] = dict_.pop(False)

plt.figure(figsize=(4,4))
plt.pie(x=dict_.values(), labels=dict_.keys(), autopct='%1.1f%%',
shadow=True, startangle=0, explode = [0.1, 0])
plt.show()

12
We can See that nearly 91% of users tweeted are not verified
In [13]:
# Top 10 Most Used Hashtags in tweets

MostUsedTweets = df.hashtags.value_counts().sort_values(ascending=False)[:5]
colors = ['lightcoral', 'lightskyblue', 'yellowgreen', 'pink', 'orange']
explode = (0.1, 0.2, 0.1, 0.1, 0.1)

# Wedge properties
wp = { 'linewidth' : 0.5, 'edgecolor' : "red" }

# Creating autocpt arguments

def func(pct, allvalues):
absolute = int(pct / 100.*np.sum(allvalues))
return "{:.1f}%\n({:d} g)".format(pct, absolute)

# Creating the plot

fig, ax = plt.subplots(figsize =(10, 7))
wedges, texts, autotexts = ax.pie(MostUsedTweets,
autopct = lambda pct: func(pct,
MostUsedTweets),
explode = explode,
labels = MostUsedTweets.keys(),
shadow = True,
colors = colors,
startangle = 90,
wedgeprops = wp,
textprops = dict(color ="black"))

# Adding legend
ax.legend(wedges, MostUsedTweets.keys(),
13
title ="Most used tweets",
loc ="center left",
bbox_to_anchor =(1, 0, 0.5, 1))

plt.setp(autotexts, size=9, weight="bold")

ax.set_title("Most used Hashtags")
plt.axis('equal')
plt.show()

In [14]:
# Number of Tweets Made Per Day

df['tweet_date']=pd.to_datetime(df['date']).dt.date
tweet_date=df['tweet_date'].value_counts().to_frame().reset_index().rename(c
olumns={'index':'date','tweet_date':'count'})
tweet_date['date']=pd.to_datetime(tweet_date['date'])
tweet_date=tweet_date.sort_values('date',ascending=False)

fig=go.Figure(go.Scatter(x=tweet_date['date'],
y=tweet_date['count'],
mode='markers+lines',
name="Submissions",
marker_color='dodgerblue'))

fig.update_layout(
title_text='Tweets per Day : ({} -
{})'.format(df['tweet_date'].sort_values()[0].strftime("%d/%m/%Y"),

df['tweet_date'].sort_values().iloc[-
1].strftime("%d/%m/%Y")),template="plotly_dark",
title_x=0.5)

fig.show()
14
It can be seen that the tweets related to the vaccine were more during the initial phases of the vaccine
launch.

If The Plot not appeared when viewed in github (due to some github bug) without downloading the notebook
then please check the plot attached as 'Tweets_per_Day.png' with this notebook
In [15]:
#Days With Maximum Number of Tweets

df["date"] = pd.to_datetime(df["date"])
df["Month"] = df["date"].apply(lambda x : x.month)
df["day"] = df["date"].apply(lambda x : x.dayofweek)
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df["day"] = df["day"].map(dmap)
plt.title("Day with maximun tweets")
sns.countplot(df["day"])
Out[15]:
<AxesSubplot:title={'center':'Day with maximun tweets'}, xlabel='day', ylabe
l='count'>

In [16]:
#Number of Retweets Made

y = df['is_retweet']
fig, ax = plt.subplots(figsize=(5, 5))
count = Counter(y)
ax.pie(count.values(), labels=count.keys(), autopct=lambda p:f'{p:.2f}%')
ax.set_title('is_retweet?')
15
plt.show()

# No retweets were made.

In [17]:
#Numer of verified and non verified users tweeted.

plt.figure(figsize=(4, 4))
sns.countplot(x ="user_verified",data=df, palette="Set1")
plt.title(" Verified VS Unverified Users")
plt.xticks([False,True],['Unverified','Verified'])
plt.show()

16
3.Text Preprocessing
Processing the raw tweets using regex
Text Preprocessing is traditionally an important step for Natural Language Processing (NLP) tasks. It
transforms text into a more digestible form so that machine learning algorithms can perform better.

The Preprocessing steps taken are:

Conert to lowerCase:Convert all the tweets to lowercase Removing Twitter Handles: Remove the
twitter handles (i.e the usernames)
Remove Twitter Hashtags : Remove all the hashtags from the tweet
Remove URL: Remove the url's present in the tweet.
Removing Non-Alphabets: Replacing characters except Digits and Alphabets with a space.
Removing Short Words: Words with length less than 2 are removed.
Removing Consecutive letters: 3 or more consecutive letters are replaced by 2 letters. (eg: "Heyyyy" to
"Heyy")
Removing Multiple Spaces: Replace all multiple spaces with single space.
In [18]:
#Convert to lowercase
df.text = df['text'].str.lower()

#remove twitter handlers

17
df.text = df.text.apply(lambda x:re.sub('@[^\s]+','',x))

#remove hashtags
df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))

# Remove URLS
df.text = df.text.apply(lambda x:re.sub(r"http\S+", "", x))

# Replace all non alphabets.

df.text = df.text.apply(lambda x:re.sub("[^a-zA-Z0-9]", ' ', x))

#Remove all single characters

df.text = df.text.apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', '', x))

# Substituting multiple spaces with single space

df.text = df.text.apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

# Replace 3 or more consecutive letters by 2 letter.

df.text = df.text.apply(lambda x:re.sub(r"(.)\1\1+",r"\1\1", x))

4. Most Prevalent Words in Tweet

In [19]:
stopwords = set(STOPWORDS)
def show_wordcloud(data, title = None):
wordcloud = WordCloud(
background_color='black',
stopwords=stopwords,
max_words=100,
max_font_size=40,
scale=5,
random_state=1
).generate(str(data))

fig = plt.figure(1, figsize=(10,10))

plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)

plt.imshow(wordcloud)
plt.show()
In [20]:
show_wordcloud(df['text'], title = 'Prevalent words in tweets')

In [21]:
india_df = df.loc[df.user_location=="India"]
show_wordcloud(india_df['text'], title = 'Prevalent words in tweets from
India')

18
In [ ]:

5. Apply VADER Sentiment to the tweets

to get labels
The VADER Sentimental analysis module is used to label the example / datapoint as the positive, negative
or neutral . VADER sentimental analysis relies on a dictionary that maps lexical features to emotion
intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up each
word's intensity in the text. For example,- Words like 'love,' 'enjoy,' 'happy,' 'like' all convey a positive
sentiment. Also, VADER is intelligent enough to understand these words' basic context, such as "did not
love" as a negative statement. It also understands the emphasis of capitalization and punctuation, such as
"ENJOY."
In [22]:
data = df.fillna('')
In [23]:
sentiment = SIA()
def get_sentiment(data):
sentiment_list = []
for text in list(data['text'].values):
if sentiment.polarity_scores(text)["compound"] > 0:
sentiment_list.append("Positive")
elif sentiment.polarity_scores(text)["compound"] < 0:
sentiment_list.append("Negative")
else:
sentiment_list.append("Neutral")
return sentiment_list

data['Sentiment'] = get_sentiment(data)
sns.countplot(x="Sentiment", data=data, palette="Set2")
print(data.Sentiment.value_counts())
Neutral 3479
Positive 3210
Negative 1393
Name: Sentiment, dtype: int64

19
In [24]:
temp =
data.groupby('Sentiment').count()['text'].reset_index().sort_values(by='text
',ascending=False)
temp.style.background_gradient(cmap='Greens')
Out[24]:
Sentiment text

1 Neutral 3479

2 Positive 3210

0 Negative 1393
In [25]:
plt.figure(figsize=(12,6))

fig = go.Figure(go.Funnelarea(
text =temp.Sentiment,
values = temp.text,
title = {"position": "top center", "text": "Funnel-Chart of Sentiment
Distribution"}
))
fig.show()
<Figure size 864x432 with 0 Axes>
20
If The Plot not appeared when viewed in github (due to some github bug) without downloading the notebook
then please check the plot attached as 'Funnel_chart.png' with this notebook
In [26]:
def get_word_cloud(sentiment):
stop_words = (set(stopwords.words('english')))
remove_words = ['vaccin', 'pfizerbiontech', 'coronavirus', 'pfizer',
'covid', 'covidvaccin', 'pfizervaccin']
stop_words = remove_words + list(stop_words)
plt.figure(figsize=[15,15])
clean_tweets= "".join(list(data[data['Sentiment']==sentiment]['Tidy
Tweet'].values))
wordcloud = WordCloud(width=700,height=400,
background_color='white',colormap='plasma', max_words=50,
stopwords=stop_words, collocations=False).generate(clean_tweets)
plt.title(f"Top 50 {sentiment} words used in tweets", fontsize=20)
plt.imshow(wordcloud)
return plt.show()
In [27]:
# Type of tweets made on vaccine over a period of time
data['date'] = pd.to_datetime(data['date']).dt.date
negative_data = data[data['Sentiment']=='Negative'].reset_index()
positive_data = data[data['Sentiment']=='Positive'].reset_index()
grouped_data_neg =
negative_data.groupby('date')['Sentiment'].count().reset_index()
grouped_data_pos =
positive_data.groupby('date')['Sentiment'].count().reset_index()
merged_data = pd.merge(grouped_data_neg, grouped_data_pos, left_on='date',
right_on='date', suffixes=(' Negative', ' Positive'))

merged_data.plot(x='date', y=['Sentiment Negative', 'Sentiment Positive'],

figsize=(14, 7), marker='o', xlabel='Date', ylabel='Count', title='Tweet
count over a period of time')
Out[27]:
<AxesSubplot:title={'center':'Tweet count over a period of time'}, xlabel='D
ate', ylabel='Count'>

21
Here We Observe that the there were More tweets when the vaccine was released and the number of
tweets about vaccine decreases as the time goes on
In [28]:
# Get the Positive, Neutral and Negative Sentiment Scores
sid = SIA()
data['sentiments'] = data['text'].apply(lambda x:
sid.polarity_scores(' '.join(re.findall(r'\w+',x.lower()))))
data['Positive Sentiment'] = data['sentiments'].apply(lambda x:
x['pos']+1*(10**-6))
data['Neutral Sentiment'] = data['sentiments'].apply(lambda x:
x['neu']+1*(10**-6))
data['Negative Sentiment'] = data['sentiments'].apply(lambda x:
x['neg']+1*(10**-6))

data.drop(columns=['sentiments'],inplace=True)
In [29]:
data.head(3)
Out[29]:
3 rows × 22 columns

In [30]:
#Distribution Of Sentiments across the tweets

plt.subplot(2,1,1)
plt.title('Distriubtion Of Sentiments Across Our
Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(data['Negative Sentiment'],bw=0.1)
sns.kdeplot(data['Positive Sentiment'],bw=0.1)
sns.kdeplot(data['Neutral Sentiment'],bw=0.1)
plt.subplot(2,1,2)

22
plt.title('CDF Of Sentiments Across Our
Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(data['Negative Sentiment'],bw=0.1,cumulative=True)
sns.kdeplot(data['Positive Sentiment'],bw=0.1,cumulative=True)
sns.kdeplot(data['Neutral Sentiment'],bw=0.1,cumulative=True)
plt.xlabel('Sentiment Value',fontsize=19)
plt.show()

In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from wordcloud import WordCloud,STOPWORDS
from nltk.corpus import stopwords
import random
plt.rc('figure',figsize=(17,13))

# Get Wordclouds For Positive and Negative Tweets

def get_word_cloud(sentiment):
stop_words = (set(stopwords.words('english')))
remove_words = ['vaccin', 'pfizerbiontech', 'coronavirus', 'pfizer',
'covid', 'covidvaccin', 'pfizervaccin']
stop_words = remove_words + list(stop_words)
plt.figure(figsize=[10,10])
clean_tweets=
"".join(list(data[data['Sentiment']==sentiment]['text'].values))

23
wordcloud = WordCloud(width=700,height=400,
background_color='white',colormap='plasma', max_words=50,
stopwords=stop_words, collocations=False).generate(clean_tweets)
plt.title(f"Top 50 {sentiment} words used in tweets", fontsize=20)
plt.imshow(wordcloud)
return plt.show()
In [32]:
get_word_cloud(sentiment='Positive')

Here we can see that the words: dose, vaccine, thank, good, first etc.. contribute towards Positive
Sentiment
In [33]:
get_word_cloud(sentiment='Negative')

Here we can see that the words: death,died,pain,stop etc.. contribute towards Negative Sentiment

6. Time Series Analysis On Sentiments

In [34]:
b_date_mean = data.groupby(by='date').mean().reset_index()
b_date_std = data.groupby(by='date').std().reset_index()

fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Daily

Average Positive Sentiment', 'Daily Average Negative Sentiment'))

fig.add_trace(
go.Scatter(x=b_date_mean['date'], y=b_date_mean['Positive
Sentiment'],name='Positive Sentiment Mean'),
row=1, col=1
)

#positive mean
fig.add_shape(type="line",
x0=b_date_mean['date'].values[0], y0=b_date_mean['Positive
Sentiment'].mean(), x1=b_date_mean['date'].values[-1],
y1=b_date_mean['Positive Sentiment'].mean(),
line=dict(
color="Red",
width=2,
dash="dashdot",
),
name='Mean'
)

fig.add_annotation(x=b_date_mean['date'].values[3], y=b_date_mean['Positive
Sentiment'].mean(),

24
text=r"$\mu : {:.2f}$".format(b_date_mean['Positive
Sentiment'].mean()),
showarrow=True,
arrowhead=3,
yshift=10)

fig.add_trace(
go.Scatter(x=b_date_mean['date'], y=b_date_mean['Negative
Sentiment'],name='Negative Sentiment Mean'),
row=2, col=1
)

#negative mean
fig.add_shape(type="line",
x0=b_date_mean['date'].values[0], y0=b_date_mean['Negative
Sentiment'].mean(), x1=b_date_mean['date'].values[-1],
y1=b_date_mean['Negative Sentiment'].mean(),
line=dict(
color="Red",
width=2,
dash="dashdot",
),
name='Mean',
xref='x2',
yref='y2'
)

fig.add_annotation(x=b_date_mean['date'].values[3], y=b_date_mean['Negative
Sentiment'].mean(),
text=r"$\mu : {:.2f}$".format(b_date_mean['Negative
Sentiment'].mean()),
showarrow=True,
arrowhead=3,
yshift=10,
xref='x2',
yref='y2')

fig.add_annotation(x=b_date_mean['date'].values[5], y=b_date_mean['Negative
Sentiment'].mean()+0.01,
text=r"Start Of Decline",
showarrow=True,
arrowhead=6,
yshift=10,
xref='x2',
yref='y2')

fig.add_annotation(x=b_date_mean['date'].values[15], y=.024,

25
text=r"Start Of Incline",
showarrow=True,
arrowhead=6,
yshift=10,
xref='x2',
yref='y2')

fig['layout']['xaxis2']['title'] = 'Date'
fig.update_layout(height=700, width=900, title_text="Sentiment Average
Change With Time")
fig.show()
Here we can see that the there is no trend or cycles or seasonality observed with Time in Positive and
Negative Sentiments of tweets. So we can conclude that time series analysis on Sentiments is pointless

If The Plot not appear when viewed in github without downloading then please check the plot attached as
'Sentiment_Average_Change_With_Time.png' with this notebook
In [ ]:

In [35]:
# Check the most followed (famous) users tweet

fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(10, 16))

sns.barplot(x="user_followers", y="user_name", orient="h", ax=ax1,
palette=["b"],
data=data[(data.Sentiment== "Positive")]\
.drop_duplicates(subset=["user_name"])\
.sort_values(by=["user_followers"], ascending=False)[["user_name",
"user_followers"]][:10])
ax1.set_title('Top 10 Accounts with Highest Followers who tweet Positive')
sns.barplot(x="user_followers", y="user_name", orient="h", ax=ax2,
palette=["g"],
data=data[(data.Sentiment == "Neutral")]
.drop_duplicates(subset=["user_name"])\
.sort_values(by=["user_followers"], ascending=False)[["user_name",
"user_followers"]][:10])
ax2.set_title('Top 10 Accounts with Highest Followers who tweet Neutral')
sns.barplot(x="user_followers", y="user_name", orient="h", ax=ax3,
palette=["r"],
data=data[(data.Sentiment == "Negative")]
.drop_duplicates(subset=["user_name"])\
.sort_values(by=["user_followers"], ascending=False)[["user_name",
"user_followers"]][:10])
ax3.set_title('Top 10 Accounts with Highest Followers who tweet Negative')

fig.show()

26
As, its obvious that verified users or media channels make a significant impact on people.
ABP news, Economic Times, Business Standard which are most famous media channels tweeted negatively
about the vaccine. while, the news channels Hindu, DD news, CGTN tweeted positively
In [36]:
27
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose

f, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 10))

ax[0,0].set_ylim(-1.1,1.1)
ax[1,0].set_ylim(-1.1,1.1)
ax[0,1].set_ylim(-1.1,1.1)
ax[1,1].set_ylim(-1.1,1.1)

plot_acf(b_date_mean['Negative Sentiment'],lags=20,
ax=ax[0,0],title='Autocorrelation Negative')
plot_pacf(b_date_mean['Negative Sentiment'],lags=20,
ax=ax[1,0],title='Partial Autocorrelation Negative')
plot_acf(b_date_mean['Positive Sentiment'],lags=20,
ax=ax[0,1],color='tab:red',title='Autocorrelation Positive')
plot_pacf(b_date_mean['Positive Sentiment'],lags=20,
ax=ax[1,1],color='tab:red',title='Partial Autocorrelation Positive')
plt.show()

Here, from graphs we can observe that the acf and pacf values for positive and negative sentiments are
nearly zero and there is no exponential decrease in acf and pacf plots. Hence, the p and q values are 0.
Hence using time series forecating models like ARMA or ARIMA doesn't make any sense.

7.Stop Word Removal and Lemmatization

28
In [37]:
df = data[['Sentiment','text']]

In [38]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Make Sure your intenrt is on :)

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

df['text'] = df['text'].apply(word_tokenize)
df.replace({'Negative': 0, 'Neutral':1 ,'Positive': 2}, inplace=True)

# Encoding 0 for negative, 1 for Neutral, 2 for Positive

[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
Removing Stopwords: Stopwords are the English words which does not add much meaning to a sentence.
They can safely be ignored without sacrificing the meaning of the sentence. (eg: "the", "he", "have")
In [39]:
stop = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: [item for item in x if item not in
stop])
Lemmatizing: Lemmatization is the process of converting a word to its base form. (e.g: “Great” to “Good”)
In [40]:
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
df['text'] = df.text.apply(lemmatize_text)

8. Splitting the Data

The Preprocessed Data is divided into 2 sets of data:

Training Data: The dataset upon which the model would be trained on. Contains 95% data. Test Data:
The dataset upon which the model would be tested against. Contains 5% data.
In [41]:
X = df['text']
y = df['Sentiment']
29
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.05, random_state=42)
In [42]:
X_train,X_test,y_train,y_test =
X_train.astype(str),X_test.astype(str),y_train.astype(str),y_test.astype(str)

9. Feature Extraction
TF-IDF indicates what the importance of the word is in order to understand the document or dataset. Let
us understand with an example. Suppose you have a dataset where students write an essay on the topic,
My House. In this dataset, the word a appears many times; it’s a high frequency word compared to other
words in the dataset. The dataset contains other words like home, house, rooms and so on that appear less
often, so their frequency are lower and they carry more information compared to the word. This is the
intuition behind TF-IDF.

TF-IDF Vectoriser converts a collection of raw documents to a matrix of TF-IDF features. The Vectoriser
is usually trained on only the X_train dataset.

ngram_range is the range of number of words in a sequence. (e.g "very expensive" is a 2-gram that is
considered as an extra feature separately from "very" and "expensive" when you have a n-gram range of
(1,2))

max_features specifies the number of features to consider. (Ordered by feature frequency across the
corpus)
In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)

vectoriser.fit(X_train)
print(f'Vectoriser fitted.')
print('No. of feature_words: ', len(vectoriser.get_feature_names()))
Vectoriser fitted.
No. of feature_words: 45197
In [44]:
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
print(f'Data Transformed.')
Data Transformed.

10. Model Building

We're creating 4 different types of model for our sentiment analysis problem:

30
i) Logistic Regression
ii) Naive Bayes
iii) Linear Support Vector Classification (LinearSVC)
iv) Random forest

Since our dataset skewed, i.e. it has no equal number of Positive and Negative Predictions. We're
choosing F1-Score as our evaluation metric. Furthermore, we're plotting the Confusion Matrix to get an
understanding of how our model is performing on different classification types.
In [45]:
# Evaluate Model Function

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

def model_Evaluate(model):

# Predict values for Test dataset

y_pred = model.predict(X_test)

# Print the evaluation metrics for the dataset.

print('\033[1m'+'\t\t\tClassification Report'+'\033[0m')
print(classification_report(y_test, y_pred))

# Compute and plot the Confusion matrix

cf_matrix = confusion_matrix(y_test, y_pred)
sns.set(rc = {'figure.figsize':(8,8)})
categories = ['Negative','Neutral','Positive']
sns.heatmap(cf_matrix, annot=True,fmt='d',xticklabels = categories,
yticklabels = categories)
plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

10.1 Logistic Regression

Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-
vs-rest can allow logistic regression to be used for multi-class classification problems, although they
require that the classification problem first be transformed into multiple binary classification problems.

Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model
that involves changing the loss function to cross-entropy loss and predict probability distribution to a
multinomial probability distribution to natively support multi-class classification problems.
In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
clf = LogisticRegression(random_state=0,max_iter=1000).fit(X_train, y_train)
model_Evaluate(clf)
Classification Report
precision recall f1-score support

31
0 0.93 0.51 0.66 72
1 0.75 0.90 0.82 178
2 0.84 0.82 0.83 155

accuracy 0.80 405

macro avg 0.84 0.75 0.77 405
weighted avg 0.82 0.80 0.80 405

10.2 Naive Bayes

32
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying
Bayes' theorem with strong (naïve) independence assumptions between the features (see Bayes classifier).
They are among the simplest Bayesian network models,but coupled with kernel density estimation, they
can achieve higher accuracy levels
In [47]:
from sklearn.naive_bayes import MultinomialNB
BNBmodel = MultinomialNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
Classification Report
precision recall f1-score support

0 0.91 0.14 0.24 72

1 0.74 0.84 0.79 178
2 0.69 0.85 0.76 155

accuracy 0.72 405

macro avg 0.78 0.61 0.60 405
weighted avg 0.75 0.72 0.68 405

33
10.3 Linear Support Vector Classifier
Support Vector Classifier or SVC is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.

The goal of the SVC algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
In [48]:

34
from sklearn.svm import LinearSVC
SVCmodel = LinearSVC()

SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
Classification Report
precision recall f1-score support

0 0.86 0.61 0.72 72

1 0.81 0.92 0.86 178
2 0.86 0.85 0.85 155

accuracy 0.83 405

macro avg 0.84 0.79 0.81 405
weighted avg 0.84 0.83 0.83 405

35
10.4 RandomForestClassifier
A random forest is a machine learning technique that’s used to solve regression and classification
problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide
solutions to complex problems.

A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest
algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm
that improves the accuracy of machine learning algorithms.

36
The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome.
In [49]:
from sklearn.ensemble import RandomForestClassifier
random_model = RandomForestClassifier(max_depth=500,random_state=0)
random_model.fit(X_train, y_train)
model_Evaluate(random_model)
Classification Report
precision recall f1-score support

0 0.92 0.47 0.62 72

1 0.72 0.98 0.83 178
2 0.90 0.74 0.81 155

accuracy 0.80 405

macro avg 0.85 0.73 0.76 405
weighted avg 0.83 0.80 0.79 405

37
11. Conclusion
We can clearly see that the Support Vector Classifier Model performs the best out of all the different
models that we tried. It achieves nearly 83% accuracy and F1-Score's 72%,76%,85% while classifying the
sentiment of a tweet.

Although Logistic Regression and Random Forest Classifiers has nearly same accuracy (80%) as SVC
the overall F1 scores are better for SVC

38
39

Rishi Mini Project
No ratings yet
Rishi Mini Project
18 pages
Nishant Mini Project 1 Rishi
No ratings yet
Nishant Mini Project 1 Rishi
18 pages
COVID Vaccine Data Analysis Tools
No ratings yet
COVID Vaccine Data Analysis Tools
5 pages
Dsbda Mini Project Covid Sample
No ratings yet
Dsbda Mini Project Covid Sample
20 pages
DAC Phase4
No ratings yet
DAC Phase4
4 pages
DSBDA Mini Project
No ratings yet
DSBDA Mini Project
19 pages
Covid Vaccine Data Analysis Report
No ratings yet
Covid Vaccine Data Analysis Report
14 pages
COVID-19 Vaccination Data Analysis
No ratings yet
COVID-19 Vaccination Data Analysis
6 pages
Dac Phase3
No ratings yet
Dac Phase3
3 pages
Dac Phase 3 032926
No ratings yet
Dac Phase 3 032926
3 pages
COVID-19 Vaccination Analysis India
No ratings yet
COVID-19 Vaccination Analysis India
33 pages
Harshdeep
No ratings yet
Harshdeep
57 pages
COVID Vaccine Analytics India
No ratings yet
COVID Vaccine Analytics India
11 pages
Mini Report Python
No ratings yet
Mini Report Python
24 pages
Covid Vaccine Statewise Dataset Analysis
No ratings yet
Covid Vaccine Statewise Dataset Analysis
18 pages
BCA Covid-19 Data Analysis
No ratings yet
BCA Covid-19 Data Analysis
37 pages
Data Analytics With Python
No ratings yet
Data Analytics With Python
18 pages
Dsbda Miniproject Report
No ratings yet
Dsbda Miniproject Report
25 pages
Ieee
No ratings yet
Ieee
4 pages
COVID-19 Vaccination Progress Analysis
No ratings yet
COVID-19 Vaccination Progress Analysis
39 pages
COVID-19 Protein Analysis with Python
No ratings yet
COVID-19 Protein Analysis with Python
23 pages
COVID-19 Vaccination Data Analysis
No ratings yet
COVID-19 Vaccination Data Analysis
55 pages
Rinku 22306670010 Project Report
No ratings yet
Rinku 22306670010 Project Report
33 pages
Dsbda Mini Covid 1
No ratings yet
Dsbda Mini Covid 1
24 pages
Dsbda Mini Covid 2
No ratings yet
Dsbda Mini Covid 2
24 pages
Tracking The Pipeline Immunoinformatics and The COVID 19 Vaccine Design
No ratings yet
Tracking The Pipeline Immunoinformatics and The COVID 19 Vaccine Design
21 pages
Research Paper Tanishka
No ratings yet
Research Paper Tanishka
5 pages
First Dsbda
No ratings yet
First Dsbda
34 pages
Python Report (Rabeeeh)
No ratings yet
Python Report (Rabeeeh)
7 pages
DSBDA Mini Project Report
No ratings yet
DSBDA Mini Project Report
9 pages
Covid Data Report
No ratings yet
Covid Data Report
21 pages
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
No ratings yet
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
8 pages
COVID-19 Data Analysis Application
No ratings yet
COVID-19 Data Analysis Application
13 pages
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
No ratings yet
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
8 pages
COVID-19 Clinical Trials EDA Pandas (ML - FA - DA Projects)
No ratings yet
COVID-19 Clinical Trials EDA Pandas (ML - FA - DA Projects)
53 pages
Topic: The (Covid-19) Vaccine Process. Purpose: To Inform Audience: The Class Tone: Formal
No ratings yet
Topic: The (Covid-19) Vaccine Process. Purpose: To Inform Audience: The Class Tone: Formal
5 pages
Covid Vaccine Analysis Mini-Project
100% (1)
Covid Vaccine Analysis Mini-Project
7 pages
Bioengineering 09 00072
No ratings yet
Bioengineering 09 00072
16 pages
SVKLNVN
No ratings yet
SVKLNVN
7 pages
Name
No ratings yet
Name
23 pages
Vaccine and Public Health
No ratings yet
Vaccine and Public Health
8 pages
COVID-19 Data Analysis with Python
No ratings yet
COVID-19 Data Analysis with Python
15 pages
Covid-19 Pandemic Analysis System Documentation
67% (3)
Covid-19 Pandemic Analysis System Documentation
49 pages
Rahul 2021
No ratings yet
Rahul 2021
6 pages
Sentiment Analysis of COVID-19 Vaccine Tweets
No ratings yet
Sentiment Analysis of COVID-19 Vaccine Tweets
25 pages
Da Phase1
No ratings yet
Da Phase1
6 pages
COVID-19 Clinical Trials EDA Pandas
No ratings yet
COVID-19 Clinical Trials EDA Pandas
30 pages
COVID-19 Spread Analysis & Modeling
No ratings yet
COVID-19 Spread Analysis & Modeling
9 pages
Vaccination Report
No ratings yet
Vaccination Report
8 pages
COVID-19 Impact Analysis India 2024
No ratings yet
COVID-19 Impact Analysis India 2024
28 pages
My Dsbda Miniproject 1
No ratings yet
My Dsbda Miniproject 1
23 pages
Praveen - 2021 - Analyzing The Attitude of Indian Citizens Towards
No ratings yet
Praveen - 2021 - Analyzing The Attitude of Indian Citizens Towards
5 pages
Syadatajveez
No ratings yet
Syadatajveez
21 pages
COVID19 Impact Analysis Project
No ratings yet
COVID19 Impact Analysis Project
6 pages
Service Learning Report
No ratings yet
Service Learning Report
13 pages
An Assessment of Sentiment Analysis of Covid 19 Tweets
No ratings yet
An Assessment of Sentiment Analysis of Covid 19 Tweets
10 pages
Name
No ratings yet
Name
23 pages
Covid-19 Pandemic Analysis System Documentation
No ratings yet
Covid-19 Pandemic Analysis System Documentation
55 pages
Backtesting Algorithmic Crypto Strategies
No ratings yet
Backtesting Algorithmic Crypto Strategies
135 pages
Python Introspection Guide
No ratings yet
Python Introspection Guide
2 pages
Short Introduction To Python Basics: Geared Towards Data Analysis
No ratings yet
Short Introduction To Python Basics: Geared Towards Data Analysis
28 pages
1 - Maya Programming Introduction PDF
No ratings yet
1 - Maya Programming Introduction PDF
29 pages
Pymodbus Readthedocs Io en Stable
No ratings yet
Pymodbus Readthedocs Io en Stable
291 pages
PyXLL Version Logs and Errors
No ratings yet
PyXLL Version Logs and Errors
16 pages
Python QB
No ratings yet
Python QB
6 pages
Intermediate Python Nanodegree Program Syllabus
No ratings yet
Intermediate Python Nanodegree Program Syllabus
11 pages
Python Assignments
No ratings yet
Python Assignments
7 pages
VTMB ModDevGuide
No ratings yet
VTMB ModDevGuide
440 pages
Ip Project - To-Do-List
100% (2)
Ip Project - To-Do-List
26 pages
Software Engineering Bootcamp Syllabus
No ratings yet
Software Engineering Bootcamp Syllabus
16 pages
Snake Game . .
No ratings yet
Snake Game . .
26 pages
Split Up - AI - IX - 2024-25 KVS RO Guwahati
No ratings yet
Split Up - AI - IX - 2024-25 KVS RO Guwahati
3 pages
MCQ Test on Python Tuples and Lists
No ratings yet
MCQ Test on Python Tuples and Lists
3 pages
Python 01 Basic Syntax
No ratings yet
Python 01 Basic Syntax
9 pages
Smart Coffee Machine with Python
No ratings yet
Smart Coffee Machine with Python
14 pages
Embedded Developer at Sevya Multimedia
No ratings yet
Embedded Developer at Sevya Multimedia
6 pages
Python Programming Content
No ratings yet
Python Programming Content
2 pages
Python Basics: Control Flow & Syntax
No ratings yet
Python Basics: Control Flow & Syntax
18 pages
Python Class VI
100% (1)
Python Class VI
1 page
Openmatb: A Multi-Attribute Task Battery Promoting Task Customization, Software Extensibility and Experiment Replicability
No ratings yet
Openmatb: A Multi-Attribute Task Battery Promoting Task Customization, Software Extensibility and Experiment Replicability
11 pages
Robot Framework User Guide
100% (1)
Robot Framework User Guide
163 pages
Use of User-Defined Transform in Data Service 14 PDF
No ratings yet
Use of User-Defined Transform in Data Service 14 PDF
7 pages
Python Important Question Paper
No ratings yet
Python Important Question Paper
4 pages
Software Installation - Ubuntu
No ratings yet
Software Installation - Ubuntu
5 pages
Python 1695083170
No ratings yet
Python 1695083170
56 pages
PYTHON Lab Manual 2025
No ratings yet
PYTHON Lab Manual 2025
39 pages
Python Training for ECE Students
No ratings yet
Python Training for ECE Students
14 pages
M.Sc. Cyber Security Syllabus 2023-24
No ratings yet
M.Sc. Cyber Security Syllabus 2023-24
66 pages

Covid-19 Vaccination Analysis

Uploaded by

Covid-19 Vaccination Analysis

Uploaded by

A Project Report

Covid-19 Vaccine Analysis

requirement for the award of the degree of

MASTER OF COMPUTER APPLICATION

Covid-19 Vaccine Analysis

Under the guidance of

SCHOOL OF COMPUTER APPLICATION AND TECHNOLOGY

GALGOTIAS UNIVERSITY, GREATER NOIDA

(Master of Computer Application) submitted in the School of Computer Application and

Department of Computer Science and Engineering/School of Computer Application and

Technology , Galgotias University, Greater Noida.

Mohd Nadeem (23SCSE2030560)

Furthermore, Python enables researchers to investigate vaccine safety by analyzing

In conclusion, Python serves as a versatile tool for conducting comprehensive analyses

import matplotlib.pyplot as plt

3991 rle ou "Cha - 5 33 53 ue 2 are it 6 2 e

count 8.082000e+03 8082.000000 8.082000e+03 8082.000000 8082.000000

mean 3.550042e+04 1192.207127 1.513661e+04 1.472037 8.690671

std 2.914947e+05 2982.597309 4.882913e+04 12.922145 59.121769

min 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000

25% 1.100000e+02 165.000000 4.172500e+02 0.000000 0.000000

50% 4.805000e+02 465.000000 2.329000e+03 0.000000 1.000000

75% 2.089750e+03 1249.500000 1.124975e+04 1.000000 4.000000

max 1.371493e+07 103226.000000 1.166459e+06 678.000000 2315.000000

df["num of words in text"] = df["text"].apply(lambda x: len(x))

# Creating autocpt arguments

# Creating the plot

plt.setp(autotexts, size=9, weight="bold")

# No retweets were made.

The Preprocessing steps taken are:

#remove twitter handlers

# Replace all non alphabets.

#Remove all single characters

# Substituting multiple spaces with single space

# Replace 3 or more consecutive letters by 2 letter.

4. Most Prevalent Words in Tweet

fig = plt.figure(1, figsize=(10,10))

5. Apply VADER Sentiment to the tweets

merged_data.plot(x='date', y=['Sentiment Negative', 'Sentiment Positive'],

# Get Wordclouds For Positive and Negative Tweets

6. Time Series Analysis On Sentiments

fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Daily

fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(10, 16))

f, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 10))

7.Stop Word Removal and Lemmatization

# Make Sure your intenrt is on :)

# Encoding 0 for negative, 1 for Neutral, 2 for Positive

8. Splitting the Data

vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)

10. Model Building

from sklearn.metrics import classification_report

# Predict values for Test dataset

# Print the evaluation metrics for the dataset.

# Compute and plot the Confusion matrix

10.1 Logistic Regression

accuracy 0.80 405

10.2 Naive Bayes

0 0.91 0.14 0.24 72

accuracy 0.72 405

0 0.86 0.61 0.72 72

accuracy 0.83 405

0 0.92 0.47 0.62 72

accuracy 0.80 405

You might also like