100% found this document useful (4 votes)

388 views36 pages

Project ML

This document discusses applying various machine learning algorithms like Naive Bayes, KNN, Bagging and Boosting to predict voter mindset based on a dataset. It involves data preprocessing steps like handling null values, encoding categorical variables, splitting data into train and test sets. Models like logistic regression, LDA, KNN, Naive Bayes with bagging and boosting are applied and their performance is compared using metrics like accuracy, confusion matrix, ROC curve to find the best model for the voter mindset prediction task.

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

388 views36 pages

Project ML

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

2021

Machine learning – Naïve Bayes, KNN, Bagging and

Boosting on Voter Mindset prediction on Election

Anil Ulchala
12/4/2021
1

1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Read the dataset. Do the descriptive statistics and do the null value condition check? Write
an inference on it. ............................................................................................................................. 47
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
7
2.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30) ................................................................................. 11
2.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................... 12
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................... 14
2.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...... 16
2.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized. .................................... 17
2.8 Based on these predictions, what are the insights? (5 marks). ............................................ 28
3. Problem Statement: ...................................................................................................................... 28
3.1 Find the number of characters, words, and sentences for the mentioned documents. ...... 28
3.2 Remove all the stopwords from all three speeches. ............................................................ 30
3.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 31
3.4 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 34

List of Tables:
Table 1 – Comparison between various models .................................................................................. 28

This Business Report is generated based on the Data set extracted from reliable sources
2

Figure 1 - Loading Dataset into Jupyter notebook.................................................................................. 4

Figure 2 – Dropping unwanted columns/variables ................................................................................. 4
Figure 3 – Shape and Data type information of the data set ................................................................. 5
Figure 4 – Checking for Null Values ........................................................................................................ 5
Figure 5 – Proportions checking for Categorical Variables ..................................................................... 6
Figure 6 – Descriptive Information ......................................................................................................... 6
Figure 7 – Distribution of the variables................................................................................................... 7
Figure 8 – Count plot for Vote Variable .................................................................................................. 7
Figure 9 – Strip plot Age vs Vote ............................................................................................................. 8
Figure 10 – Strip plot Vote vs Economic.cond.national .......................................................................... 8
Figure 11 – Economic conditional household vs Age .............................................................................. 9
Figure 12 – Strip plot against Vote feature ............................................................................................. 9
Figure 13 – Heat map of the Variables ................................................................................................. 10
Figure 14 – Pair Plot for the variables ................................................................................................... 10
Figure 15 – Outlier Treatment .............................................................................................................. 11
Figure 16 –Box plot after outlier Treatment ......................................................................................... 11
Figure 17 – Converting Target Variable to Integer type ....................................................................... 12
Figure 18 – Dataset info after encoding the variables .......................................................................... 12
Figure 19 – Test_Train_Split of the data ............................................................................................... 12
Figure 20 – Logistic Regression Model building .................................................................................... 13
Figure 21 – Model for LDA .................................................................................................................... 13
Figure 22 – Model for Gaussian NB ...................................................................................................... 14
Figure 23 – Scaling the Data Set for model building ............................................................................. 14
Figure 24 – Model building ................................................................................................................... 14
Figure 25 – Model building ................................................................................................................... 15
Figure 26 – Calculating Misclassification error for various values of K................................................. 15
Figure 27 – Plotting Misclassification error for various values of K ...................................................... 15
Figure 28 – Model building for K =19 .................................................................................................... 16
Figure 29 – Model using Random Forest .............................................................................................. 16
Figure 30 – Model using Random Forest and applying Bagging ........................................................... 16
Figure 31 – Model using Random Forest and applying Boosting ......................................................... 17
Figure 32 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 33 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 34 – AUC and RoC of Train Data ................................................................................................ 19
Figure 35 – AUC and RoC of Test Data .................................................................................................. 19
Figure 36 – Confusion Matrix of Train and Test Data ........................................................................... 20
Figure 37 – Classification Report of Train and Test Data ...................................................................... 20
Figure 38 – AUC and RoC of Train and Test Data .................................................................................. 21
Figure 39 – Confusion Matrix of Train Data .......................................................................................... 21
Figure 40 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 41 – AUC and RoC of Train Data ................................................................................................ 22
Figure 42 – AUC and RoC of Test Data .................................................................................................. 23
Figure 43 – Confusion Matrix of Train Data .......................................................................................... 23
Figure 44 – Confusion Matrix of Train Data .......................................................................................... 24
Figure 45 – AUC and RoC of Train Data ................................................................................................ 24
Figure 46 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 47 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 48 – AUC and RoC of Train Data ................................................................................................ 26
Figure 49 – Confusion Matrix of Train Data .......................................................................................... 26
Figure 50 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 51 – AUC and RoC of Train Data ................................................................................................ 27

This Business Report is generated based on the Data set extracted from reliable sources
3

Figure 52 - Loading Dataset into Jupyter notebook .............................................................................. 29

Figure 53 – Length of data .................................................................................................................... 29
Figure 54 – No of words in each speech ............................................................................................... 29
Figure 55 – User defined function to count no of sentences in a Text file .......................................... 30
Figure 56 – No of sentences in each presidential speech..................................................................... 30
Figure 57 – Importing stopwords to python ......................................................................................... 30
Figure 58 – Code to remove punctuation and unnecessary words ...................................................... 31
Figure 59 – Removing stopwords from the speeches........................................................................... 31
Figure 60 – Converting tokens into Lower case .................................................................................... 32
Figure 61 – Lemmatization with POS tag .............................................................................................. 32
Figure 62 – Word Frequency calculation for Roosevelt Speech ........................................................... 33
Figure 63 – Word Frequency calculation for Keenedy Speech ............................................................. 33
Figure 64 – Word Frequency calculation for Nixon Speech .................................................................. 33
Figure 65 – Word cloud for Roosevelt speech ...................................................................................... 34
Figure 66 – Word cloud for Kennedy speech ........................................................................................ 34
Figure 67 – Word cloud for Nixon speech ............................................................................................ 35

This Business Report is generated based on the Data set extracted from reliable sources
4

Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using NaiveBayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need perform NLP techniques, on another data set containing speeches of former vice-
presidents and create a word cloud.

2. Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

2.1 Read the dataset. Do the descriptive statistics and do the null value
condition check? Write an inference on it.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data

Figure 1 - Loading Dataset into Jupyter notebook

 Successfully loaded the data into Python.

2.1.1. Dropping the unnecessary columns from the dataset

Figure 2 – Dropping unwanted columns/variables

This Business Report is generated based on the Data set extracted from reliable sources
5

2.1.2. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (1525, 9). Therefore data set has 1525 rows and 9 columns. Refer
below figure

Figure 3 – Shape and Data type information of the data set

As per the figure we can observe there are 2 variables with ‘object’ type and remaining variables are
of ‘int’ type. Also there are no duplicate values in the data set.

2.1.3. There are no null values in the data set. Ref below figure

Figure 4 – Checking for Null Values

2.1.4. On the basis of problem description it is clear that ‘vote’ is the dependent variable/ target
variable and the remaining variables are independent variables. Going forward the report uses
terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.

This Business Report is generated based on the Data set extracted from reliable sources
6

Figure 5 – Proportions checking for Categorical Variables

From the figure we can infer that 1063 members casted their vote to Labour section and 462
members has casted their vote to Conservative. Seems, data set is the balanced with 70 and 30
proportion and good for creating Model. Also, Female stood in first place with a count of 812 over
Male with a count of 713. In addition, we can conclude that data set has no bad values.

2.1.5. Descriptive Stats:

Figure 6 – Descriptive Information

 As observed, stats from target variable say Labour has higher number in vote casting.
 Age ranges from 24 to 93 with a mean of 54 and having a std deviation of 16. Mean and
median are almost similar giving an intention of normal distribution.
 The economic.cond.national of the polling centre ranges from 1 to 5 with a median of 3.24.
Here 1 means bad and 5 means better conditions. Mean and median are almost similar
giving an intention of normal distribution.
 The economic.cond.household, blair, hague of the polling centre ranges from 1 to 5 with a
median of 3.14, 3.33, 2.7 respectively. Here 1 means bad and 5 means better conditions.
Mean and median are almost similar giving an intention of normal distribution
 The ‘Europe’ feature describing the Europe sentiment has a mean of 6 on a scale of 1 to 15.
This value depicts most of the voters are neutral towards Europe sentiment feature.
 Female vote casters are high in number compared to Male vote casters

This Business Report is generated based on the Data set extracted from reliable sources
7

2.1.6. Distribution and boxplot of the variables

Inferences based on the boxplots and dist plot:
1. ‘Age’ variable is normal distributed
2. Rest of the features, economic.cond.national, economic.cond.household, blair, Hague
and Political Knowledge being ordinal variables and not a continuous variables. So we
can see multiple spikes in the distribution.

Figure 7 – Distribution of the variables

2.2 Perform Univariate and Bivariate Analysis. Do exploratory data

analysis. Check for Outliers.

2.1.1. Univariate Analysis

Figure 8 – Count plot for Vote Variable

This Business Report is generated based on the Data set extracted from reliable sources
8

 Majority of the voters are labour, so the vote bank more lies in Labour section.
2.1.2. Bivariate Analysis
1. Strip plot between Vote and Age

Figure 9 – Strip plot Age vs Vote

 Based on the data, from the scatter plot Salary has weak correlation
 Most of the voters from labour section lies in between age of 40 and 50.

2. Strip plot between Vote and Economic.cond.national

Figure 10 – Strip plot Vote vs Economic.cond.national

 Based on the data, there is no clear correlation.

3. Strip plot between Economic conditional household and Age

This Business Report is generated based on the Data set extracted from reliable sources
9

Figure 11 – Economic conditional household vs Age

 Based on the data, there is no clear correlation.

4. Scatter plot between old children and Salary with hue of target variable

Figure 12 – Strip plot against Vote feature

This Business Report is generated based on the Data set extracted from reliable sources
10

5. Correlation between the variables:

Figure 13 – Heat map of the Variables

 Correlation between the variables is very weak as per the heat map
 Hague and Economic.cond.national has negative relation
 Europe and Economic.cond.national also have negative relation
 Age and Political knowledge also have negative relation, which is quite opposite to
reality
6. Pairplot between the variables:

Figure 14 – Pair Plot for the variables

This Business Report is generated based on the Data set extracted from reliable sources
11

 From the pair plot, we can infer that there is no strong relation between any
variable.
 Hence the dataset is not cursed by mutli-collinearity and good for modelling.

2.1.3. Treating Outliers

Before we do label encoding and creating a model we observed there are some outliers existing in
each variable. Now we will treat outliers in these variables.

Figure 15 – Outlier Treatment

Figure 16 –Box plot after outlier Treatment

2.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30)
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building.

2.3.1. Data Encoding and Model building for Machine Learning Analysis

2.3.1.1. Converting Target Variable to Integer type using Categorical Function

 Here ‘0’ represents ‘Conservative’ and ‘1’ represents ‘Labour’

This Business Report is generated based on the Data set extracted from reliable sources
12

Figure 17 – Converting Target Variable to Integer type

2.3.1.2. Data information after conversion.

Figure 18 – Dataset info after encoding the variables

2.4 Apply Logistic Regression and LDA (linear discriminant analysis).

2.4.1. Model Building for Machine Learning Analysis – Split of data

Figure 19 – Test_Train_Split of the data

 Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable

2.4.2. Logistic Regression model generation using the Grid search method

This Business Report is generated based on the Data set extracted from reliable sources
13

Figure 20 – Logistic Regression Model building

 Penalty: ‘elasticnet','l2','none'
 Solver: 'newton-cg', 'saga'
 Tol: 0.001, 0.00001
Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.

2.4.3. Creating model for LDA

Figure 21 – Model for LDA

This Business Report is generated based on the Data set extracted from reliable sources
14

2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

2.5.1. Creating model using Gaussian NB

Figure 22 – Model for Gaussian NB

2.5.2. Creating model using KNN

2.5.2.1 Scaling the data set for KNN model building

Figure 23 – Scaling the Data Set for model building

 Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So Zscore is applied for scaling on the Train data set.
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable
2.5.2.2 Building Model:

Figure 24 – Model building

2.5.2.3 Building Model (K value =7):

This Business Report is generated based on the Data set extracted from reliable sources
15

Figure 25 – Model building

 K value is considered as ‘7’.

 Model score is 71.41 %
 X is an array of dependent variables and Y is an array of target variable
2.5.2.4 Calculating Misclassification error of KNN model for a range from 1 to 19 at steps of 2

Figure 26 – Calculating Misclassification error for various values of K

 It is observed that misclassification error value is less for 19

2.5.2.5 Plotting Misclassification Error for range of K values

Figure 27 – Plotting Misclassification error for various values of K

This Business Report is generated based on the Data set extracted from reliable sources
16

2.5.2.6 Building model for K=19

Figure 28 – Model building for K =19

 Model score is observed as 69.8

2.6 Model Tuning, Bagging (Random Forest should be applied for

Bagging), and Boosting.

2.6.1. Creating model using Random Forest Technique

Figure 29 – Model using Random Forest

 Model is created using ensembling technique, with estimators as ‘100’.

 Train model score is observed as 99.9 %
 Test model score is observed as 84%, seems like little bit overfitting. So we will
do Model tuning using methods like Baggind and Boosting

2.6.2. Creating model using Random Forest model and applying Bagging

Figure 30 – Model using Random Forest and applying Bagging

This Business Report is generated based on the Data set extracted from reliable sources
17

 Model is created using ensembling technique, with base_estimators as

‘RF_model’.
 Train model score is observed as 97.9 %
 Test model score is observed as 84.27%, seems like little bit overfitting. So we
can check Model tuning using Boosting.

2.6.3. Creating model using Random Forest model and applying Boosting

Figure 31 – Model using Random Forest and applying Boosting

 Model is created using ensembling technique, with base_estimators as

‘RF_model’. Gradient Boosting method is considered in the above case
 Train model score is observed as 88.75 %
 Test model score is observed as 83.84%, seems like both train and testing
scores are in-line, can be considered as best model.

2.7 Performance Metrics: Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.

2.7.1. Performance Metrics of Logistic Regression

2.7.1.1. Confusion Matrix, accuracy and other metrics of Train data

This Business Report is generated based on the Data set extracted from reliable sources
18

Figure 32 – Confusion Matrix of Train Data

 Accuracy of the train data is 83.03%
 Recall for train data in the interest of ‘1’ is 0.91
2.7.1.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 33 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.93%

 Recall for test data in the interest of ‘1’ is 0.92
 False positive values are coming down to 45 which is a good sign on the test
data compared to train data with a value of 111

This Business Report is generated based on the Data set extracted from reliable sources
19

2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 34 – AUC and RoC of Train Data

 AUC score of the train data is 87.72%

2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 35 – AUC and RoC of Test Data

 AUC score of the train data is 87.72%

This Business Report is generated based on the Data set extracted from reliable sources
20

2.7.2. Performance Metrics of LDA

2.7.2.1. Confusion Matrix of Train and Test data

Figure 36 – Confusion Matrix of Train and Test Data

 Recall for train data in the interest of ‘1’ is 105

 Recall for Test data in the interest of ‘ 1’ is 41

2.7.2.2. Classification report of Train and Test data

Figure 37 – Classification Report of Train and Test Data

 Accuracy of the Train data is 82%

 Accuracy of the Test Data is 84%

2.7.2.3. AUC_score and RoC Curve for the train and test data

This Business Report is generated based on the Data set extracted from reliable sources
21

Figure 38 – AUC and RoC of Train and Test Data

 AUC score of the train data is 87.7%

 AUC score of the test data is 91.6%

2.7.3. Performance Metrics of Naïve Bayes

2.7.3.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 39 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
22

 Accuracy of the train data is 82.66%

 Recall for train data in the interest of ‘1’ is 0.88
2.7.3.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 40 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.71%

 Recall for test data in the interest of ‘1’ is 0.90
 False positive values are coming down to 37 which is a good sign on the test
data compared to train data with a value of 97

2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 41 – AUC and RoC of Train Data

 AUC score of the train data is 87.52%

This Business Report is generated based on the Data set extracted from reliable sources
23

2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 42 – AUC and RoC of Test Data

 AUC score of the train data is 91.02%

2.7.4. Performance Metrics of KNN Model with K value as 19

2.7.4.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 43 – Confusion Matrix of Train Data

 Accuracy of the train data is 69.82%

 Recall for train data in the interest of ‘1’ is 0.99

This Business Report is generated based on the Data set extracted from reliable sources
24

2.7.4.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 44 – Confusion Matrix of Train Data

 Accuracy of the test data is 67.46%

 Recall for test data in the interest of ‘1’ is 0.96
 False positive values are coming down to 137 which is a good sign on the test
data compared to train data with a value of 311

2.7.4.3. AUC_score and RoC Curve for the train and test data

Figure 45 – AUC and RoC of Train Data

 AUC score of the train data is 61.30%
 AUC score of the test data is 46.40%

This Business Report is generated based on the Data set extracted from reliable sources
25

2.7.5. Performance Metrics of Random Forest with Bagging Technique

2.7.5.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 46 – Confusion Matrix of Train Data

 Accuracy of the train data is 97.18%

 Recall for train data in the interest of ‘1’ is 0.99

2.7.5.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 47 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.27%

 Recall for test data in the interest of ‘1’ is 0.92

This Business Report is generated based on the Data set extracted from reliable sources
26

2.7.5.3. AUC_score and RoC Curve for the train and test data

Figure 48 – AUC and RoC of Train Data

 AUC score of the train data is 99.70%

 AUC score of the test data is 91.80%

2.7.6. Performance Metrics of Random Forest with Boosting Technique

2.7.6.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 49 – Confusion Matrix of Train Data

 Here Gradient boosting technique is used

 Accuracy of the train data is 88.75%
 Recall for train data in the interest of ‘1’ is 0.94

This Business Report is generated based on the Data set extracted from reliable sources
27

2.7.6.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 50 – Confusion Matrix of Train Data

 Accuracy of the test data is 83.84%
 Recall for test data in the interest of ‘1’ is 0.92

2.7.6.3. AUC_score and RoC Curve for the train and test data

Figure 51 – AUC and RoC of Train Data

 AUC score of the train data is 94.80%

 AUC score of the test data is 90.80%

This Business Report is generated based on the Data set extracted from reliable sources
28

2.7.7. Comparison between Models

LR Model LDA Model NB Model KNN Model Bagging Boosting

Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 83.03 84.93 82.75 84.71 82.66 84.71 69.82 67.46 97.18 84.27 88.75 83.84
Precision (1) 86 87 86 88 87 89 70 69 97 86 90 86
AUC 87.72 87.72 87.7 91.6 87.52 91.02 61.3 46.4 99.7 91.8 94.8 90.8
Recall (1) 91 92 89 91 88 90 99 96 99 92 94 92
F1- Score (1) 88 90 88 89 88 89 82 80 98 89 92 89

Table 1 – Comparison between various models

 On comparing the models, Random Forest technique with Boosting seems
consistent across various Model evaluation parameters
 Train and test data for Boosting model are almost close and consistent
 Also, the accuracy of the Boosting falls in second place with 88.75 for train and
83.84 in Test data. Bagging has highest Accuracy for train data 97.18% but test
data is performing low with deviation of 10 % almost
 Finally Random Forest Technique with Boosting technique tuning provides
better model

2.8 Based on these predictions, what are the insights? (5 marks).

On the whole, based on the outcomes of the model for data set following Insights are observed:
 Female voters are more than the male voters. So parties have to attract Male voters by
respective means.
 Most of the voters from labour section lie in between age of 40 and 50. So mostly Labour
party is attracting this age group with some monetary benefits. So any voter from these ages
has a high change he cast his vote to labor party.
 Based on the data from Strip plot, voter’s density is high for Labour party for High economic
conditional nation. So any voter from this zone will have high chances to cast their vote to
Labour party.
 Similarly, is the case with high economic condition households.
 Random Forest algorithm with Gradient Boosting technique tuning provides better model.
The model is giving better results for both training and test data.

3. Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
 President Franklin D. Roosevelt in 1941
 President John F. Kennedy in 1961
 President Richard Nixon in 1973.

3.1 Find the number of characters, words, and sentences for the
mentioned documents.

Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data

This Business Report is generated based on the Data set extracted from reliable sources
29

Figure 52 - Loading Dataset into Jupyter notebook

 Nltk file is downloaded and imported inaugural for speech library
 Extracted required vice-president speeches to allocated to respective
variables.
3.1.1. Calculate length of data
Ans: To calculate the length of data below codes are executed

Figure 53 – Length of data

 Number of characters in Roosevelt file is 7571.
 Number of characters in Kennedy file is 7618.
 Number of characters in Nixon file is 9991.

3.1.2. Calculate No of words

Ans: To calculate the no of words in data below codes are executed. Split() function is used to
extract words.

Figure 54 – No of words in each speech

 Number of words in Roosevelt file is 1360.
 Number of words in Kennedy file is 1390.
 Number of words in Nixon file is 1819.

This Business Report is generated based on the Data set extracted from reliable sources
30

3.1.3. Calculate No of Sentences

Ans: To calculate the no of sentences in data below codes are executed. We are using sent_tokenize
command to count the sentences
3.1.3.1 A “sentence_count” function is defined to count the no of sentence in a particular speech

Figure 55 – User defined function to count no of sentences in a Text file

3.1.3.1 Code to calculate no of sentences in the speech

Figure 56 – No of sentences in each presidential speech

 Number of sentences in Roosevelt file is 68.
 Number of sentences in Kennedy file is 52.
 Number of sentences in Nixon file is 68.

3.2 Remove all the stopwords from all three speeches.

3.2.1. Import predefined stopwords from the nltk

Ans: To import predefined stopwords from the nltk below codes are executed

Figure 57 – Importing stopwords to python

3.2.2. Step to remove all punctuations
Ans: In this step punctuations and unnecessary words are removed from the text file.

This Business Report is generated based on the Data set extracted from reliable sources
31

Figure 58 – Code to remove punctuation and unnecessary words

 Punctuations listed in the punkt and addition words are removed

3.2.3. Removing stopwrods in all the files

Ans: To remove stopwords below codes are executd

Figure 59 – Removing stopwords from the speeches

 The output shows the punctuations are removed and a cleaned text is shown
in suffix ‘_clean variable
 We can see comma is removed from the tokens list

3.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

3.3.1. Step to convert tokens into lower case.

Ans: To find top frequency words, first convert all the tokens in to lowers case. Since algorithm
considers lower case and upper case words are different and will create error in the count/
frequency. Hence in-built lower function is used to convert all the tokens into lower case and
maintain consistency for all the three speeches. Below line of codes are executed

This Business Report is generated based on the Data set extracted from reliable sources
32

Figure 60 – Converting tokens into Lower case

 We can observe that word s like ‘Mr’ are converted to ‘mr’ after the above
execution.
3.3.2. Step to Lemmatization with POS
Ans: In this step lemmatization is done, to convert all the tokens to root words. For the same
‘wordNetLemmatizer’ is imported to python notebook.

Figure 61 – Lemmatization with POS tag

 In the above step based on the parts of speech, words are converted to their
root words.
 Words that are not categorized as any parts of speech are retained as it is.

3.3.3. Top three word frequency calculation for each speech

3.3.3.1: Word frequency is calculated through below code in Roosevelt Speech.

This Business Report is generated based on the Data set extracted from reliable sources
33

Figure 62 – Word Frequency calculation for Roosevelt Speech

 Top three words in Roosevelt speech are “Nation”, “It” and “Life”
3.3.3.2: Word frequency is calculated through below code in Kennedy Speech.

Figure 63 – Word Frequency calculation for Keenedy Speech

 Top three words in Roosevelt speech are “World”, “Let” and “Side”
3.3.3.3: Word frequency is calculated through below code in Nixon Speech.

Figure 64 – Word Frequency calculation for Nixon Speech

 Top three words in Roosevelt speech are “America”, “Peace” and “World”

This Business Report is generated based on the Data set extracted from reliable sources
34

3.4 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
3.4.1. Now we are creating the word cloud for Roosevelt speech. From wordcloud imported
WordCloud to create word cloud.

Figure 65 – Word cloud for Roosevelt speech

3.4.2. Now we are creating the word cloud for Kennedy speech. From wordcloud imported
WordCloud to create word cloud.

Figure 66 – Word cloud for Kennedy speech

This Business Report is generated based on the Data set extracted from reliable sources
35

3.4.3. Now we are creating the word cloud for Nixon speech. From wordcloud imported WordCloud
to create word cloud.

Figure 67 – Word cloud for Nixon speech

This Business Report is generated based on the Data set extracted from reliable sources

Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Ejemplo de PD Completo
No ratings yet
Ejemplo de PD Completo
30 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
Extended Project FastKart SQLite MYSQL 1 1 PDF
No ratings yet
Extended Project FastKart SQLite MYSQL 1 1 PDF
5 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Data Visualization in Tableau - Car Insurance Claim Project
50% (2)
Data Visualization in Tableau - Car Insurance Claim Project
51 pages
Project - 8 (MRA)
50% (4)
Project - 8 (MRA)
15 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Project Report
100% (3)
Project Report
36 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
100% (1)
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
10 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
MRA CafeChain Analysis
No ratings yet
MRA CafeChain Analysis
23 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
TSF - Graded Quiz 4 - Great Lakes Institute
No ratings yet
TSF - Graded Quiz 4 - Great Lakes Institute
5 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
TSF Shoe Sales & Softdrink by Shubradip Ghosh Pgpdsba 2022 Mar
No ratings yet
TSF Shoe Sales & Softdrink by Shubradip Ghosh Pgpdsba 2022 Mar
61 pages
Shoe Sales
100% (3)
Shoe Sales
105 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
FRA Report
100% (1)
FRA Report
30 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
SMDM Assignment PDF
100% (1)
SMDM Assignment PDF
15 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Extended Project
No ratings yet
Extended Project
1 page
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Problem 2 Businessreport ML
No ratings yet
Problem 2 Businessreport ML
9 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Machine Learning Business Report PDF
No ratings yet
Machine Learning Business Report PDF
54 pages
Final Prjoect
No ratings yet
Final Prjoect
32 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
Assighment Project 1
100% (3)
Assighment Project 1
18 pages
Taylor Swift - Down Bad Lyrics Genius Lyrics
No ratings yet
Taylor Swift - Down Bad Lyrics Genius Lyrics
1 page
Satyajit Ray's Araneyer Din Ratri'-Transcending Time While Capturing Moments
No ratings yet
Satyajit Ray's Araneyer Din Ratri'-Transcending Time While Capturing Moments
3 pages
Discussion On The Training of Furniture Design and Manufacturing Professionals From The Perspective of Traditional Culture Inheritance
No ratings yet
Discussion On The Training of Furniture Design and Manufacturing Professionals From The Perspective of Traditional Culture Inheritance
5 pages
Word List 500-1499 With Synonyms & Arabic Meaning
100% (2)
Word List 500-1499 With Synonyms & Arabic Meaning
9 pages
Analisis Sistem Akuntansi Pembelian Bahan Baku Dan Pengeluaran Kas Dalam Upaya Meningkatkan Pengendalian Internal (Studi Pada PT - Otsuka Indonesia)
No ratings yet
Analisis Sistem Akuntansi Pembelian Bahan Baku Dan Pengeluaran Kas Dalam Upaya Meningkatkan Pengendalian Internal (Studi Pada PT - Otsuka Indonesia)
10 pages
Introduction To Pronunciation Handout
No ratings yet
Introduction To Pronunciation Handout
49 pages
Oral-Communication - For Grade 11 HUMSS Students - Oral Communication Communication - Is A - Studocu
No ratings yet
Oral-Communication - For Grade 11 HUMSS Students - Oral Communication Communication - Is A - Studocu
1 page
Project Guidelines
No ratings yet
Project Guidelines
4 pages
Jurnal Alkadimat Posko13 - (BISMILLAH)
No ratings yet
Jurnal Alkadimat Posko13 - (BISMILLAH)
10 pages
Nazaron Me Tum Hamari Mehman Ban
No ratings yet
Nazaron Me Tum Hamari Mehman Ban
12 pages
11 Lateral Earth Pressure 2 - Coulomb
No ratings yet
11 Lateral Earth Pressure 2 - Coulomb
20 pages
UPLOADED - Pediatric Well-Child Visit Checklist Presentation
No ratings yet
UPLOADED - Pediatric Well-Child Visit Checklist Presentation
23 pages
Ferrolab Excel
No ratings yet
Ferrolab Excel
2 pages
Pressure Vessel Hand Book 10th Edition
No ratings yet
Pressure Vessel Hand Book 10th Edition
489 pages
NLTK Final Test CLCTC k55
No ratings yet
NLTK Final Test CLCTC k55
3 pages
Micronutrients For Health
No ratings yet
Micronutrients For Health
4 pages
Dissertation Au Bonheur Des Dames Emile Zola
100% (2)
Dissertation Au Bonheur Des Dames Emile Zola
6 pages
CHURCH GOING (Poem) (Philip Larkin)
100% (2)
CHURCH GOING (Poem) (Philip Larkin)
8 pages
Psy407 Handout 7
No ratings yet
Psy407 Handout 7
7 pages
Conversion Factor Problems
No ratings yet
Conversion Factor Problems
2 pages
MKT364 Case Studies in Marketing Management Group-Based Assignment
No ratings yet
MKT364 Case Studies in Marketing Management Group-Based Assignment
20 pages
Check List: For Processing of RA Bills @
No ratings yet
Check List: For Processing of RA Bills @
9 pages
General Principles of Taxation 2018
No ratings yet
General Principles of Taxation 2018
112 pages
Sps Lacap Vs Lee
100% (2)
Sps Lacap Vs Lee
2 pages
Extended Essay Proposal 23-25
No ratings yet
Extended Essay Proposal 23-25
3 pages
Instant Download Baby Teeth 1st Edition Meg Grehan PDF All Chapters
No ratings yet
Instant Download Baby Teeth 1st Edition Meg Grehan PDF All Chapters
55 pages
Les Temps Du Verbe Et Les Modes
No ratings yet
Les Temps Du Verbe Et Les Modes
22 pages
(2014) Positive Parenting
No ratings yet
(2014) Positive Parenting
9 pages
Principles of Business Finance 1 - LESSON 1
No ratings yet
Principles of Business Finance 1 - LESSON 1
20 pages