100% found this document useful (4 votes)
388 views36 pages

Project ML

This document discusses applying various machine learning algorithms like Naive Bayes, KNN, Bagging and Boosting to predict voter mindset based on a dataset. It involves data preprocessing steps like handling null values, encoding categorical variables, splitting data into train and test sets. Models like logistic regression, LDA, KNN, Naive Bayes with bagging and boosting are applied and their performance is compared using metrics like accuracy, confusion matrix, ROC curve to find the best model for the voter mindset prediction task.

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
388 views36 pages

Project ML

This document discusses applying various machine learning algorithms like Naive Bayes, KNN, Bagging and Boosting to predict voter mindset based on a dataset. It involves data preprocessing steps like handling null values, encoding categorical variables, splitting data into train and test sets. Models like logistic regression, LDA, KNN, Naive Bayes with bagging and boosting are applied and their performance is compared using metrics like accuracy, confusion matrix, ROC curve to find the best model for the voter mindset prediction task.

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

2021

Machine learning – Naïve Bayes, KNN, Bagging and


Boosting on Voter Mindset prediction on Election

Anil Ulchala
12/4/2021
1

1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Read the dataset. Do the descriptive statistics and do the null value condition check? Write
an inference on it. ............................................................................................................................. 47
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
7
2.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30) ................................................................................. 11
2.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................... 12
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................... 14
2.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...... 16
2.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized. .................................... 17
2.8 Based on these predictions, what are the insights? (5 marks). ............................................ 28
3. Problem Statement: ...................................................................................................................... 28
3.1 Find the number of characters, words, and sentences for the mentioned documents. ...... 28
3.2 Remove all the stopwords from all three speeches. ............................................................ 30
3.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 31
3.4 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 34

List of Tables:
Table 1 – Comparison between various models .................................................................................. 28

This Business Report is generated based on the Data set extracted from reliable sources
2

Figure 1 - Loading Dataset into Jupyter notebook.................................................................................. 4


Figure 2 – Dropping unwanted columns/variables ................................................................................. 4
Figure 3 – Shape and Data type information of the data set ................................................................. 5
Figure 4 – Checking for Null Values ........................................................................................................ 5
Figure 5 – Proportions checking for Categorical Variables ..................................................................... 6
Figure 6 – Descriptive Information ......................................................................................................... 6
Figure 7 – Distribution of the variables................................................................................................... 7
Figure 8 – Count plot for Vote Variable .................................................................................................. 7
Figure 9 – Strip plot Age vs Vote ............................................................................................................. 8
Figure 10 – Strip plot Vote vs Economic.cond.national .......................................................................... 8
Figure 11 – Economic conditional household vs Age .............................................................................. 9
Figure 12 – Strip plot against Vote feature ............................................................................................. 9
Figure 13 – Heat map of the Variables ................................................................................................. 10
Figure 14 – Pair Plot for the variables ................................................................................................... 10
Figure 15 – Outlier Treatment .............................................................................................................. 11
Figure 16 –Box plot after outlier Treatment ......................................................................................... 11
Figure 17 – Converting Target Variable to Integer type ....................................................................... 12
Figure 18 – Dataset info after encoding the variables .......................................................................... 12
Figure 19 – Test_Train_Split of the data ............................................................................................... 12
Figure 20 – Logistic Regression Model building .................................................................................... 13
Figure 21 – Model for LDA .................................................................................................................... 13
Figure 22 – Model for Gaussian NB ...................................................................................................... 14
Figure 23 – Scaling the Data Set for model building ............................................................................. 14
Figure 24 – Model building ................................................................................................................... 14
Figure 25 – Model building ................................................................................................................... 15
Figure 26 – Calculating Misclassification error for various values of K................................................. 15
Figure 27 – Plotting Misclassification error for various values of K ...................................................... 15
Figure 28 – Model building for K =19 .................................................................................................... 16
Figure 29 – Model using Random Forest .............................................................................................. 16
Figure 30 – Model using Random Forest and applying Bagging ........................................................... 16
Figure 31 – Model using Random Forest and applying Boosting ......................................................... 17
Figure 32 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 33 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 34 – AUC and RoC of Train Data ................................................................................................ 19
Figure 35 – AUC and RoC of Test Data .................................................................................................. 19
Figure 36 – Confusion Matrix of Train and Test Data ........................................................................... 20
Figure 37 – Classification Report of Train and Test Data ...................................................................... 20
Figure 38 – AUC and RoC of Train and Test Data .................................................................................. 21
Figure 39 – Confusion Matrix of Train Data .......................................................................................... 21
Figure 40 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 41 – AUC and RoC of Train Data ................................................................................................ 22
Figure 42 – AUC and RoC of Test Data .................................................................................................. 23
Figure 43 – Confusion Matrix of Train Data .......................................................................................... 23
Figure 44 – Confusion Matrix of Train Data .......................................................................................... 24
Figure 45 – AUC and RoC of Train Data ................................................................................................ 24
Figure 46 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 47 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 48 – AUC and RoC of Train Data ................................................................................................ 26
Figure 49 – Confusion Matrix of Train Data .......................................................................................... 26
Figure 50 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 51 – AUC and RoC of Train Data ................................................................................................ 27

This Business Report is generated based on the Data set extracted from reliable sources
3

Figure 52 - Loading Dataset into Jupyter notebook .............................................................................. 29


Figure 53 – Length of data .................................................................................................................... 29
Figure 54 – No of words in each speech ............................................................................................... 29
Figure 55 – User defined function to count no of sentences in a Text file .......................................... 30
Figure 56 – No of sentences in each presidential speech..................................................................... 30
Figure 57 – Importing stopwords to python ......................................................................................... 30
Figure 58 – Code to remove punctuation and unnecessary words ...................................................... 31
Figure 59 – Removing stopwords from the speeches........................................................................... 31
Figure 60 – Converting tokens into Lower case .................................................................................... 32
Figure 61 – Lemmatization with POS tag .............................................................................................. 32
Figure 62 – Word Frequency calculation for Roosevelt Speech ........................................................... 33
Figure 63 – Word Frequency calculation for Keenedy Speech ............................................................. 33
Figure 64 – Word Frequency calculation for Nixon Speech .................................................................. 33
Figure 65 – Word cloud for Roosevelt speech ...................................................................................... 34
Figure 66 – Word cloud for Kennedy speech ........................................................................................ 34
Figure 67 – Word cloud for Nixon speech ............................................................................................ 35

This Business Report is generated based on the Data set extracted from reliable sources
4

Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using NaiveBayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need perform NLP techniques, on another data set containing speeches of former vice-
presidents and create a word cloud.

2. Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

2.1 Read the dataset. Do the descriptive statistics and do the null value
condition check? Write an inference on it.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data

Figure 1 - Loading Dataset into Jupyter notebook

 Successfully loaded the data into Python.

2.1.1. Dropping the unnecessary columns from the dataset

Figure 2 – Dropping unwanted columns/variables

This Business Report is generated based on the Data set extracted from reliable sources
5

2.1.2. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (1525, 9). Therefore data set has 1525 rows and 9 columns. Refer
below figure

Figure 3 – Shape and Data type information of the data set


As per the figure we can observe there are 2 variables with ‘object’ type and remaining variables are
of ‘int’ type. Also there are no duplicate values in the data set.

2.1.3. There are no null values in the data set. Ref below figure

Figure 4 – Checking for Null Values

2.1.4. On the basis of problem description it is clear that ‘vote’ is the dependent variable/ target
variable and the remaining variables are independent variables. Going forward the report uses
terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.

This Business Report is generated based on the Data set extracted from reliable sources
6

Figure 5 – Proportions checking for Categorical Variables


From the figure we can infer that 1063 members casted their vote to Labour section and 462
members has casted their vote to Conservative. Seems, data set is the balanced with 70 and 30
proportion and good for creating Model. Also, Female stood in first place with a count of 812 over
Male with a count of 713. In addition, we can conclude that data set has no bad values.

2.1.5. Descriptive Stats:

Figure 6 – Descriptive Information


 As observed, stats from target variable say Labour has higher number in vote casting.
 Age ranges from 24 to 93 with a mean of 54 and having a std deviation of 16. Mean and
median are almost similar giving an intention of normal distribution.
 The economic.cond.national of the polling centre ranges from 1 to 5 with a median of 3.24.
Here 1 means bad and 5 means better conditions. Mean and median are almost similar
giving an intention of normal distribution.
 The economic.cond.household, blair, hague of the polling centre ranges from 1 to 5 with a
median of 3.14, 3.33, 2.7 respectively. Here 1 means bad and 5 means better conditions.
Mean and median are almost similar giving an intention of normal distribution
 The ‘Europe’ feature describing the Europe sentiment has a mean of 6 on a scale of 1 to 15.
This value depicts most of the voters are neutral towards Europe sentiment feature.
 Female vote casters are high in number compared to Male vote casters

This Business Report is generated based on the Data set extracted from reliable sources
7

2.1.6. Distribution and boxplot of the variables


Inferences based on the boxplots and dist plot:
1. ‘Age’ variable is normal distributed
2. Rest of the features, economic.cond.national, economic.cond.household, blair, Hague
and Political Knowledge being ordinal variables and not a continuous variables. So we
can see multiple spikes in the distribution.

Figure 7 – Distribution of the variables

2.2 Perform Univariate and Bivariate Analysis. Do exploratory data


analysis. Check for Outliers.

2.1.1. Univariate Analysis

Figure 8 – Count plot for Vote Variable

This Business Report is generated based on the Data set extracted from reliable sources
8

 Majority of the voters are labour, so the vote bank more lies in Labour section.
2.1.2. Bivariate Analysis
1. Strip plot between Vote and Age

Figure 9 – Strip plot Age vs Vote


 Based on the data, from the scatter plot Salary has weak correlation
 Most of the voters from labour section lies in between age of 40 and 50.

2. Strip plot between Vote and Economic.cond.national

Figure 10 – Strip plot Vote vs Economic.cond.national

 Based on the data, there is no clear correlation.

3. Strip plot between Economic conditional household and Age

This Business Report is generated based on the Data set extracted from reliable sources
9

Figure 11 – Economic conditional household vs Age


 Based on the data, there is no clear correlation.

4. Scatter plot between old children and Salary with hue of target variable

Figure 12 – Strip plot against Vote feature

This Business Report is generated based on the Data set extracted from reliable sources
10

5. Correlation between the variables:

Figure 13 – Heat map of the Variables

 Correlation between the variables is very weak as per the heat map
 Hague and Economic.cond.national has negative relation
 Europe and Economic.cond.national also have negative relation
 Age and Political knowledge also have negative relation, which is quite opposite to
reality
6. Pairplot between the variables:

Figure 14 – Pair Plot for the variables

This Business Report is generated based on the Data set extracted from reliable sources
11

 From the pair plot, we can infer that there is no strong relation between any
variable.
 Hence the dataset is not cursed by mutli-collinearity and good for modelling.

2.1.3. Treating Outliers


Before we do label encoding and creating a model we observed there are some outliers existing in
each variable. Now we will treat outliers in these variables.

Figure 15 – Outlier Treatment

Figure 16 –Box plot after outlier Treatment

2.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30)
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building.

2.3.1. Data Encoding and Model building for Machine Learning Analysis

2.3.1.1. Converting Target Variable to Integer type using Categorical Function


 Here ‘0’ represents ‘Conservative’ and ‘1’ represents ‘Labour’

This Business Report is generated based on the Data set extracted from reliable sources
12

Figure 17 – Converting Target Variable to Integer type

2.3.1.2. Data information after conversion.

Figure 18 – Dataset info after encoding the variables

2.4 Apply Logistic Regression and LDA (linear discriminant analysis).


2.4.1. Model Building for Machine Learning Analysis – Split of data

Figure 19 – Test_Train_Split of the data

 Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable

2.4.2. Logistic Regression model generation using the Grid search method

This Business Report is generated based on the Data set extracted from reliable sources
13

Figure 20 – Logistic Regression Model building


 Penalty: ‘elasticnet','l2','none'
 Solver: 'newton-cg', 'saga'
 Tol: 0.001, 0.00001
Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.

2.4.3. Creating model for LDA

Figure 21 – Model for LDA

 Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable

This Business Report is generated based on the Data set extracted from reliable sources
14

2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

2.5.1. Creating model using Gaussian NB

Figure 22 – Model for Gaussian NB

 Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable

2.5.2. Creating model using KNN


2.5.2.1 Scaling the data set for KNN model building

Figure 23 – Scaling the Data Set for model building


 Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So Zscore is applied for scaling on the Train data set.
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable
2.5.2.2 Building Model:

Figure 24 – Model building


2.5.2.3 Building Model (K value =7):

This Business Report is generated based on the Data set extracted from reliable sources
15

Figure 25 – Model building

 K value is considered as ‘7’.


 Model score is 71.41 %
 X is an array of dependent variables and Y is an array of target variable
2.5.2.4 Calculating Misclassification error of KNN model for a range from 1 to 19 at steps of 2

Figure 26 – Calculating Misclassification error for various values of K


 It is observed that misclassification error value is less for 19

2.5.2.5 Plotting Misclassification Error for range of K values

Figure 27 – Plotting Misclassification error for various values of K

This Business Report is generated based on the Data set extracted from reliable sources
16

2.5.2.6 Building model for K=19

Figure 28 – Model building for K =19


 Model score is observed as 69.8

2.6 Model Tuning, Bagging (Random Forest should be applied for


Bagging), and Boosting.

2.6.1. Creating model using Random Forest Technique

Figure 29 – Model using Random Forest

 Model is created using ensembling technique, with estimators as ‘100’.


 Train model score is observed as 99.9 %
 Test model score is observed as 84%, seems like little bit overfitting. So we will
do Model tuning using methods like Baggind and Boosting

2.6.2. Creating model using Random Forest model and applying Bagging

Figure 30 – Model using Random Forest and applying Bagging

This Business Report is generated based on the Data set extracted from reliable sources
17

 Model is created using ensembling technique, with base_estimators as


‘RF_model’.
 Train model score is observed as 97.9 %
 Test model score is observed as 84.27%, seems like little bit overfitting. So we
can check Model tuning using Boosting.

2.6.3. Creating model using Random Forest model and applying Boosting

Figure 31 – Model using Random Forest and applying Boosting

 Model is created using ensembling technique, with base_estimators as


‘RF_model’. Gradient Boosting method is considered in the above case
 Train model score is observed as 88.75 %
 Test model score is observed as 83.84%, seems like both train and testing
scores are in-line, can be considered as best model.

2.7 Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.

2.7.1. Performance Metrics of Logistic Regression

2.7.1.1. Confusion Matrix, accuracy and other metrics of Train data

This Business Report is generated based on the Data set extracted from reliable sources
18

Figure 32 – Confusion Matrix of Train Data


 Accuracy of the train data is 83.03%
 Recall for train data in the interest of ‘1’ is 0.91
2.7.1.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 33 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.93%


 Recall for test data in the interest of ‘1’ is 0.92
 False positive values are coming down to 45 which is a good sign on the test
data compared to train data with a value of 111

This Business Report is generated based on the Data set extracted from reliable sources
19

2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 34 – AUC and RoC of Train Data


 AUC score of the train data is 87.72%

2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 35 – AUC and RoC of Test Data

 AUC score of the train data is 87.72%

This Business Report is generated based on the Data set extracted from reliable sources
20

2.7.2. Performance Metrics of LDA

2.7.2.1. Confusion Matrix of Train and Test data

Figure 36 – Confusion Matrix of Train and Test Data

 Recall for train data in the interest of ‘1’ is 105


 Recall for Test data in the interest of ‘ 1’ is 41

2.7.2.2. Classification report of Train and Test data

Figure 37 – Classification Report of Train and Test Data

 Accuracy of the Train data is 82%


 Accuracy of the Test Data is 84%

2.7.2.3. AUC_score and RoC Curve for the train and test data

This Business Report is generated based on the Data set extracted from reliable sources
21

Figure 38 – AUC and RoC of Train and Test Data

 AUC score of the train data is 87.7%


 AUC score of the test data is 91.6%

2.7.3. Performance Metrics of Naïve Bayes

2.7.3.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 39 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
22

 Accuracy of the train data is 82.66%


 Recall for train data in the interest of ‘1’ is 0.88
2.7.3.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 40 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.71%


 Recall for test data in the interest of ‘1’ is 0.90
 False positive values are coming down to 37 which is a good sign on the test
data compared to train data with a value of 97

2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 41 – AUC and RoC of Train Data


 AUC score of the train data is 87.52%

This Business Report is generated based on the Data set extracted from reliable sources
23

2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 42 – AUC and RoC of Test Data

 AUC score of the train data is 91.02%

2.7.4. Performance Metrics of KNN Model with K value as 19

2.7.4.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 43 – Confusion Matrix of Train Data

 Accuracy of the train data is 69.82%


 Recall for train data in the interest of ‘1’ is 0.99

This Business Report is generated based on the Data set extracted from reliable sources
24

2.7.4.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 44 – Confusion Matrix of Train Data

 Accuracy of the test data is 67.46%


 Recall for test data in the interest of ‘1’ is 0.96
 False positive values are coming down to 137 which is a good sign on the test
data compared to train data with a value of 311

2.7.4.3. AUC_score and RoC Curve for the train and test data

Figure 45 – AUC and RoC of Train Data


 AUC score of the train data is 61.30%
 AUC score of the test data is 46.40%

This Business Report is generated based on the Data set extracted from reliable sources
25

2.7.5. Performance Metrics of Random Forest with Bagging Technique

2.7.5.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 46 – Confusion Matrix of Train Data

 Accuracy of the train data is 97.18%


 Recall for train data in the interest of ‘1’ is 0.99

2.7.5.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 47 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.27%


 Recall for test data in the interest of ‘1’ is 0.92

This Business Report is generated based on the Data set extracted from reliable sources
26

2.7.5.3. AUC_score and RoC Curve for the train and test data

Figure 48 – AUC and RoC of Train Data

 AUC score of the train data is 99.70%


 AUC score of the test data is 91.80%

2.7.6. Performance Metrics of Random Forest with Boosting Technique

2.7.6.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 49 – Confusion Matrix of Train Data

 Here Gradient boosting technique is used


 Accuracy of the train data is 88.75%
 Recall for train data in the interest of ‘1’ is 0.94

This Business Report is generated based on the Data set extracted from reliable sources
27

2.7.6.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 50 – Confusion Matrix of Train Data


 Accuracy of the test data is 83.84%
 Recall for test data in the interest of ‘1’ is 0.92

2.7.6.3. AUC_score and RoC Curve for the train and test data

Figure 51 – AUC and RoC of Train Data

 AUC score of the train data is 94.80%


 AUC score of the test data is 90.80%

This Business Report is generated based on the Data set extracted from reliable sources
28

2.7.7. Comparison between Models

LR Model LDA Model NB Model KNN Model Bagging Boosting


Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 83.03 84.93 82.75 84.71 82.66 84.71 69.82 67.46 97.18 84.27 88.75 83.84
Precision (1) 86 87 86 88 87 89 70 69 97 86 90 86
AUC 87.72 87.72 87.7 91.6 87.52 91.02 61.3 46.4 99.7 91.8 94.8 90.8
Recall (1) 91 92 89 91 88 90 99 96 99 92 94 92
F1- Score (1) 88 90 88 89 88 89 82 80 98 89 92 89

Table 1 – Comparison between various models


 On comparing the models, Random Forest technique with Boosting seems
consistent across various Model evaluation parameters
 Train and test data for Boosting model are almost close and consistent
 Also, the accuracy of the Boosting falls in second place with 88.75 for train and
83.84 in Test data. Bagging has highest Accuracy for train data 97.18% but test
data is performing low with deviation of 10 % almost
 Finally Random Forest Technique with Boosting technique tuning provides
better model

2.8 Based on these predictions, what are the insights? (5 marks).

On the whole, based on the outcomes of the model for data set following Insights are observed:
 Female voters are more than the male voters. So parties have to attract Male voters by
respective means.
 Most of the voters from labour section lie in between age of 40 and 50. So mostly Labour
party is attracting this age group with some monetary benefits. So any voter from these ages
has a high change he cast his vote to labor party.
 Based on the data from Strip plot, voter’s density is high for Labour party for High economic
conditional nation. So any voter from this zone will have high chances to cast their vote to
Labour party.
 Similarly, is the case with high economic condition households.
 Random Forest algorithm with Gradient Boosting technique tuning provides better model.
The model is giving better results for both training and test data.

3. Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
 President Franklin D. Roosevelt in 1941
 President John F. Kennedy in 1961
 President Richard Nixon in 1973.

3.1 Find the number of characters, words, and sentences for the
mentioned documents.

Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data

This Business Report is generated based on the Data set extracted from reliable sources
29

Figure 52 - Loading Dataset into Jupyter notebook


 Nltk file is downloaded and imported inaugural for speech library
 Extracted required vice-president speeches to allocated to respective
variables.
3.1.1. Calculate length of data
Ans: To calculate the length of data below codes are executed

Figure 53 – Length of data


 Number of characters in Roosevelt file is 7571.
 Number of characters in Kennedy file is 7618.
 Number of characters in Nixon file is 9991.

3.1.2. Calculate No of words


Ans: To calculate the no of words in data below codes are executed. Split() function is used to
extract words.

Figure 54 – No of words in each speech


 Number of words in Roosevelt file is 1360.
 Number of words in Kennedy file is 1390.
 Number of words in Nixon file is 1819.

This Business Report is generated based on the Data set extracted from reliable sources
30

3.1.3. Calculate No of Sentences


Ans: To calculate the no of sentences in data below codes are executed. We are using sent_tokenize
command to count the sentences
3.1.3.1 A “sentence_count” function is defined to count the no of sentence in a particular speech

Figure 55 – User defined function to count no of sentences in a Text file


3.1.3.1 Code to calculate no of sentences in the speech

Figure 56 – No of sentences in each presidential speech


 Number of sentences in Roosevelt file is 68.
 Number of sentences in Kennedy file is 52.
 Number of sentences in Nixon file is 68.

3.2 Remove all the stopwords from all three speeches.

3.2.1. Import predefined stopwords from the nltk


Ans: To import predefined stopwords from the nltk below codes are executed

Figure 57 – Importing stopwords to python


3.2.2. Step to remove all punctuations
Ans: In this step punctuations and unnecessary words are removed from the text file.

This Business Report is generated based on the Data set extracted from reliable sources
31

Figure 58 – Code to remove punctuation and unnecessary words


 Punctuations listed in the punkt and addition words are removed

3.2.3. Removing stopwrods in all the files


Ans: To remove stopwords below codes are executd

Figure 59 – Removing stopwords from the speeches


 The output shows the punctuations are removed and a cleaned text is shown
in suffix ‘_clean variable
 We can see comma is removed from the tokens list

3.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

3.3.1. Step to convert tokens into lower case.


Ans: To find top frequency words, first convert all the tokens in to lowers case. Since algorithm
considers lower case and upper case words are different and will create error in the count/
frequency. Hence in-built lower function is used to convert all the tokens into lower case and
maintain consistency for all the three speeches. Below line of codes are executed

This Business Report is generated based on the Data set extracted from reliable sources
32

Figure 60 – Converting tokens into Lower case


 We can observe that word s like ‘Mr’ are converted to ‘mr’ after the above
execution.
3.3.2. Step to Lemmatization with POS
Ans: In this step lemmatization is done, to convert all the tokens to root words. For the same
‘wordNetLemmatizer’ is imported to python notebook.

Figure 61 – Lemmatization with POS tag


 In the above step based on the parts of speech, words are converted to their
root words.
 Words that are not categorized as any parts of speech are retained as it is.

3.3.3. Top three word frequency calculation for each speech

3.3.3.1: Word frequency is calculated through below code in Roosevelt Speech.

This Business Report is generated based on the Data set extracted from reliable sources
33

Figure 62 – Word Frequency calculation for Roosevelt Speech


 Top three words in Roosevelt speech are “Nation”, “It” and “Life”
3.3.3.2: Word frequency is calculated through below code in Kennedy Speech.

Figure 63 – Word Frequency calculation for Keenedy Speech


 Top three words in Roosevelt speech are “World”, “Let” and “Side”
3.3.3.3: Word frequency is calculated through below code in Nixon Speech.

Figure 64 – Word Frequency calculation for Nixon Speech


 Top three words in Roosevelt speech are “America”, “Peace” and “World”

This Business Report is generated based on the Data set extracted from reliable sources
34

3.4 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
3.4.1. Now we are creating the word cloud for Roosevelt speech. From wordcloud imported
WordCloud to create word cloud.

Figure 65 – Word cloud for Roosevelt speech


3.4.2. Now we are creating the word cloud for Kennedy speech. From wordcloud imported
WordCloud to create word cloud.

Figure 66 – Word cloud for Kennedy speech

This Business Report is generated based on the Data set extracted from reliable sources
35

3.4.3. Now we are creating the word cloud for Nixon speech. From wordcloud imported WordCloud
to create word cloud.

Figure 67 – Word cloud for Nixon speech

This Business Report is generated based on the Data set extracted from reliable sources

You might also like