Project ML
Project ML
Anil Ulchala
12/4/2021
1
1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Read the dataset. Do the descriptive statistics and do the null value condition check? Write
an inference on it. ............................................................................................................................. 47
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
7
2.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30) ................................................................................. 11
2.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................... 12
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................... 14
2.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...... 16
2.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized. .................................... 17
2.8 Based on these predictions, what are the insights? (5 marks). ............................................ 28
3. Problem Statement: ...................................................................................................................... 28
3.1 Find the number of characters, words, and sentences for the mentioned documents. ...... 28
3.2 Remove all the stopwords from all three speeches. ............................................................ 30
3.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 31
3.4 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 34
List of Tables:
Table 1 – Comparison between various models .................................................................................. 28
This Business Report is generated based on the Data set extracted from reliable sources
2
This Business Report is generated based on the Data set extracted from reliable sources
3
This Business Report is generated based on the Data set extracted from reliable sources
4
Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using NaiveBayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need perform NLP techniques, on another data set containing speeches of former vice-
presidents and create a word cloud.
2. Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
2.1 Read the dataset. Do the descriptive statistics and do the null value
condition check? Write an inference on it.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data
This Business Report is generated based on the Data set extracted from reliable sources
5
2.1.2. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (1525, 9). Therefore data set has 1525 rows and 9 columns. Refer
below figure
2.1.3. There are no null values in the data set. Ref below figure
2.1.4. On the basis of problem description it is clear that ‘vote’ is the dependent variable/ target
variable and the remaining variables are independent variables. Going forward the report uses
terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.
This Business Report is generated based on the Data set extracted from reliable sources
6
This Business Report is generated based on the Data set extracted from reliable sources
7
This Business Report is generated based on the Data set extracted from reliable sources
8
Majority of the voters are labour, so the vote bank more lies in Labour section.
2.1.2. Bivariate Analysis
1. Strip plot between Vote and Age
This Business Report is generated based on the Data set extracted from reliable sources
9
4. Scatter plot between old children and Salary with hue of target variable
This Business Report is generated based on the Data set extracted from reliable sources
10
Correlation between the variables is very weak as per the heat map
Hague and Economic.cond.national has negative relation
Europe and Economic.cond.national also have negative relation
Age and Political knowledge also have negative relation, which is quite opposite to
reality
6. Pairplot between the variables:
This Business Report is generated based on the Data set extracted from reliable sources
11
From the pair plot, we can infer that there is no strong relation between any
variable.
Hence the dataset is not cursed by mutli-collinearity and good for modelling.
2.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30)
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building.
2.3.1. Data Encoding and Model building for Machine Learning Analysis
This Business Report is generated based on the Data set extracted from reliable sources
12
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
2.4.2. Logistic Regression model generation using the Grid search method
This Business Report is generated based on the Data set extracted from reliable sources
13
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
This Business Report is generated based on the Data set extracted from reliable sources
14
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
This Business Report is generated based on the Data set extracted from reliable sources
15
This Business Report is generated based on the Data set extracted from reliable sources
16
2.6.2. Creating model using Random Forest model and applying Bagging
This Business Report is generated based on the Data set extracted from reliable sources
17
2.6.3. Creating model using Random Forest model and applying Boosting
This Business Report is generated based on the Data set extracted from reliable sources
18
This Business Report is generated based on the Data set extracted from reliable sources
19
2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data
2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data
This Business Report is generated based on the Data set extracted from reliable sources
20
2.7.2.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
21
This Business Report is generated based on the Data set extracted from reliable sources
22
2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data
This Business Report is generated based on the Data set extracted from reliable sources
23
2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data
This Business Report is generated based on the Data set extracted from reliable sources
24
2.7.4.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
25
This Business Report is generated based on the Data set extracted from reliable sources
26
2.7.5.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
27
2.7.6.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
28
On the whole, based on the outcomes of the model for data set following Insights are observed:
Female voters are more than the male voters. So parties have to attract Male voters by
respective means.
Most of the voters from labour section lie in between age of 40 and 50. So mostly Labour
party is attracting this age group with some monetary benefits. So any voter from these ages
has a high change he cast his vote to labor party.
Based on the data from Strip plot, voter’s density is high for Labour party for High economic
conditional nation. So any voter from this zone will have high chances to cast their vote to
Labour party.
Similarly, is the case with high economic condition households.
Random Forest algorithm with Gradient Boosting technique tuning provides better model.
The model is giving better results for both training and test data.
3. Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973.
3.1 Find the number of characters, words, and sentences for the
mentioned documents.
Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data
This Business Report is generated based on the Data set extracted from reliable sources
29
This Business Report is generated based on the Data set extracted from reliable sources
30
This Business Report is generated based on the Data set extracted from reliable sources
31
3.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
This Business Report is generated based on the Data set extracted from reliable sources
32
This Business Report is generated based on the Data set extracted from reliable sources
33
This Business Report is generated based on the Data set extracted from reliable sources
34
3.4 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
3.4.1. Now we are creating the word cloud for Roosevelt speech. From wordcloud imported
WordCloud to create word cloud.
This Business Report is generated based on the Data set extracted from reliable sources
35
3.4.3. Now we are creating the word cloud for Nixon speech. From wordcloud imported WordCloud
to create word cloud.
This Business Report is generated based on the Data set extracted from reliable sources