80% found this document useful (5 votes)
341 views

Machine Learning VIVEK

The document provides details on building machine learning models to predict employee transportation preferences using data from ABC Consulting. Key points: - The dataset contains 444 rows and 9 features on employee age, gender, education, salary, distance to work, license status, and current transportation mode. - Exploratory data analysis found most employees are ages 24-30, male engineers without licenses who currently use public transportation. - Models will be built on a 70% training set and evaluated on the remaining 30% test set to determine best model for predictions. - Feature engineering included one-hot encoding categorical variables. Univariate analysis found normal distributions for continuous variables while bivariate plots showed public transportation is preferred across groups.

Uploaded by

RemyaRS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
80% found this document useful (5 votes)
341 views

Machine Learning VIVEK

The document provides details on building machine learning models to predict employee transportation preferences using data from ABC Consulting. Key points: - The dataset contains 444 rows and 9 features on employee age, gender, education, salary, distance to work, license status, and current transportation mode. - Exploratory data analysis found most employees are ages 24-30, male engineers without licenses who currently use public transportation. - Models will be built on a 70% training set and evaluated on the remaining 30% test set to determine best model for predictions. - Feature engineering included one-hot encoding categorical variables. Univariate analysis found normal distributions for continuous variables while bivariate plots showed public transportation is preferred across groups.

Uploaded by

RemyaRS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

University of

Texas

1
MACHINE
LEARNING

BUSINESS
REPORT

2
SUBMITTED BY
VIVEK AJAYAKUMAR

3 DATE OF SUBMISSION: 28 MAR 2022


VERSION : 2.1
TABLE OF CONTENTS

CONTENTS
Problem 1 Page
Number
1 Basic data summary, Univariate, Bivariate analysis, graphs, 5
checking correlations, outliers and missing values treatment
(if necessary) and check the basic descriptive statistics of
the dataset.
2 Split the data into train and test in the ratio 70:30. Is 12
scaling necessary or not?
3 Build the following models on the 70% training data and check 13
the performance of these models on the Training as well as the
30% Test data using the various inferences from the Confusion
Matrix and plotting a AUC-ROC curve along with the AUC values.
Tune the models wherever required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost
4 Which model performs the best? 22
5 What are your business insights? 23
PROBLEM 2
Pick out the Deal (Dependent Variable) and Description columns 25
1 into a separate data frame.

2 Create two corpora, one with those who secured a Deal, the 26
other with those who did not secure a deal.
3 The following exercise is to be done for both the corpora: 26
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’,
‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and ‘company’ are to
be removed)
c) What were the top 3 most frequently occurring words in both
corpuses (after removing stop words)?
d) Plot the Word Cloud for both the corpora.

4 Refer to both the word clouds. What do you infer? 31

5 Looking at the word clouds, is it true that the entrepreneurs 31


who introduced devices are less likely to secure a deal based
on your analysis?
List of Figures PAGE NUMBER
1 First 5 Rows of the Dataset 6
2 Data Summary of the dataset 6
3 Univariate Analysis 8
4 Bivariate Decomposition 11
5 Logistic Regression Model 12
6 Linear Discriminant Analysis 16
7 Decision Tree Classifier – CART model 17
8 Naïve Model 24
9 KNN Model 25
10 Random Forest Model 27
11 Boosting Classifier Model using Gradient boost 28
12 Key Words 31
13 Word Cloud 32

List of Figures PAGE NUMBER


1 First 5 Rows of the Dataset 8
2 Data Summary of the dataset 9
3 Univariate Analysis 9
4 Additive Decomposition 10
5 Multiplicative Decomposition 13
6 Logistic Regression Model 13
7 Linear Discriminant Analysis 13
8 Decision Tree Classifier – CART model 14
9 Naïve Model 14
10 KNN Model 15
11 Random Forest Model 16
12 Boosting Classifier Model using Gradient boost 20
13 Key Words 30
14 Word Cloud 31

3
Problem 1: Machine Learning Models

You work for an office transport company. You are in discussions with ABC Consulting
company for providing transport for their employees. For this purpose, you are tasked
with understanding how do the employees of ABC Consulting prefer to commute presently
(between home and office). Based on the parameters like age, salary, work experience
etc. given in the data set ‘Transport.csv’, you are required to predict the preferred
mode of transport. The project requires you to build several Machine Learning models
and compare them so that the model can be finalised.

Data set for the Problem:

Age : Age of the Employee in Years


Gender : Gender of the Employee
Engineer : For Engineer =1 , Non Engineer =0
MBA : For MBA =1 , Non MBA =0
Work Exp : Experience in years
Salary : Salary in Lakhs per Annum
Distance : Distance in Kms from Home to Office
license : If Employee has Driving Licence -1, If not, then 0
Transport : Mode of Transport
The objective is to build various Machine Learning models on this data set and based on
the accuracy metrics decide which model is to be finalised for finally predicting the
mode of transport chosen by the employee.

4
1.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations,
outliers, and missing values treatment (if necessary) and check the basic descriptive
statistics of the dataset.

Data from ABC Consulting is uploaded for evaluation and Data Dictionary is shown below:

Age : Age of the Employee in Years


Gender : Gender of the Employee
Engineer : For Engineer =1 , Non Engineer =0
MBA : For MBA =1 , Non MBA =0
Work Exp : Experience in years
Salary : Salary in Lakhs per Annum
Distance : Distance in Kms from Home to Office
license : If Employee has Driving Licence -1, If not, then 0
Transport : Mode of Transport

Data has 444 rows and 9 features. The data exploration of first five rows are shown below:

Dataset has no null and duplicate values. For the analysis, ‘Transport’ variable is
considered as the dependent variable i.e., used for the prediction. The variables ‘Age’,
‘Salary’, ‘Work Exp’, ’Distance’ are continuous variable and other variables are
categorical variables, they need to encoded later for evaluation.
Data info is shown below:

5
Data description is shown below:

Dataset after the encoded categorical columns into object type ( in the place of
int64).

UNIVARIATE ANALYSIS

For the univariate analysis, the continuous variables are plotted below. The variables
are ‘Age’, ’Work Exp’,’ Salary’,’ Distance’/

6
7
The count plot of the categorical variables is shown below:
Variables Plot

AGE

GENDER

ENGINEER

MBA

8
LICENSE

TRANSPORT

INFERENCES:
o Most of the continuous variables have normal distribution. The features like Work
Experience and Salary are left skewed.
o Outliers are present in all the continuous variables and Salary and Work Experience
have a lot of outliers.
o From the count plot, most of the employees are in between the age group of 24 and
30.
o Male employees dominate compared to Female staff and most of the staffs are Engineers
rather than MBA holders.
o Most of the staff do not hold any valid driving license and they prefer to use
Public Transport.

9
BIVARIATE ANALYSIS

Bivariate Analysis is done by considering the dependent variable- Transport and other
independent variable.

Bivariate plots are shown below:

Features Plots
Engineer

Gender

MBA

10
INFERENCES:
o From all plots, it is evident that staffs prefer to use public transport instead
of private transport.
o In the Bivariate Analysis, more than two hundred engineers prefer to use Public
Transport and half of the engineers use Private Transport. The same pattern is
repeated for non-MBA staffs.
o From Gender data, it is clear that Male prefers to use Public Transport and half
of the male staffs use Private Transport. At the same time, Female staffs prefer
to use Public and Private Transport indisputably.

HEATMAP:

PAIR PLOT

11
Outlier Treatment
Before Treatment After Treatment

INFERENCES:
o From both the plots, the ‘Age’ factor is more corelated to ‘Work Exp’ and
‘Salary’. At the same time, ‘Work Exp’ is more corelated to ‘Salary’.
o ‘MBA’ and ‘Salary’ is negatively corelated.
o Apart from that, most of the variables do not show any active correlation and it
can be neglected.

1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not?

The dependent variable, Transport is encoded to 1 and 0 in which 1 indicates a valid


License holder and O indicates not a license holder.

In the case of Scaling, the scaling is not mandatory for linear models like Linear
Regression, Logistic Regression, LDA and tree-based models such as Random Forest. At the
same time, Scaling is must for the Distance based models such as KNN and SVM. The reason
for the scaling is that greater values might infer with the other smaller values and
give more importance.

Scaling should be done only after train-test split that means that scaling needs to be
done on the train set. This is done an assumption that scaling is done on the train set
and test set is unseen during the training phase.

In this dataset, the data is split on 70:30 split, and Min-Max Scaler is used for
scaling. In our dataset, train set has 310 rows and test set has 134 rows.

12
1.3. Build the following models on the 70% training data and check the performance of
these models on the Training as well as the 30% Test data using the various inferences
from the Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune
the models wherever required for optimum performance.:

Modelling of Dataset

Logistic Regression

Results of the Logistic Regression:


Train Data Test Data
Classification
Report

Confusion Matrix

ROC-AUC Scores

13
Bar Plot of feature Co-efficient values

Logistic Regression

Gender_Male
License
Distance
Salary
Work Exp
MBA
ENGINEER
AGE

-3 -2 -1 0 1 2
ENGINEE Gender_
AGE MBA Work Exp Salary Distance License
R Male
Series1 0.1579932-0.0250640.2470542-1.029917-1.488581-2.727697-1.7114091.2353086

Inferences:
o Train and Test Accuracy & ROC_AUC scores are some what nearby, this implies that
model does not suffer from overfitting.
o Order of Importance: ‘Distance’(negative value)>’MBA’(positive
value)>license(negative value)>’Age’(positive value).
o Variables like Engineer, Age are having least importance.

14
Linear Discriminant Analysis

Results of the model:


Train Data Test Data
Classification
Report

Confusion
Matrix

ROC-AUC Scores

Bar Plot of feature Coefficient Values

LDA Model
Gender_Male
License
Distance
Salary
Work Exp
MBA
ENGINEER
AGE

-6 -4 -2 0 2 4 6
ENGINEE Gender_
AGE MBA Work Exp Salary Distance License
R Male
Series1 5.2311237-0.0833310.3420627-4.203504-3.035043 -5.08246 -2.3638611.6138695

15
Inferences:
o ROC-AUC scores of the Train and Test data are 0.842 and 0.786 and it implies that
the model does not suffer from overfitting.
o From the Confusion Matrix, model is able to predict 87 values accurately and 21 as
wrong ones. This implies that model works in better way.
o Order of importance:- Age(positive influence)>Distance(negative influence)>Work
Exp(negative influence)>Salary(negative influence).
o Variables like ‘Engineer’,’Gender’ has low influence on the model.

DECISION TREE CLASSIFIER-CART MODEL


Train Data Test Data
Classificatio
n Report

Confusion
Matrix

ROC-AUC
Scores

INFERENCES:
o From the ROC-AUC Scores, the score of the test data is lower than the train data.
It implies that, the model suffers from overfitting.
o From the confusion matrix, the model fails to predict the transport mode of the
staffs.
16
o It is not a suitable model for prediction.

Naïve Bayes Model


Results of the Model
Train Data Test Data
Classification
Report

Confusion
Matrix

ROC-AUC Scores

INFERENCE:
o ROC-AUC SCORE for the train data and test data is 0.804 and 0.766. This implies
that model does not suffer from overfitting. But the scores are within and below
0.80, there the model can be categorised as underfit model.
o From the Classification Model, the f1-score of both train and test data implies
that model is not fit for prediction.

17
KNN MODEL
Results of the Model
Train Data Test Data
Classificatio
n Report

Confusion
Matrix

ROC-AUC
Scores

INFERENCE:
o From ROC-AUC scores of the train and test data, there is no significant difference
in the values. Therefore, the model does not come under an overfitting.
o Confusion Matrix indicates that most of the values are predicted accurately.
o F1 scores of the train and test data indicates that model is fit for prediction.

18
RANDOM FOREST MODEL
Results of the Model
Train Data Test Data
Classificatio
n Report

Confusion
Matrix

ROC-AUC
Scores

19
Feature Importance Plot

INFERENCE:
o There is significant difference between the ROC-AUC Scores and model comes under
an overfitted model.
o Classification report states that model works significant well in predicting the
output.
o F1- score of the train and test data indicates that model is good for prediction.
o ‘Salary’, ’Distance’, ’Age’ are first three important predictors.
o ‘MBA’ and ‘Engineer’ are two least important predictors.

Boosting Classifier using Gradient boost:


Results of the Model:
Train Data Test Data
Classification
Report

Confusion
Matrix

20
ROC-AUC Scores

Feature Importance Plot

Inference:
o ROC-AUC Scores indicates that the model suffer from overfitting.
o F1 score indicates that model is not suitable for prediction.
o ‘Salary’, ’Distance’, ’Age’ are first three prominent factors,
o Engineer and MBA are the two least important predictors.

21
1.4 Which model performs the best?

Model building is an iterative process, and the performance is evaluated by comparing


the train and test dataset. The respective model can be improved by hyperparameter
tuning, dimensionality reduction and feature improvisation. In this case, we have doe
different combinations of model to improvise and output of the various models are
compared.
Consolidated Output of the models:

Logistic Linear Cart Naïve KNN Random Boosting


Model Regression Discriminant Model Bayes Model Forest Classifer
Model Model Model Model (Gradient
Boosting)
Accuracy 0.821 0.81 0.83 0.78 0.80 0.83 0.81
Score
ROC AUC 0.758 0.786 0.743 0.766 0.823 0.863
Score 0.879
Recall 0.96 0.96 0.95 0.96 0.96 0.91
(Public 0.97
Transport)
Recall 0.51 0.51 0.56 0.44 0.47 0.56
(Private 0.60
Transport)
f1 Score 0.87 0.86 0.87 0.87
(Public 0.88 0.88 0.88
Transport)
f1 0.65 0.64 0.57 0.60
(Private 0.68 0.68 0.68
Transport)

Analysis of the model output:

Parameter Best Model


Accuracy Score Decision Tree Classifier-Cart Model
ROC AUC Score Random Forest Model
Recall (Public Logistic Regression Model
Transport)
Recall(Private Boosting Classifier Model using Gradient boost
Transport)
f1(Public Transport) Logistic Regression Model, Cart Model, Random Forest Model
f1(Private Transport) Cart Model, Random Forest Model, Boosting Classifier Model
using Gradient Model

22
Inference:
• Depends on the output parameter needed, we can choose respective models.
• If multiple models perform well on the evaluation, we can choose any of the model
depends on the other parameters of the data set.

1.5 What are your business insights?

Model performance depends upon the input data and the respective distribution of output
data. From the output parameters for evaluation like accuracy, ROC_AUC score, Recall, f1
Score, we can choose different models for respective prediction. The input data shows
that most preferred mode of Transport is public transport and more data related to
private transport might enhance the model performance. Various models give different
level of importance to the input features. This implies that domain knowledge is
important to justify the model findings. In addition to that, various sub division of
data on the basis of Gender or Age might improvise the model.

23
Problem 2
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making
their pitch to the VC sharks.
You will ONLY use “Description” column for the initial text mining exercise.
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did
not secure a deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’,
‘this’, ‘even’ and ‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices
are less likely to secure a deal based on your analysis?

24
2.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.

Data has 495 entries and has features like 'deal', 'description', 'episode', 'category', '
entrepreneurs','location', 'website', 'askedFor', 'exchangeForStake', 'valuation', 'se
ason', 'shark1', 'shark2', 'shark3', 'shark4', 'shark5', 'title','episode-season', 'Multiple
Entreprenuers'. For our evaluation, the main feature – ‘deal’ and ‘description’ is taken.

The first look of the dataset:

Dataset has two duplicate values, and the data is removed for data quality. There is no
null dataset. Data info will be

25
2.2 Create two corpora, one for those who secured a Deal, the other for those who did
not secure a deal.

Dataset is split into two corpora considering the feature ‘deal’. The Data is split into
two datasets according to ‘True’ and ‘False’ value.
True Dataset

True Dataset has 250 data entries.


False Dataset

False Dataset has 243 data entries.


2.3. The following exercise is to be done for both the corpora: a) Find the number of
characters for both the corpuses. b) Remove Stop Words from the corpora. (Words like
‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and ‘company’ are to be removed) c) What
were the top 3 most frequently occurring words in both corpuses (after removing stop
words)? d) Plot the Word Cloud for both the corpora.

a) Find the number of characters for both the corpuses


Total Characters for unsecured deal is 34736
Total Characters for secured deal is 46959

b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’,
‘this’, ‘even’ and ‘company’ are to be removed)

26
The stop words were removed by importing stopwords from NLTK and also removed listed stop
words from both corpora.

The output of the word dataset will be:


True Dataset

False Dataset

c) What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?

27
True Dataset

False Dataset

d) Plot the Word Cloud for both the corpora.

28
True Dataset

29
False Dataset

30
4. Refer to both the word clouds. What do you infer?

From the above word cloud created on the basis of the parameter – ‘deal’. Two types of
word clouds are generated on the basis of the output parameter of deal- True and False.

Secured deal pitches on mainly three important factors- product, design, and children.
This implies that entrepreneurs need to focus more on these words. Moreover, key words
from the word cloud can be used for search engine optimization(SEO). At the same time,
these words can used to develop and design their product or service.

Unsecured deal pitches on the key word like help, device, and service. This implies that
some serious issues related to device, its customer support and service are there.
Therefore, the companies need to investigate these parameters and improvise.

5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices
are less likely to secure a deal based on your analysis?

Word clouds are one of the best ways to analyse the customer feedback and it helps the
firms to strategize their plans to improvise the value of the product/service. From the
unsecured deal, the key words are help, device and service. This implies that some
issues related to device, its service and respective support are correlated. Therefore,
we can conclude that entrepreneurs who introduced devices might fail to secure a deal.
More data optimization might help the entrepreneurs to resolve this issue.

31
27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [70]:  import numpy as np


import pandas as pd
import re # this is the regular expression library which helps us manipulate text (strings) fairly easily and intuiti
import nltk # this is the Natural Language Tool Kit which contains a lot of functionalities for text analytics
import matplotlib.pyplot as plt
import string # this is used for string manipulations
import matplotlib
import seaborn as sns

In [71]:  car= pd.read_csv('Cars1.csv')

In [72]:  car.head()

Out[72]: Age Gender Engineer MBA Work Exp Salary Distance license Transport

0 28 Male 0 0 4 14.3 3.2 0 Public Transport

1 23 Female 1 0 4 8.3 3.3 0 Public Transport

2 29 Male 1 0 7 13.4 4.1 0 Public Transport

3 28 Female 1 1 5 13.4 4.5 0 Public Transport

4 27 Male 1 0 4 13.4 4.6 0 Public Transport

Review the data

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 1/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [73]:  car.isnull().sum()

Out[73]: Age 0

Gender 0

Engineer 0

MBA 0

Work Exp 0

Salary 0

Distance 0

license 0

Transport 0

dtype: int64

In [74]:  # no null values

In [75]:  car.duplicated().sum()

Out[75]: 0

In [76]:  # no duplicate values

In [77]:  car.shape

Out[77]: (444, 9)

In [78]:  car.columns

Out[78]: Index(['Age', 'Gender', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance',

'license', 'Transport'],

dtype='object')

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 2/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [79]:  car.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 444 entries, 0 to 443

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 444 non-null int64

1 Gender 444 non-null object

2 Engineer 444 non-null int64

3 MBA 444 non-null int64

4 Work Exp 444 non-null int64

5 Salary 444 non-null float64

6 Distance 444 non-null float64

7 license 444 non-null int64

8 Transport 444 non-null object

dtypes: float64(2), int64(5), object(2)

memory usage: 31.3+ KB

In [80]:  car.describe()

Out[80]: Age Engineer MBA Work Exp Salary Distance license

count 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000

mean 27.747748 0.754505 0.252252 6.299550 16.238739 11.323198 0.234234

std 4.416710 0.430866 0.434795 5.112098 10.453851 3.606149 0.423997

min 18.000000 0.000000 0.000000 0.000000 6.500000 3.200000 0.000000

25% 25.000000 1.000000 0.000000 3.000000 9.800000 8.800000 0.000000

50% 27.000000 1.000000 0.000000 5.000000 13.600000 11.000000 0.000000

75% 30.000000 1.000000 1.000000 8.000000 15.725000 13.425000 0.000000

max 43.000000 1.000000 1.000000 24.000000 57.000000 23.400000 1.000000

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 3/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [81]:  car.head()

Out[81]: Age Gender Engineer MBA Work Exp Salary Distance license Transport

0 28 Male 0 0 4 14.3 3.2 0 Public Transport

1 23 Female 1 0 4 8.3 3.3 0 Public Transport

2 29 Male 1 0 7 13.4 4.1 0 Public Transport

3 28 Female 1 1 5 13.4 4.5 0 Public Transport

4 27 Male 1 0 4 13.4 4.6 0 Public Transport

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 4/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [82]:  fig, axes = plt.subplots(nrows=4,ncols=2)


fig.set_size_inches(10,18)

a = sns.distplot(car['Age'] , ax=axes[0][0])
a.set_title("Age",fontsize=10)
a = sns.boxplot(car['Age'] , orient = "v" , ax=axes[0][1])
a.set_title(" Age ",fontsize=10)

a = sns.distplot(car['Work Exp'] , ax=axes[1][0])
a.set_title("Work Experience",fontsize=10)
a = sns.boxplot(car['Work Exp'] , orient = "v" , ax=axes[1][1])
a.set_title(" WorkExperience",fontsize=10)

a = sns.distplot(car['Salary'] , ax=axes[2][0])
a.set_title("Salary",fontsize=10)
a = sns.boxplot(car['Salary'] , orient = "v" , ax=axes[2][1])
a.set_title(" Salary",fontsize=10)

a = sns.distplot(car['Distance'] , ax=axes[3][0])
a.set_title("Distance",fontsize=10)
a = sns.boxplot(car['Distance'] , orient = "v" , ax=axes[3][1])
a.set_title(" Distance",fontsize=10)

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecat


ed function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-lev
el function with similar flexibility) or `histplot` (an axes-level function for histograms).

warnings.warn(msg, FutureWarning)

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable


as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other argu
ments without an explicit keyword will result in an error or misinterpretation.

warnings.warn(

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_core.py:1319: UserWarning: Vertical orientation ignored with


only `x` specified.

warnings.warn(single_var_warning.format("Vertical", "x"))

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecat


ed function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-lev
el function with similar flexibility) or `histplot` (an axes-level function for histograms).

warnings.warn(msg, FutureWarning)

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable


as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other argu
ments without an explicit keyword will result in an error or misinterpretation.

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 5/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

warnings.warn(

outlinear check
In [83]:  plt.figure(figsize=(10,8))
car[['Age', 'Work Exp', 'Salary', 'Distance']].boxplot(vert=0)
plt.title('Outlier Check',fontsize=16)
plt.show()

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 6/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [84]:  # count plot for the discrete variables

In [85]:  car.columns

Out[85]: Index(['Age', 'Gender', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance',

'license', 'Transport'],

dtype='object')

In [86]:  col= ['Age', 'Gender', 'Engineer', 'MBA',


'license', 'Transport']
for i in col:
sns.countplot(car[i])
plt.show()

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable


as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other argu
ments without an explicit keyword will result in an error or misinterpretation.

warnings.warn(

In [87]:  # Bi variate analysis


localhost:8888/notebooks/Machine Learning Project 1 .ipynb 7/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [88]:  sns.boxplot(x='Age',y='license',data=car1)

Out[88]: <AxesSubplot:xlabel='Age', ylabel='license'>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 8/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [89]:  sns.boxplot(x='Salary',hue='license',data=car)

Out[89]: <AxesSubplot:xlabel='Salary'>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 9/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [90]:  ​
sns.countplot(x='Engineer',data=car,hue='Transport')

Out[90]: <AxesSubplot:xlabel='Engineer', ylabel='count'>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 10/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [91]:  sns.countplot(x='Gender',data=car,hue='Transport')

Out[91]: <AxesSubplot:xlabel='Gender', ylabel='count'>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 11/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [92]:  ​
sns.countplot(x='MBA',data=car,hue='Transport')

Out[92]: <AxesSubplot:xlabel='MBA', ylabel='count'>

In [93]:  # Correlation Plot

In [94]:  car.corr()

Out[94]: Age Engineer MBA Work Exp Salary Distance license

Age 1.000000 0.091935 -0.029090 0.932236 0.860673 0.352872 0.452311

Engineer 0.091935 1.000000 0.066218 0.085729 0.086762 0.059316 0.018924

MBA -0.029090 0.066218 1.000000 0.008582 -0.007270 0.036427 -0.027358

Work Exp 0.932236 0.085729 0.008582 1.000000 0.931974 0.372735 0.452867

Salary 0.860673 0.086762 -0.007270 0.931974 1.000000 0.442359 0.508095

Distance 0.352872 0.059316 0.036427 0.372735 0.442359 1.000000 0.290084

license 0.452311 0.018924 -0.027358 0.452867 0.508095 0.290084 1.000000

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 12/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [95]:  sns.heatmap(car.corr(),vmax=1,vmin=-1,cmap="YlGnBu",annot=True,mask=np.triu(car.corr(),+1))

Out[95]: <AxesSubplot:>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 13/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [96]:  sns.pairplot(car)

Out[96]: <seaborn.axisgrid.PairGrid at 0x165acde7a60>

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 14/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 15/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [97]:  for feature in car.columns:


if car[feature].dtype == 'object':
print('\n')
print('feature:', feature)
print(car[feature].value_counts())
print(pd.Categorical(car[feature].unique()))
print(pd.Categorical(car[feature].unique()).codes)

feature: Gender

Male 316

Female 128

Name: Gender, dtype: int64

['Male', 'Female']

Categories (2, object): ['Female', 'Male']

[1 0]

feature: Transport

Public Transport 300

Private Transport 144

Name: Transport, dtype: int64

['Public Transport', 'Private Transport']

Categories (2, object): ['Private Transport', 'Public Transport']

[1 0]

In [98]:  car1=pd.get_dummies(data=car,columns=['Gender','Transport'],drop_first = True)


car1.head()

Out[98]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport

0 28 0 0 4 14.3 3.2 0 1 1

1 23 1 0 4 8.3 3.3 0 0 1

2 29 1 0 7 13.4 4.1 0 1 1

3 28 1 1 5 13.4 4.5 0 0 1

4 27 1 0 4 13.4 4.6 0 1 1

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 16/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [99]:  car1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 444 entries, 0 to 443

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 444 non-null int64

1 Engineer 444 non-null int64

2 MBA 444 non-null int64

3 Work Exp 444 non-null int64

4 Salary 444 non-null float64

5 Distance 444 non-null float64

6 license 444 non-null int64

7 Gender_Male 444 non-null uint8

8 Transport_Public Transport 444 non-null uint8

dtypes: float64(2), int64(5), uint8(2)

memory usage: 25.3 KB

Outlinear Treatment
In [100]:  def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [101]:  car.columns

Out[101]: Index(['Age', 'Gender', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance',

'license', 'Transport'],

dtype='object')

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 17/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [102]:  cols=['Age','Work Exp','Salary','Distance']

In [103]:  for column in cols:


lr,ur=remove_outlier(car1[column])
car1[column]=np.where(car1[column]>ur,ur,car1[column])
car1[column]=np.where(car1[column]<lr,lr,car1[column])

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 18/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [104]:  # outliner treatment



plt.figure(figsize=(10,8))
car1[['Age', 'Work Exp', 'Salary', 'Distance']].boxplot(vert=0)
plt.title('Outlier Check',fontsize=16)
plt.show()

Scaling the data

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 19/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [105]:  num1=['Age','Work Exp','Salary','Distance']

In [106]:  car1[num1] = car[num1].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

In [107]:  car1.head()

Out[107]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport

0 0.40 0 0 0.166667 0.154455 0.000000 0 1 1

1 0.20 1 0 0.166667 0.035644 0.004950 0 0 1

2 0.44 1 0 0.291667 0.136634 0.044554 0 1 1

3 0.40 1 1 0.208333 0.136634 0.064356 0 0 1

4 0.36 1 0 0.166667 0.136634 0.069307 0 1 1

Train-Test Split
In [108]:  car1.columns

Out[108]: Index(['Age', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance', 'license',

'Gender_Male', 'Transport_Public Transport'],

dtype='object')

In [109]:  # Copy all the predictor variables into X dataframe


X = car1.drop('Transport_Public Transport', axis=1)

# Copy target into the y dataframe.
y = car1['Transport_Public Transport']

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 20/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [110]:  X.head()

Out[110]: Age Engineer MBA Work Exp Salary Distance license Gender_Male

0 0.40 0 0 0.166667 0.154455 0.000000 0 1

1 0.20 1 0 0.166667 0.035644 0.004950 0 0

2 0.44 1 0 0.291667 0.136634 0.044554 0 1

3 0.40 1 1 0.208333 0.136634 0.064356 0 0

4 0.36 1 0 0.166667 0.136634 0.069307 0 1

In [111]:  y.head()

Out[111]: 0 1

1 1

2 1

3 1

4 1

Name: Transport_Public Transport, dtype: uint8

In [112]:  from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=car1['Transport_Pu

In [113]:  X_train.count()

Out[113]: Age 310

Engineer 310

MBA 310

Work Exp 310

Salary 310

Distance 310

license 310

Gender_Male 310

dtype: int64

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 21/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [114]:  X_test.count()

Out[114]: Age 134

Engineer 134

MBA 134

Work Exp 134

Salary 134

Distance 134

license 134

Gender_Male 134

dtype: int64

In [115]:  y_train.count()

Out[115]: 310

In [181]:  y_test.count()

Out[181]: Transport_Public Transport 134

dtype: int64

In [182]:  car1.head()

Out[182]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport

0 0.40 0 0 0.166667 0.154455 0.000000 0 1 1

1 0.20 1 0 0.166667 0.035644 0.004950 0 0 1

2 0.44 1 0 0.291667 0.136634 0.044554 0 1 1

3 0.40 1 1 0.208333 0.136634 0.064356 0 0 1

4 0.36 1 0 0.166667 0.136634 0.069307 0 1 1

Logistic Regression

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 22/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [117]:  from sklearn.linear_model import LogisticRegression


model = LogisticRegression()
model.fit(X_train, y_train)

Out[117]: LogisticRegression()

In [118]:  ytrain_predict = model.predict(X_train)


ytest_predict = model.predict(X_test)

In [119]:  ytest_predict_prob=model.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()

Out[119]: 0 1

0 0.097419 0.902581

1 0.071571 0.928429

2 0.113186 0.886814

3 0.936392 0.063608

4 0.838924 0.161076

In [120]:  # Train Data


model.score(X_train, y_train)

Out[120]: 0.8129032258064516

In [121]:  # Test Data


model.score(X_test, y_test)

Out[121]: 0.8208955223880597

In [122]:  from sklearn import metrics


from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,plot_confusion_matrix

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 23/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [123]:  ## Train Data


# predict probabilities
probs = model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

AUC: 0.831

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 24/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [124]:  ## Test Data


# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
test_auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % test_auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);

AUC: 0.758

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 25/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [125]:  confusion_matrix(y_train, ytrain_predict)

Out[125]: array([[ 53, 48],

[ 10, 199]], dtype=int64)

In [126]:  plot_confusion_matrix(model,X_train,y_train);

In [127]:  print(classification_report(y_train, ytrain_predict))

precision recall f1-score support

0 0.84 0.52 0.65 101

1 0.81 0.95 0.87 209

accuracy 0.81 310

macro avg 0.82 0.74 0.76 310

weighted avg 0.82 0.81 0.80 310

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 26/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [178]:  for idx, col_name in enumerate(X_train.columns):


print("The coefficient for {} is {}".format(col_name, model.coef_[0][idx]))

The coefficient for Age is 0.15799321368519664

The coefficient for Engineer is -0.02506398499055198

The coefficient for MBA is 0.24705417288664455

The coefficient for Work Exp is -1.029916628913061

The coefficient for Salary is -1.4885813126037621

The coefficient for Distance is -2.7276972880549057

The coefficient for license is -1.7114086950632315

The coefficient for Gender_Male is 1.235308617326227

In [ ]:  ​

Test Data
In [ ]:  confusion_matrix(y_test, ytest_predict)

In [ ]:  plot_confusion_matrix(model,X_test,y_test);

In [129]:  print(classification_report(y_test, ytest_predict))

precision recall f1-score support

0 0.88 0.51 0.65 43

1 0.81 0.97 0.88 91

accuracy 0.82 134

macro avg 0.84 0.74 0.76 134

weighted avg 0.83 0.82 0.81 134

In [ ]:  ​

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 27/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

Linear Discrimant Analysis

In [150]:  from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [151]:  LDA_model = LinearDiscriminantAnalysis()


LDA_model.fit(X_train, y_train)

C:\Users\vivek\anaconda3\lib\site-packages\sklearn\utils\validation.py:63: DataConversionWarning: A column-vector y


was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

return f(*args, **kwargs)

Out[151]: LinearDiscriminantAnalysis()

Train Data
In [152]:  y_train_predict = LDA_model.predict(X_train)
model_score = LDA_model.score(X_train, y_train)

In [153]:  model_score

Out[153]: 0.8064516129032258

In [154]:  metrics.confusion_matrix(y_train, y_train_predict)

Out[154]: array([[ 53, 48],

[ 12, 197]], dtype=int64)

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 28/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [155]:  print(metrics.classification_report(y_train, y_train_predict))

precision recall f1-score support

0 0.82 0.52 0.64 101

1 0.80 0.94 0.87 209

accuracy 0.81 310

macro avg 0.81 0.73 0.75 310

weighted avg 0.81 0.81 0.79 310

Test Data
In [156]:  y_test_predict = LDA_model.predict(X_test)
model_score = LDA_model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

0.8134328358208955

[[22 21]

[ 4 87]]

precision recall f1-score support

0 0.85 0.51 0.64 43

1 0.81 0.96 0.87 91

accuracy 0.81 134

macro avg 0.83 0.73 0.76 134

weighted avg 0.82 0.81 0.80 134

In [157]:  # Training Data Probability Prediction


pred_prob_train = LDA_model.predict_proba(X_train)

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 29/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [158]:  pred_prob_test = LDA_model.predict_proba(X_test)

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 30/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [159]:  pred_prob_train[:,1]

Out[159]: array([0.92022773, 0.78319804, 0.83586176, 0.01947157, 0.72095693,

0.3977989 , 0.94497793, 0.05207825, 0.5164211 , 0.08593226,

0.78142627, 0.51051003, 0.97092793, 0.86307424, 0.23020679,

0.95504789, 0.89668632, 0.67317881, 0.32734465, 0.9582693 ,

0.8316606 , 0.96756165, 0.88394446, 0.74137384, 0.75047192,

0.18027122, 0.66927472, 0.68240562, 0.26733835, 0.86043091,

0.68142995, 0.72349006, 0.65622649, 0.96951978, 0.96372393,

0.90946366, 0.81372452, 0.55589731, 0.95717005, 0.97031422,

0.74344949, 0.12517994, 0.19357446, 0.96857098, 0.91662527,

0.91085242, 0.96937786, 0.88904467, 0.05095959, 0.00652125,

0.95090273, 0.96620461, 0.86546601, 0.84085795, 0.89058096,

0.96633642, 0.94006771, 0.91012632, 0.26851716, 0.92211785,

0.95374515, 0.74998672, 0.89925013, 0.93145652, 0.91562067,

0.90007726, 0.01233538, 0.97069458, 0.75988497, 0.87514135,

0.06415813, 0.61147188, 0.79124164, 0.89269115, 0.58740298,

0.49053771, 0.87651487, 0.79649477, 0.72686826, 0.10931492,

0.91331073, 0.57578338, 0.83468353, 0.84152792, 0.92651834,

0.54515019, 0.92055898, 0.92478685, 0.97757567, 0.97135539,

0.02811748, 0.36049638, 0.92672214, 0.96949821, 0.87194518,

0.89178535, 0.86934081, 0.95316131, 0.88461367, 0.68748767,

0.66281191, 0.49879801, 0.24069523, 0.96763929, 0.89254032,

0.0399595 , 0.02459989, 0.98528873, 0.96453147, 0.97599778,

0.95491977, 0.13091913, 0.73078518, 0.65618063, 0.87498428,

0.81108847, 0.06718685, 0.9557436 , 0.50845022, 0.95843087,

0.69819292, 0.8513452 , 0.85540385, 0.5086539 , 0.3538545 ,

0.7683546 , 0.70412214, 0.06633841, 0.98617312, 0.08590929,

0.96507778, 0.53726827, 0.93252758, 0.24097309, 0.97923159,

0.33130927, 0.27782057, 0.28638754, 0.88888469, 0.91497759,

0.8740223 , 0.98238947, 0.99229348, 0.59147129, 0.01389672,

0.57645767, 0.8786097 , 0.89869402, 0.60093145, 0.3801811 ,

0.95498372, 0.78481908, 0.95841918, 0.89688922, 0.61764512,

0.96696711, 0.8157581 , 0.54676658, 0.85989152, 0.00578517,

0.85295832, 0.88223761, 0.96524153, 0.18279891, 0.01315408,

0.51926337, 0.49028633, 0.82880529, 0.72612899, 0.69183132,

0.86175131, 0.41156682, 0.66779148, 0.97259242, 0.8494421 ,

0.72860993, 0.9504837 , 0.90274437, 0.95069469, 0.85178666,

0.96411377, 0.00700433, 0.82270175, 0.92880233, 0.499093 ,

0.10058689, 0.64769149, 0.86873365, 0.94621306, 0.96041686,

0.01689526, 0.76476083, 0.65228768, 0.93509569, 0.64918632,

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 31/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

0.7341507 , 0.66984268, 0.907108 , 0.9078367 , 0.73788556,

0.34766282, 0.76685175, 0.93821193, 0.97049455, 0.04202097,

0.2511822 , 0.37508817, 0.30768868, 0.58214141, 0.00576029,

0.79409205, 0.24688856, 0.9556978 , 0.71520083, 0.87497974,

0.87299859, 0.98642788, 0.59262892, 0.49961783, 0.98474423,

0.56379881, 0.89462564, 0.88125149, 0.52234354, 0.00599964,

0.31090895, 0.89872373, 0.55127863, 0.61611158, 0.78103005,

0.95086441, 0.42306711, 0.97977586, 0.89551972, 0.47755264,

0.82720109, 0.95960096, 0.96388022, 0.56383339, 0.02606328,

0.97289623, 0.9598491 , 0.96705606, 0.79736208, 0.91394789,

0.98293264, 0.47707115, 0.03550854, 0.7794153 , 0.8961371 ,

0.95260056, 0.9461204 , 0.83908975, 0.86077485, 0.78622252,

0.48573911, 0.27630253, 0.82764714, 0.74052652, 0.56523353,

0.79758687, 0.95806269, 0.46017894, 0.98157077, 0.84136743,

0.82850144, 0.82135118, 0.28467853, 0.90714507, 0.82802828,

0.9834561 , 0.84249487, 0.83371691, 0.88272533, 0.54849099,

0.66606922, 0.94900574, 0.86848047, 0.6347598 , 0.65371619,

0.94662682, 0.9654427 , 0.94691404, 0.84271978, 0.84058662,

0.98342227, 0.9648222 , 0.72642151, 0.63081983, 0.75410982,

0.7072154 , 0.72783187, 0.84923728, 0.68462532, 0.84682323,

0.8996498 , 0.18301217, 0.8578388 , 0.80221857, 0.96359776,

0.87149685, 0.17187168, 0.82076429, 0.80977789, 0.76929842,

0.28090209, 0.9022878 , 0.68302325, 0.91519571, 0.9758166 ])

Train Data-AUC ROC Curve

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 32/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [160]:  # AUC and ROC for the training data



# calculate AUC
auc = metrics.roc_auc_score(y_train,pred_prob_train[:,1])
print('AUC for the Training Data: %.3f' % auc)

# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(y_train,pred_prob_train[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label = 'Training Data')

AUC for the Training Data: 0.842

Out[160]: [<matplotlib.lines.Line2D at 0x165afc18310>]

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 33/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [161]:  plot_confusion_matrix(LDA_model,X_train,y_train);

Test Data

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 34/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [162]:  # AUC and ROC for the test data



# calculate AUC
auc = metrics.roc_auc_score(y_test,pred_prob_test[:,1])
print('AUC for the Test Data: %.3f' % auc)

# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(y_test,pred_prob_test[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label='Test Data')
# show the plot
plt.legend(loc='best')
plt.show()

AUC for the Test Data: 0.786

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 35/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [163]:  plot_confusion_matrix(LDA_model,X_test,y_test);

In [179]:  for idx, col_name in enumerate(X_train.columns):


print("The coefficient for {} is {}".format(col_name, LDA_model.coef_[0][idx]))

The coefficient for Age is 5.231123671827778

The coefficient for Engineer is -0.08333089800227284

The coefficient for MBA is 0.34206272621634415

The coefficient for Work Exp is -4.203504416622364

The coefficient for Salary is -3.0350431298186664

The coefficient for Distance is -5.082460286571646

The coefficient for license is -2.36386133488761

The coefficient for Gender_Male is 1.6138695146493

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 36/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

Naive Bayes Model


In [164]:  # Copy all the predictor variables into X dataframe
X = car1.drop('Transport_Public Transport', axis=1)

# Copy target into the y dataframe.
y = car1[['Transport_Public Transport']]

In [165]:  from sklearn.naive_bayes import GaussianNB


from sklearn import metrics

In [166]:  # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1, stratify=car1['Transport_P

In [167]:  NB_model = GaussianNB()


NB_model.fit(X_train, y_train)

C:\Users\vivek\anaconda3\lib\site-packages\sklearn\utils\validation.py:63: DataConversionWarning: A column-vector y


was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

return f(*args, **kwargs)

Out[167]: GaussianNB()

Train Data

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 37/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [168]:  y_train_predict = NB_model.predict(X_train)


model_score = NB_model.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

0.8064516129032258

[[ 50 51]

[ 9 200]]

precision recall f1-score support

0 0.85 0.50 0.62 101

1 0.80 0.96 0.87 209

accuracy 0.81 310

macro avg 0.82 0.73 0.75 310

weighted avg 0.81 0.81 0.79 310

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 38/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [169]:  # predict probabilities


probs = NB_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

AUC: 0.804

Confusion Matrix

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 39/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [170]:  plot_confusion_matrix(NB_model,X_train,y_train);

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 40/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [171]:  plot_confusion_matrix(NB_model,X_test,y_test);

Test Data

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 41/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [172]:  # predict probabilities


probs = NB_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);

AUC: 0.766

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 42/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [173]:  x=pd.DataFrame(NB_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

---------------------------------------------------------------------------

AttributeError Traceback (most recent call last)

<ipython-input-173-4b922253d6b4> in <module>

----> 1 x=pd.DataFrame(NB_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)

2 plt.figure(figsize=(12,7))

3 sns.barplot(x[0],x.index,palette='rainbow')

4 plt.ylabel('Feature Name')

5 plt.xlabel('Feature Importance in %')

AttributeError: 'GaussianNB' object has no attribute 'feature_importances_'

KNN Model

In [ ]:  from scipy.stats import zscore

In [ ]:  plot_confusion_matrix(KNN_model,X_test,y_test);

In [ ]:  plot_confusion_matrix(KNN_model,X_train,y_train);

In [ ]:  X[['Age', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance', 'license']]=X[['Age', 'Engineer', 'MBA', 'Work Exp', '

In [ ]:  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=car1['Transport_Pu

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 43/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  from sklearn.neighbors import KNeighborsClassifier

In [ ]:  KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)

In [ ]:  ## Performance Matrix on train data set


y_train_predict = KNN_model.predict(X_train)
model_score = KNN_model.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

In [ ]:  ## Performance Matrix on test data set


y_test_predict = KNN_model.predict(X_test)
model_score = KNN_model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

In [ ]:  # predict probabilities


probs = KNN_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 44/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  # predict probabilities


probs = KNN_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);

In [ ]:  x=pd.DataFrame(KNN_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

RANDOM FOREST
In [ ]:  from sklearn import metrics
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,plot_confusion_matrix

In [ ]:  from sklearn.ensemble import RandomForestClassifier



RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 45/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  ## Performance Matrix on train data set


y_train_predict = RF_model.predict(X_train)
model_score = RF_model.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

In [ ]:  ## Performance Matrix on test data set


y_test_predict = RF_model.predict(X_test)
model_score = RF_model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

In [ ]:  plot_confusion_matrix(RF_model,X_train,y_train);

In [ ]:  plot_confusion_matrix(RF_model,X_test,y_test);

In [ ]:  # predict probabilities


probs = RF_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 46/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  # predict probabilities


probs = RF_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, train_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 47/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [175]:  x=pd.DataFrame(RF_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables a


s keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other argum
ents without an explicit keyword will result in an error or misinterpretation.

warnings.warn(

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 48/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  ​

Decision Tree
In [ ]:  from sklearn import tree


DT_model= tree.DecisionTreeClassifier()
DT_model.fit(X_train, y_train)

In [ ]:  ## Performance Matrix on train data set


y_train_predict = DT_model.predict(X_train)
model_score = DT_model.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

In [ ]:  plot_confusion_matrix(DT_model,X_test,y_test);

In [ ]:  plot_confusion_matrix(DT_model,X_train,y_train);

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 49/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  ### TEST DATA


# predict probabilities
probs = DT_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
test_fpr, test_tpr, train_thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(test_fpr, test_tpr);

In [ ]:  # predict probabilities


probs = DT_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(train_fpr, train_tpr);

In [ ]:  x=pd.DataFrame(DT_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 50/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  ## Performance Matrix on test data set


y_test_predict = DT_model.predict(X_test)
model_score = DT_model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

In [ ]:  x=pd.DataFrame(DT_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

Gradient Boosting¶
In [ ]:  from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(random_state=1)
gbcl = gbcl.fit(X_train, y_train)

In [ ]:  ## Performance Matrix on train data set


y_train_predict = gbcl.predict(X_train)
model_score = gbcl.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))

In [ ]:  ## Performance Matrix on test data set


y_test_predict = gbcl.predict(X_test)
model_score = gbcl.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 51/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [180]:  x=pd.DataFrame(gbcl.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()

C:\Users\vivek\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables a


s keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other argum
ents without an explicit keyword will result in an error or misinterpretation.

warnings.warn(

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 52/53


27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook

In [ ]:  ​

localhost:8888/notebooks/Machine Learning Project 1 .ipynb 53/53


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [1]:  ## Importing the necessary libraries along with the standard import

import numpy as np
import pandas as pd
import re # this is the regular expression library which helps us manipulate text (strings) fairly easily and intuiti
import nltk # this is the Natural Language Tool Kit which contains a lot of functionalities for text analytics
import matplotlib.pyplot as plt
import string # this is used for string manipulations
import matplotlib

In [2]:  ## Let us check the version of the various libraries


print('Numpy version:',np.__version__)
print('Pandas version:',pd.__version__)
print('Regular Expression version:',re.__version__)
print('Natural Language Tool Kit version:',nltk.__version__)
print('Matplotlib version:',matplotlib.__version__)

Numpy version: 1.20.2

Pandas version: 1.2.4

Regular Expression version: 2.2.1

Natural Language Tool Kit version: 3.6.2

Matplotlib version: 3.3.4

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 1/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [80]:  db=pd.read_csv('Shark Tank Companies.csv')


db.head()

Out[80]:
deal description episode category entrepreneurs location website askedFor exchangeForStake valuation se

Bluetooth
device Darrin St. Paul,
0 False 1 Novelties NaN 1000000 15 6666667
implant for Johnson MN
your ear.

Retail and
wholesale
Specialty Somerset,
1 True pie factory 1 Tod Wilson http://whybake.com/ 460000 10 4600000
Food NJ
with two
reta...

Ava the
Elephant is Baby and
Tiffany Atlanta,
2 True a godsend 1 Child http://www.avatheelephant.com/ 50000 15 333333
Krumins GA
for frazzled Care
par...

Organizing,
packing,
Consumer Nick Friedman, Tampa,
3 False and moving 1 http://collegehunkshaulingjunk.com/ 250000 25 1000000
Services Omar Soliman FL
services
deliv...

Interactive
media
Consumer
4 False centers for 1 Kevin Flannery Cary, NC http://www.wispots.com/ 1200000 10 12000000
Services
healthcare
waiti...

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 2/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [81]:  db.tail()

Out[81]:
deal description episode category entrepreneurs location website askedFor exchangeForStake valuatio

Zoom Beatrice
Interiors is Fischel-Bock,
Online Philadelphia,
490 True a virtual 28 Madeine https://zoominteriors.com 100000 20 50000
Services PA
service for Fraser &
interi... Lizzie...

Spikeball
started out
Toys and
491 True as a casual 29 Chris Ruder Chicago, IL http://spikeball.com 500000 10 500000
Games
outdoors
gam...

Shark
Wheel is
David Patrick
out to Outdoor Lake Forest,
492 True 29 and Zack http://www.sharkwheel.com 100000 5 200000
literally Recreation CA
Fleishman
reinvent the
w...

Adriana
Montano
wants to Adriana Boca Raton,
493 False 29 Entertainment http://gatocafeflorida.com 100000 20 50000
open the Montano FL
first Cat
Ca...

Sway
Motorsports
makes a Palo Alto,
494 True 29 Automotive Joe Wilcox http://www.swaymotorsports.com 300000 10 300000
three- CA
wheeled,
all-el...

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 3/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [4]:  db.columns

Out[4]: Index(['deal', 'description', 'episode', 'category', 'entrepreneurs',

'location', 'website', 'askedFor', 'exchangeForStake', 'valuation',

'season', 'shark1', 'shark2', 'shark3', 'shark4', 'shark5', 'title',

'episode-season', 'Multiple Entreprenuers'],

dtype='object')

In [5]:  sk=db.iloc[:,0:2]

In [6]:  sk.head()

Out[6]:
deal description

0 False Bluetooth device implant for your ear.

1 True Retail and wholesale pie factory with two reta...

2 True Ava the Elephant is a godsend for frazzled par...

3 False Organizing, packing, and moving services deliv...

4 False Interactive media centers for healthcare waiti...

In [7]:  sk.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 495 entries, 0 to 494

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 deal 495 non-null bool

1 description 495 non-null object

dtypes: bool(1), object(1)

memory usage: 4.5+ KB

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 4/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [8]:  sk.duplicated().sum()

Out[8]: 2

In [9]:  sk.drop_duplicates(inplace=True)

In [10]:  sk.duplicated().sum()

Out[10]: 0

Creating two seperate DataFrames


In [11]:  sk.columns

Out[11]: Index(['deal', 'description'], dtype='object')

In [12]:  # True Dataset

In [13]:  skt=sk[sk['deal']==True]

In [14]:  skt.head()

Out[14]:
deal description

1 True Retail and wholesale pie factory with two reta...

2 True Ava the Elephant is a godsend for frazzled par...

5 True One of the first entrepreneurs to pitch on Sha...

9 True An educational record label and publishing hou...

10 True A battery-operated cooking device that siphons...

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 5/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [15]:  skt.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 250 entries, 1 to 494

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 deal 250 non-null bool

1 description 250 non-null object

dtypes: bool(1), object(1)

memory usage: 4.2+ KB

In [16]:  # False Dataset

In [17]:  skf=sk[sk['deal']==False]

In [18]:  skf.head()

Out[18]:
deal description

0 False Bluetooth device implant for your ear.

3 False Organizing, packing, and moving services deliv...

4 False Interactive media centers for healthcare waiti...

6 False A mixed martial arts clothing line looking to ...

7 False Attach Noted is a detachable "arm" that holds ...

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 6/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [19]:  skf.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 243 entries, 0 to 493

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 deal 243 non-null bool

1 description 243 non-null object

dtypes: bool(1), object(1)

memory usage: 4.0+ KB

In [20]:  sk.isnull().sum()

Out[20]: deal 0

description 0

dtype: int64

In [21]:  # We are not using the train-test split function from sklearn and hence the need to jumble the data set.

In [22]:  # False Data Set

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 7/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [23]:  skf['description'] = skf['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))


skf

<ipython-input-23-1357c3133895>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skf['description'] = skf['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))

Out[23]:
deal description

0 False bluetooth device implant for your ear.

3 False organizing, packing, and moving services deliv...

4 False interactive media centers for healthcare waiti...

6 False a mixed martial arts clothing line looking to ...

7 False attach noted is a detachable "arm" that holds ...

... ... ...

482 False buck mason makes high-quality men's clothing i...

484 False frameri answers the question, "why aren't your...

485 False the paleo diet bar is a nutrition bar that is ...

488 False sunscreen mist adds another point of access fo...

493 False adriana montano wants to open the first cat ca...

243 rows × 2 columns

In [24]:  # True DataSet

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 8/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [25]:  skt['description'] = skt['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))


skt

<ipython-input-25-e0cefbac927a>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skt['description'] = skt['description'].apply(lambda x: " ".join(x.lower() for x in x.split()))

Out[25]:
deal description

1 True retail and wholesale pie factory with two reta...

2 True ava the elephant is a godsend for frazzled par...

5 True one of the first entrepreneurs to pitch on sha...

9 True an educational record label and publishing hou...

10 True a battery-operated cooking device that siphons...

... ... ...

489 True syndaver labs makes synthetic body parts for u...

490 True zoom interiors is a virtual service for interi...

491 True spikeball started out as a casual outdoors gam...

492 True shark wheel is out to literally reinvent the w...

494 True sway motorsports makes a three-wheeled, all-el...

250 rows × 2 columns

In [26]:  # Calculating number of charaters in both true and false dataset

In [27]:  # True Dataset

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 9/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [28]:  skt['char_count'] = skt['description'].str.len()


skt[['description','char_count']].head()

<ipython-input-28-fcd587819724>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skt['char_count'] = skt['description'].str.len()

Out[28]:
description char_count

1 retail and wholesale pie factory with two reta... 73

2 ava the elephant is a godsend for frazzled par... 244

5 one of the first entrepreneurs to pitch on sha... 365

9 an educational record label and publishing hou... 122

10 a battery-operated cooking device that siphons... 117

In [29]:  # False Dataset

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 10/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [30]:  skf['char_count'] = skf['description'].str.len()


skf[['description','char_count']].head()

<ipython-input-30-709538dd8eb8>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skf['char_count'] = skf['description'].str.len()

Out[30]:
description char_count

0 bluetooth device implant for your ear. 38

3 organizing, packing, and moving services deliv... 68

4 interactive media centers for healthcare waiti... 112

6 a mixed martial arts clothing line looking to ... 110

7 attach noted is a detachable "arm" that holds ... 91

In [31]:  # Introducing StopWords

In [32]:  stopwords = nltk.corpus.stopwords.words('english')

In [33]:  import nltk


nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml (https://raw.githubusercontent.co


m/nltk/nltk_data/gh-pages/index.xml)

Out[33]: True

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 11/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [34]:  stopwords
from ,

'up',

'down',

'in',

'out',

'on',

'off',

'over',

'under',

'again',

'further',

'then',

'once',

'here',

'there',

'when',

'where',

'why',

'how',

'all',

' '
In [35]:  all_Words=[x for x in pd.Series(' '.join(sk['description']).split())]

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 12/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [36]:  all_Words

Out[36]: ['Bluetooth',

'device',

'implant',

'for',

'your',

'ear.',

'Retail',

'and',

'wholesale',

'pie',

'factory',

'with',

'two',

'retail',

'locations',

'in',

'New',

'Jersey.',

'Ava',

'th '

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 13/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [37]:  nltk.FreqDist(all_Words).most_common(50)

Out[37]: [('and', 714),

('the', 625),

('a', 507),

('to', 505),

('of', 351),

('for', 260),

('that', 256),

('is', 246),

('in', 237),

('with', 184),

('your', 150),

('A', 140),

('The', 133),

('are', 113),

('can', 105),

('on', 98),

('you', 95),

('from', 94),

('their', 92),

('or', 86),

('as', 82),

('it', 70),

('by', 66),

('into', 64),

('an', 62),

('be', 61),

('also', 60),

('made', 57),

('any', 56),

('they', 53),

('have', 47),

('which', 44),

('makes', 43),

('make', 42),

('other', 41),

('just', 41),

('has', 40),

('company', 40),

('at', 39),

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 14/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

('even', 39),

('like', 38),

('out', 34),

('them', 34),

('up', 33),

('designed', 32),

('An', 32),

('its', 30),

('without', 30),

('more', 29),

('all', 29)]

In [38]:  stopwords = nltk.corpus.stopwords.words('english') +list(string.punctuation)

In [39]:  all_Words_clean = [word for word in all_Words if word not in stopwords]

In [40]:  all_words_freq = nltk.FreqDist(all_Words_clean)

In [41]:  word_features = [item[0] for item in all_words_freq.most_common(2000)]

In [42]:  all_words_freq

Out[42]: FreqDist({'A': 140, 'The': 133, 'also': 60, 'made': 57, 'makes': 43, 'make': 42, 'company': 40, 'even': 39, 'like':
38, 'designed': 32, ...})

Removal of StopWords
In [43]:  from nltk.corpus import stopwords
stop = stopwords.words('english')

In [ ]:  ​

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 15/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [44]:  # True Set

In [45]:  skt['description'] = skt['description'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))


skt['description'].head()

<ipython-input-45-c4980563ee43>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skt['description'] = skt['description'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

Out[45]: 1 retail wholesale pie factory two retail locati...

2 ava elephant godsend frazzled parents young ch...

5 one first entrepreneurs pitch shark tank, susa...

9 educational record label publishing house desi...

10 battery-operated cooking device siphons juice,...

Name: description, dtype: object

In [46]:  True_Words=[x for x in pd.Series(' '.join(skt['description']).split())]

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 16/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [47]:  True_Words

Out[47]: ['retail',

'wholesale',

'pie',

'factory',

'two',

'retail',

'locations',

'new',

'jersey.',

'ava',

'elephant',

'godsend',

'frazzled',

'parents',

'young',

'children',

'everywhere.',

'talking',

'medicine',

'di '
In [48]:  nltk.FreqDist(True_Words).most_common(3)

Out[48]: [('also', 42), ('makes', 32), ('made', 32)]

In [49]:  # False Set

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 17/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [50]:  skf['description'] = skf['description'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))


skf['description'].head()

<ipython-input-50-f3785fd3ad1e>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skf['description'] = skf['description'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

Out[50]: 0 bluetooth device implant ear.

3 organizing, packing, moving services delivered...

4 interactive media centers healthcare waiting r...

6 mixed martial arts clothing line looking becom...

7 attach noted detachable "arm" holds post-it no...

Name: description, dtype: object

Common Words Removal


In [51]:  # True Data Set
freqTrue = pd.Series(' '.join(skt['description']).split()).value_counts()[:10]
freqTrue

Out[51]: also 42

made 32

makes 32

like 27

even 27

make 24

company 23

designed 19

easy 17

without 17

dtype: int64

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 18/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [52]:  skt['description'] = skt['description'].apply(lambda x: " ".join(x for x in x.split() if x not in freqTrue))


skt['description'].head()

<ipython-input-52-79512879796f>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skt['description'] = skt['description'].apply(lambda x: " ".join(x for x in x.split() if x not in freqTrue))

Out[52]: 1 retail wholesale pie factory two retail locati...

2 ava elephant godsend frazzled parents young ch...

5 one first entrepreneurs pitch shark tank, susa...

9 educational record label publishing house get ...

10 battery-operated cooking device siphons juice,...

Name: description, dtype: object

In [53]:  # False Data Set


freqFalse = pd.Series(' '.join(skf['description']).split()).value_counts()[:10]
freqFalse

Out[53]: made 41

also 19

company 19

make 19

use 15

designed 15

system 14

even 14

without 14

water 14

dtype: int64

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 19/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [54]:  skf['description'] = skf['description'].apply(lambda x: " ".join(x for x in x.split() if x not in freqFalse))


skf['description'].head()

<ipython-input-54-4fef93a8cf4a>:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni


ng-a-view-versus-a-copy (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver
sus-a-copy)

skf['description'] = skf['description'].apply(lambda x: " ".join(x for x in x.split() if x not in freqFalse))

Out[54]: 0 bluetooth device implant ear.

3 organizing, packing, moving services delivered...

4 interactive media centers healthcare waiting r...

6 mixed martial arts clothing line looking becom...

7 attach noted detachable "arm" holds post-it no...

Name: description, dtype: object

In [55]:  all_True_Words=[x for x in pd.Series(' '.join(skt['description']).split())]

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 20/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [56]:  all_True_Words

Out[56]: ['retail',

'wholesale',

'pie',

'factory',

'two',

'retail',

'locations',

'new',

'jersey.',

'ava',

'elephant',

'godsend',

'frazzled',

'parents',

'young',

'children',

'everywhere.',

'talking',

'medicine',

'di '
In [57]:  nltk.FreqDist(all_True_Words).most_common(3)

Out[57]: [('line', 16), ('way', 15), ('new', 13)]

In [58]:  # Fasle Dataset


all_False_Words=[x for x in pd.Series(' '.join(skf['description']).split())]

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 21/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [59]:  all_False_Words

Out[59]: ['bluetooth',

'device',

'implant',

'ear.',

'organizing,',

'packing,',

'moving',

'services',

'delivered',

'college',

'women.',

'interactive',

'media',

'centers',

'healthcare',

'waiting',

'rooms',

'offering',

'patients',

' b'
In [60]:  nltk.FreqDist(all_False_Words).most_common(3)

Out[60]: [('product', 12), ('makes', 12), ('line', 11)]

Creating a Word Cloud

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 22/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [61]:  ## True Data Set


skt['description'].iloc[25:50] #Checking a tweet at random!

Out[61]: 59 move over, legos! qubits new construction toy ...

60 reinvented umbrella hands-free, impossible inv...

64 collar stays: can't live 'em, can't live 'em. ...

65 dance education centers children.

68 subscription toy service.

70 allergy season bringing down? maybe time give ...

73 mod mom furniture stylish, high-quality, made-...

74 flipoutz flexible silicon bracelets hold five ...

76 country-style apparel store sold sports author...

77 children's seat attachable luggage instantly c...

79 rubber band works money clip.

81 light-up decals car windows.

83 magnetic skin decorative finish applied applia...

84 new kind broom built-in scraper.

86 cakes scratch family recipes. ship nationwide ...

88 impact-resistant hydration system attaches sho...

94 rocket, company's flagship product, turn anyth...

95 hy-conn super-fast connector fire hydrants gar...

96 patented citikitty cat toilet training kit use...

98 shoes interchangeable uppers allowing user eas...

99 aldo orta jewelry fashion jewelry royal theme.

100 water container unscrews bottom well top ease ...

101 website people pre-purchase night town fear ge...

104 online service people buy made-to-order cartoo...

108 chord buddy one-of-a-kind learning system guit...

Name: description, dtype: object

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 23/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [62]:  ## False Data Set


skf['description'].iloc[25:50] #Checking a tweet at random!

Out[62]: 45 protective covering mp3 players similar devices.

46 printer ink refill service. send customer back...

47 chain children's play learning centers.

49 portable golf ball cleaner.

51 ethical eco-friendly apparel brand.

53 face-down tanning massage pillow storage pockets.

54 store kids shopping.

56 mobile novelty ice cream vending business.

57 coffeehouse legal resource center.

61 liquid, organic fertilizer "llama doo".

62 want shake things greeting card practices, del...

63 home-based shoe business sales model similar t...

66 premium wine sold glass individually-sealed pl...

67 food products chef big shake. flagship product...

69 alarm clock wakes user freshly cooked bacon.

71 vurtego high-tech pogo stick super lightweight...

72 rubber bands increase resistance exercise acti...

75 pure ayre odor eliminator anyone owns pets eve...

78 premium beef jerky additives preservatives.

80 aromatic lip gloss help curb appetite.

82 small cases mints, gum contact lenses featurin...

85 accessories casual golfers. flagship product p...

87 designer hospital gowns expecting moms.

89 mobile entertainment running specialized vehic...

90 barf bag worn around neck, small children pron...

Name: description, dtype: object

In [63]:  # Removing symbols and punctuations


# further_clean = Apple_tweets['Tweet'].str.replace('[^\w\s]','')

# Extending the list of stop words (including words like Apple, bitly, dear, please, etc.)
stop_words = list(stopwords.words('english'))
#stop_words.extend(["apple", "http","bit","bitly","bit ly", "dear", "im", "i'm", "please"])

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 24/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [64]:  #Removing stop words (extended list as above) from the corpus
# True Dataset
true_corpus = skt['description'].apply(lambda x: ' '.join([z for z in x.split() if z not in stop_words]))
true_corpus

Out[64]: 1 retail wholesale pie factory two retail locati...

2 ava elephant godsend frazzled parents young ch...

5 one first entrepreneurs pitch shark tank, susa...

9 educational record label publishing house get ...

10 battery-operated cooking device siphons juice,...

...

489 syndaver labs synthetic body parts use medical...

490 zoom interiors virtual service interior design...

491 spikeball started casual outdoors game, grown ...

492 shark wheel literally reinvent wheel. innovati...

494 sway motorsports three-wheeled, all-electric, ...

Name: description, Length: 250, dtype: object

In [65]:  # False DataSet


false_corpus = skf['description'].apply(lambda x: ' '.join([z for z in x.split() if z not in stop_words]))
false_corpus

Out[65]: 0 bluetooth device implant ear.

3 organizing, packing, moving services delivered...

4 interactive media centers healthcare waiting r...

6 mixed martial arts clothing line looking becom...

7 attach noted detachable "arm" holds post-it no...

...

482 buck mason makes high-quality men's clothing usa.

484 frameri answers question, "why glasses flexibl...

485 paleo diet bar nutrition bar gluten, soy, dair...

488 sunscreen mist adds another point access sunsc...

493 adriana montano wants open first cat cafe flor...

Name: description, Length: 243, dtype: object

In [66]:  wc_true = ' '.join(true_corpus)

In [67]:  wc_false = ' '.join(false_corpus)

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 25/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [ ]:  ​

In [68]:  wc_true

Out[68]: 'retail wholesale pie factory two retail locations new jersey. ava elephant godsend frazzled parents young childr
en everywhere. talking medicine dispenser administer medicine little ones turning experience playful providing po
sitive reinforcement. one first entrepreneurs pitch shark tank, susan knapp presented perfect pear, line pear-foc
used gourmet food products. sold across 650 retail stores, perfect pear product portfolio includes jams, jellies,
spreads, tapenades, vinegars, marinades, dressings many others, showcase flavors health benefits pears. education
al record label publishing house get students learning classic works literature. battery-operated cooking device
siphons juice, silicone basting brush injector tip marinades. line books written help children find inner calm. c
overplay slipcover children\'s play yards. much mattress, play yards can\'t laundered, yet babies children spend
lots time them, guess leads to. coverplay rescue! fitting snugly standard size play yards, coverplay offers quick
solution add another layer protection child germ-harboring surface play yard. 95% cotton 5% lycra, coverplay mach
ine-washable maintain. indeed, throwing coverplay slipcover wash much easier trying remove spill that\'s gone dir
ectly onto play yard. bright stylish designs, they\'ll play yard look good new. web-based buys back sells 10% unu
sed gift cards year u.s. online journaling service focused facilitating users\' progress towards achieving mainta
ining emotional well-being. fitness machine series bands varying weights pushups easier. made-to-order energy bar
s whole, natural ingredients. customers choose ingredients go highly personalized bars. sells award-winning barbe
cue spice rub products. stainless steel identifying charms stick food grill. wheat-free soy-based modeling clay,
children wheat allergies. web site allows college students buy sell class notes study guides. kids\' organizers l
ook stuffed animals children. women\'s apparel specially sizes 12 18. healthier line carbonated beverages featuri
ng organic ingredients, vitamins, antioxidants, 85 calories per can. faux golf club looks 7 iron conceals urine r
i ti lf d li t b li i ll i hi h h l thl t t fil

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 26/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [69]:  wc_false

Out[69]: 'bluetooth device implant ear. organizing, packing, moving services delivered college women. interactive media ce
nters healthcare waiting rooms offering patients web access educational information. mixed martial arts clothing
line looking become next big brand active sports / streetwear apparel. attach noted detachable "arm" holds post-i
t notes side laptop screen. safety device seatbelts. prevents driver starting vehicle unless seatbelt buckled. ho
usehold items twist: recycled chopsticks. guitars folding neck, fit backpack overhead compartment airplane. 50 st
ate capitals 50 fun minutes efficient entertaining method learn us geography. set flash cards combines phonetics,
cartoons, associations keep kids\' interest drive long-term learning retention. author ken bradford worked closel
y public private school teachers develop fun satisfying study aide. franchise-model offering professional graffit
i removal. owns trademarks words "coffee," "cappuccino," "java" highly-caffeinated words plush toys. inspirationa
l gifts accessories colorful umbrellas sandals leave messages sand. granola gourmet offers line granola bars diab
etics safely enjoy. unlike granola bars market, granola gourmet\'s bars low glycemic index. low glycemic index me
ans less prone causing spikes blood sugar, good anybody especially damaging diabetics. granola gourmet\'s bars in
gredients naturally low glycemic index, release carbohydrates slowly bloodstream. granola gourmet products tested
gi labs, developed glycemic index concept. ultimate fudge brownie bar glycemic index 23, well threshold considere
d low glycemic. funeral concierge service writes eulogy, officiates funeral service handles post-funeral family g
atherings. line surgical masks fashionable fun. sports bras engineered work woman\'s body based activity enjoys.
pitched "root beer float bottle". underwear minimize odor flatulence, airtight material elastic around legs filte
r back. fitness program fusing dance routines. concept live entertainment amusement attraction times square area
five venues featuring restaurants entertainment. selling socks "pairs" three instead two, aiming preempt problem
i i k f hi i f d i h d b kl b lt i t l lit l ti i l t

Word Cloud

True Data Set

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 27/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [70]:  pip install wordcloud

Requirement already satisfied: wordcloud in c:\users\vivek\anaconda3\lib\site-packages (1.8.1)

Requirement already satisfied: pillow in c:\users\vivek\anaconda3\lib\site-packages (from wordcloud) (8.2.0)

Requirement already satisfied: matplotlib in c:\users\vivek\anaconda3\lib\site-packages (from wordcloud) (3.3.4)

Requirement already satisfied: numpy>=1.6.1 in c:\users\vivek\anaconda3\lib\site-packages (from wordcloud) (1.20.2)

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\vivek\anaconda3\lib\site-packag


es (from matplotlib->wordcloud) (2.4.7)

Requirement already satisfied: cycler>=0.10 in c:\users\vivek\anaconda3\lib\site-packages (from matplotlib->wordclo


ud) (0.10.0)

Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\vivek\anaconda3\lib\site-packages (from matplotlib->wo


rdcloud) (1.3.1)

Requirement already satisfied: python-dateutil>=2.1 in c:\users\vivek\anaconda3\lib\site-packages (from matplotlib-


>wordcloud) (2.8.1)

Requirement already satisfied: six in c:\users\vivek\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->wo


rdcloud) (1.15.0)

Note: you may need to restart the kernel to use updated packages.

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 28/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [71]:  # Word Cloud


from wordcloud import WordCloud
wordcloud = WordCloud(width = 3000, height = 3000,
background_color ='black',
min_font_size = 10, random_state=100).generate(wc_true)

# plot the WordCloud image


plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.xlabel('Word Cloud')
plt.tight_layout(pad = 0)

print("Word Cloud for Shark deal True(after cleaning)!!")
plt.show()

Word Cloud for Shark deal True(after cleaning)!!

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 29/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [72]:  # False DataSet

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 30/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [74]:  # Word Cloud


from wordcloud import WordCloud
wordcloud = WordCloud(width = 3000, height = 3000,
background_color ='black',
min_font_size = 10, random_state=100).generate(wc_false)

# plot the WordCloud image


plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.xlabel('Word Cloud')
plt.tight_layout(pad = 0)

print("Word Cloud for Shark Deal False (after cleaning)!!")
plt.show()

Word Cloud for Shark Deal False (after cleaning)!!

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 31/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 32/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [79]:  ## N grams

from nltk.util import ngrams # function for making ngrams
import collections

wc_true
biGrams = ngrams(wc_true,3)
# get the frequency of each bigram in our corpus
biGramsFreq = collections.Counter(biGrams)

# what are the ten most popular ngrams here?
biGramsFreq.most_common(10)

Out[79]: [(('i', 'n', 'g'), 382),

(('n', 'g', ' '), 337),

(('e', 'd', ' '), 290),

(('e', 's', ' '), 255),

((' ', 'c', 'o'), 239),

(('e', 'r', ' '), 218),

(('s', ',', ' '), 198),

(('s', '.', ' '), 182),

((' ', 'p', 'r'), 171),

((' ', 'r', 'e'), 165)]

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 33/34


27/03/2022, 22:33 Text Mining Project - Jupyter Notebook

In [77]:  wc_true

Out[77]: 'retail wholesale pie factory two retail locations new jersey. ava elephant godsend frazzled parents young childr
en everywhere. talking medicine dispenser administer medicine little ones turning experience playful providing po
sitive reinforcement. one first entrepreneurs pitch shark tank, susan knapp presented perfect pear, line pear-foc
used gourmet food products. sold across 650 retail stores, perfect pear product portfolio includes jams, jellies,
spreads, tapenades, vinegars, marinades, dressings many others, showcase flavors health benefits pears. education
al record label publishing house get students learning classic works literature. battery-operated cooking device
siphons juice, silicone basting brush injector tip marinades. line books written help children find inner calm. c
overplay slipcover children\'s play yards. much mattress, play yards can\'t laundered, yet babies children spend
lots time them, guess leads to. coverplay rescue! fitting snugly standard size play yards, coverplay offers quick
solution add another layer protection child germ-harboring surface play yard. 95% cotton 5% lycra, coverplay mach
ine-washable maintain. indeed, throwing coverplay slipcover wash much easier trying remove spill that\'s gone dir
ectly onto play yard. bright stylish designs, they\'ll play yard look good new. web-based buys back sells 10% unu
sed gift cards year u.s. online journaling service focused facilitating users\' progress towards achieving mainta
ining emotional well-being. fitness machine series bands varying weights pushups easier. made-to-order energy bar
s whole, natural ingredients. customers choose ingredients go highly personalized bars. sells award-winning barbe
cue spice rub products. stainless steel identifying charms stick food grill. wheat-free soy-based modeling clay,
children wheat allergies. web site allows college students buy sell class notes study guides. kids\' organizers l
ook stuffed animals children. women\'s apparel specially sizes 12 18. healthier line carbonated beverages featuri
ng organic ingredients, vitamins, antioxidants, 85 calories per can. faux golf club looks 7 iron conceals urine r
i ti lf d li t b li i ll i hi h h l thl t t fil

localhost:8888/notebooks/Machine Learning/Text Mining Project.ipynb# 34/34

You might also like