Machine Learning VIVEK
Machine Learning VIVEK
Texas
1
MACHINE
LEARNING
BUSINESS
REPORT
2
SUBMITTED BY
VIVEK AJAYAKUMAR
CONTENTS
Problem 1 Page
Number
1 Basic data summary, Univariate, Bivariate analysis, graphs, 5
checking correlations, outliers and missing values treatment
(if necessary) and check the basic descriptive statistics of
the dataset.
2 Split the data into train and test in the ratio 70:30. Is 12
scaling necessary or not?
3 Build the following models on the 70% training data and check 13
the performance of these models on the Training as well as the
30% Test data using the various inferences from the Confusion
Matrix and plotting a AUC-ROC curve along with the AUC values.
Tune the models wherever required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost
4 Which model performs the best? 22
5 What are your business insights? 23
PROBLEM 2
Pick out the Deal (Dependent Variable) and Description columns 25
1 into a separate data frame.
2 Create two corpora, one with those who secured a Deal, the 26
other with those who did not secure a deal.
3 The following exercise is to be done for both the corpora: 26
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’,
‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and ‘company’ are to
be removed)
c) What were the top 3 most frequently occurring words in both
corpuses (after removing stop words)?
d) Plot the Word Cloud for both the corpora.
3
Problem 1: Machine Learning Models
You work for an office transport company. You are in discussions with ABC Consulting
company for providing transport for their employees. For this purpose, you are tasked
with understanding how do the employees of ABC Consulting prefer to commute presently
(between home and office). Based on the parameters like age, salary, work experience
etc. given in the data set ‘Transport.csv’, you are required to predict the preferred
mode of transport. The project requires you to build several Machine Learning models
and compare them so that the model can be finalised.
4
1.1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations,
outliers, and missing values treatment (if necessary) and check the basic descriptive
statistics of the dataset.
Data from ABC Consulting is uploaded for evaluation and Data Dictionary is shown below:
Data has 444 rows and 9 features. The data exploration of first five rows are shown below:
Dataset has no null and duplicate values. For the analysis, ‘Transport’ variable is
considered as the dependent variable i.e., used for the prediction. The variables ‘Age’,
‘Salary’, ‘Work Exp’, ’Distance’ are continuous variable and other variables are
categorical variables, they need to encoded later for evaluation.
Data info is shown below:
5
Data description is shown below:
Dataset after the encoded categorical columns into object type ( in the place of
int64).
UNIVARIATE ANALYSIS
For the univariate analysis, the continuous variables are plotted below. The variables
are ‘Age’, ’Work Exp’,’ Salary’,’ Distance’/
6
7
The count plot of the categorical variables is shown below:
Variables Plot
AGE
GENDER
ENGINEER
MBA
8
LICENSE
TRANSPORT
INFERENCES:
o Most of the continuous variables have normal distribution. The features like Work
Experience and Salary are left skewed.
o Outliers are present in all the continuous variables and Salary and Work Experience
have a lot of outliers.
o From the count plot, most of the employees are in between the age group of 24 and
30.
o Male employees dominate compared to Female staff and most of the staffs are Engineers
rather than MBA holders.
o Most of the staff do not hold any valid driving license and they prefer to use
Public Transport.
9
BIVARIATE ANALYSIS
Bivariate Analysis is done by considering the dependent variable- Transport and other
independent variable.
Features Plots
Engineer
Gender
MBA
10
INFERENCES:
o From all plots, it is evident that staffs prefer to use public transport instead
of private transport.
o In the Bivariate Analysis, more than two hundred engineers prefer to use Public
Transport and half of the engineers use Private Transport. The same pattern is
repeated for non-MBA staffs.
o From Gender data, it is clear that Male prefers to use Public Transport and half
of the male staffs use Private Transport. At the same time, Female staffs prefer
to use Public and Private Transport indisputably.
HEATMAP:
PAIR PLOT
11
Outlier Treatment
Before Treatment After Treatment
INFERENCES:
o From both the plots, the ‘Age’ factor is more corelated to ‘Work Exp’ and
‘Salary’. At the same time, ‘Work Exp’ is more corelated to ‘Salary’.
o ‘MBA’ and ‘Salary’ is negatively corelated.
o Apart from that, most of the variables do not show any active correlation and it
can be neglected.
1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
In the case of Scaling, the scaling is not mandatory for linear models like Linear
Regression, Logistic Regression, LDA and tree-based models such as Random Forest. At the
same time, Scaling is must for the Distance based models such as KNN and SVM. The reason
for the scaling is that greater values might infer with the other smaller values and
give more importance.
Scaling should be done only after train-test split that means that scaling needs to be
done on the train set. This is done an assumption that scaling is done on the train set
and test set is unseen during the training phase.
In this dataset, the data is split on 70:30 split, and Min-Max Scaler is used for
scaling. In our dataset, train set has 310 rows and test set has 134 rows.
12
1.3. Build the following models on the 70% training data and check the performance of
these models on the Training as well as the 30% Test data using the various inferences
from the Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune
the models wherever required for optimum performance.:
Modelling of Dataset
Logistic Regression
Confusion Matrix
ROC-AUC Scores
13
Bar Plot of feature Co-efficient values
Logistic Regression
Gender_Male
License
Distance
Salary
Work Exp
MBA
ENGINEER
AGE
-3 -2 -1 0 1 2
ENGINEE Gender_
AGE MBA Work Exp Salary Distance License
R Male
Series1 0.1579932-0.0250640.2470542-1.029917-1.488581-2.727697-1.7114091.2353086
Inferences:
o Train and Test Accuracy & ROC_AUC scores are some what nearby, this implies that
model does not suffer from overfitting.
o Order of Importance: ‘Distance’(negative value)>’MBA’(positive
value)>license(negative value)>’Age’(positive value).
o Variables like Engineer, Age are having least importance.
14
Linear Discriminant Analysis
Confusion
Matrix
ROC-AUC Scores
LDA Model
Gender_Male
License
Distance
Salary
Work Exp
MBA
ENGINEER
AGE
-6 -4 -2 0 2 4 6
ENGINEE Gender_
AGE MBA Work Exp Salary Distance License
R Male
Series1 5.2311237-0.0833310.3420627-4.203504-3.035043 -5.08246 -2.3638611.6138695
15
Inferences:
o ROC-AUC scores of the Train and Test data are 0.842 and 0.786 and it implies that
the model does not suffer from overfitting.
o From the Confusion Matrix, model is able to predict 87 values accurately and 21 as
wrong ones. This implies that model works in better way.
o Order of importance:- Age(positive influence)>Distance(negative influence)>Work
Exp(negative influence)>Salary(negative influence).
o Variables like ‘Engineer’,’Gender’ has low influence on the model.
Confusion
Matrix
ROC-AUC
Scores
INFERENCES:
o From the ROC-AUC Scores, the score of the test data is lower than the train data.
It implies that, the model suffers from overfitting.
o From the confusion matrix, the model fails to predict the transport mode of the
staffs.
16
o It is not a suitable model for prediction.
Confusion
Matrix
ROC-AUC Scores
INFERENCE:
o ROC-AUC SCORE for the train data and test data is 0.804 and 0.766. This implies
that model does not suffer from overfitting. But the scores are within and below
0.80, there the model can be categorised as underfit model.
o From the Classification Model, the f1-score of both train and test data implies
that model is not fit for prediction.
17
KNN MODEL
Results of the Model
Train Data Test Data
Classificatio
n Report
Confusion
Matrix
ROC-AUC
Scores
INFERENCE:
o From ROC-AUC scores of the train and test data, there is no significant difference
in the values. Therefore, the model does not come under an overfitting.
o Confusion Matrix indicates that most of the values are predicted accurately.
o F1 scores of the train and test data indicates that model is fit for prediction.
18
RANDOM FOREST MODEL
Results of the Model
Train Data Test Data
Classificatio
n Report
Confusion
Matrix
ROC-AUC
Scores
19
Feature Importance Plot
INFERENCE:
o There is significant difference between the ROC-AUC Scores and model comes under
an overfitted model.
o Classification report states that model works significant well in predicting the
output.
o F1- score of the train and test data indicates that model is good for prediction.
o ‘Salary’, ’Distance’, ’Age’ are first three important predictors.
o ‘MBA’ and ‘Engineer’ are two least important predictors.
Confusion
Matrix
20
ROC-AUC Scores
Inference:
o ROC-AUC Scores indicates that the model suffer from overfitting.
o F1 score indicates that model is not suitable for prediction.
o ‘Salary’, ’Distance’, ’Age’ are first three prominent factors,
o Engineer and MBA are the two least important predictors.
21
1.4 Which model performs the best?
22
Inference:
• Depends on the output parameter needed, we can choose respective models.
• If multiple models perform well on the evaluation, we can choose any of the model
depends on the other parameters of the data set.
Model performance depends upon the input data and the respective distribution of output
data. From the output parameters for evaluation like accuracy, ROC_AUC score, Recall, f1
Score, we can choose different models for respective prediction. The input data shows
that most preferred mode of Transport is public transport and more data related to
private transport might enhance the model performance. Various models give different
level of importance to the input features. This implies that domain knowledge is
important to justify the model findings. In addition to that, various sub division of
data on the basis of Gender or Age might improvise the model.
23
Problem 2
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making
their pitch to the VC sharks.
You will ONLY use “Description” column for the initial text mining exercise.
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
2. Create two corpora, one with those who secured a Deal, the other with those who did
not secure a deal.
3. The following exercise is to be done for both the corpora:
a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’,
‘this’, ‘even’ and ‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?
d) Plot the Word Cloud for both the corpora.
4. Refer to both the word clouds. What do you infer?
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices
are less likely to secure a deal based on your analysis?
24
2.1. Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
Data has 495 entries and has features like 'deal', 'description', 'episode', 'category', '
entrepreneurs','location', 'website', 'askedFor', 'exchangeForStake', 'valuation', 'se
ason', 'shark1', 'shark2', 'shark3', 'shark4', 'shark5', 'title','episode-season', 'Multiple
Entreprenuers'. For our evaluation, the main feature – ‘deal’ and ‘description’ is taken.
Dataset has two duplicate values, and the data is removed for data quality. There is no
null dataset. Data info will be
25
2.2 Create two corpora, one for those who secured a Deal, the other for those who did
not secure a deal.
Dataset is split into two corpora considering the feature ‘deal’. The Data is split into
two datasets according to ‘True’ and ‘False’ value.
True Dataset
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’,
‘this’, ‘even’ and ‘company’ are to be removed)
26
The stop words were removed by importing stopwords from NLTK and also removed listed stop
words from both corpora.
False Dataset
c) What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?
27
True Dataset
False Dataset
28
True Dataset
29
False Dataset
30
4. Refer to both the word clouds. What do you infer?
From the above word cloud created on the basis of the parameter – ‘deal’. Two types of
word clouds are generated on the basis of the output parameter of deal- True and False.
Secured deal pitches on mainly three important factors- product, design, and children.
This implies that entrepreneurs need to focus more on these words. Moreover, key words
from the word cloud can be used for search engine optimization(SEO). At the same time,
these words can used to develop and design their product or service.
Unsecured deal pitches on the key word like help, device, and service. This implies that
some serious issues related to device, its customer support and service are there.
Therefore, the companies need to investigate these parameters and improvise.
5. Looking at the word clouds, is it true that the entrepreneurs who introduced devices
are less likely to secure a deal based on your analysis?
Word clouds are one of the best ways to analyse the customer feedback and it helps the
firms to strategize their plans to improvise the value of the product/service. From the
unsecured deal, the key words are help, device and service. This implies that some
issues related to device, its service and respective support are correlated. Therefore,
we can conclude that entrepreneurs who introduced devices might fail to secure a deal.
More data optimization might help the entrepreneurs to resolve this issue.
31
27/03/2022, 22:34 Machine Learning Project 1 - Jupyter Notebook
In [72]: car.head()
Out[72]: Age Gender Engineer MBA Work Exp Salary Distance license Transport
In [73]: car.isnull().sum()
Out[73]: Age 0
Gender 0
Engineer 0
MBA 0
Work Exp 0
Salary 0
Distance 0
license 0
Transport 0
dtype: int64
In [75]: car.duplicated().sum()
Out[75]: 0
In [77]: car.shape
Out[77]: (444, 9)
In [78]: car.columns
'license', 'Transport'],
dtype='object')
In [79]: car.info()
<class 'pandas.core.frame.DataFrame'>
In [80]: car.describe()
In [81]: car.head()
Out[81]: Age Gender Engineer MBA Work Exp Salary Distance license Transport
warnings.warn(msg, FutureWarning)
warnings.warn(
warnings.warn(single_var_warning.format("Vertical", "x"))
warnings.warn(msg, FutureWarning)
warnings.warn(
outlinear check
In [83]: plt.figure(figsize=(10,8))
car[['Age', 'Work Exp', 'Salary', 'Distance']].boxplot(vert=0)
plt.title('Outlier Check',fontsize=16)
plt.show()
In [85]: car.columns
'license', 'Transport'],
dtype='object')
warnings.warn(
In [88]: sns.boxplot(x='Age',y='license',data=car1)
In [89]: sns.boxplot(x='Salary',hue='license',data=car)
Out[89]: <AxesSubplot:xlabel='Salary'>
In [90]:
sns.countplot(x='Engineer',data=car,hue='Transport')
In [91]: sns.countplot(x='Gender',data=car,hue='Transport')
In [92]:
sns.countplot(x='MBA',data=car,hue='Transport')
In [94]: car.corr()
In [95]: sns.heatmap(car.corr(),vmax=1,vmin=-1,cmap="YlGnBu",annot=True,mask=np.triu(car.corr(),+1))
Out[95]: <AxesSubplot:>
In [96]: sns.pairplot(car)
feature: Gender
Male 316
Female 128
['Male', 'Female']
[1 0]
feature: Transport
[1 0]
Out[98]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport
0 28 0 0 4 14.3 3.2 0 1 1
1 23 1 0 4 8.3 3.3 0 0 1
2 29 1 0 7 13.4 4.1 0 1 1
3 28 1 1 5 13.4 4.5 0 0 1
4 27 1 0 4 13.4 4.6 0 1 1
In [99]: car1.info()
<class 'pandas.core.frame.DataFrame'>
Outlinear Treatment
In [100]: def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [101]: car.columns
'license', 'Transport'],
dtype='object')
In [107]: car1.head()
Out[107]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport
Train-Test Split
In [108]: car1.columns
dtype='object')
In [110]: X.head()
Out[110]: Age Engineer MBA Work Exp Salary Distance license Gender_Male
In [111]: y.head()
Out[111]: 0 1
1 1
2 1
3 1
4 1
In [113]: X_train.count()
Engineer 310
MBA 310
Salary 310
Distance 310
license 310
Gender_Male 310
dtype: int64
In [114]: X_test.count()
Engineer 134
MBA 134
Salary 134
Distance 134
license 134
Gender_Male 134
dtype: int64
In [115]: y_train.count()
Out[115]: 310
In [181]: y_test.count()
dtype: int64
In [182]: car1.head()
Out[182]: Age Engineer MBA Work Exp Salary Distance license Gender_Male Transport_Public Transport
Logistic Regression
Out[117]: LogisticRegression()
In [119]: ytest_predict_prob=model.predict_proba(X_test)
pd.DataFrame(ytest_predict_prob).head()
Out[119]: 0 1
0 0.097419 0.902581
1 0.071571 0.928429
2 0.113186 0.886814
3 0.936392 0.063608
4 0.838924 0.161076
Out[120]: 0.8129032258064516
Out[121]: 0.8208955223880597
AUC: 0.831
AUC: 0.758
In [126]: plot_confusion_matrix(model,X_train,y_train);
In [ ]:
Test Data
In [ ]: confusion_matrix(y_test, ytest_predict)
In [ ]: plot_confusion_matrix(model,X_test,y_test);
In [ ]:
Out[151]: LinearDiscriminantAnalysis()
Train Data
In [152]: y_train_predict = LDA_model.predict(X_train)
model_score = LDA_model.score(X_train, y_train)
In [153]: model_score
Out[153]: 0.8064516129032258
Test Data
In [156]: y_test_predict = LDA_model.predict(X_test)
model_score = LDA_model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
0.8134328358208955
[[22 21]
[ 4 87]]
In [159]: pred_prob_train[:,1]
In [161]: plot_confusion_matrix(LDA_model,X_train,y_train);
Test Data
In [163]: plot_confusion_matrix(LDA_model,X_test,y_test);
In [166]: # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1, stratify=car1['Transport_P
Out[167]: GaussianNB()
Train Data
0.8064516129032258
[[ 50 51]
[ 9 200]]
AUC: 0.804
Confusion Matrix
In [170]: plot_confusion_matrix(NB_model,X_train,y_train);
In [171]: plot_confusion_matrix(NB_model,X_test,y_test);
Test Data
AUC: 0.766
In [173]: x=pd.DataFrame(NB_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
---------------------------------------------------------------------------
<ipython-input-173-4b922253d6b4> in <module>
----> 1 x=pd.DataFrame(NB_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
2 plt.figure(figsize=(12,7))
3 sns.barplot(x[0],x.index,palette='rainbow')
4 plt.ylabel('Feature Name')
KNN Model
In [ ]: plot_confusion_matrix(KNN_model,X_test,y_test);
In [ ]: plot_confusion_matrix(KNN_model,X_train,y_train);
In [ ]: X[['Age', 'Engineer', 'MBA', 'Work Exp', 'Salary', 'Distance', 'license']]=X[['Age', 'Engineer', 'MBA', 'Work Exp', '
In [ ]: KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)
In [ ]: x=pd.DataFrame(KNN_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
RANDOM FOREST
In [ ]: from sklearn import metrics
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix,plot_confusion_matrix
In [ ]: plot_confusion_matrix(RF_model,X_train,y_train);
In [ ]: plot_confusion_matrix(RF_model,X_test,y_test);
In [175]: x=pd.DataFrame(RF_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
warnings.warn(
In [ ]:
Decision Tree
In [ ]: from sklearn import tree
DT_model= tree.DecisionTreeClassifier()
DT_model.fit(X_train, y_train)
In [ ]: plot_confusion_matrix(DT_model,X_test,y_test);
In [ ]: plot_confusion_matrix(DT_model,X_train,y_train);
In [ ]: x=pd.DataFrame(DT_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
In [ ]: x=pd.DataFrame(DT_model.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
Gradient Boosting¶
In [ ]: from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(random_state=1)
gbcl = gbcl.fit(X_train, y_train)
In [180]: x=pd.DataFrame(gbcl.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='rainbow')
plt.ylabel('Feature Name')
plt.xlabel('Feature Importance in %')
plt.title('Feature Importance Plot')
plt.show()
warnings.warn(
In [ ]:
In [1]: ## Importing the necessary libraries along with the standard import
import numpy as np
import pandas as pd
import re # this is the regular expression library which helps us manipulate text (strings) fairly easily and intuiti
import nltk # this is the Natural Language Tool Kit which contains a lot of functionalities for text analytics
import matplotlib.pyplot as plt
import string # this is used for string manipulations
import matplotlib
Out[80]:
deal description episode category entrepreneurs location website askedFor exchangeForStake valuation se
Bluetooth
device Darrin St. Paul,
0 False 1 Novelties NaN 1000000 15 6666667
implant for Johnson MN
your ear.
Retail and
wholesale
Specialty Somerset,
1 True pie factory 1 Tod Wilson http://whybake.com/ 460000 10 4600000
Food NJ
with two
reta...
Ava the
Elephant is Baby and
Tiffany Atlanta,
2 True a godsend 1 Child http://www.avatheelephant.com/ 50000 15 333333
Krumins GA
for frazzled Care
par...
Organizing,
packing,
Consumer Nick Friedman, Tampa,
3 False and moving 1 http://collegehunkshaulingjunk.com/ 250000 25 1000000
Services Omar Soliman FL
services
deliv...
Interactive
media
Consumer
4 False centers for 1 Kevin Flannery Cary, NC http://www.wispots.com/ 1200000 10 12000000
Services
healthcare
waiti...
In [81]: db.tail()
Out[81]:
deal description episode category entrepreneurs location website askedFor exchangeForStake valuatio
Zoom Beatrice
Interiors is Fischel-Bock,
Online Philadelphia,
490 True a virtual 28 Madeine https://zoominteriors.com 100000 20 50000
Services PA
service for Fraser &
interi... Lizzie...
Spikeball
started out
Toys and
491 True as a casual 29 Chris Ruder Chicago, IL http://spikeball.com 500000 10 500000
Games
outdoors
gam...
Shark
Wheel is
David Patrick
out to Outdoor Lake Forest,
492 True 29 and Zack http://www.sharkwheel.com 100000 5 200000
literally Recreation CA
Fleishman
reinvent the
w...
Adriana
Montano
wants to Adriana Boca Raton,
493 False 29 Entertainment http://gatocafeflorida.com 100000 20 50000
open the Montano FL
first Cat
Ca...
Sway
Motorsports
makes a Palo Alto,
494 True 29 Automotive Joe Wilcox http://www.swaymotorsports.com 300000 10 300000
three- CA
wheeled,
all-el...
In [4]: db.columns
dtype='object')
In [5]: sk=db.iloc[:,0:2]
In [6]: sk.head()
Out[6]:
deal description
In [7]: sk.info()
<class 'pandas.core.frame.DataFrame'>
In [8]: sk.duplicated().sum()
Out[8]: 2
In [9]: sk.drop_duplicates(inplace=True)
In [10]: sk.duplicated().sum()
Out[10]: 0
In [13]: skt=sk[sk['deal']==True]
In [14]: skt.head()
Out[14]:
deal description
In [15]: skt.info()
<class 'pandas.core.frame.DataFrame'>
In [17]: skf=sk[sk['deal']==False]
In [18]: skf.head()
Out[18]:
deal description
In [19]: skf.info()
<class 'pandas.core.frame.DataFrame'>
In [20]: sk.isnull().sum()
Out[20]: deal 0
description 0
dtype: int64
In [21]: # We are not using the train-test split function from sklearn and hence the need to jumble the data set.
<ipython-input-23-1357c3133895>:1: SettingWithCopyWarning:
Out[23]:
deal description
485 False the paleo diet bar is a nutrition bar that is ...
493 False adriana montano wants to open the first cat ca...
<ipython-input-25-e0cefbac927a>:1: SettingWithCopyWarning:
Out[25]:
deal description
489 True syndaver labs makes synthetic body parts for u...
<ipython-input-28-fcd587819724>:1: SettingWithCopyWarning:
skt['char_count'] = skt['description'].str.len()
Out[28]:
description char_count
<ipython-input-30-709538dd8eb8>:1: SettingWithCopyWarning:
skf['char_count'] = skf['description'].str.len()
Out[30]:
description char_count
Out[33]: True
In [34]: stopwords
from ,
'up',
'down',
'in',
'out',
'on',
'off',
'over',
'under',
'again',
'further',
'then',
'once',
'here',
'there',
'when',
'where',
'why',
'how',
'all',
' '
In [35]: all_Words=[x for x in pd.Series(' '.join(sk['description']).split())]
In [36]: all_Words
Out[36]: ['Bluetooth',
'device',
'implant',
'for',
'your',
'ear.',
'Retail',
'and',
'wholesale',
'pie',
'factory',
'with',
'two',
'retail',
'locations',
'in',
'New',
'Jersey.',
'Ava',
'th '
In [37]: nltk.FreqDist(all_Words).most_common(50)
('the', 625),
('a', 507),
('to', 505),
('of', 351),
('for', 260),
('that', 256),
('is', 246),
('in', 237),
('with', 184),
('your', 150),
('A', 140),
('The', 133),
('are', 113),
('can', 105),
('on', 98),
('you', 95),
('from', 94),
('their', 92),
('or', 86),
('as', 82),
('it', 70),
('by', 66),
('into', 64),
('an', 62),
('be', 61),
('also', 60),
('made', 57),
('any', 56),
('they', 53),
('have', 47),
('which', 44),
('makes', 43),
('make', 42),
('other', 41),
('just', 41),
('has', 40),
('company', 40),
('at', 39),
('even', 39),
('like', 38),
('out', 34),
('them', 34),
('up', 33),
('designed', 32),
('An', 32),
('its', 30),
('without', 30),
('more', 29),
('all', 29)]
In [42]: all_words_freq
Out[42]: FreqDist({'A': 140, 'The': 133, 'also': 60, 'made': 57, 'makes': 43, 'make': 42, 'company': 40, 'even': 39, 'like':
38, 'designed': 32, ...})
Removal of StopWords
In [43]: from nltk.corpus import stopwords
stop = stopwords.words('english')
In [ ]:
<ipython-input-45-c4980563ee43>:1: SettingWithCopyWarning:
In [47]: True_Words
Out[47]: ['retail',
'wholesale',
'pie',
'factory',
'two',
'retail',
'locations',
'new',
'jersey.',
'ava',
'elephant',
'godsend',
'frazzled',
'parents',
'young',
'children',
'everywhere.',
'talking',
'medicine',
'di '
In [48]: nltk.FreqDist(True_Words).most_common(3)
<ipython-input-50-f3785fd3ad1e>:1: SettingWithCopyWarning:
Out[51]: also 42
made 32
makes 32
like 27
even 27
make 24
company 23
designed 19
easy 17
without 17
dtype: int64
<ipython-input-52-79512879796f>:1: SettingWithCopyWarning:
Out[53]: made 41
also 19
company 19
make 19
use 15
designed 15
system 14
even 14
without 14
water 14
dtype: int64
<ipython-input-54-4fef93a8cf4a>:1: SettingWithCopyWarning:
In [56]: all_True_Words
Out[56]: ['retail',
'wholesale',
'pie',
'factory',
'two',
'retail',
'locations',
'new',
'jersey.',
'ava',
'elephant',
'godsend',
'frazzled',
'parents',
'young',
'children',
'everywhere.',
'talking',
'medicine',
'di '
In [57]: nltk.FreqDist(all_True_Words).most_common(3)
In [59]: all_False_Words
Out[59]: ['bluetooth',
'device',
'implant',
'ear.',
'organizing,',
'packing,',
'moving',
'services',
'delivered',
'college',
'women.',
'interactive',
'media',
'centers',
'healthcare',
'waiting',
'rooms',
'offering',
'patients',
' b'
In [60]: nltk.FreqDist(all_False_Words).most_common(3)
In [64]: #Removing stop words (extended list as above) from the corpus
# True Dataset
true_corpus = skt['description'].apply(lambda x: ' '.join([z for z in x.split() if z not in stop_words]))
true_corpus
...
...
In [ ]:
In [68]: wc_true
Out[68]: 'retail wholesale pie factory two retail locations new jersey. ava elephant godsend frazzled parents young childr
en everywhere. talking medicine dispenser administer medicine little ones turning experience playful providing po
sitive reinforcement. one first entrepreneurs pitch shark tank, susan knapp presented perfect pear, line pear-foc
used gourmet food products. sold across 650 retail stores, perfect pear product portfolio includes jams, jellies,
spreads, tapenades, vinegars, marinades, dressings many others, showcase flavors health benefits pears. education
al record label publishing house get students learning classic works literature. battery-operated cooking device
siphons juice, silicone basting brush injector tip marinades. line books written help children find inner calm. c
overplay slipcover children\'s play yards. much mattress, play yards can\'t laundered, yet babies children spend
lots time them, guess leads to. coverplay rescue! fitting snugly standard size play yards, coverplay offers quick
solution add another layer protection child germ-harboring surface play yard. 95% cotton 5% lycra, coverplay mach
ine-washable maintain. indeed, throwing coverplay slipcover wash much easier trying remove spill that\'s gone dir
ectly onto play yard. bright stylish designs, they\'ll play yard look good new. web-based buys back sells 10% unu
sed gift cards year u.s. online journaling service focused facilitating users\' progress towards achieving mainta
ining emotional well-being. fitness machine series bands varying weights pushups easier. made-to-order energy bar
s whole, natural ingredients. customers choose ingredients go highly personalized bars. sells award-winning barbe
cue spice rub products. stainless steel identifying charms stick food grill. wheat-free soy-based modeling clay,
children wheat allergies. web site allows college students buy sell class notes study guides. kids\' organizers l
ook stuffed animals children. women\'s apparel specially sizes 12 18. healthier line carbonated beverages featuri
ng organic ingredients, vitamins, antioxidants, 85 calories per can. faux golf club looks 7 iron conceals urine r
i ti lf d li t b li i ll i hi h h l thl t t fil
In [69]: wc_false
Out[69]: 'bluetooth device implant ear. organizing, packing, moving services delivered college women. interactive media ce
nters healthcare waiting rooms offering patients web access educational information. mixed martial arts clothing
line looking become next big brand active sports / streetwear apparel. attach noted detachable "arm" holds post-i
t notes side laptop screen. safety device seatbelts. prevents driver starting vehicle unless seatbelt buckled. ho
usehold items twist: recycled chopsticks. guitars folding neck, fit backpack overhead compartment airplane. 50 st
ate capitals 50 fun minutes efficient entertaining method learn us geography. set flash cards combines phonetics,
cartoons, associations keep kids\' interest drive long-term learning retention. author ken bradford worked closel
y public private school teachers develop fun satisfying study aide. franchise-model offering professional graffit
i removal. owns trademarks words "coffee," "cappuccino," "java" highly-caffeinated words plush toys. inspirationa
l gifts accessories colorful umbrellas sandals leave messages sand. granola gourmet offers line granola bars diab
etics safely enjoy. unlike granola bars market, granola gourmet\'s bars low glycemic index. low glycemic index me
ans less prone causing spikes blood sugar, good anybody especially damaging diabetics. granola gourmet\'s bars in
gredients naturally low glycemic index, release carbohydrates slowly bloodstream. granola gourmet products tested
gi labs, developed glycemic index concept. ultimate fudge brownie bar glycemic index 23, well threshold considere
d low glycemic. funeral concierge service writes eulogy, officiates funeral service handles post-funeral family g
atherings. line surgical masks fashionable fun. sports bras engineered work woman\'s body based activity enjoys.
pitched "root beer float bottle". underwear minimize odor flatulence, airtight material elastic around legs filte
r back. fitness program fusing dance routines. concept live entertainment amusement attraction times square area
five venues featuring restaurants entertainment. selling socks "pairs" three instead two, aiming preempt problem
i i k f hi i f d i h d b kl b lt i t l lit l ti i l t
Word Cloud
Note: you may need to restart the kernel to use updated packages.
In [79]: ## N grams
from nltk.util import ngrams # function for making ngrams
import collections
wc_true
biGrams = ngrams(wc_true,3)
# get the frequency of each bigram in our corpus
biGramsFreq = collections.Counter(biGrams)
# what are the ten most popular ngrams here?
biGramsFreq.most_common(10)
In [77]: wc_true
Out[77]: 'retail wholesale pie factory two retail locations new jersey. ava elephant godsend frazzled parents young childr
en everywhere. talking medicine dispenser administer medicine little ones turning experience playful providing po
sitive reinforcement. one first entrepreneurs pitch shark tank, susan knapp presented perfect pear, line pear-foc
used gourmet food products. sold across 650 retail stores, perfect pear product portfolio includes jams, jellies,
spreads, tapenades, vinegars, marinades, dressings many others, showcase flavors health benefits pears. education
al record label publishing house get students learning classic works literature. battery-operated cooking device
siphons juice, silicone basting brush injector tip marinades. line books written help children find inner calm. c
overplay slipcover children\'s play yards. much mattress, play yards can\'t laundered, yet babies children spend
lots time them, guess leads to. coverplay rescue! fitting snugly standard size play yards, coverplay offers quick
solution add another layer protection child germ-harboring surface play yard. 95% cotton 5% lycra, coverplay mach
ine-washable maintain. indeed, throwing coverplay slipcover wash much easier trying remove spill that\'s gone dir
ectly onto play yard. bright stylish designs, they\'ll play yard look good new. web-based buys back sells 10% unu
sed gift cards year u.s. online journaling service focused facilitating users\' progress towards achieving mainta
ining emotional well-being. fitness machine series bands varying weights pushups easier. made-to-order energy bar
s whole, natural ingredients. customers choose ingredients go highly personalized bars. sells award-winning barbe
cue spice rub products. stainless steel identifying charms stick food grill. wheat-free soy-based modeling clay,
children wheat allergies. web site allows college students buy sell class notes study guides. kids\' organizers l
ook stuffed animals children. women\'s apparel specially sizes 12 18. healthier line carbonated beverages featuri
ng organic ingredients, vitamins, antioxidants, 85 calories per can. faux golf club looks 7 iron conceals urine r
i ti lf d li t b li i ll i hi h h l thl t t fil