0% found this document useful (0 votes)
12 views26 pages

Machine Learning Project - Parijat

The document analyzes a dataset containing information on employee commuting preferences. It provides descriptive statistics and explores relationships between variables through univariate, bivariate and other analyses. Models are built and evaluated to predict preferred transportation mode based on factors like age, experience, salary and distance from work.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

Machine Learning Project - Parijat

The document analyzes a dataset containing information on employee commuting preferences. It provides descriptive statistics and explores relationships between variables through univariate, bivariate and other analyses. Models are built and evaluated to predict preferred transportation mode based on factors like age, experience, salary and distance from work.

Uploaded by

PARIJAT DEV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning Project

By- Parijat Dev

1 1
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset. 3
1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ................................... 9
1.3 Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix and
plotting a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: ................................................................................................................................................ 9
Logistic Regression Model ............................................................................................................................ 9
Linear Discriminant Analysis .......................................................................................................................... 9
Decision Tree Classifier – CART model .......................................................................................................... 9
Naïve Bayes Model ........................................................................................................................................ 9
KNN Model ..................................................................................................................................................... 9
Random Forest Model ................................................................................................................................... 9
Boosting Classifier Model using Gradient boost. .......................................................................................... 9
4. Which model performs the best? ............................................................................................................ 20
5. What are your business insights? ............................................................................................................ 20
Problem 2: Text Mining ............................................................................................................................... 21
Q2- Create two corpora, one for those who secured a Deal, the other for those who did not secure a
deal. ............................................................................................................................................................. 22
2.3 The following exercise is to be done for both the corpora: .................................................................. 23
a) Find the number of characters for both the corpuses. ........................................................................... 23
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed) ................................................................................................................... 23
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop words)?
..................................................................................................................................................................... 23
d) Plot the Word Cloud for both the corpora. ............................................................................................. 23
3. The following exercise is to be done for both the corpora: .................................................................... 23
2.4 Refer to both the word clouds. What do you infer? ............................................................................. 25
2.5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?......................................................................................................... 26

1 2
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs,
checking correlations, outliers and missing values treatment (if
necessary) and check the basic descriptive statistics of the dataset.
Answer -

Data Head

Serial Age Gender Engineer MB Work Salary Distance license Transport


Numbe A Exp
r
0 28 Male 0 0 4 14.3 3.2 0 Public
Transport
1 23 Female 1 0 4 8.3 3.3 0 Public
Transport
2 29 Male 1 0 7 13.4 4.1 0 Public
Transport
3 28 Female 1 1 5 13.4 4.5 0 Public
Transport
4 27 Male 1 0 4 13.4 4.6 0 Public
Transport
5 26 Male 1 0 4 12.3 4.8 1 Public
Transport
6 28 Male 1 0 5 14.4 5.1 0 Private
Transport
7 26 Female 1 0 3 10.5 5.1 0 Public
Transport
8 22 Male 1 0 1 7.5 5.1 0 Public
Transport
9 27 Male 1 0 4 13.5 5.2 0 Public
Transport
10 25 Female 1 0 4 11.5 5.2 0 Public
Transport
11 27 Male 1 0 4 13.5 5.3 1 Public
Transport
12 24 Male 1 0 2 8.5 5.4 0 Public
Transport
13 27 Male 1 0 4 13.4 5.5 1 Public
Transport
14 32 Male 1 0 9 15.5 5.5 0 Public
Transport
15 25 Male 1 1 4 11.5 5.6 0 Public
Transport
16 34 Male 1 0 13 16.5 5.9 0 Public
Transport

1 3
Statistical Description of the dataset
COUNT MEAN STD MIN 25% 50% 75% MAX
AGE 444 27.7477 4.41671 18 25 27 30 43
48
ENGINEE 444 0.75450 0.43086 0 1 1 1 1
R 5 6
MBA 444 0.25225 0.43479 0 0 0 1 1
2 5
WORK 444 6.29955 5.11209 0 3 5 8 24
EXP 8
SALARY 444 16.2387 10.4538 6.5 9.8 13.6 15.725 57
39 51
DISTANC 444 11.3231 3.60614 3.2 8.8 11 13.425 23.4
E 98 9
LICENSE 444 0.23423 0.42399 0 0 0 0 1
4 7

The dataset provides a comprehensive statistical overview of various factors related to employee
commuting preferences for ABC Consulting. The mean age of employees is approximately 27.75 years,
with a standard deviation of 4.42. The majority of employees in the dataset are engineers (75.45%), while
only a quarter have an MBA degree (25.23%). The average work experience is around 6.30 years, with a
considerable standard deviation of 5.11, indicating a diverse range of experience levels. Salary ranges
from 6.5 to 57 lakhs per annum, with an average of 16.24 lakhs. The distance between home and office is
around 11.32 km on average, with a standard deviation of 3.61. A notable portion of employees (23.42%)
possesses a driving license. This statistical summary lays the foundation for understanding the key
characteristics of ABC Consulting employees, essential for building predictive models to determine their
preferred mode of transport.

1 4
Univariate Analysis

1 5
Bivariate Analysis

Distribution of Age with respect to preferred mode of transport

1 6
Correlation Heatmap

1 7
Pairplot

1 8
1.2 Split the data into train and test in the ratio 70:30. Is scaling
necessary or not?
Scaling is typically employed when there is a notable difference in magnitudes among variables,
potentially influencing the model. However, in this dataset, no variable exhibits significantly higher
magnitudes than others; most values fall within the range of 0 to 100. Consequently, data scaling is
deemed unnecessary for this exercise.

Following exploratory data analysis, the dataset is partitioned into independent and dependent
variables. Subsequently, a train-test split is executed, with a ratio of 70:30, to facilitate model training
and evaluation.

1.3 Build the following models on the 70% training data and check the
performance of these models on the Training as well as the 30% Test
data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever
required for optimum performance.:
Logistic Regression Model
Linear Discriminant Analysis
Decision Tree Classifier – CART model
Naïve Bayes Model
KNN Model
Random Forest Model
Boosting Classifier Model using Gradient boost.

a. Logistic Regression Model:

Model Accuracy on Test Data = 82%

Model accuracy on Train Data = 81%

Model Performance on Test Data

1 9
Model Performance on Training Data

Confusion Matrix Testing Data

1 10
AUC-ROC chart for Testing Data

B. Linear Discriminant Model

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

1 11
Model Performance on Training Data

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 12
C. Decision Tree Classifier – CART model

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

Model Performance on Training Data

Confusion Matrix for Test Data

1 13
AUC-ROC chart for Testing Data

1 14
D. Naïve Bayes Model

Model Accuracy on Test Data = 75%

Model Accuracy on Train Data = 77%

Model Performance on Train Data

Model Performance on Test Data

Confusion Matrix for Test Data

1 15
AUC-ROC chart for Testing Data

E. KNN

1 16
Model Accuracy on Test Data = 76%

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 17
F. Random Forest

Model Accuracy on Test Data = 74%

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 18
G. Gradient Boosting

Model Accuracy on Test Data = 96%

Model Accuracy on Train Data = 79%

AUC-ROC chart for Testing Data

1 19
Confusion Matrix for Test Data

4. Which model performs the best?

The optimal models based on different performance metrics are as follows:

• Accuracy Scores: Linear Discriminant Analysis (LDA)


• ROC AUC Scores: Logistic Regression
• Recall values of Private Transport (0’s): Logistic Regression and Linear Discriminant Analysis
• Recall value of Public Transport (1’s): Tuned Random Forest Classifier
• F1 score of Private Transport (0’s): Linear Discriminant Analysis
• F1 score of Public Transport (1’s): Linear Discriminant Analysis

Considering the comprehensive evaluation across various metrics, it is evident that the Linear
Discriminant Analysis (LDA) model consistently outperforms others, closely followed by the Logistic
Regression model. Therefore, based on the collective results, the Linear Discriminant Analysis (LDA)
model is deemed the most suitable for the given task.

5. What are your business insights?

Exploratory Data Analysis (EDA) and model building reveal that employees residing closer to the
workplace tend to prefer public transport, whereas those residing farther away opt for private transport.

Employees with higher experience and corresponding higher salaries exhibit a preference for private
transport over public transport.

1 20
Strong positive correlations are observed among age, work experience, and salary variables, indicating a
positive relationship with the predictor variable.

Gender differences are evident in transportation choices, with male employees more likely to have a
license and use private transport, while female employees tend to rely on public transport, possibly due
to a lack of licenses.

The Linear Discriminant Allocation (LDA) model demonstrates superior predictive performance for the
output variable (Transport) with an accuracy of 83%, surpassing the Logistic Regression model in
performance.

Problem 2: Text Mining

1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.

1 21
Q2- Create two corpora, one for those who secured a Deal, the other for
those who did not secure a deal.

1 22
2.3 The following exercise is to be done for both the corpora:

a) Find the number of characters for both the corpuses.

b) Remove Stop Words from the corpora. (Words like ‘also’,


‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and ‘company’ are to be
removed)

c) What were the top 3 most frequently occurring words in both


corpuses (after removing stop words)?

d) Plot the Word Cloud for both the corpora.

3. The following exercise is to be done for both the corpora:


A) Find the number of characters for both the corpuses.

B) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)

1 23
C) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?

D) Plot the Word Cloud for both the corpora.

Answer:

A-

Number of characters in Corpus with Deal: 64060


Number of characters in Corpus without Deal: 47184

B-

C- What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?

The top 3 most frequently occurring words in both corpora (after removing stopwords) are as follows:

Before cleaning stop words:

For positive deal corpora: 'and' (352), 'the' (249), 'to' (212)

For negative deal corpora: 'the' (228), 'and' (206), 'a' (155)

After cleaning stop words:

For positive deal corpora: 'products' (18), 'easy' (16), 'children' (16), 'make' (16)

For negative deal corpora: 'make' (16), 'product' (12), 'system' (12), 'online' (12)

D. Plot the Word Cloud for both the corpora.

1 24
2.4 Refer to both the word clouds. What do you infer?
Here's a summary of the key points based on your analysis:

Secured a Deal Word Cloud:

Positive attributes: 'one', 'design', 'free', 'children', 'offer', 'easy', 'online', 'use'.

Indicates that deals targeting children, offering free samples/products, being easy to use, and having
good design and creativity are more likely to secure a deal.

Notable focus on products for children's use.

Did Not Secure a Deal Word Cloud:

Negative attributes: 'one', 'designed', 'help', 'device', 'bottle', 'premium', 'use', 'service'.

Suggests that deals with mediocre design, less problem-solving capability, products involving water
bottles, higher and premium prices, and lower usability are less likely to secure a deal.

Emphasizes the importance of better service for products that did not secure a deal.

Common Words Across Both Word Clouds:

'one', 'designed', 'system', 'use'.

Indicates that these words may not be defining factors for whether a deal is made or not. Alternatively,
they might have been used in a different context in each scenario.

Overall Interpretation:

The importance of design, usability, and creativity is highlighted for securing a deal.

Products targeted towards children seem to have a higher likelihood of securing deals.

Service quality is emphasized as a critical factor, especially for products that did not secure a deal.
1 25
Your thorough analysis of the word clouds provides valuable insights into the characteristics and
attributes associated with successful and unsuccessful deals, offering practical considerations for
product development and marketing strategies.

2.5 Looking at the word clouds, is it true that the entrepreneurs who
introduced devices are less likely to secure a deal based on your
analysis?
The term 'device' is notably absent in the "Secured a Deal" word cloud but prominently featured in the
"Did Not Secure a Deal" word cloud. This suggests a frequent occurrence of the word 'device' in
instances where deals were declined, supporting the assertion in the question. Consequently, it implies
that entrepreneurs introducing devices are less inclined to secure deals, likely influenced by various
factors.

1 26

You might also like