0% found this document useful (0 votes)

12 views26 pages

Machine Learning Project - Parijat

The document analyzes a dataset containing information on employee commuting preferences. It provides descriptive statistics and explores relationships between variables through univariate, bivariate and other analyses. Models are built and evaluated to predict preferred transportation mode based on factors like age, experience, salary and distance from work.

Uploaded by

PARIJAT DEV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views26 pages

Machine Learning Project - Parijat

Uploaded by

PARIJAT DEV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Machine Learning Project

By- Parijat Dev

1 1
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset. 3
1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ................................... 9
1.3 Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix and
plotting a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: ................................................................................................................................................ 9
Logistic Regression Model ............................................................................................................................ 9
Linear Discriminant Analysis .......................................................................................................................... 9
Decision Tree Classifier – CART model .......................................................................................................... 9
Naïve Bayes Model ........................................................................................................................................ 9
KNN Model ..................................................................................................................................................... 9
Random Forest Model ................................................................................................................................... 9
Boosting Classifier Model using Gradient boost. .......................................................................................... 9
4. Which model performs the best? ............................................................................................................ 20
5. What are your business insights? ............................................................................................................ 20
Problem 2: Text Mining ............................................................................................................................... 21
Q2- Create two corpora, one for those who secured a Deal, the other for those who did not secure a
deal. ............................................................................................................................................................. 22
2.3 The following exercise is to be done for both the corpora: .................................................................. 23
a) Find the number of characters for both the corpuses. ........................................................................... 23
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed) ................................................................................................................... 23
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop words)?
..................................................................................................................................................................... 23
d) Plot the Word Cloud for both the corpora. ............................................................................................. 23
3. The following exercise is to be done for both the corpora: .................................................................... 23
2.4 Refer to both the word clouds. What do you infer? ............................................................................. 25
2.5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?......................................................................................................... 26

1 2
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs,
checking correlations, outliers and missing values treatment (if
necessary) and check the basic descriptive statistics of the dataset.
Answer -

Data Head

Serial Age Gender Engineer MB Work Salary Distance license Transport

Numbe A Exp
r
0 28 Male 0 0 4 14.3 3.2 0 Public
Transport
1 23 Female 1 0 4 8.3 3.3 0 Public
Transport
2 29 Male 1 0 7 13.4 4.1 0 Public
Transport
3 28 Female 1 1 5 13.4 4.5 0 Public
Transport
4 27 Male 1 0 4 13.4 4.6 0 Public
Transport
5 26 Male 1 0 4 12.3 4.8 1 Public
Transport
6 28 Male 1 0 5 14.4 5.1 0 Private
Transport
7 26 Female 1 0 3 10.5 5.1 0 Public
Transport
8 22 Male 1 0 1 7.5 5.1 0 Public
Transport
9 27 Male 1 0 4 13.5 5.2 0 Public
Transport
10 25 Female 1 0 4 11.5 5.2 0 Public
Transport
11 27 Male 1 0 4 13.5 5.3 1 Public
Transport
12 24 Male 1 0 2 8.5 5.4 0 Public
Transport
13 27 Male 1 0 4 13.4 5.5 1 Public
Transport
14 32 Male 1 0 9 15.5 5.5 0 Public
Transport
15 25 Male 1 1 4 11.5 5.6 0 Public
Transport
16 34 Male 1 0 13 16.5 5.9 0 Public
Transport

1 3
Statistical Description of the dataset
COUNT MEAN STD MIN 25% 50% 75% MAX
AGE 444 27.7477 4.41671 18 25 27 30 43
48
ENGINEE 444 0.75450 0.43086 0 1 1 1 1
R 5 6
MBA 444 0.25225 0.43479 0 0 0 1 1
2 5
WORK 444 6.29955 5.11209 0 3 5 8 24
EXP 8
SALARY 444 16.2387 10.4538 6.5 9.8 13.6 15.725 57
39 51
DISTANC 444 11.3231 3.60614 3.2 8.8 11 13.425 23.4
E 98 9
LICENSE 444 0.23423 0.42399 0 0 0 0 1
4 7

The dataset provides a comprehensive statistical overview of various factors related to employee
commuting preferences for ABC Consulting. The mean age of employees is approximately 27.75 years,
with a standard deviation of 4.42. The majority of employees in the dataset are engineers (75.45%), while
only a quarter have an MBA degree (25.23%). The average work experience is around 6.30 years, with a
considerable standard deviation of 5.11, indicating a diverse range of experience levels. Salary ranges
from 6.5 to 57 lakhs per annum, with an average of 16.24 lakhs. The distance between home and office is
around 11.32 km on average, with a standard deviation of 3.61. A notable portion of employees (23.42%)
possesses a driving license. This statistical summary lays the foundation for understanding the key
characteristics of ABC Consulting employees, essential for building predictive models to determine their
preferred mode of transport.

1 4
Univariate Analysis

1 5
Bivariate Analysis

Distribution of Age with respect to preferred mode of transport

1 6
Correlation Heatmap

1 7
Pairplot

1 8
1.2 Split the data into train and test in the ratio 70:30. Is scaling
necessary or not?
Scaling is typically employed when there is a notable difference in magnitudes among variables,
potentially influencing the model. However, in this dataset, no variable exhibits significantly higher
magnitudes than others; most values fall within the range of 0 to 100. Consequently, data scaling is
deemed unnecessary for this exercise.

Following exploratory data analysis, the dataset is partitioned into independent and dependent
variables. Subsequently, a train-test split is executed, with a ratio of 70:30, to facilitate model training
and evaluation.

1.3 Build the following models on the 70% training data and check the
performance of these models on the Training as well as the 30% Test
data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever
required for optimum performance.:
Logistic Regression Model
Linear Discriminant Analysis
Decision Tree Classifier – CART model
Naïve Bayes Model
KNN Model
Random Forest Model
Boosting Classifier Model using Gradient boost.

a. Logistic Regression Model:

Model Accuracy on Test Data = 82%

Model accuracy on Train Data = 81%

Model Performance on Test Data

1 9
Model Performance on Training Data

Confusion Matrix Testing Data

1 10
AUC-ROC chart for Testing Data

B. Linear Discriminant Model

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

1 11
Model Performance on Training Data

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 12
C. Decision Tree Classifier – CART model

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

Model Performance on Training Data

Confusion Matrix for Test Data

1 13
AUC-ROC chart for Testing Data

1 14
D. Naïve Bayes Model

Model Accuracy on Test Data = 75%

Model Accuracy on Train Data = 77%

Model Performance on Train Data

Model Performance on Test Data

Confusion Matrix for Test Data

1 15
AUC-ROC chart for Testing Data

E. KNN

1 16
Model Accuracy on Test Data = 76%

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 17
F. Random Forest

Model Accuracy on Test Data = 74%

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

1 18
G. Gradient Boosting

Model Accuracy on Test Data = 96%

Model Accuracy on Train Data = 79%

AUC-ROC chart for Testing Data

1 19
Confusion Matrix for Test Data

4. Which model performs the best?

The optimal models based on different performance metrics are as follows:

• Accuracy Scores: Linear Discriminant Analysis (LDA)

• ROC AUC Scores: Logistic Regression
• Recall values of Private Transport (0’s): Logistic Regression and Linear Discriminant Analysis
• Recall value of Public Transport (1’s): Tuned Random Forest Classifier
• F1 score of Private Transport (0’s): Linear Discriminant Analysis
• F1 score of Public Transport (1’s): Linear Discriminant Analysis

Considering the comprehensive evaluation across various metrics, it is evident that the Linear
Discriminant Analysis (LDA) model consistently outperforms others, closely followed by the Logistic
Regression model. Therefore, based on the collective results, the Linear Discriminant Analysis (LDA)
model is deemed the most suitable for the given task.

5. What are your business insights?

Exploratory Data Analysis (EDA) and model building reveal that employees residing closer to the
workplace tend to prefer public transport, whereas those residing farther away opt for private transport.

Employees with higher experience and corresponding higher salaries exhibit a preference for private
transport over public transport.

1 20
Strong positive correlations are observed among age, work experience, and salary variables, indicating a
positive relationship with the predictor variable.

Gender differences are evident in transportation choices, with male employees more likely to have a
license and use private transport, while female employees tend to rely on public transport, possibly due
to a lack of licenses.

The Linear Discriminant Allocation (LDA) model demonstrates superior predictive performance for the
output variable (Transport) with an accuracy of 83%, surpassing the Logistic Regression model in
performance.

Problem 2: Text Mining

1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.

1 21
Q2- Create two corpora, one for those who secured a Deal, the other for
those who did not secure a deal.

1 22
2.3 The following exercise is to be done for both the corpora:

a) Find the number of characters for both the corpuses.

b) Remove Stop Words from the corpora. (Words like ‘also’,

‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and ‘company’ are to be
removed)

c) What were the top 3 most frequently occurring words in both

corpuses (after removing stop words)?

d) Plot the Word Cloud for both the corpora.

3. The following exercise is to be done for both the corpora:

A) Find the number of characters for both the corpuses.

B) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)

1 23
C) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?

D) Plot the Word Cloud for both the corpora.

Answer:

A-

Number of characters in Corpus with Deal: 64060

Number of characters in Corpus without Deal: 47184

B-

C- What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?

The top 3 most frequently occurring words in both corpora (after removing stopwords) are as follows:

Before cleaning stop words:

For positive deal corpora: 'and' (352), 'the' (249), 'to' (212)

For negative deal corpora: 'the' (228), 'and' (206), 'a' (155)

After cleaning stop words:

For positive deal corpora: 'products' (18), 'easy' (16), 'children' (16), 'make' (16)

For negative deal corpora: 'make' (16), 'product' (12), 'system' (12), 'online' (12)

D. Plot the Word Cloud for both the corpora.

1 24
2.4 Refer to both the word clouds. What do you infer?
Here's a summary of the key points based on your analysis:

Secured a Deal Word Cloud:

Positive attributes: 'one', 'design', 'free', 'children', 'offer', 'easy', 'online', 'use'.

Indicates that deals targeting children, offering free samples/products, being easy to use, and having
good design and creativity are more likely to secure a deal.

Notable focus on products for children's use.

Did Not Secure a Deal Word Cloud:

Negative attributes: 'one', 'designed', 'help', 'device', 'bottle', 'premium', 'use', 'service'.

Suggests that deals with mediocre design, less problem-solving capability, products involving water
bottles, higher and premium prices, and lower usability are less likely to secure a deal.

Emphasizes the importance of better service for products that did not secure a deal.

Common Words Across Both Word Clouds:

'one', 'designed', 'system', 'use'.

Indicates that these words may not be defining factors for whether a deal is made or not. Alternatively,
they might have been used in a different context in each scenario.

Overall Interpretation:

The importance of design, usability, and creativity is highlighted for securing a deal.

Products targeted towards children seem to have a higher likelihood of securing deals.

Service quality is emphasized as a critical factor, especially for products that did not secure a deal.
1 25
Your thorough analysis of the word clouds provides valuable insights into the characteristics and
attributes associated with successful and unsuccessful deals, offering practical considerations for
product development and marketing strategies.

2.5 Looking at the word clouds, is it true that the entrepreneurs who
introduced devices are less likely to secure a deal based on your
analysis?
The term 'device' is notably absent in the "Secured a Deal" word cloud but prominently featured in the
"Did Not Secure a Deal" word cloud. This suggests a frequent occurrence of the word 'device' in
instances where deals were declined, supporting the assertion in the question. Consequently, it implies
that entrepreneurs introducing devices are less inclined to secure deals, likely influenced by various
factors.

1 26

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Vijaya ML
88% (8)
Vijaya ML
26 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
ML - Extended Project Business Report-Richa
No ratings yet
ML - Extended Project Business Report-Richa
32 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Machine Learning Extended Project
No ratings yet
Machine Learning Extended Project
3 pages
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
No ratings yet
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
54 pages
DVT Project
No ratings yet
DVT Project
35 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Predicting Mode of Transport
No ratings yet
Predicting Mode of Transport
29 pages
Yash - Capstone Report
No ratings yet
Yash - Capstone Report
29 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Sukanya December Predictive Modeling 14th Jan 2024
No ratings yet
Sukanya December Predictive Modeling 14th Jan 2024
50 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Project - Machine Learning (E)
No ratings yet
Project - Machine Learning (E)
34 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Aditya Slides For IBM
No ratings yet
Aditya Slides For IBM
125 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Predictive - Modelling - Project - PDF 1
No ratings yet
Predictive - Modelling - Project - PDF 1
31 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Machine Leaning
No ratings yet
Machine Leaning
29 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
MBA786M Project
No ratings yet
MBA786M Project
2 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Article Review 11 Eng
No ratings yet
Article Review 11 Eng
18 pages
(FINAL) Data Science Interview ChatGPT Cheat Sheet
No ratings yet
(FINAL) Data Science Interview ChatGPT Cheat Sheet
1 page
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
ML Project - Monica Sharma
No ratings yet
ML Project - Monica Sharma
35 pages
CloudyML Mega Combo Course Brochure
No ratings yet
CloudyML Mega Combo Course Brochure
19 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Business+Report Classification
No ratings yet
Business+Report Classification
16 pages
BUS2004 Ass3 Sem2 2024
No ratings yet
BUS2004 Ass3 Sem2 2024
2 pages
Ai QB
No ratings yet
Ai QB
17 pages
Basic of Statistics
No ratings yet
Basic of Statistics
4 pages
Dsa - DK Question Paper
No ratings yet
Dsa - DK Question Paper
4 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Python - Project 2 Problem Statement
No ratings yet
Python - Project 2 Problem Statement
3 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
ML Question Bank
No ratings yet
ML Question Bank
7 pages
BDMDM Telemarketing
No ratings yet
BDMDM Telemarketing
16 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Suchita - Bhovar - Business Report - March 14 2024
No ratings yet
Suchita - Bhovar - Business Report - March 14 2024
24 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
Capastone - Project - Subash Karnatakapu
No ratings yet
Capastone - Project - Subash Karnatakapu
54 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
The Traffic Assignment Problem: Models and Methods
From Everand
The Traffic Assignment Problem: Models and Methods
Michael Patriksson
5/5 (1)
Data Sheet DWO 200 M
No ratings yet
Data Sheet DWO 200 M
6 pages
Sewage Treatment Plants (STP)
No ratings yet
Sewage Treatment Plants (STP)
3 pages
Resolution 24 - Requesting For Road Surfacing
No ratings yet
Resolution 24 - Requesting For Road Surfacing
2 pages
Removal: Seat Belt - Rear Seat Outer Belt Assembly
No ratings yet
Removal: Seat Belt - Rear Seat Outer Belt Assembly
4 pages
108 122 109 135 Heytesbury Road Heritage Assessment - 1
No ratings yet
108 122 109 135 Heytesbury Road Heritage Assessment - 1
83 pages
Lecture# 4-Aircraft Structures-I-General Information On AC Construction
No ratings yet
Lecture# 4-Aircraft Structures-I-General Information On AC Construction
23 pages
Manchester Man Script
No ratings yet
Manchester Man Script
4 pages
PTB330 User Guide in English PDF
No ratings yet
PTB330 User Guide in English PDF
144 pages
2020-2021 Indian Farmers' Protest - Wikipedia
No ratings yet
2020-2021 Indian Farmers' Protest - Wikipedia
41 pages
2 Mixture and Alligations
No ratings yet
2 Mixture and Alligations
16 pages
Bachrach v. Seifert
No ratings yet
Bachrach v. Seifert
1 page
Headwords and Modifiers
No ratings yet
Headwords and Modifiers
20 pages
Design Questionnaire Erp
No ratings yet
Design Questionnaire Erp
10 pages
Assessment of Livestock Feed Resources Utilization in
No ratings yet
Assessment of Livestock Feed Resources Utilization in
7 pages
CMOS Means Complementary MOS: NMOS and PMOS Working Together in A Circuit
No ratings yet
CMOS Means Complementary MOS: NMOS and PMOS Working Together in A Circuit
8 pages
NXP NFC & Contactless Reader Solutions: Standards & Protocols
No ratings yet
NXP NFC & Contactless Reader Solutions: Standards & Protocols
2 pages
Mi Ravas Rcs Hy Q 52 Eu Rev Installation
No ratings yet
Mi Ravas Rcs Hy Q 52 Eu Rev Installation
37 pages
Angie - Lyrics & Chords
No ratings yet
Angie - Lyrics & Chords
4 pages
Using External Volumes Larger Than 4 TB - Cleaned
No ratings yet
Using External Volumes Larger Than 4 TB - Cleaned
11 pages
Morphology Aula 1
No ratings yet
Morphology Aula 1
40 pages
Rubrics For Movie Trailer 1
No ratings yet
Rubrics For Movie Trailer 1
2 pages
IWCEM Paper On Film Galvanising
No ratings yet
IWCEM Paper On Film Galvanising
212 pages
Project Topics IBF 2016 21
No ratings yet
Project Topics IBF 2016 21
3 pages
Cellulose Based Polymers PDF
No ratings yet
Cellulose Based Polymers PDF
60 pages
50 Top Modulation & Demodulation Objective Questions Answers
0% (2)
50 Top Modulation & Demodulation Objective Questions Answers
11 pages
HIGH SCHOOL MUSICAL Script
No ratings yet
HIGH SCHOOL MUSICAL Script
12 pages
Tüv Rheinland 170111 - TR Ev Services
No ratings yet
Tüv Rheinland 170111 - TR Ev Services
9 pages
Separation of Ethanol-Water Using Benzene As Entrainer: Background
No ratings yet
Separation of Ethanol-Water Using Benzene As Entrainer: Background
2 pages
ProvisionalAwardLetter - Local 2024 24KHOM13738
No ratings yet
ProvisionalAwardLetter - Local 2024 24KHOM13738
3 pages
RoboCup Rescue Workshop 2015-Part 3-PID - LineFollowing
No ratings yet
RoboCup Rescue Workshop 2015-Part 3-PID - LineFollowing
41 pages

Machine Learning Project - Parijat

Uploaded by

Machine Learning Project - Parijat

Uploaded by

Machine Learning Project

By- Parijat Dev

Serial Age Gender Engineer MB Work Salary Distance license Transport

Distribution of Age with respect to preferred mode of transport

a. Logistic Regression Model:

Model Accuracy on Test Data = 82%

Model accuracy on Train Data = 81%

Model Performance on Test Data

Confusion Matrix Testing Data

B. Linear Discriminant Model

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

Model Accuracy on Test Data = 79%

Model accuracy on Train Data = 80%

Model Performance on Test Data

Model Performance on Training Data

Confusion Matrix for Test Data

Model Accuracy on Test Data = 75%

Model Accuracy on Train Data = 77%

Model Performance on Train Data

Model Performance on Test Data

Confusion Matrix for Test Data

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

Model Accuracy on Test Data = 74%

Model Accuracy on Train Data = 100%

Confusion Matrix for Test Data

AUC-ROC chart for Testing Data

Model Accuracy on Test Data = 96%

Model Accuracy on Train Data = 79%

AUC-ROC chart for Testing Data

4. Which model performs the best?

The optimal models based on different performance metrics are as follows:

• Accuracy Scores: Linear Discriminant Analysis (LDA)

5. What are your business insights?

Problem 2: Text Mining

a) Find the number of characters for both the corpuses.

b) Remove Stop Words from the corpora. (Words like ‘also’,

c) What were the top 3 most frequently occurring words in both

d) Plot the Word Cloud for both the corpora.

3. The following exercise is to be done for both the corpora:

D) Plot the Word Cloud for both the corpora.

Number of characters in Corpus with Deal: 64060

Before cleaning stop words:

After cleaning stop words:

D. Plot the Word Cloud for both the corpora.

Secured a Deal Word Cloud:

Notable focus on products for children's use.

Did Not Secure a Deal Word Cloud:

Common Words Across Both Word Clouds:

'one', 'designed', 'system', 'use'.

You might also like