Machine Learning Project - Parijat
Machine Learning Project - Parijat
1 1
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset. 3
1.2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ................................... 9
1.3 Build the following models on the 70% training data and check the performance of these models on
the Training as well as the 30% Test data using the various inferences from the Confusion Matrix and
plotting a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: ................................................................................................................................................ 9
Logistic Regression Model ............................................................................................................................ 9
Linear Discriminant Analysis .......................................................................................................................... 9
Decision Tree Classifier – CART model .......................................................................................................... 9
Naïve Bayes Model ........................................................................................................................................ 9
KNN Model ..................................................................................................................................................... 9
Random Forest Model ................................................................................................................................... 9
Boosting Classifier Model using Gradient boost. .......................................................................................... 9
4. Which model performs the best? ............................................................................................................ 20
5. What are your business insights? ............................................................................................................ 20
Problem 2: Text Mining ............................................................................................................................... 21
Q2- Create two corpora, one for those who secured a Deal, the other for those who did not secure a
deal. ............................................................................................................................................................. 22
2.3 The following exercise is to be done for both the corpora: .................................................................. 23
a) Find the number of characters for both the corpuses. ........................................................................... 23
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed) ................................................................................................................... 23
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop words)?
..................................................................................................................................................................... 23
d) Plot the Word Cloud for both the corpora. ............................................................................................. 23
3. The following exercise is to be done for both the corpora: .................................................................... 23
2.4 Refer to both the word clouds. What do you infer? ............................................................................. 25
2.5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?......................................................................................................... 26
1 2
Q1.1- Basic data summary, Univariate, Bivariate analysis, graphs,
checking correlations, outliers and missing values treatment (if
necessary) and check the basic descriptive statistics of the dataset.
Answer -
Data Head
1 3
Statistical Description of the dataset
COUNT MEAN STD MIN 25% 50% 75% MAX
AGE 444 27.7477 4.41671 18 25 27 30 43
48
ENGINEE 444 0.75450 0.43086 0 1 1 1 1
R 5 6
MBA 444 0.25225 0.43479 0 0 0 1 1
2 5
WORK 444 6.29955 5.11209 0 3 5 8 24
EXP 8
SALARY 444 16.2387 10.4538 6.5 9.8 13.6 15.725 57
39 51
DISTANC 444 11.3231 3.60614 3.2 8.8 11 13.425 23.4
E 98 9
LICENSE 444 0.23423 0.42399 0 0 0 0 1
4 7
The dataset provides a comprehensive statistical overview of various factors related to employee
commuting preferences for ABC Consulting. The mean age of employees is approximately 27.75 years,
with a standard deviation of 4.42. The majority of employees in the dataset are engineers (75.45%), while
only a quarter have an MBA degree (25.23%). The average work experience is around 6.30 years, with a
considerable standard deviation of 5.11, indicating a diverse range of experience levels. Salary ranges
from 6.5 to 57 lakhs per annum, with an average of 16.24 lakhs. The distance between home and office is
around 11.32 km on average, with a standard deviation of 3.61. A notable portion of employees (23.42%)
possesses a driving license. This statistical summary lays the foundation for understanding the key
characteristics of ABC Consulting employees, essential for building predictive models to determine their
preferred mode of transport.
1 4
Univariate Analysis
1 5
Bivariate Analysis
1 6
Correlation Heatmap
1 7
Pairplot
1 8
1.2 Split the data into train and test in the ratio 70:30. Is scaling
necessary or not?
Scaling is typically employed when there is a notable difference in magnitudes among variables,
potentially influencing the model. However, in this dataset, no variable exhibits significantly higher
magnitudes than others; most values fall within the range of 0 to 100. Consequently, data scaling is
deemed unnecessary for this exercise.
Following exploratory data analysis, the dataset is partitioned into independent and dependent
variables. Subsequently, a train-test split is executed, with a ratio of 70:30, to facilitate model training
and evaluation.
1.3 Build the following models on the 70% training data and check the
performance of these models on the Training as well as the 30% Test
data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever
required for optimum performance.:
Logistic Regression Model
Linear Discriminant Analysis
Decision Tree Classifier – CART model
Naïve Bayes Model
KNN Model
Random Forest Model
Boosting Classifier Model using Gradient boost.
1 9
Model Performance on Training Data
1 10
AUC-ROC chart for Testing Data
1 11
Model Performance on Training Data
1 12
C. Decision Tree Classifier – CART model
1 13
AUC-ROC chart for Testing Data
1 14
D. Naïve Bayes Model
1 15
AUC-ROC chart for Testing Data
E. KNN
1 16
Model Accuracy on Test Data = 76%
1 17
F. Random Forest
1 18
G. Gradient Boosting
1 19
Confusion Matrix for Test Data
Considering the comprehensive evaluation across various metrics, it is evident that the Linear
Discriminant Analysis (LDA) model consistently outperforms others, closely followed by the Logistic
Regression model. Therefore, based on the collective results, the Linear Discriminant Analysis (LDA)
model is deemed the most suitable for the given task.
Exploratory Data Analysis (EDA) and model building reveal that employees residing closer to the
workplace tend to prefer public transport, whereas those residing farther away opt for private transport.
Employees with higher experience and corresponding higher salaries exhibit a preference for private
transport over public transport.
1 20
Strong positive correlations are observed among age, work experience, and salary variables, indicating a
positive relationship with the predictor variable.
Gender differences are evident in transportation choices, with male employees more likely to have a
license and use private transport, while female employees tend to rely on public transport, possibly due
to a lack of licenses.
The Linear Discriminant Allocation (LDA) model demonstrates superior predictive performance for the
output variable (Transport) with an accuracy of 83%, surpassing the Logistic Regression model in
performance.
1. Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.
1 21
Q2- Create two corpora, one for those who secured a Deal, the other for
those who did not secure a deal.
1 22
2.3 The following exercise is to be done for both the corpora:
B) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)
1 23
C) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
Answer:
A-
B-
C- What were the top 3 most frequently occurring words in both corpuses (after removing
stop words)?
The top 3 most frequently occurring words in both corpora (after removing stopwords) are as follows:
For positive deal corpora: 'and' (352), 'the' (249), 'to' (212)
For negative deal corpora: 'the' (228), 'and' (206), 'a' (155)
For positive deal corpora: 'products' (18), 'easy' (16), 'children' (16), 'make' (16)
For negative deal corpora: 'make' (16), 'product' (12), 'system' (12), 'online' (12)
1 24
2.4 Refer to both the word clouds. What do you infer?
Here's a summary of the key points based on your analysis:
Positive attributes: 'one', 'design', 'free', 'children', 'offer', 'easy', 'online', 'use'.
Indicates that deals targeting children, offering free samples/products, being easy to use, and having
good design and creativity are more likely to secure a deal.
Negative attributes: 'one', 'designed', 'help', 'device', 'bottle', 'premium', 'use', 'service'.
Suggests that deals with mediocre design, less problem-solving capability, products involving water
bottles, higher and premium prices, and lower usability are less likely to secure a deal.
Emphasizes the importance of better service for products that did not secure a deal.
Indicates that these words may not be defining factors for whether a deal is made or not. Alternatively,
they might have been used in a different context in each scenario.
Overall Interpretation:
The importance of design, usability, and creativity is highlighted for securing a deal.
Products targeted towards children seem to have a higher likelihood of securing deals.
Service quality is emphasized as a critical factor, especially for products that did not secure a deal.
1 25
Your thorough analysis of the word clouds provides valuable insights into the characteristics and
attributes associated with successful and unsuccessful deals, offering practical considerations for
product development and marketing strategies.
2.5 Looking at the word clouds, is it true that the entrepreneurs who
introduced devices are less likely to secure a deal based on your
analysis?
The term 'device' is notably absent in the "Secured a Deal" word cloud but prominently featured in the
"Did Not Secure a Deal" word cloud. This suggests a frequent occurrence of the word 'device' in
instances where deals were declined, supporting the assertion in the question. Consequently, it implies
that entrepreneurs introducing devices are less inclined to secure deals, likely influenced by various
factors.
1 26