100% found this document useful (2 votes)
273 views

Business Report: Predictive Modelling

This document contains a business report on predictive modelling for a cubic zirconia manufacturer and a tour and travel agency. For the manufacturer, linear regression was used to predict stone prices from attributes like carat, cut, color, clarity and make recommendations. The best model used carat and clarity as the most important factors. For the agency, logistic regression and LDA were applied to predict which employees would opt for travel packages using factors like salary, age, education and children. Exploratory data analysis including missing value treatment, encoding and model performance metrics were conducted for both cases.

Uploaded by

hepzi selvam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
273 views

Business Report: Predictive Modelling

This document contains a business report on predictive modelling for a cubic zirconia manufacturer and a tour and travel agency. For the manufacturer, linear regression was used to predict stone prices from attributes like carat, cut, color, clarity and make recommendations. The best model used carat and clarity as the most important factors. For the agency, logistic regression and LDA were applied to predict which employees would opt for travel packages using factors like salary, age, education and children. Exploratory data analysis including missing value treatment, encoding and model performance metrics were conducted for both cases.

Uploaded by

hepzi selvam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

BUSINESS REPORT

PREDICTIVE
MODELLING

PHILIP VICTOR
DATE: 27/03/2022
COURSE: PG DSBA
TABLE OF CONTENTS
PROBLEM 1: LINEAR REGRESSION.....................................................................................................3

DATA DICTIONARY ............................................................................................................................3

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis...............................................................................................................................................3

1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the
sub levels of a ordinal variables and take actions accordingly. Explain why you are combining these
sub levels with appropriate reasoning..............................................................................................9

1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning...............................................................................................................11

1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.............................................................................................................................14

PROBLEM 2: LOGISTIC REGRESSION AND LDA..................................................................................15

DATA DICTIONARY ............................................................................................................................15

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.............................................................................................................................................15

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).............................................................................................................................................25

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized..
...........................................................................................................................................................27

2.4 Inference: Basis on these predictions, what are the insights and recommendations.................37
Problem 1: Linear Regression

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most
important.

Data Dictionary: Variable Name Description

Variable Name Description


Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is
Cut increasing order Fair, Good, Very Good, Premium,
Ideal.
Colour of the cubic zirconia.With D being the worst
Color
and J the best.
Clarity refers to the absence of the Inclusions and
Clarity Blemishes. (In order from Worst to Best in terms of avg
price) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1

The Height of cubic zirconia, measured from the Culet


Depth
to the table, divided by its average Girdle Diameter.

The Width of the cubic zirconia's Table expressed as a


Table
Percentage of its Average Diameter.

Price the Price of the cubic zirconia.


X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
Sol: Data validation with EDA and perform Univariate & Bivariate analysis
Head ()
IS Null ()

Describe. T
Univariate and Bivariate Analysis
Skewness of DATA

Most Preferred cut

Plot based on colour and price


Multivariate Analysis
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the
possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.

Ans: Based on the below, all columns except for depth has no null values.

isnull ()

Checking for Outliers


1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and check
the performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with appropriate reasoning.

Ans: Dummies have to be encoded since linear regression models don’t take
categorical variables.

Splitting of the data: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 ,


random_state=1)
Linear Regression Model:

19 R Square and RMSW values for training and testing data are as below:
VIF Values:

Best params summary:


1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.

Business Insights:

Based on the EDA analysis, it is clear that ideal cut brings in the maximum profit to the
company and the colors H,I and J bring in profit whereas the other colors don’t. Additionally,
the fair and good cuts are not bringing any profit to the company.

Recommendations:

Company should focus on carat and clarity of the stone to increase pricing and thereby the
profit. Good customer base and marketing strategy needs to be adopted to attract customers
to buy the stones which gives more profit. 22
Problem 2: Logistic Regression and LDA You are hired by a tour and travel agency which
deals in selling holiday packages. You are provided details of 872 employees of a company.
Among these employees, some opted for the package and some didn't. You have to help the
company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.
Data Dictionary: Variable Name Description Holiday_Package Opted for Holiday Package
yes/no? Salary Employee salary age Age in years edu Years of formal education
no_young_children The number of young children (younger than 7 years) no_older_children
Number of older children foreign foreigner Yes/No

Variable Name Description


Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than
no_young_children
7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Ans: The data was inputted and sample rows were can be viewed below:
Data Validation
Head()
Info ()

Shape ()
(872, 8)
Describe

Is.Null()
Univariate Analysis
Plot Analysis
Inferences:
 Employees over the age of 50 seems to be not taking holiday packages as compared
to younger employees •
 Employees with salary <150000 seems to be taking holiday packages
 Based on the analysis, it looks like only 45% people are interested in holiday
package
BIVARIATE ANALYSIS

There is not much of a difference between data distribution among the holiday packages
Multicollinearity not seen in the data.
We can see the Outliers of the Data’s given
After Outlier Treatment
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).

Ans: We have split the data into test and train in the ratio of 70:30 and the data has been
encoded as below
Logistic Regression Model

Accuracy Scores
Creating LDA Model

Model Scores & Clarification report


2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized.

Ans: Accuracy of the data sets:

Confusion Matrix
Checking Optimal values that gives better value and accuracy
LDA works better when there is category target variable is present, else both results are
pretty much same.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.

Insights from the analysis are as below:

 Important factors predicting people’s interest in holiday packages are age,


salary and education.
 People above the age of 50 generally don’t prefer the holiday package people
having salary less than 50k have opted for it

Recommendations:

 A survey to understand good destinations for people above 50 may help


attract them to take holiday packages
 Targeting parents with younger children should be avoided as conversion rate
seems to be less.

You might also like