Business Report: Predictive Modelling
Business Report: Predictive Modelling
PREDICTIVE
MODELLING
PHILIP VICTOR
DATE: 27/03/2022
COURSE: PG DSBA
TABLE OF CONTENTS
PROBLEM 1: LINEAR REGRESSION.....................................................................................................3
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis...............................................................................................................................................3
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of combining the
sub levels of a ordinal variables and take actions accordingly. Explain why you are combining these
sub levels with appropriate reasoning..............................................................................................9
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from statsmodel. Create multiple models and check the performance of Predictions on Train
and Test sets using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning...............................................................................................................11
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.............................................................................................................................14
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data
analysis.............................................................................................................................................15
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).............................................................................................................................................25
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model:
Compare Both the models and write inference which model is best/optimized..
...........................................................................................................................................................27
2.4 Inference: Basis on these predictions, what are the insights and recommendations.................37
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000
cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities
as a diamond). The company is earning different profits on different prize slots. You have to
help the company in predicting the price for the stone on the bases of the details given in the
dataset so it can distinguish between higher profitable stones and lower profitable stones so
as to have better profit share. Also, provide them with the best 5 attributes that are most
important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
Sol: Data validation with EDA and perform Univariate & Bivariate analysis
Head ()
IS Null ()
Describe. T
Univariate and Bivariate Analysis
Skewness of DATA
Ans: Based on the below, all columns except for depth has no null values.
isnull ()
Ans: Dummies have to be encoded since linear regression models don’t take
categorical variables.
19 R Square and RMSW values for training and testing data are as below:
VIF Values:
Business Insights:
Based on the EDA analysis, it is clear that ideal cut brings in the maximum profit to the
company and the colors H,I and J bring in profit whereas the other colors don’t. Additionally,
the fair and good cuts are not bringing any profit to the company.
Recommendations:
Company should focus on carat and clarity of the stone to increase pricing and thereby the
profit. Good customer base and marketing strategy needs to be adopted to attract customers
to buy the stones which gives more profit. 22
Problem 2: Logistic Regression and LDA You are hired by a tour and travel agency which
deals in selling holiday packages. You are provided details of 872 employees of a company.
Among these employees, some opted for the package and some didn't. You have to help the
company in predicting whether an employee will opt for the package or not on the basis of
the information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.
Data Dictionary: Variable Name Description Holiday_Package Opted for Holiday Package
yes/no? Salary Employee salary age Age in years edu Years of formal education
no_young_children The number of young children (younger than 7 years) no_older_children
Number of older children foreign foreigner Yes/No
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Ans: The data was inputted and sample rows were can be viewed below:
Data Validation
Head()
Info ()
Shape ()
(872, 8)
Describe
Is.Null()
Univariate Analysis
Plot Analysis
Inferences:
Employees over the age of 50 seems to be not taking holiday packages as compared
to younger employees •
Employees with salary <150000 seems to be taking holiday packages
Based on the analysis, it looks like only 45% people are interested in holiday
package
BIVARIATE ANALYSIS
There is not much of a difference between data distribution among the holiday packages
Multicollinearity not seen in the data.
We can see the Outliers of the Data’s given
After Outlier Treatment
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
Ans: We have split the data into test and train in the ratio of 70:30 and the data has been
encoded as below
Logistic Regression Model
Accuracy Scores
Creating LDA Model
Confusion Matrix
Checking Optimal values that gives better value and accuracy
LDA works better when there is category target variable is present, else both results are
pretty much same.
2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
Recommendations: