0% found this document useful (0 votes)
33 views35 pages

PM Guided Project Sample Business Report

The document outlines a predictive modeling project focused on developing a pricing strategy for used and refurbished phones and tablets, driven by the growing market demand. It includes sections on problem statement, dataset loading, exploratory data analysis, data preprocessing, model building, and actionable insights. The project aims to utilize machine learning to predict prices based on various device attributes, leveraging a dataset collected in 2021.

Uploaded by

rachelsam11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views35 pages

PM Guided Project Sample Business Report

The document outlines a predictive modeling project focused on developing a pricing strategy for used and refurbished phones and tablets, driven by the growing market demand. It includes sections on problem statement, dataset loading, exploratory data analysis, data preprocessing, model building, and actionable insights. The project aims to utilize machine learning to predict prices based on various device attributes, leveraging a dataset collected in 2021.

Uploaded by

rachelsam11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Predictive Modelling

Project

al
ti
en
[email protected]
GXDLZJY7W2
fid

BUSINESS REPORT
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
Table of Contents
List of Tables 3
List of Figures 4
1.1 Problem Statement 5
1.2 Loading the Dataset and Data Overview 6
1.3 Exploratory Data Analysis 9

al
1.4 Data Preprocessing 17
1.5 Model Building 20
1.6 Final Model Building 30
1.7 Actionable Insights and Recommendations 32

ti
en
[email protected]
GXDLZJY7W2
fid
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
List of Tables
Table 1: First 5 rows of the dataset 6

Table 2: Last 5 rows of the dataset 7

Table 3: Information about the dataset 7

Table 4: Datatypes of the variables in the dataset 8

al
Table 5: Unique values in the dataset 8

Table 6: Statistical Summary of the dataset 9

ti
Table 7: No. of missing values in the dataset 17

Table 8: Feature Engineering 18

Table 9: Linear Regression - Model Summary 21

en Table 10: Performance Metrics for Linear regression Training data

Table 11: Performance Metrics for Linear regression Test data

Table 12: VIF values for the test of Multicollinearity


[email protected]
23

23

27
GXDLZJY7W2
fid
Table 13: Performance Metrics after removing the variables 27

Table 14: Linear Regression - Final Model Summary 31

Table 15: Performance Metrics for Linear regression Training data 31

Table 16: Performance Metrics for Linear regression Test data 32


on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
List of Figures
Fig 1: Boxplot and Histogram of the target variable (normalized_use_price) 10

Fig 2: Boxplot and Histogram of normalized_new_price 11

Fig 3: Boxplot and Histogram of int_memory 12

Fig 4: Boxplot and Histogram of ram 13

al
Fig 5: Barplot of OS 14

Fig 6: Barplots of all the brand_name 15

Fig 7: Lineplot of the release_year vs normalized_used_price 16

ti
Fig 8: Boxplot of the brand_name vs weight 16

Fig 9: Boxplot of the brand_name vs ram 17

en Fig 10: Correlation Heatmap

Fig 11: Boxplot to identify outliers in the dataset

Fig 12: Plot for checking linearity of variables


[email protected]
18

21

30

Fig 13: Plot for checking Normality of residuals 31


GXDLZJY7W2
fid
Fig 14: Q-Q plot of residuals 31
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.1 Problem Statement

1.1.1 Context
Buying and selling used phones and tablets used to be something that happened on a handful
of online marketplace sites. But the used and refurbished device market has grown considerably
over the past decade, and a new IDC (International Data Corporation) forecast predicts that the

al
used phone market would be worth \\$52.7bn by 2023 with a compound annual growth rate
(CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for
used phones and tablets that offer considerable savings compared with new models.

ti
Refurbished and used devices continue to provide cost-effective alternatives to both consumers
and businesses that are looking to save money when purchasing one. There are plenty of other
benefits associated with the used device market. Used and refurbished devices can be sold with

en warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such
as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices.
Maximizing the longevity of devices through second-hand trade also reduces their
environmental impact and helps in recycling and reducing waste. The impact of the COVID-19
outbreak may further boost this segment as consumers cut back on discretionary spending and
[email protected]
GXDLZJY7W2buy phones and tablets only for immediate needs.
fid
1.1.2 Objective
The rising potential of this comparatively under-the-radar market fuels the need for an
ML-based solution to develop a dynamic pricing strategy for used and refurbished devices.
ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist.
on

They want you to analyze the data provided and build a linear regression model to predict the
price of a used phone/tablet and identify factors that significantly influence it.

1.1.3 Data Dictionary


C

The data contains the different attributes of used/refurbished phones and tablets. The data was
collected in the year 2021. The detailed data dictionary is given below.

● brand_name: Name of manufacturing brand

● os: OS on which the device runs

● screen_size: Size of the screen in cm

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● 4g: Whether 4G is available or not

● 5g: Whether 5G is available or not

● main_camera_mp: Resolution of the rear camera in megapixels

● selfie_camera_mp: Resolution of the front camera in megapixels

● int_memory: Amount of internal memory (ROM) in GB

al
● ram: Amount of RAM in GB

● battery: Energy capacity of the device battery in mAh

ti
● weight: Weight of the device in grams

● release_year: Year when the device model was released

en ● days_used: Number of days the used/refurbished device has been used

● normalized_new_price: Normalized price of a new device of the same model in euros

● normalized_used_price: Normalized price of the used/refurbished device in euros


[email protected]
GXDLZJY7W2
1.2 Loading the Dataset and Data Overview
fid

The dataset provided was loaded into a pandas dataframe for the analysis.

1.2.1 Getting the first 5 and last 5 rows


on

The dataset has been loaded successfully. It has 3454 rows and 15 columns.

● First 5 rows of the dataset:


C

Table 1: First 5 rows of the dataset

● Last 5 rows of the dataset:

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
Table 2: Last 5 rows of the dataset

1.2.2 Information about the dataset

al
ti
en
[email protected]
GXDLZJY7W2
fid

Table 3: Information about the dataset

● There are 15 variables in the dataset with no null values.


on

1.2.3 Data types of the variables

Data type No of variables

object 4
C

float64 9

int64 2

Table 4: Datatypes of the variables in the dataset

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.2.4 Unique values in the attributes of the dataset

al
ti
Table 5: Unique values in the dataset

en ● Among the variables in the dataset, normalized_used_price, normalized_new_price, and


days_used have the highest counts of unique values.

1.2.5 Check for Duplicate Records


[email protected]
GXDLZJY7W2 ● No duplicate records exist in this dataset.
fid
1.2.6 Statistical Description of the Dataset
● A description of the first few columns of the dataset is given below:
on
C

Table 6: Statistical Summary of the dataset

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● The weight ranges from 69g to 855g.
● Android is the most common OS for the used devices.
● There are 33 brands in the data and a category Others too.
● This does not seem incorrect as the data contains feature phones and tablets too.
● There are a few unusually low values for the internal memory and RAM of used
devices, but those are likely due to the presence of feature phones in the data.
● The price of the used devices ranges from 1.53 to 6.61 euros with an average price
of 4.36 euros.

al
● The price of the new devices ranges from 2.91 to 7.85 euros with an average price
of 5.25 euros.
● Phones have been used on average for 674 days , ranging from 1 to 2081 days.

ti
1.3 Exploratory Data Analysis

en 1.3.1 Univariate Analysis


In this section, we will analyze the distribution of independent variables. It will help us identify
the pattern among the variables and the effects they have on the target variable.
[email protected]
GXDLZJY7W2First, let us see how the target variable (normalized_use_price) is distributed.
fid
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid
Fig 1: Boxplot and Histogram of the target variable
(normalized_use_price)

From the above boxplot and histogram, we can see that the normalized_use_price variable is
almost normally distributed. Now, let’s have a look at the other boxplots and histograms that
can help us analyze the distribution of all individual variables. So we will plot the boxplot and
histogram for all independent variables to analyze the distribution.
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid
Fig 2: Boxplot and Histogram of normalized_new_price

● The normalized prices of new device models are almost normally


distributed.
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2 Fig 3: Boxplot and Histogram of int_memory
fid
● Few devices offer more than 256GB internal memory.
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2 Fig 4: Boxplot and Histogram of ram
fid
● Most of the devices offer 4GB RAM and very few offer greater than 8GB RAM.

From the above boxplots, It is seen that variables are not distributed evenly. We can also
observe from the boxplots that there are numerous outliers present in the data.
on

Let’s analyze the distribution of all the variables using the barplot plot. A barplot helps analyze
data distribution by visually displaying frequency counts across value ranges, aiding in
identifying central tendency, skewness, variability, and comparing distributions.
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2 Fig 5: Barplot of OS
fid
● Android devices dominate ~93% of the used device market.
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
Fig 6: Barplots of all the brand_name

GXDLZJY7W2 ● Samsung has the most number of devices in the data, followed by Huawei and LG.
fid
● 14.5% of the devices in the data are from brands other than the listed ones.

1.3.2 Bivariate Analysis:


For Bivariate analysis, we can plot the boxplots of independent variables with respect to the
target variable. This will help us analyze the contribution of variables in determining the trends
on

and patterns of the used phone prices.


C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
Fig 7: Lineplot of the release_year vs normalized_used_price

en ● The price of used devices has increased over the years.

The boxplot is a good visual technique to identify such variables as seen below for some of the
independent variables:
[email protected]
GXDLZJY7W2
fid
on
C

Fig 8: Boxplot of the brand_name vs weight

● A lot of brands offer devices that are not very heavy but have a large battery capacity.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● Some devices offered by brands like Vivo, Realme, Motorola, etc. weigh just about 200g
but offer great batteries.
● Some devices offered by brands like Huawei, Apple, Sony, etc. offer great batteries but
are heavier.

al
ti
en
[email protected]
GXDLZJY7W2
fid
Fig 9: Boxplot of the brand_name vs ram

● 50% of the devices offered by most of the companies have 4GB of RAM.
● 50% of OnePlus devices have 6GB or more RAM, indicating that OnePlus devices offer
more RAM in general.
on

1.3.3 Correlation Analysis:


The heatmap of the correlation matrix can give a very good idea of the correlations between
the independent variables and the dependent variable.
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
Fig 10: Correlation Heatmap

GXDLZJY7W2 ● The normalized used device price is highly correlated with the normalized price of a new
fid
device model.
○ This makes sense as the price of a new model is likely to affect the used device
price.
● The normalized used device price is also moderately correlated with the resolution of
the cameras provided, the size of the screen, and the battery capacity.
● The weight, screen size, and battery capacity of a device show a good amount of
on

correlation.
○ This makes sense as larger battery capacity requires bigger space, thereby
increasing screen size and weight.
● The number of days a device is used is negatively correlated with the resolution of its
front camera.
○ This makes sense as older devices did not offer as powerful front cameras as the
C

recent ones.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.4 Data Preprocessing

1.4.1 Missing Values Detection and Treatment


Missing values are the values or data that are not stored (or not present) for some variable/s in
the given dataset. They are usually represented in the form of Nan or null or None in the
dataset.

al
We can check the number of missing values in the dataset.

ti
en
[email protected]
GXDLZJY7W2
fid
Table 7: No. of missing values in the dataset

● For columns `main_camera_mp`, `selfie_camera_mp`, `int_memory`, `ram`, `battery`,


and `weight`, missing values were imputed using the median values grouped by
`release_year` and `brand_name`.
● For columns `main_camera_mp`, `selfie_camera_mp`, `battery`, and `weight`, any
on

remaining missing values were imputed using the median values grouped by
`brand_name`.
● The remaining missing values in the `main_camera_mp` column were filled with the
overall median value of the column.
● After each imputation step, the data was checked to ensure all missing values were
addressed.
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.4.2 Feature Engineering
We will create a new column, `years_since_release`, calculated from the `release_year` column
by using 2021 as the baseline year. After creating this new column, we will drop the
`release_year` column from the dataset.

al
ti
Table 8: Feature Engineering

● 50% of the used devices in the data were originally released five and a half years ago.

en 1.4.3 Outliers Check


[email protected]
GXDLZJY7W2We can check the outliers present in the given dataset using the boxplots. It helps us identify
fid
data points that stand out from the rest of the data.

Conventionally, outliers are identified based on the inter-quartile distance as follows:

● Q1 – 25th Percentile
● Q3 – 75th Percentile
● IQR = Q3 – Q1
on

● Lower outlier = Value < 1.5 * IQR


● Upper Outlier = Value > 1.5 * IQR

Based on this definition, the number of outliers in the dataset is as follows:

Here are the boxplots where we have checked the number of outliers in each column of the
C

dataset.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid
on

Fig 11: Boxplot to identify outliers in the dataset

However, due to these outliers being present in sheer volume and being actual values, we will
C

not treat them and leave them in the dataset.

1.4.4 Data Preparation for Modeling


As stated in the problem, the dataset has to be split in a 70:30 ratio for train and test sets. We
also set a random seed to make the results reproducible.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
Based on the splits done, we end up with 2417 data points in the train set and 1037 data points
in the test set. Both the train and test sets have 49 attributes for modeling purposes.

1.5 Model Building

1.5.1 Linear Regression:

al
A constant term is added to the independent variable matrix to account for the intercept in the
linear regression model.

ti
The linear regression model helps facilitate accurate predictions of pricing strategies for used
and refurbished devices.

We use the Statsmodel Library for building the linear regression model. The statsmodels library

en for linear regression offers detailed statistical inference and diagnostics, aiding in hypothesis
testing and interpretation, while sklearn focuses more on predictive modeling without the same
level of statistical analysis.

Here is what the model summary looks like after the model is built:
[email protected]
GXDLZJY7W2
fid
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid
on
C

Table 9: Linear Regression - Model Summary

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● Both the R-squared and Adjusted R-squared of our model are greater than 0.8,
indicating that it can explain more than 80% of the variance in the normalized price of
used phones.
● This is a clear indication that we have been able to create a very good model which is
not underfitting the data.
● To be able to make statistical inferences from our model, we will have to test that the
linear regression assumptions are followed.

al
1.5.2 Model Evaluation Criterion:

ti
We are interested in predicting the price of the used/refurbished device in euros, the
performance metrics utilized are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE),
and Mean Absolute Percentage Error (MAPE). These metrics play a crucial role in assessing the

en accuracy and effectiveness of the model in predicting the pricing of used and refurbished
devices.

● Root Mean Squared Error (RMSE):


[email protected]
GXDLZJY7W2 ○ RMSE is a measure of the average magnitude of the residuals or prediction errors
fid
between the actual and predicted values. It provides insight into how well the
model's predictions align with the actual data.

● Mean Absolute Error (MAE):

○ MAE is the average of the absolute differences between the actual values and
the predicted values. It offers a more straightforward understanding of prediction
on

errors without considering their direction.

● Mean Absolute Percentage Error (MAPE):

○ MAPE calculates the average percentage difference between the actual and
predicted values. It provides a relative measure of the prediction accuracy,
C

especially when dealing with varying scales of data.

These metrics aid in quantifying the model's performance and its ability to forecast first-day
viewership accurately. By analyzing RMSE, MAE, and MAPE, stakeholders can gauge the model's
predictive capabilities and make informed decisions regarding the pricing of the used
smartphone devices.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
By incorporating these evaluation criteria, the linear regression model's success in predicting
used smartphone prices can be assessed comprehensively, ensuring that predictions align
closely with the features of the smartphone.

1.5.2.1 Linear Regression Model - Training Performance:

We need to evaluate the model performance on Training and Test data to see how the linear

al
regression model is performing. To evaluate the model performance, we will check the
evaluation metrics.

Training Performance Metrics:

ti
en Table 10: Performance Metrics for Linear regression Training data

1.5.2.2 Linear Regression Model - Test Performance


[email protected]
GXDLZJY7W2Testing Performance Metrics:
fid

Table 11: Performance Metrics for Linear regression Test data


on

1.5.2.3 Linear Regression Model - Performance Observations:

● RMSE and MAE of train and test data are very close, which indicates that our model is
not overfitting the train data.

● MAE indicates that our current model can predict normalized used phone prices within a
C

mean error of ~0.24 euros on test data.

● The RMSE values are higher than the MAE values as the squares of residuals penalize the
model more for larger prediction errors.

● MAPE of ~4.5 on the test data indicates that the model can predict within ~4.5% of the
normalized used phone price.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.5.3 Checking Linear Regression Assumptions
We will be checking the following Linear Regression assumptions:

al
1. No Multicollinearity
2. Linearity of variables
3. Independence of error terms
4. Normality of error terms

ti
5. No Heteroscedasticity

en 1.5.3.1 Test for Multicollinearity

We will check for multicollinearity using variance_inflation_factor from the statsmodel library.

● If VIF is 1, then there is no correlation between the 𝑘th predictor and the remaining
[email protected]
GXDLZJY7W2 predictor variables (indicating no multicollinearity).
fid
● If VIF exceeds 5 or is close to exceeding 5, it suggests moderate multicollinearity.
● If VIF of 10 or higher indicates high multicollinearity.

The table below shows the VIF of all the features


on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
Table 12: VIF values for the test of Multicollinearity

GXDLZJY7W2 ● Some of the numerical variables show high multicollinearity


fid
● We will ignore the VIF for the constant and the dummy variables

For removing the multicollinearity

1. Drop every column one by one that has a VIF score greater than 5.
2. Look at the adjusted R-squared and RMSE of all these models.
on

3. Drop the variable that makes the least change in the adjusted R-squared.
4. Check the VIF scores again.
5. Continue till you get all VIF scores under 5.

We will treat the multicollinearity by removing high VIF columns that affect model performance
by comparing adjusted R-squared and RMSE values for each exclusion.
C

Table 13: Performance Metrics after removing the variables

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● The column we are dropping is screen_size as it has VIF values greater than 5 and
it makes the least change in the adjusted R-squared.
● We are not dropping the categorical variables here because VIF is primarily used
to detect multicollinearity in continuous variables, not categorical ones.

The constant value has a high VIF, but we will retain it because we included it to account for the
intercept in the regression model.

al
1.5.3.2 Dropping high P-value variables

We will drop the predictor variables having a p-value greater than 0.05 as they do not
significantly impact the target variable.

ti
But sometimes p-values change after dropping a variable. So, we'll not drop all variables at
once.

en Instead, we will do the following:

● Build a model, check the p-values of the variables, and drop the column with the highest
p-value.
● Create a new model without the dropped feature, check the p-values of the variables,
[email protected]
GXDLZJY7W2 and drop the column with the highest p-value.
fid
● Repeat the above two steps till there are no columns with a p-value > 0.05.

The above process can also be done manually by picking one variable at a time that has a high
p-value, dropping it, and building a model again. But that might be a little tedious and using a
loop will be more efficient.

The features that have been selected due to having the appropriate p-values are const,
on

main_camera_mp, selfie_camera_mp, ram, weight, normalized_new_price,


years_since_release, brand_name_Karbonn, brand_name_Samsung, brand_name_Sony,
brand_name_Xiaomi, os_Others, os_iOS, 4g_yes and 5g_yes

1.5.3.3 Re-building the model


C

Here is what the model summary looks like after the model is re-built:

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid
Table 12: Linear Regression - Re-build Model Summary

Training Performance Metrics:


on

Table 13: Performance Metrics for Linear regression Training data


C

Testing Performance Metrics:

Table 14: Performance Metrics for Linear regression Testing data

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
Removing predictor variables with high p-values did not detrimentally impact the model's
performance, indicating these variables have minimal influence on the target variable.

Now we'll check for the rest of the assumptions on the new model.

1.5.3.4 Test for Linearity and Independence

We will test for linearity and independence by making a plot of fitted values vs residuals and

al
checking for patterns. If there is no pattern, then we say the model is linear and residuals are
independent. Otherwise, the model shows signs of non-linearity and residuals are not
independent.

ti
en
[email protected]
GXDLZJY7W2
fid
on

Fig 12: Plot for checking linearity of variables

We see no pattern in the plot above. Hence, the assumptions of linearity and independence are
C

satisfied.

1.5.3.5 Test for Normality

We will test for normality by checking the distribution of residuals, by checking the Q-Q plot of
residuals, and by using the Shapiro-Wilk test. If the residuals follow a normal distribution, they

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
will make a straight-line plot, otherwise not. If the p-value of the Shapiro-Wilk test is greater
than 0.05, we can say the residuals are normally distributed.

al
ti
en
[email protected] Fig 13: Plot for checking Normality of residuals
GXDLZJY7W2
fid
The histogram of residuals almost has a bell-shaped structure. Let's check the Q-Q plot.
on
C

Fig 14: Q-Q plot of residuals

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
The residuals almost follow a straight line.

Let's check the results of the Shapiro-Wilk test.

● Since the p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.

al
● Strictly speaking, the residuals are not normal. However, as an approximation, we can
accept this distribution as close to being normal.
● So, the assumption is satisfied.

ti
1.5.3.6 Test for Homoscedasticity

en We will test for homoscedasticity by using the Goldfeldquandt test. If we get a p-value greater
than 0.05, we can say that the residuals are homoscedastic. Otherwise, they are
heteroscedastic.

[email protected]
GXDLZJY7W2
fid
Since the p-value is greater than 0.05, the residuals are homoscedastic. So, the assumption is
satisfied.

All the assumptions of linear regression are satisfied. Let's rebuild our final model, check its
performance, and draw inferences from it.
on

1.6 Final Model Building

1.6.1 Linear Regression Model:


C

Here is what the final model summary looks like after the model is built:

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
al
ti
en
[email protected]
GXDLZJY7W2
fid

Table 14: Linear Regression - Final Model Summary


on

1.6.2 Linear Regression Model - Training Performance:


Testing Performance Metrics:
C

Table 15: Performance Metrics for Linear regression Training data

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
1.6.3 Linear Regression Model - Testing Performance:
Testing Performance Metrics:

al
Table 16: Performance Metrics for Linear regression test data

We can see that there has been no significant difference in performance metrics.

ti
1.7 Actionable Insights and Recommendations
● The model explains ~84% of the variation in the data and can predict the normalized

en used device price within ~4.6%, so it is good for predictive purposes.

● The most significant predictors of the normalized used device price are the normalized
price of a new device of the same model, the weight of the device, the resolution of the
rear and front cameras, the years since the original release of the device, the amount of
[email protected]
GXDLZJY7W2
RAM, the availability of 4G and 5G network.
fid

○ A unit increase in the normalized new model price will result in a 0.4415 unit
increase in the normalized used device price, all other variables held constant
○ A unit increase in the amount of RAM will result in a 0.0207 unit increase in the
normalized used device price, all other variables held constant
on

○ A unit increase in the years since the original release of the device will result in a
0.0292 unit decrease in the normalized used device price, all other variables held
constant
○ A unit increase in the resolution of the front camera will result in a 0.0138 unit
increase in the normalized used device price, all other variables held constant
○ A unit increase in the resolution of the rear camera will result in a 0.0210 unit
C

increase in the normalized used device price, all other variables held constant
○ The normalized used device price for devices with 4G connectivity will be 0.0502
units more than those without 4G connectivity

● ReCell should look to attract people who want to sell used phones and tablets that were
originally released in recent years and have good front and rear camera resolutions.

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.
● Devices with more RAM and 4G connectivity are also good candidates for reselling to
certain customer segments.

● They should also try to gather and put up phones having a high price for new models to
try and increase revenue.

● Additional data regarding customer demographics (age, gender, income, etc.) can be

al
collected and analyzed to gain better insights into the preferences of customers across
different price segments.

● ReCell can also look to sell other used gadgets, like smartwatches, which might attract

ti
certain segments of customers.

en
[email protected]
GXDLZJY7W2
fid
on
C

This file
Proprietary content. is meant
©Great for personal
Learning. use
All Rights by [email protected]
Reserved. only. prohibited.
Unauthorized use or distribution
Sharing or publishing the contents in part or full is liable for legal action.

You might also like