0% found this document useful (0 votes)
152 views15 pages

GStore Revenue Prediction Analysis

The document analyzes customer transaction data from the Google Merchandise Store to predict customer revenue, exploring features like device and date to find patterns, building linear regression, lightGBM, MLP and CNN models with the CNN achieving the best performance and lowest RMSE of 0.30, and including screenshots of exploring features and a prediction interface using one of the models.

Uploaded by

Rute Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views15 pages

GStore Revenue Prediction Analysis

The document analyzes customer transaction data from the Google Merchandise Store to predict customer revenue, exploring features like device and date to find patterns, building linear regression, lightGBM, MLP and CNN models with the CNN achieving the best performance and lowest RMSE of 0.30, and including screenshots of exploring features and a prediction interface using one of the models.

Uploaded by

Rute Lopes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Google Merchandise Store Data Analysis

-- Google Analytics Customer Revenue Prediction

Team Members:
Ziyu Gu (zg2305)
Jingwei Han (jh4021)
Xiaoshu Cao (xc2418)
Motivation
For many businesses, only a small percentage of customers produce the most of the revenue. As such,
making appropriate promotional strategies can help companies make more profits with less money. Also,
appropriate promotional strategies will reduce the unnecessary harassment to customers. In this way,
making the customer revenue prediction is quite important for companies.

In the Google Merchandise Store, the 80/20 rule still works.


We can see that in GStore, the ratio is even more extreme: about
2% of customers produce all of the revenue.

In our project, we are going to analyze the GStore customer dataset,


and predict revenue per customer.

Our main goal is to provide more actionable operational changes and


a better use of marketing budgets for those companies who choose to
use data analysis on top of GStore data.
Dataset and Data Cleaning
Dataset:
Our dataset is 24GB, the original version has 12 columns (like: fullVisitorId, channelGrouping , date,
device, geoNetwork, etc) with about 1.7 million rows in total. Also, there are multiple columns containing
JSON blobs of varying depth, which will create several sub-columns when flattened (If we flatten them
totally, it will create more than 200 columns).

Data Cleaning:
Since our data is quite large, we need to split them into 9 pieces, flatten them separately and then concat
them.
We delete the constant columns and almost constant columns in the first place.
Also, we explored the correlations between features to help us determine the importance of features.
Data Exploring -- ‘device_browser’
For an important feature, we want to explore its relationship to our target
variable('totals_transactionRevenue'). So we defined several functions to show: the count of how many times
a category show up in a certain feature; many times a category show up when revenue > 0; the total and
mean revenue of a category make in a certain feature. (in %)
Like in the feature ‘device_browser’, even though most of the traffic and total revenue comes from Chrome
users, the Firefox users always purchase more one time. It seems that GStore should pay more attention to
Firefox users to gain more. Also, we can treat all the less used browsers as one type(others) in modeling.
Data Exploring -- ‘date’
In order to make full use of ‘date’ column, we derived 4 new columns from it: ‘year’, ‘month’, ‘week’,
‘weekday’, which is the year, month, specific week during the year(from 1 to 52), and the day of the
week(from 0 to 6).
Like the feature ‘month’, we can see that while the mean purchase seems don’t change much, the total
revenue in Aug. and Dec. is higher than other months. This may result from the back to college days and
Christmas. So we may suggest GStore to do some promotions in Aug. and Dec. to gain more profits.
Data visualization:

Density of users in
different states in USA:
Modeling

Dataset: since our data contains user transactions from August 1st 2016 to April 30th 2018, we use the
data before Jan 1st 2018 as training data, and the data after Jan 1st 2018 as test data.

Models: We used linear regression, then implementing LGBM, MLP and CNN model, to achieve better
performance

Linear Regression: we imported ‘sklearn’ in python to do linear regression, and here is the results:
the rmse is:
LGBM model:

We apply LGBM model to our


dataset to see its
performance. We firstly
identify category features,
encode their values into
number and combine them
with numeric features to
generate training set and run
the model.
LGBM model:

According to LGBM’s
feature importance, we
can conclude that
totals_pageviews,
visitStartTime and
totals_hits are the top 3
important features
influencing on the final
revenue.
MLP and CNN model

Implement MLP and then CNN using Pytorch


Use 80% of the whole data as training set, leaving the rest 20% as
testing set.
The aim is to use CNN to extract high-dimensional features, since each
column might be related to each other to some degree.
Use label encoder to transform data of string types into numeric types .
Model Structures:
Model Result in RMSE:

CNN : 0.30
NN: 4.0936
Comparison of our models(RMSE):

Models RMSE
Linear Regression 1.54
LGBM 1.53
MLP 4.09
CNN 0.30 (can not be tested by the whole dataset due
to GPU memory limitation, only tested by
1,000,000 random samples of given dataset)
Screenshot: see our prediction
system with model:
Values typed in and final result (you can try it
out following our readme on github!):

You might also like