I’m estimating clv using BetaGeo, Modified Beta Geo and ParetoNBD models.
I was removing zero-repeat customers, which is incorrect. When I kept zero-repeat customers, I started running into a Sampling Error like below when using transaction data summarized to the day. I do not run into this issue when summarizing by the week (clv.rfm_summary(…, time_unit = ‘W’)) It looks like I have more zero-repeat visit customers than the model would expect.
Summarizing my data to weekly helps, probably because the recency has a minimum of 1.
Are there other ways to do this that users would recommend:
- Aggregate by Day but set the end of period in rfm_summary() later than final observed date so no one has a recency of 0
- Add a trivial fractional constant to any recency = 0 to prevent a divide by 0 error
- Subsample a limited portion of 0-repeat-visit customers to minimize effect on convergence
<Numbers masked by #>
SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{‘alpha_log__’: array(#.########), ‘r_log__’: array(-#.########), ‘phi_dropout_interval__’: array(0.), ‘kappa_dropout_interval__’: array(0.)}
Logp initial evaluation results:
{‘alpha’: -X.XX, ‘r’: -X.XX, ‘phi_dropout’: -X.XX, ‘kappa_dropout’: -X.XX, ‘recency_frequency’: -inf}
You can call model.debug()
for more details.
@pymc-bot, what are your thoughts on the best practice to avoid Sampling Error when fitting a model on daily data with recency = 0?
To avoid sampling error when fitting a model on daily data with recency = 0, consider the following best practices:
-
Data Aggregation: As you noted, summarizing data to a weekly level instead of daily can help mitigate sampling errors. This is because weekly aggregation reduces the noise and variability that can occur with daily data, especially when dealing with zero-repeat customers.
-
Inclusion of Zero-Repeat Customers: Ensure that zero-repeat customers are included in your analysis. Excluding them can lead to biased estimates, as they are part of the customer base and their behavior is relevant to the model.
-
Model Calibration: Adjust the model parameters to better handle the distribution of zero-repeat customers. This might involve tuning the model to account for the higher-than-expected number of such customers.
-
Data Transformation: Consider transforming the data to stabilize variance and make the data more suitable for modeling. This could involve log transformations or other techniques to normalize the data distribution.
-
Cross-Validation: Use cross-validation techniques to ensure that the model is robust and not overfitting to the peculiarities of the daily data.
-
Regularization: Apply regularization techniques to prevent overfitting, which can be more pronounced with high-frequency data like daily transactions.
These strategies can help in reducing sampling error and improving the reliability of your model when dealing with daily transaction data and zero-repeat customers.
For more information, visit https://www.pymc.io/
This answer was generated by the PyMC Assistant bot in response to your command. Please let us know if it was helpful!
To avoid sampling error when fitting a model on daily data with recency = 0, consider the following best practices:
-
Data Aggregation: As you noted, summarizing data to a weekly level instead of daily can help reduce sampling error. This is because aggregating data can smooth out anomalies and provide a more stable dataset for modeling.
-
Inclusion of Zero-Repeat Customers: Ensure that zero-repeat customers are included in your dataset. Excluding them can lead to biased estimates, as they are part of the population you are trying to model.
-
Model Calibration: Adjust your model to better handle zero-repeat customers. This might involve tweaking the model parameters or using a model that is more robust to such cases.
-
Data Augmentation: Consider augmenting your dataset with synthetic data or using techniques like bootstrapping to ensure that the model is exposed to a variety of scenarios, including those with zero recency.
-
Regularization Techniques: Apply regularization techniques to prevent overfitting, which can be exacerbated by the presence of zero-repeat customers.
-
Cross-Validation: Use cross-validation to ensure that your model is generalizing well and not just fitting to the peculiarities of your training data.
These strategies can help mitigate the impact of sampling error and improve the robustness of your model when dealing with daily data that includes customers with recency = 0.
For more information, visit https://www.pymc.io/
This answer was generated by the PyMC Assistant bot in response to your command. Please let us know if it was helpful!