EP-Experimentation Best Practices-200125-135749
EP-Experimentation Best Practices-200125-135749
Experimentation allows us to learn and give the right experiences to our Customers, to create better value for Customers.
Although experimentation seems straightforward, the risk of making inaccurate conclusions is too high, if an organization
does not follow best practices.
For example, here are a couple of commonly made mis-steps while conducting experiments which can lead to inaccurate
conclusions and decisions:
1. Peeking (article here): If we do not lock down the testing time period ahead of time, we end up with the peeking
problem where we have side effects from checking the results and taking action before the A/B test is over. The more
often you look at the intermediate results of the A/B testing with the readiness to make a decision, the higher the
probability is that the criterion will show a statistically significant difference when there is none.
a. 2 peeking cases double the p-value;
b. 5 peeking sessions increase the p-value by a factor of 3.2;
2. Simpson’s Paradox: This can occur when we change the test group allocations in a disproportionate manner mid-test-
flight. The latent segments in the test groups change their proportions when we change allocation percentages,
inserting error into the results. More formally, Simpson’s Paradox is a statistical phenomenon where an association
between two variables in a population emerges, disappears or reverses when the population is divided into
subpopulations.
To ensure we can maximize the value from our experimentation practices and reduce inaccuracy of decisions, we
recommend following best practices across all experiments. This document outlines the best practices to adopt.
Experimentation Planning
Secondary metrics help us cross-validate if the movement in the primary metric is real.
Since we use a high alpha of 5% or 10%, it is possible that impact to the primary metric was by chance, so the
secondary metrics can help validate whether the change in the primary metric is real or not. As we improve the
cadence of testing, this will become more important.
Example:
Test concept: Introduce a more prominent trial sign up button in the new user registration flow.
Primary goal: D90 conversion rate. **As more people sign up for trial in the new user flow, they
can better understand the total value of our product and move our D90 conversion rate by 5% **
Possible Noise: If the primary metric did move by 5%, but if the number of new users signing up for trial
didn't increase, then the increased D90 Conversion could be noise and needs further validation.
**Define Guardrail metrics: This helps us ensure we are not harming the business in the long run for short-term gains.
**
These are metrics based that track the long term impact an experiment can have.
Example:
When experimenting to increase trials during new user sign-up flow, we might want to set up D90 conversion or
Trial to Paid conversion as a guard rail metric.
Another example is from a Social Media company where they define for a "notification disable rate" as a guard
rail metric. They set a guard rail that says that every 1% lift in sessions/DAU from increased push notification,
the "notification disable rate" increase should be within X%.
Another Example: A SaaS company might set up a 12-month churn rate as a guard rail metric and revisit the
analysis after 12 months.
General Guidance for Test Group Sample Allocation Based on Risk and Critical Path
Experiment launch weights matrix on failure Critical path page or critical product feature
risk, and importance of feature and impacted
population L M H
H 10% 5% 1%
Experiment Verification
After the functionality is built by Dev and data eng team, verify if the reporting data and UI works as intended in Dev
or Pre production environment
Recommendation is to have any two out of Dev, QA and Analyst verify tracking and reporting
Recommendation is to have any two of PM, QA and Analyst verify UI functionality
Paste the control and test experience screenshot in experiment documentation for future reference
Experiment Validation Post-Launch: Check that the results are flowing in as expected
1 or 2 days after Go live, verify reporting data for the experiment is valid
Check for any skew in population assignment
If you launched at 1% weight, take an initial read after 2 or 3 days and set to 5% or 10% weights as per the previously
agreed plan.
The 1% experiment’s goal is only to ensure “things don’t break” and not to get a read on the results.
If you launched at 5% or 10%, change the weights to 50% as per your initial plan
Beware that the customer base could be different on weekdays/weekends, so analyze results in full week
increments in case of weekly seasonality and in ful month increments in case of monthly seasonality
The above will not apply for the 1% since the 1% test is not intended to get a read impact created by the
experimental experience.
Experimentation Governance
Based on initial feedback and observations on current experiment platform capabilities, these are a list of action items I
recommend we consider. It still needs to be consulted with impacted parties to ensure the gaps still exist and their
priority.