0% found this document useful (0 votes)
20 views5 pages

EP-Experimentation Best Practices-200125-135749

The document outlines best practices for conducting experiments to ensure accurate conclusions and valuable insights for customers. Key recommendations include defining clear hypotheses and goals, validating assumptions through pre-analysis, and ensuring proper setup and verification of experiments. Additionally, it emphasizes the importance of analyzing results critically and institutionalizing insights to avoid repeating mistakes in future experiments.

Uploaded by

spideyunfcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

EP-Experimentation Best Practices-200125-135749

The document outlines best practices for conducting experiments to ensure accurate conclusions and valuable insights for customers. Key recommendations include defining clear hypotheses and goals, validating assumptions through pre-analysis, and ensuring proper setup and verification of experiments. Additionally, it emphasizes the importance of analyzing results critically and institutionalizing insights to avoid repeating mistakes in future experiments.

Uploaded by

spideyunfcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Experimentation Best Practices

Why Experimentation Best Practices?

Experimentation allows us to learn and give the right experiences to our Customers, to create better value for Customers.
Although experimentation seems straightforward, the risk of making inaccurate conclusions is too high, if an organization
does not follow best practices.

For example, here are a couple of commonly made mis-steps while conducting experiments which can lead to inaccurate
conclusions and decisions:

1. Peeking (article here): If we do not lock down the testing time period ahead of time, we end up with the peeking
problem where we have side effects from checking the results and taking action before the A/B test is over. The more
often you look at the intermediate results of the A/B testing with the readiness to make a decision, the higher the
probability is that the criterion will show a statistically significant difference when there is none.
a. 2 peeking cases double the p-value;
b. 5 peeking sessions increase the p-value by a factor of 3.2;
2. Simpson’s Paradox: This can occur when we change the test group allocations in a disproportionate manner mid-test-
flight. The latent segments in the test groups change their proportions when we change allocation percentages,
inserting error into the results. More formally, Simpson’s Paradox is a statistical phenomenon where an association
between two variables in a population emerges, disappears or reverses when the population is divided into
subpopulations.

To ensure we can maximize the value from our experimentation practices and reduce inaccuracy of decisions, we
recommend following best practices across all experiments. This document outlines the best practices to adopt.

Experimentation Planning

Hypothesis: Define the upside

Before launching an experiment:

Define the business opportunity we are trying to improve


Explain how we are planning to enhance the business opportunity and why we want to take the particular approach.
If you have supporting data/analysis, document it.
Document the control UI and flow and test wireframe/concept

Goals: Without goals, test measurement is meaningless


Define the primary metric that we want the experimentation to move. This is the metric we use to define the
rollout scenario.
Ideally, this should be just one metric (2 metrics max)
Ideally, this should tie to a business KPI (e.g. greater ARR, paid user signups, etc)
Define the expected change magnitude and the direction to calculate the sample size
Decide one/two tail and duration of the test → based on sample size and based on direction/s we expect the KPI to
change in
Example:
Test concept: Reducing the number of steps in the new user sign up process
The primary metric: Number of sign-ups and it would be ideal to tie to a profit driving KPI, such as # of paid
users
Define the secondary metric(s): There are metrics that help us validate and understand in detail why and how the
primary metric was impacted

Secondary metrics help us cross-validate if the movement in the primary metric is real.
Since we use a high alpha of 5% or 10%, it is possible that impact to the primary metric was by chance, so the
secondary metrics can help validate whether the change in the primary metric is real or not. As we improve the
cadence of testing, this will become more important.
Example:
Test concept: Introduce a more prominent trial sign up button in the new user registration flow.
Primary goal: D90 conversion rate. **As more people sign up for trial in the new user flow, they
can better understand the total value of our product and move our D90 conversion rate by 5% **
Possible Noise: If the primary metric did move by 5%, but if the number of new users signing up for trial
didn't increase, then the increased D90 Conversion could be noise and needs further validation.
**Define Guardrail metrics: This helps us ensure we are not harming the business in the long run for short-term gains.
**
These are metrics based that track the long term impact an experiment can have.
Example:
When experimenting to increase trials during new user sign-up flow, we might want to set up D90 conversion or
Trial to Paid conversion as a guard rail metric.
Another example is from a Social Media company where they define for a "notification disable rate" as a guard
rail metric. They set a guard rail that says that every 1% lift in sessions/DAU from increased push notification,
the "notification disable rate" increase should be within X%.
Another Example: A SaaS company might set up a 12-month churn rate as a guard rail metric and revisit the
analysis after 12 months.

Pre-analysis: do the calculations upfront


Validate the problem, opportunity, and approach using data (e.g. descriptive analysis)
Example: There may not be significant value in testing email header copy change for a specific email type if the
volume of this email is low. We will never achieve significance. In this case, consider pre/post or picking a higher
traffic email type for testing.
Calculate the sample size needed using one tail/two-tail, alpha (5%) , and power (80%). Determine experiment time
period (how long to run the experiment) upfront.
Predetermine if you need to use a higher alpha for rollout decisions.
Do a cost vs. benefit analysis of conducting the experiment. When the cost of experimentation is high in the
organization and we have limited tech resources, we must be able to understand the potential upside from each test,
so we are prioritizing the right ideas.
Calculate the potential yearly upside from this change as one of the company KPI. When that's not possible, define
the upside to a KPI in terms of log value (like 0.01%, 0.01%, 0.1%, 1%, 10%, 100% impact), so we can make rough
comparisons between different ideas.
Push for step function changes vs. incremental changes

Experimentation set up plan: How to setup the test


Determine the # of variants needed to help answer the business question we want to learn from the experiments (eg:
A/B/n or MVT (multivariate) set up)
Document the analysis plan including what hypothesis question can and can’t be answered.
Define instrumentation needed (the additional tracking data you need on top of what's already available).
Specify tracking needs as critical or nice to have so developers can discuss and implement the new tracking based
on effort and performance impact.
Define the population you are testing on, including segmentation or exclusion criteria
Document when you want to run the experiment in the experiment calendar and checks for conflicts
Define the launch weights and ramp-up plan if experimenting in a critical path or impacts a large user base
When ramping for high-risk test in a critical area, go from 1% to 5%/10% and then 50%
Beware of the Simpson’s paradox and only analyze test periods where the weight allocation is proportional and
comparable.
Specify the success criteria for ramping
Example: No significant impact to Primary metric, and we could detect a change as small as 5% (i.e. sensitivity)

General Guidance for Test Group Sample Allocation Based on Risk and Critical Path

Experiment launch weights matrix on failure Critical path page or critical product feature
risk, and importance of feature and impacted
population L M H

Potential code risk for L 50% 10% 5%


failure
M 10% 10% 5%

H 10% 5% 1%

Experiment Verification

Experiment Validation Pre-Launch: Check that everything is in place

After the functionality is built by Dev and data eng team, verify if the reporting data and UI works as intended in Dev
or Pre production environment
Recommendation is to have any two out of Dev, QA and Analyst verify tracking and reporting
Recommendation is to have any two of PM, QA and Analyst verify UI functionality
Paste the control and test experience screenshot in experiment documentation for future reference

Experiment Validation Post-Launch: Check that the results are flowing in as expected
1 or 2 days after Go live, verify reporting data for the experiment is valid
Check for any skew in population assignment
If you launched at 1% weight, take an initial read after 2 or 3 days and set to 5% or 10% weights as per the previously
agreed plan.
The 1% experiment’s goal is only to ensure “things don’t break” and not to get a read on the results.
If you launched at 5% or 10%, change the weights to 50% as per your initial plan
Beware that the customer base could be different on weekdays/weekends, so analyze results in full week
increments in case of weekly seasonality and in ful month increments in case of monthly seasonality
The above will not apply for the 1% since the 1% test is not intended to get a read impact created by the
experimental experience.

Experimentation Analysis and Communication:


Ensure there is no bias in the assignment or experiment data
When possible, automate these validations like assignment population bias
Analyze the test results using statistical significance
Leverage the secondary metrics to validate and better understand how and why the test won.
Analyze results by key dimensions to understand the rollout scenario and find any significant performance
difference by population cohort.
Beware of increased statistical noise when making multiple comparisons. So form the hypothesis before analysis
for dimension level split or use adjusted P-value.
Analyze the interaction results with any pre-identified interacting experiments that were live simultaneously as this
test.
Communicate results with statistical confidence:
When there is a significant impact, communicate the results with a confidence interval (@80%)
When there is no significant impact, communicate the results with sensitivity, so the general audience can
understand at what observed change %, can we have concluded results as statistically significant.
Document results in wiki and detailed description of the test, ramp summary, key results, snapshot of metrics,
Insights, and Next steps.
Present the results to partners (Tech and Business) and analytical peers to gather additional insights and educate
others on the learning. Document any follow-up analysis and insights.
For PM: Ensure the experiment is rolled out or retired from code as per final result conclusions.

Institutionalize Insights from Experimentation:


Along with business partners, present the detailed results in a broader org group to spur conversations on how others
can benefit from learning and what actions other teams can take/collaborate to maximize value.
Ensure all results are documented and searchable by function or product tag to ensure we don't repeat the same
failed ideas. This enables us to do meta-analysis from multiple experiments to gather broader insights.
Ex: In a comparable company, we leveraged data from 2 dozen past examples to understand the relationship
between increased notification type and notification disablement.
Ex. In a comparable company, using metadata from ~50 past results, we understood the session lift by increasing
the volume of different email types.
Ex: In a comparable company, we leveraged ~8 past analyses to understand the incremental value of additional
cross merchandising spots.

Experimentation Governance

Automated alert system:


Enable an automated alert system to monitor significant negative impact to primary metrics from experiments.
This helps to avoid the need for results peeking.
We use P <= 0.01 or P <= 0.5 for two continuous days
Enable automated alert if an experiment has consistently significant positive results for continuous x days (we used
3 days to reduce random wins)
Review long-running experiments periodically to avoid performance impact on the product.
Document any Experiment failures due to set up, tracking/data failure, wrong implementation, or conflicts/ experiment
interactions. This enables us to monitor the health of the experiment platform.
Create an experiment calendar, to enable conflict management and understand volume and velocity of experiments.

Opportunity Action Items for Consideration


Note: Short term - means hours/days worth of effort. Long term means weeks/months worth of effort

Based on initial feedback and observations on current experiment platform capabilities, these are a list of action items I
recommend we consider. It still needs to be consulted with impacted parties to ensure the gaps still exist and their
priority.

Standardize sample size requirement with power calculations: (Priority H)


Short term: Define a standard sample size calculator using an alpha of (5% or 10%), one/two tail, and power of 80%
Ensure result dashboard incorporate significance and confidence level (Priority H)
Short term: Incorporate significance calculation and confidence interval directly into Sisense Experiment analysis
framework using Z test formulas or Python functions.
Ability to split test results by dimensions with adjusted P-value threshold: (Priority M)
Long-term: Explore alternative tools or data tracking to enable unrestricted metric/funnel/dimension evaluation for
Experiments and split analysis by dimensions and filters.
Solve for Sample size problems (Priority H)
Short term: Ensure we define experiment sample size analysis using one-tail or two-tail to better accommodate for
smaller sample sizes in experiments.
Short term: Leverage secondary metrics to build confidence in directional read on primary metrics.
Short term: Use guard rail metric for long-term impact measurement by pausing the experiment after collecting
enough samples and later measuring the impact on guard rail metric.
Long-term: Leverage Bayesian experiment analysis framework to communicate uncertainty in data for small
sample size problems.
Automated bias validation of test results: (Priority M)
Short term: Explore how to automatically flag when experiments have weights or population bias in results
Long term: Implement automated bias alerts when analyzing results.
Automated alerts for winning and Losing experiments: (Priority L)
Short term: Explore how to create automated alerts for losing or consistently winning experiments based on P value
and days the results remain significant.
Analyze randomization within and across experiments for any bias: (Priority M)
Short term: Explore if we need to validate randomization bias within and across experiment allocation based on
limitations with existing systems and analyze.
Long-term: Build automated systems to monitor bias continuously within and across experiments.
Experiment calendar: (Priority M)
Short term: Understand the current Experiment calendar and explore automated ways to validate accuracy of
testing calendar by comparing against actual experiment assignment data.
Long term: We have a single system to record all the experiments running across the organization and the duration
of experiments. Enable automated validation against actual test assignment data.

You might also like