- Before testing
Not every idea is worth testing
many ideas from different stakeholder(UX, PM, Engineer)
Quantitative analysis using historical data to obtain the opportunity sizing of each idea.
Zoom out and Zoom in
If get stuck, Take a step back
Investigate user's behaviors
Limits: historical data only tells us past not able to predict future accurately
Qualitative analysis
Focus group and surveys
pain points and preferences
To choose which idea is to do A/B testing
- Designing A/B tests
How long to run a tests?(Test Duration)
rule of thumb is two weeks
Sample size of a test (The formula: n = 16*sigma^2/delta^2
Dividing the sample size by the number of users in each group: Sample Size/(Randomization Units/Day)
If the number is less than a week, we should run the experiment for at least seven days to capture the weekly pattern. It is typically recommended to run it for two weeks. When it comes to collecting data for a test, more is almost always better than not enough.
Interference between control and treatment groups ❗
Independence assumption does not hold(social networks, two-sided markets)
Network effect(Spillover)
Unlike social networks where the treatment effect underestimates the real benefit of a new product, in two-sided markets, the treatment effect overestimates the actual effect. ❓
how do we design the test to prevent the spillover between control and treatment
Social network
Two sided market
Geo-based (has pitfall)
Time-based randomization. (Use with care- consider time sensitive feature-short time effect)
Network clusters
Ego-cluster ❓
- Analyzing Results
Novelty and primacy effects
1) compare new users’ results in the control group to those in the treatment group to evaluate novelty effect
2) compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.
- Making decisions
complexity of implementation, project management effort, customer support cost, maintenance cost, opportunity cost
Is it to maximize engagement, retention, revenue, or something else?
the negative shift in a non-goal metric
Guardrail metrics(Counter metrics) are hurt? (by Z-test, T-test, Chi-squared test)
Normality test ❓
Hypothesis Testing
Z-test or t-test
Conduct retrospective analysis by analyzing users' activity logs
Causal inference
Metric selection
Success metrics(Goal metrics, true north star)
Driver metrics (more sensitive than Goal metrics)
Business Goal
AARRR, HEART
Guardrail metrics (Counter metrics)
Organizational guardrails
Trustworthy-related guardrails
Website/App performance
Monitor the Sample Ratio: Control and Treatment samples to be sized according to the configuration
Business goals
click to edit
Latency: wait times for pages to load
Error logs: number of error messages
Client crashes: number of crashes per user.
Revenue: revenue per user and total revenue
Engagement: e.g., time spent per user, DAU, and pageviews per user.
Cache hit ratio to be the same among Control and Treatment
Overall Evaluation Criterion (OEC)
Average-of-ratios vs. Ratio-of-averages
click-through rate, we can use the average of ratios because it’s more robust to outliers
Caveats of ratio-of-averages
Less robust to outliers
complicating variance calculation (need to consider using bootstrap or delta method)
Choose randomization units
Choose the target population
Demographics
Platforms
segment of interest
estimated Variance
historical data
A/A tests
Rule of thumb by Power: 0.8
alpha
Delta: Minimum detectable effect[MDE a.k.a practical significance] (value is discussed and decided by multiple stakeholders.)
Eg. 5% traffic, not 100% traffic to avoid buggy and terrible for all users at once.
Find the 95% Confidence Interval (CI)
If 95% CI includes 0, it suggests no change
Statistical and practical significance
Practically significant (Launch ✅)
statistically significant: p<05 and lower bound in 95%CL >0
Not practically significant
Likely statistically/practically significant (experiment may be underpowered, run new experiments with more units if time and resources allow)
Statistically significant and likely practically significant
practically significant: new - old > Delta
Scenario 1:
Neither statistically and practically significant (Not launch ❌)
Scenario 2:
Statistically significant but not practically significant
implementation costly (not launch ❌)
cost is low(launch ✅)
Scenario 1: likely statistically significant , CI is too wide, CI contains 0, we cannot tell change hurt
Scenario 2: likely practically significant
statistically significant and like practically significant
CI does not have 0, there are high chances the results is practically significant
Randomization unit
User ID or login that users use across platforms and devices
Cookie
Event: page view, a session (more finer granularity)
Device ID
IP address (not recommended)
How to choose
Consistent user experience
Variability
Ethical considerations
Randomization unit v.s. unit of analysis
Randomization unit is the same as(or coarser than) the unit analysis
Compare two proportions by Two sample t-test
- pt, pc, overall p
- Compute SE
If pooling variances
if not pooling variance
- Compute t-statistic
- Find p-value and compare with Alpha
Old-school way(with df)
SciPy
- Find the 95% Confidence Interval (CI)
Compare two means by two sample t-test
- Compute SE
Pooled SE
unpooled SE
- Compute t-statistics
- Find p-value
- Find the 95% Confidence Interval (CI)
Website/App performance
Latency, error logs, client crashes
Business goals
Revenue, engagement
Stop and access
Priority concern before analyze
Rerun the experiment
form a hypothesis
- envision user journey
- define goal metrics
- refined hypothesis
Validate the experiment system: make sure the assignments are correctly given and Type I error is controlled.
Validate no major bias between different groups of users.
Estimate the variance of certain metrics, such that we could calculate the sample size of another experiment.
always run a series of a/a tests before an a/b test!
simulate thousands of a/a tests and plot p-values: far from uniform? you're in trouble!
common reasons behind failures: variance is calculated incorrectly (e.g., violating i.i.d. assumptions, randomization unit $\neq$ analysis unit), outliers, browser redirects, hardware differences, etc.
even after a/a tests pass, it's still recommended to run a/a tests concurrently with a/b tests to ensure no regressions in the system
Common pitfalls
Post-test segmentation
Simpson's paradox
Lack of statisitical Power
Multiple testing problem
analyzing too many creates a much higher possibility of false positives
Preventative Solutions
specify the targeted population
stratified sampling to create balanced datasets between segments
After the Test
qualitative analysis.
a follow-up test
Dealing with Network Effect
Geo-based randomization
Time-based randomization
Long-term Monitoring
How long is long-term? 3 or more months or after 10 exposure
What causes
Internal factors
Business side
Launch new product/features
Engineering side
Latency and downtime
Models and software degrade
External factors
User's behavior changes
User-learned effects
Novelty effects
Primacy effects
Competitor and others
Network effects
Delayed experience/measurement
Seasonality
Competitors emerge
Policies
How to measure long-term effects
Long-running experiments
Holdback Group
Reverse experiment
Alternative methods
Focus groups
Survey
Run an A/A test after the A/B test
Cohort analysis
Lack of testing power
Stop the experiment earlier
The experiment ran as designed but there are not enough randomization units, Due to system errors or shift in demand
Multiple Hypotheses (Multiple Testing Problem)
Increase probability of false positives (Type 1 error)
How to fix it
Post-test segmentation
Alternatives of A/B Tests
user experience research
terrible for scaling
focus groups
may fall into group think
human evaluation
survey
responders may not be faithful or representative
log-based analysis
use historical data to understand baselines, metric distributions, form hypotheses
controlled experiments aren't always possible
interrupted time series
interleaved experiments
regression discontinuity design
propensity score matching
difference in differences
Group metrics
Bonferroni correction
Family-wise error rate
False discovery rate (FDR) adjustment
Unbalanced sample sizes for segments, i.e., segments that differ considerably in size
Post-test segmentation with unplanned comparisons between pairs or groups
Conduct a statistical test to see if the two groups differ significantly.
Check if there’s any discrepancy upstream of the randomization point.
Check if the variant assignment is done correctly.
Check bot detection and filtering. data pipeline
The value of power is dependent on MDE
When not to use A/B testing
lack of infrastructure
lack of impact
lack of traffic
lack of conviction
lack of isolation
By analyzing user activity logs
Make the product change, but then run retrospective analysis at historical data
Caveat ⭐
Dealing with Non-Normality
If distribution of metric is not normally distributed
Thanks to plentiful data and Central Limit Theorem
Z and t-tests
Bootstrapping ⭐
Running alternative tests
Gathering more data
To invoke the CLT
FWER
Dealing with Network effect
Clusters of similar or connected and do isolation
Dealing with Novelty Effect
Only analyze test results for new users
Easier to identify potential novelty effects
For bug fixes or sensitive changes, launch to entire user base
Groups are not balanced
results in highly skewed results
Due to CLT, the z-test can be applied to estimate sample proportion ⭐
z=(p - p0)/sqrt(p0(1-p0))/n
Ramping
Collect data for one week
Run simulations 1000 times
Goodness of fit test
KS test
Estimate ratio variance
Follow p(1-p) maximum variance 0.25 is often used for conversion rate, click through rate.. when no prior info provided
Need to monitor and check because variance can be changed over time.
It should be absolute difference (number) not the relative difference percentage
Lack of statistical power each segment
Prevent: Before running test keep balanced and same ration of segmentation of each group for both controlled and treatment group
After running test, can do some qualitative test
Sample size may not be enough
Sample size may not be enough
May still have interactions
Need monitor to detect
lead to bias
Not all experiments will be impacted by network effect
disadvantage
Survivor bias
Dilution
Use large number of employee for internal holdback group
To conquer the unethical problem from holdback group
senarios
More than 2 metrics
More than two treatment groups
a segment of population
Multiple iterations
Multiple testing in parallel
Two steps of rules of thumb
Divide the metrics into three groups: set different p-value according to the impact level predicted
Causes
Tests with ramping up plans
Running multiple tests in parallel
Segmentation is based on attributes that can change
Bugs in assigning pipeline
Sample ration mismatch
Violation of SUTVA
Primary and novelty effect