A/B testing and Experiment Design

Before testing

Not every idea is worth testing

many ideas from different stakeholder(UX, PM, Engineer)

Quantitative analysis using historical data to obtain the opportunity sizing of each idea.

Zoom out and Zoom in

If get stuck, Take a step back

Investigate user's behaviors

Limits: historical data only tells us past not able to predict future accurately

Qualitative analysis

Focus group and surveys

pain points and preferences

To choose which idea is to do A/B testing

Designing A/B tests

How long to run a tests?(Test Duration)

rule of thumb is two weeks

Sample size of a test (The formula: n = 16*sigma^2/delta^2

Dividing the sample size by the number of users in each group: Sample Size/(Randomization Units/Day)

If the number is less than a week, we should run the experiment for at least seven days to capture the weekly pattern. It is typically recommended to run it for two weeks. When it comes to collecting data for a test, more is almost always better than not enough.

Interference between control and treatment groups ❗

Independence assumption does not hold(social networks, two-sided markets)

Network effect(Spillover)

Unlike social networks where the treatment effect underestimates the real benefit of a new product, in two-sided markets, the treatment effect overestimates the actual effect. ❓

how do we design the test to prevent the spillover between control and treatment

Social network

Two sided market

Geo-based (has pitfall)

Time-based randomization. (Use with care- consider time sensitive feature-short time effect)

Network clusters

Ego-cluster ❓

Analyzing Results

Novelty and primacy effects

1) compare new users’ results in the control group to those in the treatment group to evaluate novelty effect

2) compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.

Making decisions

complexity of implementation, project management effort, customer support cost, maintenance cost, opportunity cost

Is it to maximize engagement, retention, revenue, or something else?

the negative shift in a non-goal metric

Sanity check

Sample ratio mismatch (Z-test, T-test, Chi-squared test)

Guardrail metrics(Counter metrics) are hurt? (by Z-test, T-test, Chi-squared test)

Normality test ❓

Hypothesis Testing

Z-test or t-test

Selecting metrics and randomization units

Conduct retrospective analysis by analyzing users' activity logs

Causal inference

Metric selection

Success metrics(Goal metrics, true north star)

Driver metrics (more sensitive than Goal metrics)

Business Goal

AARRR, HEART

Guardrail metrics (Counter metrics)

Organizational guardrails

Trustworthy-related guardrails

Website/App performance

Monitor the Sample Ratio: Control and Treatment samples to be sized according to the configuration

Business goals

click to edit

Latency: wait times for pages to load

Error logs: number of error messages

Client crashes: number of crashes per user.

Revenue: revenue per user and total revenue

Engagement: e.g., time spent per user, DAU, and pageviews per user.

Cache hit ratio to be the same among Control and Treatment

Overall Evaluation Criterion (OEC)

Average-of-ratios vs. Ratio-of-averages

click-through rate, we can use the average of ratios because it’s more robust to outliers

Caveats of ratio-of-averages

Less robust to outliers

complicating variance calculation (need to consider using bootstrap or delta method)

Choose randomization units

Choose the target population

Demographics

Platforms

segment of interest

estimated Variance

historical data

A/A tests

Rule of thumb by Power: 0.8

alpha

Delta: Minimum detectable effect[MDE a.k.a practical significance] (value is discussed and decided by multiple stakeholders.)

Eg. 5% traffic, not 100% traffic to avoid buggy and terrible for all users at once.

Find the 95% Confidence Interval (CI)

If 95% CI includes 0, it suggests no change

Statistical and practical significance

Practically significant (Launch ✅)

statistically significant: p<05 and lower bound in 95%CL >0

Not practically significant

Likely statistically/practically significant (experiment may be underpowered, run new experiments with more units if time and resources allow)

Statistically significant and likely practically significant

practically significant: new - old > Delta

Scenario 1:

Neither statistically and practically significant (Not launch ❌)

Scenario 2:

Statistically significant but not practically significant

implementation costly (not launch ❌)

cost is low(launch ✅)

Scenario 1: likely statistically significant , CI is too wide, CI contains 0, we cannot tell change hurt

Scenario 2: likely practically significant

statistically significant and like practically significant

CI does not have 0, there are high chances the results is practically significant

Randomization unit

User ID or login that users use across platforms and devices

Event: page view, a session (more finer granularity)

Device ID

IP address (not recommended)

How to choose

Consistent user experience

Variability

Ethical considerations

Randomization unit v.s. unit of analysis

Randomization unit is the same as(or coarser than) the unit analysis

Compare two proportions by Two sample t-test

pt, pc, overall p

Compute SE

If pooling variances

if not pooling variance

Compute t-statistic

Find p-value and compare with Alpha

Old-school way(with df)

SciPy

Find the 95% Confidence Interval (CI)

Compare two means by two sample t-test

Compute SE

Pooled SE

unpooled SE

Compute t-statistics

Find p-value

Find the 95% Confidence Interval (CI)

Website/App performance

Latency, error logs, client crashes

Business goals

Revenue, engagement

General guidelines if fails

Stop and access

Priority concern before analyze

Rerun the experiment

How to debug SRM

form a hypothesis

envision user journey

define goal metrics

refined hypothesis

Validate the experiment system: make sure the assignments are correctly given and Type I error is controlled.

Validate no major bias between different groups of users.

Estimate the variance of certain metrics, such that we could calculate the sample size of another experiment.

always run a series of a/a tests before an a/b test!

simulate thousands of a/a tests and plot p-values: far from uniform? you're in trouble!

common reasons behind failures: variance is calculated incorrectly (e.g., violating i.i.d. assumptions, randomization unit $\neq$ analysis unit), outliers, browser redirects, hardware differences, etc.

even after a/a tests pass, it's still recommended to run a/a tests concurrently with a/b tests to ensure no regressions in the system

Common pitfalls

Post-test segmentation

Simpson's paradox

Lack of statisitical Power

Multiple testing problem

analyzing too many creates a much higher possibility of false positives

Preventative Solutions

specify the targeted population

stratified sampling to create balanced datasets between segments

After the Test

qualitative analysis.

a follow-up test

Dealing with Network Effect

Cluster-based randomization

Geo-based randomization

Time-based randomization

Long-term Monitoring

How long is long-term? 3 or more months or after 10 exposure

What causes

Internal factors

Business side

Launch new product/features

Engineering side

Latency and downtime

Models and software degrade

External factors

User's behavior changes

User-learned effects

Novelty effects

Primacy effects

Competitor and others

Network effects

Delayed experience/measurement

Seasonality

Competitors emerge

Policies

How to measure long-term effects

Long-running experiments

Holdback Group

Reverse experiment

Alternative methods

Focus groups

Survey

Run an A/A test after the A/B test

Cohort analysis

Common mistakes

Lack of testing power

Stop the experiment earlier

The experiment ran as designed but there are not enough randomization units, Due to system errors or shift in demand

Multiple Hypotheses (Multiple Testing Problem)

Increase probability of false positives (Type 1 error)

How to fix it

Post-test segmentation

Alternatives of A/B Tests

user experience research

terrible for scaling

focus groups

may fall into group think

human evaluation

survey

responders may not be faithful or representative

log-based analysis

use historical data to understand baselines, metric distributions, form hypotheses

observational causal studies

controlled experiments aren't always possible

common alternatives

interrupted time series

interleaved experiments

regression discontinuity design

propensity score matching

difference in differences

Group metrics

Bonferroni correction

Family-wise error rate

False discovery rate (FDR) adjustment

Unbalanced sample sizes for segments, i.e., segments that differ considerably in size

Post-test segmentation with unplanned comparisons between pairs or groups

Conduct a statistical test to see if the two groups differ significantly.

Check if there’s any discrepancy upstream of the randomization point.

Check if the variant assignment is done correctly.

Check bot detection and filtering. data pipeline

The value of power is dependent on MDE

When not to use A/B testing

lack of infrastructure

lack of impact

lack of traffic

lack of conviction

lack of isolation

By analyzing user activity logs

Make the product change, but then run retrospective analysis at historical data

Caveat ⭐

Dealing with Non-Normality

If distribution of metric is not normally distributed

Thanks to plentiful data and Central Limit Theorem

Z and t-tests

Bootstrapping ⭐

Running alternative tests

Gathering more data

To invoke the CLT

FWER

Dealing with Network effect

Clusters of similar or connected and do isolation

Dealing with Novelty Effect

Only analyze test results for new users

Easier to identify potential novelty effects

For bug fixes or sensitive changes, launch to entire user base

Groups are not balanced

results in highly skewed results

Due to CLT, the z-test can be applied to estimate sample proportion ⭐

z=(p - p0)/sqrt(p0(1-p0))/n

Ramping

Collect data for one week

Run simulations 1000 times

Goodness of fit test

KS test

Estimate ratio variance

Follow p(1-p) maximum variance 0.25 is often used for conversion rate, click through rate.. when no prior info provided

Need to monitor and check because variance can be changed over time.

It should be absolute difference (number) not the relative difference percentage

Lack of statistical power each segment

Prevent: Before running test keep balanced and same ration of segmentation of each group for both controlled and treatment group

After running test, can do some qualitative test

Sample size may not be enough

May still have interactions

Need monitor to detect

lead to bias

Not all experiments will be impacted by network effect

disadvantage

Survivor bias

Dilution

Use large number of employee for internal holdback group

To conquer the unethical problem from holdback group

senarios

More than 2 metrics

More than two treatment groups

a segment of population

Multiple iterations

Multiple testing in parallel

Two steps of rules of thumb

Divide the metrics into three groups: set different p-value according to the impact level predicted

Causes

Tests with ramping up plans

Running multiple tests in parallel

Segmentation is based on attributes that can change

Bugs in assigning pipeline

Sample ration mismatch

Violation of SUTVA

Primary and novelty effect