Please enable JavaScript.
Coggle requires JavaScript to display documents.
trustworthy online controlled experiments - Coggle Diagram
trustworthy online controlled experiments
Introductory Topics for Everyone
Why Experiment? Correlations, Causality, and Trustworthiness
1
Necessary Ingredients for Running Useful Controlled Experiments
experimental units (e.g., users) that can be assigned to different variants with no/little interference
enough experimental units (e.g., users).
Key metrics, ideally an OEC, are agreed upon and can be practically evaluated. If the goals are too hard to measure, it is important to agree on surrogates
Changes are easy to make
eg
Amazon
1.Users who bought x also bought Y,’
3.Users who searched for x also bought y”
2“Users who viewed X also bought/viewed Y
Running and Analyzing Experiments
An End-to-End Example
3.Designing the Experiment
size of experiment has direct impact on precision of results
a small change or be more confident in the conclusion, run a larger experiment with more users
purchase indicator (i.e., did the user purchase yes/no,without regard to the purchase amount) instead of using revenue per-user as OEC
care only bigger changes
p0.01-need to increase the sample size.
how long to run the experiment
More users:
statistical power exceptions
number of sessions, and the variance also increases
Day-of-week effect
Seasonality
Primacy and novelty effects
deciding experiment size
How safe is the experiment
need to share traffic with other experiments
4.Running the Experiment and Getting Data
Instrumentation
Infrastructure
5.Interpreting the Results
invariant metrics
Trust-related guardrail metrics
Organizational guardrail metrics, such as latency
6.From Results to Decisions
2
1Setting up the Example
2Hypothesis Testing: Establishing Statistical Significance
Twyman’s Law and Experimentation Trustworthiness
Misinterpretation of the Statistical Results
Lack of Statistical Power
eg.
only impacts a small subset of the population
not enough users
be diluted
:define what is practically significant
Misinterpreting p-values
X continuously monitor p-values
how
sequential tests with always valid p-values/ Bayesian testing framework
predetermined experiment duration determining statistical significance eg 1 week
Multiple Hypothesis Tests
? :multiple tests, choose the lowest p-value
Threats to Internal Validity
Violations of SUTVA
Survivorship Bias
Intention-to-Treat
Survivorship Bias
Sample Ratio Mismatch (SRM)
Browser redirects
Performance differences
Robots handle redirects differently
Redirects are asymmetric.
Lossy instrumentation
Residual or carryover effects
New experiments usually involve new code and the bug rate tends to be higher. It is common for a new experiment to cause some unexpected egregious issue and be aborted or kept running for a quick bug fix
cookie
Bad hash function for randomization
failed to properly distribute users in multiple concurrent experiments when the system was generalized to overlapping experiments
If triggering is done based on attributes that are changing over
time, then you must ensure that no attributes used for triggering
could be impacted by the Treatment.
eg. campaign that triggers for users who have been inactive for 3months. If the campaign is effective, those users become active and the next iteration of the campaign could have an SRM
Time-of-Day Effects
bots traffic
Threats to External Validity
Confidence Intervals
:95% confidence intervals computed from many studies would contain the true Treatment effect