Experiments: Getting Past Statistical Significance
What are you trying to do when experimenting?
When experimenting, you are trying to isolate the effect of taking some action on our website/mobile app/email on one or more objective results.
For example, let’s suppose you change the color of the main button on your page. Originally Red, you try to have it Blue (this is the action we take). Do you think it will increase the conversion rate (this is the objective result), and if so, by how much?
A/B testing: The application of “Randomized Controlled Trials”
“A/B tests” and other “multivariate tests” are applications of randomized controlled trials (RCT).
RCTs are experiments where :
- We take a sample of users and,
- randomly assign them to either a control group or a treatment group.
- The experimenter then collects performance data (conversions or purchase values) for each
We often talk about AB Testing as one activity or task. However, there are 3 main parts of the actual mechanics of testing.
- The data collection;
- The estimation of effect sizes;
- The assessment of our uncertainty of the effect size and mitigation of the uncertainty around making decisions.
Data Collection
There is going to be a tradeoff between more information (less variability) and cost. Why cost? Because we are investing time and foregoing potentially better options, in the hopes that we will find something better than what we are currently doing. We spend resources now to improve our estimates of the set of possible actions that we might take in the future. AB Testing, in of itself, is not optimization. It is an investment in information.
What does a “sample” means in this context?
First things first, a sample is simply a subset of the total population under consideration. What’s important to understand is that in most A/B testing situations, we don’t randomly sample, but we randomly assign users to treatments.
This may be confusing at first, but in the online situation where websites run A/B tests, the users present themselves for assignment. Think of it like “They come to your homepage, You don’t bring the homepage specifically to them”.
This is not just a detail, because it systematically leads to a certain selection bias if we don’t try to account for this non-random sampling in our data collection process. The selection bias makes it more difficult, if not impossible, to conclude, from our test results, about the larger population we are interested in.
The most effective way to mitigate this is by running our experiments over an extended time (full weeks, or months, etc.) to try to ensure that our samples represent better the true user/customer population. The longer, the better.
Why do we prefer “randomized” assignments?
This is because of what we call “confounding”. Confounding is when the treatment effect gets mixed with the effects of any other outside influence. Confounding is the single biggest issue in establishing a causal relationship between the treatments and the performance measure.
To continue with the previous example where we are interested in the treatment effect of the color of our button on the page’s conversion rate. When assigning users to button color, it’s not good if we give everyone who visits the page on Sunday the “Blue” button treatment, and everyone who visits the page on Monday the “Red” button treatment.
Why it’s not good? Because the “Blue” group includes both Sunday users and the “Blue” button, while the “Red” group is comprised of both “Monday” users and the “Red” Button. They are not the same kind of users. Therefore, we mixed the data in a way that any of the effects related to the day of visit are tangled together with the treatment effects of button color. This is the very reason why before/after testing is not accurate.
What matters is for each of the groups to both have:
- No confounding: They look like one another except for the treatment selection.
- No selection bias: They look like the population of interest.
On the other hand, if each day, we apply a 50/50 split of button color treatments, then the relationship between “day of visit” and “button color assignment” is broken, and we can estimate the average treatment effects without having to worry as much about the influences of outside effects. This is the foundation that makes A/B testing more accurate than before/after testing.
Still, this isn’t perfect yet. It holds only on average and it is possible, due to sampling error, that for any sample, we don’t have a balanced sample over all of the cofactors and confounders. Simply be alert to any sample ratio mismatch because of the confounding effect.
Things can go wrong if, for example, a bug in the assignment process breaks the experience in that some users don’t get assigned for certain types of old internet browsers, then we no longer have a fully randomized assignment. This bug breaks the randomization and lets the effect of old browsers leak in and mix with the effect of our treatment (button color, etc.).
Confounding is the main issue you should be concerned about within A/B testing. Everything else is secondary.
Estimating the Effects of the Treatment
We ensure the assignment of each group is randomized, and the experiment is running. It’s time to use our data collected to calculate an estimate, for each group, of the conversion rate (or any other objective result we would like to measure).
The estimate from our sample will be our best guess of what the treatment effect might be for the population under study if we were to spread this change beyond testing. Usually, for most simple experiments, it’s enough to calculate the treatment effect by using the sample’s “arithmetic mean” from each group like follows:
Treatment effect on CR = (Treatment CR) – (Control CR)
Continuing with the previous example where “Red” is our control group and “Blue” is the treatment group: if we estimate that users exposed to a blue button convert 10% of the time (0.1) and those who are exposed to a red button convert 11% of the time (0.11), then the estimated treatment effect is -1% (0.1-0.11=-0.01).
Estimating Uncertainty
It is essential to keep in mind that calculating the sample’s conversion rates is a means to an end only. The final purpose is to make statements about the population conversion rates.
The treatment effect derived from the samples we drew is an intrinsic characteristic. Another sample would have led to different data and calculated a different sample mean.
To assess uncertainty, we could estimate a “confidence interval” for the conversion rate of each of the treatment and the control group. It is an interval that is guaranteed to include the true population conversion rate with a frequency that is determined by the confidence level we choose. For example, a 95% confidence interval will contain the true population conversion rate 95% of the time.
The confidence interval can also be calculated around the “treatment effect”, which is the difference in conversion rates between our treatment group and the control group, as mentioned before.
To summarize what we have covered so far:
- We calculated the treatment effect, for example, “Having a blue button increased our conversion rate by 0.1%”.
- We obtained a measure of the uncertainty in the size of the treatment effect with no mention of hypothesis testing.
Mitigating Risk
When running a test, we want to control for two types of error we can make when taking an action based on our estimated treatment effect.
Type 1 Error AKA the “False Positive”.
In jargon, the first kind of error is the rejection of a true null hypothesis as the result of a test procedure. It is sometimes called an error of the first kind.
When a type I error occurs, we conclude that:
- the treatment has a positive effect size when it doesn’t have any real positive effect. (One tail)
- the treatment has a different effect than the control (either strictly better or strictly worse) when it doesn’t have any different effect than the control. (Two tail)
Type 2 Error AKA the “False Negative”.
The second kind of error is the failure to reject a false null hypothesis as the result of a test procedure. It is also referred to as an error of the second kind.
When a type II error occurs, we conclude that:
- the treatment does not have a positive effect size when it does have a real positive effect. (One tail)
- the treatment does not have a different effect than the control (it isn’t either strictly better or strictly worse) when it does have a different effect than the control. (Two tail)
How to specify and control the probability of these errors?
Controlling Type 1 errors
Controlling the probability that our test will make a Type 1 error is called the significance level of the test, or alpha level.
Running the test for an alpha of 0.05 means that we only make Type 1 errors up to 5% of the time. Perhaps an alpha of 1% would make more sense for another use case, or maybe 0.1%. It all depends on how damaging it would be for you to take some action based on a positive result when the effect doesn’t exist.
This doesn’t mean that if you get a significant result, that only 5% (or whatever your alpha is) of the time it will be a false positive. The rate at which a significant result is a false positive will depend on how often you run tests that have real effects.
Let’s imagine for a moment that you never run any experiments where the treatments were “truly” better than the control. In this case, you should expect to see significant results in up to 5% (alpha%) of your tests, and all of them will be false positives (Type 1 errors). You need to grok this idea to thoughtfully run your AB Tests.
Controlling Type 2 errors
Controlling Type 2 errors is based on the “Power” of the test, which is, in turn, based on the “beta”. Power is the probability of avoiding a Type II error.
For example, with a beta of 0.2 (or Power of 0.8), your test would fail to discover, on average up to 20% of the time, that the treatment is superior to the control. As the alpha, it is up to you to set what beta should be. Again, this will depend on how costly this type of mistake is for you. Spend some time to think about what this means.
What is amazing about hypothesis testing is that if you collect the data correctly, you are guaranteed to limit the probability of making these two types of errors (based on the alpha and beta you decided). If we are mindful of confounding, all we need to do is collect the correct amount of data.
Click to Tweet
Running the test after we have collected our pre-specified sample will assure we control these two errors at our specified levels.
What is the relationship between alpha, beta, and the sample size?
The sample size is the payment we must make to control Type 1 and Type 2 errors. Increasing the error control on one means you either have to lower the control on the other or increase the sample size. Power calculators are calculating the sample size needed, based on a minimum treatment effect size, and desired alpha and beta.
“The sample size is the payment we must make to control Type 1 and Type 2 errors.”
What if I decide to stop before all of the planned data was collected?
You are free to do whatever you like. You can stop a test or make a decision earlier.
However, keep in mind that the Type 1 and Type 2 risk guarantees that you were looking to control for will no longer hold. If they were important to you and your organization, losing these guarantees will be what it cost you to stop collecting data early.
The Waiting Game
All we do is just wait and run the test. After we collect our data based on the pre-specified sample size, in weekly or monthly blocks, then we don’t have to deal with any issues of selection bias or biased treatment effects.
Waiting, we are just doing the simplest thing possible and in exchange for that, we get the most robust estimation and risk control.
If you have more than 1 treatment, then you can adjust your Type 1 error control because the more treatments you run, the greater the chance you will have Type 1 errors.
Wrapping it up
At the end of the day, A/B Testing is just an application of sampling. We want to control the risk of making a change to the whole population. It is often is a good idea to run tests to get estimates of treatment effects, or a measure of uncertainty around those effects. Besides, the better you understand the relative costs of false positives, and false negative, the better you can determine if you are willing to pay to reduce their occurrences.