If your A/B testing doesn’t seem to work, you might be making one of the common mistakes, such as the peeking problem, wrong split, or wrong interpretation. This can completely destroy the profit from the experiment and can even damage the business.

As a data scientist, I want to describe the design principles of A/B tests based on data science techniques. They will help you ensure that your A/B tests show you statistically significant results and move your business in the right direction.

# Define a quantifiable success metric

As you may already know, A/B testing, or split testing, is a randomized experiment in which you want to choose the best variant of two hypotheses. The use cases of starting such testing can be a landing page redesign, headline testing, banner testing, and so on.

Before starting a range of experiments, the main goal is to **define a quantifiable success metric** for your experiment. It should reflect the changes and play a fundamental role in making the right decision.

The metrics usually reflect the business’s goals. The most popular metrics to measure a hypothesis are:

- conversion rate (page view to signup, pave view to button click, etc.)
- economic metrics (revenue per shift, average check, etc.)
- behavioral metrics (depth of pageview, average session duration, user retention rate, feature usage)

For example, we want to add a product video on the Statsbot main landing page and test how it performs compared to the page with a product image. A success metric for us would be the conversion rate from a page view into a signup.

# Randomize your traffic

Let’s split our clients into two groups: A and B. The first group will continue to see the old version of a landing page and the second will start interaction with the new page with a video.

The success of a whole A/B testing depends on the right split into groups. It actually can vary in specific cases, but the main requirement is that **the two samples should be homogeneous**.

It’s an extremely sensitive issue, which influences all your further actions. People often come up with pseudo-random splits, which actually correlate with age, gender, nationality, geo, etc. Data science approaches help prevent the cases when split groups are dependent on some factors.

One of the famous ways to validate splitting is the Intraclass correlation coefficient (ICC), which can demonstrate the difference of feature distribution for each splitting group:

- The ICC value is close to 0 or even less than 0, because Fisher's formula for producing ICC is unbiased. This means that the splitting strategy is a good one.
- The ICC value is close to -1 or 1. This means that you’d better try another splitting strategy.

Besides, you need to take into account the split ratio. **A** **50:50 split is the most popular choice for simple A/B testing and leads to the quickest results**. Nevertheless, many companies make such things very carefully and split leads 20:80 or even 10:90, with a small fraction given to the experiment group. An unsuccessful experiment can lead to big loss of conversions or income.

# Achieve statistical significance

Even if you see the breathtaking success in increasing your quantifiable metric, you’d better wait for achieving statistical significance with the experiment.

The main metrics that affect the statistical significance of any experiment are the effect size, the sample size, and the alpha significance level.

In the A/B experiment we have two hypotheses (H0 against the alternative H1) and calculate the appropriate statistic depending on the selected statistical criterion. For our example, we can test whether the means of two samples are equal (H0), or, alternatively, the means differ strongly (H1).

The **P-value (our statistical significance)** is the probability of observing a statistic at least as extreme as those measured when the null hypothesis is true. If the p-value is less than a certain threshold (typically 0.05), then we don’t reject hypothesis H0.

One of the most popular statistical tests for an A/B experiment is a Student’s t-test for 2 samples (checks equality of means). It performed well for small amounts of data, since it takes into account the size of the sample when assessing the significance. You can choose a suitable statistical test for your experiment here.

There is also a large number of services that can help you calculate the appropriate number of visitors to achieve your goal:

Always rely on statistical tests, don’t try to compare just means or medians of the main and experimental group of users. It can make you go the wrong way. Here we can see, that despite absolutely equal means, the effects are totally different.

The smaller the intersection, the more confident we can say that the effect is really significant.

# Interpretation problems

Finally, we have finished the A/B test. If the experiment group has showed the statistically significant improvement of our success metric, we can add a product video on the homepage of our website with confidence.

If not, it’s important to **analyze the obtained result and revise the whole cycle of the experiment**. Probably, the problem is in a wrong split or the wrong time period of testing that affects specific subgroups of people, or some other influence that wasn’t taken into account.

Such external and internal factors can be:

- advertising campaigns
- day of the week
- weather or seasonality
- spike of market activity
- call-center operations
- employee actions

The most common and tricky mistake that can occur at this stage of A/B testing is the so-calledpeeking problem.

To avoid it, you have to define the sample size before the experiment and calculate the result only on this sample. Not too early, nor too late.

# Key advice

A/B testing is nonetheless a very thin and insidious thing if you make it badly. The following advice based on a data science approach can help you benefit from your experiment.

- Define a success metric and effect size before starting the test.
- Never rely on your intuition, and don’t stop the experiment until achieving statistical significance.
- Test the statistical significance only after you finish, take into account a peeking problem.
- Think about the period of testing. It’s a bad idea to run an A/B test during the weekend or holidays. The right experiment should cover all weekdays, all traffic sources, and so on.
- Try to concretize the big novelty into smaller “subnovelties,” because the cross effect can show no improvements, while smaller things can separately increase the metric.
- Don’t expect a significant improvement of the metric. The majority of successful A/B tests give a 1-2% increase of the metric.
- Be careful of the data noise in your test. Statistical criteria can't catch it. For example, many people can be interested in a new feature or new design when it is initially released. This leads to abnormal behavior and should be cleaned up in a final comparison
- Don’t start A/B testing if you have only a few clients. The process can take months or even years for achieving a statistical significance and will be wrong in most cases.

Don’t be afraid of A/B testing your hypothesis, just do it right using the data science approaches above.

What you need to do is form a hypothesis with a success metric, then randomize your traffic correctly, achieve a statistical significance on a whole sample size, and interpret the results taking into account as many factors as you can. Otherwise, you risk wasting a lot of resources to get insights that mess up your business.