How to A/B test things the right way

If you’ve run an A/B test before, there’s a good chance you did it wrong.

No, that doesn’t mean you didn’t learn anything. But it does mean you probably didn’t learn as much as you could have.

I’ve helped more than 300 startup founders run A/B tests to choose between logos, packaging concepts, ad copy, and product feature combinations. Here’s one of the BIGGEST things I’ve learned about what usually goes wrong with these kinds of studies, and how to avoid it in order to maximize the value of your A/B test for your startup.

What is an A/B test?

I probably shouldn’t be presumptuous. Not everyone knows what an A/B test is.

An A/B test (sometimes called a “split test”) is a randomized experiment wherein two (or more) versions of something are shown to different groups people, selected randomly from the same sample population. These respondents then give feedback (direct or implied) on the version they saw, answering the same questions as the other group(s).

The goal is to help founders (or marketers, or product manager, etc.) determine which version of whatever it is they’re working on yields the best response, relative to the unique goals of that founder.

A/B tests are versatile. You can use A/B tests to compare reactions to competing logo designs, packaging concepts, new feature ideas, and to help you use data to make any decision involving competing versions of something.

How an A/B test works

An A/B test usually involves three things:

  1. Uploading the images, videos, or other stimuli to be tested—that is, your A and B versions.
  2. The questions that follow the presentation of the stimuli—one set of questions that will be asked after the presentations of either concept A or concept B.
  3. A meaningful analysis of the results.

Of course, the audience is also important—people need to actually take your A/B test. But that’s not something you build into your study.

Once these things are set up, whatever tool you’re using should automatically assign every participant in this study to one of the two (or more) groups. If you’ve uploaded two versions of a logo concept, for example, your A/B test tool should assign half of participants to one group (say, the A group) and the other half to the other group (the B group).

The A group sees the A stimuli, the B group sees the B stimuli, then both groups answer the same set of questions about what they saw.

That last point is critical. If your asking a different set of questions after each concept presentation, you’re not doing an A/B test, and you’re not going to be any closer to making a truly data-driven decision about which of the two versions is most likely to help you reach your goals.

What most people get wrong when A/B testing

Plain and simple: The biggest problem with the way most people run A/B tests is they don’t isolate differences between the two (or more) versions being tested, because they’re not thinking iteratively.

This is easiest to explain by example. So consider Sara.

Sara is a developer who’s built a website analytics tracking app called Saralytics in her spare time. She’s excited to launch, but she’s concerned that her lack of experience with branding and marketing means she won’t pick the best logo for her app.

So she decides to run a survey. She asks a friend to come up with two different logo concepts for
Saralytics, and she uses MarketResearchforStartups.com to run an A/B test.

Her two logos look like this:

A

B

Nice, right?

Respondents to her survey see one of these two logos, and are then asked the following three questions:

  1. How interested are you in learning more about this app?
  2. How likely would you be to download this app?
  3. True or False: This looks like a reputable app.

Great follow up questions. So we should expect to learn a great deal from this survey, right? Sara should be able to pick the optimal logo for her new app?

Well, unfortunately not. She may learn which one will perform better, but she won’t know whether she’s come anywhere to close to maximizing the value that a logo could bring to her business.

Why?

Well, these two logos differ in more than one way—color and typeface. So the unfortunate fact for Sara is that even if results show that respondents strongly prefer one logo to the other, she has no way of knowing whether it was the color or the typeface driving most respondents’ decision.

Because Sara’s A/B test showed items that differed on two margins, she has no way of knowing which one of these made the difference.

Sure, Sara could simply pick the winning logo. And she’ll probably get more app downloads than if she picked the other one, or ran no A/B test whatsoever. But is it possible that the typeface from one and the color from the other would be 10X more popular than either of the two combinations presented here?

Of course! And that’s the problem.

How to A/B test the right way

It’s probably obvious by now, but let me be clear:

A/B tests need to be part of an iterative approach to optimization.

In other words, don’t try to answer every question with one survey. Using the example of logo optimization, use separate A/B tests to test color and typeface. And if budget allows, use separate A/B tests to test higher level look-and-feel (i.e. professional vs. quirky, corporate vs. boutique) before diving into specific colors and typefaces.

I say “if budget allows” because there really is no end to this process. Optimization is a continuous effort—a journey, not a destination. But I also understand that no company has an unlimited budget for this kind of work. So decide for yourself where your intuition ends—where you begin to be unsure about what’s optimal—and where your market research should start. That’s where you should start running A/B tests.

Anyways, in order to run an air-tight A/B test that leaves you with a high level of confidence, be sure each one of your A and B concepts differ in one single way. This way, you’ll know what it is driving respondents’ decisions, and the results from your survey will give you actual, concrete knowledge and background about how your target market consumers think about things—not just whether they prefer one arbitrary concept to another arbitrary concept.

One difference per A/B test!

So, back to Sara. The best approach for her actually entails two A/B tests. Since she’s unsure about both color and typeface, she should run separate A/B tests for each. Start with a basic name and logo design, then test two (or more) typefaces and see which resonates best. Like this:

A

B

Let’s say concept A “wins.” Then Sara should take concept A and present it in another A/B test in two (or more) colors. Like this:

A

B

Once these two tests are done, she’ll be more confident about the final design than if she’d tested two very different concepts against each other.

Now, there is a caveat here. Sometimes respondents will prefer one particular combination of elements (say, typeface A in color A), but will not prefer the individual parts of that combination (say, typeface A in color B). So while you may choose each individual element iteratively, through successive A/B tests, you may not end up with exactly the combinations respondents most prefer if they had been able to select their favorite from all possible combinations all at once.

However, the reason why I still strongly suggest you run A/B tests iteratively in the manner described above is because then you’re not presenting hosts of random options to your respondents with no real rhyme or reason, effectively operating in the dark about respondents thinking and motivations.

Rather, by running successive, iterative tests, you’ll notice certain trends in the way people respond, and will be able to pick pre-A/B test what combinations even to include in the first place. That way, you’re more likely to intuit which options are likely to resonate with respondents before even beginning your test, because you’ve seen through your research that, for example, respondents seem to always prefer red over blue, or always a print over a script typeface.

This might get you 90% of way pre-A/B test, versus throwing anything at the wall just to see what sticks (which may well leave you with respondents most preferred option being something that doesn’t actually excited them).

To make this even more robust…

(Obnoxiously specific details ahead…)

Perhaps you’ve done A/B tests like this before (or perhaps you’re just reading this very carefully), and you’re picking up on an important caveat here.

You see, when Sara determines the optimal typeface in one A/B test, then the optimal color in a second A/B test, how can she know that if she had simply presented the losing typeface in a different color, that her study participants wouldn’t have liked that one better?

Well, she can’t know. Unless…the way she presents the two competing typefaces in study #1 is color-neutral. That means the stimuli uploaded to that first study doesn’t just show the typefaces in one color, but either:

  • Shows several versions of the same logo together in one image, such that respondents aren’t fixated on just one color, or…
  • Randomizes the colors of each of the two typefaces presented, such that every respondent sees the typeface in one of, say, 12 different colors.

This is a more sophisticated study, but it can be worthwhile if your level of intuition surrounding this decision is very low. Just determine for yourself the level of statistical confidence you require in order to feel good about the decision you’re facing. If you’re unsure, just reach out—I’ll respond ASAP.

Run an A/B Test now

Our A/B Test tool is designed for entrepreneurs and innovators seeking an iterative approach to optimization. Try it now!