The 4 Worst Things About Google Content Experiments You Need to Know

By Jenny DeGraff | Apr 14, 2014
More Articles by Jenny

Google Content Experiments has received many a good review and seems to be widely accepted as a promising testing platform. Admittedly, it does have a lot going for it. Experiments:

  • Is FREE
  • Is easy to set up
  • Is integrated with your Google Analytics account
  • Allows for advanced segmentation, filtering, and traffic allocation

Because Experiments is folded into Google Analytics, you can test against your already established Analytics goals as the test’s conversion goal. You are also able to view other goals, site usage and ecommerce values as secondary conversion actions for supporting test conclusions. This is a great feature, helping to fully integrate your testing strategy into your digital marketing efforts. However, when evaluating a testing platform you should note that Google Experiments does have a couple of limitations and it is a VERY basic tool.

1. No Multivariate Tests

Google Content Experiments tests on an A|B|n platform. As such, you must create full iterations of landing pages to test against each other. If you are testing a full layout change, the A|B|n model will suit your needs just fine. However, if you’d like to test multiple elements at one time or a small element, like button color, you would need to create a new landing page for each different color button you will test. This can become quite cumbersome. This platform will also present challenges if you would like to test the influence of one element across multiple pages. You will need a much more sophisticated tool for this type of testing.

2. Beware of The Multi-Armed Bandit

What the heck is a multi-armed bandit, you ask? Google support explains:

Twice per day, we take a fresh look at your experiment to see how each of the variations has performed, and we adjust the fraction of traffic that each variation will receive going forward. A variation that appears to be doing well gets more traffic, and a variation that is clearly underperforming gets less. The adjustments we make are based on a statistical formula that considers sample size and performance metrics together, so we can be confident that we’re adjusting for real performance differences and not just random chance. As the experiment progresses, we learn more and more about the relative payoffs, and so do a better job in choosing good variations.

This sounds awesome, right? Yes, in theory it would be much more efficient than your classic A|B testing method, concluding tests much quicker while reducing potential revenue loss from under-performing variations. Unfortunately, we have found that for testing with small sample sizes, like those commonly conducted in the B2B world, the multi-armed bandit has the potential to create an invalid test that will never declare a winner. This is because traffic to the variation is typically reduced so severely and so quickly that there is not a significant enough sample size to give it a chance at all.

sample-experimentBy the third day the variation received no traffic at all (and it’s difficult to have any conversions with no traffic).

3. Has Problems Reaching Statistical Significance

The smaller your sample size, the longer your test will need to run to achieve statistical significance. It has been suggested that a good ballpark is to aim for at least 100 conversions per variation before looking at statistical confidence. The challenger variation from the above experiment stopped receiving traffic at only 40 visits and 3 conversions. As a result, there is no chance this experiment will reach statistical significance.

4. Google Wants the Challenger to Win (Really Badly)

While the original variation’s conversion rate remains steady with slight improvement, the challenger’s conversion rate nose dives. As a result by day 6, the original is outperforming the challenger by 158%. However, according to Google Experiments there still is an 8.8% probability that the challenger will outperform the original. With no traffic allocated to the page, it is highly unlikely that the challenger’s conversion rate will do anything other than continue to decline.

Keys to Overcoming The Google Content Experiment Obstacles

Despite the above, if you want to conduct very simple (and free) tests, Google Content Experiments is not the worst tool. You should just be aware of its pitfalls. To overcome the Multi-Armed Bandit issue, don’t use it. When setting up your experiment be sure to turn on the option to distribute traffic evenly across all variations. You can find this under the Advanced Options.


To deal with the statistical significance issue, you may to be satisfied with running your own numbers. KISSmetrics has a nice A|B significance tool that can help you determine the lift and significance reached from any A|B test.

Have you had better luck with Google Content Experiments? Or do you have another favorite testing platform? Share your recommendations with me on Twitter @JennyDeGraff or in the comments below.

Share this article

Share on LinkedIn Share on Twitter

Subscribe today!

Join over 4,000 marketers who receive actionable digital marketing insights.


Blog Search

  • Charles R. Twardy

    You claim that “with no traffic allocated to the page” the conversion rate will “continue to decline”. But if there are no more visits, the conversion rate here is fixed at 3/40. Also, what is the evidence that traffic went to 0? The graph shows conversion rate per day, not traffic. The experiment has declared no winner yet, so should still be allocating traffic to both. Finally, your point about statistical significance is invalid: (a) you can’t argue against one design by saying it doesn’t match rules of thumb developed for another design, and (b) using the KISSmetrics tool you suggest shows that if this had been a flat design we’d already be *99%* confident that the original page is better.