Google Content Experiments has received many a good review and seems to be widely accepted as a promising testing platform. Admittedly, it does have a lot going for it. Experiments:
Because Experiments is folded into Google Analytics, you can test against your already established Analytics goals as the test’s conversion goal. You are also able to view other goals, site usage and ecommerce values as secondary conversion actions for supporting test conclusions. This is a great feature, helping to fully integrate your testing strategy into your digital marketing efforts. However, when evaluating a testing platform you should note that Google Experiments does have a couple of limitations and it is a VERY basic tool.
Google Content Experiments tests on an A|B|n platform. As such, you must create full iterations of landing pages to test against each other. If you are testing a full layout change, the A|B|n model will suit your needs just fine. However, if you’d like to test multiple elements at one time or a small element, like button color, you would need to create a new landing page for each different color button you will test. This can become quite cumbersome. This platform will also present challenges if you would like to test the influence of one element across multiple pages. You will need a much more sophisticated tool for this type of testing.
What the heck is a multi-armed bandit, you ask? Google support explains:
Twice per day, we take a fresh look at your experiment to see how each of the variations has performed, and we adjust the fraction of traffic that each variation will receive going forward. A variation that appears to be doing well gets more traffic, and a variation that is clearly underperforming gets less. The adjustments we make are based on a statistical formula that considers sample size and performance metrics together, so we can be confident that we’re adjusting for real performance differences and not just random chance. As the experiment progresses, we learn more and more about the relative payoffs, and so do a better job in choosing good variations.
This sounds awesome, right? Yes, in theory it would be much more efficient than your classic A|B testing method, concluding tests much quicker while reducing potential revenue loss from under-performing variations. Unfortunately, we have found that for testing with small sample sizes, like those commonly conducted in the B2B world, the multi-armed bandit has the potential to create an invalid test that will never declare a winner. This is because traffic to the variation is typically reduced so severely and so quickly that there is not a significant enough sample size to give it a chance at all.
The smaller your sample size, the longer your test will need to run to achieve statistical significance. It has been suggested that a good ballpark is to aim for at least 100 conversions per variation before looking at statistical confidence. The challenger variation from the above experiment stopped receiving traffic at only 40 visits and 3 conversions. As a result, there is no chance this experiment will reach statistical significance.
While the original variation’s conversion rate remains steady with slight improvement, the challenger’s conversion rate nose dives. As a result by day 6, the original is outperforming the challenger by 158%. However, according to Google Experiments there still is an 8.8% probability that the challenger will outperform the original. With no traffic allocated to the page, it is highly unlikely that the challenger’s conversion rate will do anything other than continue to decline.
Despite the above, if you want to conduct very simple (and free) tests, Google Content Experiments is not the worst tool. You should just be aware of its pitfalls. To overcome the Multi-Armed Bandit issue, don’t use it. When setting up your experiment be sure to turn on the option to distribute traffic evenly across all variations. You can find this under the Advanced Options.
To deal with the statistical significance issue, you may to be satisfied with running your own numbers. KISSmetrics has a nice A|B significance tool that can help you determine the lift and significance reached from any A|B test.
Have you had better luck with Google Content Experiments? Or do you have another favorite testing platform? Share your recommendations with me on Twitter @JennyDeGraff or in the comments below.