Over the past 16 years, I hired and managed a couple of dozen digital campaign managers, both for guaranteed and programmatic campaigns. In addition, I interviewed many more campaign managers working for premium publishers, DSPs and Agency Trading Desks. Lastly, I interacted with dozens of Digital Advertising and Marketing leaders. Based on the numerous interactions, I came to the following conclusion. Manual (human) campaign optimization decisions are often made without correctly calculating statistical significance. As a result, many manual campaign optimization decisions have a strong gambling element.
On the opposite side of the spectrum, some underperforming campaign elements are allowed to run far too long. This is because the campaign managers are not aware of the fact that they have reached statistically significant sample a long time ago, and they should have killed the underperformers much earlier.
Often, I’d challenge our newly-hired campaign managers by asking them to confirm statistical significance of their optimization decisions. After applying the statistical tools, to be described later in this post, they’d realize that there was as much as 35-40% chance that the optimization would lead to no effect or negative effect on performance. 55-60% chance of being right is better than the game of roulette, but still not good enough in my opinion, especially considering the fact that there is a straightforward way to arrive to greater confidence.
Possibly the most annoying scenario for publisher campaign managers is when an agency media buyer attempts to optimize or even cancels a direct buy prematurely based on a small sample. Agency-side media buyers are notorious for being poorly trained, overworked and underpaid. Explaining the basic concept of Statistics to them was a losing proposition most of the time.
Don’t gamble with your clients’ money. Make sure that the optimization decisions that you make are statistically valid.
Many ad campaign optimization decisions, whether they are programmatic or guaranteed campaigns, boil down to determining whether conversion rate, CTR, or brand recall / favorability rate of Treatment A is better than Treatment B. A simple tool known as z-Test for Two-Population Proportions can be used to verify statistical significance of a difference in ratios, such as conversion rate or click-through rate of two elements of an ad campaign.
It applies in the following scenarios, among others.
- Audiences and targeting strategies: does Audience A performs better than Audience B?
- Inventory sources: is exchange A inventory really better than exchange B? Same for Publisher A vs Publisher B, or different placements within the publisher site.
- Creative concepts, themes, colors, call to action messages.
- Creative sizes. Does 300×250 really outperform 728×90?
- Geo locations. Des the campaign performs better in some states or DMAs than others?
- Frequency capping schemes: is 2/day better than 5/day. This one is tricky. Do make sure that the user populations are mutually exclusive, since you don’t want to “cross-pollinate.”
- Day-parting. Does the campaign perform better during office hours or nights and weekends?
Like most statistical tests, it starts with a definition of the Null Hypothesis (H0), which typically states that there is no difference between the two treatments. H0 has to be disproven by an alternative hypothesis (H1), which typically says that there is a significant difference. Using a court analogy, you start with H0 = “suspect is innocent”. Then you have to prove beyond a reasonable doubt that the suspect is guilty (H1).
“Wrongfully accusing the innocent” is known as Type 1 Error. It is an incorrect rejection of a true null hypothesis. For our purposes, this would be concluding that there is a difference between the performance of two campaign elements, when in actuality there is no difference.
What’s “reasonable doubt?” Unless prior to starting your exciting career in Ad Tech you were a juror on Steven Avery’s trial, you’d want the probability of Type 1 error (known as α) to be at least as low as 5%. When your samples are large, such as for high budget campaigns, high converting campaigns, campaigns with CTR as a KPI (those still happen) or high traffic publishers, then you could even go as tight as 1%.
There is a lot of good literature on the math behind the test. Here’s the short version. Since we are trying to confirm any difference, whether positive or negative, we will be looking at a two-tailed test. I.e.,
– H0: p1 = p2 (Conversion rates or CTRs are the same for both treatments)
– H1 (p1 ≠ p2)
The calculation is straightforward. First you calculate z-value:
z = (p1 – p2) / (sqrt{ p * ( 1 – p ) * [ (1/n1) + (1/n2) ] }), where:
- p1= conversion rate of treatment A sample
- p2= conversion rate of treatment B sample
- n1= impressions of treatment A sample
- n2= impressions of treatment B sample
- p= combined conversion rate (conversions A + conversions B) / (impressions A+ Impressions B)
You can insert the this formula in Excel. Then, using the normal distribution table (or the NORMSDIST Excel function), you find “p” value, which corresponds to the z. p-value will be your α (the probability of Type 1 Error).
If you don’t want to bother with Excel, here’s a nifty online calculator. Your nominators would be your conversions and your denominators will be impressions.
More than Two Treatments? There Is a Way.
What to do when you have several treatments, such as A, B, C, and D? For example, you could be testing four creative concepts, or you could be running a Private Marketplace (PMP) buy on 11 publishers. The answer depends on what decision you need to make.
A) Selecting the winner. If you need to have only one winner, such as the best performing creative concept, then you could apply the following technique that my team used:
run z-Test on each pair:
- B vs A
- C vs A
- D vs A
- C vs B
- D vs B
- D vs C
As soon as any pair gets you a p(α) < 0.05 (or 0.01), you eliminate the losing treatment. It is clearly not the winner. For example, if “C” is confirmed to be worse than “B”, then you are left with A, B, and D. Now you can split your ad impressions between three treatments instead of four treatments, which means that you will arrive to a larger sample faster. Later, you can compare the remaining treatments and eliminate the next loser, after which you will be left with only two and finally only one.
- Tip: it’s always a good idea to split your traffic evenly between treatments. Not only do you arrive to significance sooner, but you can also re-distribute the traffic of a loser among the remaining treatments without creating a bias of seasonality.
- Note that p(α )<0.05 means that you have about a 5% of chance of being wrong about each winner. If you run 20 or so absolutely identical treatments and if p(α) 0.05 is good enough for you, there is a big chance that a “winner” will emerge.
B) Selecting a loser among more than 2 treatments
Paradoxically, when instead of selecting one winner, you need to select one or two losers, your decision tree becomes a bit more complex. For example, your programmatic media buy produces a conversion rate, which is below the goal. Some of the four inventory sources that you use seem to perform worse than average. You want to eliminate the underperformers and thus increase the average conversion rate, but you want to be fairly certain (p<0.05) that you are making the right decision. In the previous exercise it is enough for you to confirm that C is worse than D to make sure that C is not the winner among A, B, C, and D. However, just because C is worse than D, does not mean that C is the loser among A, B, C, and D. A and B may end up being confirmed to be worse then C once enough sample is accumulated. Once again, split your traffic evenly between treatments, if you can.
Confidence Intervals
Confidence Intervals are a great visualization tool. They show mean and the 95% (or 99%) confidence spread. They make it very easy to visualize significance with multiple treatments. Plotting confidence intervals for multiple treatments allows you to easily see whether any two pairs are significantly different. For example, A is clearly significantly different from C but D is not yet significantly different from B or E.
Here’s the formula:
p ± z {sqrt[ p * ( 1 – p ) / n ]}
Of course, 99% confidence (z=2.58) intervals are much wider than 95% (z=1.96) confidence intervals. As you accumulate bigger samples, confidence intervals shrink. However, looking at the confidence intervals early lets you eyeball performance to see where you may or may not expect a big difference. You also get an approximate idea when you might be able to optimize.
Confidence Intervals of Difference
Confidence intervals of difference is yet another useful tool. Once the confidence interval of difference no longer crosses the zero line (all possible values within 95% (or 99% confidence range are either clearly positive or clearly negative), you have achieved significance. If the interval crosses the Zero line, then it is not yet clear whether the difference in proportion of the population is positive or negative.
Here is the calculation:
(p1-p2) ± z {sqrt [ p1(1-p1)/n1 + p2 (1-p2)/n2]}
As always, your z-scores are:
- 90%: 1.645
- 95%: 1.96
- 98%: 2.33
- 99%: 2.575
So, how much time should I run my campaign Before optimizing?
Uneducated folks have rules of thumb, such as 1 week, 2 weeks or a month. Unlike them, you already know, you run as much as it takes to: get to:
- p(α) <5% (or 1%)
- confidence intervals that no longer overlap
- confidence interval of difference no longer crosses the zero line.
When we run multivariate tests for a large publisher, we often launched tests at 8pm. By 10 am the next morning we’d have significant samples. Yet some other tests would run for weeks with no significance.
The good news is that if there’s a big difference, it does not take a large sample to confirm it. You can see it by playing around with the above formulas or the online calculator. Here is an illustration. You are measuring % of blond people between two US Midwestern cities in neighboring states: Indianapolis, IN and Columbus, OH. Because the difference between blondness of the population is fairly small, you might end up adding thousands of randomly selected people into your samples before you arrive to significance. On the other hand, if you measure % of blonds between Stockholm and Beijing, a couple of dozens of people will be enough to confirm that the difference in significant. Same with the ad campaign elements with dramatically different performance.
The campaign has been running for a while, and there is no significance yet. What do I do?
I will have an in-depth post on this soon. Meanwhile, a quick note is this. You might not be seeing significant difference for two reasons:
1) Conversions are infrequent, and the sample is accumulated slowly. For example, we managed a campaign for a home sales company, and the conversion event was an online appointment to see a $350k 4-bedroom home (San Franciscans, do not envy). The entire budget was only $50k/month, and it was split across 15 regions, each with its own budget and creative set. Needless to say, each line item got only 3-5 conversions per month, and even two months into the campaign there was no statistical significant sample to do any optimization within a line item.
2) There simply is no significant difference. For example, your Creative Concept B is identical to Creative Concept A, except for substituting 11-point font with a 12-point font. You may wait until hell freezes over to produce a statistically significant sample. At some point, you will have to kill the test and pronounce that there is no significant difference—or you can just let both concepts run and not worry about killing either one of them.
Dollars? Yen? Tugriks?
z-Test for two population proportions applies to ratios of counts or events, but it cannot be directly applied to revenue or ROI metrics. One $50.00 sale is not 50 or 5,000 events. It is still one event. If you are trying to use zTest to measure a difference in ROAS, CPA, eCPM or average sale size, using dollar amounts as your denominators, you would be wrong. To see why, imagine that instead of Dollars you use, say, Indonesian Rupiah. Suddenly, your denominators increase 13,182 times (as of the time of this writing) and now you appear to have a statistically significant sample.
So what do I do if I optimize to CPA or ROAS, eCPM, RPM or eCPC instead of Conversion Rate or CTR? Two options:
1) My team (with the blessings of two Stats PhDs) applied the following trick. We normalized the number of conversions to the highest conversion value. Suppose you run a Private Marketplace (PMP) impressions on two different properties. For the sake of simplicity, let us suppose that the number of impressions per property and the CPMs are the same. Property A produced 40 conversions, while Property B produced 50. Obviously, B has a better conversion rate. However, the folks who came from Property A ended up making bigger purchases, which resulted in higher overall sales for the buy on Property A. Is this a fluke? How do I have confidence that Property A will end up producing more overall sales going forward?
The trick is to normalize the number of conversions to the average sales per conversion in each sample. Then you can apply the same method. To be strict, you want to normalize to the higher sales/conversion. I.e., you would count two $50 conversions as one $100 conversion, but not vice-versa. Then your sample appears smaller.
Example:
(same impressions, same CPM, but different conversions and a different average sale):
A: 200,000 impressions, 40 conversions, $5,000 in sales ($125/sale)
B: 200,000 impressions, 50 conversions, $3,000 in sales ($60/sale)
Your new (adjusted) number of conversions for B becomes: 50*60/125=24
Now you have:
A: 200,000 impressions, 40 conversions, $5,000 in sales ($125/sale)
B: 200,000 impressions, 24 conversions, $3,000 in sales ($125/sale)
z= 2.0002
p-value= 0.0455
Yes, the difference is significant at p<0.05
If you want to be really strict, then you’d calculate the variance and confidence intervals for both the Conversion Rate and the average amount of sales for both A and B and then combine the two by taking the best and the worst 95% confidence scenarios for both the conversion rate and sale per conversion. After all, you have variation not only in the conversion rate but in the average amount of sales. However, for practical purposes the above trick works just fine. In addition to being blessed by two Stats PhDs, this trick has been validated in practice by my team in thousands of tests.
Revenue from Multiple Sources? – t-Test for Mean to the Rescue!
What if you have multiple revenue sources and you’d like to measure their combined performance, such as total yield. Use a good old t-Test for Mean to measure revenue. Break down your test results into individual samples. As a Publisher, we applied this successfully when comparing revenue performance of two radically different site design templates. Each template would produce ad revenue of multiple types: standard unit programmatic, standard unit direct / guaranteed, native programmatic, and subscription—each with its own variance. Changes in page templates would increase one or two of the revenue streams, but would decrease the other two. We needed a template, which would maximize revenue across all four. t-Test was the only way to go.
For example, for each of Treatment A and Treatment B, we would measure:
- Measurement 1: RPM yield of all even hours of day 1
- Measurement 2: RPM yield of all odd hours of day 1
- Measurement 3: RPM yield of all even hours of day 2
- Measurement 4: RPM yield of all odd hours of day 2
- Etc.
“t” distribution is to be used when the number of measurements is <30. If the number of measurements is 30 or more, then z (bell-shaped/Gaussian) distribution applies.
Can machines be trusted to optimize?
Yes. I trust a machine a lot more than I trust you. Auto-optimization / machine learning algorithms do take statistical significance into account. They are vetted by statisticians. Some use more complex and efficient math, such as Bayesian Statistics. Trouble is, they are not always designed to optimize to the events that you need. Also, most systems are not designed to optimize to multiple outcomes, such as maximizing your subscription and ad revenue while keeping your bounce rate under control all at the same time.
Well, enough for now. If you made it this far without falling asleep, then you must be a kindred ad tech geek spirit. We have much to discuss, my friend. In the next posts on the topic I will go very deep into Multivariate Testing and will introduce the concept of Design of Experiment (DOE)–something I learned at my Lean SixSigma Blackbelt class and successfully applied in programmatic advertising. So, stay tuned.
Meanwhile, do connect on LinkedIn and Twitter.
Also, please leave your comments, good and bad.
Leave a Reply