ETF Education

On Back-Tests (Part 2)

Qualities that distinguish good from bad back-tests

Samuel Lee 12 December, 2013 | 17:33

This article first appeared in the Morningstar ETFInvestor – September 2013.

In part 1 of this article, we looked at how different audiences look at back-tests. Now we look at the qualities that distinguish good from bad back-tests.

Don’t Trust, and Verify
To my knowledge, there is only one truly comprehensive study looking at whether back-tested equity strategies end up being truly predictive. Two respected researchers, R. David McLean and Jeffrey Pontiff, slaved away on a monumental working paper titled “Does Academic Research Destroy Stock Return Predictability?”¹. They independently replicated and tested 82 equity characteristics published in academic studies purported to predict excess returns. Like the Vanguard study, they looked at the excess returns the characteristic strategies produced in back-tests (“in sample”) and live (“out-of-sample”). If the characteristics were only predictive in-sample, there are two possible explanations: The market is efficiently arbitraging away the anomaly, or the observed pattern was the product of data-snooping. To distinguish the two effects, McLean and Pontiff cleverly split the out-of-sample period into two: pre- and post-publication. Because it can take years before a working paper is published, there’s a period in which the characteristic is still out of sample but known only to a small group of academics. If the characteristic’s predictive power decayed completely in the working paper phase, then we can point to data-snooping as the culprit. If its power decays after publication, then it’s likely the markets at work arbitraging away the anomaly.

Interestingly, they could replicate only 72 out of 82 results. Of those, they found that the average out-of-sample decay due to statistical bias was 10% and average post-publication decay was an additional 25%, for an approximately 35% decay from back-test to live performance. We can’t take these results at face value. Their sample might over-represent the most cited and memorable studies, introducing survivorship bias.

By the standards of social science, that’s suspiciously impressive. At least two big replication attempts of promising biomedical studies found the majority couldn’t be replicated—I’d expected finance studies to do even worse because of the ease of back-testing.

Despite possible issues, the study suggests back-tested equity strategies that have passed the academic publishing gauntlet are of higher quality than the ones produced by less rigorous and conflicted parties (like, say, index and ETF purveyors). Though I’m still skeptical of much of the academic literature, I do believe academics have been able to identify market regularities in advance. For example, the “big three” factors, size, value and momentum, were all discovered and established by the mid-1990s. All three factors went on to earn excess returns in the subsequent two decades. The following are qualities that distinguish good from bad back-tests, roughly from most to least important.

Strong economic intuition. Can you make a strong, evidence-based story beforehand that would justify the proposed effect?
An intellectually honest source. Are the parties behind the back-test credible? Do they have any motivation to data-snoop or lie?
Simple and transparent methodology. Complex models often underperform simple, robust ones in out-of-sample tests.
Sample size. Academics usually expect at least several decades of data in a sample, at least when considering back-tested equity strategies. The highest-quality back-tests are conducted using the big, high-quality data sets.
Finally, effect size and statistical significance. Many analysts look for high returns and high statistical significance in order to determine whether they should accept the validity of a proposition. While statistical and economic significance are necessary, they themselves are often weak predictors of a study’s validity. Anyone can produce statistically significant results by data-snooping or even outright fabrication.

And then you’re still not done. You want several high quality studies from skeptical, independent researchers that broadly find similar results before you conclude something is likely “true.” These are high hurdles, yes, but necessary if you want decent odds of striking nuggets of truth rather than fool’s gold.

¹ R. David McLean and Jeffrey Pontiff. “Does Academic Research Destroy

Stock Return Predictability.” Working paper 2013.