Quantopian’s Academic Paper About In vs. Out-of-Sample Performance of Trading Algorithms

4.May 2016

A really good academic paper from guys (and girl) behind Quantopian:

Authors: Wiecki, Campbell, Lent, Stauth

Title: All that Glitters Is Not Gold: Comparing Backtest and Out-of-Sample Performance on a Large Cohort of Trading Algorithms

Link: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2745220

Abstract:

When automated trading strategies are developed and evaluated using backtests on historical pricing data, there exists a tendency to overfit to the past. Using a unique dataset of 888 algorithmic trading strategies developed and backtested on the Quantopian platform with at least 6 months of out-of-sample performance, we study the prevalence and impact of backtest overfitting. Specifically, we find that commonly reported backtest evaluation metrics like the Sharpe ratio offer little value in predicting out of sample performance (R² < 0.025). In contrast, higher order moments, like volatility and maximum drawdown, as well as portfolio construction features, like hedging, show significant predictive value of relevance to quantitative finance practitioners. Moreover, in line with prior theoretical considerations, we find empirical evidence of overfitting – the more backtesting a quant has done for a strategy, the larger the discrepancy between backtest and out-of-sample performance. Finally, we show that by training non-linear machine learning classifiers on a variety of features that describe backtest behavior, out-of-sample performance can be predicted at a much higher accuracy (R² = 0.17) on hold-out data compared to using linear, univariate features. A portfolio constructed on predictions on hold-out data performed significantly better out-of-sample than one constructed from algorithms with the highest backtest Sharpe ratios.

Notable quotations from the academic research paper:

"For the first time, to the best of our knowledge, we present empirical data that can be used to validate theoretical and anecdotal claims about the ubiquity of backtest overfitting and its impact on algorithm selection. This was possible by having access to a unique data set of 888 trading algorithms developed and tested by quants on the Quantopian platform. Analysis revealed several results relevant to the quantitative finance community at large – practitioners and academics alike.

Most strikingly, we find very weak correlations between IS and OOS performance in most common finance metrics including Sharpe ratio, information ratio, alpha. This result provides strong empirical support for the simulations carried out by Bailey et al. [2014]. More specifically, it supports the assumptions underlying their simulations without compensatory market forces to be present which would induce a negative correlation between IS and OOS Sharpe ratio. It is also interesting to compare different performance metrics in their predictability of OOS performance. Highest predictability was achieved by using the Sharpe ratio computed over the last IS year. This feature was also picked up by the random forest classifier as the most predictive feature.

Additionally, we find significant evidence that the more backtests a user ran, the bigger the difference between IS and OOS performance – a direct indication of the detrimental effect of backtest overfitting. This observed relationship is also consistent with Bailey et. al's [2014] prediction that increased backtesting of multiple strategy variations (parameter tuning) would increase overfitting. Thus, our results further support the notion that backtest overfitting is common and wide-spread. The observed significant positive relationship between amount of backtesting and Sharpe shortfall (IS Sharpe – OOS Sharpe) provides support for a Sharpe ratio penalized by the amount of backtesting
(e.g. the "deflated Sharpe ratio" by Bailey & Lopez de Prado [2014]). An attempt to calibrate such a backtesting penalty based on observed data is a promising direction for future research.

Together, these sobering results suggest that a reported Sharpe ratio (or related measure) based on backtest results alone can not be expected to prevail in future market environments with any reasonable confidence.

While the results described above are relevant by themselves, overall, predictability of OOS performance was low (R² < 0.25) suggesting that it is simply not possible to forecast profitability of a trading strategy based on its backtest data. However, we show that machine learning together with careful feature engineering can predict OOS performance far better than any of the individual measures alone. Using these predictions to construct a portfolio of strategies resulted in competitive cumulative OOS returns with a Sharpe ratio of 1.2 that is better than most portfolios constructed by randomly selecting strategies. While it is difficult to extract an intuition about how the Random Forest is deriving predictions, we have provided some indication of which features it deems important. It is interesting to note that among the most important features are those that quantify higher-order moments including skew and tail-behavior of returns (tail-ratio and kurtosis). Together, these results suggest that predictive information can indeed be extracted from a backtest, just not in a linear and univariate way. It is important to note that we cannot yet claim that this specific selection mechanism will work well on future data as the machine learning algorithm might learn to predict which strategy type worked well over the specific OOS time-period most of our algorithms were tested on (for a more detailed discussion of this point, see the limitations section). However, if these results are reproducible on an independent data set or the strategies identified continue to outperform the broad cohort over a much longer time frame, it should be of high relevance to quantitative finance professionals who now have a more accurate and automatic tool to evaluate the merit of a trading algorithm. As such, we believe our work highlights the potential of a data scientific approach to quantitative portfolio construction as an alternative to discretionary capital allocation."

Are you looking for more strategies to read about? Sign up for our newsletter or visit our Blog or Screener.

Do you want to learn more about Quantpedia Premium service? Check how Quantpedia works, our mission and Premium pricing offer.

Do you want to learn more about Quantpedia Pro service? Check its description, watch videos, review reporting capabilities and visit our pricing offer.

Do you want algorithmic access to the full Quantpedia database via the API? Subscribe to Quantpedia Pro, ask for an API key, and explore the in/out-of-sample statistics, source academic papers, and code snippets — ideal for quantitative research, systematic trading workflows, and AI model training.

Are you looking for historical data or backtesting platforms? Check our list of Algo Trading Discounts.

Or follow us on:

Facebook Group, Facebook Page, Twitter, Linkedin, Medium or Youtube

Share on Refer to a friend

We’ve already analysed tens of thousands of financial research papers and identified more than 700 attractive trading systems together with hundreds of related academic papers.

Browse Strategies