Data are the key to systematic investing/trading strategies. The hypotheses testing, risk or return evaluations, correlations, and factor loadings rely on past data and backtests. With an increasing speed of publication in finance, critiques of quantitative strategies have emerged. Strategies seem to decay in alpha, post-publication returns tend to be lower, and many strategies become insignificant once rigorously tested (in or out-of-sample). Moreover, some might even appear profitable purely by chance and the repetitive examination of the same dataset, such as CRSP stocks after 1963.
Is there any solution to overcome these limitations? Partially, the design of the novel machine learning strategies consisting of training, validation, and testing sets might help. Perhaps the most crucial part of such a scheme is the usage of the purely out-of-sample dataset. In this regard, the novel research by Baltussen et al. (2021) provides several valuable findings for the most recognized factors. The authors constructed a database of U.S. stocks, including dividends and market caps for 1488 major stocks from 1866 to 1926. The sample can be described as the pre-CRSP period, including independent, pre-publication, and “out-of-sample” data that can be a perfect test for the factors utilized today.