A Robust Approach to Multi-Factor Regression Analysis

24.February 2021

Practitioners widely use asset pricing models such as CAPM or Fama French models to identify relationships between their portfolios and common factors. Moreover, each asset class has some widely-recognized asset pricing model, from equities through commodities to even cryptocurrencies.

However, which model can we use if our portfolio is complex and consists of many asset classes? Which factors should we include and which should we omit? (Especially if we have a database that consists of several hundreds of potential factors). Additionally, we know that equities influence bonds, commodities influence equities and vice versa. Hence the question, what about the cross-asset relationships?

These are the problems and questions we faced when looking for a methodology for our Multi-Factor Analysis report in the Quantpedia Pro platform. This blog post aims to introduce the model, its logic and the method we have decided to use.

Imagine that the investable universe consists just of three stocks, and we know ten consecutive daily returns. If somebody has a portfolio of these three stocks with different weights and tells us the daily profits of such a portfolio, can we find the weights in this portfolio? Yes, we can, by solving a system of linear equations since the weights are not changing and the number of equations is significantly larger than the number of stocks.

Do you want to test these ideas yourself? We offer our readers Historical Trading Data Discounts.

The graph above is a visual interpretation of such a problem. Portfolios 1-3 are individual stocks, and Portfolio 4 is a composite portfolio with unknown weights, which can be found by solving a linear system. We know one equity curve and the investment universe, and since we have enough data, we can determine the exact weights.

However, this beautiful theory falls apart in the “real world”. Suppose the investment universe is vast and consists of stocks, bonds, ETFs, commodities, currencies, etc. and the portfolio consists of numerous complex strategies. In this case, it is impossible to find the exact weights for each instrument in the portfolio. Thankfully, investors and traders know what their composition of the portfolio is. They are rather interested in the interactions and dependencies with other assets or strategies. A well-known example is any asset-pricing model such as CAPM. The CAPM model says how our portfolio is affected by the market factor. Similarly, we can also find the relationships of our portfolio and other factors, assets or strategies.

Suppose we get back to the above-mentioned complex portfolio that consists of numerous assets, in which we can invest in, according to the systematic rules. In that case, traditional asset pricing models are not sufficient.

We can characterize the complex portfolio by finding relationships with other assets or systematic trading rules. Such a characterization can be a key for an informed decision since we can understand how the different assets/strategies are related to our strategy. Subsequently, we can better diversify our strategy or even find profitable opportunities if we find a profitable and uncorrelated strategy. Therefore, compared to the simplified example of the three stocks, we are not only interested in what is our portfolio composed of, but we are also interested in what our portfolio composition could or should be. However, this requires a dataset that consists of numerous assets and strategies and a correct approach to find the dependencies.

At Quantpedia Pro, we have a dataset consisting of numerous typical strategies based on the quantitative rules. Together with numerous indices, it could be possible to identify the dependencies and relationships between our dataset and any given portfolio. The only question left is how to find these relationships?

If we have an equity curve for a given portfolio and a lot of possible variables, it is easy to overfit any model that would try to find the dependencies. A good example is a traditional linear regression model. Using all the variables would yield to an overfitted model, where the interpretation of the parameters would hardly make any sense.

Still, the linear regression has its place in the econometrics and could also be utilized for our task, but there are many caveats. Intuitively, we cannot use all the variables to avoid overfitting, but which variables should we omit? Naturally, we would like to miss all the variables that cannot significantly explain a part of the equity curve’s variation. The term significance tempts to use, for example, the t-statistics. The algorithm could be as follows: let’s find the parameters’ estimates and check the t-statistics for each parameter. If the parameter is statistically significant, keep the corresponding variable, and if not, drop it and continue until only statistically significant predictive variables are left. This process is a perfect example of the Stepwise regression with the backward elimination, where we start with all the variables and in each step, reduce the number of variables based on the predefined criteria. While this process seems to be logical, there is still one problem. Both the t-test and F-test, which are commonly used in econometric modelling, require some assumptions. If the assumptions are not met, interpretation of these tests is limited, and the statistics are difficult to interpret. A common violation is that the error term does not have a normal distribution, yet the tests are based on the normality. Additionally, there can be heteroskedasticity and autocorrelation. Not going even further into the statistics and econometrics (by considering various methods to estimate the covariance matrix of errors or the central limit theorem), the key takeaway is that if the model assumptions are not met, the interpretation is limited.

Quantpedia’s approach

During the development of our Multi-Factor Analysis model, we have identified that the assumptions for traditional statistical tests are frequently violated, what is the reason why we do not build our model around conventional statistical tests. Another possibility is to use non-parametric statistics and a method like the Theil-Sen estimator, but the multivariate version is very complex, given it is dependence on the spatial median.

The linear regressions are often characterized by the goodness of fit (R squared or adjusted R squared) or various information criteria. All these measures are commonly used in practice, and although they are not true statistical “tests“, they can be efficiently used to compare models. Our model is based on the Akaike’s Information Criterion (AIC), which has common use in model selection. The AIC estimates the “quality“ of each model, but can only be used compared to other models with the same independent variable. Additionally, the AIC takes into an account the number of parameters. The number of parameters (factors related to the given strategy) should not be too high, to obtain meaningful, yet as simple as possible model with straightforward interpretations.

We employ the AIC in a model selection using the Stepwise regression with forward selection.

Suppose we have the equity curve of some strategy (independent variable). In that case, the model is built on the assumption that at first, we do not have any variables in the model. Still, we have a set of pre-given variables that consists of various “factors“ such as other strategies or indices. We assume that we have n factors. In the first step, we build numerous models which use only one of the factors (one factor = one model). Therefore, we are left with as many models as we have possible factors (n models). Nextly, we compute the AIC for each model, and based on the AIC, we select the best model. As the next step, we try to add another factor from the reduced set of the factors that could improve our model. The algorithm builds n minus one models, computes the AIC of each model and picks the best model. The process where a new factor is added, based on the AIC, continues until the AIC does not improve anymore. If the AIC is not improving, it means that the model’s complexity would not outweigh the goodness of the fit of the model.

To sum it up, Quantpedia’s approach of Multi-Factor analysis is entirely automated, with an intuitive model selection which is not based on the assumptions that need to be checked and are most likely violated. Additionally, the model aims to be as simple as possible to ensure the interpretability of the results.

Author:
Matus Padysak, Senior Quant Analyst, Quantpedia

Are you looking for more strategies to read about? Sign up for our newsletter or visit our Blog or Screener.

Do you want to learn more about Quantpedia Premium service? Check how Quantpedia works, our mission and Premium pricing offer.

Do you want to learn more about Quantpedia Pro service? Check its description, watch videos, review reporting capabilities and visit our pricing offer.

Do you want algorithmic access to the full Quantpedia database via the API? Subscribe to Quantpedia Pro, ask for an API key, and explore the in/out-of-sample statistics, source academic papers, and code snippets — ideal for quantitative research, systematic trading workflows, and AI model training.

Are you looking for historical data or backtesting platforms? Check our list of Algo Trading Discounts.

Or follow us on:

Facebook Group, Facebook Page, Telegram, Twitter, Linkedin, Medium or Youtube

Share on Refer to a friend

We’ve already analysed tens of thousands of financial research papers and identified more than 700 attractive trading systems together with hundreds of related academic papers.

Browse Strategies