Natural language processing, or NLP for short, is the ability of a program to understand human language. Studies suggest there is a connection between investor’s vocabulary and the profitability of their strategies.This research analyzes lexical metrics in 10-K & 10-Q reports. All publicly traded companies have to file 10-K & 10-Q reports periodically. These reports consist of relevant information about financial performance. Nowadays, there is a gradual shift from numerical to text-based information, making the reports harder to analyze. Still, the 10-K & 10-Q reports rightfully receive great interest from academics, investors and analysts.
BRAIN is one of the companies that analyze the 10-K & 10-Q reports using NLP. The main objective of The Brain Language Metrics on Company Filings (BLMCF) dataset is to monitor numerous language metrics on 10-Ks and 10-Qs company reports for approximately 6000+ US stocks. This paper focuses on the Lexical metrics of the BLMCF dataset, specifically lexical richness, lexical density, and specific density.
The combination of the high and increasing volume of published 10-K & 10-Q reports and their gradual shift to nonnumerical information leads to the premise that fundamental analysts cannot identify crucial information in the “white noise” about the actual and future performance of the company. The companies like BRAIN, which analyze the 10-K& 10-Q reports using NLP and give scores according to numerous language metrics, bridge the gap between the nonnumerical and numerical data. The research suggests that the richer the vocabulary of an investor is, the higher the lexical score the company gets and the better it performs.
Backtest period from source paper
Confidence in anomaly's validity
Notes to Confidence in Anomaly's Validity
Notes to Indicative Performance
Table on page 6, Compounding Annual Return
Period of Rebalancing
Notes to Period of Rebalancing
Notes to Estimated Volatility
Table on page 6, Annual Standard Deviation
Number of Traded Instruments
Notes to Number of Traded Instruments
Top 500 US stocks by dollar volume
Notes to Maximum drawdown
Table on page 6, Drawdown
Notes to Complexity Evaluation
Simple trading strategy
The investment universe consists of top 500 US stocks by dollar volume. The stocks are sorted based on their lexical density and specific density score from the BLMCF dataset. Lexical density measures the structure and complexity of human communication in a text. A high lexical density indicates a large amount of information-carrying words. Specific density measures how dense the report’s language is from a financial point of view. In other words, how many finance- related words are used in the text. The investor goes long the top decile and short the bottom decile. Additionally, the portfolio is rebalanced on a monthly basis.
Hedge for stocks during bear markets
Yes - Based on the backtest in Quantconnect, the strategy has a negative beta of -0.029. The visual inspection of the equity curve also suggests that the strategy performs well during bear markets.
Out-of-sample strategy's implementation/validation in QuantConnect's framework