How to Use Lexical Density of Company Filings

Quantpedia is The Encyclopedia of Quantitative Trading Strategies

We've already analyzed tens of thousands of financial research papers and identified more than 1000 attractive trading systems together with thundreds of related academic papers.

Browse Strategies

Unlock Screener & 300+ Advanced Charts
Browse 1000+ uncommon trading strategy ideas
Get new strategies on bi-weekly basis
Explore 2000+ academic research papers
View 800+ out-of-sample backtests
Design multi-factor multi-asset portfolios

Get subscription

Natural language processing, or NLP for short, is the ability of a program to understand human language. Studies suggest there is a connection between investor's vocabulary and the profitability of their strategies.This research analyzes lexical metrics in 10-K & 10-Q reports. All publicly traded companies have to file 10-K & 10-Q reports periodically. These reports consist of relevant information about financial performance. Nowadays, there is a gradual shift from numerical to text-based information, making the reports harder to analyze. Still, the 10-K & 10-Q reports rightfully receive great interest from academics, investors and analysts.
BRAIN is one of the companies that analyze the 10-K & 10-Q reports using NLP. The main objective of The Brain Language Metrics on Company Filings (BLMCF) dataset is to monitor numerous language metrics on 10-Ks and 10-Qs company reports for approximately 6000+ US stocks. This paper focuses on the Lexical metrics of the BLMCF dataset, specifically lexical richness, lexical density, and specific density.

Fundamental reason

The combination of the high and increasing volume of published 10-K & 10-Q reports and their gradual shift to nonnumerical information leads to the premise that fundamental analysts cannot identify crucial information in the “white noise” about the actual and future performance of the company. The companies like BRAIN, which analyze the 10-K& 10-Q reports using NLP and give scores according to numerous language metrics, bridge the gap between the nonnumerical and numerical data. The research suggests that the richer the vocabulary of an investor is, the higher the lexical score the company gets and the better it performs.

Get Premium Strategy Ideas & Pro Reporting

Unlock Screener & 300+ Advanced Charts
Browse 1000+ unique strategies
Get new strategies on bi-weekly basis
Explore 2000+ academic research papers
View 800+ out-of-sample backtests
Design multi-factor multi-asset portfolios

Get subscription

Keywords

equity long short alternative data machine learning

Market Factors

Equities

Confidence in Anomaly's Validity

Strong

Period of Rebalancing

Monthly

Number of Traded Instruments

500

Notes to Number of Traded Instruments

Top 500 US stocks by dollar volume

Complexity Evaluation

Complex

Financial instruments

Stocks

Backtest period from source paper

2010 – 2021

Indicative Performance

8.16%

Notes to Indicative Performance

Table on page 6, Compounding Annual Return

Estimated Volatility

10.4%

Notes to Estimated Volatility

Table on page 6, Annual Standard Deviation

Notes to Maximum drawdown

Table on page 6, Drawdown

Sharpe Ratio

0.69

Regions

United States

Simple trading strategy

The investment universe consists of top 500 US stocks by dollar volume. The stocks are sorted based on their lexical density and specific density score from the BLMCF dataset. Lexical density measures the structure and complexity of human communication in a text. A high lexical density indicates a large amount of information-carrying words. Specific density measures how dense the report’s language is from a financial point of view. In other words, how many finance- related words are used in the text. The investor goes long the top decile and short the bottom decile. Additionally, the portfolio is rebalanced on a monthly basis.

Hedge for stocks during bear markets

Yes – Based on the backtest in Quantconnect, the strategy has a negative beta of -0.029. The visual inspection of the equity curve also suggests that the strategy performs well during bear markets.

Out-of-sample strategy implementation in QuantConnect (chart, statistics & code)

Related picture

Source paper

Hanicova, Kalus, Vojtko: How to Use Lexical Density of Company Filings

Abstract: This paper analyzes the application of natural language processing (NLP) on the 10-K and the 10-Q company reports. Using the Brain Language Metrics on Company Filings (BLMCF) dataset, which monitors numerous language metrics on 10-Ks and 10-Qs company reports, we analyze various lexical metrics such as lexical richness, lexical density, and specific density.In simple words, lexical richness says how many unique words are used by the author. The idea is that the more varied vocabulary the author has, the more complex the text is. Secondly, lexical density measures the structure and complexity of human communication in a text. A high lexical density indicates a large amount of information-carrying words. And lastly, specific density measures how dense the report's language is from a financial point of view. In other words, how many finance- related words are used in the text.Overall, we can say that this type of alternative data exhibits interesting results. Even though lexical richness produced the weakest results (of our strategies) when applied to the investment universe consisting of 500 stocks, it significantly improved when we expanded the investment universe to 3000 stocks. Moreover, the strategies based on the lexical density and specific density improved the Sharpe ratio even further.In the Last section, we combine the two metrics (Lexical density and Specific density) in one strategy. Applying both of these metrics to the investment universe with 500 stocks produces a Sharpe ratio of 0.688.

Other papers

Han, Henry and Wu, Yi and Zhao, Qianyu and Ren, Jie: Forecasting Stock Excess Returns With SEC 8-K Filings
Abstract: The stock excess return forecast with SEC 8-K filings via machine learning presents a challenge in business and AI. In this study, we model it as an im-balanced learning problem and propose an SVM forecast with tuned Gaussian kernels that demonstrate better performance in comparison with peers. It shows that the TF-IDF vectorization has advantages over the BERT vectorization in the forecast. Unlike general assumptions, we find that dimension reduction generally lowers forecasting effectiveness compared to using the original data. Moreover, inappropriate dimension reduction may increase the overfitting risk in the forecast or cause the machine learning model to lose its learning capabilities. We find that resampling techniques cannot enhance forecasting effectiveness. In addition, we propose a novel dimension reduction stacking method to retrieve both global and local data characteristics for vectorized data that outperforms other peer methods in forecasting and decreases learning complexities. The algorithms and techniques proposed in this work can help stakeholders optimize their investment decisions by exploiting the 8-K filings besides shedding light on AI innovations in accounting and finance.