Algotrading NLP on 10-K's

In this project, NLP Analysis was carried out on 10-k financial statements to generate an alpha factor. For the dataset, the end of day from Quotemedia and Loughran-McDonald sentiment word lists were used.

Installation

Use git clone to get a copy of this repository.

$ git clone https://github.com/lucaskienast/NLP-on-10Ks-from-EDGAR-DB.git
$ cd NLP-on-10Ks-from-EDGAR-DB

Method

define list of public companies and get their CIK's
use BeautifulSoup to get XML files for each CIK
download complete submission text file for all CIK's with sec_edgar_downloader
use re to get text content between <DOCUMENT> tags where <TYPE> is '10-k'
use BeautifulSoup to remove <HTML> tags and convert to lower case
lemmatize verbs using nltk.stem.WordNetLemmatizer
remove stopwords using nltk.corpus.stopwords
get Loughran-McDonald sentiment word list and lemmatize it
generate bag of words that counts number of sentiment words in each 10-K via sklearn.feature_extraction.text.CountVectorizer
compute Term Frequency Inverse Document Frequency (TFIDF) via sklearn.feature_extraction.text.TfidfVectorizer
compute cosine similarity to evaluate change in TFIDF over time using sklearn.metrics.pairwise.cosine_similarity
get stock data using zipline
implement cosine similarities as alpha factors and analyse factor returns using alphalens
compute Sharpe ratio for each sentiment factor

Results

Alpha factors (negative, positive, uncertainty, litigous, constraining, and interesting) do not display monotonicity in quantiles of factor returns. The sentiment word litigous achieved the highest Sharpe ratio at 2.23.