Sentiment Analysis for Event-Driven Stock Prediction

Use NLP to predict stock price movement based on news from Reuters, we need the following 5 steps:

  1. Data Collection

    1.1 get the whole ticker list

    1.2 crawl news from Reuters using BeautifulSoup

    1.3 crawl prices using urllib2 (Yahoo Finance API is outdated)

  2. Applied GloVe to train a dense word vector from Reuters corpus in NLTK

    2.1 build the word-word co-occurrence matrix

    2.2 factorizing the weighted log of the co-occurrence matrix

  3. Feature Engineering

    3.2 Unify word format: unify tense, singular & plural, remove punctuations & stop words

    3.2 Extract feature using feature hashing based on the trained word vector (step 2)

    3.3 Pad word senquence (essentially a matrix) to keep the same dimension

  4. Trained a ConvNet to predict the stock price movement based on a reasonable parameter selection

  5. The result shows a significant 1-2% improve on the test set

1. Data Collection

1.1 Download the ticker list from NASDAQ

./crawler_allTickers.py 20  # keep the top e.g. 20% marketcap companies

1.2 Use BeautifulSoup to crawl news headlines from Reuters

Note: you may need over one month to fetch the news you want.

Suppose we find a news about Facebook on Dec.13, 2016 at reuters.com

We can use the following script to crawl it and format it to our local file

./crawler_reuters.py # we can relate the news with company and date, this is more precise than Bloomberg News

By brute-force iterating company tickers and dates, we can get the dataset with about 30,000 ~ 200,000 news in the end. Since a company may have multiple news in a single day, the current version will only deal with topStory and ignore the others.

1.3 Use urllib2 to crawl historical stock prices

Improvement here, use normalized return [4] over S&P 500 instead of return.

./crawler_yahoo_finance.py # generate stock price raw data: stockPrices_raw.json, containing open, close, ..., adjClose
./create_label.py # use raw price data to generate stockReturns.json

2. Word Embedding

To use our customized word vector, apply GloVe to train word vector from Reuters corpus in NLTK

./embeddingWord.py

Read the detail of the method here, implementation here

We can also directly use a pretrained GloVe word vector from here

3. Feature Engineering

Unify the word format, project word to a word vector, so every sentence results in a matrix.

Detail about unifying word format are: lower case, remove punctuation, get rid of stop words, unify tense and singular & plural using en

Seperate test set away from training+validation test, otherwise we would get a too optimistic result.

./genFeatureMatrix.py

4. Train a ConvoNet to predict the stock price movement.

For the sake of simplicity, I just applied a ConvoNet in Keras, the detail operations in text data is slighly differnt from the image, we can use the architecture from FIgure 1 in Yoon Kim's paper

./model_cnn.py

5. Prediction and analysis

As shown in the result, the prediction accuracy signifinantly improves around 1% - 2% compared to random pick.

6. Future work

From the work by Tim Loughran and Bill McDonald, some words have strong indication of positive and negative effects in finance, we may need to dig into these words to find more information. A very simple but interest example can be found in Financial Sentiment Analysis part1, part2

As suggested by H Lee, we may consider to include features of earnings surprise due to its great value

Issues

  1. remove_punctuation() handles middle name (e.g., P.F -> pf)

References:

  1. Yoon Kim, Convolutional Neural Networks for Sentence Classification, EMNLP, 2014
  2. J Pennington, R Socher, CD Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014
  3. Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65.
  4. H Lee, etc, On the Importance of Text Analysis for Stock Price Prediction, LREC, 2014
  5. Xiao Ding, Deep Learning for Event-Driven Stock Prediction, IJCAI2015
  6. IMPLEMENTING A CNN FOR TEXT CLASSIFICATION IN TENSORFLOW
  7. Keras predict sentiment-movie-reviews using deep learning
  8. Keras sequence-classification-lstm-recurrent-neural-networks
  9. tf-idf + t-sne
  10. Implementation of CNN in sequence classification
  11. Getting Started with Word2Vec and GloVe in Python