/Stock-Price-Prediction

This is a side project create by Lucy's and me, the main idea is use Bert model for fuse the stock news title and past stock price to predict the stock price in the future.

Primary LanguageJupyter NotebookMIT LicenseMIT

Stock Price Prediction

Contents

About

This is a side project create by Lucy and me, the main idea is use Bert model for fuse the stock news title and past stock price to predict the stock price in the future.

What we notice when investing in stocks is that in addition to observing the past values of stocks and short-term changes in stocks. We also refer to the news of the stock market, that is, we refer to the buying and selling conditions of other investors or the operating conditions of the company. This is why simple models such as LSTM or DNN will be inaccurate in stock forecasting - it loses the fundamental value of the stock.

Therefore, we hope to use bert to identify the positive or negative impact of news on stocks, that is, to use bert to solve downstream tasks. However, we are faced with a huge problem, we only have historical daily stock prices, not hourly or even minutely stock data. And representing dozens of news, their answers during training will be the same (news published within the same day). So we thought of another method: According to the free or fetching news API on the Internet, we can get the value of each news, and use this value as the ground truth of fine-tune. In inference, the average score of all news of the day is predicted as the predicted stock market rise.

Set up environment

$git clone git@github.com:henrytsui000/DataScienceProject.git
$cd $(THIS_PROJECT)
$conda create -n IDS python=3.8
$conda activate IDS
$pip install -r requirements.txt
# install pytorch, please choose your cuda version
$conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia

Prepare Data

Raw Data

There are two ways to get the news data and the stock price data:

  1. Run the python code to get the data. You may apply an account at marketaux, and add an .env file which indicate the TOKEN
$python tools/update_data.py
$python tools/update_news.py
  1. Use the command below to download the files which I upload to the Google Drive.
$gdown --folder https://drive.google.com/drive/folders/1WR5bq9gzMtNFHMnTB6dLEm8AeJj8b9ZT?usp=share_link

The file structure you downloaded is as follows:

./data
├── news.csv
└── stock.csv 

0 directories, 2 files

Training Data

This step for transform data in news and stock to format data, easier for model reading.

$python tools/make_bert.py
$python tools/make_predict.py
$python tools/add_feature.py

Visualize Data

Before we train the model, we do explore data analysis. There are some example figure below, and you can find the more detail at here. More Detail

Correlation GIF News News mounts
Stock Price News Stock Risk

Go Predict

We attempt to use BERT's powerful language processing capabilities to increase our ability to understand the stock market to help us predict stocks. Perhaps this cannot allow us to accurately forecast stocks forever, after all, the stock market is affected by both market trading and company conditions. But we want to be able to maximize the mastery of the latter. Rather than other models blindly learning relationships between numbers that don't exist.

Data problems

Then we ran into a big problem, with dozens or even hundreds of news items every day. It is irrational to use the gap between stock prices in two days as ground truth, because such good news may be defined as ordinary news.

Therefore, we thought of a superb solution - we will get the evaluation of the news through the API through the official news. We first use BERT to add a simple MLP to predict the news evaluation.

But it is a bit unrealistic to take the average score of each news to the final stock price forecast. What I mean is that too much news information into a single number will blur the forecast results.

Solutions!

Therefore, in the actual prediction, we remove the MLP and regard the BERT output as the feature of the news. Input into the designed model to make predictions.

Fine Tune BERT

Architecture

As mentioned in the above paragraph, our bert model is shown below. First, there will be pre-trained bert (it is conceivable that we do not have enough data to do bert pre-training). Then, after the MLP constructed by myself, it is separated into feature and output, and the value obtained by output and news API is used as ground truth, and MSE loss is used for training. See experiment for detailed training status.

Train the BERT model

It takes about an hour to train Bert from scratch (single RTX3090). If the RAM is not large enough or you don’t want to train for too long, you can adjust it by modifying the parameters. The following training instructions are basic instructions for reference.

$conda activate IDS
$python model/news2score.py
# For Fast training
# !python model/news2score.py -b 128 -e 20 --lr 5e-5

Train the Prediction model

It's only take 1 minutes for training the prediction DNN model. Run the ipynb script could run the training process.

# Just run model/score2price.ipynb

Inference

Finally, we will get the two models weight. And use inference script would turn the stock price and news to predict price!

Use this command counld inference our testing data!

$python model/inference.py

Results

For visiualize data, run tools/predict.ipynb would visualize the data which just finish inferece on last chapter.

Experiment

BERT training process

Finetuning pretrained bert weight from huggingface. Input is news title, and output is features of news(dimension = 16). We also use news scores from Martex.com as ground truth to train. Scatter plots below is our predict news score v.s. ground truth news score during training process.

0 Epoch 10 Epochs 20 Epochs 30 Epochs 40 Epochs 50 Epochs

DNN predict stock returns training process

Getting MSE and MAE Loss of other model and our model. We can see in belaow chart, our model has better performance than others.

Method MSE MAE
LSTM 9.85 6.83
Shifting 9.89 6.18
Our Model 6.21 4.14

Acknowledgement

This visiulize source code is based on FFN, Numpy, Pandas. Thanks for their wonderful works.

The code for predict stock news is based on BERT, Thanks for GOOGLE.