基于深度学习和强化学习的量化交易系统(quantitative trading system based on Deep learning and Reinforcement Learning)



  1. main structure

  2. still working on

  3. conclusion


The system consist of:

Data processing module

Price prediction module

The reinforcement learning module based on:

the design for 6 actions(sell, short, sell_hold, short_hold, sell, cover)

The reinforcement learning module based on:

  1. Using the up and down line of VWAP or BBIBOLL to transform the price into (-1,1)
  2. Design two reinforcement learning models seperately for (buy, sell hold) and (short, cover,hold)
  3. Set the priority for two models, and decide which action should be output

Stocking picking strategy based on price prediction and RL return

Data Processing Module

I get the data set from Kaggle, which is the daily price and volume data of American stock market. The data set consist of open price, close price, high price, low price and volume. But this kind of data doesn't work well in the training of Deep learning and Reinforcement Learning. So I create dozens of technical analysis function to generate more feature for the input. Through this way, it could make the DQN agent easier to understand the input and increase the convergence speed.

Here I choose the most popular technical analysis method on Thinkorswim and in China. And make it easy to generate the data based on original data set.

The technical analysis method this module support are:

  1. SMA
  2. EMA
  6. VWAP
  7. VWAP_UP
  12. RSI_EMA
  13. RSI_SMA
  14. TRIX
  15. TMA
  16. BIAS


If you don't understand the calculation formula, please search in Google.

Besides, in this module, I also define a function whose inputs are data for two technical line(a,b), and the outputs are four numbers(-1,-0.5,0.5,1). "-1" means a came above b from the below, "-0.5" means a is alway bigger than b, "0.5" means b is always bigger than a, and "1" means b came above a from below. this could be used for the analysis of MACD.

Through the creation 17 technical analysis methods, I largely increase the number of features. Rightnow, I can only get the daily stock price data, so the data set is so small, and it is hard to see the advantage of this. But if I could use the 5min data or 1 min data, I believe the training result will be better because of the technical lines.

Price Prediction Module

In this module, we first separate the data into training_input,training_output, testing_input and testing_output based on different inputs_shape of XGBOOST, LSTM and CNN. Then we do the normalization for each data set.

In this part, the function supported are:

  1. separate the data for ML methods, separate the data for LSTM, seperate the data for CNN
  2. separate and normalized the data for ML methods, separate and normalized the data for LSTM, seperate and normalized the data for CNN

The normalization method is calculate the proportion between each data in the windows and the first data in the windows.

The price prediction models are:XGBOOST(REGRESSION), SVR, GRU, LSTM,modified Deepsense, and Deepsense with Attention Model.

According to the experiment, I discover that SVR did the worst job, and XGBOOST also couldn't do the job very well. But the training result of XGBOOST depends largly on the adjustment of parameters, and I am not very good at it, so it maybe better after the adjustment. GRU could be understanded as the simplify LSTM, it got higher speed, but the result is not as good as LSTM.


Deepsense network is Deep learning structure used on mobile sensor's data analysis. In my point of view, the fluctuation of stock price looks very simillar with the sensor's data. So I adjusted the structure of Deepsense to make it more suitable for my data set.


The reason for adding an Attention Model is that I have seen an paper about face recognition, telling that using 1*1 convolution kernels for three times, and multiply is with the original image, could increase the weights for the most important features after training, and those important features are nose, eyes, mouths. So I am trying to use the same ideas to get the most important feature information from the input_data.

The final experiment shows that the Deepsense model with attention model perform better than the original Deepsense model. Though the reult is also not as good as the LSTM, I can still expected to get better result after using the 5min data and 1min data.


In a word, the LSTM gets the best results, and the avearge R2 score for 10 stocks could be higher than 90%.

Reinforcement Learning Module 1

In this module, I use simple DQN, DDQN and Dueling DDQN to realize my ideas.

My idea is considering the actions of long and short together, which means choose an action from (buy, sell, hold, short, cover). But the limitations are too much, like: agent may output "short" continuously, or output "buy" continously, which are all not reasonable.

Through the research, I found out that I could separate the "hold" action into "hold_in_buy" and "hold_in_short". The advantage of doing that is I could reshape the last action into one-hot shape, and concatenated it into the inputs(state). So under this condition, the output at anytime will be only two:

buy: holdb, sell

holdb: holdb, sell

short: holds, cover

holds:holds, cover

sell: buy, short

cover: buy, short

Here you can see, I didn't set an action for hold while there is no portfolios. The reason for doing that is I wanted the agent to learn buy at low, sell at high, short at high and cover at low, so it is useless to set an action under this condition.

Key points:

  1. The input consist of two parts:

    1. the data for technical analysis and original data generated from the data processing modules.

    2. the one-hot formation for the last action.

  2. The action selection is a little bit different from the traditional DQN, because the output action of my NN is decided by the last action, so unlike the tradional DQN, which is the output of each nuerons of the last dense is the action, my action is chozen by the complex selection function. By the way, the last action is stored by a (2,1) queue.

  3. As for the reward, I have to use the last functional price recording, and use this recording and the present price to calculate the reward. I use the price_buffer to record the price for the last buy and short, and clear it to zeros when sell and cover action is chosen.

  4. Just like using DQN to play a game, I have to set a finish condition. If it finished, the Q value will be the same as the reward. Here, I define that under the condition of processing one share, if the return is higher than 40, or the loss is higher than 30, we finish it.

because the first data stored in the memory is naive and stupied, I define every 500 epoches in replay, the memory should delete the first 50 data.

The Reinforcement Models:

In this module, three models are supported:Simple DQN, DDQN and Dueling DDQN. In theory the Dueling DDQN should get the best result. But afterall, trading is not playing a game, so it is hard to say whether the tricks is usful in trading. So I keep all three models for testing.

Here I would love to breifly introduce three models:

DQN is the Deep Q network based on the idea of Q-learning, it replace the Q-TABLE with the CNN network. So it solved the capacity problems of Q-learning.

In my view, the Key point of DQN should be:

  1. the remember and replay function. In my view of point, the method of randomly choosing previous records(remember) and use it for training is more like a real NN, and it gets better results of course.
  2. it design a target net, which is an NN with later updated weights. And the target net is used for calculating the Qmax of next state, and it solves the problem that the connaction between Qmax and present one is too high.

DDQN is based on the DQN, the difference is the DDQN use the eval net to calculate the Qmax and get the corresponding action, and use the target net to calculate the Q(next state, action) and the traditional Qmax.


Dueling DDQN is based on the DDQN. the idea is separate the state and the action in the NN, which means at the very last of NN model, we separate the action that supposed to be output into actions and state. Then substract the value of action with the mean of every actions. And finally added the state and the actions as the outputs. The reason for doing this is that in some state, whatever the action you do, there isn't any influence on the next state. So in the Quantitative trading, I can understand it like this, when your initial capital is not big enough, whatever the action you made, it can hardly effect the stock price.

Reinforcement Learning Module 2

In the previous Reinforcement module I discover that because of the price fluctuation in daily data is not as high as day-trading, it will last a long period of bear of bull market, so the DQN agent will tend to choose buy at the start of bull market,and hold for a very long time. This is actually the right decision, but I want the agent to be more sensitive to the price, which mean through short and buy to earn money in the fluctuation of stock price.

So I design a new model to fix this, and the key point of this model are two parts:


  1. we understand that if the price data is concussion in a fix range, the reinforcement learning agent could learning the strategies better. So I use the the up and down line of VWAP(4std), to reshape the price data in to (-1,1). But even though I set the standard deviation as 4 times, there are still some price get out of the range, so I set those data into 1.1 and -1.1, which means out of the range.

  2. I use two reinforcement learning models to design the action policies for (buy, sell, hold) and (short, cover, hold) separately. And the final output action is chosen under a principle that the priority for buy_model is higher than the short_model. This principle means that we choose the action output by the buy_model first, then the short_model. This design may not be that reasonable, so I am trying to use the NLP and GOOGLE TREND data, to make a prediction of the future trend, and use the result for adjusting the priority.

Unlike the previous reinforcement learning module, the finish condition and reward policy has all been changed, as well as the action selection:

  1. finish condition: here I define every (buy,sell) action means the finish mark, vice versa.
  2. reward policy: I define the reward as the difference between price datas which has already been reshaped into (-1,1)
  3. action selection: here the selection is not as complex as the previous one, but I define the unreasonable action will be outputed as "hold".Then the action will get a punishment value.

The chosen for models is the same as the previous module, including the DQN, DDQN and Dueling DDQN, although the code maybe different, but the fundamental ideas are the same.

Stocking Picking Strategy Based on Price Prediction and RL Return

I use the LSTM to predict the price data in a period of future, the data could already been used for stock picking.

but at the same time, I give the price data as input to the reinforcement learning model, calculate the return. The stock with the highest return will be the chosen one.

Because the VWAP need the volume data to make the calculations, and the volume data is really hard to predict, so I use the BBIBOLL to reshape the price data.

The idea is really simple, so I won't tell it in details.


DDPG model: the code has been finished, and I am testing it right now and will put it into the module as soon as possible.

NLP: the present idea is using the web crawler to get the title and first paragraph of the article in financial news. Then using the stocking price variety as the label to train to NN.


GOOGLE TREND: there is a paper talk about using the GOOGLE TREND to make the predictions, but after several simple test, I found out the result isn't good enough. But I will keep trying.

Reinforcement learning priority: with the prediction result of NLP model and GOOGLE TREND model, I can adjust the priority of the second reinforcement learning module.


I spent lots of time and energy on the system, but meanwhile I have learned lots of knowledge, not only the technical knowledge, but more important, is the financial and quantitative trading knowledge.

I tried to use the Thinkorswim and investopedia's paper money system, and learned alot about stock, futures and Options. And I also asked questions to popular day trader, and learned alot about the tricks of technical analysis, and those tricks are very important in building the reinforcement learning models.

In a word, the system is not able to get stable profit, but it has made a great improvement on myself. If you are interest, you can have a look, thanks for reading!










