/ReinforcementLearning-AlgoTrading

Code for thesis project on applying reinforcement learning to algorithmic trading

Primary LanguagePython

Reinforcement Learning for Algorithmic Trading

Getting Started

Python 3.6.5

I would reccomend creating a virtual enviorment to avoid dependancy issues. You can create a virtual enviorment using Virtualenv if you don't already have it installed in your current python interpreter. The current dependancies are in requirements-cpu.txt or gpu equivalent, and can be installed by the following commands.

pip3 install virtualenv
python3 -m virtualenv env

source env/bin/activate

pip install -r requirements-cpu.txt

The equivalent requirements for gpu support are inside requirements-gpu.txt.

Training

We are currently working on optimizing the distribution of funds between two assets. You run python main.py [source type], where the source type(s) are as follows:

  • markov
    • Markov memory 1 and a fixed asset with return rate 0
  • markov2
    • Markov memory 2 and a fixed asset with return rate 0
  • iid
    • IID uniform random variable and a fixed asset with return rate 0
  • mix
    • Markov memory 1 and IID uniform r.v
  • real
    • Real data

This will populate the contents of a Q table and display the result of following the policy obtained over 100 intervals of testing data. An overview of the code architechure is found below.

  • main.py
    • This is where you can specify episodes and initial investment
  • enviorment.py
    • Here you can manipulate paramaters of the trading enviorment such as action space, observation space, reward function and done flag
  • Q_table.py
    • This class defines learning rate, gamma, epsilon parameters and contains choice action and learning methods
  • utils.py
    • A series of methods used to import, generate and manipulate data for training
  • /Matlab/
    • A bunch of matlab scripts that have been used to determine quantizers, empirical distributions and policies.

Q Learning Results

Below are some observations thus far. Each example is derived with an initial investment of $100 for 10 episodes.

IID Source

When the input source is an iid random variable uniformly distributed on [-1,1] with steps of 0.2, the Q table is populated as follows. Note that there is no observation space, as return rates are independent.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
None 0.013 -0.016 -0.11 -0.038 -0.009 -0.038 0.004 0.001 -0.038 -0.087 -0.123

Which indicates the best option is to invest %0 of capital into the IID stock at any given time.
When the input source is an iid random variable uniformly distributed on [0,0.3], the Q table is populated as follows.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
None 1923.39 2243.24 2064.73 2121.45 2028.12 2122.95 2088.61 2155.73 2096.5 2124.11 4931.78

Which indicates the best option is to invest all capital into the IID stock at any given time.

Markov Memory 1 Source

Here the input source is markovian with the following transition probabilities.

If the possible return rates for the 3 states are -0.2, 0.0, 0.2 the Q table is populated as follows.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev Value: 0 218.074 111.159 127.127 105.276 112.809 102.041 98.145 97.792 113.613 95.798 106.162
Prev Value: 0.2 181.137 191.67 261.255 187.992 172.182 179.056 223.266 186.9 182.004 217.171 629.632
Prev Value: -0.2 244.331 58.172 62.42 48.51 56.312 55.855 50.747 63.276 77.343 52.846 55.282

This deduces the policy that, when the previous value is:

  • 0 -> Do not invest
  • -0.2 -> Do not invest
  • 0.2 -> Invest %100 of capital into stock

If the return rates for the 3 possible states are -0.1, 0, 0.4, the Q table below is produced.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev Value: -0.1 559.623 62.7048 60.9816 68.2782 61.0308 66.2113 58.6067 56.0653 64.5082 60.6457 56.6935
Prev Value: 0 69.0304 72.4873 64.9642 58.6928 61.7887 561.144 70.8124 69.7554 66.7248 63.1553 64.2676
Prev Value: 0.4 159.824 196.479 200.312 181.142 205.622 200.464 184.247 178.793 196.372 172.393 1281.63

This deduces the policy that, when the previous value is:

  • 0 -> Invest %50 of capital into stock
  • -0.1 -> Do not invest
  • 0.4 -> Invest %100 of capital into stock
    This results are consistend with the optimal investement decision derived in the ./Matlab/ scripts, which yields that the Q-learning is working effectivly.

Using the script ./Matlab/ModelFitting.m the one step transition matrix for IBM computed over the IBM return rates, (discluding the last 1000 samples for testing).

For states x < -.1%, -.1% < x < .1% and x > .1%. We can then enter this transition matrix and states in `utils.py` to generating 5000 samples, and then train the Q learning agent on these samples. Applying the policy obtained from Q-learning, we can then apply it to the testing data for IBM (the last 1000 values).

However, this quantization is generic and randomly chosen. Consider now repeating the experiment, however, we can determine the quantization of our training data using the Lloyd Max algorithm on the real data training set.

Applying this quantiation to `./Matlab/ModelFitting.m` we generate the following transition matrix.

Training a Q agent on the dataset generated according to this empirical distribution and applying the policy obtained to the testing data yeilds the following result.

Repeating this experiment with the Microsoft data, we first obtain a Llloyd Max quantizatizer as shown below.

Afterwards we can obtain an empiratical approximation of the 1 step transition matrix according to this quantizer, with respect to the MSFT data discluding the most recent 1000 entries.

Using this quantizer and transition matrix to generate training data for the Q agent, the policies obtained have the following result on our testing data.

Comparing Q Learning Results for Markov Orders

To better understand what markov order (if any) best represents the conditional dependence of past return rates we experiment with manipulating the obsersvations seen by the Q learning agent. To start, we consider the microsoft data. Below is the populated Q table if this stock is treated as iid. Here, the agent has no access to previous return rates so the observation is always the same.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
None 39.2431 39.5498 39.2918 40.3238 39.5636 48.7792 39.8656 39.8663 39.4475 39.746 39.4284

This policy implies investing %90 of assets into the stock for any given time step. Applying such policy over the testing data yeilds.

Now, if we allow the Q learning agent to observe the previous return rate as `-1, 0, 1` correspoding to being in the negative interval, nuetral interval and positive interval as per the uniform quantization over the training data, we observe the following markov order 1 policy from the Q learning agent.
Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev "Down" 8.69687 8.416 8.76448 8.32668 8.06351 8.44431 8.61975 19.2718 9.00001 7.86183 7.99984
Prev "Nuetral" 21.6007 21.3949 21.1213 21.2595 21.0212 21.3045 20.5782 21.1772 20.8734 28.6595 21.2023
Prev "Positive" 8.626 8.10509 18.6383 8.12764 8.08328 8.12089 7.71774 8.01588 8.39773 8.66198 8.14364

This markov policy yeilds the following performance over the testing data.

Allowing the Q agent to look at the previous two return rates yeilds the following Q table.

Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev Seq: "Down", "Down" 1.46329 1.34375 1.31343 1.51526 1.36004 1.47882 4.51021 1.35624 1.63156 1.34464 1.29957
Prev Seq: "Down", "Nuetral" 4.35364 4.07422 3.88701 4.19443 4.04607 3.83908 4.51836 4.09829 10.6058 3.95363 4.04416
Prev Seq: "Down", "Positive" 1.16335 1.42665 1.21431 1.37276 1.42155 1.41948 1.36941 4.42286 1.15609 1.05923 1.19335
Prev Seq: "Nuetral", "Down" 2.88785 2.68302 3.32496 2.92446 2.70594 2.6983 2.91641 2.71681 7.94342 3.07691 2.84623
Prev Seq: "Nuetral", "Nuetral" 15.2012 15.0627 22.1063 14.9272 14.9772 15.1129 14.9588 14.7254 14.772 15.2999 15.2734
Prev Seq: "Nuetral", "Postive" 7.55047 3.08106 3.07143 3.17324 2.909 3.29652 2.78571 3.24299 3.09982 2.82638 3.24957
Prev Seq: "Positive", "Down" 4.64262 1.45491 1.49116 1.44225 1.37647 1.39101 1.6392 1.17189 1.3111 1.53721 1.38324
Prev Seq: "Positive", "Nuetral" 3.74981 4.25076 4.28741 9.85611 4.12424 4.65483 4.37102 4.29442 3.81727 4.28817 3.76588
Prev Seq: "Positive", "Positive" 1.22974 1.00224 1.16756 1.04358 1.34929 1.13015 1.25551 3.65231 0.896386 1.18419 1.17435

Applying this markov 2 policy to the testing set yeilds

Interestingly engough, the seccond order memory performed worse. This is due to Q learning agent not being able to appropriately explore all states. In order to allow the agent to derive a markov memory 2 policy we must increase the number of training episoides. Conducting the equivilant experiment for the IBM data yeilds the following results. When treated as IID, the following Q table is derived.
Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
None 43.8113 43.8479 43.6593 44.0324 44.544 44.5033 43.8169 44.1369 43.9382 54.4591 44.3226

And applying such policy to the testing data yeilds.

Moving to markov memory 1, we obtain the following Q table.
Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev "Down" 10.2676 10.2352 11.1674 10.8175 10.6088 11.0861 10.3039 22.9233 9.90889 10.1815 9.52493
Prev "Nuetral" 23.2557 23.2025 23.4107 23.7157 23.1266 23.4601 31.6622 24.0064 22.8477 23.2966 23.6219
Prev "Positive" 7.81495 7.97147 8.75302 8.76693 8.49904 8.13567 8.53647 8.06919 8.06033 8.29538 20.0326

Which yeilds the following testing results.

The difference between the IID test and markov 1 test is that the markov 1 policy seems to be less "risky", it looses less money at the begining of the testing period, however, gains less towards the end. The two policies appear to be very simaler. Moving forward to markov order 2, we obtain the following q table.
Distribution into Source 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Prev Seq: "Down", "Down" 2.01617 2.08723 1.95979 6.864 1.78769 1.70395 1.87764 2.22568 1.64296 1.81577 2.00709
Prev Seq: "Down", "Nuetral" 5.53366 5.52372 5.12495 5.09426 12.9182 5.1679 5.40856 4.4439 4.89715 4.97982 5.47335
Prev Seq: "Down", "Positive" 1.18736 1.34069 1.44999 1.51097 1.20899 1.38332 4.57586 1.24534 1.38879 1.46921 1.37788
Prev Seq: "Nuetral", "Down" 3.6141 3.97473 3.66598 3.66956 3.88415 9.45307 3.82271 3.77226 3.98146 3.84407 3.50747
Prev Seq: "Nuetral", "Nuetral" 16.3345 16.0558 15.8404 16.3348 15.6925 16.4471 16.2041 16.3338 16.1826 16.1871 24.4097
Prev Seq: "Nuetral", "Postive" 3.13525 2.72395 2.83029 3.04262 8.08926 2.84946 3.14385 3.14187 2.75471 2.78054 2.84228
Prev Seq: "Positive", "Down" 1.49701 5.84822 1.38522 1.58129 1.71697 1.74671 1.58833 1.83914 1.59212 1.92212 1.68669
Prev Seq: "Positive", "Nuetral" 4.31702 3.85078 10.0363 3.83889 3.75574 3.99425 3.21814 4.12956 3.47642 3.56835 3.76362
Prev Seq: "Positive", "Positive" 1.30869 1.09477 1.28534 1.12935 1.18675 0.935871 0.9024 0.860647 1.13079 1.10643 4.14827

Which yeilds the following performance over out testing period.