/stock-embeddings

Deep Learning Coursework Fall 2018, turned into stock embedding analysis

Primary LanguagePython

deeplearning - two sigma challenge


Timeline:

  • Report Deadline 19th January -- All

  • Programming deadline 17th January

  • Implement everything in proposal 14th [need 1 day to train probably?]

  • Implement feature selection algorithm others 13th -- Others

  • Implement feature selection algorithm PCA 11th -- Kostas

  • Implement Loss Function 8th -- David

  • Implement Peripheral 8th -- David

  • Implement Baseline NN (from Kaggle) 10th [should work on random input data] -- All


Some things to note:

  • Make sure to use python 3.

  • To install any packages, use the following commands (inside a virtualenvironment!):

pip install -r requirements.txt

(Of course one can also use anaconda, which is a better decision - I'm just used to pip and the deadline is soon).

Whenever you install a new package, please append the requirements.txt file with the modulename, as well as the most specific module version.

  • Here is the kaggle link whose dataset we use.
https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs/version/3

Download the data into the data folder inside the root directory of this project.

  • Please create a file called ".env" in the root directory of this project, and write to the file the following item:
LEONHARD_UNAME="yedavid"

DATAPATH="/Users/david/deeplearning/data/Data/Stocks/"

DATAPATH_PROCESSED="/Users/david/deeplearning/data/processed/all.csv"
DATAPATH_PROCESSED_DEV="/Users/david/deeplearning/data/processed/dev.csv"

DATA_PICKLE="/Users/david/deeplearning/data/processed/pickle.pkl"
DATA_PICKLE_DEV="/Users/david/deeplearning/data/processed/pickle_dev.pkl"

MODEL_SAVEPATH_BASEPATH="/Users/david/deeplearning/data/"


(i..e instead of yedavid, use your username - the same goes for paths)


The following are the current todo's. You can declare new todos in 1. the repo issues, or 2. here in the readme.

Peripheral:

  • Data loader and preprocessor
  • Batch loader for training set
  • Batch loader for validation set
  • Batch loader for test set
  • Automatic deployment to Leonhard, and automated download of saved weights
  • Setup environment for feature algorithms, debug tools, prediction algorithms, and other features

Debug tools:

  • [] Tensorboard coupled to training loss and validation loss
  • [] Predictor on validation / test dataset

Feature selection algorithms:

Prediction algorithms:

  • LSTM (Baseline from Kaggle)

Other features:

  • Loss function

Q & A:

  • Do we do the training-test split between stocks, or dates? (or somehow both?)
  • X is of shape (n_stocks, n_dates, n_features). Should Y be of shape (n_stocks, n_dates, 6) (6, because high, low, open, close, volume, medium)? Or what exactly do we want to predict?

Backlog (ignore for now)

-- implement dummy features -- Implement autoregression visualization (checking if from one timestep to another it is symmetric or not etc.)

TODO: Also include the loss (graphs for each one possibly)

Future work:

  • [] embedding can be extended to non-linear, deep latent space (by using a sub-network)

---- Some more todo's

  • [/] Der Teil zu Feature Engineering ist jetzt fehl am Platz und muss weg, dafuer sollten wir den Teil zu Embeddings ausbauen

  • [] Entweder mehr Sachen isV MAE fuer jeweils positive/negative forecasts jeweils zeigen, oder direkt ein Investmentexperiment machen -> enough?

  • Der Punkt ist, dass Leute kommen koennen und sagen: Wie viele Returns in eurem Sample waren positiv (in der Periode sind es grob geschaetzt signifikant mehr als 50%), was macht ein Indikator mit 54% Accuracy dann fuer einen Unterschied vs salopp gesagt einfach drauf los zu kaufen

  • Und bei dem Stock Experiment nur eine Accuracy zahl zu zeigen, idk -> Do more analysis with prediction ROC or so, or predicision/recall matrix

  • Die ganzen Grafiken und das Format muss nochmal sauber ueberarbeitet werden