- Preprocessing [most already implemented, but needs error checking + PEP8 and improvements to file/class structure)] --> something is wrong at the moment related to normalization and batch processing. Needs some thorough checking. Please check the open issues in the Github.
- load all of the .csv files included as data (there will be more in the future, so load all files in specified folder (either daily or hourly))
- the columns should be normalized per file_name (pair/exchange), which can then be applied to the test set.
- currently a filename token is given as input to the model, this can probably be removed or made optional.
- do a 20-10-70% test validation training split: 20% last rows of each of the files is for test.
- provide a production option whereby all data is used for training
- custum data loader with stratified sampling
Please check and address all the issues on Github.
- Model
- allow user to specify the y column to predict. (e.g. Sell_p40_a4)
- allow a list of columns to be removed from the model input (i.e., Top_p15_a4,Btm_p15_a4,Buy_p15_a4,Sell_p15_a4,Top_p40_a1,Btm_p40_a1,Buy_p40_a1,Sell_p40_a1,Top,Btm,last_pivot)
- input of the model is of dimensions: all_features n-sequence length (n can be set and could be for instance 14 days)
- Implement the following model architectures as classes:
- Transformer
- 2 layer LSTM
- 2 layer LSTM with self-attention
- Simple 2 layer FC
- Wavenet (will need bigger input window, please allow easy changing of all model parameteres when calling the class)
- Add training/evaluation function with loss/accuracy plot for training, validaton, and test set
- Output the confusion matrix for test set
- function to save / load model and predict based on small input dataframe (m rows)
- batchnorm for training optimization
- Predict
- create a predict function that loads the final production model on training + test set and feed in a csv from the data folder: 'production_data_for_new_prediction'
- output dataframe with predictions
- output confusion matrix for this data
- allow me to control the threshold (0.5) for prediction cutoff so that I can increase the precision if needed.
A. Variant: n-to-1
- predicts the next 1 element of the specified column.
B. Variant: n-to-m
- predicts the last m elements of column y.
- it can use the previous t=0 until t=t-m elements of column y as input (this is not the case for Variant n-to-1)
- it can use the other x columns as input until t=t as usual.
Notes:
- the model can be trained either on the hourly folder, or the daily folder. The datatime column should be flexible enough to accomodate.
- Document code very well please and use PEP8 standard, it's ok to create many files/classes etc.
Please push regularly to the repository.
We need easy functions to create/train/predict on new data (from other sources), which we can call from the Colab e.g.:
my_model = Model_n-to-1(n=14, layers=3,...)
results = my_model.train(epochs=6,device=gpu_1,data='datafolder',production=False, save='filename')
my_model.load('filename')
predictions = my_model.predict(test_data=my_dataframe)
or a slightly better syntax if you can suggest it.