LeonardoBerti00/Axial-LOB-High-Frequency-Trading-with-Axial-Attention

Can't replicate Paper Performance

Opened this issue ยท 15 comments

Unfortunately I can't replicate the performances stated in the original paper, probably because of the hyperparameters, in fact I don't have enough computational power to execute a hyperparameters search.
The max F1-score that I've reached is 79% for k=10 and 78% for k=5.

Hahaha that's true. BTW what is the variable T used for in your code? I just found it used to determine the length of the dataset and to output hyper parameter information. But it is not involved in organizing the data or training progress at all. :)

The dataset FI-2010 is constructed taking the snapshot of the LOB every 10 events, so if the real horizon = 50, we need a variable to compute the length of the datasets (train, val and test) that is horizon/10, this variable is T. While h is the variable used only to select the right label column.

The paper seems a bit ambiguous as to the network architecture, fig. 1 shows only one axial attention block - is your interpretation that the whole block (in the grey box) is repeated again afterwards?

Did you get best performance with the hyper-parameters in the notebook? With these the model has 20,219 trainable params compared to the 9,615 quoted in the paper - suggesting they used smaller channel dims, or maybe only one block like in fig. 1

Yes I think so, because they write "The main building component of the proposed model, shown in Fig. 1, is the gated axial attention block, which consists of two layers, each containing two multi-head axial attention modules with gated positional encodings", but I'm not sure.
As far as performance is concerned, I haven't been able to do a complete hyperparameter search.

Why self.length = x.shape[0] - T - self.dim + 1 in the class Dataset?
Thank you

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

very clear. Thank you

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

Not sure my understanding is correct.
Does T mean the futuer window size to predict (predict horizons in the paper)?
If T=5, it means using current self.dim(40) data of the current windw to predict the data of the next 5 window?

yes

Thanks for the reply!
In addition, where could we find the description of FI-2010 data. The data download link https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data did not have the descriptions. And we are wondering how to select the right label from the data for different time window. (e.g. used -2 to select the y label in the code)

You can find it in the paper where it was proposed named "Benchmark Dataset for Mid-Price Forecasting of LOB Data with ML". Anyway, -1 is horizon 100, -2 is 50, -3 is 30, -4 is 20 and -5 is 10.

I met the same problem that could not replicate the performance as the paper.
For running the code, we got the best epoch at epoch 23:
Train Loss: 0.7474, Validation Loss: 0.8472, Duration: 3:11:56.469105, Best Val Epoch: 23
And the test dataset evaluation:
Test acc: 0.7913
precision recall f1-score support

       0     0.7600    0.7429    0.7514     38447
       1     0.8379    0.8466    0.8422     65996
       2     0.7366    0.7404    0.7385     35100

accuracy                         0.7913    139543

macro avg 0.7782 0.7766 0.7774 139543
weighted avg 0.7910 0.7913 0.7911 139543

What would be the reason to the performance difference from the paper?
In the paper, F1 score for k=50 is 83.27.
What would be the tunings for the next step? e.g. optimizor, hyper-parameters tuning?

I think these hyperparameters:
c_final = 4 #channel output size of the second conv
n_heads = 4
c_in_axial = 32 #channel output size of the first conv
c_out_axial = 32
pool_kernel = (1, 4)
pool_stride = (1, 4)
have to be tuned to reach the same performance as the paper. Unfortunately, the model is slow to train.

You might try torch.compile() in torch v2.0.

model = AxialLOB(W, dim, c_in_axial, c_out_axial, c_final, n_heads, pool_kernel, pool_stride)
model = torch.compile(model)
model.to(device)

With this command, one step training time reduced from 12 min to 9 min on Google Clob.

thank you