Dose this model genelarize well on your (other) dataset?
Closed this issue · 1 comments
Thank you very much for the implementation. And I wonder whether there is someone applying this method on other datasets and how's the performance?
When I apply this method on my datasets (a traffic dataset and a disease dataset), there are two problems: 1. the loss is very big, i.e., the model cannot learn the pattern, much worse than the vanilla LSTM, wired. 2. in some cases, the val loss drops quickly, but increases explosively (in 1-2 epochs).
I have tried to use the minmax scaler and the gradient norm clip to address the problem, but these don't work. As the encoder use the whole sequence for attention, the T cannot be too big, that limits the information inputed to the model. But I still think it is hard to tune this model in other datasets. Does someone have similar experience?
Did you have success with your problem?