The model performance is greatly affected by long-tailed dataset

Question

The model performance is greatly affected by long-tailed dataset

earthpimp opened this issue 4 months ago · 1 comments

Sorry to bother you.
I trained Prediff on my own dataset and found the result quite bad. I guess the reason behind should be data imbalance which is commonly observed in precipitation nowcasting.
I am currently considering to do resampling but I am worrying that it might hurt the generalizability.
I noticed that in your previous paper regarding TrajGRU, pixelwise loss weighting is applied to the radar sequence. How could I implement a similar approach in Prediff.
I would be appreciated if you could offer some suggestions.

Answer 1 · 2024-05-23T07:06:47.000Z

Thank you for your interest in our work and your question. Yes, we encountered a similar challenge when training PreDiff on highly imbalanced data, such as the HKO-7. Although it is not feasible to directly implement a loss function like in TrajGRU that can intuitively balance the training towards data in a long-tail distribution, we found that adjusting the data sampling directly can help alleviate this problem. Specifically, we increase the sampling of rare data and decrease the sampling of common data.