openclimatefix/skillful_nowcasting

some confusions

sjjlr opened this issue · 9 comments

sjjlr commented

1, Should the original radar data be normalized to [0,1]?The paper said the radar data was transformed to rain rate, then is the rain rate be normalized to [0,1] or [-1, 1]?
2, The output channel of conditioning stack is not consistent with the code, for example, for the last layer, 384 vs 768, while the code cannot work when channel is 384.
3, The predicted radars are very similar as the last frame of input data, but without any cloud movement. I don't know why. It seems that the convGRU doesnot work.

  1. The rain rate is normalized to [0,1] I believe
  2. Yes, that's a known bug, the main branch doesn't match the paper, its what is being changed in #5 that closer follows the Nature paper.
  3. Yeah, I talked with the authors awhile ago, and they said it is very hard to train from scratch. I haven't gotten this implementation working well yet, the hope is with #5, we can use the pre-trained weights from DeepMind to then work the same as their's. I believe the convGRU implementation works, as I was using it in other working models in SatFlow before I split it out here. But feel free to improve it!

I did just swap out the ConvGRU implementation with the one from our implementation of MetNet, and fixed a few errors, so as long as you run the #5 branch, it should at least run, although results are still probably not great

sjjlr commented

Thanks a lot. Did the original authors said why it's hard to train?

Not that I remember, it just was a bit difficult to get going from scratch, I think it is just difficult to get the training to be stable

sjjlr commented

Just because GAN is hard to train? Training stablely is one thing, I don't understand why there is no cloud movement using ConvGRU. There must be something wrong. In my opinion, even if the training is unstable, the cloud should move, although the movement is in wrong direction.

One challenge we have faced is having large enough batch sizes and training long enough to get the model to learn anything. I don't quite remember what the authors' reasoning was, but in our own attempts at getting it working, we've been a bit constrained on getting large enough batches and parameters to train it. I'm hoping by using the weights from the official implementation, we can get it working well enough that fine-tuning it might work.

sjjlr commented

I notice that there are some differences between the forward function of ConvGRUCell and the original ConvGRU paper, can you check it?

meanwhile, if the model is trained using 256256 croped images, how does it work on the whole 15001200 images? croping and mosaicing? or the whole 1500*1200 images can be input into the model directly?

They did just add the psuedocode for almost the whole model now, see the comments in #5, so I now have their implementation of the ConvGRU, etc. So soon this repo will be a nearly exact PyTorch port of their model, and hopefully be able to load their weights quite easily.

Now that #5 is merged, all the layers in the model are setup as close as possible to the pseudocode, so should work the same. There is an issue with #10 but otherwise, it should work the same.