FastFlowNet - Hard and un-reproducible convergence
magsail opened this issue · 1 comments
Hi Henrique,
Thank you so much for sharing this collection of optical models. It helps me get on with these models quickly.
I've been training FastFlowNet on FlyingChairs dataset with your default configurations. I found that the convergence is hard and usually un-reproducible. Some times the training will converge after 16 epochs (45k steps). Sometimes the training will converge after 47 epochs (130k steps). Sometimes it will just not converge.
I'm attaching the loss curve for convergence starting with 16 epochs and 47 epochs for example.
convergence starting with 16 epochs
convergence starting with 47 epochs
Did you see this phenomenon when you were training the model?
Besides, I compared your loss calculation with FastFlowNet's and PWCNet's original paper. In both papers, the loss for each pyramid level was multiplied with a weight sequence.
self._weights = [0.005, 0.01, 0.02, 0.08, 0.32]
with 0.005 multiplied to the loss of the highest resolution pyramid level (which is level of 1/4 original image resolution) and 0.32 multiplied to the loss of the lowest resolution pyramid level (which is the level of 1/64 original resolution).
In your implementation, you reverse the weight sequence and replace values with proportional sequence values. ie.
self._weights = [0.32, 0.16, 0.08, 0.04, 0.02]
So in your implementation 0.32 applies to the highest resolution pyramid level.
Do you have any reasons for making this change? Is it due to the original weight sequence is even harder to converge?
Really appreciate if you can advise.
Best Regards!
David
Hi David, sorry to hear about your troubles.
Unfortunately, as I said in the training docs, I don't have the resources to train and verify these models by myself, so I cannot guarantee they will train as intended. The default training is based on the RAFT routine, so I don't know how other models will behave. Based on your feedback, I think I should also make the train script print some warnings similar to the docs to inform more people about this restriction.
As for the weights, I think FastFlowNet didn't provide a training script, so I just borrowed the loss from FlowNet and didn't realize they were different. Since you mentioned this issue, I will take a look and fix accordingly, but I don't know if this can help fix your problem.
Best