Hypernetwork training runs substantially slower
Leon-Schoenbrunn opened this issue · 3 comments
This is not really an issue, but there's not a discussion forum so I'm leaving this here.
Since yesterday I've experimented with this extension and the pull-request that added gradient accumulation and the resulting hypernetworks were great, but I noticed a significant slowdown in the steps. My Train-Gamma setup is exactly the same as your one in the screenshot provided in the readme and I can't really set my batch size to something greater than 1 for now (8GB 3070).
Now here's my question: Is the slowdown due the new latent sampling method, do steps actually work different now with gradient accumulation or did something completely different change? (I think my it/s actually went up, but each step takes forever now)
Some quick manual testing with a [1, 1.5, 1.5, 1] softsign network:
Gradient accumulation steps: 1, latent sampling: once, shuffle tags: Off - Time for 10 steps: ~4 seconds | baseline
Gradient accumulation steps: 1, latent sampling: deterministic, shuffle tags: Off - Time for 10 steps: ~4 seconds | no slowdown
Gradient accumulation steps: 1, latent sampling: once, shuffle tags: On - Time for 10 steps: ~5 seconds | slight slowdown??
Gradient accumulation steps: 4, latent sampling: once, shuffle tags: Off - Time for 10 steps: ~15 seconds | 3x as long / step
Gradient accumulation steps: 4, latent sampling: once, shuffle tags: On - Time for 10 steps: ~20 seconds | 4x as long / step
Gradient accumulation steps: 4, latent sampling: deterministic, shuffle tags: On - Time for 10 steps: ~20 seconds | 4x as long / step
Is the slowdown by the gradient accumulation justified, and are we learning more per step?
Yes, batch size is not intended to be bigger than 1, because its using image size iteself, 'variable size' or 'no crop'. I can solve it, but I doubt the effect of batch size...
And training is slower, since its actually doing more steps than shown. If you set 'gradient accumulation step' to 8, then it should be 8x slower. its doing 8 micro-steps per step so its intended.
Thanks for clearing this up. I wasn't quite sure if we are running the steps as usual and apply the gradient accumulation afterwards or if we run the accumulation in each step.