pytorch/torchtitan

Hard release criteria: Run and get convergence data on long running tests

gnadathur opened this issue · 2 comments

Hard release criteria: Run and get convergence data on long running tests
  • Run on 64 A100
  • Later on 64 H100

What are the hyper parameters for convergence run ?

  • adjusted batch size to 1.
  • What should the learning rate be ? @wanchaol , @lessw2020 , maybe duplicate the earlier convergence tests from FSDP1.