EleutherAI/gpt-neox

Argument List too long error

kavlekar101 opened this issue · 2 comments

Describe the bug
Encountered an issue when modifying the config file, specifically the 760.yml
I cannot set the value checkpoint_factor to 10 and have the value train_iters set to 320000. If I do this, I get OSError: [Errno 7] Argument list too long: 'pdsh'

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'gpt-neox/configs/760.yml'
  2. Add a "save" key to the .yml file and set it equal to "checkpoint"
  3. Change the "checkpoint_factor" key to 10
  4. Change the "train_iters" key to 320000 if it is not that already
  5. Run job.sh file
  6. I receive an OSError: [Errno 7] Argument list too long: 'pdsh'

Expected behavior
I expected the training to just happen along with checkpoints being outputted

Proposed solution
Not sure how to solve this problem

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GPUs: 6
  • Configs:

Additional context
Stack trace

File "/fs/scratch/../gpt-neox/deepy.py", line 41, in
main()
File "/fs/scratch/../gpt-neox/deepy.py", line 37, in main
deepspeed.launcher.runner.main(deepspeed_main_args)
File "/fs/scratch/../gpt-neox/.venv/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 555, in main
result = subprocess.Popen(cmd, env=env)
File "/usr/local/python/3.9-2022.05/lib/python3.9/subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/python/3.9-2022.05/lib/python3.9/subprocess.py", line 1821, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

I've looked into this issue and this is related to how the configs are passed to DeepSpeed.

Instead of being passed checkpoint_factor, DeepSpeed is passed a list of iterations to save. So when you use a number of iterations of 320000, and checkpoint_factor=10, it creates a list of 32,000 integers. Then the command is too long to for linux, and you get the OSError you see.

Do you intend to save a checkpoint every 10 iterations for 320,000 iterations?

Closing due to inactivity