huggingface/autotrain-advanced

[BUG] I am using 2 nodes of 4 GPUs each, (total 8GPUs), but num_machines is always set to 1.

jackswl opened this issue · 4 comments

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config /home/xxx

UI Screenshots & Parameters

No response

Error Logs

I am trying to execute with 2 nodes of 4 GPUs each, via

#PBS -l select=2:ncpus=128:ngpus=4:mem=880GB

However, the accelerate launch is always showing num_machines=1 when executing the CLI command:

autotrain --config /home/xxx

@abhishekkrthakur any idea on how to work on this? Am I right to say that autotrain does not support multi-nodes? How to work around this?
Thanks!

Additional Information

No response

autotrain doesnt support multi-node, yet.

thanks for the reply.

BTW, do you happen to have a rough timeline of when autotrain will allow multi-node usage?
just asking so I can know roughly when to get back to this topic when multi-node is out!
thanks a lot.

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 20 days since being marked as stale.