huggingface/autotrain-advanced

[BUG] I am using 2 nodes of 4 GPUs each, (total 8GPUs), but num_machines is always set to 1.

jackswl opened this issue 4 months ago · 4 comments

jackswl commented 4 months ago

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config /home/xxx

UI Screenshots & Parameters

No response

Error Logs

I am trying to execute with 2 nodes of 4 GPUs each, via

#PBS -l select=2:ncpus=128:ngpus=4:mem=880GB

However, the accelerate launch is always showing num_machines=1 when executing the CLI command:

autotrain --config /home/xxx

@abhishekkrthakur any idea on how to work on this? Am I right to say that autotrain does not support multi-nodes? How to work around this?
Thanks!

Additional Information

No response

abhishekkrthakur commented 4 months ago

autotrain doesnt support multi-node, yet.

jackswl commented 4 months ago

thanks for the reply.

BTW, do you happen to have a rough timeline of when autotrain will allow multi-node usage?
just asking so I can know roughly when to get back to this topic when multi-node is out!
thanks a lot.

github-actions commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions commented 3 months ago

This issue was closed because it has been inactive for 20 days since being marked as stale.