How to pretrain on a single machine (no using SLURM)

Question

How to pretrain on a single machine (no using SLURM)

Opened this issue 23 days ago · 1 comments

Thank you for this amazing project.

I tried to perform pretraining on a single machine, with a Nvidia A100 GPU, or just with a CPU, but it could not work through.

It seems the script file main_pretrain.py needs to be modified somehow.

Could you offer help in detail on this matter?

Thanks in advance.

Answer 1 · 2024-06-18T10:00:33.000Z

@geonoon In fact, we have considered two cases for distributed pretraining: SLURM and server, but I'm not sure whether the main_pretrain.py of MTP can be implemented on the server, maybe you can refer to this, to revise the codes related to the distributed pretraining.

Here is a command example:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py