Using deepspeed training resnet on imagenet on a system with multiple GPUs.
Specify the paths in the command below, and run it.
nohup deepspeed main.py --deepspeed --deepspeed_config /path/to/ds_config.json --data_dir /path/to/imagenet --out_dir /path/to/output > /path/to/out.log 2> /path/to/err.log &
Set IF names that nccl will use.
export NCCL_SOCKET_IFNAME="all-names"
Specify the paths and nodes in the command below, and run it.
nohup deepspeed --hostfile=/path/to/hostfile --include="node-IP1:device-idxs[@node-IP2:devices-idxs]" main.py --deepspeed --deepspeed_config /path/to/ds_config.json --data_dir /path/to/imagenet --out_dir /path/to/output > /path/to/out.log 2> /path/to/err.log &