DDP training gets terminated in the middle of the training because of some SIGKILL received by a PID (forked child process)
Closed this issue · 2 comments
Describe the bug
traceback : Signal 9 (SIGKILL) received by PID 10398
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 10398) of binary: /anaconda/envs/py37_default/bin/python
Traceback (most recent call last):
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda/envs/py37_default/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/..../torch/distributed/launch.py", line 193, in
main()
File "/home/.../torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/.../torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/.../torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/.../torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyte.../torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
To Reproduce
Just a normal stoke training script
stoke_model = Stoke(
model=model,
verbose=True,
optimizer=optimizer,
loss=loss,
batch_size_per_device= opt.batchSize,
gpu=True,
fp16=None,
distributed=DistributedOptions.ddp.value,
fairscale_oss=True,
fairscale_sddp=True,
grad_accum_steps=1,
configs= [amp_config, ddp_config, oss_config],
grad_clip=ClipGradNormConfig(max_norm = opt.grad_clip, norm_type=2.0),
)
def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):
example_ct = 0 # number of examples seen
batch_ct = 0
sum_loss = 0
stoke_model.print_on_devices(f"Starting Epoch {epoch + 1}")
stoke_model.model_access.train()
for idx, (inputs, targets) in enumerate(train_dataloader):
# call the model through the stoke onkect interface
outputs = stoke_model.model(inputs)
train_loss = stoke_model.loss(outputs, targets)
stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")
# Call backward through the stoke object interface
stoke_model.backward(loss=train_loss)
# Call step through the stoke object interface
stoke_model.step()
scheduler1.step()
scheduler2.step
sum_loss += train_loss
example_ct += len(inputs)
batch_ct += 1
# Report metrics every 50th batch
if ((batch_ct + 1) % 50) == 0:
train_log(train_loss, example_ct, epoch)
#print(train_loss, example_ct, epoch)
avg_loss = sum_loss / len(train_dataloader)
return avg_loss
for epoch in tqdm(range(epochs), leave=True):
train_loss = train(train_dataloader, stoke_model, scheduler1, scheduler2, epoch)
val_loss = validate(val_dataloader, stoke_model, epoch)
save_checkpoint(stoke_model, epoch, train_loss, val_loss)
The actual script is posted here - https://gist.github.com/rushi-the-neural-arch/bee47ba87e5ddabf0cb47def9bc0b013
-
Ran config as -
env CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 Stoke-DDP.py --projectName "Stoke-4K-2X-DDP" --batchSize 18 --nEpochs 2 --lr 1e-3 --weight_decay 1e-4 --grad_clip 0.1
-
Error produced is -
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10401 closing signal SIGTERM
Expected behavior
Ohkay so I know this issue is more of a PyTorch DDP concern and not a Stoke issue as I found many users face this problem and there doesn't seem any definitive solution for this apart from downgrading the torch version. You can see here, a workaround just 2 days ago - pytorch/pytorch#67538, the user downgraded his torch version from 1.10 to 1.8 which solved this particular issue but as Stoke requires torch version to be greater than 1.81, I guess this would not be possible for us. Maybe torch 1.10 version is just recently rolled out so they might not have fixed this from their end but do you happen to know any alternative approach/solution for this??
And also giving you a bit more context, I trained a sample very lightweight neural network and could do the training easily for larger batch sizes, I did a few experimentations and after gaining some perspective, I switched to a heavier-more parameter (~4.5M) network for training but now this error started occurring. Initially, I thought this might be due to more load being exerted on the RAM so I decreased the batch size to 1 and also removed the gradient accumulation step, played around with num_workers parameters but this didn't solve the error. In fact, what I have noticed is that this error occurs in the middle exactly after 125 steps! which seems weird as there is no code that relates to some operation after 125 steps or after a specific number of steps
EDIT - I tried the FP16 training and the error still persists but it's after 145 steps now.
Screenshots/Code Snippets
Environment:
- OS: Ubuntu 18.04.5,
- Python version - 3.7.7
- PyTorch Version - 1.10:
- Deepspeed Version: 0.5.4
- Horovod Version: 0.23
- Fairscale Version: 0.4.0
- CUDA/cuDNN version: 11.2 / 7.6.2
- Stoke configuration: 0.2.0
Hmmmm... Return code -9 means it was most likely a SIGKILL probably due to an OOM kill from the OS or from CUDA (sometimes these errors throw nothing helpful).
I think you are making a mistake in the summed loss (not detaching the tensor from the graph). See the added comments below on part of your code:
def train(train_dataloader, stoke_model: Stoke, scheduler1, scheduler2, epoch: int):
for idx, (inputs, targets) in enumerate(train_dataloader):
# call the model through the stoke onkect interface
outputs = stoke_model.model(inputs)
# This is the loss tensor(s)... remember here that this tensor is still attached to the compute graph and Stoke handles it no differently than base torch
train_loss = stoke_model.loss(outputs, targets)
### This is just an ema of the step loss -- shouldn't ever get reset by Stoke
stoke_model.print_ema_loss(prepend_msg=f"Step {idx+1} -- EMA Loss")
# Call backward through the stoke object interface
stoke_model.backward(loss=train_loss)
# Call step through the stoke object interface
stoke_model.step()
scheduler1.step()
scheduler2.step
### I think this is where you have a mistake that would lead to an OOM SIGKILL -- typically you call .detach() on the tensor to remove it from the graph when you want to create a running sum of the loss
sum_loss += train_loss
### Swap with this handy stoke function that will sync and detach across all devices
### https://fidelity.github.io/stoke/reference/stoke/stoke/#detach_and_sync_loss
sum_loss += stoke_model.detach_and_sync_loss(loss=train_loss)
avg_loss = sum_loss / len(train_dataloader)
return avg_loss
@ncilfone Heyy there, apologies for the delay! Thank you so much! This was precisely the mistake I was making (a silly mistake, forgot to detach) However, the improper error message lead me to a spiral!
Closing this issue as it is solved!