microsoft/infinibatch

training stuck if data loader worker process has exceptions

rpengms opened this issue · 0 comments

If the data loader process has exception - the training will stuck.

It looks like we have code to ensure that if main training process ended, reap all the data loader processes.
Maybe the other way is also needed - if a dataloader throws an exception and terminate for any reason, either restart one or terminate the parent training process.