Error: MPMD detected but reload is not supported yet
wfckl789 opened this issue · 1 comments
wfckl789 commented
Hi, I found the error MPMD detected but reload is not supported yet
will occur if I open Eager Debug Mode
for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!
I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh
after download them.
Environment information:
EC2 Instance: trn1.32.xlarge
OS: Ubuntu 20.04
Neuron Pytorch: Latest 2.18
aws-rhsoln commented
Its a duplicate of this issue: aws-neuron/neuronx-distributed#21 . Closing this one