aws-neuron/aws-neuron-sdk

Error: MPMD detected but reload is not supported yet

wfckl789 opened this issue · 1 comments

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

image

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

Its a duplicate of this issue: aws-neuron/neuronx-distributed#21 . Closing this one