dmlc/rabit

Fault tolerance not work: Allreduce Recovered data size do not match the specification of function call

chinwy opened this issue · 3 comments

I ran xgboost on yarn and test if fault tolerance could work. I started 4 workers. When xgboost started to update model, I killed one worker(called worker0). Yarn started a worker named worker0_1 instead. But the worker failed finally due to this error:  Allreduce Recovered data size do not match the specification of function call.
The responding code is,

allreduce_robust.cc(line 817)
if (role == kRequestData || role == kHaveData) {
utils::Check(data_size == size,
"Allreduce Recovered data size do not match the specification of function call.\n"
"Please check if calling sequence of recovered program is the "
"same the original one in current VersionNumber");
}

I printed some details then. data_size is 800 and size is 8. But I don't know the reason.

@chenqin FYI....long way to go

hcho3 commented

Closing, as Rabit have been moved into dmlc/xgboost. See discussion in dmlc/xgboost#5995.