Allreduce will some times hang
chenghuige opened this issue · 2 comments
chenghuige commented
The code is like below, will hang in allreduce, but if remove "#pragma omp parallel for " will ok, even though allreduce is outside of parallel loop. During some run I also face "
:AssertError:check ack & check pt cannot occur together with normal ops
", ":zero size check point is not allowed", all these will not occur if remove "#pragma omp parallel for"
vector<bool> needMoreStep(TrainData.NumFeatures, false);
#pragma omp parallel for
for (int featureIndex = 0; featureIndex < TrainData.NumFeatures; featureIndex++)
{
if (IsFeatureOk(featureIndex))
{
needMoreStep[featureIndex] = CalculateSamllerChildHistogram(featureIndex);
}
}
for (int featureIndex = 0; featureIndex < TrainData.NumFeatures; featureIndex++)
{
if (IsFeatureOk(featureIndex) && needMoreStep[featureIndex])
{
//rabit allreduce here.. will hang
AllreduceSum((*_smallerChildHistogramArray)[featureIndex]);
}
}
tqchen commented
you need to make sure every nodes enter that if clause
chenghuige commented
Well, just checked , you are right, it's my fault , thanks Chen!