dmlc/rabit

Allreduce will some times hang

chenghuige opened this issue · 2 comments

The code is like below, will hang in allreduce, but if remove "#pragma omp parallel for " will ok, even though allreduce is outside of parallel loop. During some run I also face "
:AssertError:check ack & check pt cannot occur together with normal ops
", ":zero size check point is not allowed", all these will not occur if remove "#pragma omp parallel for"

          vector<bool> needMoreStep(TrainData.NumFeatures, false);
            #pragma omp parallel for 
        for (int featureIndex = 0; featureIndex < TrainData.NumFeatures; featureIndex++)
        {
            if (IsFeatureOk(featureIndex))
            {
                needMoreStep[featureIndex] = CalculateSamllerChildHistogram(featureIndex);
            }
        }

            for (int featureIndex = 0; featureIndex < TrainData.NumFeatures; featureIndex++)
            {
                if (IsFeatureOk(featureIndex) && needMoreStep[featureIndex])
                {

               //rabit allreduce here.. will hang 
                        AllreduceSum((*_smallerChildHistogramArray)[featureIndex]);
                }
            }

you need to make sure every nodes enter that if clause

Well, just checked , you are right, it's my fault , thanks Chen!