dmlc/rabit

Boroadcast issue in openmp parallel loop

chenghuige opened this issue · 6 comments

    template<typename DType>
    static void Allgather(std::vector<DType>& datas)
    {
        std::vector<std::vector<DType> > values(rabit::GetWorldSize());
        values[rabit::GetRank()] = move(datas);
        for (size_t i = 0; i < values.size(); i++)
        {
            rabit::Broadcast(&values[i], i);
            gezi::merge(datas, values[i]);
        }
    }

         #pragma omp parallel for
    for (size_t i = 0; i < 16; i++)
    {
        vector<int> values;
        if (rabit::GetRank() == 0)
        {
            values = { 1, 2, 3 };
        }
        else
        {
            values = { rabit::GetRank(), 4, 5, 6 };
        }
        Rabit::Allgather(values);
    }

This will cause
*** Aborted at 1434420821 (unix time) try "date -d @1434420821" if you are using GNU date ***

AssertError:Allreduce: boundary check

PC: @ 0x471f33 (anonymous namespace)::cpp_alloc()

AssertError:maxdf must be smaller than FDSETSIZE
AssertError:maxdf must be smaller than FDSETSIZE

AssertError:PushTemp inconsistent
^CTraceback (most recent call last):
File "/home/users/chenghuige/tools/rabit_demo.py", line 96, in
tracker.submit(args.nworker, [], fun_submit = mthread_submit, verbose = args.verbose)
File "/home/users/chenghuige/tools/rabit_tracker.py", line 316, in submit
master.accept_slaves(nslave)
File "/home/users/chenghuige/tools/rabit_tracker.py", line 258, in accept_slaves
fd, s_addr = self.sock.accept()
File "/home/users/chenghuige/.jumbo/lib/python2.7/socket.py", line 202, in accept
sock, addr = self._sock.accept()

the rabit functions are not threadsafe, so you have to make sure only call it from one thread. Use openmp to speedup computation, but do not use them to parallelize communication

I see.., thanks Chen!

hi Chen, can I let rabit functions in omp critical section ?
#pragma omp critical
{
}
However if setting thread number > 1, the program will hang, with ctrl+c stop I find
@ 0x7f0c6cfca550 (unknown)
@ 0x7f0c6c9a9b16 futex_wait.constprop.2
@ 0x7f0c6c9a9baf gomp_mutex_lock_slow
@ 0x7f0c6c9a58c0 gomp_mutex_lock

Also another thing some times I will find tcmalloc: large alloc when using rabit, the single machine program have not show this before, not sure have you met this ? though seems this will not affect the final result.

Please try to use rabit outside omp construction. Critical only ensures on thread calls the function, does not ensure the correct order of calling

Ok, thanks for quick reply!

Well, I find another workaround,usirng #pragma omp ordered
Rabit is so cool 👍