Boroadcast issue in openmp parallel loop
chenghuige opened this issue · 6 comments
template<typename DType>
static void Allgather(std::vector<DType>& datas)
{
std::vector<std::vector<DType> > values(rabit::GetWorldSize());
values[rabit::GetRank()] = move(datas);
for (size_t i = 0; i < values.size(); i++)
{
rabit::Broadcast(&values[i], i);
gezi::merge(datas, values[i]);
}
}
#pragma omp parallel for
for (size_t i = 0; i < 16; i++)
{
vector<int> values;
if (rabit::GetRank() == 0)
{
values = { 1, 2, 3 };
}
else
{
values = { rabit::GetRank(), 4, 5, 6 };
}
Rabit::Allgather(values);
}
This will cause
*** Aborted at 1434420821 (unix time) try "date -d @1434420821" if you are using GNU date ***
AssertError:Allreduce: boundary check
PC: @ 0x471f33 (anonymous namespace)::cpp_alloc()
AssertError:maxdf must be smaller than FDSETSIZE
AssertError:maxdf must be smaller than FDSETSIZE
AssertError:PushTemp inconsistent
^CTraceback (most recent call last):
File "/home/users/chenghuige/tools/rabit_demo.py", line 96, in
tracker.submit(args.nworker, [], fun_submit = mthread_submit, verbose = args.verbose)
File "/home/users/chenghuige/tools/rabit_tracker.py", line 316, in submit
master.accept_slaves(nslave)
File "/home/users/chenghuige/tools/rabit_tracker.py", line 258, in accept_slaves
fd, s_addr = self.sock.accept()
File "/home/users/chenghuige/.jumbo/lib/python2.7/socket.py", line 202, in accept
sock, addr = self._sock.accept()
the rabit functions are not threadsafe, so you have to make sure only call it from one thread. Use openmp to speedup computation, but do not use them to parallelize communication
I see.., thanks Chen!
hi Chen, can I let rabit functions in omp critical section ?
#pragma omp critical
{
}
However if setting thread number > 1, the program will hang, with ctrl+c stop I find
@ 0x7f0c6cfca550 (unknown)
@ 0x7f0c6c9a9b16 futex_wait.constprop.2
@ 0x7f0c6c9a9baf gomp_mutex_lock_slow
@ 0x7f0c6c9a58c0 gomp_mutex_lock
Also another thing some times I will find tcmalloc: large alloc when using rabit, the single machine program have not show this before, not sure have you met this ? though seems this will not affect the final result.
Please try to use rabit outside omp construction. Critical only ensures on thread calls the function, does not ensure the correct order of calling
Ok, thanks for quick reply!
Well, I find another workaround,usirng #pragma omp ordered
Rabit is so cool 👍