open-mpi/hwloc

hwloc_topology_set_pid seems to lack error checking on Linux

HadrienG2 opened this issue · 1 comments

As of hwloc v2.9.3, and at least on Linux, hwloc_topology_set_pid() seems to accept nonexistent PIDs as input without complaining (it keeps returning 0). Hopefully it does nothing in this scenario, or the error is reported later e.g. at topology loading time? But an error should probably be signaled early on to ease debugging if possible, because even errors at loading time are already ambiguous (it's not clear which configuration parameter caused the error).

Reporting an error early wouldn't help much since the PID could disappear between set_pid() and load(), or even be replaced by another process. There are lots of work going on in Linux to avoid this kind of issues (the overall idea is to "acquire a pidfd" so that the PID cannot be reused in the middle), it's a really complicated problem in general, lots of API and syscalls might get changed because of this.

In the end, our topology PID is only used for binding and for getting cgroup info. Binding will just fail. Cgroup won't be found, and it will fallback to no-restricting (default cgroup). I am not sure we want to report a late load() an error in that case.