rapidsai/ucx-py

implicit libnuma.so.1 dependency added in new ucx-1.11.1 in conda package

pseudotensor opened this issue · 6 comments

ldd shows:

python/lib/python3.8/site-packages/ucp/_libs/ucx_api.cpython-38-x86_64-linux-gnu.so:
        libnuma.so.1 => not found

for the

https://conda.anaconda.org/rapidsai/linux-64/ucx-1.11.1+gc58db6b-cuda11.2_0.tar.bz2
https://conda.anaconda.org/rapidsai/linux-64/ucx-proc-1.0.0-gpu.tar.bz2
https://conda.anaconda.org/rapidsai/linux-64/ucx-py-0.21.0-py38_gc58db6b_0.tar.bz2

after building non-conflicting conda solution.

This is a new dependency that was not present prior to 3 days ago when new ucx was uploaded to https://anaconda.org/rapidsai/ucx/files

I expect it is a mistake that this dependency was forced since no corresponding package dependency installs libnuma

So one now gets things like:

ImportError while importing test module '/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/tests/test_balanced_cut.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/h2oai/dai/python/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/__init__.py:14: in <module>
    from cugraph.community import (
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/community/__init__.py:14: in <module>
    from cugraph.community.louvain import louvain
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/community/louvain.py:14: in <module>
    from cugraph.community import louvain_wrapper
cugraph/community/louvain_wrapper.pyx:21: in init cugraph.community.louvain_wrapper
    ???
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/__init__.py:14: in <module>
    from cugraph.structure.graph_classes import (Graph,
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_classes.py:15: in <module>
    from .graph_implementation import (simpleGraphImpl,
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_implementation/__init__.py:14: in <module>
    from .simpleGraph import simpleGraphImpl
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/structure/graph_implementation/simpleGraph.py:14: in <module>
    from cugraph.structure import graph_primtypes_wrapper
cugraph/structure/graph_primtypes_wrapper.pyx:29: in init cugraph.structure.graph_primtypes_wrapper
    ???
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/comms/comms.py:14: in <module>
    from cugraph.raft.dask.common.comms import Comms as raftComms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/__init__.py:16: in <module>
    from .common.comms import Comms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/__init__.py:16: in <module>
    from .comms import Comms
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/comms.py:17: in <module>
    from .ucx import UCX
/opt/h2oai/dai/python/lib/python3.8/site-packages/cugraph/raft/dask/common/ucx.py:16: in <module>
    import ucp
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/__init__.py:10: in <module>
    from .core import *  # noqa
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/core.py:16: in <module>
    from . import comm
/opt/h2oai/dai/python/lib/python3.8/site-packages/ucp/comm.py:7: in <module>
    from ._libs import arr, ucx_api
E   ImportError: libnuma.so.1: cannot open shared object file: No such file or directory

It was always the intent to depend on libnuma, see https://github.com/rapidsai/ucx-split-feedstock/blob/master/recipe/install_ucx.sh#L19 , and our users were always instructed to install it from their OS package manager. However, recently a bug was discovered and fixed in openucx/ucx#6782 that would not enable NUMA when passed explicitly as we do in our conda recipes.

In previous UCX 1.9 packages we didn't have that dependency due to the UCX bug above, and in new UCX 1.11 packages we do, as it's important for UCX in certain systems. Apparently, we could depend on https://anaconda.org/conda-forge/numactl-libs-cos7-x86_64 to resolve that dependency in conda directly, but because it's specifically targeted at CentOS 7, I'm not sure whether it's a reliable package for the general user, any thoughts here @jakirkham @raydouglass ?

One thing I noticed is that RAPIDS 21.08 wasn't pinned to UCX 1.9, which causes a new environment to pick UCX 1.11 (which wasn't supported back then), so if we still want to support RAPIDS <= 21.08, we must pin UCX 1.9 or instruct users to specify ucx=1.9. What do you think @raydouglass @quasiben ?

@pentschev Ok, that's good to know. So rapids <=21.08 shouldn't be used with ucx1.11 then, I should go back to ucx1.9? I was also hit by this then, since the new conda solution upgraded ucx to 1.11 and I just assumed this was ok and was trying to resolve the libnuma issue to make that work.

Ok, that's good to know. So rapids <=21.08 shouldn't be used with ucx1.11 then, I should go back to ucx1.9?

That's right.

I was also hit by this then, since the new conda solution upgraded ucx to 1.11 and I just assumed this was ok and was trying to resolve the libnuma issue to make that work.

No, this wasn't predicted. Recently we started pinning some libraries to a maximum version, I believe we should do the same with UCX.

In previous UCX 1.9 packages we didn't have that dependency due to the UCX bug above, and in new UCX 1.11 packages we do, as it's important for UCX in certain systems. Apparently, we could depend on https://anaconda.org/conda-forge/numactl-libs-cos7-x86_64 to resolve that dependency in conda directly, but because it's specifically targeted at CentOS 7, I'm not sure whether it's a reliable package for the general user, any thoughts here @jakirkham @raydouglass ?

No that's a CDT that is just using a vendored package from CentOS 7. It's only used at build time to make our build tooling happy. Would not use that at runtime. At present users should continue to install this from an OS system package manager.

@jakirkham but that means the ucx package is not consistent with anything else in conda land. For no other packages do i have to install something on the OS natively separately except nvidia drivers. This this is quite a big awkward change. Means the conda setup is not self-contained like it should be.

I know it is not ideal.

Unfortunately libnuma is one of the easier dependencies to install. The others (like MOFED) rely on someone already installing the right libraries, drivers, etc. on the system and have everything configured correctly.

We are discussing internally to see if there are ways to improve the situation to make this less of a pain to deploy.

Edit: Also we raised this issue ( openucx/ucx#4570 ) previously to discuss making libnuma optional at runtime.