aws/aws-ofi-nccl

No include folder after installation

Closed this issue · 5 comments

Hi, I am building this for my non-AWS system. I follow the instructions:

wget https://github.com/aws/aws-ofi-nccl/archive/refs/tags/v1.4.0.tar.gz
cd aws-ofi-nccl-1.4.0
./autogen.sh
./configure --prefix=$PWD/install --with-libfabric=/libfabric-1.18.0/install --with-cuda=/cuda-12.0 --with-mpi=/ompi/install --with-nccl=/nccl-2.4.8-1/build
make -j
make -j install

After the build, I can only find the following in the installation:
image

I have tried many versions, but it seems to never generate the include folder which should contain nccl.h.

Is there a reason to use such an old version? While we're not doing regular vetted releases of the non-AWS variant currently, we do our best to not break non-EFA Libfabric providers and it should work. (If it doesn't, please open an issue!)

Our install folder will not generally contain an include folder, we have a header-only dependency on NCCL so that we can implement the interface(s) it defines, and we commit those headers to this repo. There's no interface that we currently define that we expect others to implement, so therefore there's no use in shipping any headers in our installs.

The build result is expected to be just the plugin shared library that NCCL (hopefully) loads and uses for communication, alongside some static topology files for specific platforms (ie: AWS instance types) that we support. What are you trying to use that header for?

Hi @sielicki , thanks for the explanation. In my application, I am trying to compile my own C code which #include <nccl.h>, thus I need to include the header file path. If I understand your comments correctly, I can specify the path, lib_path, and include_path to nccl's, and also add the path that this repo generates?

Also I have a follow-up question. I see that most recent releases say This release is intended only for use on AWS P* instances..., does this mean that it can only run on AWS, not other systems?

Hi @sielicki , thanks for the explanation. In my application, I am trying to compile my own C code which #include <nccl.h>, thus I need to include the header file path. If I understand your comments correctly, I can specify the path, lib_path, and include_path to nccl's, and also add the path that this repo generates?

If you're writing code that expects to use the NCCL APIs, at the build time of your code you shouldn't need anything from this repository. At runtime, you'll want to make sure that you have the libnccl-net.so from this repo in either your LD_LIBRARY_PATH or ld.so.cache. NCCL will look for that file and, if found, will use it as the backend. Try setting NCCL_DEBUG=info to enable verbose prints from NCCL -- it will report whether or not it's successfully found the ofi provider.

Also I have a follow-up question. I see that most recent releases say This release is intended only for use on AWS P* instances..., does this mean that it can only run on AWS, not other systems?

The core contributors for this plugin are all at AWS today and as a result we're mostly focused on prov/efa. The -aws releases are being done in an effort to be forthcoming about the fact that we have a testing gap for other providers and we can't sign-off on it in good faith. That being said, we consider it a bug if anything doesn't work with other providers, and we try our best to ensure that we're considering all cases. So tl;dr: the AWS releases should work on other systems (as should master), but we don't personally test it and can't vouch for it right now. We would like to go back to making general releases if anyone is able to provide access to hardware and/or collaborate with us on CI.

please reopen if you hit any other issues or if there's documentation that we could make more clear. thanks!