aws/aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
CApache-2.0
Issues
- 2
[Feature request] Topo file for g6e.48xlarge
#689 opened by sean-smith - 8
ncclInternalError: Internal check failed. | NET/OFI Unable to insert remote address into address vector for device 1
#663 opened by emorikawa - 0
- 9
register_mr_buffers:544 NCCL WARN NET/OFI Unable to register memory (type = 2) for device 0. RC: -22, Error: Invalid argument
#584 opened by visatish - 2
NCCL RDMA expects fi_cq_data_entry, but OPX provider fills CQ with fi_cq_tagged_entry
#609 opened by lsavers - 0
Initialization fails for OPX Libfabric Provider
#606 opened by lsavers - 4
- 6
Assistance to broader Tag releases
#395 opened by caio-davi - 2
Incorrect error message when setting configure flag: --enable-nvtx-trace-per-dev
#477 opened by ryanhankins - 9
RDMA protocol
#364 opened by eliekozah - 1
GPU direct
#313 opened by tks2004 - 3
Topology Discovery Regression
#298 opened by willgleich - 4
Support Amazon Linux 2023 (AL2023)
#282 opened by bryantbiggs - 3
Version
#391 opened by sean-smith - 5
No include folder after installation
#597 opened by YJHMITWEB - 5
Support Red Hat Enterprise Linux 9+
#290 opened by tmh97 - 3
Segfault after/during finalize with OpenMPI
#322 opened by tmh97 - 0
RDMA support for g6e nodes
#547 opened by Abhishek8394 - 0
rename library to match nccl docs
#508 opened by aws-nslick - 1
- 1
NCCL Cannot Find Tuner Symbols. Need to Export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so
#472 opened by zhanwenchen - 2
NCCL topology file for g5.12xlarge
#466 opened by abdulfatir - 6
- 1
- 6
Unable to find libcudart.so (1.7.1)
#244 opened by kwohlfahrt - 4
Support Ubuntu 22.04
#203 opened by tson1111 - 0
Propagate "Invalid address" to NCCL communicator
#346 opened by vmarkovtsev - 2
Add more examples with more recent cuda versions
#296 opened by tchaton - 4
Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts
#278 opened by tmh97 - 11
Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older.
#241 opened by nvcastet - 0
What are some AI/ML workloads users can utilize to test performance of the plugin?
#269 opened by tmh97 - 0
Misleading comparison on unsigned integer
#236 opened by rauteric - 3
- 2
Support FI_CONTEXT2
#204 opened by tmh97 - 2
aws branch does not build on centos 7 with gcc 4.8.5
#183 opened by wenduwan - 1
- 2
NCCL WARN NET/OFI Only EFA provider is supported
#174 opened by mkserge - 14
- 2
[Feature Request]Allow custom NCCL_TOPO_FILE location
#150 opened by roywei - 6
Mellanox and EFA in Docker Image
#170 opened by mvpatel2000 - 2
aws-ofi-nccl makes unnecessary calls to ofi_iflush() when using the PSM3 transport.
#151 opened by mwheinz - 10
- 2
NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error
#161 opened by Ridhamz-nd - 2
- 6
How does ofi_iflush() work?
#149 opened by mwheinz - 1
Question - difference between main vs aws branches
#130 opened by kiukchung - 2
Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)
#106 opened by yukunlin - 2
- 1
Crash with multirail providers.
#102 opened by dmaryin - 7
[Question] Is RDMA available on p3dn instances?
#104 opened by kiukchung