NVMe Driver not registered with nvidia-fs - GDS NVMe unsupported on Rocky 8.6
karanveersingh5623 opened this issue · 0 comments
Hi Team
I am trying to enable GDS as NVMe supported on Dell R750XA server .
Below is my OS configuration
[root@node002 src]# cat /etc/centos-release
Rocky Linux release 8.6 (Green Obsidian)
[root@node002 src]#
[root@node002 src]# uname -r
4.18.0-477.27.1.el8_8.x86_64
[root@node002 src]# dnf list cuda-tools*
cuda-tools-12-2.x86_64 12.2.1-1 @cuda
[root@node002 src]# dnf list gds*
Installed Packages
gds-tools-12-2.x86_64 1.7.1.12-1 @cuda
Installed Packages
nvidia-fs.x86_64 2.17.3-1 @cuda
check1 : Loaded Kernel modules
[root@node002 src]# lsmod | grep nvidia_fs
nvidia_fs 253952 0
nvidia 56508416 3 nvidia_uvm,nvidia_fs,nvidia_modeset
[root@node002 src]# lsmod | grep nvme_core
nvme_core 139264 7 nvme,nvme_fc,nvme_fabrics
t10_pi 16384 3 nvmet,sd_mod,nvme_core
IOMMU
[root@node002 src]# dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=node/vmlinuz.node002 initrd=node/initrd.node002 biosdevname=0 net.ifnames=0 nonm acpi=on nicdelay=0 rd.driver.blacklist=nouveau xdriver=vesa intel_iommu=off console=tty0 ip=192.168.61.92:192.168.61.88:192.168.61.2:255.255.255.0 BOOTIF=01-04-3f-72-dc-06-85
[ 0.000000] Kernel command line: BOOT_IMAGE=node/vmlinuz.node002 initrd=node/initrd.node002 biosdevname=0 net.ifnames=0 nonm acpi=on nicdelay=0 rd.driver.blacklist=nouveau xdriver=vesa intel_iommu=off console=tty0 ip=192.168.61.92:192.168.61.88:192.168.61.2:255.255.255.0 BOOTIF=01-04-3f-72-dc-06-85
[ 0.000000] DMAR: IOMMU disabled
[ 1.551025] iommu: Default domain type: Passthrough
PCIe Topology => GPU and NVMe devices on the same PXL Switch
[root@node002 ~]# lspci -tv | egrep -i "nvidia | Sams"
| \-02.0-[e3]----00.0 NVIDIA Corporation GA100 [A100 PCIe 80GB]
| \-02.0-[ca]----00.0 NVIDIA Corporation GA100 [A100 PCIe 80GB]
| \-02.0-[65]----00.0 NVIDIA Corporation GA100 [A100 PCIe 80GB]
| +-02.0-[31]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X
| +-03.0-[32]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
| \-02.0-[17]----00.0 NVIDIA Corporation GA100 [A100 PCIe 80GB]
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci | grep -i sams
31:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM174X
32:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci -vv -s 32:00.0 | grep 'ACS' -A2
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
--
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [178 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# lspci | grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
[root@node002 ~]#
[root@node002 ~]# lspci -vv -s 17:00.0 | grep 'ACS' -A2
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
--
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
Capabilities: [d00 v1] Lane Margining at the Receiver <?>
[root@node002 src]# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.7.2.11
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.17.5)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Enabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads : err=0 io_state_err=0
Sparse Reads : n=0 io=0 holes=0 pages=0
Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap : n=0 ok=0 err=0 munmap=0
Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops : Read=0 Write=0 BatchIO=0
[root@node002 src]# cat /proc/driver/nvidia-fs/peer_affinity
GPU P2P DMA distribution based on pci-distance
(last column indicates p2p via root complex)
GPU :0000:ca:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:65:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:e3:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GPU :0000:17:00.0 :0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@node002 src]#
[root@node002 src]#
[root@node002 src]# cat /proc/driver/nvidia-fs/peer_distance
gpu peer peerrank p2pdist link gen numa np2p class
0000:ca:00.0 0000:98:00.1 0x00820070 0x0082 0x10 0x03 0x01 0 network
0000:ca:00.0 0000:98:00.0 0x00820070 0x0082 0x10 0x03 0x01 0 network
0000:ca:00.0 0000:33:00.0 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:ca:00.0 0000:31:00.0 0x01010090 0x0101 0x04 0x04 0x00 0 nvme
0000:ca:00.0 0000:04:00.1 0x0101009e 0x0101 0x01 0x02 0x00 0 network
0000:ca:00.0 0000:32:00.0 0x01010090 0x0101 0x04 0x04 0x00 0 nvme
0000:ca:00.0 0000:33:00.3 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:ca:00.0 0000:33:00.1 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:ca:00.0 0000:04:00.0 0x0101009e 0x0101 0x01 0x02 0x00 0 network
0000:ca:00.0 0000:33:00.2 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:65:00.0 0000:98:00.1 0x01010070 0x0101 0x10 0x03 0x01 0 network
0000:65:00.0 0000:98:00.0 0x01010070 0x0101 0x10 0x03 0x01 0 network
0000:65:00.0 0000:33:00.0 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:65:00.0 0000:31:00.0 0x00820090 0x0082 0x04 0x04 0x00 0 nvme
0000:65:00.0 0000:04:00.1 0x0082009e 0x0082 0x01 0x02 0x00 0 network
0000:65:00.0 0000:32:00.0 0x00820090 0x0082 0x04 0x04 0x00 0 nvme
0000:65:00.0 0000:33:00.3 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:65:00.0 0000:33:00.1 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:65:00.0 0000:04:00.0 0x0082009e 0x0082 0x01 0x02 0x00 0 network
0000:65:00.0 0000:33:00.2 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:e3:00.0 0000:98:00.1 0x00820070 0x0082 0x10 0x03 0x01 0 network
0000:e3:00.0 0000:98:00.0 0x00820070 0x0082 0x10 0x03 0x01 0 network
0000:e3:00.0 0000:33:00.0 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:e3:00.0 0000:31:00.0 0x01010090 0x0101 0x04 0x04 0x00 0 nvme
0000:e3:00.0 0000:04:00.1 0x0101009e 0x0101 0x01 0x02 0x00 0 network
0000:e3:00.0 0000:32:00.0 0x01010090 0x0101 0x04 0x04 0x00 0 nvme
0000:e3:00.0 0000:33:00.3 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:e3:00.0 0000:33:00.1 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:e3:00.0 0000:04:00.0 0x0101009e 0x0101 0x01 0x02 0x00 0 network
0000:e3:00.0 0000:33:00.2 0x01010088 0x0101 0x08 0x03 0x00 0 network
0000:17:00.0 0000:98:00.1 0x01010070 0x0101 0x10 0x03 0x01 0 network
0000:17:00.0 0000:98:00.0 0x01010070 0x0101 0x10 0x03 0x01 0 network
0000:17:00.0 0000:33:00.0 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:17:00.0 0000:31:00.0 0x00820090 0x0082 0x04 0x04 0x00 0 nvme
0000:17:00.0 0000:04:00.1 0x0082009e 0x0082 0x01 0x02 0x00 0 network
0000:17:00.0 0000:32:00.0 0x00820090 0x0082 0x04 0x04 0x00 0 nvme
0000:17:00.0 0000:33:00.3 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:17:00.0 0000:33:00.1 0x00820088 0x0082 0x08 0x03 0x00 0 network
0000:17:00.0 0000:04:00.0 0x0082009e 0x0082 0x01 0x02 0x00 0 network
0000:17:00.0 0000:33:00.2 0x00820088 0x0082 0x08 0x03 0x00 0 network
OFED info
[root@node002 src]# ofed_info
MLNX_OFED_LINUX-5.8-3.0.7.0 (OFED-5.8-3.0.7):
clusterkit:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/clusterkit-1.8.428-1.58101.src.rpm
dapl:
mlnx_ofed_dapl/dapl-2.1.10.1.mlnx-OFED.4.9.0.1.5.58033.src.rpm
dpcp:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/dpcp-1.1.37-1.58101.src.rpm
dump_pr:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/dump_pr-1.0-5.13.0.MLNX20221016.gac314ef.58101.src.rpm
hcoll:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/hcoll-4.8.3220-1.58101.src.rpm
ibdump:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibdump-6.0.0-1.58101.src.rpm
ibsim:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibsim-0.10-1.58101.src.rpm
ibutils2:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/ibutils2-2.1.1-0.156.MLNX20221016.g4aceb16.58101.src.rpm
iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
kernel-mft:
mlnx_ofed_mft/kernel-mft-4.22.1-307.src.rpm
knem:
knem.git mellanox-master
commit a805e8ff50104ac77b20f8a5eb496a71cd7c384c
libvma:
vma/source_rpms/libvma-9.7.2-1.src.rpm
libxlio:
/sw/release/sw_acceleration/xlio/2.0.7/libxlio-2.0.7-1.src.rpm
mlnx-dpdk:
https://github.com/Mellanox/dpdk.org mlnx_dpdk_20.11_last_stable
commit dfeb0f20c5807139a5f250e2ef1d58e9ac0130ce
mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-ethtool:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlnx-ethtool-5.18-1.58101.src.rpm
mlnx-iproute2:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlnx-iproute2-5.19.0-1.58101.src.rpm
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
mlnx-tools:
https://github.com/Mellanox/mlnx-tools mlnx_ofed_5_8
commit f7e5694e8371ef0c6a71273ea7755f7023c35517
mlx-steering-dump:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mlx-steering-dump-1.0.0-0.58101.src.rpm
mpi-selector:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mpi-selector-1.0.3-1.58101.src.rpm
mpitests:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mpitests-3.2.20-de56b6b.58101.src.rpm
mstflint:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/mstflint-4.16.1-2.58101.src.rpm
multiperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/multiperf-3.0-3.0.58101.src.rpm
ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc
openmpi:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/openmpi-4.1.5a1-1.58101.src.rpm
opensm:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/opensm-5.13.0.MLNX20221016.10d3954-0.1.58101.src.rpm
openvswitch:
https://gitlab-master.nvidia.com/sdn/ovs mlnx_ofed_5_8_1
commit 0565b8676ac4a40be3a2e07a8ce27a37ac792915
perftest:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/perftest-4.5-0.18.gfcddfe0.58101.src.rpm
rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_5_8
commit 6e6f497a3412148b1e05deda456b000865472dff
rshim:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/rshim-2.0.6-18.g955dbef.src.rpm
sharp:
mlnx_ofed_sharp/sharp-3.1.1.MLNX20221122.c93d7550.tar.gz
sockperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-1.0.1/SRPMS/sockperf-3.10-0.git5ebd327da983.58101.src.rpm
srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_8
commit 65e3aec417045faa2228224b4a9fb74c02742860
ucx:
mlnx_ofed_ucx/ucx-1.14.0-1.src.rpm
xpmem:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.8-3.0.5/SRPMS/xpmem-2.6.4-1.58305.src.rpm
Installed Packages:
-------------------
librdmacm-utils
dapl-utils
dpcp
ucx-knem
mlnxofed-docs
mlnx-tools
knem-modules
libxpmem
libibverbs-utils
opensm-devel
sharp
openmpi
mlnx-ofa_kernel-source
isert
libibumad
mlnx-ofa_kernel
kernel-mft
opensm
dapl-devel
mstflint
dump_pr
ucx-devel
ucx-rdmacm
hcoll
mlnx-iproute2
xpmem-modules
mlnx-nfsrdma
infiniband-diags
ibacm
mlnx-ofa_kernel-modules
knem
opensm-libs
dapl
perftest
ibutils2
ucx
ucx-ib
ucx-xpmem
mlnx-ethtool
mpitests_openmpi
iser
mlnx-nvme
rdma-core
rdma-core-devel
opensm-static
srp_daemon
ucx-cma
hcoll-cuda
mlnx-ofa_kernel-devel
srp
librdmacm
dapl-devel-static
ibdump
ucx-cuda
rshim
libibverbs
xpmem
mpi-selector
ibsim
gdscheck.py
[root@node002 src]# /usr/local/cuda-12.2/gds/tools/gdscheck.py -p
GDS release version: 1.7.1.12
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : false
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 2 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
GPU index 3 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
nvidia-smi
[root@node002 src]# nvidia-smi
Thu Nov 9 14:57:42 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 42C P0 61W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:65:00.0 Off | 0 |
| N/A 41C P0 64W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 54C P0 78W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:E3:00.0 Off | Off |
| N/A 49C P0 73W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
cufile.log
02-11-2023 11:42:01:926 [pid=3540949 tid=3540949] NOTICE cufio-fs:408 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:02.0,device/transport:pcie,fsid:b6aba42f597f4a560x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36544d3052b001630025384700000001,
02-11-2023 11:42:01:926 [pid=3540949 tid=3540949] NOTICE cufio:1036 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
05-11-2023 15:01:56:615 [pid=38031 tid=38031] ERROR cufio:480 cuInit Failed, error CUDA_ERROR_NO_DEVICE
05-11-2023 15:01:56:615 [pid=38031 tid=38031] ERROR cufio:583 cuFile initialization failed
06-11-2023 18:40:43:90 [pid=59193 tid=59193] NOTICE cufio-drv:720 running in compatible mode
06-11-2023 20:12:02:865 [pid=157254 tid=157254] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
06-11-2023 20:19:09:497 [pid=164679 tid=164679] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
07-11-2023 16:30:39:272 [pid=15680 tid=15680] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
07-11-2023 16:58:54:420 [pid=45874 tid=45874] ERROR cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
07-11-2023 16:58:54:421 [pid=45874 tid=45874] NOTICE cufio-fs:441 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:03.0,device/transport:pcie,ext4_journal_mode:ordered,fsid:f0578b196d5913c20x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36344830526001490025384500000001,
07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR cufio:296 cuFileHandleRegister error, file checks failed
07-11-2023 16:58:54:421 [pid=45874 tid=45874] ERROR cufio:338 cuFileHandleRegister error: GPUDirect Storage not supported on current file
07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR cufio-fs:199 NVMe Driver not registered with nvidia-fs!!!
07-11-2023 17:00:42:361 [pid=47786 tid=47786] NOTICE cufio-fs:441 dumping volume attributes: DEVNAME:/dev/nvme1n1,ID_FS_TYPE:ext4,ID_FS_USAGE:filesystem,UDEV_PCI_BRIDGE:0000:30:03.0,device/transport:pcie,ext4_journal_mode:ordered,fsid:f0578b196d5913c20x,numa_node:0,queue/logical_block_size:4096,wwid:eui.36344830526001490025384500000001,
07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR cufio:296 cuFileHandleRegister error, file checks failed
07-11-2023 17:00:42:361 [pid=47786 tid=47786] ERROR cufio:338 cuFileHandleRegister error: GPUDirect Storage not supported on current file
please let me know if you need more information