tinygrad/open-gpu-kernel-modules

Low performance when running over NVLink

sheepymeh opened this issue · 6 comments

NVIDIA Open GPU Kernel Modules Version

Comparing with NVIDIA commit 12933b2

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

5.15.0-102-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3090

Describe the bug

Thank you for this project! It seems to be working well on 3090s. However, NVLink seems to underperform with this fork.

In the results below, the variation in the performance of PCIe GPUs is caused by differing PCIe versions and lanes. GPUs 2 and 3 are connected via NVLink (4 lanes, 56.25GB/s theoretical unidirectional performance). They are also connected via PCIe Gen 4 x8 (25GB/s theoretical unidirectional performance).

Running p2pBandwidthLatencyTest with this fork:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.80  11.33  11.67  10.49  15.66  11.40  11.11
     1  11.37 812.92   8.92   8.93  11.38   8.94  11.40
     2  11.23   8.94 838.70   8.97  11.14   8.98  11.27
     3  11.20   8.90   8.91 838.00  11.12   8.92  11.25
     4  15.48  11.35  11.57  11.55 838.93  11.39  16.07
     5  11.34   8.90   8.95   8.93  11.38 838.03  11.31
     6  15.86  11.39  10.57  11.67  16.05  10.95 838.48

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  25.42  26.00  25.98  50.82  25.46  51.20
     1  25.45 838.25  25.40  25.50  25.46  25.52  25.45
     2  25.95  25.45 837.58  17.27  25.99  25.46  25.99
     3  25.99  25.50  17.04 835.34  25.99  25.46  25.99
     4  50.18  25.46  26.00  25.98 838.25  25.42  51.21
     5  25.46  25.57  25.41  25.51  25.38 837.35  25.47
     6  50.20  25.46  25.99  25.98  51.22  25.47 839.83

With the original open-source driver:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  11.36  11.66  10.49  15.65  11.37  15.92
     1  11.43 830.23   8.88   8.92  11.41   8.95  11.38
     2  11.18   8.93 837.80   8.97  11.13   8.99  11.26
     3  11.21   8.91   8.91 839.60  11.13   8.91  11.26
     4  15.51  11.38  11.56  11.57 838.70  11.41  16.01
     5  11.34   8.97   8.93   8.94  11.35 838.67  11.28
     6  15.86  11.35  11.66  11.68  11.66  11.27 838.03
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.56  11.35  11.66  11.65  15.60  11.32  15.92
     1  11.42 838.66   8.94   8.94  11.37   8.94  11.38
     2  11.21   8.94 838.70 101.69  11.14   8.94  11.26
     3  11.19   8.97 101.91 837.80  11.11   8.92  11.26
     4  15.50  11.37  11.57  11.57 838.48  11.37  15.84
     5  11.31   8.95   8.93   8.94  11.33 838.03  11.28
     6  15.80  11.35  11.70  10.43  16.07  11.28 838.93

We can see that the p2p driver improves performance as expected on PCIe with this fork (e.g. 15.80 GB/s -> 50.20 GB/s). However the NVLink performance (GPUs 2 and 3) decreases from ~100 GB/s to ~17 GB/s.

To Reproduce

Run p2pBandwidthLatencyTest and compare with original fork

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

Experiencing the same

Ahh, yea this is real, and glad to see it working with 3090s. I have only tested on 4090s where there's no NVLink to worry about.

The driver is forcing P2P to be through PCI-E, I'm sure there's a way to not need that force. Would merge a PR that fixes this, I doubt it's too hard. Though we are only maintaining this driver for tinybox, so it would have to come from external.

@geohot if you connect two Tinyboxes together, will it allow for the GPU in one box to communicate P2P with a GPU from the second box if you connect them with Mellanox adapter cards in the OCP slots?

@zvorinji i though it before but it seems not practical both from economic and techniques.

  1. if we need Mellanox to achieve pcie speed eg. pcie 4.016 you will need 64GB/s so convert to Mellanox adapter card it means at least you will need the adpater at 500G (only 400G or 800G for actual products?) which is very expensive. And you will waste at least one pcie 4.016 slot in each machine. I am not sure if the 2 IB and RDMA supported adapters can transfer data more than the limit of bandwidth.
  2. And another thing is that i am not sure when we activate p2p in the drive does it mean RDMA is also activated? Otherwise we need 2 more GPUs support RDMA to transfer P2P data inside both machines before the data transfer between the two machines.
    Correct me if i am wrong since i am new in this area but also want to build my low cost inference cluster :)

wondering if anybody find a workaround, planning on using the driver with my 3090 nvlinked

I'm also curious about this. Would be nice to be able to use this with NVLINK 3090s.