linux-nvme/nvme-stas

Scaling issues with nvme-stas

martin-gpy opened this issue · 8 comments

On a scaled up (roughly 450 namespaces from 80 subsystems with 2 NVMe/TCP controllers each) SLES15 SP5 MU host, one ends up seeing following stacd errors in the /var/log/messages during I/O testing with faults:

stacd[26003]: Udev._process_udev_event()         - Error while polling fd: 3 [90414]
stacd[26003]: Udev._process_udev_event()         - Error while polling fd: 3 [90394]
stacd[26003]: Udev._process_udev_event()         - Error while polling fd: 3 [90580]
...

This manifests as dropped connections, failed paths, I/O errors, etc. on the SP5 host.

Config details below:

# uname -r
5.14.21-150500.55.7-default
# rpm -qa | grep nvme
libnvme-devel-1.4+27.g5ae1c39-150500.4.3.1.26528.1.PTF.1212598.x86_64
nvme-stas-2.2.2-150500.3.6.1.x86_64
nvme-cli-bash-completion-2.4+24.ga1ee2099-150500.4.3.1.26528.1.PTF.1212598.noarch
libnvme1-1.4+27.g5ae1c39-150500.4.3.1.26528.1.PTF.1212598.x86_64
nvme-cli-zsh-completion-2.4+24.ga1ee2099-150500.4.3.1.26528.1.PTF.1212598.noarch
python3-libnvme-1.4+27.g5ae1c39-150500.4.3.1.26528.1.PTF.1212598.x86_64
nvme-cli-2.4+24.ga1ee2099-150500.4.3.1.26528.1.PTF.1212598.x86_64
libnvme-mi1-1.4+27.g5ae1c39-150500.4.3.1.26528.1.PTF.1212598.x86_64

Hi @martin-gpy - I've been trying to reproduce the issue. I don't have access to real subsystems, so I'm using nvmet to simulate the many subsystems/namespaces that you have.

I created 80 subsystems, each having 6 namespaces, for a total of 480 namespaces. I simulated adding/removing subsystems to generate kernel events, but was not able to reproduce the issue.

I was wondering if you're doing anything special to make the issue happen. For example, are you simulating network outages that would result in lots of keep-alive timeouts? In other words, is there a lot of activity going on that would result in lots of kernel events being generated?

I was wondering if you're doing anything special to make the issue happen. For example, are you simulating network outages that would result in lots of keep-alive timeouts? In other words, is there a lot of activity going on that would result in lots of kernel events being generated?

Yes, that's right. We are running IO and triggering storage failover events on the target end. So that would result in several path perturbations and associated events on the host given the scale at which we are running this.

@martin-gpy - I tried to optimize uevent handling. Not sure it's going to make much difference. I was wondering if you could try it since I don't have a setup to reproduce your large scale system. You will need to build the latest code from the main branch.

If this doesn't fix the issue, I'm afraid there's not much I can do. The code is trying to handle the events as quickly as possible, but this cannot scale indefinitely. This begs the question: Are 80 subsystems with 450 namespaces even a realistic setup.

Ok. Will give a try with your latest fixes and let you know.

Ok. Will give a try with your latest fixes and let you know.

Hey, would it be possible to give a SLES15 SP5 based nvme-stas package (current MU is nvme-stas-2.2.2-150500.3.6.1) containing these latest fixes? Would make it easier for us to try out...

Hey, would it be possible to give a SLES15 SP5 based nvme-stas package (current MU is nvme-stas-2.2.2-150500.3.6.1) containing these latest fixes? Would make it easier for us to try out...

Unfortunately, the latest nvme-stas code (ver. 2.3) hasn't been through the full regression testing yet and cannot be included in an official release. Our test team does not have the time right now to fully test all the changes.

Successfully tested on a scaled up SLES15 SP5 MU host (500 namespaces from 80 subsystems with 2 controllers each per SP5 host) using a nvme-stas test package containing the below fixes:

udev: Optimize uevent handling - 41add87

iputil: Reduce amount of netlink requests to the kernel - 06748ad

And no longer sees this Error while polling fd: 3 error anymore in the host's /var/log/messages during storage failover operations. So these fixes look good. Thanks.

Great to hear that the changes improved things for you.

While this is good news, we need to keep in mind that there is a theoretical limit to the number of namespaces/subsystems/controllers that can be instantiated. Different factors, such as the amount of CPU available, will influence that limit.

I'm confident that in most cases users will have a lot less namespaces/subsystems than what you tested with. In other words, I do not plan to improve things beyond this.

Closing this as resolved.