spacejam/sled

io_uring: io_uring_setup syscall returns ENOMEM

Closed this issue · 5 comments

io_uring_setup syscall returns ENOMEM when code tries to allocate too much io_urings. Specifically, this test cargo test --features=testing,io_uring log_chunky_iterator allocates 100 threads and they start returning ENOMEM on io_uring_setup starting from ~20. This is very beginning of the io_uring setup and even fails before queues mmaps. I will trace kernel functions to understand better why it has this limit, but the fix anyway is to reduce the test parallelism for io_uring setup.

It's clearly observed by

$ strace -f env cargo test --features=testing,io_uring log_chunky_iterator

Tracing notes: http://blog.vmsplice.net/2019/08/determining-why-linux-syscall-failed.html

$ sudo trace-cmd record -p function_graph -g __x64_sys_io_uring_setup
$ sudo trace-cmd report --cpu 0

It is true, that execution return ENOMEM at:

log_chunky_iter-27409 [000] 40690.231487: funcgraph_exit:       ! 238.103 us |  } <-- good invocation
 log_chunky_iter-27466 [000] 40690.232135: funcgraph_exit:       ! 459.701 us |  } <-- good invocation
 log_chunky_iter-27387 [000] 40690.241284: funcgraph_exit:         2.434 us   |  } <-- bad invocation
 log_chunky_iter-27394 [000] 40690.243183: funcgraph_exit:         2.269 us   |  }
...
 log_chunky_iter-27437 [000] 40690.250563: funcgraph_entry:                   |  __x64_sys_io_uring_setup() {
 log_chunky_iter-27437 [000] 40690.250563: funcgraph_entry:                   |    io_uring_setup() {
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_entry:                   |      capable() {
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_entry:                   |        ns_capable_common() {
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_entry:                   |          security_capable() {
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_entry:                   |            cap_capable() {
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_exit:         0.152 us   |            }
 log_chunky_iter-27437 [000] 40690.250564: funcgraph_exit:         0.489 us   |          }
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_exit:         0.747 us   |        }
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_exit:         0.984 us   |      }
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_entry:                   |      free_uid() {
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_exit:         0.143 us   |      }
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_exit:         1.618 us   |    }
 log_chunky_iter-27437 [000] 40690.250565: funcgraph_exit:         2.215 us   |  }

from 5.3.0 kernel:

	account_mem = !capable(CAP_IPC_LOCK);
	if (account_mem) {
		ret = io_account_mem(user,
				ring_pages(p->sq_entries, p->cq_entries));
		if (ret) {
			free_uid(user);
			return ret;
		}
	}

	ctx = io_ring_ctx_alloc(p);
	if (!ctx) {
		if (account_mem)
			io_unaccount_mem(user, ring_pages(p->sq_entries,
								p->cq_entries));
		free_uid(user);
		return -ENOMEM;
	}

So basically it fails due to the:

	/* Don't allow more pages than we can safely lock */
	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

Limit of allocated pages per process

axboe commented

This can be worked around by raising the per-user rlimit memlocked limit. It's generally pretty low on systems. See /etc/security/limits.{d,conf}

@axboe thanks!

@sitano thanks for diving into this! that definitely helps clarify things for me around why this was happening

yeah, the only option is increasing allowed memory locked pages. so yeah.
4096kb is enough for all tests to pass.