occlum/ngo

[BUG] Running iperf and the client side never exits

Opened this issue · 1 comments

Describe the bug

To test NGO network performance, run iperf server-side on host Linux, and run client inside NGO. And the client-side never exits. From the log, I can see the client hanging at a Recvfrom syscall.

With the same usage, running in Occlum can pass.

To reproduce

Steps to reproduce the behavior:

  1. Run iperf server on host: ./iperf -s -p 6888
  2. Run iperf client in Occlum: occlum run /bin/iperf -c 127.0.0.1 -p 6888 -P 1 -t 1

Expected behavior

The client-side should exit when the test ends.

Logs

Server-side (strace):
After the test ended, the server close the socket (fd = 4) and exit the thread. However, the client-side didn't respond to the socket close and kept waiting.

[pid 499387] 06:58:15.887235 <... recvfrom resumed> "", 131072, 0, NULL, NULL) = 0
[pid 499387] 06:58:15.887596 setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}, NULL) = 0
[pid 499387] 06:58:15.888252 futex(0x7fffe80012c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1653548296, tv_nsec=888089788}, 0xffffffff <unfinished ...>
[pid 499290] 06:58:15.892903 <... clock_nanosleep resumed> NULL) = 0
[pid 499290] 06:58:15.893203 futex(0x7fffe80012c0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 499387] 06:58:15.893509 <... futex resumed> ) = 0
[pid 499290] 06:58:15.893694 write(1, "[ ID] Interval       Transfer   "..., 93[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-3.13 sec   305 MBytes   816 Mbits/sec
 <unfinished ...>
[pid 499387] 06:58:15.893737 futex(0x7fffe80012c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 499290] 06:58:15.893747 <... write resumed> ) = 93
[pid 499387] 06:58:15.893755 <... futex resumed> ) = 0
[pid 499387] 06:58:15.893773 futex(0x7fffe80012c4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1653548296, tv_nsec=893715811}, 0xffffffff <unfinished ...>
[pid 499290] 06:58:15.893786 futex(0x7fffe80012c4, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 499387] 06:58:15.893795 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 499290] 06:58:15.893805 <... futex resumed> ) = 0
[pid 499387] 06:58:15.893813 futex(0x7fffe80012c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 499290] 06:58:15.893823 futex(0x555555591b2c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1653548296, tv_nsec=893817602}, 0xffffffff <unfinished ...>
[pid 499387] 06:58:15.893832 <... futex resumed> ) = 0
[pid 499387] 06:58:15.893851 close(4)   = 0

Client-side:
When the transmission ended, the client-side shut down write and looked like trying to receive a control message by Recvfrom and the second Recvfrom never returns.

�[0m[2022-05-26T06:58:13.758Z][TRACE][C:4][T4][#4907][Setitimer] Syscall { num = Setitimer }�[0m
�[0m[2022-05-26T06:58:13.758Z][TRACE][C:4][T4][#4907][Setitimer] ret = 0xffffffffffffffda�[0m
�[31m[2022-05-26T06:58:13.759Z][ERROR][C:4][T4][#4907][Setitimer] Error = ENOSYS (#38, Function not implemented): Unimplemented or unknown syscall [line = 854, file = src/entry/syscall.rs]�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4908][Shutdown] Syscall { num = Shutdown, fd = 3, how = 1 }�[0m
�[0m[2022-05-26T06:58:13.759Z][DEBUG][C:4][T4][#4908][Shutdown] shutdown: fd: 3, how: 1�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4908][Shutdown] ret = 0x0�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4909][ClockGettime] Syscall { num = ClockGettime, clockid = 0, ts_u = 0x7fff2ca94cf0 }�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4909][ClockGettime] ret = 0x0�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4910][Setsockopt] Syscall { num = Setsockopt, fd = 3, level = 1, optname = 20, optval = 0x7fff2ca94cf0, optlen = 16 }�[0m
�[0m[2022-05-26T06:58:13.759Z][DEBUG][C:4][T4][#4910][Setsockopt] setsockopt: fd: 3, level: 1, optname: 20, optval: 0x7fff2ca94cf0, optlen: 16�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4910][Setsockopt] ret = 0x0�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4911][Recvfrom] Syscall { num = Recvfrom, fd = 3, base = 0x7fff2c252010, len = 131072, flags = 0, addr = 0x0, addr_len = 0x0 }�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4911][Recvfrom] ret = 0x1c�[0m
�[0m[2022-05-26T06:58:13.759Z][TRACE][C:4][T4][#4912][Recvfrom] Syscall { num = Recvfrom, fd = 3, base = 0x7fff2c252010, len = 131072, flags = 0, addr = 0x0, addr_len = 0x0 }�[0m
�[0m[2022-05-26T06:58:13.765Z][DEBUG][C:0] Timer Wheel: will sleep 10s�[0m
�[0m[2022-05-26T06:58:13.765Z][TRACE][C:3][T3][#74][ClockNanosleep] ret = 0x0�[0m
�[0m[2022-05-26T06:58:13.765Z][TRACE][C:3][T3][#75][ClockNanosleep] Syscall { num = ClockNanosleep, clockid = 1, flags = 0, request = 0x7fff2c693e00, remain = 0x0 }�[0m
�[0m[2022-05-26T06:58:13.765Z][DEBUG][C:3][T3][#75][ClockNanosleep] Timer Wheel: try waking�[0m
�[0m[2022-05-26T06:58:13.765Z][DEBUG][C:0] Timer Wheel: woken up�[0m
�[0m[2022-05-26T06:58:13.782Z][DEBUG][C:0] Timer Wheel: will sleep 10s�[0m
�[0m[2022-05-26T06:58:13.782Z][TRACE][C:3][T3][#75][ClockNanosleep] ret = 0x0�[0m
�[0m[2022-05-26T06:58:13.782Z][TRACE][C:3][T3][#76][ClockNanosleep] Syscall { num = ClockNanosleep, clockid = 1, flags = 0, request = 0x7fff2c693e00, remain = 0x0 }�[0m
�[0m[2022-05-26T06:58:13.782Z][DEBUG][C:3][T3][#76][ClockNanosleep] Timer Wheel: try waking�[0m
�[0m[2022-05-26T06:58:13.783Z][DEBUG][C:0] Timer Wheel: woken up�[0m
�[0m[2022-05-26T06:58:13.799Z][DEBUG][C:0] Timer Wheel: will sleep 10s�[0m

Environment

  • HW: [e.g. SGX1, SGX2] SGX2
  • OS: [e.g. Ubuntu18.04, CentOS8.1] Ubuntu 20.04
  • Occlum version: [e.g. 0.17.0] NGO master (f9a8448)

Additional context

Compared with the client log running in Occlum, the second Recvfrom return when the server closes the socket.
Server-side:

[pid 498825] 06:22:58.381262 futex(0x7fffe80012c0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 498857] 06:22:58.381561 <... futex resumed> ) = 0
[pid 498825] 06:22:58.381605 <... futex resumed> ) = 1
[pid 498857] 06:22:58.381868 futex(0x7fffe80012c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 498825] 06:22:58.381927 write(1, "[ ID] Interval       Transfer   "..., 93 <unfinished ...>
[pid 498857] 06:22:58.382026 <... futex resumed> ) = 0
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec   830 MBytes  6.93 Gbits/sec
[pid 498825] 06:22:58.382107 <... write resumed> ) = 93
[pid 498825] 06:22:58.382188 futex(0x7fffe80012c4, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 498825] 06:22:58.382264 futex(0x555555591b2c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1653546179, tv_nsec=382243969}, 0xffffffff <unfinished ...>
[pid 498857] 06:22:58.382298 futex(0x7fffe80012c8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 498857] 06:22:58.382556 close(4)   = 0
[pid 498857] 06:22:58.383015 munmap(0x7ffff5f53000, 2641920) = 0
[pid 498857] 06:22:58.383613 futex(0x555555591be8, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 498857] 06:22:58.383716 munmap(0x7ffff69fa000, 135168 <unfinished ...>
[pid 498824] 06:22:58.384010 <... futex resumed> ) = 0

Client-side:

[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13323][Setitimer] Retval = 0xffffffffffffffda�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13324][Shutdown] Syscall { num = Shutdown, fd = 3, how = 1 }�[0m
�[0m[2022-05-26T06:22:58.371Z][DEBUG][T4][#13324][Shutdown] shutdown: fd: 3, how: 1�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13324][Shutdown] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13325][ClockGettime] Syscall { num = ClockGettime, clockid = 0, ts_u = 0x7fff0f4bfcf0 }�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13325][ClockGettime] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13326][Setsockopt] Syscall { num = Setsockopt, fd = 3, level = 1, optname = 20, optval = 0x7fff0f4bfcf0, optlen = 16 }�[0m
�[0m[2022-05-26T06:22:58.371Z][DEBUG][T4][#13326][Setsockopt] setsockopt: fd: 3, level: 1, optname: 20, optval: 0x7fff0f4bfcf0, optlen: 16�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13326][Setsockopt] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.371Z][TRACE][T4][#13327][Recvfrom] Syscall { num = Recvfrom, fd = 3, base = 0x7fff0ec7d010, len = 131072, flags = 0, addr = 0x0, addr_len = 0x0 }�[0m
�[0m[2022-05-26T06:22:58.372Z][TRACE][T4][#13327][Recvfrom] Retval = 0x1c�[0m
�[0m[2022-05-26T06:22:58.372Z][TRACE][T4][#13328][Recvfrom] Syscall { num = Recvfrom, fd = 3, base = 0x7fff0ec7d010, len = 131072, flags = 0, addr = 0x0, addr_len = 0x0 }�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13328][Recvfrom] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13329][ClockGettime] Syscall { num = ClockGettime, clockid = 0, ts_u = 0x7fff0f4bfd30 }�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13329][ClockGettime] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13330][ClockGettime] Syscall { num = ClockGettime, clockid = 0, ts_u = 0x7fff0f4bfbf0 }�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13330][ClockGettime] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.382Z][TRACE][T4][#13331][···Futex] Syscall { num = Futex, futex_addr = 0x7fff0c35a1c0, futex_op = 393, futex_val = 0, timeout = 140733450025968, futex_new_addr = 0x0, bitset = 4294967295 }�[0m
�[0m[2022-05-26T06:22:58.382Z][DEBUG][T4][#13331][···Futex] futex_wait_bitset addr: 0x7fff0c35a1c0, val: 0, timeout: Some(FutexTimeout { clock_id: CLOCK_REALTIME, ts: timespec_t { sec: 1653546179, nsec: 382870780 }, absolute_time: true }), bitset: 0xffffffff�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T3][#81][ClockNanosleep] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T3][#82][···Futex] Syscall { num = Futex, futex_addr = 0x7fff0c35a1c0, futex_op = 129, futex_val = 1, timeout = 0, futex_new_addr = 0x7fff0c35a198, bitset = 204841400 }�[0m
�[0m[2022-05-26T06:22:58.386Z][DEBUG][T3][#82][···Futex] futex_wake_bitset addr: 0x7fff0c35a1c0, max_count: 1, bitset: 0xffffffff�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T3][#82][···Futex] Retval = 0x1�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13331][···Futex] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T3][#83][···Write] Syscall { num = Write, fd = 1, buf = 0x7fff14000f70, size = 93 }�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13332][ClockGettime] Syscall { num = ClockGettime, clockid = 0, ts_u = 0x7fff0f4bfbf0 }�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13332][ClockGettime] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13333][···Futex] Syscall { num = Futex, futex_addr = 0x7fff0c35a1c8, futex_op = 129, futex_val = 1, timeout = 0, futex_new_addr = 0x0, bitset = 4294967295 }�[0m
�[0m[2022-05-26T06:22:58.386Z][DEBUG][T3][#83][···Write] write: fd: 1�[0m
�[0m[2022-05-26T06:22:58.386Z][DEBUG][T4][#13333][···Futex] futex_wake_bitset addr: 0x7fff0c35a1c8, max_count: 1, bitset: 0xffffffff�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13333][···Futex] Retval = 0x0�[0m
�[0m[2022-05-26T06:22:58.386Z][TRACE][T4][#13334][···Futex] Syscall { num = Futex, futex_addr = 0x7fff0c35a1c4, futex_op = 393, futex_val = 0, timeout = 140733450025968, futex_new_addr = 0x0, bitset = 4294967295 }�[0m
�[0m[2022-05-26T06:22:58.386Z][DEBUG][T4][#13334][···Futex] futex_wait_bitset addr: 0x7fff0c35a1c4, val: 0, timeout: Some(FutexTimeout { clock_id: CLOCK_REALTIME, ts: timespec_t { sec: 1653546179, nsec: 386620281 }, absolute_time: true }), bitset: 0xffffffff�[0m
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.01 sec   830 MBytes  6.88 Gbits/sec
�[0m[2022-05-26T06:22:58.386Z][TRACE][T3][#83][···Write] Retval = 0x5d�[0m

So when a side is close, on the other side of the stream, Recvfrom should return 0. From the man page of recv, it also states that:

When a stream socket peer has performed an orderly shutdown, the return value will be 0 (the traditional "end-of-file" return).

But in NGO, it shows a different phenomenon that the close didn't make the recvfrom return. I did a little digging, and it looks like the recvfrom request is submitted to the kernel but it never returns and the calling thread keeps waiting. Thus, it seems like a fault of io_uring that close/shutdown can't cancel pending requests and make them return.

From an issue in liburing (axboe/liburing#568), it looks like it won't be supported before 5.19.

Possible solution/Implementation

Besides this problem, iperf also uses a Netlink socket for other usages in this test. NGO is also not supported. To run this tool, there are mainly two solutions:
(1) Wait for the kernel 5.19 and hopefully, this usage is supported, and update all related libraries to use this feature. Also, add support for Netlink socket type.

(2) Add an ocall based implementation as a fallback solution when running applications like this. We could encourage the users to try io_uring based network first and if it can't use, use this fallback solution. At least, the in-enclave scheduling can bring some benefits. The drawback of this solution is that some of the network ocall will be blocking and can't release the CPU.

iouring is fine. The reason is that the client doesn't really "shutdown" the host fd and the server is not notified. The shutdown implementation needs ocall.