`EADDRINUSE` raised by a test on a specific platform
dinosaure opened this issue · 4 comments
For the release of 6.1.0, we got an error on opam-ci
on a test about sockets:
# ┌──────────────────────────────────────────────────────────────────────────────┐
# │ [FAIL] socket 0 two sockets connect via TCP. │
# └──────────────────────────────────────────────────────────────────────────────┘
# test.exe: [INFO] IPv4 socket stack: connect
# test.exe: [INFO] IPv4 socket stack: connect
# [exception] Unix.Unix_error(Unix.EADDRINUSE, "bind", "")
# Raised by primitive operation at Tcpip_stack_socket.V4.listen_tcpv4 in file "src/stack-unix/tcpip_stack_socket.ml", line 77, characters 6-88
# Called from Dune__exe__Test_socket.two_connect_tcp.(fun) in file "test/test_socket.ml", line 40, characters 2-60
# Called from Dune__exe__Test.run in file "test/test.ml" (inlined), line 37, characters 15-24
# Called from Dune__exe__Test.(fun) in file "test/test.ml", line 51, characters 42-47
# Called from Alcotest_engine__Core.Make.protect_test.(fun) in file "src/alcotest-engine/core.ml", line 277, characters 17-23
# Called from Alcotest_engine__Monad.Identity.catch in file "src/alcotest-engine/monad.ml", line 34, characters 31-35
Such error seems reproducible on:
- centos-8
- debian-testing
- opensuse-15.2
At this time, nobody reproduced such error locally. For more details, you can see this PR:
ocaml/opam-repository#18357 (comment)
After #443, wouldn't the ports only close when exiting the blocking calls to recvfrom, accept? Then they may still be listening when the next test runs (even after close).
https://github.com/mirage/mirage-tcpip/blob/master/src/stack-unix/tcpip_stack_socket.ml#L83
The tests may now be failing because we missed the exception in the past:
d63b65d#diff-cd11935cb53f2f3aa659cee9d12de7c8db4b502f94ba978a4b87d5df181dd0be
In the test we don't check that the result is read, so it could have worked before because it was sent to a different stack that was still listening:
https://github.com/mirage/mirage-tcpip/blob/master/test/test_socket.ml#L44
So according to d63b65d, we should add an try ... catch ...
on the Lwt_unix.bind
?
I can't test the code right now, but to me it looks like the test is now failing correctly, but it's not a new error -- we just didn't detect it before because the test connected to a stack started by a previous test and passed. The new close-implementation seems to be incomplete (it doesn't terminate existing blocking calls or callbacks?) so the port is still open when the test runs.
While debugging and working on #449, we read through this issue and my conclusion is:
- the socket stack (TCP and UDP, V4, V6, V4V6) should remember all sockets (file descriptors) that were created, and on
disconnect
close them - the tests that call Stack.connect need to call Stack.disconnect on termination
This way (and since we set SO_REUSEADDR so a future call to listen etc. will be able to REUSE a socket in FIN_WAIT2/CLOSED), the tests will be more reliable.