mirage/mirage-tcpip

`EADDRINUSE` raised by a test on a specific platform

dinosaure opened this issue · 4 comments

For the release of 6.1.0, we got an error on opam-ci on a test about sockets:

# ┌──────────────────────────────────────────────────────────────────────────────┐
# │ [FAIL]        socket                0   two sockets connect via TCP.         │
# └──────────────────────────────────────────────────────────────────────────────┘
# test.exe: [INFO] IPv4 socket stack: connect
# test.exe: [INFO] IPv4 socket stack: connect
# [exception] Unix.Unix_error(Unix.EADDRINUSE, "bind", "")
#             Raised by primitive operation at Tcpip_stack_socket.V4.listen_tcpv4 in file "src/stack-unix/tcpip_stack_socket.ml", line 77, characters 6-88
#             Called from Dune__exe__Test_socket.two_connect_tcp.(fun) in file "test/test_socket.ml", line 40, characters 2-60
#             Called from Dune__exe__Test.run in file "test/test.ml" (inlined), line 37, characters 15-24
#             Called from Dune__exe__Test.(fun) in file "test/test.ml", line 51, characters 42-47
#             Called from Alcotest_engine__Core.Make.protect_test.(fun) in file "src/alcotest-engine/core.ml", line 277, characters 17-23
#             Called from Alcotest_engine__Monad.Identity.catch in file "src/alcotest-engine/monad.ml", line 34, characters 31-35

Such error seems reproducible on:

  • centos-8
  • debian-testing
  • opensuse-15.2

At this time, nobody reproduced such error locally. For more details, you can see this PR:
ocaml/opam-repository#18357 (comment)

After #443, wouldn't the ports only close when exiting the blocking calls to recvfrom, accept? Then they may still be listening when the next test runs (even after close).

https://github.com/mirage/mirage-tcpip/blob/master/src/stack-unix/tcpip_stack_socket.ml#L83

The tests may now be failing because we missed the exception in the past:

d63b65d#diff-cd11935cb53f2f3aa659cee9d12de7c8db4b502f94ba978a4b87d5df181dd0be

In the test we don't check that the result is read, so it could have worked before because it was sent to a different stack that was still listening:

https://github.com/mirage/mirage-tcpip/blob/master/test/test_socket.ml#L44

So according to d63b65d, we should add an try ... catch ... on the Lwt_unix.bind?

I can't test the code right now, but to me it looks like the test is now failing correctly, but it's not a new error -- we just didn't detect it before because the test connected to a stack started by a previous test and passed. The new close-implementation seems to be incomplete (it doesn't terminate existing blocking calls or callbacks?) so the port is still open when the test runs.

While debugging and working on #449, we read through this issue and my conclusion is:

  • the socket stack (TCP and UDP, V4, V6, V4V6) should remember all sockets (file descriptors) that were created, and on disconnect close them
  • the tests that call Stack.connect need to call Stack.disconnect on termination

This way (and since we set SO_REUSEADDR so a future call to listen etc. will be able to REUSE a socket in FIN_WAIT2/CLOSED), the tests will be more reliable.