erlang/otp

DNS resolution fails due to einval error from gen_udp:connect

fhunleth opened this issue ยท 12 comments

Describe the bug

I'm using the OTP resolver and it is configured with two DNS servers. The first DNS server was properly queried, but when it returned an error (not an nxdomain one, but a refusal to answer the query). The second DNS server wasn't queried. In tracking this down, the udp_connect function was returning {error, einval}.

To Reproduce

To consistently reproduce the bug, I made the first DNS server always return an rcode 4 (not implemented). Looking at inet_res.erl, it looks like quite a few errors could be used to trigger the code to try the next DNS server in the list.

Next I ran inet_res:resolve("google.com", in, a, [{verbose, true}]) and watched the messages show the error from the first server, the "Try UDP server" print for the next server and then the immediate UDP send failed: {error, einval} print. Upon looking into this further, the {error, einval} didn't come from send, but from udp_connect.

I first thought that gen_udp:connect/3 was failing because the socket was closed. There are a couple calls to close the socket in inet_res.erl, but none of those were being called.

Removing the call to gen_udp:connect/3 fixes the issue. I.e., the resolver tries the first DNS server, the first server returns an error, then the resolver tries the second DNS server, gets back a good response and returns it successfully. Whatever was causing gen_udp:connect to return an error doesn't impact gen_udp:send from working. In fact, I was thinking that removing the call to gen_udp:connect/3 would be a fix to this issue but since I hadn't traced this to what was causing connect to fail, I wanted to raise the issue first. It could also be an issue with gen_udp:connect/3.

Expected behavior

inet_res:resolve tries the second DNS server in the list when the first one refuses to answer the query.

Affected versions

OTP 24.0.5

Additional information

Here's an example log that shows a failed run of :init_res.resolve/4. This runs quickly. There are no pauses or timeouts.

iex> :inet_res.resolve('google.com', :in, :a, verbose: true, nameservers: [{{127,0,0,53},53}, {{192,168,7,1},53}])
Query: {msg,[{header,{header,[{id,5},
                              {qr,0},
                              {opcode,query},
                              {aa,0},
                              {tc,0},
                              {rd,true},
                              {ra,0},
                              {pr,0},
                              {rcode,0}]}},
             {qdlist,[{dns_query,[{domain,"google.com"},
                                  {type,a},
                                  {class,in}]}]},
             {anlist,[]},
             {nslist,[]},
             {arlist,[]}]}
Try UDP server : {127,0,0,53}:53 (timeout=666)
Got reply: {msg,[{header,{header,[{id,5},
                                  {qr,true},
                                  {opcode,query},
                                  {aa,false},
                                  {tc,false},
                                  {rd,true},
                                  {ra,false},
                                  {pr,false},
                                  {rcode,4}]}},
                 {qdlist,[{dns_query,[{domain,"google.com"},
                                      {type,a},
                                      {class,in}]}]},
                 {anlist,[]},
                 {nslist,[]},
                 {arlist,[]}]}
UDP server error: {error,
                      {notimp,
                          {dns_rec,
                              {dns_header,5,true,query,false,false,true,
                                  false,false,4},
                              [{dns_query,"google.com",a,in}],
                              [],[],[]}}}
Try UDP server : {192,168,7,1}:53 (timeout=666)
UDP send failed: {error,einval}
{:error, :econnrefused}

This seems to be a Linux bug / pecularity in the implementation of connect(2) for a connectionless socket such as an UDP socket.

If I first connect to an address on the loopback interface, then to a more public address, it fails with EINVAL from the actual libc call connect(2). I have strace:d this sequence to verify that:

Eshell V12.0.2  (abort with ^G)
1> {ok, S} = gen_udp:open(0, []).       
{ok,#Port<0.5>}
2> gen_udp:connect(S, {127,0,0,53}, 53).
ok
3> gen_udp:connect(S, {8,8,8,8}, 53).   
{error,einval}
4> gen_udp:close(S).                    
ok
5> f(S), {ok, S} = gen_udp:open(0, []). 
{ok,#Port<0.6>}
6> gen_udp:connect(S, {8,8,8,8}, 53).  
ok
7> gen_udp:connect(S, {127,0,0,53}, 53).
ok
8> gen_udp:connect(S, {8,8,8,8}, 53).   
ok

If I take a step down and try it with the closer to the metal socket API, it shows the same behaviour:

Eshell V12.0.2  (abort with ^G)
1> {ok, S} = socket:open(inet, dgram).
{ok,{'$socket',#Ref<0.326745212.1599209476.194695>}}
2> socket:bind(S, any).                                               
ok
3> socket:connect(S, #{family => inet, addr => {127,0,0,1}, port => 53}).
ok
4> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).  
{error,einval}
5> socket:close(S).
ok
6> f(S), {ok, S} = socket:open(inet, dgram).                             
{ok,{'$socket',#Ref<0.326745212.1599209476.194735>}}
7> socket:bind(S, any).                                                  
ok
8> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).
ok
9> socket:connect(S, #{family => inet, addr => {127,0,0,1}, port => 53}).
ok
10> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).  
ok

It appears that Linux does something with the socket if it first connects to a loopback address, so it later can not connect to an address outside that interface, or?

The man page for connect(2) says:

connectionless protocol sockets may use connect() multiple times to change their association.

But not quite freely, it seems... :-(

The man page for connect(2) also says:

Connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC (supported on Linux since kernel 2.2).

So I added some code to prim_socket_nif.c to handle more generic addresses, and then it turns out that if I insert

3b> socket:connect(S, #{family => 0, addr => <<0,0>>}).

i.e connect to AF_UNSPEC between connect to {127,0,0,1} and connect to {8,8,8,8}, this strange state gets cleared and the second connect works.

What do do?
  • Bug report this to Linux
  • Work around this by closing the socket and opening a new one in this situation
  • (Preferably not) Work around with AF_UNSPEC as i found - this will be cumbersome with inet_drv.c i.e gen_udp sockets, and maybe tricky to do in a portable way

Wow, thank you for digging into this. Fwiw, the cost of closing and opening a new socket in my app is trivial since trying the second DNS server shouldn't happen that often. I took a quick look at the Linux source code for handling UDP connect() calls, but I have no insights to report. I'm not sure how to assist further, but please let me know if there's a way.

I have tried to Google this, but get no good hits for connect(2) returning with errno == EINVAL. Note that that errno value is not documented for this function.

There should be Linux forums to ask in, bug report tools to use. But I am not familiar to them.

I dug some more... FreeBSD 10.4 (End Of Life) behaves as Linux. But not FreeBSD 11.4 (oldest supported), nor OpenBSD 6.9 (latest release). So FreeBSD has had the bug and fixed it, a while back. OpenBSD does not have it now. Might be worth mentioning for anyone reporting this to some Linux bug report tool...
Edit: Ubuntu 20.04 still has the bug. I am afraid we do not have any newer in our lab.

Unfortunately this place in the code is not prepared to re-open a socket, so the rewrite is non-trivial, but I will have a look at it...

I did some digging to see what the error path is in the kernel:

The EINVAL is coming from https://elixir.bootlin.com/linux/v5.4.138/source/net/ipv4/route.c#L2343:

	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev)))
		if (ipv4_is_loopback(fl4->saddr) &&
		    !(dev_out->flags & IFF_LOOPBACK) &&
		    !netif_is_l3_master(dev_out))
			return ERR_PTR(-EINVAL);

The way that I read this is that it returns EINVAL if the source address is currently loopback and the proposed destination is not. That sounded good to me.

The stack trace is:

__mkroute_output
ip_route_output_key
ip_route_output_flow
ip_route_connect_init
ip_route_connect
__ip4_datagram_connect
ip4_datagram_connect
inet_dgram_connect

Perhaps of interest is how simple the AF_UNSPEC path is for disconnecting the socket at https://elixir.bootlin.com/linux/v5.4.138/source/net/ipv4/af_inet.c#L560. It totally avoids the ip_route_connect code path which has a number of EINVAL return paths.

I'm using Linux 5.4, so I was wondering if anything had changed in Linux 5.14. inet_dgram_connect looks the same - AF_UNSPEC disconnects, but otherwise it delegates to ip4_datagram_connect. (It seems like one could always disconnect first in inet_dgram_connect to simplify the "connect" code path, but I have no clue what the ramifications are of that.) Since that means that the code goes down the ip_route_connect path, it will still hit the if check in __mkroute_output that returns EINVAL. I.e., Linux 5.14 won't fix it.

A git blame for the check in question leads to torvalds/linux@d0daebc. I didn't know you could do this, so I tried it for fun. I don't view it as helpful.

I personally think that going down the ip_route_connect path without disconnecting first is way too complicated and possibly going to lead to other weird failures. It's leading me to think that running connect with AF_UNSPEC should always be done before connecting again just to avoid a hugely complicated code path. (re-opening the socket, is of course ok too.)

Thank you again for the time you're spending in addressing this. I'll post if I find any thing more when I get a next window of time today to look at this.

Just to elaborate:

Why connect an UDP socket?

The reason that inet_res uses the odd (and undocumented) solution to connect an UDP socket before sending a query to a name server instead of just plain UDP send to a destination is:

  • When an UDP socket is connected you are not supposed to get datagrams from other addresses than the connected; the kernel filters for you.
  • If you send to an address that has no active UDP socket the ICMP side channel will report back that the destination is bad, so the subsequent UDP receive will get a ECONNREFUSED and does not have to wait for the timeout to realize that there is no name server on the other end.

It seems Linux and the BSD:s are a in disagreement here.

Use case:

  1. Open a socket inet,dgram (UDP)
  2. bind to any
  3. connect to 8.8.8.8:53
  4. connect to 127.0.0.1:53

On the BSD:s, if I check sockname after 3) it is RoutableIP:EphemeralPort, and after 4) it is 127.0.0.1:EphemeralPort (the same EphemeralPort as after 3))

On Linux (Ubuntu 18.04, Linux-5.4.0-80), sockname after 4) is the same as after 3); RoutableIP:EphemeralPort, so Linux does not change the socket's source address when the connect address is changed. This is a quite different behaviour from the BSD:s, and only kind of works when changing to the loopback address (still; firewall and name server source address rules may get in the way). This also explains why changing from the loopback address does not work since it would not be possible to get a reply routed back to a loopback source address.

The only documented reasons for EINVAL from connect(2) for the BSD:s are; the length of the address does not match the address family, and, the destination address for a TCP connect is a UDP broadcast or multicast address. Quite obvious argument errors independent of the socket state, which is the norm for EINVAL.

So I think Linux is out on deep water here. Connecting an UDP socket to different addresses is an old feature.
I suspect this Linux misbehaviour is related to the advanced optimizations that they have made for connecting unbound TCP sockets which delays the address binding and allows re-using ports to a high degree by taking the destination address into consideration when selecting source port. I speculate that changing source address when re-connecting is too likely to collide on the source port, so they simply did not implement it, despite not being compatible with the old ways, the documentation, and the BSD:s

Now I have to meditate over where (Erlang or C, inet_drv.c sockets, socket sockets, and gen_udp compatibility sockets) and how it would be best to work around this Linuxism. Apparently the only safe approach is to, on Linux, always "dissolve the association" before reconnecting, to ensure connecting with a sensible source address.

Sorry, accidentally pressed the Close button

Then again, the simplest solution for us would be if Linux could fix this misbehaviour...

This fixes the problem for legacy gen_udp:

diff --git a/erts/emulator/drivers/common/inet_drv.c b/erts/emulator/drivers/common/inet_drv.c
index 6c722dd87f..d635c31712 100644
--- a/erts/emulator/drivers/common/inet_drv.c
+++ b/erts/emulator/drivers/common/inet_drv.c
@@ -1,7 +1,7 @@
 /*
  * %CopyrightBegin%
  *
- * Copyright Ericsson AB 1997-2020. All Rights Reserved.
+ * Copyright Ericsson AB 1997-2021. All Rights Reserved.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -1660,7 +1660,9 @@ static void *realloc_wrapper(void *current, ErlDrvSizeT size){
 #endif
 #   define ANC_BUFF_SIZE   INET_DEF_BUFFER/2 /* XXX: not very good... */
 
+
 #ifdef HAVE_UDP
+
 static int load_address(ErlDrvTermData* spec, int i, char* buf)
 {
     int n;
@@ -1722,7 +1724,12 @@ static int load_address(ErlDrvTermData* spec, int i, char* buf)
     }
     return i;
  }
-#endif
+
+#ifdef __linux__
+static struct sockaddr disconnect_sa;
+#endif /* #ifdef __linux__ */
+
+#endif /* #ifdef HAVE_UDP */
 
 
 #ifdef HAVE_SCTP
@@ -12345,6 +12352,13 @@ static udp_descriptor* sctp_inet_copy(udp_descriptor* desc, SOCKET s, int* err)
 #ifdef HAVE_UDP
 static int packet_inet_init()
 {
+#ifdef __linux__
+    sys_memzero((char *)&disconnect_sa, sizeof(disconnect_sa));
+#ifdef AF_UNSPEC
+    disconnect_sa.sa_family = AF_UNSPEC;
+#endif /* #ifdef AF_UNSPEC */
+#endif /* #ifdef __linux__ */
+
     return 0;
 }
 
@@ -12616,6 +12630,14 @@ static ErlDrvSSizeT packet_inet_ctl(ErlDrvData e, unsigned int cmd, char* buf,
                 (desc->sfamily, &desc->remote, &buf, &len)) != NULL)
                return ctl_xerror(xerror, rbuf, rsize);
            
+#ifdef __linux__
+            /* Workaround for Linux's misbehaviour to not
+             * change the source address when connecting
+             * a datagram socket to a new destination
+             */
+            sock_connect(desc->s, &disconnect_sa, sizeof(disconnect_sa));
+#endif /* #ifdef __linux__ */
+
            code = sock_connect(desc->s,
                                (struct sockaddr*) &desc->remote, len);
            if (IS_SOCKET_ERROR(code)) {

Now over to how to handle this in socket...

mikpe commented

There should be Linux forums to ask in, bug report tools to use. But I am not familiar to them.

netdev@vger.kernel.org for networking-specific stuff
linux-kernel@vger.kernel.org for general kernel stuff (but it's very noisy)

Plain text mode email preferred, and no top-posting. A self-contained reproducer in C is enough, plus a general description of the presumed bug.

There is/was a bugzilla, but people tend to ignore it.

Alternatively, report it to your distribution using their preferred methods.

Fix merged. Thank you for investigating this issue!

That's great!!! Thank you so much for fixing it!