golang/go

net: TestFilePacketConn fails on Scaleway

bradfitz opened this issue · 15 comments

On a Scaleway ARM host (where we're trying to move the ARM builders), the net package fails with:

--- FAIL: TestFilePacketConn (0.00s)
        file_test.go:113: write ip 127.0.0.1->127.0.0.1: sendto: bad address

Debug:

root@scw-105acb:~/go/src# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.10
DISTRIB_CODENAME=utopic
DISTRIB_DESCRIPTION="Ubuntu 14.10"
root@scw-105acb:~/go/src# ifconfig 
docker0   Link encap:Ethernet  HWaddr 56:84:7a:fe:97:99  
          inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

eth0      Link encap:Ethernet  HWaddr 00:07:cb:03:76:44  
          inet addr:10.1.34.160  Bcast:10.1.35.255  Mask:255.255.254.0
          inet6 addr: fe80::207:cbff:fe03:7644/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:357998 errors:0 dropped:0 overruns:0 frame:0
          TX packets:108129 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:532 
          RX bytes:352772865 (352.7 MB)  TX bytes:2078718437 (2.0 GB)
          Interrupt:24 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:20563 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20563 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:66891220 (66.8 MB)  TX bytes:66891220 (66.8 MB)

root@scw-105acb:~/go/src# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.1.34.1       0.0.0.0         UG    0      0        0 eth0
10.1.34.0       0.0.0.0         255.255.254.0   U     0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

Note that this machine has a Docker daemon running, but I'm not yet running the build inside a container. This failure was from running on the host machine, as part of evaluating the speed of these machines.

/cc @mikioh, @davecheney, @crawshaw, @adg

And the strace:

[pid 15756] socket(PF_INET, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_ICMP) = 3
[pid 15756] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 15756] bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
[pid 15756] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3058795512, u64=3058795512}}) = 0
[pid 15756] getsockname(3, {sa_family=AF_INET, sin_port=htons(1), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 15756] getpeername(3, 0x10649bac, [112]) = -1 ENOTCONN (Transport endpoint is not connected)
[pid 15756] fcntl(3, F_DUPFD_CLOEXEC, 0) = 5
[pid 15756] fcntl(5, F_GETFL)           = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 15756] fcntl(5, F_SETFL, O_RDWR)   = 0
[pid 15756] fcntl(5, F_DUPFD_CLOEXEC, 0) = 6
[pid 15756] fcntl(6, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 15756] fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 15756] getsockopt(6, SOL_SOCKET, SO_TYPE, [3], [4]) = 0
[pid 15756] getsockname(6, {sa_family=AF_INET, sin_port=htons(1), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 15756] getpeername(6, 0x10649bdc, [112]) = -1 ENOTCONN (Transport endpoint is not connected)
[pid 15756] epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3058795392, u64=3058795392}}) = 0
[pid 15756] sendto(6, "", 0, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EFAULT (Bad address)
[pid 15756] clock_gettime(CLOCK_REALTIME, {1430954968, 213392054}) = 0
[pid 15756] write(1, "--- FAIL: TestFilePacketConn (0."..., 529--- FAIL: TestFilePacketConn (0.04s)
        file_test.go:113: write ip 127.0.0.1->127.0.0.1: sendto: bad address
) = 529
[pid 15756] write(1, "FAIL\n", 5FAIL
)       = 5
[pid 15756] close(3)                    = 0
[pid 15756] exit_group(1)               = ?
[pid 15758] +++ exited with 1 +++
[pid 15757] +++ exited with 1 +++
+++ exited with 1 +++

The sendto EFAULT is seems wrong.

       EFAULT An invalid user space address was specified for an argument.

The man page says:

       ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
                      const struct sockaddr *dest_addr, socklen_t addrlen);

Is a NULL buf *void okay, even with len 0?

Actually, the syscall package already tries hard to avoid a NULL *void:

// Single-word zero for use when we need a valid pointer to 0 bytes.                                                
// See mksyscall.pl.                                                                                                
var _zero uintptr

func sendto(s int, buf []byte, flags int, to unsafe.Pointer, addrlen _Socklen) (err error) {
        var _p0 unsafe.Pointer
        if len(buf) > 0 {
                _p0 = unsafe.Pointer(&buf[0])
        } else {
                _p0 = unsafe.Pointer(&_zero)
        }
        _, _, e1 := Syscall6(SYS_SENDTO, uintptr(s), uintptr(_p0), uintptr(len(buf)), uintptr(flags), uintptr(to), uintptr(addrlen))
        if e1 != 0 {
                err = errnoErr(e1)
        }
        return
}

... yet &_zero (which should be non-nil) ends up as zero according to the strace.

Is Syscall6 doing the right thing?

This machine FWIW has 4 of these:

# cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 2 (v7l)
BogoMIPS        : 1332.01
Features        : half thumb fastmult vfp edsp thumbee vfpv3 tls idiva idivt vfpd32 lpae 
CPU implementer : 0x56
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0x584
CPU revision    : 2

/cc @minux @ianlancetaylor @rsc @davecheney @josharian

Maybe dup of #7299? (correction s/7229/7299)

No, I just can't read. The buf pointer is indeed non-zero. I was off by one reading all the empty values. And strace in raw mode (as well as some printlns in the syscall package) confirms:

[pid 16133] write(2, "sendto zero ", 12sendto zero ) = 12
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "0x3af964", 80x3af964)     = 8
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "0x3af964", 80x3af964)     = 8
[pid 16133] write(2, "\n", 1
)           = 1
[pid 16133] write(2, "sendto ", 7sendto )      = 7
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "290", 3290)          = 3
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "6", 16)            = 1
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "3864932", 73864932)      = 7
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "0", 10)            = 1
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "0", 10)            = 1
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "275330280", 9275330280)    = 9
[pid 16133] write(2, " ", 1 )            = 1
[pid 16133] write(2, "16", 216)           = 2
[pid 16133] write(2, "\n", 1
)           = 1
[pid 16133] sendto(0x6, 0x3af964, 0, 0, 0x106934e8, 0x10) = -1 (errno 14)

So it's only len and flags which are zero.

Still no clue about the EFAULT, though.

Ah, if the error you are seeing is only

write ip 127.0.0.1->127.0.0.1: sendto: bad address

I'll take this issue. Seems like it just happens in the top/middle-half of ICMP stack.

Not sure what that means but happy for a fix. (ICMP has three halves? :))

As a matter of convenience, I usually think that it consists of socket-interface adaptation layer (or service access point layer), protocol layer and transport (in this case IP) adaptation layer. I believe that the root cause of this issue is just passing a corrupted ICMP packet to the kernel. Certainly the 4-year-old test cases need to be updated for the recent restricted kernels.

In addition, from Go 1.5, the full stack test cases for IPConn have been moved to the following:
golang.org/x/net/ipv4
golang.org/x/net/ipv6
golang.org/x/net/icmp

I'm happy if buildbots can support to run tests in x/net with administrator privilege eventually.

I'm going to just delete that test for now, then. You can re-enable it later when you identify how the test is broken.

Kernel is 3.19.1-181, FWIW.

moul commented

Subscribing, I'm from the Scaleway team

CL https://golang.org/cl/10090 mentions this issue.

CL https://golang.org/cl/10134 mentions this issue.

CL https://golang.org/cl/17476 mentions this issue.