Improve namerd and bad path error handling
stevej opened this issue · 0 comments
This morning I spent about 30 minutes debugging why requests were hanging. It turns out I forgot to add a path so namerd had nothing to resolve. I'm going to add my notes below to remind us how difficult it can be to debug issues with linkerd-tcp today.
Problem:
When no path is specified, requests to linkerd-tcp hang indefinitely.
While setting up a new linkerd-tcp, I noticed that requests were failing. (Later Oliver pointed out that my test requests failed to specify a 'Host' header so namerd resolutions were failing.)
Watching the process with strace I noticed the following behavior.
bind(6, {sa_family=AF_INET, sin_port=htons(7474), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 1024) = 0
ioctl(6, FIONBIO, [1])
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2, u64=2}}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674480478}) = 0
epoll_wait(3, [], 1024, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674517778}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674543177}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674570803}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674590894}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674611178}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674630385}) = 0
epoll_wait(3, [], 1024, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674659413}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674678174}) = 0
epoll_wait(3, Listening on http://0.0.0.0:9989.
[{EPOLLIN, {u32=2, u64=2}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 817271282}) = 0
write(5, "\1", 1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 817361718}) = 0
epoll_wait(3, [{EPOLLIN, {u32=4294967295, u64=18446744073709551615}}], 1024, 0) = 1
read(4, "\1", 128) = 1
read(4, 0x7fff4e9c86c0, 128) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {165947, 817751478}) = 0
accept4(6, {sa_family=AF_INET, sin_port=htons(47002), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = $
2
ioctl(12, FIONBIO, [1]) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4, u64=4}}) = 0
write(9, "\1", 1) = 1
accept4(6, 0x7fff4e9c7c00, 0x7fff4e9c7cd4, SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {165947, 818158355}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=4, u64=4}}], 1024, 0) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818226082}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818257060}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818292018}) = 0
write(9, "\1", 1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818508486}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818538988}) = 0
write(9, "\1", 1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818601446}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818636679}) = 0
write(9, "\1", 1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818711174}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818736119}) = 0
write(9, "\1", 1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818813994}) = 0
epoll_wait(3, [], 1024, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818871606}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818899081}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLOUT, {u32=4, u64=4}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165950, 323825513}) = 0
clock_gettime(CLOCK_MONOTONIC, {165950, 324029051}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLOUT, {u32=4, u64=4}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165950, 443832441}) = 0
clock_gettime(CLOCK_MONOTONIC, {165950, 443949129}) = 0
The :7474 listener is binding properly but using telnet to issue an HTTP request resulted in no response.
$ telnet localhost 7474
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0
lsof
output while that HTTP request is in-flight.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
linkerd-t 4580 stevej cwd DIR 8,1 4096 270339 /home/stevej/src/linkerd-tcp
linkerd-t 4580 stevej rtd DIR 8,1 4096 2 /
linkerd-t 4580 stevej txt REG 8,1 9258304 271179 /home/stevej/src/linkerd-tcp/target/release/linkerd-tcp
linkerd-t 4580 stevej mem REG 8,1 1088952 1907 /lib/x86_64-linux-gnu/libm-2.24.so
linkerd-t 4580 stevej mem REG 8,1 1856752 1903 /lib/x86_64-linux-gnu/libc-2.24.so
linkerd-t 4580 stevej mem REG 8,1 92552 1931 /lib/x86_64-linux-gnu/libgcc_s.so.1
linkerd-t 4580 stevej mem REG 8,1 142400 1918 /lib/x86_64-linux-gnu/libpthread-2.24.so
linkerd-t 4580 stevej mem REG 8,1 31712 1920 /lib/x86_64-linux-gnu/librt-2.24.so
linkerd-t 4580 stevej mem REG 8,1 14608 1906 /lib/x86_64-linux-gnu/libdl-2.24.so
linkerd-t 4580 stevej mem REG 8,1 158512 1899 /lib/x86_64-linux-gnu/ld-2.24.so
linkerd-t 4580 stevej 0u CHR 136,1 0t0 4 /dev/pts/1
linkerd-t 4580 stevej 1u CHR 136,1 0t0 4 /dev/pts/1
linkerd-t 4580 stevej 2u CHR 136,1 0t0 4 /dev/pts/1
linkerd-t 4580 stevej 3u a_inode 0,11 0 9463 [eventpoll]
linkerd-t 4580 stevej 4r FIFO 0,10 0t0 89197 pipe
linkerd-t 4580 stevej 5w FIFO 0,10 0t0 89197 pipe
linkerd-t 4580 stevej 6u IPv4 89198 0t0 TCP *:7474 (LISTEN)
linkerd-t 4580 stevej 7u a_inode 0,11 0 9463 [eventpoll]
linkerd-t 4580 stevej 8r FIFO 0,10 0t0 87890 pipe
linkerd-t 4580 stevej 9w FIFO 0,10 0t0 87890 pipe
linkerd-t 4580 stevej 10u IPv4 87891 0t0 TCP *:9989 (LISTEN)
linkerd-t 4580 stevej 11u IPv4 87892 0t0 TCP localhost:54064->localhost:4180 (ESTABLISHED)
linkerd-t 4580 stevej 12u IPv4 88470 0t0 TCP localhost:7474->localhost:47002 (ESTABLISHED)
There is no open socket to any of the perf-cluster
hosts.
Running linkerd-tcp with trace logging reveals the culprit: we're failing to parse namerd responses but not failing the request.
TRACE:hyper::http::conn : Conn::flush = Ok(Ready(()))
TRACE:tokio_core::reactor : event Ready {Readable} Token(1)
DEBUG:tokio_core::reactor : loop process - 2 events, Duration { secs: 0, nanos: 85366 }
TRACE:linkerd_tcp::app : polling 4 running
TRACE:linkerd_tcp::app : polling runner 0
TRACE:hyper::client::response : Response::new
DEBUG:hyper::client::response : version=Http11, status=Ok
DEBUG:hyper::client::response : headers={"Content-Type": "application/json", "Content-Length": "15"}
TRACE:linkerd_tcp::namerd : parsing namerd response
INFO :linkerd_tcp::namerd : error parsing response: missing field `addrs` at line 1 column 14
Oliver explained that this behavior is caused by issuing a request with no path for the router to process. Since we don't specify a default path for now, we do a lookup to namerd with an empty path and that fails.
I think we should address this in two ways:
- Change namerd parsing errors to be at the
error!
level so a user will more quickly see the error. - Fast fail requests lacking a path until we decide to specify a default path in the configuration file or add dtab support.