linkerd/linkerd-tcp

Improve namerd and bad path error handling

stevej opened this issue · 0 comments

This morning I spent about 30 minutes debugging why requests were hanging. It turns out I forgot to add a path so namerd had nothing to resolve. I'm going to add my notes below to remind us how difficult it can be to debug issues with linkerd-tcp today.

Problem:
When no path is specified, requests to linkerd-tcp hang indefinitely.

While setting up a new linkerd-tcp, I noticed that requests were failing. (Later Oliver pointed out that my test requests failed to specify a 'Host' header so namerd resolutions were failing.)

Watching the process with strace I noticed the following behavior.

bind(6, {sa_family=AF_INET, sin_port=htons(7474), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 1024)                         = 0
ioctl(6, FIONBIO, [1])
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2, u64=2}}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674480478}) = 0
epoll_wait(3, [], 1024, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674517778}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674543177}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674570803}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674590894}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674611178}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674630385}) = 0
epoll_wait(3, [], 1024, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674659413}) = 0
clock_gettime(CLOCK_MONOTONIC, {165866, 674678174}) = 0
epoll_wait(3, Listening on http://0.0.0.0:9989.
[{EPOLLIN, {u32=2, u64=2}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 817271282}) = 0
write(5, "\1", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 817361718}) = 0
epoll_wait(3, [{EPOLLIN, {u32=4294967295, u64=18446744073709551615}}], 1024, 0) = 1
read(4, "\1", 128)                      = 1
read(4, 0x7fff4e9c86c0, 128)            = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {165947, 817751478}) = 0
accept4(6, {sa_family=AF_INET, sin_port=htons(47002), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = $
2
ioctl(12, FIONBIO, [1])                 = 0
epoll_ctl(3, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4, u64=4}}) = 0
write(9, "\1", 1)                       = 1
accept4(6, 0x7fff4e9c7c00, 0x7fff4e9c7cd4, SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {165947, 818158355}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=4, u64=4}}], 1024, 0) = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818226082}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818257060}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818292018}) = 0
write(9, "\1", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818508486}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818538988}) = 0
write(9, "\1", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818601446}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818636679}) = 0
write(9, "\1", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818711174}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818736119}) = 0
write(9, "\1", 1)                       = 1
clock_gettime(CLOCK_MONOTONIC, {165947, 818813994}) = 0
epoll_wait(3, [], 1024, 0)              = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818871606}) = 0
clock_gettime(CLOCK_MONOTONIC, {165947, 818899081}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLOUT, {u32=4, u64=4}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165950, 323825513}) = 0
clock_gettime(CLOCK_MONOTONIC, {165950, 324029051}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLOUT, {u32=4, u64=4}}], 1024, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {165950, 443832441}) = 0
clock_gettime(CLOCK_MONOTONIC, {165950, 443949129}) = 0

The :7474 listener is binding properly but using telnet to issue an HTTP request resulted in no response.

$ telnet localhost 7474
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0

lsof output while that HTTP request is in-flight.

COMMAND    PID   USER   FD      TYPE DEVICE SIZE/OFF   NODE NAME
linkerd-t 4580 stevej  cwd       DIR    8,1     4096 270339 /home/stevej/src/linkerd-tcp
linkerd-t 4580 stevej  rtd       DIR    8,1     4096      2 /
linkerd-t 4580 stevej  txt       REG    8,1  9258304 271179 /home/stevej/src/linkerd-tcp/target/release/linkerd-tcp
linkerd-t 4580 stevej  mem       REG    8,1  1088952   1907 /lib/x86_64-linux-gnu/libm-2.24.so
linkerd-t 4580 stevej  mem       REG    8,1  1856752   1903 /lib/x86_64-linux-gnu/libc-2.24.so
linkerd-t 4580 stevej  mem       REG    8,1    92552   1931 /lib/x86_64-linux-gnu/libgcc_s.so.1
linkerd-t 4580 stevej  mem       REG    8,1   142400   1918 /lib/x86_64-linux-gnu/libpthread-2.24.so
linkerd-t 4580 stevej  mem       REG    8,1    31712   1920 /lib/x86_64-linux-gnu/librt-2.24.so
linkerd-t 4580 stevej  mem       REG    8,1    14608   1906 /lib/x86_64-linux-gnu/libdl-2.24.so
linkerd-t 4580 stevej  mem       REG    8,1   158512   1899 /lib/x86_64-linux-gnu/ld-2.24.so
linkerd-t 4580 stevej    0u      CHR  136,1      0t0      4 /dev/pts/1
linkerd-t 4580 stevej    1u      CHR  136,1      0t0      4 /dev/pts/1
linkerd-t 4580 stevej    2u      CHR  136,1      0t0      4 /dev/pts/1
linkerd-t 4580 stevej    3u  a_inode   0,11        0   9463 [eventpoll]
linkerd-t 4580 stevej    4r     FIFO   0,10      0t0  89197 pipe
linkerd-t 4580 stevej    5w     FIFO   0,10      0t0  89197 pipe
linkerd-t 4580 stevej    6u     IPv4  89198      0t0    TCP *:7474 (LISTEN)
linkerd-t 4580 stevej    7u  a_inode   0,11        0   9463 [eventpoll]
linkerd-t 4580 stevej    8r     FIFO   0,10      0t0  87890 pipe
linkerd-t 4580 stevej    9w     FIFO   0,10      0t0  87890 pipe
linkerd-t 4580 stevej   10u     IPv4  87891      0t0    TCP *:9989 (LISTEN)
linkerd-t 4580 stevej   11u     IPv4  87892      0t0    TCP localhost:54064->localhost:4180 (ESTABLISHED)
linkerd-t 4580 stevej   12u     IPv4  88470      0t0    TCP localhost:7474->localhost:47002 (ESTABLISHED)

There is no open socket to any of the perf-cluster hosts.

Running linkerd-tcp with trace logging reveals the culprit: we're failing to parse namerd responses but not failing the request.

TRACE:hyper::http::conn                         : Conn::flush = Ok(Ready(()))
TRACE:tokio_core::reactor                       : event Ready {Readable} Token(1)
DEBUG:tokio_core::reactor                       : loop process - 2 events, Duration { secs: 0, nanos: 85366 }
TRACE:linkerd_tcp::app                          : polling 4 running
TRACE:linkerd_tcp::app                          : polling runner 0
TRACE:hyper::client::response                   : Response::new
DEBUG:hyper::client::response                   : version=Http11, status=Ok
DEBUG:hyper::client::response                   : headers={"Content-Type": "application/json", "Content-Length": "15"}
TRACE:linkerd_tcp::namerd                       : parsing namerd response
INFO :linkerd_tcp::namerd                       : error parsing response: missing field `addrs` at line 1 column 14

Oliver explained that this behavior is caused by issuing a request with no path for the router to process. Since we don't specify a default path for now, we do a lookup to namerd with an empty path and that fails.

I think we should address this in two ways:

  1. Change namerd parsing errors to be at the error! level so a user will more quickly see the error.
  2. Fast fail requests lacking a path until we decide to specify a default path in the configuration file or add dtab support.