cloudwego/kitex

服务几天没访问后就访问不了了

Closed this issue · 2 comments

Describe the bug

我有个测试环境,平时基本无请求量,经常遇到过几天后就访问不了了(etcd中已无注册信息,所以找不到该微服务),重启后问题就能解决。但是生产环境没遇到过这样的问题,今日测试环境又复现了所以提个issue。

本次bug时间点及Log:

  1. 2024/06/24 微服务启动
    1 2024/06/24 15:43:12.401003 server.go:83: [Info] KITEX: server listen at addr=172.31.251.191:10000
    2 2024/06/24 15:43:13.515344 etcd_registry.go:290: [Info] start keepalive lease 6c8d8e3743d9e86f for etcd registry

  2. 2024/06/30 微服务注册信息从etcd中丢失,导致后续其他服务访问不了该微服务
    01 2024/06/30 01:00:48.568084 etcd_registry.go:223: [Info] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000
    02 {"level":"warn","ts":"2024-06-30T01:00:51.571+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0012728c0/mydomain.etcd.cbs:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    03 2024/06/30 01:00:51.577406 etcd_registry.go:227: [Warn] keep register grant lease kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed with err: context deadline exceeded
    04 2024/06/30 01:01:01.624931 etcd_registry.go:223: [Info] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000
    05 {"level":"warn","ts":"2024-06-30T01:01:04.578+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0012728c0/mydomain.etcd.cbs:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    06 2024/06/30 01:01:04.579596 etcd_registry.go:234: [Warn] keep register put kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed with err: context deadline exceeded
    07 2024/06/30 01:01:14.626367 etcd_registry.go:223: [Info] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000
    08 {"level":"warn","ts":"2024-06-30T01:01:17.580+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0012728c0/mydomain.etcd.cbs:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    09 2024/06/30 01:01:17.580799 etcd_registry.go:234: [Warn] keep register put kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed with err: context deadline exceeded
    10 2024/06/30 01:01:27.627743 etcd_registry.go:223: [Info] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000
    11 {"level":"warn","ts":"2024-06-30T01:01:30.582+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0012728c0/mydomain.etcd.cbs:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    12 2024/06/30 01:01:30.583166 etcd_registry.go:234: [Warn] keep register put kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed with err: context deadline exceeded
    13 2024/06/30 01:01:40.630734 etcd_registry.go:223: [Info] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000
    14 {"level":"warn","ts":"2024-06-30T01:01:43.585+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0012728c0/mydomain.etcd.cbs:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
    15 2024/06/30 01:01:43.585891 etcd_registry.go:234: [Warn] keep register put kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed with err: context deadline exceeded
    16 2024/06/30 01:01:43.585914 etcd_registry.go:255: [Error] keep register service kitex/registry-etcd/wsp.earn.qa.int/172.31.251.191:10000 failed times:5

  3. 重启该微服务后一切恢复正常。

从日志上分析看,是服务与etcd断连后,服务重试了5次没重新注册上etcd,后续就没再尝试重试注册了。虽然该服务没挂(挂了后会被监控程序拉起来,这种情况反而没事),但是在etcd上已无该服务注册信息,所以其他服务访问不到该微服务。

这种问题如何更好的规避下?一直尝试重新注册还是有更好的方案?
还有个疑问点,该服务启动注册在etcd上的信息怎么丢失了。

https://github.com/kitex-contrib/registry-etcd/blob/main/etcd_registry.go#L242
是不是测试环境etcd访问时间较长?这里的3s包含了调用get的时间,可能导致put调用时时间不够。
可以把重试的超时时间调整到5s试试

我先调整重试次数为0(即一直尝试),先跑一段时间看看效果先。

etcd.NewEtcdRegistryWithRetry(["ip:addr"], retry.NewRetryConfig(
retry.WithMaxAttemptTimes(0), // 无限次尝试
retry.WithObserveDelay(20time.Second),
retry.WithRetryDelay(5
time.Second),
))