Connection problem between runtime and gateway
lurenpluto opened this issue · 3 comments
After switching active OOD, there is a situation where runtime and OOD cannot be connected, and pings connected through bdt fail
OOD: 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p
runtime: 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA
Runtime side
Last connection on runtime side
Details
4904: [2023-05-06 20:47:15.658443 +08:00] INFO [ThreadId(4)] [component\cyfs-lib\src\requestor\bdt.rs:181] http-bdt request to 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p:84 success! during=208ms, seq=TempSeq(4211563353)
The next connections to the gateway are all timeout errors.
Details
4933: [2023-05-06 20:47:26.396892 +08:00] WARN [ThreadId(8)] [component\cyfs-lib\src\requestor\bdt.rs:110] connect to 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p:84 failed! with_desc=true, during=5009ms, err: (Timeout, future has timed out, None)
And after restarting the runtime side, it also can't connect to the gateway, so it looks like there is a problem on the gateway side
OOD side
Last request log received from runtime
Details
[2023-05-06 20:47:18.905150 +08:00] INFO [ThreadId(7)] [component\cyfs-stack\src\interface\http_bdt_listener.rs:167] recv bdt http request. source=5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, seq=TempSeq(4211563355), method=POST, url=http:// 5bnzhzxqdrwun6nkzmgksuiradspunuwlbfyg91qc55p:84/non/, len=Some(26)
It looks like the OOD is running and for some reason other devices can't connect to itself
For 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, the final ping for keepalive is:
[2023-05-06 20:46:43.015978 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558552.
[2023-05-06 20:47:08.043060 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558553.
And 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA connect 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p
failure starts from:
[2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)
The sn-miner use client_ping_timeout(default is 5min) to purge clients' cache, so, it's because 5bn..55p hasn't pinged from keep alive for too long, SN thinks it has dropped, resulting in a NotFound when 5bn..KwA calls, thus the connection fails.
For 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, the final ping for keepalive is: [2023-05-06 20:46:43.015978 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558552. [2023-05-06 20:47:08.043060 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558553.
And 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA connect 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p failure starts from: [2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)
The sn-miner use client_ping_timeout(default is 25s) to purge clients' cache, so, it's because 5bn..55p hasn't pinged from keep alive for too long, SN thinks it has dropped, resulting in a NotFound when 5bn..KwA calls, thus the connection fails.
So this problem, is also bdt's sn ping stopped unexpectedly, resulting in the problem of being considered offline by the SN server?
It should be the same or similar problem as the one below
#250
So we need to review the SN ping logic inside the bdt stack to see what would cause the ping loop to be aborted
All tasks are blocked, causing the bdt-stack is unable to communicate properly:
[2023-05-06 20:47:19.246450 +08:00] WARN [ThreadId(7)] [component\cyfs-stack\src\interface\http_bdt_listener.rs:198] bdt http request complete with error! status=501, seq=TempSeq(4211563355), during=341ms
[2023-05-06 20:48:00.881778 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:49:00.882825 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:50:00.883364 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:51:00.885583 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:52:00.887406 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:52:00.887488 +08:00] ERROR [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:122] task system dead timeout, now will exit process! last_active=13327850820839775
After 5 minutes and 30 seconds, when querying 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p through sn, it will definitely not be found, and during this period, there will be no communication with it:
SN remove 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p's cache item, so not found:
[2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)
Task is blocked, so all the packets can no be handle:
[2023-05-06 20:49:24.056021 +08:00] INFO [ThreadId(6)] [component\cyfs-bdt\src\tunnel\udp.rs:148] UdpTunnel{local:L4udp192.168.100.75:8051,remote:L4udp192.168.100.75:8050} dead for connecting timeout
When the check\dead.rs detect this situation, the gateway will be restart? If so, the device an be connected after 5min.