DawnbrandBots/bastion-bot

Timeouts calling search API

Closed this issue · 9 comments

Example for message 1071682782721495130
Received 2023-02-05T06:44:46.364Z
Timed out 2023-02-05T06:44:48.871Z

timings: {
    start: 1675579486366,
    socket: 1675579486366,
    lookup: 1675579486366,
    connect: 1675579486366,
    secureConnect: undefined,
    upload: 1675579486366,
    response: undefined,
    end: undefined,
    error: 1675579488869,
    abort: undefined,
    phases: {
      wait: 0,
      dns: 0,
      tcp: 0,
      tls: undefined,
      request: 0,
      firstByte: undefined,
      download: undefined,
      total: 2503
    }
  }

Service received request

2023-02-05T06:44:46.462216000Z {"level":30,"time":1675579486420,"pid":1,"hostname":"d11f6ccbdbb0","reqId":"req-81hj"

Many requests before and after get a response in under half a second. No response is ever logged for this one.

Possible workaround for message search: early timeout and then retry a couple times with exponential backoff.

Since the last deployment https://github.com/DawnbrandBots/bastion-bot/actions/runs/4901960311 (about the past week)

docker logs bastion-bot_bot_1 2>&1 | grep 'RequestError: Timeout' > timeouts.log
grep 'error:events:message:search' timeouts.log | wc -l

823 incidents proceeded to full timeout
1046 at least four retries
1391 at least three retries
1972 at least two retries
3441 at least one retry

140 non-message search or Yugipedia timeouts

$ grep -ve 'events:message:search' -e 'command:yugi' timeouts.log | wc -l
140
$ grep -ve 'events:message:search' -e 'command:yugi' timeouts.log | cut -d' ' -f2 | sort | uniq
bot:error:command:art
bot:error:command:deck
bot:error:command:id
bot:error:command:random
bot:error:command:rush
bot:error:command:search
bot:warn:command:rush

40 art, 4 deck, 1 id, 24 random, 17 search, 9 rush (45 autocomplete timeout warnings)

image

local locust test suggests the issues are one off (reason for increased failures unclear, droplet sucks?)

For message search, reverting to old 2.5 second timeout with exponential backoff provided by the library. Also will be adjusting some other commands accordingly.

Cause has to be something occurring on the host when the request is made. The local load test is much more successful than recent command uses.

image

image

image

Quiet week, resolving