cronokirby/alchemy

Bot becomes unresponsive if cache ever goes offline

Opened this issue · 0 comments

Issue

If the Alchemy.Cache.Guilds guild child process is ever killed for any reason (i.e. timeouts), the bot will become unresponsive and no commands or messages can be processed for that server.

Analysis:

  • The cacher is the second genstate step, which means if it ever stops processing things, so does everything else downstream (like command handlers)
  • The cacher does a sync call to the Guilds.Cache gen server
    • The default timeout of the genserver is 5000 ms, which means if the bot is backloged processing other commands, or just not getting much cpu time, genstate can crash
  • After a crash, the state is stuck in "unavailable" => true state, and nothing seems to get it out of that state
  • At this point, all further messages are discarded, and it is not possible to perform any actions (at least to my knowledge) through the bot.

Reproduction steps

  • The simplest way I've found is to add a :timer.sleep(5001) at the top of this handler
  • Then run a command like
msg = %{"activities" => [], "client_status" => %{}, "game" => nil, "guild_id" => guild_id, "roles" => [], "status" => "offline", "user" => %{"id" => your_user_id}}
[1,2,3,4,5,6,7,8]
|> Task.async_stream(fn _ -> Alchemy.Cache.Guilds.update_presence(msg)  end, max_concurrency: 10, timeout: 30000) 
|> Enum.map(fn e -> e end)

from iex.

  • a bunch of errors will be spit out, then the bot will enter the unrecoverable state.

For checking the state

children = Supervisor.which_children(Alchemy.Cache.Guilds.GuildSupervisor)
pids = children |> Enum.map(fn e -> Tuple.to_list(e) |> Enum.at(1) end)
has_been_restarted = Enum.any?(pids, fn pid ->
  state = :sys.get_state(pid)
  state["unavailable"] == true && state["id"] == guild_id
end)

if has_been_restarted is true, things are broken. get_state returns some more useful info (the state of the process), but for the purposes of determining that this is working that's all that's relevant.

Notes

I was attempting to submit a pr to fix this issue, but was having trouble determining what the proper way of fixing this would be.

It seems like we just need to refresh the "seed" state of the cache when this happens, but it wasn't clear to me where that should happen (or is currently happening from). I also was unsure if there was a hidden reason that we could not do this on genstate death.

Issues aside, wanted to say thanks for the awesome library! I was only able to debug this in couple hours because of the great work you've put into this so far to make this work so well.