joshkunz/co2-sensor

Backend is crashing every few days with "Too many files" errors

joshkunz opened this issue · 0 comments

Every few days, Gotham starts throwing these errors:

 ERROR gotham > Socket Error: Too many open files (os error 24)
 ERROR gotham > Socket Error: Too many open files (os error 24)
 ERROR gotham > Socket Error: Too many open files (os error 24)
 ERROR gotham > Socket Error: Too many open files (os error 24)

For every connection request (typically every prometheus scrape). This Gotham issue suggest that it's due to half-open connections, likely left by Prometheus, makes sense. It points to Hyper, the underlying HTTP framework as the culprit. Searching through Hyper's issues, it seems like there's plenty of instance of people running into similar errors. For example this issue. I'm really not sure what's going on here, since searching around it looks like nobody else is having trouble with Prometheus leaving half-open connections. Seems like it's likely some kind of cleanup problem in Gotham/Hyper.

This is pretty Yikes too me. I would expect any major HTTP (or server library in general) to be able to deal with half-open connections since it's a juicy DoS opportunity. As I see it, I have a few options:

  1. Add a /healthy endpoint to the server + write a wrapper to poll it and re-start the server if it gets borked.
  2. Figure out what the issue within Gotham/Hyper is and fix it.
  3. Try to switch to something like actix.rs, that is not based on Hyper, and see if that fixes things.
  4. Re-write the backend in another language, like Go.

I don't really like any of these options, so I'm not sure what to do.