sfackler/r2d2

Failed connection attempts result in leaked file descriptors

Closed this issue · 2 comments

To preface: I'm using r2d2 within a web service, running on a freshly installed CentOS 8 server. r2d2 is imported into my project via Diesel's r2d2 feature. The databases that r2d2 talks to are on separate servers, and share a network.

Recently, one of my services suffered from a network outage that affected the network between the program that uses r2d2 and the database. When the network came back online, the program still wasn't able to respond to requests; its log was showing many Too many open files errors as it had hit the default soft limit for open files on the system (1024), causing all further network requests to also throw similar 'too many open file' errors until I manually restarted the process.

From my testing, any failed connection within r2d2 causes this behavior. To test, I first got an idea of what the program looks like with a working network:

  • Started up a new instance of the program
  • Grabbed its process ID via ps aux
  • Got an initial count of open files with lsof -p <pid> | wc -l. My program in particular starts with 90 open files.
  • Issued a few normal requests to get r2d2 to acquire some database connections, and remeasured the open file count. In my case, with the handful of different pools that would get used while processing those requests, it went up to 92 - 94, and won't go any higher than that regardless of how many requests come in. If I stop sending requests for long enough for r2d2 to release connections, the open file count will go down, as expected.

Then, I severed the connection between the program and a PostgreSQL database that it talks to. Anything that causes r2d2 to throw errors seems to work; in my case I removed PostgreSQL's entry from the database server's FirewallD zone. This made the program immediately start logging "No route to host" errors from r2d2.

Using that same lsof command from above every few minutes, I monitored the open file count, and it seems like it goes up once per failed attempt, starting from 92 open files right when the firewall was modified:

  • After the first five attempts: 97
  • 2 minutes of failed attempts: 110 open files
  • 5 minutes: 147
  • 10 minutes: 215
  • 30 minutes: 567
  • 35 minutes: 627

Inspecting the output of lsof at this point shows many entries that are near identical to the following:

COMMAND     PID     USER   FD      TYPE     DEVICE SIZE/OFF     NODE    NAME
xx          25972    xx   245u      sock      0,9      0t0     406107 protocol: TCP

At this point, I re-added PostgreSQL's port to the firewall, which allowed r2d2 to create connections successfully. I ran the lsof command again after waiting 20 minutes, and it showed that the count hadn't changed from 627. From what I saw after that network outage, r2d2 will never release those open files - I have about six hours worth of logs of nothing but r2d2 throwing 'too many open files' errors every time it'd try to acquire a new connection every few seconds, wherein the network came back online about 2 hours into those logs.

Some additional potentially helpful information:

  • The pool's type: Pool<ConnectionManager<PgConnection>>, where PgConnection comes from Diesel itself.
  • How is the program acquiring connections from the pool:
self.pool.get().map_err(|err| CustomError::PoolError(err))
  • How is the pool constructed:
Pool::builder()
        .max_size(2)
        .build_unchecked(ConnectionManager::new(
            CONFIG.database.connection_string.clone(),
        ));

It seems like the problem would have to be with Diesel rather than r2d2. If diesel fails to connect to the database, there's nothing for the pool to leak.

Thanks for the quick response!

You're right - I guess I was thrown off as my logs show r2d2 as the module logging these errors, like so:

ERROR r2d2 - could not create socket: Too many open files

When in reality, r2d2 just catches Errs when it tries to connect and prints them out on its own, here. Should've looked a bit more into it!

Closing this as I've moved the issue over to Diesel's repo.