Failed connection attempts result in leaked file descriptors
Closed this issue · 2 comments
To preface: I'm using r2d2 within a web service, running on a freshly installed CentOS 8 server. r2d2 is imported into my project via Diesel's r2d2
feature. The databases that r2d2 talks to are on separate servers, and share a network.
Recently, one of my services suffered from a network outage that affected the network between the program that uses r2d2 and the database. When the network came back online, the program still wasn't able to respond to requests; its log was showing many Too many open files
errors as it had hit the default soft limit for open files on the system (1024), causing all further network requests to also throw similar 'too many open file' errors until I manually restarted the process.
From my testing, any failed connection within r2d2 causes this behavior. To test, I first got an idea of what the program looks like with a working network:
- Started up a new instance of the program
- Grabbed its process ID via
ps aux
- Got an initial count of open files with
lsof -p <pid> | wc -l
. My program in particular starts with 90 open files. - Issued a few normal requests to get r2d2 to acquire some database connections, and remeasured the open file count. In my case, with the handful of different pools that would get used while processing those requests, it went up to 92 - 94, and won't go any higher than that regardless of how many requests come in. If I stop sending requests for long enough for r2d2 to release connections, the open file count will go down, as expected.
Then, I severed the connection between the program and a PostgreSQL database that it talks to. Anything that causes r2d2 to throw errors seems to work; in my case I removed PostgreSQL's entry from the database server's FirewallD zone. This made the program immediately start logging "No route to host" errors from r2d2.
Using that same lsof
command from above every few minutes, I monitored the open file count, and it seems like it goes up once per failed attempt, starting from 92 open files right when the firewall was modified:
- After the first five attempts: 97
- 2 minutes of failed attempts: 110 open files
- 5 minutes: 147
- 10 minutes: 215
- 30 minutes: 567
- 35 minutes: 627
Inspecting the output of lsof
at this point shows many entries that are near identical to the following:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
xx 25972 xx 245u sock 0,9 0t0 406107 protocol: TCP
At this point, I re-added PostgreSQL's port to the firewall, which allowed r2d2 to create connections successfully. I ran the lsof
command again after waiting 20 minutes, and it showed that the count hadn't changed from 627. From what I saw after that network outage, r2d2 will never release those open files - I have about six hours worth of logs of nothing but r2d2 throwing 'too many open files' errors every time it'd try to acquire a new connection every few seconds, wherein the network came back online about 2 hours into those logs.
Some additional potentially helpful information:
- The pool's type:
Pool<ConnectionManager<PgConnection>>
, wherePgConnection
comes from Diesel itself. - How is the program acquiring connections from the pool:
self.pool.get().map_err(|err| CustomError::PoolError(err))
- How is the pool constructed:
Pool::builder()
.max_size(2)
.build_unchecked(ConnectionManager::new(
CONFIG.database.connection_string.clone(),
));
It seems like the problem would have to be with Diesel rather than r2d2. If diesel fails to connect to the database, there's nothing for the pool to leak.
Thanks for the quick response!
You're right - I guess I was thrown off as my logs show r2d2 as the module logging these errors, like so:
ERROR r2d2 - could not create socket: Too many open files
When in reality, r2d2 just catches Errs when it tries to connect and prints them out on its own, here. Should've looked a bit more into it!
Closing this as I've moved the issue over to Diesel's repo.