Connection pool hangs if authentication fails

When using a connection pool with tokio_postgres, the connection hangs until a TimedOut error is returned if the password is incorrect. This behaviour is undesirable and should just result in authentication failure.

To re-produce, simply try connect to any database using an incorrect password and the connection will hang until it times out.

I'd be happy to review a PR for this, but I don't think I'll be able to work on this myself in the near future.

I've spent a bit of time looking into this while debugging some TLS connection issues. It is decidedly not a simple fix, unfortunately. There are a few things that I've been doing to make life a little bit easier, however:

If your deployment allows it, set Builder::min_idle to require at least one open connection before considering the Pool ready. Consider testing the connection parameters on startup with a short-lived pool and very low timeout.
Be sure to register an ErrorSink with the method Builder::error_sink which will give you an opportunity to log errors that otherwise would fall off the table.
Also, I'll be taking a look at using the crate bb8_failsafe here for more proactive connection error management.

The touchpoint for the issues is at PoolInner::add_connection here:

bb8/bb8/src/inner.rs

Lines 199 to 225 in 5737736

    
           let conn = shared 
        
               .manager 
        
               .connect() 
        
               .and_then(|mut c| async { self.on_acquire_connection(&mut c).await.map(|_| c) }) 
        
               .await; 
        
           match conn { 
        
               Ok(conn) => { 
        
                   let conn = Conn::new(conn); 
        
                   shared 
        
                       .internals 
        
                       .lock() 
        
                       .put(conn, Some(approval), self.inner.clone()); 
        
                   return Ok(()); 
        
               } 
        
               Err(e) => { 
        
                   if Instant::now() - start > self.inner.statics.connection_timeout { 
        
                       let mut locked = shared.internals.lock(); 
        
                       locked.connect_failed(approval); 
        
                       return Err(e); 
        
                   } else { 
        
                       delay = max(Duration::from_millis(200), delay); 
        
                       delay = min(self.inner.statics.connection_timeout / 2, delay * 2); 
        
                       sleep(delay).await; 
        
                   } 
        
               } 
        
           }

This is the code that ultimately calls into the ManageConnection impl to make the connection, and (in the case of an error) keeps trying until it gets one successfully or the pool's connection timeout expires.

The right behavior in the general case is to delay returning any error until the timeout, but it would be great if there were a way to identify particular classes of errors that shouldn't wait for the timeout. It seems like that would require cooperation from the ManageConnection impl, which means it would require a new trait method (say something like fn error_is_fatal(&self, e: Self::Error) -> bool { return false; } which could then be used to exit early. Then we could override the default impl of that trait method where appropriate. Unfortunately for my use-case (bb8-postgres) the tokio-postgres crate does not expose error details in a way that could reasonably be used for this purpose. Long story short, doing this would be a lot of work.

As a side note @djc or anyone else looking more deeply, this call stack is Pool::get => PoolInner::get => PoolInner::make_pooled => PoolInner::spawn_replenishing_approvals raises my eyebrows. In particular, it seems like the waiter ought to be a Result channel so that a root cause can bubble back up. As it is, we lose the original error even in cases where we are actively waiting for the connection, which does not seem to be the intent of spawn_replenishing_approvals.

bb8/bb8/src/inner.rs

Lines 138 to 139 in 5737736

    
           let approvals = locked.push_waiter(tx, &self.inner.statics); 
        
           self.spawn_replenishing_approvals(approvals);

As a side note @djc or anyone else looking more deeply, this call stack is Pool::get => PoolInner::get => PoolInner::make_pooled => PoolInner::spawn_replenishing_approvals raises my eyebrows. In particular, it seems like the waiter ought to be a Result channel so that a root cause can bubble back up. As it is, we lose the original error even in cases where we are actively waiting for the connection, which does not seem to be the intent of spawn_replenishing_approvals.

That sounds sensible. Want to propose a PR?

Unfortunately for my use-case (bb8-postgres) the tokio-postgres crate does not expose error details in a way that could reasonably be used for this purpose. Long story short, doing this would be a lot of work.

Might be worth an issue against tokio-postgres? I would definitely be open to adding an error_is_fatal() like hook to the ManageConnection trait.

That sounds sensible. Want to propose a PR?

Happy to work something up. Since the waiter is stored in internals::PoolInternals and the error pops up in inner::PoolInner::spawn_start_connections, I think the move is to replace this line that throws to the error_sink

bb8/bb8/src/inner.rs

Line 67 in 5737736

Err(e) => this.inner.statics.error_sink.sink(e),

with a call to a new method on SharedPool that looks roughly like:

impl<M: ManageConnection> PoolInternals<M> {
    fn forward_error(&self, err: M::Error) {
        let mut locked = self.internals.lock();
        while let Some(waiter) = locked.waiters.pop_front() {
            match waiter.send(Err(err)) {
                Ok(_) => return,
                Err(e) => err = e,
            }
        }
        self.statics.error_sink.sink(e);
    }
}

Might be worth an issue against tokio-postgres? I would definitely be open to adding an error_is_fatal() like hook to the ManageConnection trait.

It looks like it has been discussed a few times (sfackler/rust-postgres#571, sfackler/rust-postgres#583, sfackler/rust-postgres#790), though usually in a different context. There's even another project that is trying to do exactly the same thing: https://github.com/palfrey/wait-for-db/blob/c3fb9dbad94f06f91f83d6ef9e4af0b93ef64083/src/pg.rs#L34-L49

In at least one case a PR was merged that exposed a limited interface to the error kind (sfackler/rust-postgres#739) so it's reasonable to think a similar PR would be well received.

Yeah, the forward_error() approach seems reasonable. Should probably verify all sink() callers (at least in PoolInner) to see if they should use this.

Hmm, it's not so easy, I guess. The above strategy works, but only if the user has configured retry_connection(false). If not, in practice we hit the "outer" timeout on the waiter channel here:

bb8/bb8/src/inner.rs

Line 142 in 03f95c9

match timeout(self.inner.statics.connection_timeout, rx).await {

before we exhaust the "inner" timeout on the retry loop here:

bb8/bb8/src/inner.rs

Line 216 in 03f95c9

|| Instant::now() - start > self.inner.statics.connection_timeout

After a quick review I don't immediately see a great way to get these to line up that doesn't feel like a hack (i.e. make the outer one "a little bit longer" or some such). The outer timeout can't really be removed, since we're relying on that in the face of pool contention. The inner timeout in the context of the retry loop looks locally correct, and it's hard to picture what would need to change really.

I'll go ahead and submit a PR to bubble up the error for the case of a pool configured with retry_connection(false), since that at least makes a little bit of progress here.

	let conn = shared
	.manager
	.connect()
	.and_then(\|mut c\| async { self.on_acquire_connection(&mut c).await.map(\|_\| c) })
	.await;

	match conn {
	Ok(conn) => {
	let conn = Conn::new(conn);
	shared
	.internals
	.lock()
	.put(conn, Some(approval), self.inner.clone());
	return Ok(());
	}
	Err(e) => {
	if Instant::now() - start > self.inner.statics.connection_timeout {
	let mut locked = shared.internals.lock();
	locked.connect_failed(approval);
	return Err(e);
	} else {
	delay = max(Duration::from_millis(200), delay);
	delay = min(self.inner.statics.connection_timeout / 2, delay * 2);
	sleep(delay).await;
	}
	}
	}

	let approvals = locked.push_waiter(tx, &self.inner.statics);
	self.spawn_replenishing_approvals(approvals);