launchbadge/sqlx

Pool::acquire times out

skuzins opened this issue · 20 comments

I'm using a connection pool with a larger number of tokio tasks, see the example below. The tasks run a small query, do some work without holding on to a connection, then run another query. In the example, the second call to acquire quickly starts failing with PoolTimeout or it takes too long, roughly connect_timeout seconds.

Any ideas would be appreciated.

use sqlx::mysql::*;
async fn run_queries(i: u32, pool: &sqlx::Pool<MySql>) {
    {
        let mut conn = pool.acquire().await.unwrap();

        let _ = sqlx::query("SELECT 1").fetch_one(&mut conn).await.unwrap();

        // conn gets dropped here and should be returned to the pool
    }

    // (do some other work here without holding on to a connection)

    {
        let start = std::time::Instant::now();

        // this often fails (PoolTimedOut) or takes connect_timeout seconds
        let mut _conn = pool.acquire().await.unwrap();

        println!("{} pool.acquire() took {:?}", i, std::time::Instant::now() - start);

        // (run another query...)

    }
}

#[tokio::main]
async fn main() {
    let pool = MySqlPoolOptions::new()
        .max_connections(1) // also fails with higher counts, e.g. 5
        .connect(&std::env::var("DATABASE_URL").unwrap())
        .await
        .unwrap();

    let mut handles = vec![];
    for i in 0..100 {
        let pool = pool.clone();

        handles.push(tokio::spawn(async move {
            run_queries(i, &pool).await;
        }));
    }

    futures::future::join_all(handles).await;
}
[dependencies]
sqlx = { version = "0.4.0-beta.1", default-features = false, features = [ "runtime-tokio", "mysql" ] }
tokio = "0.2.22"
futures = "0.3.5"

Output snippets:

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: PoolTimedOut'
pool.acquire() took 30.001958452s

MySQL Engine version 8.0.17. Tested with stable-x86_64-unknown-linux-gnu and stable-x86_64-pc-windows-msvc.

I've ran into the same issue. I am not seeing connections recycled when they are dropped, which is causing issues when multiple queries happen on one web request/future.

I have worked around it by using deadpool for connection pooling, which appears to be more reliable in my limited tests.

Here's my (very simple) implementation. (also note: I needed to use DerefMut to coax the compiler, as I am not sure what combination of * and & is needed here.)

use std::ops::DerefMut;
use sqlx::{Connect, Connection, Error as SqlxError, Executor, PgConnection};

use async_trait::async_trait;
use deadpool::managed::{Manager, RecycleResult};
use log::*;
type Pool = deadpool::managed::Pool<PgConnection, SqlxError>;

struct DbPool {
    url: String,
}

impl DbPool {
    fn new(url: String, size: usize) -> Pool {
        Pool::new(DbPool { url }, size)
    }
}

#[async_trait]
impl Manager<PgConnection, SqlxError> for DbPool {
    async fn create(&self) -> Result<PgConnection, SqlxError> {
        PgConnection::connect(&self.url).await
    }
    async fn recycle(&self, obj: &mut PgConnection) -> RecycleResult<SqlxError> {
        Ok(obj.ping().await?)
    }
}



async fn main() {
    ...
    let pool = DbPool::new(url, 16);
    ...
    sqlx::query_as("select * from users where email = $1")
      .bind(&email)
      .fetch_one(pool.get().await?.deref_mut())
      .await?;
     ... 
}

Thanks for sharing your workaround! deadpool looks great.

Currently the pool doesn't behave well if the acquire() future is cancelled; it fails to remove the waiter from the list, which holds up the queue. This could have a knock-on effect with other acquire() calls timing out due to the same problem. Each time a connection is returned to the pool it only attempts to wake the task at the head of the queue. If there's more dead tasks than connections in the queue then the pool will stall.

@skuzins in your example, could you please try debug-printing the pool after the second .acquire() call times out, instead of panicking:

if let Err(e) = pool.acquire().await {
    println!("acquire error: {}, pool state: {:?}", e, pool);
}

I'm not sure any future would/should normally be cancelled in this example - I believe they should all just run to completion.

Here's the debug message, please let me know if you need more information.

acquire error: pool timed out while waiting for an open connection, pool state: 
Pool { size: 1, num_idle: 0, is_closed: false, options: PoolOptions { 
max_connections: 1, min_connections: 0, connect_timeout: 30s, 
max_lifetime: Some(1800s), idle_timeout: Some(600s), test_before_acquire: true } }

The timeout itself cancels the internal future that's actually waiting in the queue.

However, I'd expect to see num_idle: 1 if that was the case. How about with .max_connections(5)?

I am not seeing cancellations in my code and the example posted doesn't have them either. It feels more like an issue with waking up after the pool has stopped using a connection.

@abonander The thing is, there shouldn't be any timeouts. The task acquires the connection, performs a short query, drops the connection and does the same thing once more. The whole program should finish within milliseconds without any timeouts.

Here's the output with .max_connections(5):

acquire error: pool timed out while waiting for an open connection, pool state: Pool 
{ size: 5, num_idle: 0, is_closed: false, options: PoolOptions { 
max_connections: 5, min_connections: 0, connect_timeout: 30s, 
max_lifetime: Some(1800s), idle_timeout: Some(600s), test_before_acquire: true } }

FWIW, I think I'm hitting this (or something like it) when using try_acquire with Postgres.

I have a loop which waits 5 seconds and calls try_acquire to see if any connections are available. The loop also does some work with the pool in a spawned task.

Once the 5 seconds elapse the pool shows all connections idle, but try_acquire returns None.

I'm not certain this bug has been fixed, as I've been required to use the deadpool workaround, which has entirely solved the problems we've been having. PgPool seems to, in seemingly simple cases, break even under a very basic load.

Yeah I think adding this periodic wake was a mistake: https://github.com/launchbadge/sqlx/blob/master/sqlx-core/src/pool/inner.rs#L180

After a single wait period it effectively makes the pool unfair again as it becomes dependent on the timing of when tasks wake and poll the future.

Something I realized in 0.6 that I want to backport is that if a task is "woken" (as in its woken flag is set) but it doesn't try to acquire, on drop of the Waiter it needs to wake the next task in the list so that it can attempt to acquire. I think without that it leads to a knock-on effect where there's a connection in the idle queue and tasks waiting but none are getting woken because the first task was woken but didn't poll (either because it timed out or was cancelled).

The idle reaper might also be unintentionally preempting waiting tasks, I need to look into that.

Just spotted another bug, too; since waking a waiting task involves popping it from the SegQueue, if the task wakes and doesn't get a connection, it'll just... go back to sleep without putting itself back in the queue.

Using SegQueue was probably a bad idea anyway since waiting tasks can't retain their spot in line and have to get returned to the back of the queue. In 0.6 we're switching to an intrusive linked list where waiting tasks aren't removed until acquire() completes or is cancelled, so that should avoid this issue.

The idle reaper is also definitely preempting some waiting tasks since it directly pops connections from the idle queue without checking if any tasks are waiting.

slhmy commented

I met similiar problem, Using MySQLPool in Actix-web 4.0-beta server.

Codes are mentioned in #1199 (comment)

I make async queries by using the pool directly in each small queries, and request queries in 512 connections concurrently.
Then I met PoolTimedOut error. Shouldn't block in pool connections since every query will ran successfully.

@slhmy if you would, please try #1211 to see if that fixes the issue.

slhmy commented

@slhmy if you would, please try #1211 to see if that fixes the issue.

😍fixed, I checked Tokio runtime.

That's awesome to hear, thanks!

slckl commented

I'm also running into this issue with a somewhat silly usecase of hundreds-to-low-thousands tokio tasks running small write queries against a PgPool. It works for a couple of minutes, and then buckles with a PoolTimedOut.
#1211 unfortunately, does not seem to have any effect in my case.
will try the deadpool version suggested above.

Edit: deadpool version works just fine..

#1211 fixed my issue thanks @abonander

I have a minimal repro of the timeout issue in async. This might be a wrongful usage, please tell me if that's the case!

Cargo.toml

[package]
name = "repro_sqlx_timeout"
version = "0.1.0"
edition = "2021"

[dependencies]
sqlx = { version = "0.5", features = [
    "sqlite",
    "any",
    "runtime-tokio-native-tls",
] }

tokio = { version = "1", features = ["full"] }
futures = "0.3"

[dev-dependencies]
tempfile = "3.2.0"

main.rs

fn main() {}

#[cfg(test)]
mod tests {
    use futures::future::join_all;
    use sqlx::{migrate::MigrateDatabase, Executor};
    use std::{thread, time::Duration};

    #[tokio::test]
    async fn sqlx_timeout() {
        let dir = tempfile::tempdir().unwrap();
        let path = dir.path();
        let db_path = path.join("database.db3");
        let database_uri = format!("sqlite://{}", db_path.to_str().unwrap().replace('\\', "/"));

        sqlx::Any::create_database(&database_uri).await.unwrap();

        let connection = sqlx::any::AnyPoolOptions::new()
            .max_connections(1)
            .connect_timeout(Duration::from_millis(300))
            .connect(&database_uri)
            .await
            .unwrap();

        connection
            .execute("CREATE TABLE some_table(some_number BIGINT);")
            .await
            .unwrap();

        let mut vec_future = Vec::new();

        for _ in 1..3 {
            vec_future.push(async {
                connection
                    .execute(sqlx::query("SELECT * FROM some_table"))
                    .await
                    .unwrap(); // thread 'tests::sqlx_timeout' panicked at 'called `Result::unwrap()` on an `Err` value: PoolTimedOut'
                thread::sleep(Duration::from_secs(1));
            });
        }

        let _results = join_all(vec_future).await;
    }
}

@fjoanis-legion That thread::sleep() blocks the Tokio runtime and doesn't let the connection cleanup tasks execute. You want tokio::time::sleep(Duration::from_secs(1)).await; instead.

Locking this issue to prevent further necroing. If anyone is encountering sporadic timeouts on SQLx 0.6.0, please open a new issue.