aws/aws-sdk-cpp

Liveness / performance bug in PooledThreadExecutor

hunjmes opened this issue · 3 comments

Describe the bug

My team noticed a performance / "liveness" bug when using the PooledThreadExecutor to handle lots of requests to S3. Our repro is reliable, but unfortunately too complex include with this report, so a description will have to suffice.

We observed lots of "producer" threads waiting for the SDK to complete requests that they had added to the PooledThreadExecutor's queue, while almost all of the PooledThreadExecutor's "consumer" ThreadTasks were idle, waiting at this line.

The simplest description of the bug is that function ThreadTask::MainTaskRunner() doesn't use or hold the same mutex, between checking m_executor.HasTasks() and when it waits on m_executor.m_sync.WaitOne(). This allows a concurrent "producer" thread to add a task to the queue, after ThreadTask::MainTaskRunner() has concluded that the queue is empty, but before ThreadTask::MainTaskRunner() waits on the Semaphore.

In such a case, the queue has a task to be performed, but the ThreadTask does not realize this.

Expected Behavior

When pushing a task to the PooledThreadExecutor's queue, if a ThreadTask is available, then it should execute that task.

Current Behavior

PooledThreadExecutor's queue is non-empty, but ThreadTasks remain idle.

In my team's reopro, overall throughput drops dramatically, with hundreds of threads all waiting, ultimately, on the PooledThreadExecutor's Semaphore. (Eventually, in our repro, the "producer" threads fill their queues, so they end up blocked on the ThreadTasks, which are not executing the work on the PooledThreadExecutor's queue.)

Reproduction Steps

Not feasible to repro at scale, here.

Possible Solution

Function ThreadTask::MainTaskRunner() needs to hold the mutex until it waits on the Semaphore. Unfortunately, the SDK's Semaphore class doesn't take a mutex parameter, so the function would probably have to use std::condition_variable, instead. (Note that this code already uses std::lock_guard and std::mutex, so also using std::unique_lock and std::condition_variable should not be a problem.)

Possible solution, something like this (with corresponding changes made elsewhere):

void ThreadTask::MainTaskRunner()
{
    while (m_continue)
    {        
        std::unique_lock lock(m_executor.m_queueLock);
        while (m_continue && m_executor.HasTasks() /*must not acquire m_queueLock anymore*/)
        {      
            auto fn = m_executor.PopTask();
            lock.unlock();
            if(fn)
            {
                (*fn)();
                Aws::Delete(fn);               
            }
            lock.lock();
        }
     
        if(m_continue)
        {
            m_executor.m_cv.wait(lock, ...);
        }
    }
}

Additional Information/Context

No response

AWS CPP SDK version used

1.11.x

Compiler and Version used

gcc 8

Operating System and version

Linux 5.4.258

jmklix commented

Thanks for opening this issue and creating a PR! Reviewing the PR

I'm also seeing this on tag 1.11.68

@jmklix is there an update on this? The PR still looks unreviewed