Liveness / performance bug in PooledThreadExecutor
hunjmes opened this issue · 3 comments
Describe the bug
My team noticed a performance / "liveness" bug when using the PooledThreadExecutor to handle lots of requests to S3. Our repro is reliable, but unfortunately too complex include with this report, so a description will have to suffice.
We observed lots of "producer" threads waiting for the SDK to complete requests that they had added to the PooledThreadExecutor's queue, while almost all of the PooledThreadExecutor's "consumer" ThreadTasks were idle, waiting at this line.
The simplest description of the bug is that function ThreadTask::MainTaskRunner()
doesn't use or hold the same mutex, between checking m_executor.HasTasks()
and when it waits on m_executor.m_sync.WaitOne()
. This allows a concurrent "producer" thread to add a task to the queue, after ThreadTask::MainTaskRunner()
has concluded that the queue is empty, but before ThreadTask::MainTaskRunner()
waits on the Semaphore
.
In such a case, the queue has a task to be performed, but the ThreadTask
does not realize this.
Expected Behavior
When pushing a task to the PooledThreadExecutor's queue, if a ThreadTask
is available, then it should execute that task.
Current Behavior
PooledThreadExecutor's queue is non-empty, but ThreadTask
s remain idle.
In my team's reopro, overall throughput drops dramatically, with hundreds of threads all waiting, ultimately, on the PooledThreadExecutor's Semaphore. (Eventually, in our repro, the "producer" threads fill their queues, so they end up blocked on the ThreadTask
s, which are not executing the work on the PooledThreadExecutor's queue.)
Reproduction Steps
Not feasible to repro at scale, here.
Possible Solution
Function ThreadTask::MainTaskRunner()
needs to hold the mutex until it waits on the Semaphore
. Unfortunately, the SDK's Semaphore
class doesn't take a mutex parameter, so the function would probably have to use std::condition_variable
, instead. (Note that this code already uses std::lock_guard
and std::mutex
, so also using std::unique_lock
and std::condition_variable
should not be a problem.)
Possible solution, something like this (with corresponding changes made elsewhere):
void ThreadTask::MainTaskRunner()
{
while (m_continue)
{
std::unique_lock lock(m_executor.m_queueLock);
while (m_continue && m_executor.HasTasks() /*must not acquire m_queueLock anymore*/)
{
auto fn = m_executor.PopTask();
lock.unlock();
if(fn)
{
(*fn)();
Aws::Delete(fn);
}
lock.lock();
}
if(m_continue)
{
m_executor.m_cv.wait(lock, ...);
}
}
}
Additional Information/Context
No response
AWS CPP SDK version used
1.11.x
Compiler and Version used
gcc 8
Operating System and version
Linux 5.4.258
Pull request https://github.com/aws/aws-sdk-cpp/pull/2745/commits .
Thanks for opening this issue and creating a PR! Reviewing the PR
I'm also seeing this on tag 1.11.68
@jmklix is there an update on this? The PR still looks unreviewed