aws/aws-sdk-cpp

Destructor S3CrtClient::~S3CrtClient() hangs, as of release 1.11.195

hunjmes opened this issue · 7 comments

Describe the bug

Symptoms are superficially similar to #2769 , except that issue involved calling Aws::ShutdownAPI(), while an S3CrtClient was still alive.

This bug involves the S3CrtClient::~S3CrtClient() destructor, itself. We see a hang/deadlock inside:

#0  0x00007fcfffe93377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fcffdc63800 in __gthread_cond_wait (__mutex=<optimized out>, __cond=0x7fcff3f246c8)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0-build2/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#2  std::__condvar::wait (__m=..., this=0x7fcff3f246c8)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0-build2/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:155
#3  std::condition_variable::wait (this=this@entry=0x7fcff3f246c8, __lock=...)
    at /local/p4clients/pkgbuild-v3X3O/workspace/build/LibGCC/LibGCC-gcc.238862.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/gcc-src/build/private/src/gcc-11.4.0/libstdc++-v3/src/c++11/condition_variable.cc:41
#4  0x00007fd001f4d79b in std::condition_variable::wait<Aws::Utils::Threading::Semaphore::WaitOne()::<lambda()> > (__p=..., __lock=..., this=0x7fcff3f246c8)
    at /opt/brazil-pkg-cache/packages/GCC/GCC-11.x.1214.0/AL2_x86_64/DEV.STD.PTHREAD/build/gcc-11.4.0/include/c++/11.4.0/condition_variable:103
#5  Aws::Utils::Threading::Semaphore::WaitOne (this=0x7fcff3f24690)
    at /local/p4clients/pkgbuild-3Vovk/workspace/build/Aws-sdk-cpp-core/Aws-sdk-cpp-core-1.11.x_StdMemoryManagement.96405.0/AL2_x86_64/DEV.STD.PTHREAD/build/private/aws-sdk-cpp/src/aws-cpp-sdk-core/source/utils/threading/Semaphore.cpp:21
#6  0x00007fd001a0b76f in Aws::S3Crt::S3CrtClient::~S3CrtClient (this=0x7ffd3bf79cb0, __in_chrg=<optimized out>)
    at /opt/brazil-pkg-cache/packages/GCC/GCC-11.x.1214.0/AL2_x86_64/DEV.STD.PTHREAD/build/gcc-11.4.0/include/c++/11.4.0/bits/shared_ptr_base.h:984

Expected Behavior

The S3CrtClient's destructor should not hang. (This was the behavior, as of release 1.11.175.)

Current Behavior

The S3CrtClient's destructor hangs, as of release 1.11.195.

Reproduction Steps

We don't have a formal repro yet. I am trying to nail down the root cause of the hang -- it might be unexpected/incorrect usage of the AWS API, by our client application -- concurrently with filing this bug.

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.195

Compiler and Version used

gcc 11

Operating System and version

Linux 5.10.216-182.855.amzn2int.x86_64

Also hangs in release 1.11.328 .

Looks like function CrtClientShutdownCallback() is not being called, to Release() the S3CrtClient's m_clientShutdownSem semaphore field.

Problem seems to be that if the caller creates and destroys an S3CrtClient, using RAII pattern, then the constructor creates a semaphore field:

  m_clientShutdownSem = Aws::MakeShared<Threading::Semaphore>(ALLOCATION_TAG, 0, 1);

-- and passes that semaphore to the wrapped Crt Client:

  m_wrappedData.clientShutdownSem = m_clientShutdownSem;
  s3CrtConfig.shutdown_callback = CrtClientShutdownCallback;
  s3CrtConfig.shutdown_callback_user_data = static_cast<void*>(&m_wrappedData);
  m_s3CrtClient = aws_s3_client_new(Aws::get_aws_allocator(), &s3CrtConfig);

Then, inside the destructor:

  aws_s3_client_release(m_s3CrtClient);
  if(m_clientShutdownSem)
  {
    m_clientShutdownSem->WaitOne(); // Wait aws_s3_client shutdown
  }

However, the call to aws_s3_client_release(...) doesn't seem to be releasing the Crt client.

I wonder if there's an implicit assumption, at the Crt layer, that there is always only 1 S3 (crt) client alive, at a time? What's supposed to happen, if the caller does:

{
  ClientConfiguration config;
  S3CrtClient client1(config);
  S3CrtClient client2(config);
...
}

?

My hanging/failing test is a Google-test "crash test," and the hanging process is the process fork()ed by the Google-test framework. My other unit tests, that don't use the "crash test" framework, do not appear to hang.

I wonder if some "Crt" state that used to be tied to a given S3CrtClient object has now been made static?

A quick note is that forking can lead to somewhat unexpected results currently when used in conjunction with CRT.
CRT does rely on some global state that can end up being mangled during fork.
This is something on the backlog for the team to improve, but we dont have a concrete ETA for that.
As a workaround, you could try to avoid forking while CPP SDK is in initialized state, i.e. explicitly calling ShutdownAPI before forking the process.

Confirmed that the proposed workaround, to call ShutdownAPI before forking, works. Thanks!

@hunjmes Did you have any other questions about this sdk?