ruby/net-imap

Flaky test with macOS platform

Opened this issue · 3 comments

hsbt commented

I sometimes got the following failure with ruby/ruby and make test-bundled-gems.

Testing the net-imap gem
/Users/hsbt/Documents/github.com/ruby/ruby/exe/ruby -C /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/bin/rake test
Loaded suite /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/rake-13.2.1/lib/rake/rake_test_loader
Started
/Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/test-unit-3.6.2/lib/test/unit/fixture.rb:284: warning: the block passed to 'priority_setup' defined at /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/test-unit-3.6.2/lib/test/unit/priority.rb:183 may be ignored
F
====================================================================================================================================================================================
Failure: test_imaps_with_ca_file(IMAPTest):
  Exception raised:
  <#<OpenSSL::SSL::SSLError: SSL_read: tlsv1 alert unknown ca (SSL alert number 48)>>
  Backtrace:
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::SSL::SSLSocket#sysread'
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::Buffering#fill_rbuff'
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:236:in 'OpenSSL::Buffering#gets'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2720:in 'Net::IMAP#get_response'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2629:in 'Net::IMAP#receive_responses'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2606:in 'block in Net::IMAP#start_receiver_thread'.
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::SSL::SSLSocket#sysread'
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::Buffering#fill_rbuff'
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:236:in 'OpenSSL::Buffering#gets'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2720:in 'Net::IMAP#get_response'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2629:in 'Net::IMAP#receive_responses'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2606:in 'block in Net::IMAP#start_receiver_thread'
====================================================================================================================================================================================
Finished in 8.730846 seconds.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
818 tests, 8689 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
99.8778% passed
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
93.69 tests/s, 995.21 assertions/s
rake aborted!
Command failed with status (1)
/Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/rake-13.2.1/exe/rake:27:in '<top (required)>'
Tasks: TOP => test
(See full trace by running task with --trace)
Tests failed with exit code 1

My environment is macOS 14.5 with Apple Silicon and using OpenSSL 3 library provided from homebrew.

@hsbt That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or? I haven't noticed it failing in the CI workflow. I'm only running on Linux, and haven't seen it myself. I wonder if it's unique to Apple Silicon or a specific version of OpenSSL or something else that's different from macos-latest in GH actions.

I can see one thing that's unique about that test. For all of the tests that create an SSL connection:

This test is one of the four that uses imaps_test. And of those four, this is one of only two that expects a successful connection. And this is the only one that verifies the server's cert. So, if this is the only test that's flaky, the problem is probably in imaps_test.

At first, I couldn't see anything significant between how the server is configured for imaps_test vs any of the others. We're using the exact same SSL fixture files (ca_path, key, and cert) in all three places, and we seem to be creating the SSLContext identically. But, after looking more closely, I see that imaps_test is using OpenSSL::SSL::SSLServer.new(tcp_server, ssl_ctx) and the other two are using OpenSSL::SSL::SSLSocket.new(tcp_socket, ssl_ctx).

Looking at SSLServer, it seems like there's no difference between what it's doing and what FakeServer is doing with SSLSocket. The only difference I can discern is that SSLServer#initialize contains the following:

        unless ctx.session_id_context
          # see #6137 - session id may not exceed 32 bytes
          prng = ::Random.new($0.hash)
          session_id = prng.bytes(16).unpack1('H*')
          @ctx.session_id_context = session_id
        end

I have no idea why session_id_context would matter for this (it probably doesn't), but it's the only difference I could see.

We could experiment with simply converting imaps_test to use SSLSocket directly, or even convert those four tests over to run_fake_server_in_thread. But I'm still baffled and curious about why this would make a difference.

Should we ask about this over at https://github.com/ruby/openssl/issues?

I ran make test-bundled-gems on ruby/ruby ~10 times with no luck. I'm also running on Linux.

<#<OpenSSL::SSL::SSLError: SSL_read: tlsv1 alert unknown ca (SSL alert number 48)>>

This OpenSSL::SSL::SSLError on the client side seems weird. The error message says that it received the "unknown_ca" alert. It is usually sent when the server doesn't trust the client certificate, but I don't see any test cases in net-imap that use a client certificate.


session_id_context

I believe it's irrelevant. The seesion ID context is only used for TLS session resumption, which net-imap doesn't use.

hsbt commented

That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or?

I see this fail once every day.

I haven't noticed it failing in the CI workflow.

Yes, I also didn't see this fail our GitHub Actions. It may be caused by my development environment.