aarch64 community box dropping connection to AMQP host
cole-h opened this issue · 0 comments
Recently, the aarch64 builders have been shrinking over time until the next redeploy brings them back to life. After a bit of debugging, we noticed that the connection to the AMQP host is somehow lost, but the builder doesn't exit. Heartbeats should have helped in this situation but didn't. When looking at the connections using ss
, we see the following on a busted builder:
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
u_str ESTAB 0 0 * 44148 * 24754 users:(("builder",pid=25341,fd=2),("builder",pid=25341,fd=1),("grahamcofborg-b",pid=25335,fd=2),("grahamcofborg-b",pid=25335,fd=1))
and the following on a working builder:
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
u_str ESTAB 0 0 * 80906 * 33026 users:(("builder",pid=24827,fd=2),("builder",pid=24827,fd=1),("grahamcofborg-b",pid=24816,fd=2),("grahamcofborg-b",pid=24816,fd=1))
tcp ESTAB 0 0 xxx.xxx.xxx.xxx:47128 xxx.xxx.xxx.xxx:5671 users:(("builder",pid=24827,fd=3))
As you can see, the busted builder has dropped its connection to the AMQP host, while the working builder still has an established connection.
Potentially unrelated, but in the stack of one of the busted builders, we also noticed the following thread:
TID 25367:
#0 0x0000ffff8f0eacd4 pthread_cond_wait@@GLIBC_2.17
#1 0x0000aaaae9de77d0 std::thread::park::h1fac58ddd22dac93
#2 0x0000aaaae98ac830 crossbeam_channel::context::Context::wait_until::h47058df4a5256735
#3 0x0000aaaae9987c88 crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::_$u7b$$u7b$closure$u7d$$u7d$::h02279d7921848ff9
#4 0x0000aaaae98adf18 crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::h6995da6c0885a3c7
#5 0x0000aaaae98aef2c crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::he1284795f72001bb
#6 0x0000aaaae99a9e80 std::thread::local::LocalKey$LT$T$GT$::try_with::h4fdf647ecd5ad711
#7 0x0000aaaae98ad0b4 crossbeam_channel::context::Context::with::hb484fda6ebec39c0
#8 0x0000aaaae9987b88 crossbeam_channel::flavors::list::Channel$LT$T$GT$::recv::hf6bf4df54bc8bfd4
#9 0x0000aaaae998289c crossbeam_channel::channel::Receiver$LT$T$GT$::recv::hec7e7cbef7223cc7
#10 0x0000aaaae9a403f4 lapin::socket_state::SocketState::wait::h1b7cc14ee32195dd
#11 0x0000aaaae995b200 lapin::io_loop::IoLoop::run::h5c2f4a08fb0d30a3
#12 0x0000aaaae995a7c4 lapin::io_loop::IoLoop::start::_$u7b$$u7b$closure$u7d$$u7d$::h5562f2f06733e920
#13 0x0000aaaae98ee454 std::sys_common::backtrace::__rust_begin_short_backtrace::h7549166dff606a08
#14 0x0000aaaae99aba04 std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h29c85edad72c9531
#15 0x0000aaaae9964600 _$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hf5ee88030140dc2f
#16 0x0000aaaae99ac00c std::panicking::try::do_call::h286d7e4a15b0a0ed
#17 0x0000aaaae9a6b038 __rust_try
#18 0x0000aaaae99abe74 std::panicking::try::hc9e0e7a4417d6ead
#19 0x0000aaaae99a94c0 std::panic::catch_unwind::ha411c12cbbd43248
#20 0x0000aaaae99ab61c std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hdf83f6ce30b202de
#21 0x0000aaaae99b2fb8 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1dd7bfb988fbf8ab
#22 0x0000aaaae9ded7ac std::sys::unix::thread::Thread::new::thread_start::haf3b724064391ad1
#23 0x0000ffff8f0e43b4 start_thread
#24 0x0000ffff8f0198dc thread_start
Maybe this is related to something panicking, and thus preventing a clean exit (or exit altogether) somehow? Though it looks to me like that's just related to the std::thread::Builder::spawn_unchecked
function (probably to panic if anything goes wrong rather than making the caller handle any errors).