Thread hangs on exit
tsilen opened this issue · 26 comments
Recently we've run into an issue where a process using zk and redis_failover sometimes fails to die on exit, but instead hangs in an infinite sched_yield() loop taking all CPU. A SIGKILL is required to get rid of it. I guess it has to do with the program exiting without properly closing the connection first, but i guess it should still die cleanly.
gdb output gives this:
#0 0x00007fb7e6765a67 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fb7e6bd86f7 in gvl_yield (vm=0x1bfdf20, th=<optimized out>) at thread_pthread.c:125
#2 0x00007fb7e6bdac16 in rb_thread_schedule_limits (limits_us=0) at thread.c:1025
#3 rb_thread_schedule_limits (limits_us=0) at thread.c:1033
#4 rb_thread_schedule () at thread.c:1035
#5 0x00007fb7e6bdad5f in rb_thread_terminate_all () at thread.c:375
#6 0x00007fb7e6abf89e in ruby_cleanup (ex=0) at eval.c:140
#7 0x00007fb7e6abfa25 in ruby_run_node (n=0x24f0428) at eval.c:244
#8 0x00000000004007fb in main (argc=3, argv=0x7fff7725e948) at main.c:38
After adding some debug-code to ruby side to get a backtrace when the process is hung, I was able to get this:
Thread TID-t26i0
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/threadpool.rb:268:in `worker_thread_body'
When trying to reproduce it without redis_failover i was able to get it hang in a similar way, but in a different place:
Thread TID-ccaag
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common/queue_with_pipe.rb:59:in `pop'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:56:in `get_next_event'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:94:in `dispatch_thread_body'
and
Thread TID-alg44
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:110:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:122:in `wait_while'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/client/threaded.rb:533:in `block in reconnect_thread_body'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:211:in `mon_synchronize'
Code in all is somewhat similar (@cond.wait).
Any ideas? Ruby is 1.9.3-p194, but it also happens at least on 1.9.3-p0. ZK 1.7.1, zookeeper 1.3.0. OS: linux ubuntu 12.04.
At least in ruby 1.9.3 you can give a timeout to ConditionVariable's wait(), maybe that would help.
With this I was able to get it hang in a similar way quite often:
#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
$stdout.sync = true
@zkservers = "localhost:2181"
trap 'TTIN' do
Thread.list.each do |thread|
puts "Thread TID-#{thread.object_id.to_s(36)}"
puts thread.backtrace.join("\n")
end
end
def do_something
zk = ZK.new(@zkservers)
puts zk.children('/').inspect
sleep 1
end
puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
stack << Thread.new { do_something }
end
sleep rand(0)
Running it in while true; do ./test.rb; done loop until it gets stuck and then kill -TTIN prints the backtraces of the still alive threads.
One workaround would be calling close! in at_exit { }
Just wanted to let you know I got this. I'm not working today, but I'll try
to get a fix together by the EOW.
Jonathan Simms
(sent from my mobile device)
On Sep 26, 2012, at 9:07, tsilen notifications@github.com wrote:
One workaround would be calling close! in at_exit { }
—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/50#issuecomment-8889029.
Any update on this one?
+1
hey guys, sorry, been bogged down with holidays and nonsense, i'll try to get to this one this week
After a quick look, it seems the problem is (as always) 1.8.7 compatibility (yes, i know, talk to @eric), where only Monitor's condvar takes a timeout. I'll have to review it a little more closely. I like the at_exit { zk.close! }
idea, i'm just not sure if that will cause a memory leak, as it will keep a reference to the zk instance. It shouldn't matter in that most people are going to only use one or two zk instances, but still.
I'll try to review the places where wait() is being called and at least get a branch together that people can test against.
Thanks for taking a look! With regards to most people not having many ZK instances - that might not always be true if you're using redis_failover with a connection pool (for example, what's used in Sidekiq). Sidekiq lets you create a connection pool with e.g. 10 or 20 redis clients. Each RedisFailover::Client instance will contain an underlying ZK instance.
By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this:
zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
pool.with { |client| client.get('foo') }
this is probably how it should be done, just in terms of resource usage
(open file handles, etc).
On Mon, Oct 8, 2012 at 3:37 PM, Ryan LeCompte notifications@github.comwrote:
By the way, I just updated (in master) RedisFailover::Client so that you
can pass in an existing ZK client instance. You would use it like this:zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
pool.with { |client| client.get('foo') }—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-9237528.
Ok, so I think the test @tsilen came up with shows a bug in ruby. I've run into issues before with at-exit GC and thread primitives (in 1.8, but still). If you modify the above torture test:
#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
require 'logger'
$stdout.sync = true
$log = Logger.new($stderr).tap { |n| n.level = Logger::DEBUG }
@zkservers = "localhost:2181"
trap 'TTIN' do
Thread.list.each do |thread|
puts "Thread TID-#{thread.object_id.to_s(36)}"
puts thread.backtrace.join("\n")
end
end
def do_something
ZK.open(@zkservers) do |zk|
thr_id = Thread.current.object_id.to_s(36)
$log.debug { "#{thr_id} #{zk.children('/').inspect}" }
end
sleep 1
end
puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
stack << Thread.new { do_something }
end
# sleep rand(0)
stack.each do |th|
begin
th.join(3)
rescue Exception => e
puts "kill: #{e.class} #{e.message}"
end
end
By waiting for the threads to exit before "falling off the end of the script" I can't get it to hang.
Yes, the issue seems to happen when a thread is alive and in some specific state at the end of the script, so waiting for them to exit should also do the trick. I did some quick testing with the updated script and didn't get it to hang either, but I'll test a bit more tomorrow.
Ok, cool. I'm like 99% sure that this is a bug in the way ruby does ConditionVariable cleanup.
Looks like close! in at_exit indeed doesn't work in all cases. This is a quite a big problem, because for example rails (rack) processes many times hang when they should exit when released from e.g. Passenger's pool and keep eating a lot of resources. I think it's something that has to be handled within zk.
Yeah, you need to close your connections before "falling off the end of the script" as above. You cannot rely on ruby to shut down these clients properly because it's not handling cleaning up POSIX thread primitives properly. I tested in an at_exit and it didn't work, I had to wait for all of the clients to finish closing and join all threads before reaching the end.
I'm not sure how to handle this in terms of passenger, but I'd imagine that there's some kind of shutdown you can hook into.
Again, there's really nothing I can do on the ZK end, as I don't control the main thread. The only other thing you can do is do your work inside of a ZK.open() block, which will ensure that zk is shutdown when the block returns.
Can you modify the code to use 1.9's condvar api when running in 1.9?
We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover.
If you close ZK before the process exits, the hang doesn't happen because
all the underlying posix stuff gets cleaned up. Clean shutdown is key.
The reason i have to use Monitor is because @eric, who's been a major
contributor is still running 1.8, and I've agreed to maintain backwards
compatibility. 1.8's ConditionVariable doesn't have a timeout option, which
makes it really difficult.
On Fri, Dec 7, 2012 at 11:46 AM, Mike Perham notifications@github.comwrote:
We are seeing sidekiq hang occasionally in production on exit and yep
we're running redis_failover.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11136886.
Looking at the 1.9 code for monitor, it looks like it's just using a standard ConditionVariable
and Mutex
under the covers. There shouldn't be any real difference in how it acts on 1.9.
Except that ConditionVariable and Mutex in 1.9 have real posix thread
primitives backing them, and 1.9 doesn't clean up effectively before
exiting, so it deadlocks.
On Fri, Dec 7, 2012 at 2:35 PM, Eric Lindvall notifications@github.comwrote:
Looking at the 1.9 code for monitor, it looks like it's just using a
standard ConditionVariable and Mutex under the covers. There shouldn't be
any real difference in how it acts on 1.9.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11143151.
If you are suggesting there's a bug in Ruby 1.9, that's quite possible. I found a bug in CondVar last year, made a reproducible test case and had it fixed in head the next morning.
Yes, the consensus is that there is a bug in the 1.9 Mutex
/ConditionVariable
implementation that causes the VM to hang if there are threads that are waiting on a ConditionVariable
when the process exits.
Does sidekiq do graceful shutdowns? If so, it's likely you could work around this by calling close!
on the ZK
object before exiting.
Where's the redmine ticket for the bug?
Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation.
Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilen have described it as an "occasional" issue, which would lead me to believe neither of you have isolated it.
We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket.
In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client.
FYI, redis_failover does provide the RedisFailover::Client#shutdown method
which closes the ZK connection and closes the underlying Redis clients.
On Fri, Dec 7, 2012 at 12:54 PM, Eric Lindvall notifications@github.comwrote:
Someone will have to pipe up if they've isolated the problem into a
reproducible testcase, but from reading the thread you and @tsilenhttps://github.com/tsilenhave described it as an "occasional" issue, which would lead me to believe
neither of you have isolated it.We're all volunteers here. I run ruby 1.8, and haven't run into this
specific issue, but hopefully one of the self-motivated people on this
thread who does run 1.9 and have been impacted by this issue could find the
time to isolate it and submit a redmine ticket.In the mean time, it may be worth investigating providing a mechanism in
sidekiq for users of the library to register callbacks for performing a
clean shutdown which would allow a user to perform a graceful shutdown of
the redis_failover or zk client.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11145976.
"Where's the redmine ticket for the bug?"
You're kidding.
On Fri, Dec 7, 2012 at 3:47 PM, Mike Perham notifications@github.comwrote:
Where's the redmine ticket for the bug?
Sidekiq knows nothing about ZK. It's using the redis client, which is
wrapped by the redis_failover client. Sidekiq does not bother to close
sockets / clients before shutting down and redis_failover does not provide
a close/shutdown operation.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11145692.
Ryan, FYI it's not listed here which is linked from your readme: http://rubydoc.info/github/ryanlecompte/redis_failover/master/RedisFailover/Client