DataDog/dogstatsd-ruby

Segfault(s) with 5.5.0

aharpervc opened this issue · 3 comments

Hi, I'm in the process of upgrading a Rails app, and updated dogstatd-ruby from 3.3.0 to 5.5.0. The former version had been working fine without issue for years, but now with the upgraded version I'm now seeing a segfault. The only code that's changed is related to this gem. I'm on Rails 5.2.2.1 although hopefully that part shouldn't matter.

My code is basically this (somewhat simplified):

# config\initializers\datadog_instrumentation.rb
require 'datadog/statsd'
STATSD = Datadog::Statsd.new('localhost', 8125, buffer_flush_interval: 60)

class DataDogInstrumentation
  def self.notify app, event, *data
    STATSD.count "#{app}.#{event}", data.first[:count]
  end
end

# newly added to follow the v5 requirement of always calling `close`
at_exit do
  STATSD.close
end
# code elsewhere in the Rails app does this (simplified)
DataDogInstrumentation.notify "my_app", "my_event", count: 5

Here's the log (slightly edited & truncated for brevity; line 30 in the original code is STATSD.close in the at_exit block):

[BUG]
Segmentation fault at 0x00005563f7bb5f21

ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]

-- Ruby level backtrace information ----------------------------------------
/my-apps-path-here/releases/20220923182902/config/initializers/datadog_instrumentation.rb:30:in `block in <main>'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:380:in `close'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/forwarder.rb:109:in `close'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:107:in `stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:37:in `stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:37:in `synchronize'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:38:in `block in stop'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/timer.rb:38:in `signal'

That's this code:

def stop
return if @thread.nil?
@stop = true
@mx.synchronize do
@cv.signal
end
@thread.join
@thread = nil
end

I thought maybe there was something weird going on with the at_exit block, so I removed that part for testing, but I'm still getting a segfault. The stack trace is a bit different (again, somewhat edited/snipped for clarity/brevity):

/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:

[BUG]
Segmentation fault at 0x0000562bf467b1b5

ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]

-- Ruby level backtrace information ----------------------------------------
/my-apps-path-here/releases/20220923194611/config/initializers/datadog_instrumentation.rb:13:in `notify'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:162:in `increment'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:190:in `count'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd.rb:430:in `send_stats'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/forwarder.rb:75:in `send_message'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:70:in `add'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:70:in `synchronize'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:in `block in add'
/my-apps-path-here/shared/bundle/ruby/2.5.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/sender.rb:76:in `close'

That's this:

def add(message)
raise ArgumentError, 'Start sender first' unless message_queue
# if the thread does not exist, we assume we are running in a forked process,
# empty the message queue and message buffers (these messages belong to
# the parent process) and spawn a new companion thread.
if !sender_thread.alive?
@mx.synchronize {
# a call from another thread has already re-created
# the companion thread before this one acquired the lock
break if sender_thread.alive?
@logger.debug { "Statsd: companion thread is dead, re-creating one" } if @logger
message_queue.close if CLOSEABLE_QUEUES
@message_queue = nil
message_buffer.reset
start
@flush_timer.start if @flush_timer && @flush_timer.stop?
}
end
if message_queue.length <= @queue_size
message_queue << message
else
@telemetry.dropped_queue(packets: 1, bytes: message.bytesize) if @telemetry
end
end

So, different code, but line 76 is message_queue.close if CLOSEABLE_QUEUES, so maybe the same root cause if it's related to closing something.

At first glance this seems like an issue with dogstatsd-ruby itself, especially since there had been no issues at all on the prior version.

Thoughts?

A workaround suggested by DataDog technical support was to add the single_thread: true option to the initializer. This does appear to make the segfault(s) go away.

Thanks for the detailed report! I was able to reproduce this crash with the following ruby program:

require './datadog_instrumentation.rb' # using your snippet from the first post untouched

def do_work
        puts "Counting... " + Time.new.to_s
        DataDogInstrumentation.notify "my_app", "my_event", count: 5
end

def crash
        do_work
        sleep 1
        Process.fork() # leak the child
        do_work
        Process.exit()
end

def dontcrash
        do_work
        sleep 1
        do_work
        Process.exit()
end

puts ARGV.first
if ARGV.first == "crash"
        crash
else
        dontcrash
end

With uname -a:

Linux dev-ruby-repro 4.15.0-193-generic #204-Ubuntu SMP Fri Aug 26 19:25:57 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

I'm able to see a crash almost every single time running ruby main.rb crash against ruby 2.5.3.
Upgrading to ruby 2.5.9 fixes the crash for me, I'm completely unable to reproduce it.
(Note I'm testing on a slightly different platform, aarch64 vs amd64, so you may see slightly different results)

I would recommend upgrading ruby to2.5.9 version of ruby (or newer, 2.5 and 2.6 are both EOL according to this)

If this is not an option, then single_thread: true should help, however this doesn't completely remove the crash for me, it just makes it less frequent.

Thanks for investigating!

I would recommend upgrading ruby to2.5.9 version of ruby (or newer, 2.5 and 2.6 are both EOL according to this)

Oh I am well aware 😆. That is a separate project for us though. I'm satisfied that this is a reasonable answer to close this issue, though.