open-telemetry/opentelemetry-ruby

Performance impact on Ruby on Rails ?

kedare opened this issue · 15 comments

kedare commented

Description of the bug

We had a sampling rate of 0.1 enabled for quite some time.
As soon as we have enabled the sampling to 1.0 (100%) we had terrible performances on various spans like the ones related to PostgreSQL for example

image

We pushed for similar changes on other part of the our infrastructure (python and go code) and did not face similar issues.

We even saw some span (still related to database) usually taking less than 100ms taking up to 30s for no reasons, we had to rollback the configuration to the 0.1 sampling rate.

It's not clear what could cause this ? I would expect some performance impact but nothing near a x300 duration on the operations.

I did not do any kinf of optimization on the OTLP exporter settings or thing like that.

We are using Ruby on Rails with puma.

Share details about your runtime

Operating system details: Linux, Ubuntu 20.04 LTS
RUBY_ENGINE: "ruby"
RUBY_VERSION: "3.2.1"
RUBY_DESCRIPTION: "ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]"

Configuration

OTEL_EXPORTER_OTLP_ENDPOINT=http://xxx:4318
OTEL_LOG_LEVEL=warn
OTEL_RESOURCE_ATTRIBUTES=env=us,version=2.3.21
OTEL_SERVICE_NAME=back
OTEL_TRACES_SAMPLER_ARG=1.0
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OpenTelemetry::SDK.configure do |config|
  is_otel_disabled = ENV.fetch('OTEL_SDK_DISABLED', 'false').to_b
  is_logging_disabled = Rails.const_defined?(:Console) || Rails.env.test? || is_otel_disabled

  config.service_name = service_name

  if is_logging_disabled
    config.logger = Logger.new(nil)
  else
    config.logger = Logger.new($stdout, level: ENV.fetch('OTEL_LOG_LEVEL', Logger::INFO), formatter: OpenTelemetryLogFormatter.new)
  end

  next if is_otel_disabled

  # Configure custom resource
  config.resource = OpenTelemetry::SDK::Resources::Resource.create(
    'git.commit.sha' => xxx::Git.commit,
    'git.repository_url' => xxx::Git::REPOSITORY
  )

  # Rails
  config.use 'OpenTelemetry::Instrumentation::Rack'
  config.use 'OpenTelemetry::Instrumentation::Rails'
  config.use 'OpenTelemetry::Instrumentation::Rake'
  config.use 'OpenTelemetry::Instrumentation::ActiveSupport'
  config.use 'OpenTelemetry::Instrumentation::ActionPack'
  config.use 'OpenTelemetry::Instrumentation::ActiveRecord'
  config.use 'OpenTelemetry::Instrumentation::ActionView'

  # Network libraries
  config.use 'OpenTelemetry::Instrumentation::ConcurrentRuby'
  config.use 'OpenTelemetry::Instrumentation::Faraday'
  config.use 'OpenTelemetry::Instrumentation::Net::HTTP'

  # Storage
  config.use 'OpenTelemetry::Instrumentation::AwsSdk'

  # Database and queues
  config.use 'OpenTelemetry::Instrumentation::Bunny'
  config.use 'OpenTelemetry::Instrumentation::PG'
  config.use 'OpenTelemetry::Instrumentation::Redis'

  # Worker
  config.use 'OpenTelemetry::Instrumentation::Sidekiq', {
    propagation_style: :child,
  }
end

What kind of optimization could be done ?
Could it be some kind of locking issue with the puma multiprocessing ? (I did see a warning at the startup about the opentelemetry thread started in puma), I found this #472 that could disable it but I am not sure this would have a performance impact ?

I see this post contains a few settings that can be tuned, would this help ?
https://uptrace.dev/get/opentelemetry-ruby.html#already-using-otlp-exporter

Thanks for reporting this issue.

Do you have any more details you may be able to share with us?

Have you performed or recorded any profiles?

Is this happening with all instrumentations or do you mostly observe it with specific ones like PG?

kedare commented

I checked I could find both spans related to PostgreSQL, ActiveRecord and Redis having this similar issue (which are most of the spans our apps generate, so it's possible that all spans are affected).
Could there be some kind of lock or semaphore that could impact the code globally outside of the opentelemetry thread ? It looks like something is blocking it.
Unfortunately I could not to any profile in production, what tool would you recommend to be able to do that ?
I will try to reproduce outside of the production environment, strange thing is that I did not face this issue during previous load tests

This is a proper trace with a low sampling rate (and same load)
Screenshot 2023-08-02 at 21 41 30

This is an example of trace with 100% sampling rate
Screenshot 2023-08-02 at 21 42 19

On the DB size we don't see any potential lock that could explain that (and the same query appears normal and fast on the pg_stat_statements so it's definitively not an issue on the database side)

Also on the OTLP collector we did not push any specific processor or configuration that would be required for high load, maybe it's something we should also look at ? (I am not very familiar is this part has to be tuned and how)

kedare commented

It's not the first time I see strange behaviour under high load, maybe it's related: #1418

Thanks for all of the details.

All otel data structures are threadsafe data structures and use a mutex during creation, mutation, processing etc...

The BatchSpanProcessor has a worker thread that uses a mutex to ensure thread safe access to the finished span buffer.

It's possible to disable this thread, or disable the batch spam processor altogether and see if it suffers from the same issues. That may help us narrow down the problem a bit.

For continuous production profiling the only open source option I can recommend is pyroscope but we would not of course be able to provide support for that.

kedare commented

Thank you for your response.

Which settings would you recommend to set for that ?
Also what would be the behaviour of the span buffer is full and it cannot keep up sending the spans faster than they arrive ? Could this be blocking the application code if it's full ?
I was thinking to test with multiple span processor to see if it fixes the issues (I think I saw an env variable that allows this)

Also what would be the behaviour of the span buffer is full and it cannot keep up sending the spans faster than they arrive ?

They get dropped.

# Adds a span to the batch. Thread-safe; may block on lock.
def on_finish(span)
return unless span.context.trace_flags.sampled?
lock do
reset_on_fork
n = spans.size + 1 - max_queue_size
if n.positive?
spans.shift(n)
report_dropped_spans(n, reason: 'buffer-full')
end
spans << span
@condition.signal if spans.size > batch_size
end
end

Could this be blocking the application code if it's full ?

I unfortunately have a few applications that regularly overflow the buffer when they see a sudden burst in traffic, and we have not observed this behaviour. That said it doesn't mean it isn't possible.

I was thinking to test with multiple span processor to see if it fixes the issues (I think I saw an env variable that allows this)

The BatchSpanProcessor is the production ready one. There's a simple span processor in the repo that I would caution you from using in a production environment, I don't know of anyone using it in their prod env. If that's what you meant.

I wonder if you're encountering a similar situation to the one described here https://deploy.equinix.com/blog/opentelemetry-provisioning-observability-distributed-tracing/

I'll try to avoid casually suggesting things for you to "test" in your production environment, but I wonder if it is worth drastically dropping down the amount of instrumentation you have enabled. Starting simply with Rack, ActionPack, and Net::HTTP. The idea behind using that subset is that you would get the minimal amount of instrumentation so that it doesn't break the trace graph if this service is part of a larger constellation of apps.

From there you could slowly increase the sample rate until it just works, or it breaks. This approach will reduce the amount of touch points opentelemetry has on your service. Assuming this works, I would individually add single instrumentation packages one at a time.

Another question, have you added a lot/any of your own custom instrumentation to the application in question?

kedare commented

The BatchSpanProcessor is the production ready one. There's a simple span processor in the repo that I would caution you from using in a production environment, I don't know of anyone using it in their prod env. If that's what you meant.

I was more talking about tuning up the OTEL_BSP_MAX_CONCURRENT_EXPORTS they are showing there that I think is related to the BatchSpanProcessor, also with all the others OTEL_BSP_... variables. But I this concurrent exports one is not supported in ruby ? I could not find any trace of it in the code of this repository.

Another question, have you added a lot/any of your own custom instrumentation to the application in question?
We only have 1 function with a custom instrumentation with a very low throughput so far, all the rest are the automatic ones.
We however add various attributes in the existing automated spans (like user id, upload id, etc...).

I wonder if you're encountering a similar situation to the one described here https://deploy.equinix.com/blog/opentelemetry-provisioning-observability-distributed-tracing/

I will take a look at it, thank you

I would retry to do some load tests with the various OTEL integration enabled and not (I will try to make them configurable via environment variables to avoid lot of code changes every time we want to toggle something)

Also so far we don't set anything related to this issue #472 , could this be the cause ? (We do get a warning when the puma master process start, but from what i understood it should take care automatically of starting a new thread on each worker ?)

I was more talking about tuning up the OTEL_BSP_MAX_CONCURRENT_EXPORTS they are showing there that I think is related to the BatchSpanProcessor, also with all the others OTEL_BSP_... variables. But I this concurrent exports one is not supported in ruby ? I could not find any trace of it in the code of this repository.

Haven't encountered that env var before. A quick search tells me this exists in rust and no results elsewhere in any of the official language implementations.

Another question, have you added a lot/any of your own custom instrumentation to the application in question?
We only have 1 function with a custom instrumentation with a very low throughput so far, all the rest are the automatic ones.

We however add various attributes in the existing automated spans (like user id, upload id, etc...).

As you linked uptrace docs in your response (I'm not familiar with them) I would encourage you not to use the approach they suggest for adding attributes in their example.

  tracer.in_span('GET /posts/:id') do |child1|
    child1.set_attribute('http.method', 'GET')
    child1.set_attribute('http.route', '/posts/:id')
    child1.set_attribute('http.url', 'http://localhost:8080/posts/123')
    child1.set_attribute('http.status_code', 200)
    child1.record_exception(ArgumentError.new('error1'))
  end

Languages like Ruby or Go for example will make sure spans are thread safe, and they will hold a lock when setting attributes.

I would be surprised if this was the source of your problem, but regardless this is best practices if you want to minimize the impact instrumentation has on the code it is measuring.

Here's an explanation I have saved from a while back.

# Adding attributes on span creation is the most performant way
# to add attributes in general and should be the default choice
# when possible.
def a(arg)
  Tracer.in_span("a", attributes: { "arg" => arg }) {}
end 

# Sometimes we want to add an attribute to a span but
# we don't have it when we're creating the span.
# The span method `set_attribute` allows you set an attribute
# on an existing span.
# 
# Warning: As span mutation is thread safe, that means setting an attribute
# on an existing will hold a mutex lock.
def a
  Tracer.in_span("a") do |span|
    result = something_interesting
    span.set_attribute("result", result)
  end
end

# alternate form
def a
  Tracer.in_span("a") do |span|
    result = something_interesting
    span["result"] = result
  end
end

# Really bad
# Never do this as you're hitting a mutex lock for each call to set_attribute
def a
  Tracer.in_span("a") do |span|
    result = something_interesting
    span.set_attribute("a", result.a)
    span.set_attribute("b", result.b)
    span.set_attribute("c", result.c)
    span.set_attribute("d", result.d)
  end
end

# alternate form of really bad
def a
  Tracer.in_span("a") do |span|
    result = something_interesting
    span["a"] = result.a
    span["b"] = result.b
    span["c"] = result.c
    span["d"] = result.d
  end
end

# If we have a set of attributes only available after span creation.
# We're still hitting the mutex lock, but only once.
def a
  Tracer.in_span("a") do |span|
    result = something_interesting
    span.add_attributes({
      "a" => result.a,
      "b" => result.b,
      "c" => result.c,
      "d" => result.d,
    })
  end
end

I would retry to do some load tests with the various OTEL integration enabled and not (I will try to make them configurable via environment variables to avoid lot of code changes every time we want to toggle something)

Disabling by env var is supported by default. See -> https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/base/lib/opentelemetry/instrumentation/base.rb#L313-L327

Here's an example using Active Record.

ENV["OTEL_RUBY_INSTRUMENTATION_ACTIVERECORD_ENABLED"] = "false"

Also so far we don't set anything related to this issue #472 , could this be the cause ? (We do get a warning when the puma master process start, but from what i understood it should take care automatically of starting a new thread on each worker ?)

I'm using the default configuration here for all of our applications without issue. I think more investigation needs to be done before we can point at one particularly piece of code being responsible for the issue.

kedare commented

Thanks for the info.
We usually only put 1 attribute on the existing spans (sometime 2, never more) so it's unlikely to be related to this.

I will continue to investigate, so far it looks like I get better performances with those settings than the default ones

OTEL_BSP_MAX_EXPORT_BATCH_SIZE=2048
OTEL_BSP_MAX_QUEUE_SIZE=16384
OTEL_BSP_SCHEDULE_DELAY=2000
OTEL_RUBY_BSP_START_THREAD_ON_BOOT=false

But not yet tested in production where the biggest impact was

With thread boot set to false you'll need to flush the span buffer yourself using tracer provider force_flush.

This is best done either using either Rack's after_response or the Rack Events API.

For background workers you'll want to configure the options to force active job to do the same.

kedare commented

I think the thread boot to false just prevents it from working on the initial puma master no ?
I could properly get spans after having set it to false

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.