getsentry/sentry-elixir

Selective reporting on Oban errors

Rodeoclash opened this issue ยท 17 comments

Currently failing Oban jobs will always report to Sentry, that is, any job that raises and exception or returns an {:error, "reason"} tuple.

In some circumstances we have jobs that should fail with expected errors. For example, the delivery of our webhooks can sometimes fail with an error like %Req.TransportError{reason: :econnrefused} - this indicates that the recipient of the webhook has misconfigured their handler. In this scenario, we don't want to alert Sentry but we do want the job to fail (and eventually retry as the user will be notified of the failing webhook and can then fix the problem preventing delivery).

I'm not quite sure what the solution to this should look like, it could be something like:

  1. Error reporting on Oban jobs can be selectively applied.
  2. Error reporting can ignore certain error responses to jobs (this is my preference as I would like to know about other unexpected errors!). I imagine this taking the form of a list of error tuples that when returned from the job, don't trigger the error reporting.

I think this can currently be worked around by using a combination of the snooze response and manually tracking the attempts in the job, but this would be a workaround, the ability to selectively report errors somehow would solve this.

@Rodeoclash thanks for the feature suggestion!

One approach you can always take is go for an event filter, which would allow you to configure a global filter for events reported to Sentry, with any custom logic you desire.

@sorentwo pinging you here though in case you have good ideas on how to approach this on the Oban side?

I was pointed to the event_filter feature after I raised this by a colleague, I'll dig into that before I bother you again ๐Ÿ™‚

Nah no worries, I think for Oban we might do something more ergonomic because I also encountered the same "annoyance" of returning {:error, ...} and not wanting it to be reported much.

There's an old recipe that describes a way to do this using telemetry and protocols. However, that's not useful when Sentry is doing the reporting automatically.

One thing you can consider is that all error tuples are wrapped in anOban.PerformError. If you filter out that particular error you'd effectively silence failures from manual errors while retaining real exceptions.

@sorentwo oh, if the job crashes (exit/throw/raise) isn't that wrapped as an Oban.PerformError as well?

oh, if the job crashes (exit/throw/raise) isn't that wrapped as an Oban.PerformError as well?

No, that's wrapped in Oban.CrashError instead.

Aaaaah ok than yeah we don't need to do anything here. I'll improve the docs before closing this. Thanks Parker ๐Ÿ’Œ

Hi, your solution helps with filtering errors but what about attempts. For example we have plenty of 3rd party api calls, a lot of them crash in first attempt but it's ok in second (or more) attempt. Is it possible to set that error will be captured only in case attempt number is same as max attempt?

I made PR for it #759

@sorentwo before merging the attempt in #759, is there a chance that maybe we'd want to include the job or the job's meta within Oban.PerformError?

is there a chance that maybe we'd want to include the job or the job's meta within Oban.PerformError?

We could, for more context on the exception.

I'm wondering what that would gain for selective reporting, since the exception is always bundled with telemetry meta that includes the job.

For what it's worth, this is what I ended up with:

defmodule SuperApi.Sentry do
  @moduledoc false

  # Filter webhook delivery worker errors before send
  def before_send(
        %{
          original_exception: %Oban.PerformError{reason: {:error, reason}},
          extra: %{worker: "SuperApi.Webhooks.DeliverWorker"}
        } = event
      ) do
    case reason do
      # We received an error from the partners web server for this webhook.
      :delivery_error -> nil
      # Handle common web request errors. This can occur from partners that
      # mistype a webhook destination, haven't configured it correctly or that
      # it has gone down temporarily.
      %Req.TransportError{reason: :timeout} -> nil
      %Req.TransportError{reason: :nxdomain} -> nil
      %Req.TransportError{reason: :econnrefused} -> nil
      # Let all other errors webhook delivery errors bubble up to Sentry.
      _ -> event
    end
  end

  def before_send(event), do: event
end

The main thing I didn't like is that I have to pattern match on the string under extra.worker to be able to scope the filtering to the job itself. Being able to pattern match on the module itself would be optimal (and if that was available under the %Oban.PerformError{} even better.

@sorentwo makes a good point though: all this data is already available in the Telemetry event that Sentry itself consumes. So, this is not Oban's responsibility I think, it's Sentry's. I think we might go with options to control reporting like #759's approach. I'll take a look at that PR again ๐Ÿ™ƒ

My biggest problems with my #759 are:

  • it's configured based on number of intents, I think there is more complex logic and it should be considered to use state instead of numbers, because Oban will change state to discard when max_attempts == attempt or other logic (in future maybe)
  • it's configured for whole app, not based on worker. I can imagine that developers can use this feature on worker level instead of app level.
  %Req.TransportError{reason: :timeout} -> nil
  %Req.TransportError{reason: :nxdomain} -> nil
  %Req.TransportError{reason: :econnrefused} -> nil
  # Let all other errors webhook delivery errors bubble up to Sentry.
  _ -> event

I have really big problem with this kind of solutions. It works but what will happen when you network/dns/something crash and you will not be able to call external domains/APIs. Sentry will never let you know about this error. Even when your 3rd party service will have outage you should be notified about. That's reason why multiple attempts exist and why max_attempts exist as well. That's reason why I implemented that PR to be focused on these numbers.

I have really big problem with this kind of solutions. It works but what will happen when you network/dns/something crash and you will not be able to call external domains/APIs

In this case it's the delivery of code from our server to a partner (who often misconfigures their system, we also alert them via email on failed delivery) but I agree in principle with #759 - this would suit us for when we're calling flaky APIs that we expect some errors from that aren't critical to know about unless they're ongoing.

another example where it would be nice to have it on worker level I think.