spandex-project/spandex

Performance problems with Optimal dependency

amandasposito opened this issue · 18 comments

I ran a profiler on my app using the eprof and fprof to identify bottlenecks to trace the execution of all functions in the code and report the time consumed with each.

Looking at the results 25% of the time was spent on Optimal, which is a lib responsible to validating code.

So I set up a benchmark script using Benchee to test the original code with the Optimal dependency and another one that overrides it with a simpler implementation.

The result was about 10% faster over the whole request without the Optimal dependency.

https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Eprof.html
https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Fprof.html

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
Number of Available Cores: 3
Available memory: 5.82 GB
Elixir 1.8.1
Erlang 21.3.7
Benchmark suite executing with the following configuration:
warmup: 0 ns
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 15 s
Benchmarking 0_warmup...
Benchmarking 1_normal...
Benchmarking 2_optimal_overridden...
warning: redefining module Optaimal (current version loaded from _build/test/lib/optimal/ebin/Elixir.Optimal.beam)
  priv/myapp_web/script_optimal.exs:23
Name                           ips        average  deviation         median         99th %
2_optimal_overridden        270.69        3.69 ms    ±33.85%        3.26 ms        7.91 ms
1_normal                    248.06        4.03 ms    ±35.16%        3.58 ms        9.72 ms
0_warmup                    129.54        7.72 ms  ±1258.76%        3.45 ms       11.21 ms
Comparison:
2_optimal_overridden        270.69
1_normal                    248.06 - 1.09x slower +0.34 ms
0_warmup                    129.54 - 2.09x slower +4.03 ms

That is really useful info! I’d be happy to accept a PR that removed the optimal dependency.

Yeah I was surprised that the performance impact was so significant! My only concern would be that removing it would be a breaking change unless we somehow keep the behavior the same. In practice, it's probably fine to remove it or at least make it disable-able, because for most use-cases, people will only be sending valid spans once they figure out the right things to send.

If we removed Optimal, things might silently fail or fail in a different place, but at the end of the day, it would've crashed anyway.

I think an alternative would be to fork optimal (I actually wrote it originally, but its in my old employer's name) and just fix the performance problem.

They'd probably be amenable to a PR, also.

If someone makes a PR I'll gladly take a look; Hoping to take a glance at it if y'all don't get to it first!

We should just switch from optimal to nimble options. Anyone who wants to take a crack at this is welcome!

@zachdaniel I'm interested in picking this up! I've been using Spandex for over a year now and want to contribute back 😁

Awesome! No one else has claimed it, so go for it 💪

Hey @connorlay, need any help? We're seeing considerable performance overhead when using Spandex at Fresha - I'd like to factor out optimal but since it's a considerable amount of work I didn't want to duplicate it in case you've already started.

Hi @kamilkowalski! Feel free to take this over. I never made much progress on this after realizing the size of the refactor.

The best strategy here is likely just to use nimble options. We'll just need to rewrite the schemas to their syntax.

I'm also available / willing / interested in helping to make this migration happen. I think long-term we will likely just migrate away from Spandex in favor of OpenTelemetry, but given Datadog's timeline on supporting OpenTelemetry and the amount of effort it'll take to switch away from Spandex, this is a worthwhile investment in the meantime.

Well the API for NimbleOptions and Optimal are almost exactly the same, so I honestly don't think it would be too difficult to do. I'm too busy at the moment though unfortunately.

I looked at it briefly and I don't think it'll be that bad, it's just a lot of tedious verification that things are still doing what we want, and also a technically-breaking change that I doubt anyone cares about in practice: our API currently returns Optimal validation errors, so probably the better API design would be to either return some well-known set of Spandex validation errors, or just change the typespecs to be less-specific about what kind of error struct you might get, and just switch it to NimbleOptions instead of Optimal, with or without a major version bump to signal that there's a breaking change there, depending on what we find in testing/verification that things are working.

Another possible path could be to simply make Optimal be optional: if you have it in your app's deps, then options are validated and if not, then you don't and you just crash or something if you send an incorrect type. This would be the "high performance" production mode of operation after you've already validated that you're sending the right thing via dev/test.

The only issue with that would be default values/other logic around values. It doesn't just validate, it can also transform.

Won't it be better for performance to not do validation? For functions where low latency is required people would probably prefer performace over ease of development.

Yeah, I think it would be better for sure, at least to have it only validate in development mode or something like that. I'm just saying we'll have to manually write the code that isn't validation (that sets defaults for example).