launchdarkly/erlang-server-sdk

Long variation fetch times

zbarnes757 opened this issue · 4 comments

Describe the bug

We are seeing excessively long variation fetch times in our production environment. We are using Redis as our backend. In the attached span, you can see that every now and then, it takes >3s to do a variation fetch. This span is only for :ldclient.variation(key, context, fallback). Is there something during this fetch that could be hanging for this long? It is intermittent and we have not seen a pattern as to why this is happening.

Logs

There are no logs surrounding the event to indicate an internal issue.

SDK version
3.0.3

Language version, developer tools
erlang 26.1.2
elixir 1.15.7-otp-26

Additional context

Screenshot 2023-11-21 at 1 37 17 PM

When you use redis as a persistent store the SDK operates using redis as a read-through cache with a TTL.

This means each time the TTL expires it needs to fetch the data from redis again. The only blocking operation within this procedure would be fetching the data from redis.

Depending on the architecture you are using, and the reasons you are using redis, this may or may not be avoidable currently.

The best performance for the SDK, generally, is when it is being used without a backing store. In that case it receives flag updates and puts them in memory and evaluations always happen from memory. The possible downsides are that during initialization variation methods would return default values. Where with redis they instead will be fetched from the store (which may or may not take less time than just initializing). The other down side would be if LD was unreachable, and an SDK was initialized after that point, then it wouldn't be able to get flag configuration.

There are two ways that someone may be using a store, either the SDK is still connecting to LD and it is also writing updates to that store, or the store is populated by the relay proxy.

If you are connecting directly, then you can increase the TTL of used for the store. You will still get fresh values because the SDK is connected and updating those values. It will update them in the cache during the update and then update the redis storage.

If you are using relay proxy to update the flag configuration in redis, then increasing the cache TTL will result in less fresh values, but possibly fewer slow responses.

Are you able to use latency monitoring in your redis instance?

Thank you,
Ryan

Hi @kinyoklion,

I'm sorry for not getting back to you sooner. We are still getting Redis latency monitoring setup for this service. To confirm, this is the configuration for using Redis backend and an infinite TTL, correct?

    redis_tls_options = [
      cacertfile: ~c"/etc/ssl/certs/ca-certificates.crt",
      verify: :verify_none,
      customize_hostname_check: [match_fun: :public_key.pkix_verify_hostname_match_fun(:https)]
    ]

    options =
      Map.merge(@default_client_opts, %{
        redis_host: String.to_charlist(redis_host),
        redis_port: redis_port,
        redis_tls: redis_tls_options,
        # infinite TTL for feature flags to avoid timeout issues when refreshing cache
        cache_ttl: -1,
        feature_store: :ldclient_storage_redis,
        http_options: %{
          tls_options: :ldclient_config.tls_basic_options()
        }
      })

    :ldclient.start_instance(ld_sdk_key, options)

@zbarnes757

That does look correct.

Do you know if you request flags that do not exist? For a flag that doesn't exist the SDK will not know that it doesn't exist, just that it doesn't have it in the cache, and it would still reach out to redis in that scenario.

Thank you,
Ryan

This issue is marked as stale because it has been open for 30 days without activity. Remove the stale label or comment, or this will be closed in 7 days.