configcat/node-sdk

Add an `onConfigDownloadFailed` option to the options object

jdgarcia opened this issue · 10 comments

Is your feature request related to a problem? Please describe.

We're using this in our product, but we wanted to have a backup plan in case the ConfigCat service was down. We would like a way to be able to call a function in the event the config download fails.

Describe the solution you'd like

We'd like it if the options supported an onConfigDownloadFailed option, which would be a callback that gets called if the config download fails.

Describe alternatives you've considered

We do have a workaround but it's a bit "hacky". We put a wrapper around the logger's error function, and are checking the error message to see if it contains the string Failed to download feature flags & settings from ConfigCat., which comes from this line in the config-fetcher.

Additional context

I wasn't sure it it made more sense for the option to exist in the configcat/common-js package but I realized the download/error message were happening in this SDK, not the common library.

Hi @jdgarcia ,

Thank you for the report.

The library has a built-in fallback mechanism, you can pass the default parameter to the getValue function and this will return if some error occurs.

Can you explain your application and usage? I can not see why do you need the event based solution.

Thanks,
Endre

Hi, I'm one of @jdgarcia's teammates. I'll outline some of our thinking here:

  • We'd love to know when this problem is happening, so we can check configcat status, look for problems in our configuration that could prevent the machine from reaching the configcat API, etc. That means we need to be able to run code so we can log/alarm. We could make all default values some kind of special token that triggers logging in the callback, but that's kind of burdensome and requires us to do it everywhere we ever get any value.
  • It's also not clear that getValue is the proper place to look for network errors, since we're trying to get values from an internal model that has (hopefully) already been fetched from the API. If getValue is fetching for each call instead of using the internal model, this isn't viable since any rate limiting on configcat's part will probably quickly shut us down.
  • Finally, we'd like to be able to log that fetching from the API has started failing during auto-fetches, even though there's no immediate problem because our instance already has an internal model. This simply can't be done via getValue.

Hi @Mr-Wallet,

The getValue method usually won't trigger any http fetch. Except if you use the configcat SDK in lazyloading mode with low TTL value it can generate more fetch operations.

The main reason that the getValue has a default return value is if the client can not read the configs from the cache or the cache is empty the default value will serve.

If you would like to log the failures you can use the logger as your teammates mentioned.

I put this feature request into our backlog and I will discuss with the team.

Thanks,
Endre

Just to be clear, we are using an Auto-poll client, and the feature request is to be able to run a callback when autopoll fails, both during start-up and during periodic polling. That is why I am skeptical that getValue is the appropriate thing to focus on. I want to make sure that this is absolutely clear for the purpose of your backlog.

For the time being the best we can do is provide a logger which tries to parse the error as a string, looking for evidence of the particular kind of error. However string parsing to sniff for the error type is not a fantastic strategy as it is prone to breaking unexpectedly if this package ever decides to reword an error message, since the exact contents of the message is not part of the fixed API.

Hi @Mr-Wallet and @jdgarcia,

We had a discussion with the team this morning and we all agree that your feature request could add value to the SDK.

Since this request is the only one so far, I'd like to be sure I fully understand your situation. I'm more interested in high level business perspective your use case.
Can you tell me a bit more about your application? What it does? Why is important to get notified about the connection issues? What is this backup plan you mentioned earlier? What happens when this plan is triggered?

Thanks ;)

The use case is a desktop client with an API connection. The client periodically communicates with our API in order to check licensing/features, and has thousands of users checking in per hour. We're looking to use those check-ins to gradually roll out features or provide specific customers with previews/betas before releasing to the wider userbase; ConfigCat seems like a good fit for this.

Our API constructs an auto-poll client, and polls every minute. We strongly value high availability so we want to build a defense-in-depth on this system to make sure that nobody loses features, especially not ones they're paying for. So if ConfigCat isn't responding, we'd like the opportunity to detect that, and we'd like to know why. (Is ConfigCat down? Did something go wrong with a VPN tunnel? When we restarted the server, did it fail to load the ConfigCat credentials? Did the load balancer develop A.I. and decided it would be funny to route the responses to a different machine?) We want the opportunity to trigger alert systems in the event that the config fails to fetch, beyond simply monitoring the global console errors. It seems fairly self-evident that when a system goes down, we want to know about it, and soon. If the problem were in our configuration and not on ConfigCat's end, then we might merrily modify user features from e.g. the ConfigCat website, blissfully unaware that these changes were not propagating to users in a timely manner.

OK, so if we have a problem, do things keep running until we solve it? Well, the auto-polling client simply uses the previous config, so we're covered there. But what if the problem starts because of (or persists through) rebooting the API? Then the client has nothing in cache. Our solution is to store a copy of the ConfigCat features JSON in our database on every successful fetch, and if we fail to fetch from ConfigCat on startup, inject the backup into the client by setting the cache ourselves (basically this). So reason #2 we want to run arbitrary code when the fetch fails: if it fails and there's no cached config in memory, we want to run the fallback to our database version of the features. I understand that mucking around with the client's private cache might not be a "supported" use case, but we could just as easily have chosen something like, "don't ask the ConfigCat client, and just give every user this default set of features." Being able to detect such a failure at start-up, and then run application-specific fallback code, seems like a broadly-desirable feature regardless of the exact fallback.

Hey,

I just want to share some details about your last paragraph. All of our SDKs are supporting a custom cache mechanism. The default is an in-memory cache (https://github.com/configcat/common-js/blob/master/src/Cache.ts), but you can pass in e.g. a filesystem+in-memory cache combination anytime. Or it can use your database too, it totally depends on you.
I think this could be a great help to you to achieve more availability.
Here is a sample app: https://github.com/configcat/node-sdk/tree/master/samples/customcache
And here is the documentation part for it: https://configcat.com/docs/sdk-reference/node#using-custom-cache-implementation

Cheers,

@Mr-Wallet thank you for the detailed explanation, now I understand why do you need to be extra sure if the ConfigCat service is up.
At this moment I see 3 options:

  1. The solution mentioned above by @laliconfigcat might be good enough.
  2. Sine this feature adds value to the SDKs and benefit other customers, we can put it on our roadmap and implement it in all SDKs. This takes some time, but we will do it anyway with the next improvement cycle.
  3. For the short term, you could open a PR only for the Node SDK. We will take it as a priority and can review quickly and release a new version.

What do you guys think?

Thanks for the tip @laliconfigcat, we will investigate that option. Regardless of how we have the fallback implemented, we are okay for the time being, so we don't require any rush jobs. We just thought that it was a little brittle and awkward to have to basically regex all the error messages coming out of logger in order to detect when an autofetch fails so that we can run arbitrary code in response.

I suppose the real feature request here is better granularity on different classes of errors, since the exact text of human-readable error messages are "undocumented API" and have a tendency to change without warning. And also being forced to shim non-logging stuff into the logger option feels inappropriate as well.

All right. Added your feature request to the next batch of SDK improvements.