Gracefully closing unleash and redis

Question

Gracefully closing unleash and redis

Closed this issue 25 days ago · 7 comments

Describe the bug

We are using unleash-client-node and ioredis with NestJS.

Unleash is configured with initialize for two reasons:

To allow our application to startup faster
In case of error, falling back to our redis

        const unleash = initialize({
          url,
          appName,
          customHeaders: {
            Authorization: apiKey,
          },
          storageProvider,
        })
          .on('ready', () => logger.log('ready'))
          .on('synchronized', () => logger.log('synchronized'))
          .on('warn', (msg) => logger.warn(msg))
          .on('error', (error) => logger.error(error));

Our StorageProvider is implemented as this:

import { Cluster, Redis } from 'ioredis';

export class UnleashRedisStorage {
  private readonly logger = new Logger('CustomRedisStore');

  constructor(private readonly redis: Cluster | Redis) {
    super();
  }

  async set(key: string, data: any) {
    await this.redis.set(this.buildKey(key), JSON.stringify(data));
  }

  async get(key: string) {
    const data = await this.redis.get(this.buildKey(key));
    return data ? JSON.parse(data) : data;
  }

  private buildKey(key: string) {
    return `unleash:${key}`;
  }
}

And our redis cluster is configured as the following:

import { Logger } from '@nestjs/common';
import Redis, { Cluster } from 'ioredis';

export class RedisFactory {
  static async create(options: {
    clusterMode: boolean;
    host: string;
    port: number;
    namespace?: string;
    tls: boolean;
    timeout?: number;
    db?: number;
  }): Promise<Redis | Cluster> {
    const logger = new Logger('Redis');
    const configuredTimeout = options.timeout ?? 10000;
    let client: Redis | Cluster;
    const tls = options.tls ? {} : undefined;
    if (options.clusterMode) {
      client = new Redis.Cluster([{ host: options.host, port: options.port }], {
        keyPrefix: options.namespace && `${options.namespace}:`,
        enableOfflineQueue: false,
        enableReadyCheck: true,
        dnsLookup: (address, callback) => callback(null, address),
        redisOptions: {
          tls,
        },
        clusterRetryStrategy: (times) => (times <= 10 ? 300 : undefined),
      });
    } else {
      client = new Redis(options.port, options.host, {
        db: options.db ?? 0,
        keyPrefix: options.namespace && `${options.namespace}:`,
        offlineQueue: false,
        tls,
        retryStrategy: (times) => (times <= 10 ? 300 : undefined),
      });
    }

    client.on('error', (error) =>
      logger.error(
        { error: error.message },
        error.stack,
        'Error on redis client',
      ),
    );

    return await new Promise((resolve, reject) => {
      const t = setTimeout(() => {
        logger.error(
          `Redis connection is taking more than ${configuredTimeout}ms to be ready`,
        );
        reject(new Error('redis not available'));
      }, configuredTimeout);

      client.once('ready', () => {
        logger.log('Redis connection is ready');
        clearTimeout(t);
        resolve(client);
      });
    });
  }
}

When shutting down the application, using NestJS's builtin' hooks, we do this:

await this.unleash.destroyWithFlush();
// On UnleashRedisStorage
this.redis.disconnect(false);

This seems to be triggering a race condition, where the UnleashRedisStorage is being called with "set" right after redis has disconnected. Logs attached below.

Is there any way to avoid this race condition somehow? How can I know that Unleash has "ended" sending all commands before closing?

Steps to reproduce the bug

No response

Expected behavior

No response

Logs, error output, etc.

{"level":"error","time":"2024-11-05T13:30:46.536Z","pid":84462,"hostname":"MacBook-Pro-de-Matias.local","name":"dist","context":"CustomRedisStore","data":{"key":"default","data":{"version":2,"features":[{"strategies":[{"name":"flexibleRollout","constraints":[],"parameters":{"groupId":"sfp-arg-new-cohorts-test","rollout":"25","stickiness":"random"}}],"impressionData":true,"enabled":true,"name":"sfp-arg-new-cohorts-test","description":"Rollout test","project":"default","stale":false,"type":"experiment","variants":[]},{"strategies":[{"name":"default","constraints":[],"parameters":{}}],"impressionData":false,"enabled":false,"name":"nuvem-pago-br-geo-create-account","description":"Nuvem Pago Brasil create-account feature toggle","project":"default","stale":false,"type":"kill-switch","variants":[]},{"strategies":[{"name":"userWithId","constraints":[],"parameters":...,"message":"Error while executing set","error":{}}

/Users/matias/projects/services-nuvem-pago-argentina/node_modules/ioredis/built/Redis.js:332
            command.reject(new Error(utils_1.CONNECTION_CLOSED_ERROR_MSG));
                           ^
Error: Connection is closed.
    at EventEmitter.sendCommand (/Users/matias/projects/services-nuvem-pago-argentina/node_modules/ioredis/built/Redis.js:332:28)
    at EventEmitter.set (/Users/matias/projects/services-nuvem-pago-argentina/node_modules/ioredis/built/utils/Commander.js:90:25)
    at UnleashRedisStorage.set (/Users/matias/projects/services-nuvem-pago-argentina/src/shared/feature-flag/unleash/infra/unleash-redis-storage.ts:23:22)
    at UnleashRedisStorage.descriptor.value (/Users/matias/projects/services-nuvem-pago-argentina/src/shared/utils/log-and-throw/log-and-throw.ts:29:28)
    at Repository.save (/Users/matias/projects/services-nuvem-pago-argentina/node_modules/unleash-client/src/repository/index.ts:202:32)
    at Repository.fetch (/Users/matias/projects/services-nuvem-pago-argentina/node_modules/unleash-client/src/repository/index.ts:286:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Promise.all (index 0)
    at Repository.start (/Users/matias/projects/services-nuvem-pago-argentina/node_modules/unleash-client/src/repository/index.ts:150:5)
    at async Promise.all (index 0)

Screenshots

No response

Additional context

No response

Unleash version

No response

Subscription type

None

Hosting type

None

SDK information (language and version)

3.14.1. Although I upgraded to last version and its the same

Answer 1 · 2024-11-06T08:35:21.000Z

Hi @matiasgarcia, thanks for your report. Since you're awaiting the destroyWithFlush, it should already be finished when control is returned to your function, so when you call redis.disconnect it should be safe to call.

Our first impression is that there might be a missing await in our destroyWithFlush.

I'll double check our client code to make sure and make a PR if it needs to be fixed.

Answer 2 · 2024-11-06T09:09:36.000Z

Hmm, I'm not seeing what I thought I would see.

unleash.ts's destroyWithFlush on main

  flushMetrics(): Promise<void> {
    return this.metrics.sendMetrics();
  }
  async destroyWithFlush(): Promise<void> {
    await this.flushMetrics();
    this.destroy();
  }

Calling the following destroy() method

destroy() {
    this.repository.stop();
    this.metrics.stop();
    Unleash.instance = undefined;
    Unleash.configSignature = undefined;
    Unleash.instanceCount--;
  }

The call to this.repository.stop()

stop() {
    this.stopped = true;
    if (this.timer) {
      clearTimeout(this.timer);
    }
    this.removeAllListeners();
  }

should clear any timers, and therefore prevent any further calls to save/set on the storage implementation.

Do you have a minimal reproducible example you could share?

Answer 3 · 2024-11-06T13:59:01.000Z

I added a repo to reproduce this issue https://github.com/matiasgarcia/unleash-redis-nestjs but unfortunately I am struggling to do so because unleash and redis are running on my local env, with an empty state, and I have no latency to reproduce the issue.

In production, I noted that "synchronized" event is finally fired after initializing the shutdown process, therefore calling "unleash.destroyWithFlush()" won't clean up the event listener that calls storageProvider.set(...), making it crash if using a database.

Answer 4 · 2024-11-06T16:12:54.000Z

@chriswk I played around with this a little bit more to understand why it happens in our production environment and why I cannot reproduce it on my local.

What I realized is that these two elements can cause the race condition:

get() request to server: https://github.com/Unleash/unleash-client-node/blob/main/src/repository/index.ts#L346-L355
job time: https://github.com/matiasgarcia/unleash-redis-nestjs/blob/main/src/job.ts#L9 (simulated with sleep(200))

In my local environment, the get() request takes 40ms to resolve. However, in production, the request takes approximately 1s.

If get() time > job time and the shutdown process initiated => there is a chance that the process will crash despite of calling unleash.destroyWithFlush().

If get() time < job time and the shutdown process initiated, it will finish successfully.

Error trace:

/Users/matias/projects/unleash-redis-race/node_modules/ioredis/built/Redis.js:332
            command.reject(new Error(utils_1.CONNECTION_CLOSED_ERROR_MSG));
                           ^

Error: Connection is closed.
    at EventEmitter.sendCommand (/Users/matias/projects/unleash-redis-race/node_modules/ioredis/built/Redis.js:332:28)
    at EventEmitter.set (/Users/matias/projects/unleash-redis-race/node_modules/ioredis/built/utils/Commander.js:90:25)
    at UnleashRedisStorage.set (/Users/matias/projects/unleash-redis-race/src/unleash/unleash.redis.ts:13:22)
    at Repository.save (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/repository/index.ts:202:32)
    at Repository.fetch (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/repository/index.ts:287:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Promise.all (index 0)
    at Repository.start (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/repository/index.ts:150:5)
    at async Promise.all (index 0)
    at Unleash.start (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/unleash.ts:248:5)
Emitted 'error' event on Repository instance at:
    at Repository.fetch (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/repository/index.ts:299:6)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    [... lines matching original stack trace ...]
    at Unleash.start (/Users/matias/projects/unleash-redis-race/node_modules/unleash-client/src/unleash.ts:248:5)

What I am missing here is how can I create a dockerized environment where the latency can be simulated, will try to spend a bit more on that but I would appreciate any guidance.

Answer 5 · 2024-11-07T08:12:47.000Z

Yeah, we faced the same issue.
IMO there should be a second check:

if (this.stopped) {
      return;
}

before calling await this.save(data, true); inside fetch method.
Because if repo.fetch started just a few ms earlier than we call "instance.destroy()", it would still proceed further, and try saving received data into storage (Redis, FS etc.). But "destroy" should mean it - after calling "destroy" any logic in Unleash should be stopped. We shouldn't expect any "surprises" from Unleash after calling this method.

Because of this, we are forced to add a wrapper over destroy to make it async and actually wait until any running sync job is completed:

  async destroy(): Promise<void> {
    // Wait for pending "sync" requests to finish with any status
    await Promise.race([
      once(this.unleash, UnleashEvents.Changed), // Is emitted right before the data is written to Redis
      once(this.unleash, UnleashEvents.Unchanged), // Is emitted when Unleash server returns 304
      once(this.unleash, UnleashEvents.Error), // Is emitted on any error (including network error)
    ]);
    this.unleash.destroy();
  }

this is very relevant for setups with low refreshInterval (1-3 seconds).

Answer 6 · 2024-11-07T08:44:11.000Z

Great catch folks.

I'll bring it back to the team and see if we can't get a patch out.

I'll link the inevitable PR to this issue so you can keep track.

Thank you for clarifying the scenarios where it happens, it should make the debugging and fix easier to write.

Answer 7 · 2024-11-08T12:12:23.000Z

Great hints @matiasgarcia and @ozonep 🙌
#673 should solve the issue. Once we merge it please check if it solved your cases