WiserSolutions/quadro

Not able to recover MongoDB connection

ondrej-kvasnovsky opened this issue · 2 comments

Replication steps:

  1. Start MongoDB
  2. Start a quadro service that creates connection to MongoDB (using quadro/services/mongo.js)
  3. After service is started (and connection to MongoDB was made), kill MongoDB
  4. Try to make MongoDB call and wait until it fails after 30 attempts (30 seconds):
{
  "name": "MongoError",
  "message": "failed to reconnect after 30 attempts with interval 1000 ms"
}
  1. Try to make MongoDB call from the service again, you get "Topology was destroyed" message because the client state is set to DESTROYED:
{
  "name": "MongoError",
  "message": "Topology was destroyed"
}

FYI: The client is destroyed after couple of retries (in mongodb-core/lib/connection/pool.js):

// Destroy the instance
self.destroy();
Emit close event
self.emit('reconnectFailed'
  , new MongoError(f('failed to reconnect after %s attempts with interval %s ms', 
  self.options.reconnectTries, 
  self.options.reconnectInterval)));
  1. Start MongoDB

Actual outcome

Try to use service to connect to MongoDB, you get:

{
  "name": "MongoError",
  "message": "Topology was destroyed"
}

Expected outcome

Connection is recovered and service can use MongoDB.

Observations:

  • autoReconnect is set to true by default ((http://mongodb.github.io/node-mongodb-native/2.2/reference/connecting/connection-settings/)[driver]) - and this only makes sure it is reconnected while attempting to reconnect during the first 30 seconds

  • when connection state is set to DESTROYED, the connection will not try to reconnect and a new connection needs to be established

  • when there are not attempts to use dead MongoDB, the connection recovers because the connection state was not changed to DESTROYED

received serverDescriptionChanged
{
  "topologyId": 0,
  "address": "localhost:27017",
  "previousDescription": {
    "topologyType": "Single",
    "servers": [
      {
        "address": "localhost:27017",
        "arbiters": [],
        "hosts": [],
        "passives": [],
        "type": "Standalone"
      }
    ]
  },
  "newDescription": {
    "address": "localhost:27017",
    "arbiters": [],
    "hosts": [],
    "passives": [],
    "type": "Unknown"
  }
}

Options:

Option 1

Quadro service can recover from DESTROYED state. It proxies collection function and whenever is collection function called, it checks whether the client is DESTROYED. If it is destroyed, it goes through all instances in container and tries to reconnect the mongo client. Something like this:

const { MongoClient, Db } = require('mongodb')

module.exports = async function(config, app) {
    const dbName = `${app.name}_${app.env}`
    const defaultConnectionUrl = `mongodb://localhost:27017/${dbName}`
    const url = Q.config.get('db.endpoint', defaultConnectionUrl)
    return createConnection(url)
}

async function createConnection(url, options) {
    const client = new MongoClient(url)

    const connection = await client.connect(url, options)
    const handler = {
        apply: (target, thisArg, argumentsList) => {
            if (thisArg.topology.isDestroyed()) {
                tryToReconnect().catch(function(error) {
                    Q.log.warn('Not able to recover MongoDB client from destroyed state', { error })
                })
            }

            const collection = target.bind(thisArg)
            return collection(argumentsList[0], argumentsList[1], argumentsList[2])
        }
    }
    const proxy = new Proxy(connection.collection, handler)
    connection.collection = proxy
    return connection
}

async function tryToReconnect() {
    const instances = Q.container.map
    for (const [key, value] of instances) {
        const instance = await value.instance
        if (instance) {
            const names = Object.getOwnPropertyNames(instance)
            for (const name of names) {
                if (instance[name] instanceof Db) {
                    Q.log.info('Reconnecting mongo instance', { key, name })
                    instance[name].open()
                }
            }
        }
    }
}

Probably too much complexity to solve this issue. But maybe an interesting option to make quadro services more resilient?

Option 2

We add a check whether a service can connect to MongoDB. But we had a good reasons to exclude connecting to MongoDB from health-check. It was: when MongoDB is under heavy load, it can be lagging and then service health-check can timeout and instances are killed and created again, even when there is nothing wrong with the instances.

module.exports = class {
    constructor(mongo) {
        this.mongo = mongo
    }

    run() {
        const mongoPing = await this.mongo.command({ ping: '1' })
        const isMongoOk = mongoPing.ok === 1
        return isMongoOk
    }
}

Option 3

We add a check whether MongoDB client in a service is destroyed. That would check status of client rather than making connection to a mongo server (also, this issue is about client being destroyed)

module.exports = class {
    constructor(mongo) {
        this.mongo = mongo
    }

    run() {
        return this.mongo.topology.isConnected()
    }
}

I think the 3rd option is the best one, let me know what you think @igorshapiro @ankurjain86 @tomeresk

Option 4

Or we could do something like this: when we find the mongo client is destroyed, we can try to open it again (and prevent replacement of ec2 instance). But it is not really responsibility of health-check to recover from errors...

services/healthcheck.js:

module.exports = class {
    constructor(mongo) {
        this.mongo = mongo
    }

    run() {
        const connectionString = Q.config.get('db.endpoint')
        if (connectionString) {
            if (this.mongo.topology.isDestroyed()) {
                this.mongo.open()
            }
            return this.mongo.topology.isConnected()
        }
        return true
    }
}