telefonicaid/fiware-orion

Improve keepalive performance in mongo connection pool

rg2011 opened this issue · 7 comments

Is your feature request related to a problem / use case? Please describe.

It is related to a performance problem. We have noticed a high rate of slow queries in our mongo deployment, regarding the listDatabases command:

{
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "admin.$cmd",
    "command": {
      "listDatabases": 1,
      "$db": "admin",
  ... omitted for brevity ...
    "locks": {
      "ParallelBatchWriterMode": {
        "acquireCount": {
          "r": 140
        }
      },
      "FeatureCompatibilityVersion": {
        "acquireCount": {
          "r": 140
        }
      },
      "ReplicationStateTransition": {
        "acquireCount": {
          "w": 1
        }
      },
      "Global": {
        "acquireCount": {
          "r": 140
        }
      },
      "Mutex": {
        "acquireCount": {
          "r": 139
        }
      }
    },
   .... omitted for brevity ...
    "durationMillis": 354
  }
}

The log has been redacted for brevity, but I've let the part about the locks in there. It seems that the command acquires 139 - 140 locks, which might be the reason why it is so slow.

The source IP address of these requests belong to Orion servers. We have several of them in our multi-tenant deployment. It seems that fiware-orion uses the listDatabases command as a keepalive:

// MongoDB has a ping command, but we are not using it, as it doesn't not
// provides auth checking when user and pass are empty (it provides auth
// when we have user and pass, but that is not enough).
//
// In addition, note that command and database depend on mtenant. If we
// are in mtenant mode we
// will need at some point to look for all orion* databases, so command to
// ping will be listDatabases in the admin DB. But if we run in not mtenant
// mode listCollections in the default database will suffice
std::string cmd;
std::string effectiveDb;
if (mtenat)
{
cmd = "listDatabases";

Deployed at scale, we are hitting around 300 - 500 ms per each listDatabases request, as shown in the log. We would like to propose changing to some lighter command for keepalive, instead of listDatabases.

Describe the solution you'd like

Stop using listDatabases for keepalive in the mongo pool. Replace with a less expensive command.

Describe alternatives you've considered

Really not much besides increasing the resources of the mongo servers or splitting the mongo databases across different replicasets, but both options seem much more costly than changing the keepalive method.

Describe why you need this feature

Slow queries have an overall impact on the cluster performance, might be degrading some of the actual work the cluster has to do.

Currently the listDatabases queries are not the only slow queries we have, but they amount to roughly 40% - 50% of all the slow queries in the replicaset.

Additional information

Do you have the intention to implement the solution

I can help with choosing a new command to use as keepalive in the pool. For instance, getParameter might be a good candidate, e.g. db.adminCommand({ getParameter:1, logLevel:1}).

I can also help evaluating the impact on performance once the command is changed.

The mentioned code corresponds to pingConnection(), which is uses only at CB startup, so its impact is very limited.

listDatabases is used in another place in the code:

bool getOrionDatabases(std::vector<std::string>* dbsP)
{
orion::BSONObj result;
std::string err;
orion::BSONObjBuilder bob;
bob.append("listDatabases", 1);
if (!orion::runDatabaseCommand("admin", bob.obj(), &result, &err))
{
return false;
}

The getOrionDatabases() function is invoked from subCacheRefresh(). This is done with a frequency of -subCacheIval seconds (60 by default).

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

Ref https://www.mongodb.com/docs/manual/reference/command/listDatabases/

Use nameOnly to make the operation lighter.

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

Yes, it's every 60 seconds.

PR has been merged but keep this issue opened while it can be tested in the same environment where @rg2011 detect the problem.

This has been included in Orion 4.0.0.

Pending on a test in the environment before closing the issue.

Deployed 4.0.0 in prod environment and confirmed decrease in slow queries. Thanks!