Improve keepalive performance in mongo connection pool
rg2011 opened this issue · 7 comments
Is your feature request related to a problem / use case? Please describe.
It is related to a performance problem. We have noticed a high rate of slow queries in our mongo deployment, regarding the listDatabases
command:
{
"msg": "Slow query",
"attr": {
"type": "command",
"ns": "admin.$cmd",
"command": {
"listDatabases": 1,
"$db": "admin",
... omitted for brevity ...
"locks": {
"ParallelBatchWriterMode": {
"acquireCount": {
"r": 140
}
},
"FeatureCompatibilityVersion": {
"acquireCount": {
"r": 140
}
},
"ReplicationStateTransition": {
"acquireCount": {
"w": 1
}
},
"Global": {
"acquireCount": {
"r": 140
}
},
"Mutex": {
"acquireCount": {
"r": 139
}
}
},
.... omitted for brevity ...
"durationMillis": 354
}
}
The log has been redacted for brevity, but I've let the part about the locks in there. It seems that the command acquires 139 - 140 locks, which might be the reason why it is so slow.
The source IP address of these requests belong to Orion servers. We have several of them in our multi-tenant deployment. It seems that fiware-orion uses the listDatabases
command as a keepalive:
fiware-orion/src/lib/mongoDriver/mongoConnectionPool.cpp
Lines 105 to 120 in a0d75b3
Deployed at scale, we are hitting around 300 - 500 ms per each listDatabases
request, as shown in the log. We would like to propose changing to some lighter command for keepalive, instead of listDatabases
.
Describe the solution you'd like
Stop using listDatabases
for keepalive in the mongo pool. Replace with a less expensive command.
Describe alternatives you've considered
Really not much besides increasing the resources of the mongo servers or splitting the mongo databases across different replicasets, but both options seem much more costly than changing the keepalive method.
Describe why you need this feature
Slow queries have an overall impact on the cluster performance, might be degrading some of the actual work the cluster has to do.
Currently the listDatabases
queries are not the only slow queries we have, but they amount to roughly 40% - 50% of all the slow queries in the replicaset.
Additional information
Do you have the intention to implement the solution
I can help with choosing a new command to use as keepalive in the pool. For instance, getParameter
might be a good candidate, e.g. db.adminCommand({ getParameter:1, logLevel:1})
.
I can also help evaluating the impact on performance once the command is changed.
The mentioned code corresponds to pingConnection(), which is uses only at CB startup, so its impact is very limited.
listDatabases
is used in another place in the code:
fiware-orion/src/lib/mongoBackend/MongoGlobal.cpp
Lines 228 to 239 in a0d75b3
The getOrionDatabases() function is invoked from subCacheRefresh(). This is done with a frequency of -subCacheIval
seconds (60 by default).
@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases
happens at a frequency that matches the configuration of -subCacheIval
? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".
Ref https://www.mongodb.com/docs/manual/reference/command/listDatabases/
Use nameOnly
to make the operation lighter.
@rg2011 to confirm this theory... could you check if the "slow query" log regarding
listDatabases
happens at a frequency that matches the configuration of-subCacheIval
? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".
Yes, it's every 60 seconds.
PR has been merged but keep this issue opened while it can be tested in the same environment where @rg2011 detect the problem.
This has been included in Orion 4.0.0.
Pending on a test in the environment before closing the issue.
Deployed 4.0.0 in prod environment and confirmed decrease in slow queries. Thanks!