cargomedia/cm

performance of mongodb queries to retrieve logs in admin

Closed this issue · 36 comments

njam commented

It happens again. Probably also related to write queries happening at the same time?

Related:
#2160
#2228

mongos> db.cm_log.count()
4969165
mongos> db.cm_log.count({level:400})
20805
mongos> db.cm_log.count({level:300})
4193617
mongos> db.cm_log.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_",
        "ns" : "example.cm_log"
    },
    {
        "v" : 1,
        "key" : {
            "level" : NumberLong(1),
            "createdAt" : NumberLong(1)
        },
        "name" : "level_1_createdAt_1",
        "ns" : "example.cm_log",
        "background" : true
    },
    {
        "v" : 1,
        "key" : {
            "context.extra.type" : NumberLong(1),
            "level" : NumberLong(1),
            "createdAt" : NumberLong(1)
        },
        "name" : "context_extra_type_1_level_1_createdAt_1",
        "ns" : "example.cm_log",
        "background" : true
    },
    {
        "v" : 1,
        "key" : {
            "message" : NumberLong(1)
        },
        "name" : "message_1",
        "ns" : "example.cm_log",
        "background" : true
    }
]
njam commented

It never happens for "/log/stats" but only for "/log?level=X".

@fvovan can you investigate?

Example long running query:

        {
            "opid" : 166492112,
            "active" : true,
            "secs_running" : 1182,
            "microsecs_running" : NumberLong(1182201490),
            "op" : "query",
            "ns" : "example.cm_log",
            "query" : {
                "count" : "cm_log",
                "query" : {
                    "message" : {
                        "$exists" : true
                    },
                    "level" : {
                        "$in" : [
                            NumberLong(300)
                        ]
                    }
                }
            },
            "planSummary" : "IXSCAN { level: 1, createdAt: 1 }",
            "client" : "10.55.40.160:59820",
            "desc" : "conn217970",
            "threadId" : "0x7f3eb6b16700",
            "connectionId" : 217970,
            "locks" : {
                "^example" : "R"
            },
            "waitingForLock" : true,
            "numYields" : 44658,
            "lockStats" : {
                "timeLockedMicros" : {
                    "r" : NumberLong(1319777562),
                    "w" : NumberLong(0)
                },
                "timeAcquiringMicros" : {
                    "r" : NumberLong(370976651),
                    "w" : NumberLong(0)
                }
            }
        },
njam commented

The log was seeing ~25 inserts per second:
screen shot 2016-08-27 at 22 52 22

there are places when it were way more of inserts..
image

And corresponding count of total queries for this peak is 88922 ..
image

Could it be a bug in copperegg?

Btw, are we going to migrate on 3 version on jessie?

njam commented

I saw that too. It's only the config node. I guess it's an anomaly and we can ignore it.

MongoDB 3 upgrade is independent of Debian 8. You can track it here: cargomedia/puppet-packages#1255

So the problem is {"message" : {$exists: true}}
I simulated mongo load appr. 300-400 inserts per sec.
and starting from some point of collection size (2M), query
db.cm_log.count({"message" : {$exists: true}, "level" : 300})
takes more than 30 secs and locks the collection, while just {"level": 300} query
executes almost instantly.

Unfortunately there is no explain option for count() in mongo version we have now. But i tried to recon index usage, and the result is mongo doesn't use indexes for $exists operator.

Let's just modify the query, anyway as checked, there are no documents without message

njam commented

sgtm, can you open a separate PR, so we can close this only once we verified the impact on prod?

Cleanup could also be optimized. Currently every type is cleaned up individually in Maintenance_Cli instead of simply deleting all entries that are too old in one go.

Cleanup could also be optimized. Currently every type is cleaned up individually in Maintenance_Cli instead of simply deleting all entries that are too old in one go.

Yes probably it could, but the thing is cleanUp() can be overridden in certain types to preserve important log records, that's why all "typed" documents are processed separately.

Yes they also have different values for maximum age I see.

Wouldn't it be better to use mongo's ttl-mechanism for this? It's null on production, but records are still cleared through the cleanup-mechanism. Different ttls could be achieved by configuring different loggers for specific log-types. I expect ttl-based removal to be much more efficient than this manual cleanup.

maybe,
it could be done without using several loggers but rather with injecting expireAt (or something like that) property to each document in mongo handler, depending on record type

njam commented

It didn't help :/
Had a timeout on /log?level=400 just now:

MongoCursorTimeoutException: www3.example.com:27017: Read timed out after reading 0 bytes, waited for 30.000000 seconds in /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/PagingSource/MongoDb.php on line 89
0. {main} at /home/example/serve/public/index.php line 0
1. CM_Bootloader->execute(Closure) at /home/example/releases/20160902122350/public/index.php line 11
2. {closure}() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Bootloader.php line 163
3. CM_Http_Handler->processRequest(CM_Http_Request_Post) at /home/example/releases/20160902122350/public/index.php line 9
4. CM_Http_Response_Abstract->process() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Handler.php line 28
5. CM_Http_Response_View_Abstract->_process() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Abstract.php line 48
6. CM_Http_Response_Abstract->_runWithCatching(Closure, Closure) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/View/Abstract.php line 170
7. CM_Http_Response_View_Abstract->{closure}() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Abstract.php line 273
8. CM_Http_Response_View_Ajax->_processView([]) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/View/Abstract.php line 167
9. CM_View_Abstract->ajax_loadPage(['path' => '/log?level=300'], CM_Frontend_JavascriptContainer_View, CM_Http_Response_View_Ajax) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/View/Ajax.php line 33
10. CM_Http_Response_View_Abstract->loadPage(['path' => '/log?level=300'], CM_Http_Response_View_Ajax) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/View/Abstract.php line 36
11. CM_Http_Response_Abstract->process() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/View/Abstract.php line 87
12. CM_Http_Response_Page_Embed->_process() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Abstract.php line 48
13. CM_Http_Response_Page->_processContentOrRedirect() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page/Embed.php line 30
14. CM_Http_Response_Page->_processPageLoop(CM_Http_Request_Get) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page.php line 82
15. CM_Http_Response_Page->_processPage(CM_Http_Request_Get, CM_Http_Response_Page_ProcessingResult) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page.php line 95
16. CM_Http_Response_Abstract->_runWithCatching(Closure, Closure) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page.php line 152
17. CM_Http_Response_Page->{closure}() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Abstract.php line 273
18. CM_Http_Response_Page_Embed->_renderPage(Admin_Page_Log) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page.php line 143
19. CM_RenderAdapter_Layout->fetchPage() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Http/Response/Page/Embed.php line 26
20. CM_RenderAdapter_Component->fetch() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/RenderAdapter/Layout.php line 87
21. CM_Frontend_Render->fetchViewResponse(CM_Frontend_ViewResponse) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/RenderAdapter/Component.php line 23
22. CM_Frontend_Render->fetchViewTemplate(Admin_Page_Log, 'default', []) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 487
23. CM_Frontend_Render->fetchTemplate('library/Admin/layout/default/P...', [], []) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 479
24. Smarty_Internal_TemplateBase->fetch() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 85
25. content_57c97007558fd6_26342174(Smarty_Internal_Template) at /home/example/releases/20160902122350/vendor/smarty/smarty/libs/sysplugins/smarty_internal_templatebase.php line 182
26. smarty_function_component([], Smarty_Internal_Template) at /home/example/releases/20160902122350/tmp/smarty/Admin_Page_Log_101_en^93d5e3bd10cf1a2ec1d667877804c236f1375238.file.default.tpl.php line 44
27. CM_RenderAdapter_Component->fetch() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/SmartyPlugins/function.component.php line 21
28. CM_Frontend_Render->fetchViewResponse(CM_Frontend_ViewResponse) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/RenderAdapter/Component.php line 23
29. CM_Frontend_Render->fetchViewTemplate(Admin_Component_LogList, 'default', []) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 487
30. CM_Frontend_Render->fetchTemplate('vendor/cargomedia/cm/layout/de...', [], []) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 479
31. Smarty_Internal_TemplateBase->fetch() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Frontend/Render.php line 85
32. content_57c97007a09953_61088818(Smarty_Internal_Template) at /home/example/releases/20160902122350/vendor/smarty/smarty/libs/sysplugins/smarty_internal_templatebase.php line 182
33. CM_Paging_Abstract->rewind() at /home/example/releases/20160902122350/tmp/smarty/Admin_Component_LogList_101_en^649fdec06fbf8ca1d48945985f87730b50a65993.file.default.tpl.php line 82
34. CM_Paging_Abstract->getItems() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Paging/Abstract.php line 481
35. CM_Paging_Abstract->_getItemsRaw() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Paging/Abstract.php line 31
36. CM_PagingSource_MongoDb->getItems(0, 51) at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/Paging/Abstract.php line 347
37. MongoCursor->rewind() at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/PagingSource/MongoDb.php line 89
38. {throw} at /home/example/releases/20160902122350/vendor/cargomedia/cm/library/CM/PagingSource/MongoDb.php line 89
njam commented

When loading the "logs" page just now we saw the following in CopperEgg.
Surge in disk reads:
screen shot 2016-09-08 at 12 09 40
And IO wait:
screen shot 2016-09-08 at 12 09 47

njam commented

What is the difference of:

  • /log/stats: using SK_App_Stats::getLog(start, stop, 400) - fast
  • /log?level=400: using new CM_Paging_Log([400]) - slow

The most obvious difference is that SK_App_Stats uses an exact match for level whereas CM_Paging_Log uses a range-match, even when a single level is requested. Mongodb is notoriously bad at range-matches, so it might help to do an exact match if only one level is requested.

oh. bad me.

It's the same {$exists: true} but now used for context.extra.type.
What about configuring dummy type as 0?
It sounds stupid but seems like even in schemaless storages we have to keep schema..

even when a single level is requested. Mongodb is notoriously bad at range-matches, so it might help to do an exact match if only one level is requested.

I did few benchmarks, there is no difference. Probably query optimizer transforms it. But it's not difficult to set up on our side.

I'm leaving soon but if you're okay with idea here is PR
#2326

If it doesn't make a difference then I wouldn't do it.

Regarding type: Another approach, if we don't want to enforce a dummy-type, would be to make the type-index sparse and hint it in the query-options. Maybe that helps.

Regarding type: Another approach, if we don't want to enforce a dummy-type, would be to make the type-index sparse and hint it in the query-options. Maybe that helps.

it seems it's not possible to use hint for count() queries

As far as I can see, the paging's count is only used for the pagination so it can display a link to the last available page. How about we try to remove that?

njam commented

As discussed: Vladimir will re-create a big MongoDB log collection, once it's ready let's investigate this together @alexispeter @fvovan

njam commented

As discussed: let's try always setting context.extra.type, to avoid a sparse index.

I can't say that setting mandatory type helped a lot.
Despite this trick gave some speed-up at dev environment, sometimes it still throws on production one.

njam commented

Loading the admin logs page still results in lot of disk I/O, and has slow response times. Here with 400'000 warning records:
screen shot 2016-10-03 at 09 07 17

last attempt was not good enough as well.
Let's try next "trick".
The only thing needs to be considered is cleaning redundant indexes in the end.

Using single index for filtering and sorting finally gave positive results.
So next step is adding index for "type only" query and cleaning obsolete indexes

All listings work well now.
cc @njam

njam commented

How does it look @fvovan ?

last remaining "all" listing is fixed.
so everything should work

njam commented

👍

Are the "experimental" indexes removed already?
Can you summarize what was the issue and how you solved it eventually?

Are the "experimental" indexes removed already?

They were removed in #2384

Can you summarize what was the issue and how you solved it eventually?

Briefly there were two problems. One is bad performance on $exists queries that was fixed with introducing dummy type for cm_log collection.
Another problem is relying on mongo's index Intersection which worked well on dev environment but not on production one. As it turned out it's necessary to fulfil the whole query (I mean its filtering and sorting parts ) with single index.

njam commented

Now we have these indices:

  • 1: _id
  • 2: level, createdAt
  • 3: context.extra.type, level, createdAt
  • 4: message
  • 5: context.extra.type, createdAt

(collections.json)

Do we need all of them? 2, 3 and 5 cannot be combined?

Do we need all of them? 2, 3 and 5 cannot be combined?

That's the whole point..
sometimes we query by level, sometimes by type, sometimes by both of them. They probably could be combined if some of these queries didn't contain sort modifier. Doc says

Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

So even if we believe in index intersection 2 and 5 are necessary. The only one that theoretically could be deleted is 3. I doubt it's possible especially for count queries.
Can be checked on live server only because as it turned out the behaviour is different due either server load, or storage topology..

njam commented

Ok I get it now.
Closeable?

agree