naemon/naemon-livestatus

Missing statehist table

dirtyren opened this issue · 8 comments

Hey, just a quick question.
Did the statehist table present in the original livestatus was taken out of the naemon-livestatus version or it was something later added to livestatus but not ported to naemon-livestatus?

Tks.

sni commented

the statehist table has been developed for mk-livestatus after the fork and has never been completly backported to naemon-livestatus.

sni commented

I once backported the statehist table for a customer here: https://github.com/ConSol/omd/blob/labs/packages/naemon-livestatus/patches/0001-backport-statehist-table.patch
But i haven't found time to backport it to the latest naemon-livestatus HEAD.
See also #20
We already had a POC to see wether it makes sense to backport the statehist table or apply the naemon patches to a new fork of mk-livestatus.
Unfortunatly naemon-livestatus does not contain the history of the original livestatus, so it takes quite some time to do either of them.
At least there is some documentation on how to reunite those git repositories again: https://labs.consol.de/development/git/2017/09/08/reunite-separate-git-repositories.html
Which has its own drawbacks, because the mk-livestatus does not have a seperate repository but is maintained in a subfolder of check_mk which makes working with the upstream git repo super ugly.

Tks @sni.
I am trying to merge them to bring this feature to naemon-livestatus. If I get it to compile and work, I will make a push request so more people get to test it and maybe you can merge in the next release.
Lets see where it goes.
Tks a lot.

[]s.

The statehist tables were in mk-livestatus when naemon-livestatus was forked from it, but using them in network larger than you could track with pen and paper meant memory usage would explode very quickly. Querying them also caused a lot of (large-ish) leaks that were exceedingly difficult to track down, so they were removed in favour of the event subscription service addendum, which makes it reasonably trivial to stash event history in a database (or a file, or a kafka queue or whatever).

Tks @ageric , I was thinking about the overhead on performance and memory the other day. I will forget about that for now and think about other means.
We already store the state changes in a database but the problem is trying to find the top 10 offenders for SLA in realtime to show on a dashboard when you are monitoring large networks . That was the plan for the statehist table.

Tks.

The statehist table would be too slow for that anyway, I think.

If I were you I'd run a statetracker externally that keeps tabs on problems and recoveries and updates a static-sized db-table every time something recovers. If you do that and also run a livestatus query to find problem hosts and services and calculate their state duration at dashboard loadtime, you'll be able to get the complete downtime picture at the least possible cost, and also have as real realtime data as is possible.

The statetracker could be a module if you want to go the extra mile.

Tks @ageric.
We already log all state changes to the database, but I think your idea of just to the availability calculations when something changes is a good one.
We already use gearmand to get data from naemon to the database, esper and another couple o things, it would't be difficult to add this availability calculation on the fly on a new queue with a new worker that could keep tabs on a monthly base.

[]s.