Enhancement - Calculate SLA for services and hosts
Closed this issue · 14 comments
Calculate SLA and take into account downtimes.
Calculate MTTR and acknowledgement delay during outages.
What's the looks like? I think it's a great idea, bit not sure how to calculate that...
Sent from my iPhone
On Nov 8, 2012, at 10:02 PM, xkilian notifications@github.com wrote:
Calculate SLA and take into account downtimes.
Calculate MTTR and acknowledgement delay during outages.—
Reply to this email directly or view it on GitHub.
Apologies for closing this request, I clicked the wrong button! I have Reopened it :)
These would be (very) nice to have, I want to have a look at "Nagios Operations Center" and then incorporate some of their ninja skillz:
http://splunk-base.splunk.com/apps/52020/nagios-operations-center
L.
We could experiment with MK Livestatus to retrieve the SLA data:
http://mathias-kettner.de/checkmk_livestatus.html
It becomes very difficult to do SLA calcs from the standard nagios log data as Splunk doesn't keep state natively whereas nagios does out of the box...
MK Livestatus could be the way around that :)
L.
Yeah love the mk_livestatus integration actually, really awesome. I wonder if it's worth making mk_livestatus an input (maybe moduler? ) If not an input, I bet we could run a search to populate the lookups for pulldowns with mk_livestatus!
Hmm... The data to do the SLA calculations is all in the logs. Thruk only uses the nagios log data to calculate the SLA.
For downtimes, I have to check in the log, but it is my understanding that this data is also present.
See how Thruk handles the SLA calculations in their code. (It is Perl, but you will get the idea and algorithm, yeah for open-source!) EDIT: They use Livestatus to get the data from the Livestatus logstore and do the reporting from that.
You should NOT poll data from Livestatus other than for showing current state.
Of course polling livestatus for SLA data directly would be nice but totally not meant for it. It would kill performance unless the SLAs would be pre-calculated in the background for pre-determined time-frames, ex. 1 day, 1 week, 1 month. But this would be a gross solution and not permit any fancy statistical munging which is the bread and butter of Splunk.
I have created a broker module for Shinken which exports SERVICE and HOST logs (same format as nagios.log) to a raw TCP socket. TCP port 9514 by default. The universal forwarder would listen on this socket to process the data.
The SERVICE and HOST data includes state changes and downtimes.
You can find it at github xkilian/shinken branch syslog under shinken/modules/rawsocket_broker.py
I have not tested it yet.
Still not tested, I will try and get around to it this week, but I have a whole fleet of monkeys on my back… sigh
I have added a script to request a hosts' service SLA by accessing MK Livestatus...
Example usage:
index=nagios src_host="eping.big-data.com.au" name="time"| head 1 | eval daysago=5 |dedup src_host,name | liveservicesla | stats max(liveservicesla) AS liveservicesla | eval liveservicesla=liveservicesla*100
Commit:
254cc90
L.
New Commit:
d40a1ad
FYI: the 'daysago' variable in the "Example usage" (above) gets parsed by the script for the SLA time window calculation for MK Livestatus :)
Was looking at that today! Maybe make a macro with a couple of inputs so that we can pass form inputs to it?
Sent from my iPhone
On Dec 3, 2012, at 7:42 PM, Luke Harris notifications@github.com wrote:
New Commit:
d40a1adFYI: the 'daysago' variable in the "Example usage" (above) gets parsed by the script for the SLA time window calculation for MK Livestatus :)
—
Reply to this email directly or view it on GitHub.
Will do :)