debops/debops-playbooks

Proposed role: debops.monitoring

tobijb opened this issue · 19 comments

Provide out of the box monitoring framework. Perhaps using the following?

Server and Service Monitoring: Sensu
System Metrics: Collectd
App Metrics: Statsd
Metrics visualization: Grafana
Metrics Storage and Collection: Influxdb
Alerts and Notification routing: Sensu
Integrations: Pagerduty, Hubot...?

I wholeheartedly agree; however it seems to me that most of these are currently outside of Debian repositories. Sure, they probably have their own APT repositories which can be added to sources.list, however by doing that we end up with Ubuntu PPAs all over again.

I would be in favor of pushing the software to Debian proper, all the way through unstable, testing to jessie-backports if possible and needed. Mail the authors and ask them if they could do that, for example through Debian Mentors repository. Adding software to Debian gives us proper integration with the rest of the system and standardized code base. At the same time I would give priority to alternatives already in Debian if they are viable.

I can see where you're coming from. Perhaps this is where something like "debops-extras monitoring" starts to come to life?

Sure, main playbook might need to be split at some point in the future due to number of roles, however currently playbooks are very "rigid" and require all roles to be present in order to function. I'm waiting for Ansible v2 to decide what to do about it.

ypid commented

I have been doing Monitoring for a couple of years now and Check_MK with Icinga works well for me and my/our customers.
The packages in jessie are quite recent and should do the trick. I have not tested the check-mk-multisite package in Debian in detail yet, because I still install Check_MK from source.

I'm leaning towards Icinga instead of Nagios due to licensing issues. Check-MK can be used by LibreNMS, so that's a plus.

ganto commented

@ypid : If you are looking for a way to integrate Check_MK with DebOps, I already hacked together something at ansible-checkmk_agent. I'm using it as custom role in DebOps with a manually setup Icinga server.

Unfortunately I still didn't have time to try the role with the "official" DebOps LibreNMS setup...

ypid commented

@ganto your role looks really nice, thanks for the hint.

ypid commented

TL;DR I compared LibreNMS with Nagios/Icinga/Check_MK/PNP4Nagios. The later setup which I am using for some time now appears more mature, is highly configurable and better fits my use case. I will keep using it 😉

I had some time today to checkout LibreNMS which is supported by DebOps. Note that I did only get into LibreNMS for one day, so I have no long-term experience with it yet.
First of all, LibreNMS fits nicely from a moral/license point of view and I am grateful to Paul Gear for creating it based on Observium.
As I come from an Nagios/Icinga/Check_MK/PNP4Nagios background I compared it with what I know and like.

Advantages from LibreNMS over Nagios/Icinga/…

  • Auto detection of whole networks using already available information (would require some effort in the Nagios-world), this particularly applies for network devices.
  • Nice web interface (Check_MK Multisite and WATO achieve a similar goal in regards to functionality, but UI points go to LibreNMS)
  • API access: build into LibreNMS core, Live Status module for Nagios/Icinga (not indented for network access and does not allow to change the configuration in Check_MK or Nagios/Icinga)
  • IPMI support build into it (you have to do some manually stuff to get this working with Check_MK. I have not tested how good it works in LibreNMS)

About equal

  • Support for many SNMP OIDs/entries (both systems support a lot of SNMP MIBs, I have no exact numbers which system supports more)
  • Legacy Nagios/Icinga checks (Both systems can use them once they are configured for hosts/devices)
  • Reconfiguration (adding/changing hosts) does need a reload of the Nagios/Icinga core while LibreNMS (in lack of a "core") does not. I noted the point under equal because Icinga actually does useful things during reload e.g. evaluating the parent relationship of network devices/dependencies (e.g. if a switch in front of a not-redundantly connected server goes down, the server goes into unreachable and might not be alerted). Does LibreNMS support this?

Advantages from Nagios/Icinga/… over LibreNMS

  • LibreNMS is written in PHP and check scheduling is done by CRON. Nagios/Icinga1 are written in straight C. (Icinga2 in C++ but Check_MK does not support that)
  • Writing plugins/checks (I have written a few SNMP Plugins for Check_MK and must say the internal Python API for this is really nice. Looking at the definitions/source code which LibreNMS uses ports.inc.php does not convince me)
  • Configuration of Nagios/Icinga is much more powerful then in LibreNMS. Things like number of retry attempts, check interval, notification periods and so on. Big plus for Nagios.
  • Check_MK usefully abstracts this configuration power and makes it rule based. Check_MK can generate any Nagios/Icinga configuration and thus retains all options offered by Nagios.
  • Again configuration: Using the rule based notifications from Check_MK you can do pretty much without relaying on no free cloud services for alerting. Things like calender or so would still need work thought (I my position, I don’t need that right now).
  • PNP4Nagios does a better job in handling performance graphing (also based on RRDtool). collectd is supported by both LibreNMS and Nagios/Icinga (but I have not yet tried it).
  • There is a module for Nagios/Icinga in Ansible for scheduling downtime and toggling alerts, not yet for LibreNMS?.
  • Although the web interface of LibreNMS in nice, I think the overview page of a host/device in Check_MK is much better suited and more functional.

Summing-up

LibreNMS is a nice project which I think is very well suited for telcos (for which Adam Armstrong originally created Observium). It can surly also be used to do server monitoring but I think there are other/better tools out there for this.

The real advantage of LibreNMS is its autodetection of network devices but I think that might not be number one priority in DebOps as you guys probably have a CMDB or also manage your network stuff with Ansible so you don’t really need autodetection. At least that is what I would do when I where responsible for network also 😉

Options for integrating Monitoring into DebOps

  1. Creating the roles to setup one/multiple monitoring systems using the appropriate roles/playbooks. We would need to write for example: debops.icinga, debops.check_mk, debops.pnp4nagios, debops.rrdcached (or try to use something else here. Would need to be evaluated) and maybe debops.nagvis
  2. Run debops.check_mk_agent against all hosts.
  3. Using a different inventory variable, apply the debops.check_mk role to all hosts. The role then configures the host in the monitoring system(s) as to be monitored and triggers Check_MK to reinventorize the new hosts to find checks to run against them, then reload the core with the updated config.

What do you guys think?

Maybe there is time next year to work on that 😄

ganto commented

Hi everyone

As I have currently the task to setup new monitoring servers with Icinga/Check_MK at my day work, I was evaluating and experimenting a lot with Ansible and Check_MK recently. Also I'm already running two mostly manually managed Icinga/Check_MK installations (without WATO) in a Debian and in a RHEL-based environment since several years. There are a few issues which should be considered when using Ansible:

  1. Software source: There are various packages available for the Check_MK server components:

    • check-mk-server from the official Debian repository (or from EPEL for the RPM-world). Individual components such as icinga, PNP4Nagios, livestatus or the WATO Web interface can/(have to) be installed and configured separately
    • check-mk-raw, the official packages from the upstream projects, they are always up-to-date and already bundle a big bunch of (wanted) dependencies and features.
    • omd (Open Monitoring Distribution): Another bundle packages which contains even more features and easily allow to experiment with different monitoring cores such as Icinga vs. Nagios vs. Shinken and various Web interfaces and add-ons.

    When choosing the first approach, you have to do a lot of configuration fiddling to end up with a result similar to what check-mk-raw or omd gives you. The latter two options give you a handy tool also called omd which allows you to easily instantiate new monitoring sites, test-migrate to new versions and so on. After being sceptical to loose control at the beginning, I start to like it as it really includes some nice features that a "classical" setup cannot give you.

  2. Software Setup: The Icinga/Nagios/Check_MK ecosystem is huge and sometimes quite confusing because of the large amount of options you have for a certain task. E.g. there are 4-5 different ways how to do distributed monitoring and/or fail-over, there are 4 different ways to integrate PNP4Nagios and so on... I tried splitting it up into different roles but the coupling ended up being very tight so that I concluded that it would makes more sense to only have one role. On the other side, this makes it impractical again for e.g. only setting up Icinga without Check_MK.

  3. Configuration Responsibility: The WATO Web interface is very great and allows you to completely manage the monitoring site via Web browser. It also supports some cumbersome HTTP/JSON APIs which could be handy for some remote queries or configuration updates. The entire configuration is written down to plain text files which can even be automatically managed by git, however, it opens up the question who is responsible for the configuration (Ansible/DebOps or Monitoring-Admins)? Or how are they split up between the parties? The Web interface is quite intuitive after a while and I feel it's very powerful for quickly achieving results compared to the sometimes irritating configuration file layout (e.g. see here).

After first trying with the upstream packages, experimenting with the OMD distribution I finally ended up with check-mk-raw now and opted for a semi-automated configuration management. Ansible would only setup the initial configuration and client management. After that the monitoring users will be responsible for the fine tuning of their checks. The environment I'm doing this for contains about 700 hosts with currently ~55'000 checks (only for the Linux service).

I still need some weeks to clean up the rough edges in my Ansible code, but then it shouldn't be too difficult to adjust the role(s) for DebOps. I will let you know, once I have something to show.

Btw. here another link with some OMD advertisement: Best Monitoring Solution 2015 - OMD (Open Monitoring Distribution)

@ypid and @ganto, thank you for excellent writeups! It seems that some monitoring solutions have the server-side covered pretty well, and it would be interesting to look into client integration in DebOps first, so that for example hosts managed by DebOps can be easily integrated into existing OMD installation. @ganto do you think that @ypid's check_mk_agent role https://github.com/debops-contrib/ansible-checkmk_agent can be used with OMD?

ganto commented

The ansible-checkmk-agent role is originally written by me ;-) I use it in the Debian environment (partly managed by DebOps) I mentioned above. It's still a bit rough, but usable.

If you then also have the Check_MK server part with WATO, this would simplify a lot, as you can download the matching agent release from WATO (instead of using the upstream deb) and also make better use of the large number of agent plugins that are all available via public URL from WATO (instead of downloading from the upstream git).

@ganto Ups, my bad, I give back credit where it's due. :-) Good to hear that it's usable. Right now I don't have any installation to use the agent against, maybe in the future I'll check it out.

ypid commented

@ganto Nice that you are working on that! I did look into OMD but I thought it might be to much magic for using it right now and proposed to go with the software packaged by Debian instead. But what works best is easier to figure out when actually trying it so I am looking forward to seeing your work.

About the automation of setting up additional hosts also in the monitoring system I think it could make sense to split the sections of configuration in /etc/check_mk/conf.d to either CM or dedicated admins. So you could let Ansible manage your servers and your Network admins can manage there devices via WATO.

ganto commented

@ypid : ya, I felt so too at the beginning. But after trying it for a few weeks now, I can only say positive things about (at least) the omd command which is also part of check-mk-raw.

With the OMD package however, I'm not so happy. At least not for a production setup. The current stable release 1.30 contained checkmk-1.2.6p12 which had some show stopper bugs for my setup. As it is an all-in-one package you cannot easily update to a newer checkmk release without creating your own fork of the package. Also I'm not yet confident enough in how long this still will be maintained, as the entire package with all possible alternatives is quite complex and OMD 2.x with Icinga 2 (without Check_MK) is being tested for a while now. To be honest, I didn't fully figure out the differences between OMD 1.x and Check_MK RAW yet, except, that I feel the latter includes less unnecessary cruft (for my use case), is more up-to-date and better documented.

About splitting the configuration between Ansible and WATO: WATO already stores its configurations in a wato subdirectory of /etc/check_mk/conf.d, even separated into individual files per topic. That's pretty. However, I didn't try yet if WATO is able to display (read-only) what was configured outside of its configuration directory and I think it won't be easy (but still possible with some permissions juggling) to prevent someone from adding server checks in WATO in case this would be Ansible's responsibility.

ypid commented

OK sounds good. I only tried the whole OMD package and remembered that its support in Debian stable (jessie) was beta.

About WATO, yes it stores its config in wato directory. When you mean displaying things like global settings then yes. WATO will not show you hosts configured outside of the wato directory. We can surly manage to get something working here. It might even be possible to read in the Python configuration variables back into Ansible and then template the changed configuration back so that admins and Ansible could both manage all the hosts. I am already generating check_mk configuration files with a script: https://github.com/hamcos/monitoring-scripts/blob/master/check_mk/wato_cvs_import.py
but some better "API" would definitely come in handy …

About the permission juggling, WATO usually does not like that and will show you/your users stack traces when trying to change the specific file with WATO 😉

ypid commented

@ganto Just curious, how is it going?

ganto commented

Thanks for reminding me. Actually, I have a quite sophisticated role for Red Hat done. It's a bit of hack at many places, so I don't dare to release it. However, I'm cleaning it up now and adjust it for DebOps. You can find my progress at debops-contrib/ansible-checkmk_server.

I'll definitely still need some help later on. Hang on...

ypid commented

Thanks for the update 👍

ypid commented

@ganto and others, how do you see CheckMK in 2019? Still using it? I am still running it and find it unmatched for network and infrastructure monitoring.

I also checked out Prometheus which I find pretty strong in cloud and application monitoring but I will not run it for now because it does not fit into my environment.