Stripe Datadog checks
This is a collection of plugins — checks in Datadog parlance — for the Datadog agent that Stripe has found useful with Datadog.
Motivation
We've sent a lot of patches to Datadog and we regularly work closely with them on our ideas. But sometimes we want something that isn't a fit for the mainline Datadog agent. To that end we've created this repository to hold work that is either in flight or was decided to not be a fit for inclusion in the core agent set. We hope you find it useful!
Using The Checks
Place the .py
file you want to use in to the checks directory — /etc/dd-agent/checks.d
by default — and the YAML config file in the config directory — /etc/dd-agent/conf.d
by default — and you should be ready to go! Restart the agent and run /etc/init.d/datadog-agent info
to verify that the plugin is working.
Each plugin here is provided with a sample config file containing some documentation.
Checks
Here's our list of checks!
File
Uses Python's glob.glob
to look look for at least one file matching the provided path
. You can control the success or failure of this check via expect
using one of present
or absent
. For example if you use expect: present
and the file does not exist, this check will fail. If you use expect: absent
and the file is absent, it will emit ok!
The service check and any emitted metrics are tagged with the path
, expected_status
and actual_status
. It's check message will be File %s that was expected to be %s is %s instead" % (path, expect, status)
.
If this check does find a path that matches it will also emit a gauge file.age_seconds
containing the age of the oldest file in seconds that matches the path.
---
init_config:
instances:
# Puppet locks (these might turn stale):
- path: '/etc/stripe/facts/puppet_locked.txt'
expect: absent
# Package upgrades requiring reboots
- path: '/var/run/stripe/restart-required/*'
expect: absent
Jenkins Metrics
Fetches metrics from Jenkin's Metrics Plugin (which you must install separately). It fetches all the metrics under vm.*
and emits them as gauges except the vm.gc.*.count
and vm.gc.*.time
which are emitted as monotonic_count
.
Linux VM Extras
Fetches the following metrics by polling Linux' '/proc/vmstat':
system.linux.vm
pgpgin
aspages.in
,pgpgout
aspages.out
,pswpin
aspages.swapped_in
,pswpout
aspages.swapped_out
,pgfault
aspages.faults
,pgmajfault
aspages.major_faults
NSQ
Fetches the following metrics by polling NSQ's /stats
endpoint:
nsq.topic_count
nsq.topic.channel_count
nsq.topic
(all tagged withtopic_name
):depth
backend_depth
message_count
(count, not gauge)
nsq.topic.channel
(all tagged withtopic_name
andchannel_name
):depth
backend_depth
in_flight_count
deferred_count
message_count
(count, not gauge)requeue_count
timeout_count
e2e_processing_latency.p50
(nanoseconds)e2e_processing_latency.p95
(nanoseconds)e2e_processing_latency.p99
(nanoseconds)e2e_processing_latency.p999
(nanoseconds)e2e_processing_latency.p9999
(nanoseconds)
nsq.topic.channel.client
(all tagged withtopic_name
,channel_name
andclient_id
,client_version
,tls
,user_agent
,deflate
andsnappy
):ready_count
in_flight_count
message_count
(count, not gauge)finish_count
requeue_count
If you have extra tags you would like to parse from any of your topic names, you can include topic_name_regex
as a
Python regex in your init_config
. The regex will be applied to each topic name and if there is a match, the name of
the symbolic group and the value it captured will be included as a tag key/value pair.
Nagios Runner
The Nagios Runner check takes a list of check "instances". The instances are each executed and, according to the Nagios Plugin API the return value is inspected and a service check is submitted using the provided name.
Note: The checks supplied are executed sequentially. You may run in to performance issues if you attempt to run too many checks or checks that execute very slowly. This will effectively block the agent and cause all sorts of hiccups!
init_config:
# Not needed
instances:
- name: "stripe.check.is_llama_on_rocket"
command: "/usr/lib/nagios/plugins/check_if_llama_is --on rocket"
- name: "stripe.check.falafel_length"
command: "/usr/lib/nagios/plugins/check_falafel -l 1234"
OpenVPN
The OpenVPN check counts the number of active VPN connections per user. Combined with a Datadog monitor, it ensures that the same user isn't logged in too many times (e.g., multiple, sustained VPN connections for the same user is indicative of a laptop compromise).
Each VPN is accessible over both TCP and UDP, and is available to both privileged (Stripe employees) and unprivileged users (vendors). The unique combination of these is considered a VPN "level", and OpenVPN emits a status file every 10 seconds for each level to indicate the currently-active connections. When a user disconnects (e.g. if their Internet connection drops out) or if their IP address/port changes, they may appear in the status file multiple times. This is fine, as long as the number of connections per user drops down to 1 within a minute or so.
The status file also contains useful information such as the IP (which can be used for geolookups), the connection duration (which can be used to ensure that the VPN is online and that it isn't cycling users), and the number of bytes sent/received (which could be used to detect erratic behavior).
Out of memory killer (OOM)
This check emits a failure when any process has been killed by the OOM killer since the system last started up. It continues to emit criticals until the log file is removed or the system is restarted (providing the log file contains uptimes to detect a reboot)
It reads the configured logfile
as syslog kernel output for lines matching the kernel_line_regex
property. The regular expression should provide named capture groups for message
and, optionally, uptime
. The uptime
capture group is how it detects system reboots; it will stop looking for OOM instances when it detects a reboot. The second configurable regular expression extracts information from the message
data itself, which is included in the service check message (not as tags, as this would pose problems with alert recovery).
An example configuration for a base Ubuntu system goes like this:
---
init_config:
instances:
- logfile: '/var/log/kern.log'
kernel_line_regex: '^(?P<timestamp>.+?) (?P<host>\S+) kernel: \[\s*(?P<uptime>\d+(?:\.\d+)?)\] (?P<message>.*)$'
kill_message_regex: '^Out of memory: Kill process (?P<pid>\d+) \((?P<pname>.*?)\) score (?P<score>.*?) or sacrifice child'
This file is included in conf.d/oom.yaml
.
Two error cases also emit service checks:
- If the log file is not present, a warning is emitted; this is not inherently a problem but could indicate misconfiguration
- If a permission error prevents dd-agent from reading the file, a critical is emitted; this is a definite failure and needs correcting
Outdated Packages
This check verifies that the given packages are not outdated (currently, only on Ubuntu). You can specify a set of package names and versions (split out by release), and this check will report critical if the current version of that package is older than the specified version. For example:
init_config:
# Not needed
instances:
- package: bash
version:
precise: "4.2-2ubuntu2.6"
trusty: "4.3-7ubuntu1.5"
- package: openssl
version:
precise: "1.0.1-4ubuntu5.31"
trusty: "1.0.1f-1ubuntu2.15"
Resque
Inspects the Redis storage for a Resque instance and ouputs some metrics:
resque.jobs.failed_total
- number of jobs failed (monotonic_count)resque.jobs.processed_total
- number of jobs processed (monotonic_count)resque.queues_count
- number of queues (gauge)resque.worker_count
- number of workers (gauge)
Slapd (OpenLDAP's Stand-alone LDAP Daemon)
This check queries and surfaces statistics from the monitor
backend of a running slapd
instance. It will emit the following
metrics:
slapd.connect_time
- time taken to connect to the server (histogram)slapd.connections.total
- total number of connections (monotonic_count)slapd.connections.current
- current number of connections (gauge)slapd.statistics.bytes_total
- total bytes sent (monotonic_count)slapd.statistics.entries_total
- total entries sent (monotonic_count)slapd.threads.active
- number of active threads (gauge)slapd.threads.open
- number of open threads (gauge)slapd.threads.pending
- number of pending threads (gauge)slapd.threads.starting
- number of threads being started (gauge)slapd.waiters.read
- the number of clients waiting to read (gauge)slapd.waiters.write
- the number of clients waiting to write (gauge)
In addition, the check will emit a service check (slapd.can_connect
) that
indicates whether it was able to successfully connect to the LDAP server.
Slapd Configuration
To enable the monitor
backend, you can add the following lines to
slapd.conf
:
moduleload back_monitor
database monitor
access to dn="cn=monitor"
by peername=127.0.0.1 read
by * none
This allows only clients on the local machine to access the backend, since it may contain potentially-sensitive information.
Storm REST API
This check comes in two parts: One is a cronjob-able script in
scripts/cache-storm-data
(intended to run every minute, or whichever
interval doesn't overload your nimbus), and the other is a check that
reads the generated JSON file and emits metrics.
For the check, we recommend running it at an interval 2x faster than
the cache-storm-data cron job runs (using the
min_collection_interval: <Nsec>
config parameter in init_config
).
You can configure the topologies considered for emission using the
topologies
regex, and the check will group all the matched metrics
(picking the youngest ACTIVE
metric for each that have name
collisions).
The caching process can be very time-consuming since storm's executor and per-topology stats take a really long time to generate. It's best to run the cache script a few times across the lifetime of your storm topologies to get a feel for how long it takes and how resource-intensive the metrics-gathering can be.
The storm_rest_api.yaml
config file is used by both the
cache strip and the check.
Splunk
Collects metrics from a Splunk master about the status of a Splunk cluster. It assumes you are using Search Head Clustering and queries the SHC captain for search information.
It emits these service checks:
splunk.can_connect
when things break during fetching statussplunk.index.is_healthy
for "unhealthy" indices, tagged byindex_name
. See the message for more details.splunk.peer.is_healthy
for "unhealthy" nodes, tagged bypeer_name
. See the message for more details.
It emits these metrics:
splunk.fixups
jobs_present
tagged byindex_name
andfixup_level
splunk.indexes
tagged byindex_name
replication
tagged byindex_copy
, for each "copy"actual_copies
- Number of copies that actually exist.expected_copies
- Number of copies that should exist.
search
tagged byindex_copy
, for each "copy"actual_copies
- Number of copies that actually exist.expected_copies
- Number of copies that should exist.
size_bytes
- The total size in bytes.total_excess_bucket_copies
- The total number of excess copies for all buckets.total_excess_searchable_copies
- The total number of excess searchable copies for all buckets.
splunk.peers
tagged bypeer_name
, andsite
bucket_count
- The number of buckets on this peer tagged additionally byindex
.bucket_status
- The number of buckets in a given status on this peer, tagged additionally bybucket_status
.delayed_buckets_to_discard
- The number of buckets waiting to be discarded on this peer.peers_present
- The number of peers available (as a gauge) tagged additionally bystatus
.primary_count
- The number of buckets for which the peer is primary in its local site, or the number of buckets that return search results from same site as the peer.primary_count_remote
- The number of buckets for which the peer is primary that are not in its local site.replication_count
- The number of replications this peer is part of, as either source or target.
splunk.search_cluster
captains
- The count of captains tagged bysite
. This can be used to ensure a captain and detect splitbrainmember_statuses
- The number of members in the search cluster tagged bystatus
andsite
.
splunk.searches
in_progress
- In progress search gauge, tagged byis_saved
andsearch_owner
.
You can configure it thusly:
---
init_config:
default_timeout: 30
instances:
- url: https://localhost:8089
username: obsrobot
password: foobar
SubDir Sizes
The SubDir Sizes is a sister to Datadog's directory
integeration. Our needs required enough differences that making a new integration
seemed the easier path and made for a less complex configuration. It takes a directory
and emits a total size (in bytes) and a
count of files therein for each subdirectory it finds. It also can use a regular expression to dynamically create tags for each subdirectory.
This integation is useful for getting tag-friendly metrics for backup directories and things like Kafka that store in subdirectories.
Here's the config we use for Kafka:
init_config:
instances:
- directory: "/pay/kafka/data"
dirtagname: "name"
subdirtagname: "topic"
subdirtagname_regex: "(?P<topic>.*)-(?P<partition>\\d+)"
Note: The regular expression provided to subdirtagname_regex
should use named groups
such that calling groupdict()
on the resulting match provides name-value pairs for use as tags!
And here are the metrics, each of which will be tagged with $dirtagname:$DIRECTORY
and $subdirtagname:basename(subdir)
and whatever tags come from subdirtagname_regex
:
system.sub_dir.bytes
system.sub_dir.files