/go-carbon-old

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Primary LanguageGoMIT LicenseMIT

Table of Contents

go-carbon Build Status

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Architecture

Features

  • Receive metrics from TCP and UDP (plaintext protocol)
  • Receive metrics with Pickle protocol (TCP only)
  • Receive metrics from HTTP
  • Receive metrics from Apache Kafka
  • storage-schemas.conf
  • storage-aggregation.conf
  • Carbonlink (requests to cache from graphite-web)
  • Carbonlink-like GRPC api
  • Logging with rotation support (reopen log if it moves)
  • Many persister workers (using many cpu cores)
  • Run as daemon
  • Optional dump/restore restart on USR2 signal (config dump section): stop persister, start write new data to file, dump cache to file, stop all (and restore from files after next start)
  • Reload some config options without restart (HUP signal):
    • whisper section of main config, storage-schemas.conf and storage-aggregation.conf
    • graph-prefix, metric-interval, metric-endpoint, max-cpu from common section
    • dump section

Performance

Faster than default carbon. In all conditions :) How much faster depends on server hardware, storage-schemas, etc.

The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:

Success story

There were some efforts to find out maximum possible performance of go-carbon on a hardware (2xE5-2620v3, 128GB RAM, local SSDs).

The result of that effort (in points per second):

Performance

Stable performance was around 950k points per second with short-term peak performance of 1.2M points/sec.

Installation

Use binary packages from releases page or build manually (requires golang 1.8+):

# build binary
git clone https://github.com/lomik/go-carbon.git
cd go-carbon
make

Configuration

$ go-carbon --help
Usage of go-carbon:
  -check-config=false: Check config and exit
  -config="": Filename of config
  -config-print-default=false: Print default config
  -daemon=false: Run in background
  -pidfile="": Pidfile path (only for daemon)
  -version=false: Print version
[common]
# Run as user. Works only in daemon mode
user = "carbon"
# Prefix for store all internal go-carbon graphs. Supported macroses: {host}
graph-prefix = "carbon.agents.{host}"
# Endpoint for store internal carbon metrics. Valid values: "" or "local", "tcp://host:port", "udp://host:port"
metric-endpoint = "local"
# Interval of storing internal metrics. Like CARBON_METRIC_INTERVAL
metric-interval = "1m0s"
# Increase for configuration with multi persister workers
max-cpu = 4

[whisper]
data-dir = "/var/lib/graphite/whisper"
# http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf. Required
schemas-file = "/etc/go-carbon/storage-schemas.conf"
# http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-aggregation-conf. Optional
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
# Worker threads count. Metrics sharded by "crc32(metricName) % workers"
workers = 8
# Limits the number of whisper update_many() calls per second. 0 - no limit
max-updates-per-second = 0
# Softly limits the number of whisper files that get created each second. 0 - no limit
max-creates-per-second = 0
# Make max-creates-per-second a hard limit. Extra new metrics are dropped. A hard throttle of 0 drops all new metrics.
hard-max-creates-per-second = false
# Sparse file creation
sparse-create = false
# use flock on every file call (ensures consistency if there are concurrent read/writes to the same file)
flock = true
enabled = true
# Use hashed filenames for tagged metrics instead of human readable
# https://github.com/lomik/go-carbon/pull/225
hash-filenames = true
# specify to enable/disable compressed format. IMPORTANT: Only one process/thread could write to compressed whisper files at a time, especially when you are rebalancing graphite clusters (with buckytools, for example), flock needs to be enabled both in go-carbon and your tooling.
compressed = false

[cache]
# Limit of in-memory stored points (not metrics)
max-size = 1000000
# Capacity of queue between receivers and cache
# Strategy to persist metrics. Values: "max","sorted","noop"
#   "max" - write metrics with most unwritten datapoints first
#   "sorted" - sort by timestamp of first unwritten datapoint.
#   "noop" - pick metrics to write in unspecified order,
#            requires least CPU and improves cache responsiveness
write-strategy = "max"

[udp]
listen = ":2003"
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

[pickle]
listen = ":2004"
# Limit message size for prevent memory overflow
max-message-size = 67108864
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

# You can define unlimited count of additional receivers
# Common definition scheme:
# [receiver.<any receiver name>]
# protocol = "<any supported protocol>"
# <protocol specific options>
#
# All available protocols:
#
# [receiver.udp2]
# protocol = "udp"
# listen = ":2003"
# # Enable optional logging of incomplete messages (chunked by max UDP packet size)
# log-incomplete = false
#
# [receiver.tcp2]
# protocol = "tcp"
# listen = ":2003"
#
# [receiver.pickle2]
# protocol = "pickle"
# listen = ":2004"
# # Limit message size for prevent memory overflow
# max-message-size = 67108864
#
# [receiver.protobuf]
# protocol = "protobuf"
# # Same framing protocol as pickle, but message encoded in protobuf format
# # See https://github.com/lomik/go-carbon/blob/master/helper/carbonpb/carbon.proto
# listen = ":2005"
# # Limit message size for prevent memory overflow
# max-message-size = 67108864
#
# [receiver.http]
# protocol = "http"
# # This receiver receives data from POST requests body.
# # Data can be encoded in plain text format (default),
# # protobuf (with Content-Type: application/protobuf header) or
# # pickle (with Content-Type: application/python-pickle header).
# listen = ":2007"
# max-message-size = 67108864
#
# [receiver.kafka]
# protocol = "kafka
# # This receiver receives data from kafka
# # You can use Partitions and Topics to do sharding
# # State is saved in local file to avoid problems with multiple consumers
#
# # Encoding of messages
# # Available options: "plain" (default), "protobuf", "pickle"
# #   Please note that for "plain" you must pass metrics with leading "\n".
# #   e.x.
# #    echo "test.metric $(date +%s) $(date +%s)" | kafkacat -D $'\0' -z snappy -T -b localhost:9092 -t graphite
# parse-protocol = "protobuf"
# # Kafka connection parameters
# brokers = [ "host1:9092", "host2:9092" ]
# topic = "graphite"
# partition = 0
#
# # Specify how often receiver will try to connect to kafka in case of network problems
# reconnect-interval = "5m"
# # How often receiver will ask Kafka for new data (in case there was no messages available to read)
# fetch-interval = "200ms"
#
# # Path to saved kafka state. Used for restarts
# state-file = "/var/lib/graphite/kafka.state"
# # Initial offset, if there is no saved state. Can be relative time or "newest" or "oldest".
# # In case offset is unavailable (in future, etc) fallback is "oldest"
# initial-offset = "-30m"
#
# # Specify kafka feature level (default: 0.11.0.0).
# # Please note that some features (consuming lz4 compressed streams) requires kafka >0.11
# # You must specify version in full. E.x. '0.11.0.0' - ok, but '0.11' is not.
# # Supported version (as of 22 Jan 2018):
# #   0.8.2.0
# #   0.8.2.1
# #   0.8.2.2
# #   0.9.0.0
# #   0.9.0.1
# #   0.10.0.0
# #   0.10.0.1
# #   0.10.1.0
# #   0.10.2.0
# #   0.11.0.0
# #   1.0.0
# kafka-version = "0.11.0.0"
#
# [receiver.pubsub]
# # This receiver receives data from Google PubSub
# # - Authentication is managed through APPLICATION_DEFAULT_CREDENTIALS:
# #   - https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application
# # - Currently the subscription must exist before running go-carbon.
# # - The "receiver_*" settings are optional and directly map to the google pubsub
# #   libraries ReceiveSettings (https://godoc.org/cloud.google.com/go/pubsub#ReceiveSettings)
# #   - How to think about the "receiver_*" settings: In an attempt to maximize throughput the
# #     pubsub library will spawn 'receiver_go_routines' to fetch messages from the server.
# #     These goroutines simply buffer them into memory until 'receiver_max_messages' or 'receiver_max_bytes'
# #     have been read. This does not affect the actual handling of these messages which are processed by other goroutines.
# protocol = "pubsub"
# project = "project-name"
# subscription = "subscription-name"
# receiver_go_routines = 4
# receiver_max_messages = 1000
# receiver_max_bytes = 500000000 # default 500MB

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
# Close inactive connections after "read-timeout"
read-timeout = "30s"

# grpc api
# protocol: https://github.com/lomik/go-carbon/blob/master/helper/carbonpb/carbon.proto
# samples: https://github.com/lomik/go-carbon/tree/master/api/sample
[grpc]
listen = "127.0.0.1:7003"
enabled = true

# http://graphite.readthedocs.io/en/latest/tags.html
[tags]
enabled = false
# TagDB url. It should support /tags/tagMultiSeries endpoint
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
# Directory for send queue (based on leveldb)
local-dir = "/var/lib/graphite/tagging/"
# POST timeout
tagdb-timeout = "1s"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = "127.0.0.1:8080"
# Carbonserver support is still experimental and may contain bugs
# Or be incompatible with github.com/grobian/carbonserver
enabled = false
# Buckets to track response times
buckets = 10
# carbonserver-specific metrics will be sent as counters
# For compatibility with grobian/carbonserver
metrics-as-counters = false
# Read and Write timeouts for HTTP server
read-timeout = "60s"
write-timeout = "60s"
# Enable /render cache, it will cache the result for 1 minute
query-cache-enabled = true
# 0 for unlimited
query-cache-size-mb = 0
# Enable /metrics/find cache, it will cache the result for 5 minutes
find-cache-enabled = true
# Control trigram index
#  This index is used to speed-up /find requests
#  However, it will lead to increased memory consumption
#  Estimated memory consumption is approx. 500 bytes per each metric on disk
#  Another drawback is that it will recreate index every scan-frequency interval
#  All new/deleted metrics will still be searchable until index is recreated
trigram-index = true
# carbonserver keeps track of all available whisper files
# in memory. This determines how often it will check FS
# for new or deleted metrics.
scan-frequency = "5m0s"
# Maximum amount of globs in a single metric in index
# This value is used to speed-up /find requests with
# a lot of globs, but will lead to increased memory consumption
max-globs = 100
# Fail if amount of globs more than max-globs
fail-on-max-globs = false
# graphite-web-10-mode
# Use Graphite-web 1.0 native structs for pickle response
# This mode will break compatibility with graphite-web 0.9.x
# If false, carbonserver won't send graphite-web 1.0 specific structs
# That might degrade performance of the cluster
# But will be compatible with both graphite-web 1.0 and 0.9.x
graphite-web-10-strict-mode = true
# Allows to keep track for "last time readed" between restarts, leave empty to disable
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

[dump]
# Enable dump/restore function on USR2 signal
enabled = false
# Directory for store dump data. Should be writeable for carbon
path = "/var/lib/graphite/dump/"
# Restore speed. 0 - unlimited
restore-per-second = 0

[pprof]
listen = "localhost:7007"
enabled = false

# Default logger
[[logging]]
# logger name
# available loggers:
# * "" - default logger for all messages without configured special logger
# @TODO
logger = ""
# Log output: filename, "stderr", "stdout", "none", "" (same as "stderr")
file = "/var/log/go-carbon/go-carbon.log"
# Log level: "debug", "info", "warn", "error", "dpanic", "panic", and "fatal"
level = "info"
# Log format: "json", "console", "mixed"
encoding = "mixed"
# Log time format: "millis", "nanos", "epoch", "iso8601"
encoding-time = "iso8601"
# Log duration format: "seconds", "nanos", "string"
encoding-duration = "seconds"

# You can define multiply loggers:

# Copy errors to stderr for systemd
# [[logging]]
# logger = ""
# file = "stderr"
# level = "error"
# encoding = "mixed"
# encoding-time = "iso8601"
# encoding-duration = "seconds"

OS tuning

It is crucial for performance to ensure that your OS tuned so that go-carbon is never blocked on writes, usually that involves adjusting following sysctl params on Linux systems:

# percentage of your RAM which can be left unwritten to disk. MUST be much more than
# your write rate, which is usually one FS block size (4KB) per metric.
sysctl -w vm.dirty_ratio=80

# percentage of yout RAM when background writer have to kick in and
# start writes to disk. Make it way above the value you see in `/proc/meminfo|grep Dirty`
# so that it doesn't interefere with dirty_expire_centisecs explained below
sysctl -w vm.dirty_background_ratio=50

# allow page to be left dirty no longer than 10 mins
# if unwritten page stays longer than time set here,
# kernel starts writing it out
sysctl -w vm.dirty_expire_centisecs=$(( 10*60*100 ))

Net effect of these 3 params is that with dirty_*_ratio params set high enough multiple updates to a metric don't trigger disk activity. Multiple datapoint writes are coalesced into single disk write which kernel then writes to disk in a background.

With settings above applied, best write-strategy to use is "noop"

Reported stats

metric description
cache.maxSize
cache.metrics
cache.size
cache.queueWriteoutTime Time in seconds to make a full cycle writing all metrics
carbonserver.cache_partial_hit Requests that was partially served from cache
carbonserver.cache_miss Total cache misses
carbonserver.cache_only_hit Requests fully served from the cache
carbonserver.cache_wait_time_overhead_ns Time spent getting copy of the cache
carbonserver.cache_wait_time_ns Time spent waiting for cache, including overhead
carbonserver.cache_requests Total metrics we've tried to fetch from cache
carbonserver.disk_wait_time_ns Time spent reading data from disk
carbonserver.disk_requests Amount of metrics we've tried to fetch from disk
carbonserver.points_returned Datapoints returned by carbonserver
carbonserver.metrics_returned Metrics returned by carbonserver
persister.maxUpdatesPerSecond
persister.workers
runtime.GOMAXPROCS
runtime.NumGoroutine

Changelog

master
  • Added new options and upgraded go-whisper library to have compressed format (cwhisper) support
version 0.14.0
  • Accept UDP messages in plain protocol without trailing newline
  • Added whisper.hard-max-creates-per-second option #242
  • No longer trying to combine separate UDP messages from one sender into single stream
  • [carbonserver] Added metrics for prometheus
  • [carbonserver] Improved compatibility with graphite-web (#250, #251)
version 0.13.0
  • Added whisper.max-creates-per-second option
  • Support multiple targets in carbonserver
  • Support new carbonapi_v3_pb protocol. This allows recent versions of carbonapi to get metadata alongside with data
version 0.12.0
  • Tags support was added (only with graphite-web)
  • flock support for persister and carbonserver
  • cache.max-size and cache.write-strategy can be changed without restart (HUP signal)
  • Google PubSub protocol was added. It receives data from PubSub Subscriptions and can decode protobuf, plain, or pickle messages.
    • The default format is plain. Specify protobuf or pickle by adding an attribute named 'content-type' to the PubSub messsages:
      • application/protobuf
      • application/python-pickle
    • Sample configuration:
[receiver.pubsub]
protocol = "pubsub"
project = "project-name"
subscription = "subscription-name"
# receiver_go_routines = 4
# receiver_max_messages = 1000
# receiver_max_bytes = 500000000 # default 500MB
version 0.11.0
  • GRPC api for query cache was added
  • Added support for an unlimited number of receivers
  • Protobuf protocol was added. Sample configuration:
[receiver.protobuf]
protocol = "protobuf"
listen = ":2005"
  • HTTP protocol was added. It receives data from POST requests body. Data can be encoded in plain, pickle (Content-Type: application/python-pickle) and protobuf (Content-Type: application/protobuf) formats. Sample configuration:
[receiver.http]
protocol = "http"
listen = ":2006"
  • Kafka protocol was added. It receives data from Kafka and can decode protobuf, plain or pickle messages. You need manually specify message format in the config file. Sample configuration:
[receiver.kafka]
protocol = "kafka"
parse-protocol = "protobuf" # can be also "plain" or "pickle"
brokers = [ "localhost:9092" ]
topic = "graphite"
partition = 0
state-file = "/var/lib/graphite/kafka.state"
initial-offset = "-30m" # In case of absent or damaged state file fetch last 30 mins of messages
version 0.10.0

Breaking changes:

  • common: logfile and log-level in common config section are deprecated
  • changed config defaults:
    • user changed to carbon
    • whisper directory changed to /var/lib/graphite/whisper/
    • schemas config changed to /etc/go-carbon/storage-schemas.conf
  • rpm:
    • binary moved to /usr/bin/go-carbon
    • configs moved to /etc/go-carbon/
  • deb:
    • binary moved to /usr/bin/go-carbon

Other changes:

  • common: Requires Go 1.8 or newer
  • common: Logging refactored. Format changed to structured JSON. Added support of multiple logging handlers with separate output, level and encoding
  • dump/restore: New dump format. Added go-carbon -cat filename command for printing dump to console. New version of go-carbon can read old dump
  • dump/restore: [fix] go-carbon can not stop after dump (with enabled dump and carbonserver)
  • carbonserver: [feature] IdleTimeout is now configurable in carbonserver part
  • carbonserver: [feature] support /render query cache (query-cache-* options in config file)
  • carbonserver: [feature] support /metrics/find cache (find-cache-* option in config file)
  • carbonserver: [feature] support /metrics/details handler, that returns information about metrics (require enabled trigram-index)
  • carbonserver: [feature] Add config option to disable trigram index (before that to disable index you should set scan-interval to 0)
  • carbonserver: [fix] fix #146 (metrics_known was broken if metrics were not sent as counters)
version 0.9.1
  • Always stop on USR2 signal (previously did not stop with disabled dump/restore) #135
version 0.9.0
  • Completely new internal architecture
  • Removed flush to whisper and stop on USR2 signal. Use dump/restore instead
  • Removed global queue (channel) between receivers and cache, added optional per-receiver queues
  • Built-in carbonserver (thanks Vladimir Smirnov)
  • Added runtime tunables to internal metrics #70
version 0.8.1
  • Bug fix: The synchronous config reload (HUP signal) and launch of the internal collecting statistics procedure (every "metric-interval") could cause deadlock (thanks Maxim Ivanov)
version 0.8.0
  • Fully refactored and optimizer cache module (core of go-carbon) (thanks Maxim Ivanov)
  • Added noop cache.write-strategy (thanks Maxim Ivanov)
  • New optional dump/restore functional for minimize data loss on restart
  • Refactored internal stat mechanics. common.graph-prefix and common.metric-interval now can be changed without restart (on HUP signal)
  • Customizable internal metrics endpoint. common.metric-endpoint param. Valid values: "local" and "" (write directly to whisper), "tcp://host:port", "udp://host:port"
version 0.7.3
  • Added cache.write-strategy option (values "max" or "sorted") (thanks Alexander Akulov)
  • commitedPoints metric renamed to committedPoints
version 0.7.2
  • Added sparse file creation (whisper.sparse-create config option)
  • Enable reload in init script (thanks Jose Riguera)
  • Clean up schemas parser code (thanks Dieter Plaetinck)
  • Better go-whisper error handling (thanks Hiroaki Nakamura)
  • Don't try to create whisper file if exists with bad permissions #21
version 0.7.1
  • Fixed problem: Points in queue (channel) between cache and persister subsystems was invisible for carbonlink
version 0.7
  • Grace stop on USR2 signal: close all socket listeners, flush cache to disk and stop carbon
  • Reload persister config (whisper section of main config, storage-schemas.conf and storage-aggregation.conf) on HUP signal
  • Fix bug: Cache may start save points only after first checkpoint
  • Decimal numbers in log files instead of hexademical #22
  • Fix bug: NaN values being saved in Whisper datafiles #17 (thanks Andrew Hayworth)
  • Fix bug: may crash on bad pickle message with big message size in header #30. Added option pickle.max-message-size with 64 MB default value
  • Improved throttling (max-updates-per-second) performance #32
version 0.6
  • metric-interval option
version 0.5.5
  • Cache module optimization
version 0.5.4
  • Fix RPM init script
version 0.5.3
  • Improved validation of bad wsp files
  • RPM init script checks config before restart
  • Debug logging of bad pickle messages
version 0.5.2
  • Fix bug in go-whisper library: UpdateMany saves first point if many points has identical timestamp
version 0.5.1
  • Reduced error level of "bad messages" in tcp and pickle receivers. Now info
  • Configurable logging level. log-level option
  • Fix wrong carbonlink request error in log
version 0.5.0
  • -check-config validates schemas and aggregation configs
  • Fix broken internal metrics tcp.active and pickle.active
  • Optional udp incomplete messages logging: log-incomplete setting
  • Fixes for working on x86-32
  • logging fsnotify: fix ONCE rotation bug
version 0.4.3
  • Optional whisper throttle setting #8: max-updates-per-second
version 0.4.2
  • Fix bug in go-whisper: points in long archives missed if metrics retention count >=3
version 0.4.1
  • Bug fix schemas parser
version 0.4
  • Code refactoring and improved test coverage (thanks Dave Rawks)
  • Bug fixes
version 0.3
  • Log "create wsp" as debug
  • Log UDP checkpoint (calculate stats every minute)
  • Rotate logfile by inotify event (without HUP)
  • Check logfile opened
  • storage-aggregation.conf support
  • Create and chown logfile before daemonize and change user
  • Debian package (thanks Dave Rawks)
version 0.2
  • Git submodule dependencies
  • Init script for CentOS 6
  • Makefile
  • "make rpm" script
  • Daemonize and run-as-user support
  • -check-config option
  • -pidfile option
version 0.1
  • First full-functional public version
  • Logging with HUP rotation support
  • UDP receiver
  • Tcp receiver
  • Pickle receiver
  • TOML-configs
  • Carbonlink
  • Multi-persister support
  • storage-schemas.conf support