For a detailed list of linked pull requests merged in each release, see CHANGELOG.md. For more readable information about recent changes, please see RELEASE_NOTES.md.
Refinery is a tail-based sampling proxy that operates at the the level of an entire trace. It examines whole traces and intelligently applies sampling decisions (whether to keep or discard) to each trace.
A tail-based sampling model allows you to inspect an entire trace at one time and make a decision to sample based on its contents. For example, perhaps the root span has the HTTP status code to serve for a request, whereas another span might have information on whether the data was served from a cache. Using Refinery, you can choose to keep only traces that had a 500
status code and were also served from a cache.
Refinery support several kinds of tail sampling:
- Dynamic sampling - This sampling type configures a key based on a trace's set of fields and automatically increases or decreases the sampling rate based on how frequently each unique value of that key occurs. For example, using a key based on
http.status_code
, requests that return200
could be included less often in the forwarded sample data, since200
indicates success and appears in higher frequency, compared to requests that return404
errors and require investigation. - Rules-based sampling - This sampling type enables you to define sampling rates for well-known conditions. For example, you can sample 100% of traces with an error and then fall back to dynamic sampling for all other traffic.
- Throughput-based sampling - This sampling type enables you to sample traces based on a fixed upper-bound for the number of spans per second. The sampler will dynamically sample traces with a goal to keep the throughput below the specified limit.
- Deterministic probability sampling - This sampling type consistently applies sampling decisions without considering the contents of the trace other than its trace ID. For example, you can include 1 out of every 12 traces in the sampled data sent to Honeycomb. This kind of sampling can also be done using head sampling, and if you use both, Refinery takes that into account.
Refinery lets you combine all of the above techniques to achieve your desired sampling behavior.
Refinery is designed to sit within your infrastructure where all traces can reach it. Refinery can run standalone or be deployed in a cluster of two or more Refinery processes accessible via a separate load balancer.
Refinery processes must be able to communicate with each other to concentrate traces on single servers.
Within your application (or other Honeycomb event sources) you would configure the API Host
to be http(s)://load-balancer/. Everything else remains the same (api key, dataset name, etc. - all that lives with the originating client).
Every Refinery instance should have a minimum of 2GB RAM and at least 2 cores, preferably more. Refinery clusters usually run best with fewer, larger servers; for large installations, 16 cores and 16GB of RAM per server is not unusual (with appropriate tuning).
Additional RAM and CPU can be used by increasing configuration values; in particular, CacheCapacity
is an important configuration value. Refinery's Stress Relief
system provides a good indication of how hard Refinery is working, and when invoked, logs (as reason
) the name of the Refinery configuration value that should be increased to reduce stress.
Refinery is available as a Helm chart in the Honeycomb Helm repository.
You can install Refinery with the following command:
helm repo add honeycomb https://honeycombio.github.io/helm-charts
helm install refinery honeycomb/refinery
This will use the default values file. You can also supply your own:
helm install refinery honeycomb/refinery --values /path/to/refinery-values.yaml
When operating in a cluster, Refinery expects to gather all of the spans in a trace onto a single instance so that it can make a trace decision. Since each span arrives independently, each Refinery instance needs to be able to communicate with all of its peers in order to distribute traces to the correct instance.
This communication can be managed in two ways: via an explicit list of peers in the configuration file, or by using self-registration via a shared Redis cache. Installations should generally prefer to use Redis. Even in large installations, the load on the Redis server is quite light, with each instance only making a few requests per minute. A single Redis instance with fractional CPU is usually sufficient.
Configuration is controlled by Refinery's two configuration files. For information on all of the configuration parameters that control Refinery's operation, please see the configuration documentation.
For information on sampler configuration, please see the sampling methods documentation.
Refinery is a typical linux-style command line application, and supports several command line switches.
refinery -h
will print an extended help text listing all command line options and supported environment variables.
Refinery supports the following key environment variables; please see the command line help or the online documentation for the full list. Command line switches take precedence over file configuration, and environment variables take precendence over both.
Environment Variable | Config Field |
---|---|
REFINERY_GRPC_LISTEN_ADDRESS |
GRPCListenAddr |
REFINERY_REDIS_HOST |
PeerManagement.RedisHost |
REFINERY_REDIS_USERNAME |
PeerManagement.RedisUsername |
REFINERY_REDIS_PASSWORD |
PeerManagement.RedisPassword |
REFINERY_HONEYCOMB_API_KEY |
HoneycombLogger.LoggerAPIKey |
REFINERY_HONEYCOMB_METRICS_API_KEY REFINERY_HONEYCOMB_API_KEY |
LegacyMetrics.APIKey |
REFINERY_QUERY_AUTH_TOKEN |
QueryAuthToken |
Note, REFINERY_HONEYCOMB_METRICS_API_KEY
takes precedence over REFINERY_HONEYCOMB_API_KEY
for the LegacyMetrics.APIKey
configuration.
When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic. By enabling dry run mode, all spans in each trace will be marked with the sampling decision in a field called refinery_kept
. All traces will be sent to Honeycomb regardless of the sampling decision. The SampleRate will not be changed, but the calculated SampleRate will be stored in a field called meta.dryrun.sample_rate
. You can then run queries in Honeycomb to check your results and verify that the rules are working as intended. See the configuration documentation for information on enabling Dry Run mode.
When dry run mode is enabled, the metric trace_send_kept
will increment for each trace, and the metric for trace_send_dropped
will remain 0, reflecting that we are sending all traces to Honeycomb.
Refinery uses bounded queues and circular buffers to manage allocating traces, so even under high volume memory use shouldn't expand dramatically. However, given that traces are stored in a circular buffer, when the throughput of traces exceeds the size of the buffer, things will start to go wrong. If you have statistics configured, a counter named collect_cache_buffer_overrun
will be incremented each time this happens. The symptoms of this will be that traces will stop getting accumulated together, and instead spans that should be part of the same trace will be treated as two separate traces. All traces will continue to be sent (and sampled), but some sampling decisions will be made on incomplete data. The size of the circular buffer is a configuration option named CacheCapacity
. To choose a good value, you should consider the throughput of traces (e.g. traces / second started) and multiply that by the maximum duration of a trace (say, 3 seconds), then multiply that by some large buffer (maybe 10x). This will give you good headroom.
Determining the number of machines necessary in the cluster is not an exact science, and is best influenced by watching for buffer overruns. But for a rough heuristic, count on a single machine using about 2G of memory to handle 5000 incoming events and tracking 500 sub-second traces per second (for each full trace lasting less than a second and an average size of 10 spans per trace).
Refinery emits a number of metrics to give some indication about the health of the process. These metrics should be sent to Honeycomb, typically with Open Telemetry, and can also be exposed to Prometheus. The interesting ones to watch are:
- Sample rates: how many traces are kept / dropped, and what does the sample rate distribution look like?
- [incoming|peer]router*: how many events (no trace info) vs. spans (have trace info) have been accepted, and how many sent on to peers?
- collect_cache_buffer_overrun: this should remain zero; a positive value indicates the need to grow the size of the collector's circular buffer (via configuration
CacheCapacity
). - process_uptime_seconds: records the uptime of each process; look for unexpected restarts as a key towards memory constraints.
The default logging level of warn
is fairly quiet. The debug
level emits too much data to be used in production, but contains excellent information in a pre-production environment,including trace decision information. info
is somewhere between. Setting the logging level to debug
during initial configuration will help understand what's working and what's not, but when traffic volumes increase it should be set to warn
or even error
. Logs may be sent to stdout or to Honeycomb.
Refinery validates its configuration on startup or when a configuration is reloaded, and it emits diagnostics for any problems. On startup, it will refuse to start; on reload, it will not change the existing configuration.
Because the configuration file formats (YAML) can sometimes be confusing to read and write, it may be valuable to check the loaded configuration by using one of the /query
endpoints from the command line on a server that can access a refinery host.
The /query
endpoints are protected and can be enabled by specifying QueryAuthToken
in the configuration file or specifying REFINERY_QUERY_AUTH_TOKEN
in the environment. All requests to any /query
endpoint must include the header X-Honeycomb-Refinery-Query
set to the value of the specified token.
curl --include --get $REFINERY_HOST/query/allrules/$FORMAT --header "x-honeycomb-refinery-query: my-local-token"
will retrieve the entire rules configuration.
curl --include --get $REFINERY_HOST/query/rules/$FORMAT/$DATASET --header "x-honeycomb-refinery-query: my-local-token"
will retrieve the rule set that refinery will use for the specified dataset. It comes back as a map of the sampler type to its rule set.
curl --include --get $REFINERY_HOST/query/configmetadata --header "x-honeycomb-refinery-query: my-local-token"
will retrieve information about the configurations currently in use, including the timestamp when the configuration was last loaded.
For file-based configurations (the only type currently supported), the hash
value is identical to the value generated by the md5sum
command for the given config file.
For all of these commands:
$REFINERY_HOST
should be the url of your refinery.$FORMAT
can be one ofjson
,yaml
, ortoml
.$DATASET
is the name of the dataset you want to check.
Refinery can send telemetry that includes information that can help debug the sampling decisions that are made. To enable it, in the config file, set AddRuleReasonToTrace
to true
. This will cause traces that are sent to Honeycomb to include a field meta.refinery.reason
, which will contain text indicating which rule was evaluated that caused the trace to be included.
Refinery does not yet buffer traces or sampling decisions to disk. When you restart the process all in-flight traces will be flushed (sent upstream to Honeycomb), but you will lose the record of past trace decisions. When started back up, it will start with a clean slate.
Within each directory, the interface the dependency exports is in the file with the same name as the directory and then (for the most part) each of the other files are alternative implementations of that interface. For example, in logger
, /logger/logger.go
contains the interface definition and logger/honeycomb.go
contains the implementation of the logger
interface that will send logs to Honeycomb.
main.go
sets up the app and makes choices about which versions of dependency implementations to use (eg which logger, which sampler, etc.) It starts up everything and then launches App
app/app.go
is the main control point. When its Start
function ends, the program shuts down. It launches two Router
s which listen for incoming events.
route/route.go
listens on the network for incoming traffic. There are two routers running and they handle different types of incoming traffic: events coming from the outside world (the incoming
router) and events coming from another member of the Refinery cluster (peer
traffic). Once it gets an event, it decides where it should go next: is this incoming request an event (or batch of events), and if so, does it have a trace ID? Everything that is not an event or an event that does not have a trace ID is immediately handed to transmission
to be forwarded on to Honeycomb. If it is an event with a trace ID, the router extracts the trace ID and then uses the sharder
to decide which member of the Refinery cluster should handle this trace. If it's a peer, the event will be forwarded to that peer. If it's us, the event will be transformed into an internal representation and handed to the collector
to bundle spans into traces.
collect/collect.go
the collector is responsible for bundling spans together into traces and deciding when to send them to Honeycomb or if they should be dropped. The first time a trace ID is seen, the collector starts a timer. If the root span (aka a span with a trace ID and no parent ID) arrives before the timer expires, then the trace is considered complete. The trace is sent and the timer is canceled. If the timer expires before the root span arrives, the trace will be sent whether or not it is complete. Just before sending, the collector asks the sampler
for a sample rate and whether or not to keep the trace. The collector obeys this sampling decision and records it (the record is applied to any spans that may come in as part of the trace after the decision has been made). After making the sampling decision, if the trace is to be kept, it is passed along to the transmission
for actual sending.
transmit/transmit.go
is a wrapper around the HTTP interactions with the Honeycomb API. It handles batching events together and sending them upstream.
logger
and metrics
are for managing the logs and metrics that Refinery itself produces.
sampler
contains algorithms to compute sample rates based on the traces provided.
sharder
determines which peer in a clustered Refinery config is supposed to handle an individual trace.
types
contains a few type definitions that are used to hand data in between packages.