Investigate buildout into production-scalable tooling

Question

Investigate buildout into production-scalable tooling

charles-dyfis-net opened this issue 8 years ago · 6 comments

Filing this ticket as a set of TODO items largely for myself, but hopefully also as a conversation-starter.

How expensive is it to have instrumentation check whether tracing is enabled and, finding that it isn't, not trace an invocation? (That is to say: Is the overhead low enough to have instrumentation be present in production, but only called on a representative sample or when another condition is met)?
How expensive is it to collect traces, but them throw them away? (If we can trace a call, but only serialize and export those traces if an error condition is detected, this would be a powerful tool for detailed error reporting).
How expensive is it to collect (or report) only profiling results, but not arguments and return values?
Is exporting sysdig trace events rather than accumulating state in-memory a workable approach?

Answer 1 · 2016-12-02T02:45:57.000Z

A bit more general-purpose noodling -- all on the conversation-starter side; checkbox items in the ticket itself remain the focus for POC/feasibility determination.

Hmm. Sysdig trace events don't only make sense as a way to do high-performance transport out from the traced process, but as a way to correlate with other monitoring/tracing -- if we want to be able to look at the FD table as it existed when a function that later exited with an exception was started, or at what's going on in the front-end HTTP server when a given function is invoked, or what the OS is logging to syslog within a 5-second window of a given exception, these are all things sysdig can do using chisels that already exists (so long as we generate recognizable events for them to operate against).

On the other hand, much of the value having the data collected by sayid is in being able to query it in a content-dependent way -- we'd want to be able to filter using arbitrary Clojure predicates. Perhaps a plugin (for sysdig's Lua-based userspace tooling) running a ClojureScript coprocess to evaluate user-provided predicates?

On the other hand, more powerful queries require being able to operate on data against longer time spans than those for which (open-source) sysdig tooling is suitable. Examples:

Over our last 30 days, which percentage of sampled calls to X have argument Y?
Over our last week, plot the number of items in collection argument Z to sampled invocations of function N to the time required to execute that function.
What's the smallest value of argument B to function C for which exception E ever occurred? (If collection is cheap enough to allow retroactive tracing for calls that end in exceptions (HTTP requests that end in 500s, or other user-defined cases), this could extend beyond only randomly-sampled events to cover all such invocations!)

Exporting this (presumably from sysdig) to a suitable datastore, and documenting appropriate queries, is probably best deferred until after the concept is otherwise proved.

Answer 2 · 2016-12-02T04:40:16.000Z

Thinking about the details, some caveats have come up:

Making logging conditional on exceptions means that if we're going the Sysdig Trace approach, we'd need to log entry parameters on exit, since we wouldn't know yet if we would need to log on entry (fortunately, since data is immutable, this is possible so long as such entry parameters are not themselves reference types). Wouldn't be an issue if using a system where timestamps are application-provided as part of the data and thus can refer to events which took place in the past, but in the context of Sysdig Trace, the metadata (time of the trace call itself) is used, and can't be overridden in a correct and consistent way. This would also mean that we wouldn't get details except for the components of the call stack actually active at the time of the exception, unless we use a serialization format that allows args+result to be logged for the entire request, not just that component of the stack, at once. This latter would require substantial work to correlate with identity of the relevant invocations; guessing probably not worthwhile.

Sysdig Trace support still makes sense, but maybe not in conjunction with tracing details only on exception. Maybe provide means to dump full trace out-of-band from sysdig when an an exception occurs, ie. to a log or dump file?

Answer 3 · 2017-01-31T13:51:40.000Z

@charles-dyfis-net A couple weeks ago, I started working on a production version of sayid. Still not sure my approach is going to pan out, but I thought I'd leave a note for anyone interested.

Answer 4 · 2017-03-05T02:19:40.000Z

I'm looking for some direction. If you (anyone reading this) might be interested in a version of sayid designed for production, please consider taking this survey. thanks

http://sayidpro.com/

Answer 5 · 2017-04-04T13:17:04.000Z

Update for anyone out there who might be listening:
I got some very helpful feedback from the survey last month. What I learned definitely changed my idea of what an mvp should look like. I hope to make a public announcement of some kind in the next few weeks. We'll see.

Answer 6 · 2017-05-08T15:24:52.000Z

@charles-dyfis-net @achesnais @julienfantin @DonyorM @abcdw

Here it is:
Sayid Pro - Transparency for Clojure Production Environments
https://www.kickstarter.com/projects/1269641244/sayid-pro-transparency-for-clojure-production-envi