json-api-dotnet/JsonApiDotNetCore

Server-Timing header support

ebryn opened this issue · 7 comments

To aid in performance analysis of our API endpoints, it would be awesome if JADNC could implement support for the Server-Timing header at the framework level. This is something that could be optionally enabled and ideally not have any runtime performance cost when unused. If there isn't appetite to implement within the framework, perhaps hooks could be exposed so it could be more easily done at the application level?

We've begun using https://github.com/tpeczek/dotnet-server-timing within application code, in particular around Dapper queries, and it has been a really nice addition to our developer experience. The getting started tutorial gives a quick overview of Server-Timing and includes a screenshot of what the output looks like in the Chrome dev tools.

Hi @ebryn, thanks for your suggestion. Adding metrics and counters is something I'd like too, but I haven't found time yet to dive into it.

Did you know that .NET 8 added built-in support for metrics? On top of that, there's OpenTelemetry integration, which provides platform/cloud-neutral distributed traces. The project at https://github.com/alexmg/Timingz seems to bridge the gap with the Server-Timing header, though I haven't looked deeply:

Integration with the .NET Activity API and OpenTelemtry .NET allows existing Activity based metrics to be included in the Server-Timing header and exported to OpenTelemetry exporters

What do you think should be measured inside JsonApiDotNetCore specifically? What comes to mind is the request duration and SQL execution time, but I suspect there are already solutions for that.

@bkoelman My apologies for the late reply.

The initial questions we ask ourselves when an API is slower than ideal are:

  1. Is it the DB query? (Maybe we're missing an index?)
  2. Is it some custom code we wrote in our application?
  3. Is it framework overhead? (Is it JSON serialization? Is it EF?)

Big picture: being able to see, at a quick glance via the Chrome dev tools, the time spent executing queries in the database, time spent in application code, and time spent in framework code would provide significant value in our staging/production environments (even development, as the server logs can be messy)

Thanks for the feedback. I've taken a closer look into this area.

To get the terminology right, observability/telemetry can be broken down into:

  • logging: events happening inside an executable
  • distributed tracing: correlate log entries, measure duration of parts in execution chain (frontend -> api -> db)
  • metrics/counters: queries per second, bytes written, memory/cpu usage, etc

I've done some experiments with tracing, replacing the built-in CodeTimingSessionManager usage with Activity (overhead cost is around 125ns, so we're good). Activity (= Span in OpenTelemetry) is built into the .NET runtime and its purpose is to capture correlated traces, so I'd prefer to use that instead of taking a dependency on a third-party library, or introducing an abstraction (this is quite performance-sensitive).

Now there are several ways to consume activities. https://github.com/alexmg/Timingz converts them to Server-Timing headers, whereas Aspire provides a dashboard to display distributed traces (leveraging OpenTelemetry for collection).

Timingz output in Chrome DevTools:
image

Aspire (preview 2) dashboard:
image

As can be seen above, the output from Aspire is much richer. Not only is it capable of showing nested measurements (repository-get executes as part of service-get). It also includes details such as the SQL being executed, which is coming directly from the Activity usage inside the Npgsql driver.

The big downside is that the Server-Timing HTTP header cannot include offsets/correlations. I've opened an issue to at least include downstream activities. But even then, the user needs to know the internal execution flow to make sense of the output.

The Aspire dashboard is tailored for local development. But its usage of OpenTelemetry paves the path for production analytics with tools like Prometheus/Grafana (metrics), Jaeger and Zipkin (traces).

Big picture: being able to see (...) time spent in application code, and time spent in framework code

I don't think that's a realistic goal. It requires CLR-level interception to make the distinction, which is what profilers do. That comes at a considerable cost, so not something to be running all the time. For example, a custom repository that calls queryLayer.ToString().

Instead (and this is what every other library does), we should capture the duration of our own key execution points that don't involve application code. Such as serialization, query string parsing, running middleware, etc. So then the remaining gaps must be application code. Applications can optionally define their own activities to measure their own key logic blocks, and they'll get nested (similarly to Npgsql use) in Aspire. An open question is how to handle resource definition callbacks, where application code exists as well. To exclude them, we'd need to surround them with nested activities, but that makes the output very verbose.

at a quick glance via the Chrome dev tools, the time spent (...) in our staging/production environments

Wouldn't it be better to monitor what users observe in production, as opposed to what a single developer observes when he/she performs a single request? What the developer sees in Chrome DevTools may not be representative. OpenTelemetry enables to capture/view the entire user experience:

image

Names of activities: max ~16 chars, guidelines at https://github.com/open-telemetry/semantic-conventions/blob/main/docs/README.md, includes GraphQL.

We should narrow down which activities we need (and per activity, what metadata is considered relevant to attach), but this largely depends on whether we can use nesting at all. If Server-Timing headers are the goal, our options are very limited (due to its lack of offsets and nestings).

In my experience a simple Server-Timing: total;dur=25.3ms implemented with a middleware like that one

internal class ServerTimingMiddleware(RequestDelegate next)
{
    public async Task InvokeAsync(HttpContext context)
    {
        var sw = Stopwatch.StartNew();
        context.Response.Headers.Append("Timing-Allow-Origin", "*");
        context.Response.OnStarting(() =>
        {
            context.Response.Headers.Append("Server-Timing", "total;dur=" + sw.Elapsed.TotalMilliseconds.ToString("0.0"));
            return Task.CompletedTask;
        });
        await next(context);
    }
}

is already super valuable, especially to easily differentiate time spent in the network and time spent processing the request. That helps us to prove to customers that a latency issue might not be caused by a performance issue on our side.

You could add more more metrics than just total but there is risk of exposing potentially sensitive application and infrastructure information (see https://www.w3.org/TR/server-timing/#privacy-and-security). Also as @bkoelman mentioned, there is no concepts of parent/child relationships with server timing metrics.

If the idea is to monitor the performance of a single request then distributed traces will give you a high level overview of where the time is spent. For example, I'm pushing some traces to Datadog using the OpenTelemetry SDK in a small personal project and I can see that in the 5 ms request, most of the time was spent in Npgsql.
image

I don't think JADNC needs to emit its own spans as ASP.NET Core, HTTP client, SQL clients are already instrumented which will give you a high level view of what's going on. Emitting spans for parsing the query or building the response object feels too verbose to me. Also, emitting a span is not free. The Activity is a pretty large object to allocate, generating the tag values can be expensive, and starting the Activity runs the sampling decision which, depending on your sampler, can add an extra cost. The developer can still emit its own spans in its ResourceDefinition if they feel like it's needed there.

Also, you can consider returning the trace id of the request in the header traceresponse to associate the trace to a response.

If the idea is to optimize the CPU performance of the application, adequate profiling tools should be used such as perf and dotnet-trace. Or if you afford it, continuous profiling, like Datadog APM or Grafana Pyroscope, is great.

To conclude, I don't think anything should be done in JADNC around tracing and server timing.

@verdie-g Thanks, these are great tips! I tend to agree with your conclusion. @ebryn thoughts?

Seems reasonable! Thanks for the discussion