w3c/network-error-logging

HTTP error types

Closed this issue · 7 comments

Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.

As website operators already have access to 4xx/5xx error information via their server access logs, do these need to be also reported via NEL?

Additional reports generated for HTTP errors must be received, processed and stored which all have costs and must be factored in to scaling of reporting endpoints, so it may be useful to avoid sending HTTP errors since they're available in access logs. Possibly arguably, they're also not really specifically network errors (being at the application level).

+@chrisn

I will translate this to a feature request: be able to specify in the NEL policy which network error types the browser must not report on, and +1 for this from me.

I believe it is a good thing to have 4xx/5xx errors in scope of NEL, because:

  • Some CDNs don't provide the raw logs to all customers (paid feature, or only for customers on top tier price plan)
  • Some CDNs provide logs with a big delay (can be many hours or even a full day)
  • The audience of NEL should include devs that either don't have access to server logs or need a lot of time to parse the logs (poor tooling, little experience, ...), like front-end devs

Yeah that's a really good point @aaronpeters - I mean, in a perfect world, orgs would do the work to give their staff access to appropriate data but pragmatism rules here and with a directive on NEL to control which event classes are included, we'd get the best of both worlds. I like it.

A detail point for discussion - personally, I'd prefer to set which event classes are included, rather than which are not (with default being "all").

For clarity also, by "event class", I mean:

  • dns
  • tcp
  • tls
  • http
  • h2
  • abandoned
  • unknown

Perhaps this would be an Array in the NEL JSON config e.g.:

{"report_to":"default","max_age":2592000,"include_subdomains":true,"failure_fraction":0.01,"event_classes":["dns","tcp","tls"]}

To include only dns, tcp and tls event class reports.

One potential risk with a whitelisting approach is that you completely miss any new network error types the browser starts reporting on.

Some context around past discussions on this: #63 is where we first added "success" reports. The main justification is what you've already discussed up-thread: while you have info about successful connections in server logs, it's not always easy to join those with your NEL logs. There's a real benefit to having all of the information you need in one place. (We added the separate success/failure sampling rates at the same time to give some control over the amount of incoming report data.)

#72 / #79 are where we clarified the handling of 4xx/5xx responses — the user agent MUST classify them as failures, with error code http.error, which means that they fall under the failure sampling rate, and not the success sampling rate.

The overall rationale here was similar to what we're discussing in #124 — this was the balance we settled on between (a) giving you control without adding too much complexity on the client side, and (b) ensure that reports contain enough information to do something more complex in the collector. For instance, the reason that the error codes have a hierarchical structure (tcp.[something], etc) is so that you can do a simple string prefix comparison in your collector if you want to handle TCP errors differently than DNS ones.

One potential risk with a whitelisting approach is that you completely miss any new network error types the browser starts reporting on.

Yeah, I get that. But you could flip it around and say that if you have deviated from the defaults, you won't get any surprises when new event classes land :-).

@dcreager thanks for the context, I recognise that i'm very green to all this. I wonder though, would it be really complex to introduce something similar to the above? (shown here:)

{"report_to":"default","max_age":2592000,"include_subdomains":true,"failure_fraction":0.01,"event_classes":["dns","tcp","tls"]}

Where the default is "all".
Maybe it would and that's understood if so. I just feel that from my perspective, i'd definitely use it and it'd allow me to crank up our sample rate without incurring costs. Our reporting endpoint costs are largely driven by handling the received reports and we have a really nice access logging system so i'd prefer to stay dns, tcp and tls for NEL. This might be very much just us, but I wanted to try to convey our angle on it. I hope it helps and I genuinely mean it as constructive so apologies if it comes across otherwise :-).

I'll close this issue in favour of #133 which would achieve the same outcome and is more complete.