inadarei/rfc-healthcheck

response considerations

randallsquared opened this issue · 17 comments

Make it explicit that details objects are unstandardized?

Is cpu the percentage of the allowed CPU for the logical node that the application is using? the percentage of the allowed CPU that is in use? something else? Not sure how unclear this actually is, but maybe it needs some thought. Things that would be useful to, say, a systems admin, would be "how much CPU is in use", "how much CPU is this app I'm checking using", "what is the load average of the logical node", and maybe other things.

Memory has some of the same issue, and also the issue that in 10 years, the absolute numbers will likely be much larger, since it's tied to kilobytes. Maybe memory_unit that could be one of "perc", "kB", "KiB", "MB", "MiB", "GB", "GiB"?

I think details are standardized, just not yet documented :)

CPU percentage is – the percentage of the allowed CPU that is in use.

I have thought making memory and cpu more robust (objects). But the issue is that: they should be arrays, since applications don't run on a single node anymore. And array of objects feels a bit too much... Maybe it's just me and should get over it... in time? :)

I would argue that memory and CPU info should be included in the details list instead of on the root of the document.

As the RFC is for a responses of APIs and not necessarily server nodes serving APIs. Granted the values are optional so any app could just omit them, however, there doesn't seem to be anything specific that makes CPU and memory special over a Cassandra sub-system, like the example listed, other than they've been a part of Opts since the beginning and they'll probably be included more regularly.

Also, I'd advise adding a note that details is meant to be eventually standardized.

dret commented

Thanks, guys!

OK, let's standardize the details and then see if cpu and memory fit in there? Right now I feel like they won't but maybe they do.

The original motivation for responding with cpu and ram was to allow monitoring tools to decide whether things are getting worse. I.e. a monitoring tool could poll a service every 10 minutes and notice upwards trend in memory utilization, or it may detect "dangerously" high levels of cpu. It should be monitoring tool that decides what threshold are, not the reporting service, however.

Obviously this kind of information, if exposed, should be protected (which RFC does explicitly point out) but I don't think it is irrelevant. It just has changed the meaning because apps don't run on a single machine anymore, and are usually clustered, indeed. The health of the app is aggregate of health of the nodes it runs on?

Looking at health of CPU and health of Cassandra system as a similar thing is a very interesting idea...

OK,

let's see how @mastermatt's suggestion for details may look: https://github.com/inadarei/rfc-healthcheck/pull/7/files#diff-e8cd9c1cd7d648d99d1df69f6263ab48

And here's the rendered version: http://rawgit.com/inadarei/rfc-healthcheck/reusable-details/index.html (refreshes every 10-15 minutes, so may not always be the most fresh version)

I can definitely see the elegance. It feels a bit too verbose, especially for CPU and Memory use-cases that used to be very simple arrays and now would be sequence of larger objects. However, I think key here is that most fields in details are optional, so implementers can use as much or as little as needed. The given example is the most verbose example for demonstration purposes.

I don't know... What does everybody else think?

As mentioned for another reason in #8 I think it would be useful to collapse the details item into the base document spec.

Separately, though, let's just reference ISO/IEC 80000 units and abbreviations rather than creating new, incompatible ones (e.g., pb for pettabyte rather than PB).

You mean the ISO/IEC 80000 that costs CHF158 to even view?

https://www.iso.org/standard/31898.html

That's problematic, of course. For time, we can use the abbreviations in the definitions section of RFC3339 if there's nothing newer.

For storage values it's more difficult, since while "everybody knows" the difference between kB and KiB, I haven't found a free, published standard.

One way to solve this is to suggest that standard units such as kB or KiB should be used appropriately, and avoid referencing any particular standard. The thing I'd prefer to avoid is defining custom abbreviations which do not match the actual abbreviations in technical use; whether some actual standard is referenced is at least a secondary concern.

Using RFC3339 for time abbrevs? I don't think that one covers things like millisecond or microsecond, but we can at least make it consistent with 3339. And it obviously doesn't cover byte abbrevs.

Regarding the ISO: I was extremely excited when you mentioned it, but since I am not familiar with the contents of ISO 80000, and it is not actually well-described on Wikipedia as opposed to ISO 8601, I tried to see what 80000 was about. Which is when they asked for money, and that obviously left a very bad taste. I don't know how they feel about such things in the older industries but on the web - if it ain't free, it ain't a good standard. I assume you agree. And hence my frustration :)

dret commented

what about having a "description" member for human-readable information
that could accompany the status codes?

Would that be different from "notes"?

dret commented

The suggested change to details has been moved to: #12 with @randallsquared's enhancements to the original PR.

@dret, there is a top-level key description. It is at the bottom of the example JSON and easy to miss. We can put it at the top, in the example and allow it in details as well, for the sake of idiomatic design that @randallsquared is advocating for in #12

dret commented

Thanks, @dret. I see what you are saying.

What about "statusDescription" to avoid collision with the "description" that is a description of the service itself?

dret commented