Improvements for Problem guideline (176)
ePaul opened this issue · 7 comments
We discovered that our current guideline about problem/json is interpreted somewhat differently (both between guild members, and by other developers in the company).
Some things to clarify (most related to each other):
- 1. Is the problem object meant just for the client developers (for debugging purposes), or is it also intended to be used by client applications?
- 1a. If the latter, only for display to a human (e.g. in an UI), or also for automatic processing decisions?
- 2. What is the intended meaning of the
type
(+title
) member, compared to the status code?- 2a. Should the problem types be defined in your API? (How?) Should they be standardized across APIs?
- 2b. Can clients rely on the type attribute?
- 2c. Is this the same as an "error code"?
- 3. What is the meaning of the
instance
field?- 3a. Does pointing to a specific occurrence of an error is meant to point to a code location, or to a set of input data or time when this error happened (e.g. via flow-ID)?
- 3b. When is this needed/useful in addition to the type?
- 4. For cases where a server application doesn't have more detailed error information, should fields (e.g.
type
+title
) be filled with default values replicating the status code information, or just be left empty/missing? - 5. What are compatibility concerns for problem objects? Is it an incompatible change if in some specific situation now a different
type
value is returned? - 6. How can/should the problem object be extended?
- 6a. Use the
type
as a discriminator for subtyping? - 6b. How can extended problems and generic ones coexist?
- 6a. Use the
- 7. Client robustness: Can clients rely on anything in the error object? If not, can they still use something?
(Also, if we are looking at this guideline: Fix the typo instant
→ instance
in the first paragraph.)
Adding my opinions:
- The problem object can be used both by developers and applications.
1a: The problem object can be used both for display to a human and for automated processing.
(But in either case the code needs to be able to deal with the fact that it might not be there (or not complete), and have sensible fallback behavior.) - The
type
should normally be more specific than the status code. If it doesn't give additional information, you can just omit it (i.e. default it toabout:blank
).
2a: If the API designer knows what errors can happen, they should mention them in the API definition. This can be done by extending the problem object and adding an x-extensible-enum to the type property. (You'd still need to list the meanings in a description field, or point to some external documentation.)
If you have a bunch of related APIs, using common types for related errors makes sense (and using different types for different errors makes even more sense) – this makes it easier for clients who relate to multiple APIs, as they now can share error handlers.
2b: Clients can not rely on the type property being there, or having just values which are known. There could always be some infrastructure component injecting errors, which doesn't know your problem types.
2c: Yes, in many cases an error code could be put into the type property. - –
- If you don't have additional information to the status code, you can just omit the type (i.e. default it to
about:blank
) and title. A client can generate a generic error message when hitting an almost-empty problem object. - A client should always be able to deal with the fact that an unknown error type (or none at all) is returned, and have some reasonable fallback behavior. This might degrade the user experience, so servers should avoid this if possible, especially for documented error codes.
- –
- Clients can use properties in the problem object (especially those which are documented in the specification), but they can't rely on them being there, or having specific values. In case a problem object doesn't have needed information, the client needs to fall back to some reasonable default behavior (similar as when there is an error without problem object).
I tried to dump a few thoughts on this without going through the whole discussion, so feel free to just skip over this and treat it as a brain dump. I'll try to go through the discussion and reconcile it with my thoughts in a bit :)
- Is the problem object meant just for the client developers (for debugging purposes), or is it also intended to be used by client applications?
Currently the spec heavily leans towards it being for developers. I'd tend to mostly agree there. The errors the API returns likely have to be interpreted and provided with the right messaging in the right context for them to be useful when embedded into a frontend.
For apps facing a huge audience of diverse customers (e.g. Fashion Store), this seems critical and we should just display "Unknown error" + reference they can provide for debugging. For cases where the customer can actually do something about it, it should be a special case that looks at the Problem type
and shows the appropriate localized in-context message.
For apps with a specific known audience, e.g. employees, displaying the title
+ detail
+ any other context directly is a lot easier and helps debugging. We're unlikely to localize this across many languages for a small set of users anyway.
1a. If the latter, only for display to a human (e.g. in an UI), or also for automatic processing decisions?
I'd say if there is an obvious automatic processing decision that can be taken (e.g. retry), the client can use type
or status
to make that decision, but it should not consider other fields.
- What is the intended meaning of the type (+ title) member, compared to the status code?
type
may be used to make automatic decisions in certain failure modes, title
is an engineer-readable short version of type
. They're application-specific status codes, while status
should have the same meaning across all applications (as far as possible).
2a. Should the problem types be defined in your API? (How?) Should they be standardized across APIs?
I don't think they should be standardized across APIs and they don't necessarily need to be documented, unless they can/should be used for the automated decision making in the client. The different cases could be listed as different responses in the API spec, perhaps highlighting the type in the description?
2b. Can clients rely on the type attribute?
Yes, but I think in that case it should be documented in the API spec (or rather, vice versa), see 2a.
2c. Is this the same as an "error code"?
Yes. It's also good to differentiate between textual error codes and numerical error codes. type
is only the same as the former.
- What is the meaning of the instance field?
It seems difficult implement in many cases, but to me it seems like an identifier that points to a specific individual request error. E.g. type
+ a Flow ID + any other variable context.
3a. Does pointing to a specific occurrence of an error is meant to point to a code location, or to a set of input data or time when this error happened (e.g. via flow-ID)?
I understand it more as the latter. The former I'd capture as a custom message in detail
that you can Ctrl+F to find. I don't know if you could reference code location without leaking a lot of implementation details to the client (it feels closer to dumping a stack trace, which goes against REST#177).
3b. When is this needed/useful in addition to the type?
It seems a bit of an edge case, but it might be useful for debugging specific requests (e.g. looking up matching logs or traces).
- For cases where a server application doesn't have more detailed error information, should fields (e.g. type + title) be filled with default values replicating the status code information, or just be left empty/missing?
I think it's better to have a default fallback. It's not so bad for developers without one, but if you are exposing this to users e.g. for internal apps, you have to otherwise implement a fallback in the client, which is less ideal.
In fact, I think for some cases you should always have the default / static type, e.g. for internal server errors, as it seems very unlikely that the client could make automated decisions based on internal API state, but I might be missing something. type
seems more for client errors.
- What are compatibility concerns for problem objects? Is it an incompatible change if in some specific situation now a different type value is returned?
I'd say the API spec should be the deciding factor. If it's documented to return a specific type, it shouldn't change that behavior. Of course, in the real world it's a bit messier, so I would always try to keep type
and status
the same for the same class of problems. If they need to change, I'd treat it similarly to changing the status code.
The human readable messages (title
, detail
) should not be interpreted by clients, so they should be able to be changed without considering it to be a breaking change.
- How can/should the problem object be extended?
Not sure. There are two extensions I can think of off the top of my head, not sure if the guidelines have anything on these.
- Batch/bulk responses, what's the best way to provide multiple problems?
- Hierarchical problems. I believe there are some internal services that return problems like this, but it doesn't seem to be codified in the spec.
6a. Use the type as a discriminator for subtyping?
That seems sensible, but I'd only do that if it's really needed to drive a business case as otherwise it can get complicated very quickly.
6b. How can extended problems and generic ones coexist?
I don't see a big problem here, it's the same as other API responses. All extensions should only add fields, not change existing ones. Then consumers of the problems either just handle the generic problems or have the code to handle specific ones too. Extra unhandled fields should be ignored as per the usual with JSON parsing.
- Client robustness: Can clients rely on anything in the error object? If not, can they still use something?*
I think this is on a case-by-case basis. Other points already touch on this quite a bit.
2a: If the API designer knows what errors can happen, they should mention them in the API definition. This can be done by extending the problem object and adding an x-extensible-enum to the type property. (You'd still need to list the meanings in a description field, or point to some external documentation.)
That's a good way to list them! I'd go with descriptions first as just the types can be endpoint-specific and are probably hard to interpret out of context (just in the enum itself).
If you have a bunch of related APIs, using common types for related errors makes sense (and using different types for different errors makes even more sense) – this makes it easier for clients who relate to multiple APIs, as they now can share error handlers.
I'm finding it difficult to see cases of two APIs being so similar that they share the same errors and even further that the clients can respond to the errors in the same way across multiple APIs. I'd tread carefully here as it seems easy to make assumptions on error semantics leading to unexpected behavior (= incidents) later. Did you have some specific examples in mind here?
2b: Clients can not rely on the type property being there, or having just values which are known. There could always be some infrastructure component injecting errors, which doesn't know your problem types.
Yeah that's a good point, there should always be an "unknown error" fallback on the client-side as well to handle cases where you e.g. don't even get JSON. At the same time, it seems useful to also implement fallbacks in the application for common cases (e.g. internal errors), so that you get a little more detail.
- If you don't have additional information to the status code, you can just omit the type (i.e. default it to
about:blank
) and title. A client can generate a generic error message when hitting an almost-empty problem object.
There are two cases.
- Fallback type provided e.g.
/problem/internal-error
. This indicates to the client/user that we know that there was an error, but there's nothing you can do about it. In other words, the call failed successfully. 😅 - Unknown error (no
type
). We don't even know what happened, maybe the call didn't even work.
I don't know if it makes sense to differentiate between them, but it might be useful for engineers at least?
- Clients can use properties in the problem object (especially those which are documented in the specification), but they can't rely on them being there, or having specific values. In case a problem object doesn't have needed information, the client needs to fall back to some reasonable default behavior (similar as when there is an error without problem object).
I'm not sure what this means exactly. If the client can't rely on e.g. "wrong price" being returned when the "price" constraint fails to show the right localized error message, what's the point in even having type
there in the first place? Of course it doesn't make sense to expect a specific call failing in a specific way, because if you do, why would you even make it? :)
tl;dr: After going through the discussion, it seems the guidelines and/or the Problem schema would benefit from additional examples and explanation to make it clearer how to use the different fields and when.
I would discourage extending the Problem object to keep it simple. If it's really really needed then sure, but it should be a high bar. Two cases I can think of to look into might batch/bulk and hierarchical problems. I'd love to see more use-cases if any though, because I might be missing something.
My 2 cents on this issue:
-
The problem details are usually not described in the API and must be expected to change over time. As long as the API does not defined/enumerate values for
type
andinstance
to signal the commitment to keep problem information stable and define a clear semantic, these should not be used by client applications to decide on actions.- If the semantic of the different values of
type
andinstance
is properly defined in the problem description, or the problem object states an explicit commitment to maintain the semantic, clients may make use of these fields to determine actions.
- If the semantic of the different values of
-
type
values should be more specific thanstatus codes
to make best use of the problem information. However, since systems are often lacking a more detailed information when encountering an error, it is okay in my opinion to fallback on a status code name when a better definition is missing.- The types need only be defined in the API when the API wants to signal a commitment to maintain a stable error information that can be used by clients to derive actions.
-
The
instance
field should serve the purpose of defining a clear error instance from which the developer can derive the source location of the error and optionally also the call context of the error, i.e. the set of parameters relevant for the error. The latter may include theflow-id
but should usually be more specific - if supported. -
I agree with @SmilyOrg that blanking is a bad thing here since it requires clients/users to fill the semantic gap. So if no additional information is available these should be prefilled with status code specific fallback information.
-
If problem resource are specified sufficiently to declare a commitment on their stability, I think the same compatibility requirements as for regular resources are expected to hold.
-
I think, the idea of extended problems is to provide a distinct schema to express the context information of a specific error, that else would be aggregated by the
instance
anddescription
fields to support a clearer (unified) semantic and to express the commitment to support certain client side actions. I would encourage to extend the problem and to provide more structured information whenever an API wants to make a commitment to provide a well defined stable error behavior.. -
As long as the problem is only specified generically, clients should not rely on anything and be aware that information may instantly change. The (extended) problem specification should clearly specify on what clients can rely.
@tfrauenstein I think you agreed a few month back to compile these suggestions into some kind of proposal.
hint: pls. see also API Guild Initial discussion from May and RFC discussion from September