`HttpError::for_status` asserts when passed a 500-level error
gjcolombo opened this issue · 6 comments
for_status
takes any HTTP status code as an input, but it unconditionally passes the code to for_client_error
, which asserts that the status is 400-level:
dropshot/dropshot/src/error.rs
Lines 204 to 215 in c719b41
dropshot/dropshot/src/error.rs
Lines 143 to 155 in c719b41
I have some error handling code that's tripping this assert: it calls an external service, gets back a Progenitor result, checks to see if the result contains a status, and if so passes that status to HttpError::for_status
to re-encode it as an HTTP error. If there's a more idiomatic way to do this specific conversion, that'd be great too, but it seems strange that for_status
theoretically allows any status but will explode on a subset of them.
@bnaecker also hit something similar in https://github.com/oxidecomputer/dendrite/pull/1047
@davepacheco I think we might either remove the assert or move to a more type-safe mechanism to exclude non-4xx status codes.
@ahl What do you think about having see next commentfor_status
not use for_client_error
(so that we still enforce it for for_client_error
)?
(Sorry for the multi-comment)
First, sorry for the bug! This sounds like a pretty annoying one.
I did a bit more digging. Most of this code hasn't changed since June, 2020 (very early Dropshot days) and I think this was probably an old oversight. But I think I understand a bit better why it looks like this and it has to do with the internal messages for 500-level errors. Recall that:
- All errors have both an internal message and an external message.
- For 500-level errors, the external message is generally something generic ("Internal Server Error" or "Service Unavailable") and the internal message is caller-provided. We assume that the internal message has private server implementation details that are irrelevant to the client and maybe even sensitive.
- For 400-level errors, the internal message and external message are usually the same. Since these are client errors, we figure there's no need to include private server implementation details in either message.
This behavior is enforced by various constructors:
HttpError::for_internal_error()
(always produces a 500 with external message "Internal Server Error" and uses the caller-provided message for the internal message)HttpError::for_unavail()
(always produces a 503 with external message "Service Unavailable" and uses the caller-provided message for the internal message)HttpError::for_bad_request
(always produces a 400 with both external and internal messages matching the caller-provided message)HttpError::for_not_found
(this one's a bit of an exception; it produces a 404 with external message "Not Found" and internal message whatever was provided by the caller)HttpError::for_client_error()
(caller chooses the status code, which must be a 400-level error; both the internal and external messages match the caller-provided message).
In retrospect I think HttpError::for_status()
was created for the many 400-level errors where no additional context is needed (e.g., 405 ("Method Not Allowed")) and so the internal and external messages can just be the generic message for that status code. Indeed, the only caller in Dropshot today is using it to generate a 405. Here, there's nothing useful for the caller to put in either the internal or external message.
You could imagine an HttpError::for_status
that looked at whether the status code you gave it was 400-level (in which case it does what HttpError::for_status()
does today, which is to generate a generic message for the status code and use it for both the internal and external message) or 500-level (in which case it would generate a generic message for both internal and external messages). But this seems almost certainly not what we want for 500-level errors. This is saying: there was a problem on the server side (so there's no way the client could know why it failed) and we're declining to put any information in the log about the failure.
Instead I'd suggest that we:
- Rename it
for_generic_client_error
or the like. - Figure out what gap that leaves in the API, if any, and how best to fill it. I tried to dig into this a bit.
- From oxidecomputer/dendrite#1047, it sounds like there'd be no gap if we removed
HttpError::for_status()
. We'd just useHttpError::for_unavail
(which is good). - @bnaecker mentioned there we'd hit this a few other times. I'm sorry about that! It'd be useful to know if those were 500 and 503s, and if so, whether it'd have been fine if we just didn't have
HttpError::for_status()
. - @gjcolombo's use case above sounds a bit trickier. I wasn't clear on exactly what we're trying to do there and whether some other utility would be useful here.
- From oxidecomputer/dendrite#1047, it sounds like there'd be no gap if we removed
Sorry again for the trouble.
@davepacheco Thanks for the historical context, that's very useful. I think renaming it to HttpError::for_generic_client_error()
would be perfectly fine. We can put a # Panics
section in the docstring too, which I think would be useful. I was definitely confused by the fact that the name makes it appear like any status code would be valid, and the docstring doesn't clarify which kind is allowed today, as far as I could tell.
It'd be useful to know if those were 500 and 503s, and if so, whether it'd have been fine if we just didn't have HttpError::for_status().
Yeah, the most recent time I hit this was trying to return a 503. I'm aware of for_unavail()
, which I should have used directly. Removing for_status()
would also be fine by me -- we have the constructors for 500-level errors you mentioned above, and we already have for_client_error()
, in which case I would be fine with the existing assertion.
My thanks as well for the extra context here--I don't think I appreciated any of these details when I originally filed this issue. There are a couple of places in the Omicron sources where we've referred to it:
-
https://github.com/oxidecomputer/omicron/blob/81c808c31bfda62bb371a12a049bfb846f88624e/nexus/src/app/instance.rs#L117 - this is generating an error when Nexus gets a
CommunicationError
from a call to sled agent; it wants to return a 502 or 504, but there are no handy constructors for these, so it callsfor_internal_error
and then munges the status code instead. (This actually seems like a bug--the status code is right in this case but the message will still be "Internal Server Error".) -
https://github.com/oxidecomputer/omicron/blob/81c808c31bfda62bb371a12a049bfb846f88624e/sled-agent/src/sled_agent.rs#L207 - this is the case I was referring to in the OP. It's reached when sled agent has called Propolis, gotten an error, and repackaged it into a sled agent error type, which now needs to be converted to an
HttpError
so that sled agent can return it to Nexus. IIRC my intention here was to preserve the error from Propolis whenever possible; the code does this by passing client errors tofor_status
, converting 503 errors usingfor_unavail
, and turning everything else into a 500 usingfor_internal_error
. In the latter two cases the internal messages are generated by converting the original Propolis error to a string.
In the former case I think it'd be fine just to have for_bad_gateway
and for_gateway_timeout
constructors. In the latter case it seems like it might be easiest just to construct an HttpError
from scratch, preserving the status code received from Propolis and setting the internal_message
appropriately.
I was definitely confused by the fact that the name makes it appear like any status code would be valid, and the docstring doesn't clarify which kind is allowed today, as far as I could tell.
Totally -- this is pretty bad. It's more than reasonable for someone to expect they could use this for a 500-level error!
There are a couple of places in the Omicron sources where we've referred to it:
In the former case I think it'd be fine just to have
for_bad_gateway
andfor_gateway_timeout
constructors.
Tell me if you disagree but I don't feel like 502/504 are great for the external client to get for these cases. Although Nexus is sort of acting like a gateway, that's an implementation detail the client doesn't need to know or care about. I think from their perspective, this is a 503 -- the "server" (that is, the Oxide system) was unavailable.
I think it's fine if we do want to have specific functions for 502/504 or (I'd probably prefer) a for_server_error(status_code: StatusCode, internal_message: &str)
that someone could use to create any 500-level error with the generic external message and the specified internal message. But my gut would be to use 503/for_unavail
here.
In the latter case it seems like it might be easiest just to construct an HttpError from scratch, preserving the status code received from Propolis and setting the internal_message appropriately.
Yeah that makes sense to me at least for now. (In the limit I could see wanting Sled Agent to grok all the specific errors Propolis knows about and repackaging each one, and creating a 500 for unexpected errors. But that seems like a lot of overhead right now for probably not much benefit.)
Thank you both for the detail!