oxidecomputer/dropshot

`HttpError::for_status` asserts when passed a 500-level error

gjcolombo opened this issue · 6 comments

for_status takes any HTTP status code as an input, but it unconditionally passes the code to for_client_error, which asserts that the status is 400-level:

/// Generates an `HttpError` for the given HTTP `status_code` where the
/// internal and external messages for the error come from the standard label
/// for this status code (e.g., the message for status code 404 is "Not
/// Found").
pub fn for_status(
error_code: Option<String>,
status_code: http::StatusCode,
) -> Self {
// TODO-polish This should probably be our own message.
let message = status_code.canonical_reason().unwrap().to_string();
HttpError::for_client_error(error_code, status_code, message)
}

pub fn for_client_error(
error_code: Option<String>,
status_code: http::StatusCode,
message: String,
) -> Self {
assert!(status_code.is_client_error());
HttpError {
status_code,
error_code,
internal_message: message.clone(),
external_message: message,
}
}

I have some error handling code that's tripping this assert: it calls an external service, gets back a Progenitor result, checks to see if the result contains a status, and if so passes that status to HttpError::for_status to re-encode it as an HTTP error. If there's a more idiomatic way to do this specific conversion, that'd be great too, but it seems strange that for_status theoretically allows any status but will explode on a subset of them.

ahl commented

@bnaecker also hit something similar in https://github.com/oxidecomputer/dendrite/pull/1047

@davepacheco I think we might either remove the assert or move to a more type-safe mechanism to exclude non-4xx status codes.

@ahl What do you think about having for_status not use for_client_error (so that we still enforce it for for_client_error)? see next comment

(Sorry for the multi-comment)

First, sorry for the bug! This sounds like a pretty annoying one.

I did a bit more digging. Most of this code hasn't changed since June, 2020 (very early Dropshot days) and I think this was probably an old oversight. But I think I understand a bit better why it looks like this and it has to do with the internal messages for 500-level errors. Recall that:

  • All errors have both an internal message and an external message.
  • For 500-level errors, the external message is generally something generic ("Internal Server Error" or "Service Unavailable") and the internal message is caller-provided. We assume that the internal message has private server implementation details that are irrelevant to the client and maybe even sensitive.
  • For 400-level errors, the internal message and external message are usually the same. Since these are client errors, we figure there's no need to include private server implementation details in either message.

This behavior is enforced by various constructors:

  • HttpError::for_internal_error() (always produces a 500 with external message "Internal Server Error" and uses the caller-provided message for the internal message)
  • HttpError::for_unavail() (always produces a 503 with external message "Service Unavailable" and uses the caller-provided message for the internal message)
  • HttpError::for_bad_request (always produces a 400 with both external and internal messages matching the caller-provided message)
  • HttpError::for_not_found (this one's a bit of an exception; it produces a 404 with external message "Not Found" and internal message whatever was provided by the caller)
  • HttpError::for_client_error() (caller chooses the status code, which must be a 400-level error; both the internal and external messages match the caller-provided message).

In retrospect I think HttpError::for_status() was created for the many 400-level errors where no additional context is needed (e.g., 405 ("Method Not Allowed")) and so the internal and external messages can just be the generic message for that status code. Indeed, the only caller in Dropshot today is using it to generate a 405. Here, there's nothing useful for the caller to put in either the internal or external message.

You could imagine an HttpError::for_status that looked at whether the status code you gave it was 400-level (in which case it does what HttpError::for_status() does today, which is to generate a generic message for the status code and use it for both the internal and external message) or 500-level (in which case it would generate a generic message for both internal and external messages). But this seems almost certainly not what we want for 500-level errors. This is saying: there was a problem on the server side (so there's no way the client could know why it failed) and we're declining to put any information in the log about the failure.

Instead I'd suggest that we:

  • Rename it for_generic_client_error or the like.
  • Figure out what gap that leaves in the API, if any, and how best to fill it. I tried to dig into this a bit.
    • From oxidecomputer/dendrite#1047, it sounds like there'd be no gap if we removed HttpError::for_status(). We'd just use HttpError::for_unavail (which is good).
    • @bnaecker mentioned there we'd hit this a few other times. I'm sorry about that! It'd be useful to know if those were 500 and 503s, and if so, whether it'd have been fine if we just didn't have HttpError::for_status().
    • @gjcolombo's use case above sounds a bit trickier. I wasn't clear on exactly what we're trying to do there and whether some other utility would be useful here.

Sorry again for the trouble.

@davepacheco Thanks for the historical context, that's very useful. I think renaming it to HttpError::for_generic_client_error() would be perfectly fine. We can put a # Panics section in the docstring too, which I think would be useful. I was definitely confused by the fact that the name makes it appear like any status code would be valid, and the docstring doesn't clarify which kind is allowed today, as far as I could tell.

It'd be useful to know if those were 500 and 503s, and if so, whether it'd have been fine if we just didn't have HttpError::for_status().

Yeah, the most recent time I hit this was trying to return a 503. I'm aware of for_unavail(), which I should have used directly. Removing for_status() would also be fine by me -- we have the constructors for 500-level errors you mentioned above, and we already have for_client_error(), in which case I would be fine with the existing assertion.

My thanks as well for the extra context here--I don't think I appreciated any of these details when I originally filed this issue. There are a couple of places in the Omicron sources where we've referred to it:

  • https://github.com/oxidecomputer/omicron/blob/81c808c31bfda62bb371a12a049bfb846f88624e/nexus/src/app/instance.rs#L117 - this is generating an error when Nexus gets a CommunicationError from a call to sled agent; it wants to return a 502 or 504, but there are no handy constructors for these, so it calls for_internal_error and then munges the status code instead. (This actually seems like a bug--the status code is right in this case but the message will still be "Internal Server Error".)

  • https://github.com/oxidecomputer/omicron/blob/81c808c31bfda62bb371a12a049bfb846f88624e/sled-agent/src/sled_agent.rs#L207 - this is the case I was referring to in the OP. It's reached when sled agent has called Propolis, gotten an error, and repackaged it into a sled agent error type, which now needs to be converted to an HttpError so that sled agent can return it to Nexus. IIRC my intention here was to preserve the error from Propolis whenever possible; the code does this by passing client errors to for_status, converting 503 errors using for_unavail, and turning everything else into a 500 using for_internal_error. In the latter two cases the internal messages are generated by converting the original Propolis error to a string.

In the former case I think it'd be fine just to have for_bad_gateway and for_gateway_timeout constructors. In the latter case it seems like it might be easiest just to construct an HttpError from scratch, preserving the status code received from Propolis and setting the internal_message appropriately.

I was definitely confused by the fact that the name makes it appear like any status code would be valid, and the docstring doesn't clarify which kind is allowed today, as far as I could tell.

Totally -- this is pretty bad. It's more than reasonable for someone to expect they could use this for a 500-level error!

There are a couple of places in the Omicron sources where we've referred to it:

In the former case I think it'd be fine just to have for_bad_gateway and for_gateway_timeout constructors.

Tell me if you disagree but I don't feel like 502/504 are great for the external client to get for these cases. Although Nexus is sort of acting like a gateway, that's an implementation detail the client doesn't need to know or care about. I think from their perspective, this is a 503 -- the "server" (that is, the Oxide system) was unavailable.

I think it's fine if we do want to have specific functions for 502/504 or (I'd probably prefer) a for_server_error(status_code: StatusCode, internal_message: &str) that someone could use to create any 500-level error with the generic external message and the specified internal message. But my gut would be to use 503/for_unavail here.

In the latter case it seems like it might be easiest just to construct an HttpError from scratch, preserving the status code received from Propolis and setting the internal_message appropriately.

Yeah that makes sense to me at least for now. (In the limit I could see wanting Sled Agent to grok all the specific errors Propolis knows about and repackaging each one, and creating a 500 for unexpected errors. But that seems like a lot of overhead right now for probably not much benefit.)

Thank you both for the detail!