checking for NaN/infinity automatically
BurntSushi opened this issue · 2 comments
I'm ignorant of the underlying issues here, but I missed the part in the docs for Buffer::format
about the fact that it doesn't check for NaN/infinite floats for you. This led to a bug introduced on my behalf: BurntSushi/rust-csv#153
It's my fault for not reading the docs carefully, but is there a reason why this isn't handled for you automatically? In particular, the lower level APIs, such as d2s_buffered_n
, do handle this for you.
It seems to me like most uses of Buffer::format
will just need to wrap it in its own case analysis. If so, does it make sense to perhaps add a new API (at the same abstraction level as Buffer::format
) that does the case analysis for you?
I think the idea is that downstream code should consider what output they need for non-finite floats. If serde_json ignored the issue and were emitting something like {"float": NaN}
via some default behavior then that would be invalid JSON and very bad. Is there a use case where
- the caller would care for the performance of ryu or the formatting differences enough to use ryu over std::fmt,
- but is okay with whatever ryu wants to produce for non-finites without looking into what that is,
- and at the same time isn't okay with non-finites showing some safe unspecified string?
In other words adding some default representations would not absolve the caller of needing to evaluate whether those representations are correct for them. Code that forgot to think about and test non-finite would still likely be wrong. Meanwhile as downsides, either code that did think about and handle non-finites in their use-case specific correct way would be slower because of the redundant checks, or else we would need to double the surface area of this crate to expose functions with and without non-finite handling.
The d2s_buffered_n function is intended to match C ryu exactly while the rest of the crate isn't. I guess matching C exactly has been convenient for benchmarking vs C but isn't a good API; we can remove it.
I would be on board with documenting the current behavior more prominently in the crate level doc. Alternatively I may be convinced to split Buffer::format
/format_finite
where format
has some reasonable default behavior if such a thing exists, while format_finite
has the current behavior -- but it's not clear that there is any single default behavior that would be correct for sufficient fraction of callers to make this worthwhile.
I think CSV fits exactly into the use case you're describing. In particular, it solidly hits all three of your bullet points:
- I routinely see float-to-string (and string-to-float) conversions show up in a profile whenever parsing CSV data that ends up needing to parse floats. Here's an example with a few nice benchmark results: BurntSushi/rust-csv@6105a6a --- The formatting would be nice to control a bit more. std gets part of the way there via
format!
, but that ends up being slow. ryu helps here a bit by preferring more "human" friendly outputs, so the need for precision control is perhaps a bit less. But still, I'd happily use ryu for cases where you don't need output precision control, and std's stuff when you do. - I think as long as whatever ryu produces for non-finites can be roundtripped and parsed by string-to-float conversions, I'm happy.
- I think folks would definitely want to know if there calculations result in
NaN
rather than being potentially tricked with an unspecified decimal. CSV is used a lot of this sort of thing, and speaking from experience, introducing a bug into a calculation that results in aNaN
is pretty easy. (Non-finites less so, but I suspect the principle still applies.)
I would be on board with documenting the current behavior more prominently in the crate level doc. Alternatively I may be convinced to split Buffer::format/format_finite where format has some reasonable default behavior if such a thing exists, while format_finite has the current behavior -- but it's not clear that there is any single default behavior that would be correct for sufficient fraction of callers to make this worthwhile.
It does honestly kind of feel like the default here really should include sensible formatting for non-finites and NaNs. This would match std's behavior. But I am speaking through the lense of an end user with very little domain knowledge. From my perspective, having to repeat the case analysis every time I use ryu seems like a bummer. I guess the flip side is that an exceedingly common format, such as JSON, explicitly wants to alter this case analysis.
Perhaps CSV is a bit of a frankenstein in this respect, since there is no spec that governs floating point values in CSV files. Everything is just strings as far as the CSV spec goes (which isn't very far). But in most other formats, I imagine the formatting of floating point numbers is more tightly controlled, such that default handling of non-finites might not be right, and indeed, using explicit case analysis would probably be preferable.
Outside of that, I guess other use cases would involve ad hoc printing of floats. It's honestly not too hard in my experience to get bottlenecked by std's float-to-string conversion, so I'd probably be quick to use ryu in the future for anything in this regard. A sensible default would be great here. But as you say, it's not clear how often this arises in practice.
The d2s_buffered_n function is intended to match C ryu exactly while the rest of the crate isn't. I guess matching C exactly has been convenient for benchmarking vs C but isn't a good API; we can remove it.
Ah gotya. I missed that!