BurntSushi/jiff

strtime accepts formats with multiple format specifiers of the same type

Closed this issue · 8 comments

#57 is currently getting stuck on inputs like:

parse("%f%f", "10010101111111")

Because the second %f specifier overwrites the previous. To my understanding, there isn't really a case where one might want to use the same specifier twice (as it would necessarily just overwrite the other). As such, the format should probably be rejected if the same specifier is used twice.

It would indeed be strange to use a specifier twice, but it doesn't seem totally crazy in cases where components of a datetime are repeated. Otherwise, there are two reasons why I don't think this should return an error:

  1. There is nothing technically invalid about it. Like, it doesn't make the format string wrong, and its actual behavior is pretty reasonable and predictable: it overwrites what was there before.
  2. There would likely be a perf hit for dealing with this error case since you'd need to keep track of everything you've seen before. The strtime format currently doesn't require any such tracking. This would be quite unfortunate because the error checking would be for a very rare case of misuse, but would slow down every parse.

Can you say more about what it means for it to get stuck? Is there a way to make it unstuck without rejecting duplicate specifiers?

it doesn't seem totally crazy in cases where components of a datetime are repeated.

It would be impossible to determine if things are actually repeated then, no? I would argue that in this case, the values should be guaranteed to be equal!

There would likely be a perf hit for dealing with this error case since you'd need to keep track of everything you've seen before.

I somewhat doubt that a single branch to determine if the value is already filled would be too devastating to performance, especially in comparison to the rest of the parsing routine. If nothing else, having a safety mode to check this would probably be good, since duplicated specifiers is likely accidental.

Can you say more about what it means for it to get stuck?

The round-trip fuzzers rely on the fact that data remains consistent. Since the information is inconsistently determined (in this case, by overwriting parts of the input), we can't use it as an oracle for "is the parsing/unparsing consistent".

Is there a way to make it unstuck without rejecting duplicate specifiers?

We would need to drop the round-trip check or reject duplicate specifiers. I don't think there's any other (reasonable) option.

It would be impossible to determine if things are actually repeated then, no? I would argue that in this case, the values should be guaranteed to be equal!

Not necessarily... Just this past weekend, I had to add support for parsing weekdays that were inconsistent with the date because git supports it, and this was necessary flexibility to get Jiff integrated into gitoxide.

I somewhat doubt that a single branch to determine if the value is already filled would be too devastating to performance, especially in comparison to the rest of the parsing routine. If nothing else, having a safety mode to check this would probably be good, since duplicated specifiers is likely accidental.

It's not a single branch though. It's memory. There is no memory right now. Duplicate checking requires memory.

I'm opposed to a separate safety mode as well for reasons of API design that are difficult to articulate concisely.

We would need to drop the round-trip check or reject duplicate specifiers. I don't think there's any other (reasonable) option.

Probably the route I'd go then is to make a little mini-parser in the fuzzer that rejects the duplicate specifiers and sacrifice fuzz coverage of that area. I can work on that when I get a chance. It shouldn't be hard.

Probably the route I'd go then is to make a little mini-parser in the fuzzer that rejects the duplicate specifiers and sacrifice fuzz coverage of that area. I can work on that when I get a chance. It shouldn't be hard.

If it would be possible to track this in the normal parsing routine with #[cfg(fuzzing)]-based code changes? I'm thinking, some kind of Tracking struct that's normally a ZST but when #[cfg(fuzzing)] is present, it checks for duplicates?

Possible... I suppose. I'm not a huge fan of littering the code with fuzz-specific stuff, but I can abide if it helps unblock you. But I'll probably eventually want it moved into the fuzzer.

Sure, understood. This wouldn't be a public-facing API change and you can use it in other testing code as well.

I'm going to close this out because I don't think we should implement the restriction as requested.

If you would benefit from some small cfg-gated changes in the parser for fuzzing, I'd accept a PR. But my tolerance for that sort of thing is honestly not too high.