rust-bakery/nom

separated_list1 by multipspace0 sep failed

yejunjin opened this issue ยท 16 comments

Hello, and thank you for submitting an issue to nom!

First, please note that, for family reasons, I have limited time to work on
nom, so following the advice here will make sure I will quickly understand
your problem and answer as soon as possible.
Second, if I don't get to work on your issue quickly, that does not mean I
don't consider it important or useful. Major version releases happen once
a year, and a lot of fixes are done for the occasion, once I have had time
to think of the right solution. So I will get back to you :)

Prerequisites

Here are a few things you should provide to help me understand the issue:

  • Rust version : rustc 1.71.1 (eb26296b5 2023-08-03)
  • nom version :7.1.3
  • nom compilation features used: default

Test case

Please provide a short, complete (with crate import, etc) test case for
the issue, showing clearly the expected and obtained results.

Example test case:

    #[test]
    fn test_separated_list1() {
        fn parser(input: &str) -> IResult<&str, Vec<&str>> {
            separated_list1(multispace0, tag("abc"))(input)
        }
        assert_eq!(parser("abcabc"), Ok(("", vec!["abc", "abc"])));
       // thread 'parser::tests::test_cypher' panicked at 'assertion failed: `(left == right)`
       // left: `Err(Error(Error { input: "abc", code: SeparatedList }))`,
       // right: `Ok(("", ["abc", "abc"]))`
    }
epage commented

While its not in the docs, in the code is this comment

infinite loop check: the parser must always consume

Meaning, the separator parser. I assume this is protecting against writing code where both the separator and the core type are allowed to be empty, you'd get an infinite loop, so nom seems to say the core type may be empty but the separator must not be.

Maybe related with #1573

The sep parser in seperated_list must not be optional because if it is not present then consecutive items become just one item.

Perhaps the required sep parser is a alternate separators,

alt((sep1, sep2, ...))

I ran into the same thing. Is there a more accepted pattern currently for a sequence of items that are optionally separated by whitespace? I've switched something like a many1(preceded(multispace0, my_parser)), but that doesn't work exactly the same because it allows leading separators (which isn't actually a problem in my context, at least), and separated_list1 would read more clearly.

If the separator is optional then what is to separate two consecutive elements when the separator is absent?

without testing, this seems to be what you might want...

many1(tuple((element, many0(sep))))

Rust saves us from ourselves ;)

The return type on that isn't great, though. You'd have to map the tuple combinator to get the elements on their own. preceded or terminated are probably better options.

Geal commented

that check for non consuming input in repeating parser dates from the very beginning of nom. People kept running into issues with parsers going into infinite loop, in particular due to optional whitespace. Even now in some user tests without the check people trip on it ๐Ÿ˜…

Sure, that makes sense, but I think it would be more robustly handled by returning an error in the case that both sep and f both parse empty in a row, rather than one or the other. As long as one of the two isn't parsing empty, there's no risk of an infinite loop.

An optional sep in a separated list is conceptually invalid because if the separator does not exist the the elements are not separated and you then do not have the presence of two elements but one. Perhaps what is regarded as an optional is really an alternate combinator such as comma or space.

Another situation is when the elements themselves are bracketed and the separator is purely cosmetic such as a sequence of quoted bracketed strings. However, this later case isn't a separated list but a sequence of elements with optional suffix (or prefix).

I wonder if we cand define traits and auto implement them against parser results so as to use bounds to better document this functionality.

...pseudo

trait Consuming {}
trait NonConsuming {}
impl<..> Consuming for fn(I)->IResult<..>
impl<..> NonConsuming for fn(I)->IResult<I,Option<O>,E>
impl<..> NonConsuming for fn(I)->IResult<I,Result<O,E2>,E>

An optional sep in a separated list is conceptually invalid

You present this as fact when it's really just your opinion. The difference between separated_list and a prefix/suffix per element is that the former only looks for the separator between pairs of elements, while the latter also looks for a separator at the very beginning (prefix) or end (suffix). They are not equivalent, regardless of whether the separator is optional.

if the separator does not exist the the elements are not separated and you then do not have the presence of two elements but one

That depends on what's being parsed. In my case, I have a list of {...} blocks, which I want to be separated by any amount of whitespace (including zero whitespace). I don't want to allow leading or trailing whitespace, and I always want at least one block. separated_list1(multispace0, block) would have been perfect for my case, but it gave an unhelpful error instead, which is how I found this issue.

Perhaps what is regarded as an optional is really an alternate combinator such as comma or space.

That's not a problem, because if it's not optional, it's not optional. If somebody uses alt((tag(','), tag(' '))) or something, it can't be skipped anyway.

separated_list1(foo, bar) is a parser that has a mandatory list of bar items separated by mandatory foo items.

separated_list1(opt(foo), bar) is a parser with bar items possibly separated by foo items.

In no situation should it be unclear whether the separator is optional or not, because you can always use a mandatory combinator if you want it to be mandatory. This is one of the major strengths of combinators, that you can build exactly the behavior you want from the building blocks. The suggestion is not making the separator optional, but allowing it to be a potentially non-consuming parser.

Actually, the reason I am on this thread is because I had a similar presumption about this functionality for the optionality of sep. Specifically, that sep should be capable of being optional due to the fact that it, sep, and the element parser are self delimiting maximally consuming independent parsers. Perhaps a new version of separated_list, separatedb_list, can be added to the library that implements this wildly expected behaviour.

Incidentally, my apologizes if I sound emphatic or definitive. I am an infinite novice in this and many things. :)

// test cases: 
(sep, element) => [element]
(sep, opt_element) => [opt_element]
(opt_sep, element) => [element]
(opt_sep, opt_element) => [opt_element]
(element, element) => [element;1]
(opt_element, opt_element) => Err
Geal commented

I have moved the loop check in separated_list0 and separated_list1 to cover the application of both parsers: #1756
As @Taywee said

Sure, that makes sense, but I think it would be more robustly handled by returning an error in the case that both sep and f both parse empty in a row, rather than one or the other. As long as one of the two isn't parsing empty, there's no risk of an infinite loop.

This will offer enough protection against infinite loops

We should probably remove the comment that sep must be consuming...

/// # Arguments
/// * `sep` Parses the separator between list elements. Must be consuming.
/// * `f` Parses the elements of the list.