How to parse until a range of tags
frenetisch-applaudierend opened this issue · 3 comments
I would like to parse arbitrary text with embedded sequences which are delimited by different tags into their parts. E.g.
Test <#embedded sequence 1#> and (*embedded sequence 2*)
should be parsed to Text("Test ")
Embedded1("embedded sequence 1")
Embedded2("embedded sequence 2")
. Ideally all strings in the token should be borrowed from the input string.
The embedded sequences are straightforward, but I fail to specify the parser for the Text
tokens. Is it possible to take_until
a range of tags is encountered?
Hello @frenetisch-applaudierend.
I think the fifth chapter of the article The Nom Guide (Nominomicon) be able to address your question.
Hi @coalooball
Thanks for the Link, I haven't seen that one before!
However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))
), but this does not seem handled (or I did not see it).
Hi @coalooball
Thanks for the Link, I haven't seen that one before!
However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e.
take_until(tag("<#").or(tag("(*")))
), but this does not seem handled (or I did not see it).
Hello again!
The take_until
really doesn't work that way, since it's the equivalent of a terminal node in BNF. I suppose you could use terminated
to extract arbitrary text.
Here is my method which is a bit more cumbersome, I don't know if there are any other concise methods:
use nom::{
branch::alt,
bytes::complete::{tag, take_till, take_while1},
character::{is_alphanumeric, is_space},
sequence::{delimited, terminated},
IResult,
};
fn is_delimiter(s: u8) -> bool {
s == 0x2a || s == 0x23
}
fn embedded_sequence(s: &[u8]) -> IResult<&[u8], &[u8]> {
delimited(
alt((tag(b"<"), tag(b"("))),
delimited(
alt((tag(b"#"), tag(b"*"))),
take_till(is_delimiter),
alt((tag(b"#"), tag(b"*"))),
),
alt((tag(b">"), tag(b")"))),
)(s)
}
fn parse(s: &[u8]) -> IResult<&[u8], &[u8]> {
terminated(
take_while1(|x| is_alphanumeric(x) || is_space(x)),
embedded_sequence,
)(s)
}
fn main() {}
#[test]
fn test_embedded_sequence() {
assert_eq!(
embedded_sequence(b"<#embedded sequence 1#>111").unwrap(),
(b"111".as_ref(), b"embedded sequence 1".as_ref())
);
assert_eq!(
parse(b"Test <#embedded sequence 1#> and (*embedded sequence 2*)").unwrap(),
(b" and (*embedded sequence 2*)".as_ref(), b"Test ".as_ref())
)
}