nom::bytes::complete::escaped_transform woes?
kitchen opened this issue · 1 comments
I'm trying to use nom::bytes::complete::escaped_transform and running into some trouble.
Specifically, I'm running into an issue where the function wants and escape char
but I am trying to give it an escape byte, one that doesn't seem to be playing nicely with as char
(specifically, 0xDB
)
It seems as though in rust, a char
is actually a multi-byte representation of a unicode character. And if I'm understanding things correctly 0xDB
is above decimal 127, which means the "there's another byte to this character" utf-8 encoding thing so it's more like 0xDB00
internally? Now that I think of that, I actually wrote a little test case to check for that and sure enough that's exactly what it is.
Anywho, this possibly raises a bigger issue: this function maybe should be in nom::character::complete
instead of bytes
since it's clearly character oriented? And then a byte-oriented version placed in nom::bytes::complete
? Also I wonder how hard it would be to have the escape char argument be another parser, so you could use tag
or something else in place (not that I need that, but it might be useful to make it more generic?)
Thanks!
Prerequisites
❯ rustc --version
rustc 1.71.0 (8ede3aae2 2023-07-12)
❯ grep nom Cargo.toml
nom = "7.1.3"
Test case
use nom::branch::alt;
use nom::bytes::complete::{escaped_transform, is_not, tag};
use nom::combinator::value;
use nom::IResult;
const FEND: u8 = 0xC0;
const FESC: u8 = 0xDB;
const TFEND: u8 = 0xDC;
const TFESC: u8 = 0xDD;
pub fn unescape(input: &[u8]) -> IResult<&[u8], Vec<u8>> {
escaped_transform(
is_not([FESC]),
FESC as char,
alt((
value(&[FEND][..], tag(&[TFEND])),
value(&[FESC][..], tag(&[TFESC])),
)),
)(input)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn try_fesc() {
let res = unescape(&[0x61, 0x62, FESC, TFEND, 0x63, 0x64, 0x65]);
assert_eq!(res, Ok((&[][..], vec![0x61, 0x62, FEND, 0x63, 0x64, 0x65])))
}
#[test]
fn try_fesczerozero() {
// 0xDB as char internally gets turned into 0xDB00, it seems
// this test case is *not* desired behavior, but I put it here
// for insight into the implementation details
let res = unescape(&[0x61, FESC, 0x00, TFEND, 0x63, 0x64]);
assert_eq!(res, Ok((&[][..], vec![0x61, FEND, 0x63, 0x64])));
}
#[test]
fn try_noesc() {
let res = unescape(&[0x61, 0x62, 0x63]);
assert_eq!(res, Ok((&[][..], vec![0x61, 0x62, 0x63])));
}
}
output of test run:
❯ cargo test
Finished test [unoptimized + debuginfo] target(s) in 0.00s
Running unittests src/lib.rs (target/debug/deps/nomplayground-ec796cae7e096d2e)
running 3 tests
test tests::try_noesc ... ok
test tests::try_fesczerozero ... ok
test tests::try_fesc ... FAILED
failures:
---- tests::try_fesc stdout ----
thread 'tests::try_fesc' panicked at 'assertion failed: `(left == right)`
left: `Err(Error(Error { input: [99, 100, 101], code: Tag }))`,
right: `Ok(([], [97, 98, 192, 99, 100, 101]))`', src/lib.rs:29:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
tests::try_fesc
test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
error: test failed, to rerun pass `--lib`
right it looks like it's missing something when looking at utf8 input