ron-rs/ron

Deserializer doesn't support unicode identifiers

Closed this issue · 6 comments

use ron;
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct A {
    한글: String,
}

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct B {
    eng: String,
}
fn main() {
    let a = A {
        한글: "스트링".to_string(),
    };
    let b = B {
        eng: "string".to_string(),
    };
    let a_ser = ron::ser::to_string(&a).unwrap();
    let b_ser = ron::ser::to_string(&b).unwrap();
    let a_de = ron::de::from_str::<A>(&a_ser);
    let b_de = ron::de::from_str::<B>(&b_ser);
    println!("a_ser:\n{:#?}", a_ser );
    println!("a_de:\n{:#?}", a_de );
    println!("b_ser:\n{:#?}", b_ser );
    println!("b_de:\n{:#?}", b_de );
}
a_ser:
"(한글:\"스트링\")"
a_de:
Err(
    Error {
        code: ExpectedIdentifier,
        position: Position {
            line: 1,
            col: 2,
        },
    },
)
b_ser:
"(eng:\"string\")"
b_de:
Ok(
    B {
        eng: "string",
    },
)

Is this a bug of ron?

The bug is not that it cannot deserialize unicode identifiers, but that it doesn't use raw identifiers when serializing identifiers.

EDIT: So far we don't support unicode in identifiers. I'll change this issue to be specific to unicode support; I've created one for incorrect serialization of raw idents: #322

All parsing operates on u8, for doing xid_start/xid_continue checks, we would need to work with char.

I see 3 possible implementations:

  1. make Bytes::new validate that the input is valid utf8. This would allow us to use a bit of unsafe to still work with bytes but be able to also get the next unicode char like so:
fn check_ident_other_char(&self, index: usize) -> bool {
         // Safe because `index` always points at a valid char boundry
         unsafe { from_utf8_unchecked(&self.bytes[index..]) }
        .chars()
        .next()
        .map_or(false, is_xid_continue)
}
  1. create our own method (or pull in a crate that is able to do it) to only validate the next character in the byte array something like
get_next_char(&self.bytes[index..]).map_or(false, is_xid_continue)
  1. we could also just use from_utf8 but that would mean that for every parsed ident the whole trailing input would be validated as well.
juntyr commented

@ModProg Thank you for sharing your thoughts! I've been thinking about the UTF-8 support for a while too but haven't had the time to draft an implementation so far.

When we serialise ron to a string, we already assert that the document must be valid UTF-8. While we can currently parse just about any bytes, I'm not fully convinced that sticking to that really is necessary - arbitrary byte strings as suggested in #438 can always use escapes. So I think option (1) is a very valid approach.

Option 2 has the appeal of still giving us the flexibility of non-UTF-8 documents but at the cost of extra maintenance on our side. I think I'd still prefer option (1) here unless there is a very simple implementation or small and well-supported crate that we could depend on.

I have also thought about option (3) but the potential N^2 complexity blowup does make me wary about that option.

While we can currently parse just about any bytes

Well we actually cannot, every byte we parse, must be either inside a string and therefore valid utf-8, or outside a string and therefore ascii (subset of utf-8).

Introducing the restriction of utf-8 for the whole file therefore does not change anything currently, it would only lock us out of the possibility to introduce some raw byte data later. But in case we really want to do that, we can still change our approach then.

Do we have any benchmarks? To measure performance degradation?

Do we have any benchmarks? To measure performance degradation?

Oh ignore me, found them immediately after commenting: https://github.com/ron-rs/ron-bench