hexadecimal literal is not a Unicode scalar value

Question

hexadecimal literal is not a Unicode scalar value

LuckyTurtleDev opened this issue a year ago · 3 comments

What version of regex are you using?

If it isn't the latest version, then please upgrade and check whether the bug
is still present.

Describe the bug at a high level.

I have try to use this regex to match emojis.

 (\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])

But calling Regex::new() on it creates the error hexadecimal literal is not a Unicode scalar value.

What are the steps to reproduce the behavior?

Minimal example, which include only the unicode sequence, which fail.

fn main() {
    regex:Regex::new(r"\ud83c").unwrap();
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=0c5febef6e07f5cc0906954de8261766

What is the actual behavior?

thread 'main' panicked at src/main.rs:2:34:
called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    \ud83c
      ^^^^
error: hexadecimal literal is not a Unicode scalar value
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What is the expected behavior?

return Ok instead of Err ❓

Answer 1 · 2023-10-27T15:37:28.000Z

You're using regex:Regex instead of regex::Regex. As a result, the error emitted from your reproduction is not the Unicode scalar value error.

Otherwise, the scalar value error is correct. \ud83c is a surrogate codepoint, not a scalar value. Surrogate codepoints are only used with UTF-16. This crate uses UTF-8.

Answer 2 · 2023-10-27T16:30:36.000Z

You're using regex:Regex instead of regex::Regex. As a result, the error emitted from your reproduction is not the Unicode scalar value error.

Looks like I have mess up creating the example without notice it. 🤦 I have fix this now.

Surrogate codepoints are only used with UTF-16. This crate uses UTF-8.

Can I still use this if I split it in two bytes or will this create an error, because it is not valid uf8?
What happen if I match a regex against a utf-16 string?

Answer 3 · 2023-10-27T17:44:31.000Z

Never mind I can simple use \p{Emoji}

But why is ♡ a Math char but not an Emoji?