hexadecimal literal is not a Unicode scalar value
LuckyTurtleDev opened this issue · 3 comments
What version of regex are you using?
If it isn't the latest version, then please upgrade and check whether the bug
is still present.
Describe the bug at a high level.
I have try to use this regex to match emojis.
(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])
But calling Regex::new()
on it creates the error hexadecimal literal is not a Unicode scalar value
.
What are the steps to reproduce the behavior?
Minimal example, which include only the unicode sequence, which fail.
fn main() {
regex:Regex::new(r"\ud83c").unwrap();
}
What is the actual behavior?
thread 'main' panicked at src/main.rs:2:34:
called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
\ud83c
^^^^
error: hexadecimal literal is not a Unicode scalar value
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What is the expected behavior?
return Ok
instead of Err
❓
You're using regex:Regex
instead of regex::Regex
. As a result, the error emitted from your reproduction is not the Unicode scalar value error.
Otherwise, the scalar value error is correct. \ud83c
is a surrogate codepoint, not a scalar value. Surrogate codepoints are only used with UTF-16. This crate uses UTF-8.
You're using regex:Regex instead of regex::Regex. As a result, the error emitted from your reproduction is not the Unicode scalar value error.
Looks like I have mess up creating the example without notice it. 🤦 I have fix this now.
Surrogate codepoints are only used with UTF-16. This crate uses UTF-8.
Can I still use this if I split it in two bytes or will this create an error, because it is not valid uf8?
What happen if I match a regex against a utf-16 string?
Never mind I can simple use \p{Emoji}
But why is ♡
a Math char but not an Emoji?