Buggy behaviour when working with accentuated characters
WyohKnott opened this issue · 4 comments
When working with filenames containing accentuated or weird characters, the {} is not replaced correctly.
For exemple:
seq 1 10 | parallel echo "Québec-q{}.webm"
gives:
Québec-1}.webm
Québec-2}.webm
Québec-3}.webm
Québec-4}.webm
Québec-5}.webm
Québec-6}.webm
Québec-7}.webm
Québec-8}.webm
Québec-9}.webm
Québec-10}.webm
instead of Québec-q1.webm and so on.
If there's more non-ascii characters, the program segfault:
seq 1 10 | RUST_BACKTRACE=1 parallel echo "Œuf_échaudé-q{}.webm"
gives
parallel: reading inputs from standard input
thread 'main' panicked at 'byte index 18 is not a char boundary; it is inside 'é' (bytes 17..19) of `echo Œuf_échaudé-q{}.webm`', /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
stack backtrace:
1: 0x560b9875899a - std::sys::imp::backtrace::tracing::imp::write::h9c41d2f69e5caabf
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
2: 0x560b98757cee - std::panicking::default_hook::{{closure}}::hcc803c8663cda123
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:351
3: 0x560b98756fdb - std::panicking::rust_panic_with_hook::hffbc74969c7b5d87
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:367
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:555
4: 0x560b98756b3f - std::panicking::begin_panic::hc4c5d184a1e3fb7c
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:517
5: 0x560b98756ac9 - std::panicking::begin_panic_fmt::h34f5b320b0f94559
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:501
6: 0x560b98765956 - core::panicking::panic_fmt::h1016b85b51d1931f
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:477
7: 0x560b98766e4f - core::str::slice_error_fail::h02b27cb27b0f1c1d
at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
8: 0x560b9875069f - parallel::main::h6c96215d2b4b63a7
9: 0x560b9875334f - main
10: 0x7f1a66c82400 - __libc_start_main
11: 0x560b9871d649 - _start
12: 0x0 - <unknown>
The "shifting" seems directiy correlated to the number of UTF-8 codepoints used by "special" characters.
For example the characters 💖 is composed of 4 codepoints, so the {} variable is shifted 3 characters to the left:
seq 1 10 | parallel echo "test_💖_-q{}.webm"
test_💖1q{}.webm
test_💖2q{}.webm
test_💖3q{}.webm
test_💖4q{}.webm
test_💖5q{}.webm
test_💖6q{}.webm
test_💖7q{}.webm
test_💖8q{}.webm
test_💖9q{}.webm
test_💖10q{}.webm
Somewhere in your code there must be an assumption 1 character = 1 codepoint, and it messes everything up for characters coded with more than 1 codepoint.
The issue is in the tokenizer. This is the stage that strips out tokens like {}
and converts them into their corresponding series of tokens.
I'll have the fix uploaded soon. The fix is just manually incrementing the index value by the character's actual size using the len_utf8()
method. Example:
let mut id = 0;
for character in data.chars() {
/// actual work
id += character.len_utf8();
}