Buggy behaviour when working with accentuated characters

Question

Buggy behaviour when working with accentuated characters

WyohKnott opened this issue 8 years ago · 4 comments

When working with filenames containing accentuated or weird characters, the {} is not replaced correctly.

For exemple:
seq 1 10 | parallel echo "Québec-q{}.webm"

gives:

Québec-1}.webm
Québec-2}.webm
Québec-3}.webm
Québec-4}.webm
Québec-5}.webm
Québec-6}.webm
Québec-7}.webm
Québec-8}.webm
Québec-9}.webm
Québec-10}.webm

instead of Québec-q1.webm and so on.

If there's more non-ascii characters, the program segfault:

seq 1 10 | RUST_BACKTRACE=1 parallel echo "Œuf_échaudé-q{}.webm"

gives

parallel: reading inputs from standard input
thread 'main' panicked at 'byte index 18 is not a char boundary; it is inside 'é' (bytes 17..19) of `echo Œuf_échaudé-q{}.webm`', /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
stack backtrace:
   1:     0x560b9875899a - std::sys::imp::backtrace::tracing::imp::write::h9c41d2f69e5caabf
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x560b98757cee - std::panicking::default_hook::{{closure}}::hcc803c8663cda123
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:351
   3:     0x560b98756fdb - std::panicking::rust_panic_with_hook::hffbc74969c7b5d87
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:367
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:555
   4:     0x560b98756b3f - std::panicking::begin_panic::hc4c5d184a1e3fb7c
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:517
   5:     0x560b98756ac9 - std::panicking::begin_panic_fmt::h34f5b320b0f94559
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:501
   6:     0x560b98765956 - core::panicking::panic_fmt::h1016b85b51d1931f
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:477
   7:     0x560b98766e4f - core::str::slice_error_fail::h02b27cb27b0f1c1d
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
   8:     0x560b9875069f - parallel::main::h6c96215d2b4b63a7
   9:     0x560b9875334f - main
  10:     0x7f1a66c82400 - __libc_start_main
  11:     0x560b9871d649 - _start
  12:                0x0 - <unknown>

mmstick commented 8 years ago

7053e3a

Answer 1 · 2017-01-17T18:53:00.000Z

The "shifting" seems directiy correlated to the number of UTF-8 codepoints used by "special" characters.

For example the characters 💖 is composed of 4 codepoints, so the {} variable is shifted 3 characters to the left:

seq 1 10 | parallel echo "test_💖_-q{}.webm"
test_💖1q{}.webm
test_💖2q{}.webm
test_💖3q{}.webm
test_💖4q{}.webm
test_💖5q{}.webm
test_💖6q{}.webm
test_💖7q{}.webm
test_💖8q{}.webm
test_💖9q{}.webm
test_💖10q{}.webm

Somewhere in your code there must be an assumption 1 character = 1 codepoint, and it messes everything up for characters coded with more than 1 codepoint.

Answer 2 · 2017-01-17T19:01:46.000Z

The issue is in the tokenizer. This is the stage that strips out tokens like {} and converts them into their corresponding series of tokens.

Answer 3 · 2017-01-18T02:22:16.000Z

I'll have the fix uploaded soon. The fix is just manually incrementing the index value by the character's actual size using the len_utf8() method. Example:

let mut id = 0;
for character in data.chars() {
    /// actual work
    id += character.len_utf8();
}