mmstick/parallel

Buggy behaviour when working with accentuated characters

WyohKnott opened this issue · 4 comments

When working with filenames containing accentuated or weird characters, the {} is not replaced correctly.

For exemple:
seq 1 10 | parallel echo "Québec-q{}.webm"

gives:

Québec-1}.webm
Québec-2}.webm
Québec-3}.webm
Québec-4}.webm
Québec-5}.webm
Québec-6}.webm
Québec-7}.webm
Québec-8}.webm
Québec-9}.webm
Québec-10}.webm

instead of Québec-q1.webm and so on.

If there's more non-ascii characters, the program segfault:

seq 1 10 | RUST_BACKTRACE=1 parallel echo "Œuf_échaudé-q{}.webm"

gives

parallel: reading inputs from standard input
thread 'main' panicked at 'byte index 18 is not a char boundary; it is inside 'é' (bytes 17..19) of `echo Œuf_échaudé-q{}.webm`', /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
stack backtrace:
   1:     0x560b9875899a - std::sys::imp::backtrace::tracing::imp::write::h9c41d2f69e5caabf
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x560b98757cee - std::panicking::default_hook::{{closure}}::hcc803c8663cda123
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:351
   3:     0x560b98756fdb - std::panicking::rust_panic_with_hook::hffbc74969c7b5d87
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:367
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:555
   4:     0x560b98756b3f - std::panicking::begin_panic::hc4c5d184a1e3fb7c
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:517
   5:     0x560b98756ac9 - std::panicking::begin_panic_fmt::h34f5b320b0f94559
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:501
   6:     0x560b98765956 - core::panicking::panic_fmt::h1016b85b51d1931f
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:477
   7:     0x560b98766e4f - core::str::slice_error_fail::h02b27cb27b0f1c1d
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
   8:     0x560b9875069f - parallel::main::h6c96215d2b4b63a7
   9:     0x560b9875334f - main
  10:     0x7f1a66c82400 - __libc_start_main
  11:     0x560b9871d649 - _start
  12:                0x0 - <unknown>

The "shifting" seems directiy correlated to the number of UTF-8 codepoints used by "special" characters.

For example the characters 💖 is composed of 4 codepoints, so the {} variable is shifted 3 characters to the left:

seq 1 10 | parallel echo "test_💖_-q{}.webm"
test_💖1q{}.webm
test_💖2q{}.webm
test_💖3q{}.webm
test_💖4q{}.webm
test_💖5q{}.webm
test_💖6q{}.webm
test_💖7q{}.webm
test_💖8q{}.webm
test_💖9q{}.webm
test_💖10q{}.webm

Somewhere in your code there must be an assumption 1 character = 1 codepoint, and it messes everything up for characters coded with more than 1 codepoint.

The issue is in the tokenizer. This is the stage that strips out tokens like {} and converts them into their corresponding series of tokens.

I'll have the fix uploaded soon. The fix is just manually incrementing the index value by the character's actual size using the len_utf8() method. Example:

let mut id = 0;
for character in data.chars() {
    /// actual work
    id += character.len_utf8();
}