RazrFalcon/pico-args

Can it do correct non-UTF-8 OsString/PathBuf filename handling?

vi opened this issue · 15 comments

vi commented

Or does it force everything to be String like gumdrop?

For example, one implement a /sbin/chroot analogue with pico-args that does not break passing arguments to executed process in edge cases?

Yes, everything is a string.

How can you hit this issue on linux? Are you using non-UTF-8 locale?

vi commented

How can you hit this issue on linux?

$ echo qwer > $(printf '\xFF')
$ ls -l
total 4
-rw-r--r-- 1 vi vi 5 Jul 22 17:14 ''$'\377'
$ find .
.
./?
$ find . -print0 | hd
00000000  2e 00 2e 2f ff 00                                 |.../..|
00000006
$ cat � 
qwer
$ rm $(printf '\xFF')

How would pico-args-based /bin/rm analogue delete this file?

Are you using non-UTF-8 locale?

Locale is UTF-8, but filenames are bytes, not strings in Linux.

As far as I remember, there may be tricky specifics about filename handling on other platforms as well.

Last time I've checked, rm was not able to remove such files too. But it was few years ago.

I will think about it. But I think that there are too much work for such an edge case.

vi commented

For me, on of reason to have proper OsString-capable argument parsing is ability to forward them to Command::args as is, fully transparent and unmodifined.

Missing this subtle corner case (typically panicking instead) is not a Rust Way (maybe also not a Unix Way). Imagine /usr/bin/sudo analogue of using pico-args. It's not allright to panic because of invalid data specified by user.

It is a question of correctness, not feature richness.

Actually, rm can remove such files with no problem:

darcy:tmp$ echo qwer > $(printf '\xFF')
darcy:tmp$ ls
?
darcy:tmp$ rm � 
darcy:tmp$ ls
darcy:tmp$

I typed the rm command using tab completion, since otherewise that character is hard to type. I could also have removed it with a rm * now that I think of it. In either case, rm accepts a single non-unicode string argument passed it by the shell, and removes that file.

I agree that this is important. Fortunately, it's also pretty easy. You just store a Vec<OsString> and convert when needed.

It's not really a corner case, it's a question of correctness for any program that handles file paths.

I don't think that this is easy at all, since the current code works with UTF-8 strings, not bytes.

Non UTF-8 arguments are not allowed anymore.

I'll just add that this is a show-stopper for me. No hard feelings, but failure to handle Path or PathBuf types correctly means it can't be used in a general-purpose argument processing crate.

pico-args designed to parse strings, not bytes. I'm open for suggestions, but I don't see how this can be implemented.

The only solution I see is to use *_from_os_str methods when you what to parse something like a Path.

vi commented

Non UTF-8 arguments are not allowed anymore.

  1. What happens if user specifies them? I see: 'from_env' returns a 'Result' now..
  2. If arbitrary file paths are not allowed as arguments, it should be documented in README and crate docs.

I've added OsString support by basically rewriting everything. It's no longer pico, since it's more than 500 LOC already.

The only limitation is that you can't parse --key=value when it's not a UTF-8 string. You have to separate them by space. But it's not possible anyway, since we don't know what encoding is used.

vi commented

It's no longer pico, since it's more than 500 LOC already.

That's unfortunate, but feels closer to Rust Way. Is there significant additional compile-time or exeutable size overhead?

But it's not possible anyway, since we don't know what encoding is used.

Non-ascii command-line option and flag names may be rarity. I think it is OK to fail (but not panic) in --key=<non-UTF8-value> when key itself is non-ascii.

Or maybe just handling only ASCII English option names is already OK for a "pico" library.

In C programs, command line parsing may be "ascii-only 1 byte = 1 char" territory where UTF-8 works only because of all [\x80-\xFF] characters are treatet as funny unknown letters.

No. I've managed to squeeze it into the original 20KiB margin.

The problem is that OsStr doesn't have an access to the underling data, so I can only convert it to a String. Unless I'm missing something.

vi commented

The problem is that OsStr doesn't have an access to the underling data.

On UNIX there is as_bytes method in extension trait.
On Windows clap seems to use unsafe { transmute } hackery, as shown in Reddit comment. But command line argument arrays are broken on Windows in the first place.