Can it do correct non-UTF-8 OsString/PathBuf filename handling?
vi opened this issue · 15 comments
Or does it force everything to be String like gumdrop
?
For example, one implement a /sbin/chroot
analogue with pico-args that does not break passing arguments to executed process in edge cases?
Yes, everything is a string.
How can you hit this issue on linux? Are you using non-UTF-8 locale?
How can you hit this issue on linux?
$ echo qwer > $(printf '\xFF')
$ ls -l
total 4
-rw-r--r-- 1 vi vi 5 Jul 22 17:14 ''$'\377'
$ find .
.
./?
$ find . -print0 | hd
00000000 2e 00 2e 2f ff 00 |.../..|
00000006
$ cat �
qwer
$ rm $(printf '\xFF')
How would pico-args-based /bin/rm
analogue delete this file?
Are you using non-UTF-8 locale?
Locale is UTF-8, but filenames are bytes, not strings in Linux.
As far as I remember, there may be tricky specifics about filename handling on other platforms as well.
Last time I've checked, rm
was not able to remove such files too. But it was few years ago.
I will think about it. But I think that there are too much work for such an edge case.
For me, on of reason to have proper OsString-capable argument parsing is ability to forward them to Command::args
as is, fully transparent and unmodifined.
Missing this subtle corner case (typically panicking instead) is not a Rust Way (maybe also not a Unix Way). Imagine /usr/bin/sudo
analogue of using pico-args. It's not allright to panic because of invalid data specified by user.
It is a question of correctness, not feature richness.
Actually, rm can remove such files with no problem:
darcy:tmp$ echo qwer > $(printf '\xFF')
darcy:tmp$ ls
?
darcy:tmp$ rm �
darcy:tmp$ ls
darcy:tmp$
I typed the rm command using tab completion, since otherewise that character is hard to type. I could also have removed it with a rm *
now that I think of it. In either case, rm accepts a single non-unicode string argument passed it by the shell, and removes that file.
I agree that this is important. Fortunately, it's also pretty easy. You just store a Vec<OsString>
and convert when needed.
It's not really a corner case, it's a question of correctness for any program that handles file paths.
I don't think that this is easy at all, since the current code works with UTF-8 strings, not bytes.
Non UTF-8 arguments are not allowed anymore.
I'll just add that this is a show-stopper for me. No hard feelings, but failure to handle Path
or PathBuf
types correctly means it can't be used in a general-purpose argument processing crate.
pico-args
designed to parse strings, not bytes. I'm open for suggestions, but I don't see how this can be implemented.
The only solution I see is to use *_from_os_str
methods when you what to parse something like a Path
.
Non UTF-8 arguments are not allowed anymore.
What happens if user specifies them?I see:'from_env' returns a 'Result' now.
.- If arbitrary file paths are not allowed as arguments, it should be documented in README and crate docs.
I've added OsString
support by basically rewriting everything. It's no longer pico, since it's more than 500 LOC already.
The only limitation is that you can't parse --key=value
when it's not a UTF-8 string. You have to separate them by space. But it's not possible anyway, since we don't know what encoding is used.
It's no longer pico, since it's more than 500 LOC already.
That's unfortunate, but feels closer to Rust Way. Is there significant additional compile-time or exeutable size overhead?
But it's not possible anyway, since we don't know what encoding is used.
Non-ascii command-line option and flag names may be rarity. I think it is OK to fail (but not panic) in --key=<non-UTF8-value>
when key
itself is non-ascii.
Or maybe just handling only ASCII English option names is already OK for a "pico" library.
In C programs, command line parsing may be "ascii-only 1 byte = 1 char" territory where UTF-8 works only because of all [\x80-\xFF]
characters are treatet as funny unknown letters.
No. I've managed to squeeze it into the original 20KiB margin.
The problem is that OsStr
doesn't have an access to the underling data, so I can only convert it to a String
. Unless I'm missing something.
The problem is that OsStr doesn't have an access to the underling data.
On UNIX there is as_bytes
method in extension trait.
On Windows clap
seems to use unsafe { transmute }
hackery, as shown in Reddit comment. But command line argument arrays are broken on Windows in the first place.