google/argh

Supporting non-UTF-8 paths

Byron opened this issue · 8 comments

Byron commented

I love argh for its simplicity, compile performance and leanness, and I find myself migrating more and more projects to it.

Just now it became evident that despite supporting PathBuf in struct fields, it will always assume valid UTF-8 in the arguments it parses. This might be an assumption that holds in Fuchsia, but may make it impossible to use for general purpose linux tools that consume paths in arguments.

I wonder if there is any chance to allow parsing PathBuf fields without making assumptions about their encoding?

Thank you

PS: This question was triggered by this post on reddit/r/rust.

Perhaps os_str_bytes could be of use.

Thank you for the kind words!

Like Freaky said, you can likely use std::ffi::OsString instead of a PathBuf. Is that sufficient for your use case?

Fuchsia does indeed have everything as UTF-8, but I'm 100% willing to add more general support.

That sounds like a bug. PathBuf is implemented as a newtype around OsString, so why does using it detour through String?

argh is fundamentally &str-driven:

argh/src/lib.rs

Line 188 in e59f2c8

fn from_args(command_name: &[&str], args: &[&str]) -> Result<Self, EarlyExit>;

I mentioned os_str_bytes because it looks useful for argument parsing, though I suspect the simplified approach argh takes makes it less important.

Byron commented

Thanks so much, @benbrittain, for your openness towards this improvement! With that added correctness, I think this will put argh right into the sweet spot between low-enough compile times, convenience, and low/no binary overhead.
Right now I am actually implementing a tool supporting both structopt and argh CLIs, and it's clear that the declaration of argh is easier to remember and less cluttered, so that opinionatedness does hit a sweet spot for me indeed.

My thoughts are as follows:

  • work with OsStr|OsString instances internally, and use std::env::args_os() instead of std::env::args()
  • when placing an OsStr into into a field that requires UTF-8 encoding, try to convert or fail gracefully
  • when wanting to call FromArgValue or from_str_fn(…), try to convert or fail gracefully
  • introduce a new FromArgValueOS or from_os_str_fn(…) (working titles :D) for those who want to support non-UTF-8 fully

Is there anything fundamental I am missing?
And how much effort do you think it is?

I am asking because maybe we sketch out the implementation a little and attract contributors :).

Byron commented

Hi @benbrittain, I think if I would hear your ideas about my initial thoughts stated above, I would consider giving the implementation a shot. That way I hope to avoid going off in the wrong direction.

Thanks a lot!

@Byron Sorry about the slow response! That sounds all sounds well thought out and I'm happy to work with you in moving argh in this direction! My only concern is with increasing binary overhead, so I want to ensure that we avoid that. I added #39 yesterday so that we actually track that

Byron commented

That sounds great, thank you! Regarding binary overhead, I would think that there is close to none, but then again I don't really know how Argh is working. As long as the code generator won't have to support two versions of everything, i.e. OsStr + str, I am optimistic supporting non-utf8 won't be detrimental.
Let's see how it goes :).