rust-lang/rust

std::fs::canonicalize returns UNC paths on Windows, and a lot of software doesn't support UNC paths

radix opened this issue Β· 38 comments

radix commented

Hi, I hope this is the right forum/format to register this problem, let me know if it's not.

Today I tried to use std::fs::canonicalize to make a path absolute so that I could execute it with std::process::Command. canonicalize returns so-called "UNC paths", which look like this: \\?\C:\foo\bar\... (sometimes the ? can be a hostname).

It turns out you can't pass a UNC path as the current directory when starting a process (i.e., Command::new(...).current_dir(unc_path)). In fact, a lot of other apps will blow up if you pass them a UNC path: for example, Microsoft's own cl.exe compiler doesn't support it: rust-lang/cc-rs#169

It feels to me that maybe returning UNC paths from canonicalize is the wrong choice, given that they don't work in so many places. It'd probably be better to return a simple "absolute path", which begins with the drive letter, instead of returning a UNC path, and instead provide a separate function specifically for generating UNC paths for people who need them.

Maybe if this is too much of an incompatible change, a new function for creating absolute paths should be added to std? I'd bet, however, that making the change to canonicalize itself would suddenly make more software suddenly start working rather than suddenly break.

canonicalize simply asks the kernel to canonicalize the path, which it does and happens to return the canonical path as a root local device path. Root local device paths are capable of representing some paths which normal absolute paths are incapable of representing accurately (such as components being named ".." or "." or having "/" in them), along with the fact that they're the only way to call many system functions with paths longer than MAX_PATH (aside from being on some recent version of Windows 10 and having a certain registry key enabled). As a result having libstd automatically strip the prefix would definitely break some situations. However I'm definitely in favor of having more support in libstd for converting between these different kinds of paths so the user can easily turn a root local device path into an absolute path. I'd also love to have an fs::normalize which merely normalizes a possibly relative path into an absolute path without hitting the filesystem on Windows.

In reference to your commit which referenced this PR, normalization is not the same as merely joining the path onto the current directory due to drive relative paths being relative to the current directory on the given drive. For example given a drive relative path of C:foo, and an env::current_dir() of D:\bar, normalizing C:foo will have to get the current directory for C:\ and could end up being normalized to something radically different such as C:\i\dont\even\foo.

radix commented

thanks, @retep998 :) it's just a hacked-together build tool that probably will eventually be replaced with something else, and I didn't intend to notify this ticket about my commit. but I guess it goes to show that a good way to get an absolute path in std would be really helpful.

Command::current_dir should be fixed. I doubt we will change canonlicalize.

Note, the i-wrong tag is only for the Command::current_dir, not the canonicalize behaviour.

Quick testing on Windows 10.0.15063 indicates that both SetCurrentDirectoryW and CreateProcessW are okay with a current directory starting with \\?\. They are not okay with a current directory that exceeds MAX_PATH regardless of \\?\. CreateProcessW is okay with the path to the process itself starting with \\?\ regardless of whether the first parameter is used. CreateProcessW is only okay with the path to the process exceeding MAX_PATH if it starts with \\?\ and is specified as the first parameter which Rust does not currently use. I tested std::process::Command::current_dir and it works as expected, accepting paths starting with \\?\ but rejecting any paths exceeding MAX_PATH.

Technically, AFAIK it is safe to strip the prefix in common simple cases (absolute path with a drive letter, no reserved names, shorter than max_path), and leave it otherwise.

So I think there's no need to compromise on correctness as far as stdlib goes. The trade-off is between failing early and exposing other software that doesn't support UNC paths vs maximizing interoperability with non-UNC software.

In an ideal world, I would prefer the "fail early" approach, so that limitations are quickly found and removed. However, Windows/DOS path handling has exceptionally long and messy history and decades of Microsoft bending over backwards to let old software not upgrade its path handling. If Microsoft can't push developers towards UNC, and fails to enforce this even in their own products, I have no hope of Rust shifting the Windows ecosystem to UNC. It will rather just frustrate Rust users and make Rust seem less reliable on Windows.

So in this case I suggest trying to maximize interoperability instead, and canonicalize to regular paths whenever possible (using UNC only for paths that can't be handled otherwise).

Also, careful stripping of the prefix done in stdlib will be much safer than other crates stripping it unconditionally (because realistically whenever someone runs into this problem, they'll just strip it unconditionally)

ofek commented

@kornelski I completely agree. The current behavior is unexpected in my opinion.

I hope this is helpful…

According to Microsoft:

Note File I/O functions in the Windows API convert / to \ as part of converting the name to an NT-style name, except when using the \\?\ prefix as detailed in the following sections.

Source: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx

And the Ruby language uses forward slashes for File paths and that works on Windows.

I've looked at this problem in detail. There are a few rules which need to be checked to safely strip the UNC prefix. It can be implemented as a simple state machine.

I've implemented that using public APIs, but because OsStr is opaque it's not nearly as nice as stdlib's implementation could have been:

https://lib.rs/dunce

So I'm still hoping canonicalize would do it automatically, because if it's done only for legacy-compatible paths there's no downside: all paths work for UNC-aware programs, and all paths that can work for legacy programs work too.

Another example of this issue that I encountered in alexcrichton/cargo-vendor#71:

url::URL.to_file_path() returns a non-UNC path (even if the URL was initialized with a UNC path). And std::path::Path.starts_with() doesn't normalize its arguments to UNC paths. So calling to_file_path() on a file: URL and then comparing it to the output of canonicalize() via starts_with() always returns false, even if the two paths represent the same resource:

extern crate url;

use std::path::Path;
use url::Url;

fn main() {
	// Path.canonicalize() returns a UNC path.
	let unc_path_buf = Path::new(r"C:\Windows\System").canonicalize().expect("path");
	let unc_path = unc_path_buf.as_path();

	// Meanwhile, Url.to_file_path() returns a non-UNC path,
	// even when initialized from a UNC path.
	let file_url = Url::from_file_path(unc_path).expect("url");
	let abs_path_buf = file_url.to_file_path().expect("path");
	let abs_path = abs_path_buf.as_path();

	// unc_path and abs_path refer to the same resource,
	// and they both "start with" themselves.
	assert!(unc_path.starts_with(unc_path));
	assert!(abs_path.starts_with(abs_path));

	// But they don't "start with" each other, so these fail.
	assert!(unc_path.starts_with(abs_path));
	assert!(abs_path.starts_with(unc_path));
}

Arguably, to_file_path() should return a UNC path, at least when initialized with one. And perhaps starts_with() should normalize its arguments (or perhaps clarify that it compares paths, not the resources to which they refer, and thus does no normalization). Also, the mitigation for consumers of this API is straightforward: canonicalize() all paths you compare if you do so to any of them. So maybe the current behavior is reasonable.

Nevertheless, it does feel like something of a footgun, so it's worth at least documenting how it differs from that of some other APIs on Windows.

comparing it to the output of canonicalize() via starts_with() always returns false, even if the two paths represent the same resource:

Comparing canonical paths is a footgun in general because it is the wrong thing to do! Things like hard links and so on mean that such comparisons will never be entirely accurate. Please don't abuse canonicalization for this use case.

If you want to tell whether two paths point to the same file, compare their file IDs! That's what same-file does and it works great!

but starts_with is not for is-file-a-file comparison, but is-file-in-a-directory check. There are no hardlinks involved (and AFAIK apart from private implementation detail of macOS time machine, no OS supports directory hardlinks).

There are more ways than just \\?\C:\ and C:\ to represent the same path, so unfortunately any sort of file in directory check is a hard problem. For example, paths can refer to drives via names other than drive letters.

I'ive been using conditional comipiling when canonicalizing, and not doing it on Windows.

ofek commented

Every time I get back into Rust I hit this, forget I've been here before, Google why canonicalize behaves differently on Windows than all the other languages' stdlib I've used, then return to this issue...

Is there an official way/well known crate to:

  1. get the normalized, absolute version of a path (no UNC) regardless of its existence, akin to Python's os.path.abspath
  2. get the normalized, absolute version of a path (no UNC) while resolving all symlinks, akin to Python's os.path.realpath

It's extremely painful to write cross-platform code due to this issue. You can even see the steady stream of bugs linking back to here.

I found:

but no progress.

Perhaps I should just look at how ripgrep handles things and do that.

It looks like it should be an innocent path-cleaning fn, but it prepends combos of / and ? on Windows that break things.

This feels like something where doing that's the right answer in some cases, but wrong in others. I think the answer is to have it take an enum where you have to explicitly specify whether you want the //?// bit. If you're not in one of those use cases, you're going to get burned. If you're developing on Linux, someone compiling the code for Windows will get burned.

This continues to be both an annoyance, and a source of bugs from oversimplified workarounds.

webdesus/fs_extra#29

Unless I'm somehow mistaken, it seems that one tool that doesn't support UNC paths is rustc itself: I tried to pass a UNC path to --extern and it claimed the path didn't exist until I trimmed off the \\?\.

ofek commented

Still bugs occurring that link back to this issue...

When trying to work around the issue from tauri::app::PathResolver::resolve_resource, which internally uses std::fs::canonicalize, I found out that the following code:

println!("{}", path.starts_with(r"\\?\"));
println!("{}", path.to_string_lossy().starts_with(r"\\?\"));
print!("{}", path.display());
std::io::stdout().flush().unwrap();

gives the following output:

false
true
\\?\C:\Users\USER\Documents\github\repalungs\tauri-client\src-tauri\target\debug\assets\default.nifti.gz

which prevents the following workaround

#[cfg(target_os = "windows")]
let path: PathBuf = match path.strip_prefix(r"\\?\") {
    Ok(path) => path.to_path_buf(),
    Err(_) => path,
};

from working. I'd really suggest to use dunce::simplified instead.

[...]
which prevents the following workaround

#[cfg(target_os = "windows")]
let path: PathBuf = match path.strip_prefix(r"\\?\") {
    Ok(path) => path.to_path_buf(),
    Err(_) => path,
};

from working. I'd really suggest to use dunce::simplified instead.

@JohnScience That "workaround" is most likely wrong in the first place since

UNC paths can change their meaning if they're reinterpreted as DOS paths.

Consider using https://lib.rs/crates/dunce which strips the prefix only when it is safe to do so.

See webdesus/fs_extra#29 for the source of the citation.

Canonicalizing a path has a specific meaning which std::fs::canonicalize implements. I don't think that should be changed to let users stay ignorant of that. This blog post shows a myriad of ways Go suffers from sacrificing correctness for simplicity and I don't think that is the right approach for Rust.
While the behaviour is correctly documented, it could maybe be made more prominently in the docs. Although 3 out of 5 lines are already dedicated to this behaviour...

@Tastaturtaste I'm the author of dunce crate and the bug you're citing. I think canonicalize should strip the UNC prefix whenever possible.

Naive unconditional stripping of the prefix is wrong, but there are clear stable rules for when it's perfectly safe and correct to do so. I'll reiterate that this is not a sacrifice of correctness. It's purely an improvement in compatibility β€” paths that can't be losslessly converted to classic DOS paths can stay as UNC.

And compatibility improvement is unfortunately needed. Even Microsoft's own MSVC does not support UNC paths. Given slow-moving backwards-looking nature of Windows, the situation is unlikely to improve, so Rust is just stuck with an incompatible oddity.

ofek commented

Has there been any update on potentially adding the correct functionality to the standard library?

There is now the (currently unstable) std::path::absolute that turns paths in to an absolute form without adding \\?\.

canonicalize is more tricky. It returns whatever is returned by the OS so if it's \\?\ prefixed then it'll need to be manually converted but only if the conversion is not lossy. However, deciding whether or not a path can round trip through a win32 path is not trivial and may change between versions of Windows. The only truly reliable way is to get Windows to actually round-trip it and see if you end up with the same path. There's also the tricky issue of changing long established behaviour. The \\?\ prefix may break some people but others may be relying on it (e.g. people wanting a path that will bypass max path limits).

Probably the best way to make progress here is to have a function that attempts to perform the conversion. This would also be more generally useful.

deciding whether or not a path can round trip through a win32 path is not trivial

Yes, but the rules are documented.

and may change between versions of Windows

This is extremely unlikely to change in an incompatible way, because the documented limits of these paths exist for backward compatibility with legacy software that may have them hardcoded (e.g. binaries may have 260-byte long buffers for paths).

Windows can relax some of the limits, e.g. some win32 APIs can be configured to accept more than 260 bytes. Maybe someday CON will be a legal name everywhere. But even if Windows relaxes the limits, Rust sticking to the old limits won't create any problems. At worst it will unnecessarily use UNC paths in some edge cases. Currently it uses UNC paths every time, so it literally can't be any worse than it is now.

ofek commented

it literally can't be any worse than it is now

I would like to echo this sentiment, it is not hyperbole https://twitter.com/mitsuhiko/status/1699381180536086714

The vast majority of code that deals with file paths and cares about Windows either uses https://docs.rs/dunce/ or has custom logic to strip the first UNC characters like https://github.com/sharkdp/fd/pull/72/files. Code that doesn't do one of these yet either will eventually when people open bug reports (see the dozens of issues that link to this issue) or will produce paths that cannot be used by everything else and cannot often be considered to support Windows.

Normally I wouldn't follow issues for this long because I have a great number of work, FOSS and personal matters that fill my time nearly completely but I happen to be a Windows user and this issue affects us. So, I will keep watching the developments in hope of a fix...

Yes, but the rules are documented.

Could you provide a link? I know of people reverse engineering the current algorithm but I'm not aware of Microsoft docs that describe the complete lossy transformation.

I'm aware of those links. The first one is the most relevant and has had more details added over its lifetime but still does not document the path parsing algorithm in its entirety. I still think the most robust solution is to attempt to round-trip the path and see if it changes.

It's more than 5 year now, do we have a reasonable solution?

We now have std::path::absolute which will work better if you don't need to resolve symlinks. We do need to work towards stabilizing it however.

Removing the \\?\ prefix still requires a third party crate. We could add a windows-only function to Path that attempts to remove it. We would however need to guard against changing the meaning of the path (which can happen in certain edge cases).

Thanks for that but it's still extremely frustrating. No matter what I try I still get:

\\?\UNC\applications2...

In my csv output file.

Using:

let file_path = path::absolute(path)?.to_string_lossy().to_string();

The expected value is \\applications2\... (it's a Windows network drive).

I tried dunce as well.

What's the input path? absolute shouldn't add \\?\UNC\ unless it's already in the input path (or in the current directory). I think dunce has issues with some paths, you could try omnipath instead.

It's like H:/opensight/ where H: is mapped to \\applications2\sousaa.

Hm, using std::path::absolute should resolve that as H:\opensight\ because it doesn't look up where it's mapped to.

You're right, it worked, it just that I used the wrong binary. Many thanks!