ForNeVeR/TruePath

Fast native file system path abstraction

ForNeVeR opened this issue · 10 comments

So, we need an abstraction over a file system path which:

  • is able to parse a file system path on the local file system
  • guarantees that the path is absolute (or relative, or either)
  • is either a struct or a value object (immutable)
  • is normalized (i.e. no double or trailing separators, automatic stripping of UNC prefix on Windows)
  • provides a set of operators to append either a string or a relative path
  • should be fast (i.e. the virtual methods are unwanted, if it's possible to live without them)
  • provides a set of basic operations:
    • canonicalize,
    • concatenate,
    • check if a file or a directory,
    • convert to a string.

Open questions:

  • interning? (may be useful for normalized paths)
    • decided for now out of scope
  • definition of canonicalization? (to me, it's obvious we should convert paths to the right case on case-insensitive file systems, but file systems may behave their own way)
    • this is called normalization and now documented

Do we need to parse non-local file system paths? Example: parse C:\Users on unix.
What will be abstract root? On unix system root is always the same, but not on Windows. Probably we need several abstract roots.
Do we need to compare relative paths from different file systems? How will we compare case-sensitive/insensitive paths?

I guess we have to come up with some examples.

In my opinion, parsing non-local file system paths is a thing for #2.

But you've raised a good question, actually. I guess we'll need different "codecs" (or something else) for different file systems anyway, and if we're having it – then why not make them user-visible?

Ok, we can forbid to compare path with different codecs for now.
So about parsing. How to determine is path absolute or relative if it starts with slash on unix system? I think user have to pass some enum value manually. But we do not need it for win-paths I guess. So, our api can't be the same for all platforms.

Therefore I suggest to develop support for single platform (Windows?) at start.

How to determine is path absolute or relative if it starts with slash on unix system?

I would say that such a path is always an absolute one. Do you think there are cases when it's not?

Therefore I suggest to develop support for single platform (Windows?) at start.

Yep, I think we could start from that.

I would say that such a path is always an absolute one. Do you think there are cases when it's not?

Simple mistake like:

var path = "/dev/random";
var rootPath = "/dev";
var relativePath = path.Replace(rootPath, ""); // getting '/random'. Relative path but starts from slash

So, will we support wild-card paths?
C:\Users\*\**\*.* and so on.

Will we support \..\ statements? We can normalize it easy, but need to have information about it in our abstraction.

What will we do with restricted characters? Like from Path.GetInvalidFileNameChars. I guess throwing exception is enough.

There are DOS path specifiers (\\.\ and \\?\). Do we need parse it too?

How to store parsed data in memory? I think the best way is collection of ReadOnlySpan<char> to reduce memory allocations.

How to iterate through path's segments? There is no full tree or graph, so simple 'flat' collection (Array or LinkedList) is enough.

Simple mistake like:

var path = "/dev/random";
var rootPath = "/dev";
var relativePath = path.Replace(rootPath, ""); // getting '/random'. Relative path but starts from slash

Please note that the default behavior of path-combining functions, such as Path.Combine is to short-circuit on absolute paths. Such as Path.Combine(@"C:\Windows", @"C:\Users") will return @"C:\Users" (and the same on Unix, which is more questionable, I agree).

I would like to preserve this behavior by default.

But (and that's a big but!) the main point of the library is to make paths more strongly typed, to avoid such issues. A prototype API I imagine looks like that:

// this is always absolute; will assert in the constructor
struct AbsolutePath
{
  // autodetect path kind, behave as Path.Combine if passed an absolute path
  public LocalPath operator / (string relativeOrAbsolute);
  
  // note no necessity of any kind of "autodetection"
  public LocalPath operator / (RelativePath relative);
}

// in this struct, we may allow to pass paths such as `/random` and convert them into `random`, perhaps not by default
// but via a ctor overload which will allow to pass various flags such as "TreatUnixRootedPathAsRelative"
struct RelativePath {}

This is all debatable, of course.

So, will we support wild-card paths? C:\Users\*\**\*.* and so on.

As of now, this is not one of the main points of the library, but I have nothing against implementing that in the near future.

Will we support \..\ statements? We can normalize it easy, but need to have information about it in our abstraction.

I believe that we should normalize paths by default, but we may discuss.

What will we do with restricted characters? Like from Path.GetInvalidFileNameChars. I guess throwing exception is enough.

I think yes, just throw an exception from, the path constructor.

There are DOS path specifiers (\\.\ and \\?\). Do we need parse it too?

We should do something about them (and network paths, too), but it may be another "codec".

\\?\ I have plans to utilize automatically in certain cases (when converting paths to strings for WinAPI).

How to store parsed data in memory? I think the best way is collection of ReadOnlySpan<char> to reduce memory allocations.

That's a good question. Not sure about ReadOnlySpan, it is a stack-only type, right? I don't think we should aim to that. Maybe ReadOnlyMemory is enough?

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How to iterate through path's segments?

I guess, in this case, we can have an API like AbsolutePath.ForEach(Action<ReadOnlySpan<char>>).

I would like to preserve this behavior by default.

To preserve we can use such functions internally, can't we?

public LocalPath operator / (string relativeOrAbsolute);

Wow, I like it! Use / operator is great idea!

I believe that we should normalize paths by default, but we may discuss.

Yep, I tried to imagine use-case for store info about '..' but actually couldn't. So ok, let's normalize.

We should do something about them (and network paths, too), but it may be another "codec".

How do you suggest to choose specific codec? Im/explicitly?

Maybe ReadOnlyMemory is enough?

Yes, agree with you.

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How it works? Could you show docs or something please?

I guess, in this case, we can have an API like AbsolutePath.ForEach(Action<ReadOnlySpan>).

Yep, or implement IEnumerable.

How to parse raw parse? Should we use regex? I think simple 'Split' is not enough (cause of \\, \\?\, \\.\). Or we can implement our own state-machine (how regex works) or something like that.

I would like to preserve this behavior by default.

To preserve we can use such functions internally, can't we?

We can use them of course (though this won't probably be in line with the "zero-alloc" approach), or we can reimplement them on our own. We were only discussing the behavior and not the implementation here, in my opinion.

We should do something about them (and network paths, too), but it may be another "codec".

How do you suggest to choose specific codec? Im/explicitly?

This is something open for discussion. I have in mind the implementation of so-called "interaction contexts" from ReSharper (where each path gets its own "interaction context" and will parse the paths accordingly), but this isn't set in stone.

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How it works? Could you show docs or something please?

This isn't documented (and is thus subject to change), but basically it works like this: there's a static interning cache, where the key is the path passed to a method FileSystemPath.Parse("foo", InternStrategy.{Intern/DoNotIntern/TRY_GET_INTERNED_BUT_DO_NOT_INTERN}).

So, this is only optimized for cases when you pass the same path to FileSystemPath.Parse a lot.

Note that the keys in the cache are before canonicalization.

I'm not saying we should do something like this, but this is a possibility.

How to parse raw parse? Should we use regex? I think simple 'Split' is not enough (cause of \\, \\?\, \\.\). Or we can implement our own state-machine (how regex works) or something like that.

In any case, this is a very simple routine with linear complexity (until we start considering various weird path parameters like partial case-sensitivity). So, any simple implementation would work, provided it does no unnecessary allocations.

I think, by default our path should store a canonicalized path string inside of itself, and send parts of it when requested by APIs that enumerate its components. Whether we should add anything to work with Memory<char> I'm not sure yet. Maybe default to Memory and add string-based overloads?

This isn't documented (and is thus subject to change), but basically it works like this: there's a static interning cache, where the key is the path passed to a method FileSystemPath.Parse("foo", InternStrategy.{Intern/DoNotIntern/TRY_GET_INTERNED_BUT_DO_NOT_INTERN}).

So, this is only optimized for cases when you pass the same path to FileSystemPath.Parse a lot.

Note that the keys in the cache are before canonicalization.

I'm not saying we should do something like this, but this is a possibility.

Great idea, but I don't understand how it should be implemented. I think it is not the task with first priority. We should create issue and discuss later.

Maybe default to Memory and add string-based overloads?

Yep, string-based overloads with Memory<char>.ToString calls I guess.

I am closing this issue as mostly implemented, and extracting the remaining parts to a set of separate, more focused issues.