Cysharp/Utf8StreamReader

File reading can take advantage of `RandomAccess(SFH)` for .NET 6+ targets

Closed this issue ยท 2 comments

Hi and thank you for writing Utf8StreamReader, it is really fast and nice lower level stream reader type.

There is a particular bit that stood out to me in readme on reading from a file: https://github.com/Cysharp/Utf8StreamReader?tab=readme-ov-file#optimizing-filestream

For post-.NET 6 targets, a FileStream may not be necessary as for files of definite length you can simply store consumed offset and issue calls so RandomAccess.ReadAsync(SafeFileHandle, buffer, offset) directly to avoid (potential) double-buffering and extra overhead of dispatching through a file stream strategy.

This is what I eventually arrived at when implementing U8Reader:

(Arguably, it is slower on long lines and more allocatey because it has to perform validation and cannot assume the lifetime of returned data, acting like a better regular StreamReader so this is a really good counterpart)

The easy path to implementing it is just making a Utf8FileReader copy of the original implementation to avoid extra object fields/object field overloading. However, I must put a disclaimer that this might not be worth it in light of async overhead in this particular case*. Still, posting it here as notes that may be of interest to you, should you find FileStream overhead and quirks noticeable.

* secret trick to low latency file IO on Unix that all code reviewers hate - using sync ๐Ÿ˜†

Thanks for the feedback.
I really appreciate your abstraction of Stream, FileHandle, and Socket in your design.

In Utf8StreamReader, it only executes ReadAsync(Memory<byte>) of FileStream.
This calls OSFileStreamStrategy.ReadAsync, which in turn calls RandomAccess.ReadAtOffsetAsync.
https://github.com/dotnet/runtime/blob/1fa699e9e82e96a7f4f1928f258984ce1cc38471/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/OSFileStreamStrategy.cs#L290
(Since RandomAccess.ReadAsync internally calls RandomAccess.ReadAtOffsetAsync, they are essentially the same)

In other words, the current FileStream is a very thin wrapper around RandomAccess, so if you create a non-buffered FileStream, there is minimal overhead.

Thanks, I have submitted a PR (different issue) and it also adds a little bit more context on why/how the implementation ended up looking the way it is. Though, with that said, I think it's often better to keep implementation simple which has been really useful throughout the years as I've been learning a lot back in the day from your code! (it's a huge service to .NET community)

But yes, it's overall less risky and more effort efficient compared to what U8Reader tries to do and generally makes sense - I don't even know whether your use case involves consuming files primarily or network streams so it may have been irrelevant ๐Ÿ˜