conaticus/FileExplorer

Cache-less fast file search by directly reading the file system

valaphee opened this issue · 3 comments

There is a way on Windows, and definitely also on Linux, to directly read the file system structure, which don't involve the operating systems file caching, search indices, etc.

And when searching for a file over the entire disk, it's way faster to just read for example the NTFS Master File Table; As a proof on concept, I created a project which currently implements this kind of file search for Windows NTFS file system: https://github.com/valaphee/curtain/blob/main/storage/examples/search.rs, and it is only limited by the speed of how fast the Master File Table can be read, and only takes about 1 second on my PC when searching on an 512GB NVMe volume with around 200k files.

This might also be combined with file system watches and caching. But without writing the cache to disk, as the load time of the cache is slower than just reading the file entries. And it also increases the disk wear level by an unnessary amount, especially as compressed data may completely change by just adding/removing one file.

The internals are little bit more complicated as it basically simulates the NTFS file system read operation and it also requires administrator privileges. One other downside is that it might be slower when limiting the search like for example searching for example by directory which has <= Total NTFS files / 2, which can be optimized by parsing the file structure, etc. with the cost of higher memory usage.

But its the technically the fastest way for searching files.

i second this

I just wanted to suggest the MFT store, however, I thought to use it to build the index/cache which would speed up subdirectory search when/if implemented.

I just wanted to suggest the MFT store, however, I thought to use it to build the index/cache which would speed up subdirectory search when/if implemented.

But you would have to overhaul the cache, as the "path" operation is really slow, because you have to walk the entire path to the root directory,

Which can be solved by normal tree walking utilizing the index attribute, or to directly build a tree from leaf to root.
And you could linearize the tree like:

Names (Linear vec, zero terminated strings)
0 C:\
1 | helloworld.txt
2 | System32\
3 | | ..
4 | Users\
5 | | valaphee\
6 | | | helloworld.txt
7 | | | helloworld\
8 | | | | helloworld.txt
9 | | | | ..
10| | | ..
11| | anotheruser\
12| | | helloworld.txt
13| | | ..
14| | ..
15| ..
16..
(will look like "C:\\\0helloworld.txt\0System32\\\0..\0Users\\\0valaphee\\\0...")

Ranges (HashMap)
C:\ 1-15
C:\System32\ 3-3
C:\Users\ 5-13
C:\Users\valaphee\ 6-8
C:\Users\valaphee\helloworld\ 8-8
C:\Users\anotheruser\ 12-12

Searching for "helloworld.txt" in C:\Users\valaphee:
- look in ranges for "C:\Users\valaphee\" => 6-8
- 6. found helloworld.txt
- 7. push helloworld\ (\ indicates directory)
- 8. found helloworld.txt
- (9.) pop last element (e.g. helloworld, .. indicates pop)

This method is a trade-off between linear search speed (CPU cache hits, because there is no pointer chasing like a Vec of Strings), hash lookup speed and also memory occupancy