ScanBytes is a tool and a lib for scanning files for occurrences of certain bytes fast.
It can be useful for example for creating an index of a CSV/TSV file or just a file with lines. Then a certain line/record can be quickly fetched by its index.
- Multithreading brings performance benefits when data fits in disk caches.
- 2 generic backends, one is JIT-ed one, another one is not-JITed. Obviously, JIT is supported only on certain platforms, currently only x86_64.
- Specialized hardcoded backend for scanning certain common cases:
- lines in a file
- TSV
- CSV
- Automatic dispatching between backends.
- Built-in benchmark.
echo "The quick brown fox jumps over the lazy dog" > fox.txt
# 0123456789ABCDEF0123456789ABCDEF01234567
ScanBytes --alphabet " fh" s fox.txt | hd
00000000 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 |................|
00000010 09 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 |................|
00000020 10 00 00 00 00 00 00 00 13 00 00 00 00 00 00 00 |................|
00000030 19 00 00 00 00 00 00 00 1e 00 00 00 00 00 00 00 |................|
00000040 20 00 00 00 00 00 00 00 22 00 00 00 00 00 00 00 | .......".......|
00000050 27 00 00 00 00 00 00 00 |'.......|
00000058
As you see, there is a lot of redundancy in the output. It can be compressed by encoding it into a proper data structure, but it is currently notimplemented in C++.
Packaging with CPack is implemented, you can generate an installable package for Debian and RPM-based distros. All the dependencies are assummed to be installed the same way.