/ScanBytes.cpp

A library and a CLI tool for scanning files for occurences of certain bytes and then outputting them into CLI.

Primary LanguageC++The UnlicenseUnlicense

ScanBytes Unlicensed work

GitHub Actions Libraries.io Status

ScanBytes is a tool and a lib for scanning files for occurrences of certain bytes fast.

It can be useful for example for creating an index of a CSV/TSV file or just a file with lines. Then a certain line/record can be quickly fetched by its index.

Features

  • Multithreading brings performance benefits when data fits in disk caches.
  • 2 generic backends, one is JIT-ed one, another one is not-JITed. Obviously, JIT is supported only on certain platforms, currently only x86_64.
  • Specialized hardcoded backend for scanning certain common cases:
    • lines in a file
    • TSV
    • CSV
  • Automatic dispatching between backends.
  • Built-in benchmark.

Example

echo "The quick brown fox jumps over the lazy dog" > fox.txt
#     0123456789ABCDEF0123456789ABCDEF01234567
ScanBytes --alphabet " fh" s fox.txt | hd
00000000  01 00 00 00 00 00 00 00  03 00 00 00 00 00 00 00  |................|
00000010  09 00 00 00 00 00 00 00  0f 00 00 00 00 00 00 00  |................|
00000020  10 00 00 00 00 00 00 00  13 00 00 00 00 00 00 00  |................|
00000030  19 00 00 00 00 00 00 00  1e 00 00 00 00 00 00 00  |................|
00000040  20 00 00 00 00 00 00 00  22 00 00 00 00 00 00 00  | .......".......|
00000050  27 00 00 00 00 00 00 00                           |'.......|
00000058

As you see, there is a lot of redundancy in the output. It can be compressed by encoding it into a proper data structure, but it is currently notimplemented in C++.

Installation

Packaging with CPack is implemented, you can generate an installable package for Debian and RPM-based distros. All the dependencies are assummed to be installed the same way.

Related projects