/hashtree

SHA256 library highly optimized for Merkle tree computations

Primary LanguageAssemblyMIT LicenseMIT

hashtree

Hashtree is a SHA256 library highly optimized for Merkle tree computation. It is based on Intel's implementation with a few modifications like hardcoding the scheduled words of the padding block.

The library exposes a single header file with the low level sha functions. They all have the following signature:

void sha256(unsigned char *output, unsigned char *input, uint64_t count)

where input is the buffer containing count chunks of 64 bytes to be hashed, and output is a pre-allocated buffer that will hold the count 32-byte digests. These are low level functions that do not perform any error check: the caller is responsible for checking that input is at least 64*count bytes long and that output is writable and at least 32*count bytes long. The caller is responsible for memory allocation and de-allocation of these buffers.

Dependencies

There are no dependencies besides the standard C header stdint.h. Benchmarks have a dependency on libm. Tests and benchmarks on x86-64 an extra dependency on cpuid.h is needed. An optional dependency on openssl allows to test and benchmark against openssl. The only build-time dependency is a GCC and GNU assembler compatible compiler like gcc and gas. On Mac OS X with newer Apple Silicon processors the library can be built with the default clang compiler.

Compilation

  • Start by cloning the repository
$ git clone https://github.com/potuz/hashtree.git
$ cd hashtree
  • To build the library
$ make

this produces a statically linked library libhashtree.a in the src directory.

  • To build tests or benchmarks or all respectively:
$ make test
$ make bench
$ make all
  • To test or benchmark against OPENSSL:
$ make clean
$ HAVE_OPENSSL=1 make all
  • To cross compile for ARMv8
$ CC=aarch64-linux-gnu-gcc make
  • To cross compile for Windows (benchmarks will not work)
$ make clean
$ CC=x86_64-w64-mingw32-gcc make test

Running Tests and Benchmarks

$ make test
$ ./src/test 
Test hash_sse_1...                              [ OK ]
Test hash_sse_1_multiple_blocks...              [ OK ]
Test hash_avx_1...                              [ OK ]
Test hash_avx_1_multiple_blocks...              [ OK ]
Test hash_avx_4...                              [ OK ]
Test hash_avx_4_6blocks...                      [ OK ]
Test hash_avx_8...                              [ OK ]
Test hash_avx_8_13blocks...                     [ OK ]
Test hash_shani...                              [ CPU does not support SHA-ni ]
Test hash_shani_13blocks...                     [ CPU does not support SHA-ni ]
Test hash_avx_16...                             [ OK ]
Test hash_avx_16_30blocks...                    [ OK ]

This is running in a CPU that does not support SHA extensions, that is why two tests fail. Your system may output a different combination.

To run benchmarks:

$ HAVE_OPENSSL=1 make all
$ ./src/bench 
[==========] Running 9 benchmarks.
[ RUN      ] sse.sse_x1_one_at_time
[       OK ] sse.sse_x1_one_at_time (mean 32.300ms, confidence interval +- 1.725613%)
[ RUN      ] sse.sse_x1
[       OK ] sse.sse_x1 (mean 31.837ms, confidence interval +- 0.327542%)
[ RUN      ] avx.avx_x1_one_at_time
[       OK ] avx.avx_x1_one_at_time (mean 32.299ms, confidence interval +- 0.464452%)
[ RUN      ] avx.avx_x1
[       OK ] avx.avx_x1 (mean 31.855ms, confidence interval +- 0.150833%)
[ RUN      ] avx.avx_x4
[       OK ] avx.avx_x4 (mean 12.368ms, confidence interval +- 1.758262%)
[ RUN      ] avx.avx_x8
[       OK ] avx.avx_x8 (mean 6.519ms, confidence interval +- 2.142718%)
[ RUN      ] shani.shani
[       OK ] shani.shani (mean -20.-559us, confidence interval +- -89188758235664.375000%)
[ RUN      ] shani.shani_one_at_time
[       OK ] shani.shani_one_at_time (mean -3281090326183.-673us, confidence interval +- -9846.737643%)
[ RUN      ] openssl.openssl_one_at_time
[       OK ] openssl.openssl_one_at_time (mean 30.519ms, confidence interval +- 0.330545%)
[==========] 9 benchmarks ran.
[  PASSED  ] 9 benchmarks.

The results for the SHA-ni benchmarks are spurious since the CPU does not support them. Here we see that the benchmark against openssl native implementation for this CPU runs at the same speed as the single buffer SSE and AVX implementation, while the AVX2 implementation 8 blocks at a time runs 5 times faster.

A benchmark on a cascade-lake supporting AVX-512:

./src/bench
[==========] Running 9 benchmarks.
[ RUN      ] sse.sse_x1_one_at_time
[       OK ] sse.sse_x1_one_at_time (mean 29.182ms, confidence interval +- 0.149473%)
[ RUN      ] sse.sse_x1
[       OK ] sse.sse_x1 (mean 28.833ms, confidence interval +- 0.074605%)
[ RUN      ] avx.avx_x1_one_at_time
[       OK ] avx.avx_x1_one_at_time (mean 29.205ms, confidence interval +- 0.138581%)
[ RUN      ] avx.avx_x1
[       OK ] avx.avx_x1 (mean 28.871ms, confidence interval +- 0.200034%)
[ RUN      ] avx.avx_x4
[       OK ] avx.avx_x4 (mean 11.078ms, confidence interval +- 0.140484%)
[ RUN      ] avx.avx_x8
[       OK ] avx.avx_x8 (mean 5.650ms, confidence interval +- 0.118668%)
[ RUN      ] avx.avx_x16
[       OK ] avx.avx_x16 (mean 2.413ms, confidence interval +- 0.223049%)
[ RUN      ] shani.shani
[       OK ] shani.shani (mean -4.-941us, confidence interval +- -1102817647134393.500000%)
[ RUN      ] shani.shani_one_at_time
[       OK ] shani.shani_one_at_time (mean 0.-140us, confidence interval +- -70078699044457400.000000%)
[==========] 9 benchmarks ran.
[  PASSED  ] 9 benchmarks.

We see that AVX-512 (x16) runs 12 times faster than a single block implementation. This is slightly better than a native SHA extension CPU were gains were about x10.

A similar benchmark on a Raspberry Pi 4 model B:

$ ./src/bench 
[==========] Running 3 benchmarks.
[ RUN      ] armv8.neon_x1_one_at_time
[       OK ] armv8.neon_x1_one_at_time (mean 79.853ms, confidence interval +- 0.157599%)
[ RUN      ] armv8.neon_x1
[       OK ] armv8.neon_x1 (mean 79.035ms, confidence interval +- 0.070254%)
[ RUN      ] armv8.neon_x4
[       OK ] armv8.neon_x4 (mean 58.356ms, confidence interval +- 0.076089%)
[==========] 3 benchmarks ran.
[  PASSED  ] 3 benchmarks.

We see that a ASIMD version 4 blocks at a time, while not that much of an improvement as in the x86-64, is still 27% faster.

Using the library

The library exposes several architecture dependent SHA implementations. It is the caller responsibility to choose the right one. This can be done at runtime once at application launch. For x86_64 systems for example one can use cpuid.h, see here for an example on how to choose an implementation.

Most vectorized implementations exploit the fact that independent branches in the Merkle tree can be hashed in "parallel" within one CPU, to take advantage of this, Merkleization algorithms that loop over consecutive tree layers hashing two blocks at a time need to be updated to pass the entire layer, or all consecutive blocks. A naive example on how to accomplish this can be found in this document

Some examples benchmarks running the algorithms in this library vs prysm's current implementation are

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12                  160       7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12              15      74012328 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4               68      26677965 ns/op
BenchmarkHashBalancePrysm-4              7     165434686 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4             121       9711482 ns/op
BenchmarkHashBalancePrysm-4             10     103716714 ns/op
PASS

Nim bindings

The library offers low-level bindings for Nim that can be installed using:

nimble install https://github.com/prysmaticlabs/hashtree/

or used in a package with:

requires "https://github.com/prysmaticlabs/hashtree/"

Rust bindings

At the top directory you can run

$ make rust_bindings

To run tests:

$ make rust_tests

See the examples directory for examples on how to use the library