Feature request: streaming API

Question

Feature request: streaming API

data-man opened this issue 2 years ago · 12 comments

Thank you for the project!

Would be nice to have something like this (like xxhash):

komi_state_t* komi_createState(void);
komi_errorcode komi_freeState(komi_state_t* statePtr);
komi_errorcode komi_reset(komi_state_t* statePtr, uint64_t seed);
komi_errorcode komi_update(komi_state_t* statePtr, const void* input, size_t length);

Answer 1 · 2022-10-06T04:16:23.000Z

Hello! What are the use-cases for the streamed version? At this point I'm not very keen on producing a streamed version. I'm also thinking that concatenation of strings in a temporary buffer and then hashing the buffer may actually be not that much slower, because streamed implementation may need to accumulate a part of the input anyway (same memcpy). The input data stitching logic may consume a lot of cycles if "update" is called several times over a series of small inputs.

Answer 2 · 2022-10-06T04:41:20.000Z

What are the use-cases for the streamed version?

Read data from a file chunk by chunk and calculate a hash of every chunk.

Answer 3 · 2022-10-06T05:13:45.000Z

Okay, I'll think about a better way to implement that. At the moment you may take a look at PRVHASH64S: https://github.com/avaneev/prvhash It offers about 8.4 GB/s streamed hashing throughput and can produce a hash value of any required size. Streamed komihash may be twice faster, but considering storage system throughput it's not as important.

Answer 4 · 2022-11-02T06:51:40.000Z

Hi! I've updated project's page with info on "Sequential-Incremental Hashing". It can be also used to hash files, but requires some given read block size (using different read block sizes will produce different hash values).

Answer 5 · 2022-11-21T22:05:31.000Z

Hello!

I've updated project's page with info on "Sequential-Incremental Hashing".

I think, it's incorrect:

#include "komihash.h"
#include <stdio.h>

int main()
{
    const char* str1 = "123456";
    const char* str2 = "123";
    const char* str3 = "456";

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), 0 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );


    printf("h1: %x\n", h1);
    printf("h2: %x\n", h2);
    printf("h3: %x\n", h3);
}

h1: c84c16d8
h2: fbe4c41c
h3: 71481aea

Answer 6 · 2022-11-22T02:21:26.000Z

What do you mean "incorrect"?

By the way, you should write this way:

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), h1 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );

You've missed passing h1 to str2.
Then hashes are 64-bit and you are printing with %x. You should use %llx.

Answer 7 · 2022-11-22T02:27:51.000Z

I meant h3 should be equal to h1.

Answer 8 · 2022-11-22T03:07:07.000Z

You've misunderstood the concept. It's not a streamed hashing. Sequential hashing assumes length of each item is also a part of the message.

Answer 9 · 2022-11-22T03:16:58.000Z

It's not a streamed hashing.

So I wonder why you closed this issue.

Answer 10 · 2022-11-22T03:26:51.000Z

You can hash all your files using blocks of e.g. 1024 bytes (except the last block). It will be the same as streamed hashing for this given block size. Buffered streamed hashing is needed for "standardized" hashes. Here with komihash I do not see such need. Streamed hashing is also slower.

Also, from database point of view, when several independent values are concatenated, this incremental approach is preferable to streamed hashing since each item's length is encoded implicitly.

Answer 11 · 2022-12-06T17:49:24.000Z

I've implemented the streamed hashing after all, please check it out. It turned out to be a bit faster than the base komihash() function.

Answer 12 · 2022-12-06T18:51:36.000Z

Awesome, thank you!