avaneev/komihash

Feature request: streaming API

data-man opened this issue · 12 comments

Thank you for the project!

Would be nice to have something like this (like xxhash):

komi_state_t* komi_createState(void);
komi_errorcode komi_freeState(komi_state_t* statePtr);
komi_errorcode komi_reset(komi_state_t* statePtr, uint64_t seed);
komi_errorcode komi_update(komi_state_t* statePtr, const void* input, size_t length);

Hello! What are the use-cases for the streamed version? At this point I'm not very keen on producing a streamed version. I'm also thinking that concatenation of strings in a temporary buffer and then hashing the buffer may actually be not that much slower, because streamed implementation may need to accumulate a part of the input anyway (same memcpy). The input data stitching logic may consume a lot of cycles if "update" is called several times over a series of small inputs.

What are the use-cases for the streamed version?

Read data from a file chunk by chunk and calculate a hash of every chunk.

Okay, I'll think about a better way to implement that. At the moment you may take a look at PRVHASH64S: https://github.com/avaneev/prvhash It offers about 8.4 GB/s streamed hashing throughput and can produce a hash value of any required size. Streamed komihash may be twice faster, but considering storage system throughput it's not as important.

Hi! I've updated project's page with info on "Sequential-Incremental Hashing". It can be also used to hash files, but requires some given read block size (using different read block sizes will produce different hash values).

Hello!

I've updated project's page with info on "Sequential-Incremental Hashing".

I think, it's incorrect:

#include "komihash.h"
#include <stdio.h>

int main()
{
    const char* str1 = "123456";
    const char* str2 = "123";
    const char* str3 = "456";

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), 0 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );


    printf("h1: %x\n", h1);
    printf("h2: %x\n", h2);
    printf("h3: %x\n", h3);
}

h1: c84c16d8
h2: fbe4c41c
h3: 71481aea

What do you mean "incorrect"?

By the way, you should write this way:

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), h1 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );

You've missed passing h1 to str2.
Then hashes are 64-bit and you are printing with %x. You should use %llx.

I meant h3 should be equal to h1.

You've misunderstood the concept. It's not a streamed hashing. Sequential hashing assumes length of each item is also a part of the message.

It's not a streamed hashing.

So I wonder why you closed this issue.

You can hash all your files using blocks of e.g. 1024 bytes (except the last block). It will be the same as streamed hashing for this given block size. Buffered streamed hashing is needed for "standardized" hashes. Here with komihash I do not see such need. Streamed hashing is also slower.

Also, from database point of view, when several independent values are concatenated, this incremental approach is preferable to streamed hashing since each item's length is encoded implicitly.

I've implemented the streamed hashing after all, please check it out. It turned out to be a bit faster than the base komihash() function.

Awesome, thank you!