Daniel-Liu-c0deb0t/triple_accel

Implement Jaccard index?

luizirber opened this issue · 5 comments

Hi @Daniel-Liu-c0deb0t!

I didn't dig down too much into the codebase, but I wonder if there is interest in adding a Jaccard index function (as described in Faster Population Counts Using AVX2 Instructions.
The implementation from the paper is at https://github.com/CountOnes/hamming_weight/blob/1dd7554c0fc39e01c9d7fa54372fd4eccf458875/src/avx_jaccard_index.c
and while it is a bit heavy on the macros I think it could be translated using the infra you already have in triple_accel?

(This would be incredibly useful for me, so I can start looking deeper into it if you're also interested =])

Hey @luizirber, I think it would be a bit difficult to integrate Jaccard distance into the current codebase, since it is mainly adapted for the Levenshtein distance calculation. It would likely be easier to make a separate crate for Jaccard index. I considered this feature before, but I thought it would be better to only focus on edit distance. Additionally, I'm current too busy with other projects to implement it. Sorry. Eventually, I would likely implement SIMD popcount in Rust in a separate project, since I don't believe anyone has done it yet and its very useful. If you have time to look into it, then I think it shouldn't be too hard to get a Rust port, as the code is not too long. The dense macros are just to make it work on multiple platforms.

No problem =]

Hey @luizirber, I am interested in implementing an SSE/AVX popcount routine in another project: https://github.com/natir/nuc2bit. Perhaps you can use the code (when I am done) to implement a Jaccard distance function?

Sounds great, and happy to see you working with Pierre =]

Hey, I wrote up a quick popcount implementation here using the shuffle-based algorithm. For Jaccard distance, you are going to have to modify the code to AND and OR the vectors before applying internal_popcount, in the innermost loop.

I haven't thoroughly benchmarked my implementation since I just finished it, but it should be pretty fast.