addaleax/levenshtein-sse

Potential performance improvement

NickStrupat opened this issue · 1 comments

Hi! Thanks for the awesome work here. I've been tinkering with it for the past couple of evenings.

I've been testing for potential performance improvements by switching from std::vector & your allocator to using blocks of memory allocated on the stack (via alloca). I'm seeing a 25%-ish improvement for short strings, but about 25%-ish degradation for longer strings. I'm wondering if you have any ideas why that might be?

Also, I haven't yet done a proper multi-thread test. I suspect the contention on _mm_malloc will yield a higher performance improvement when using alloca, since the threads don't need to sync on heap allcoation. If I get some time next week I will try to do that and I'd love to hear what you think of this approach.

The changes I made to achieve the stack allocation are here...
NickStrupat@58ee960

I like the idea, but switching to alloca unconditionally might not be ideal, because it’s completely possible to bring things to a stack overflow using a single call that way…

What we do in Node.js core a lot is using a fixed-size stack-allocated buffer for small strings, and deferring to a larger, heap-allocated buffer for longer ones.

Also, this library is generally better at handling long strings – for short ones, vectorization doesn’t really pay out, and there’s the additional overhead of JS → C++ calls.