ilanschnell/bitarray

copy_n performance

dgiger42 opened this issue · 10 comments

In copy_n(), it appears that each bit is copied individually whenever a and b are not both divisible by 8. I suspect that this scenario could be sped up significantly by using bit shifting to copy each word in two steps, rather than the 64 (or 32) steps that it currently takes.

Assuming this optimization provides a significant speedup, would it be worthwhile? It would make the function slightly more complex, but I think the speedup would be fairly large.

Yes, I've thought about this too. However, the added complexity has discouraged me from doing anything in this respect. I think the speedup would be significant to make it worthwhile, in particular since copy_n() is on such a low level that delete_n(), insert_n(), ... would benefit from the speedup. I just moved the definition of UINT64_BUFFER and UINT64_WORDS such that they can be used copy_n(): 6d76688

I just realized that there is even more low hanging fruit. Currently copy_n() only uses memmove() when both bitarrays start positions are modulo 8. This means that the operation del a[1:9] does not take advantage of memmove(), even though it could by copying seven bits (using setbit() and getbit()) and then using memmove().

I've made a little test in #138, so we could expect a speedup of about 25x. However, I also expect the complexity of copy_n() to increase quite a bit.

So, #138 has big and little endianness working as well as shifting by 1 to the left or right. Right now, the length of the bitarray has to be a multiple of 64. With a little extra work, it should be possible to lift this restriction. Then, for large bitarrays, the operations:

a.pop(0)
a.insert(0, value)

(as well as basically equivalent operations like del a[0], ...) should take be a lot faster.
There are certainly many more cases in which shift operations could provide a significant speedup, but I feel like there are too many special cases to consider. Also, I saw that the current testsuite does not cover what is currently in #138 (because the cases are too special, in particular the multiple of 64 restriction).

Having a fast insert and pop at the front of the array makes bitarray behave much better in cases where one would usually (an probably should) use the deque object.

What do you think?

Optimizing this is definitely more complex than I realized. One idea which (I think) could simplify it would be to copy the bits to a new array before writing them to the actual destination. That might eliminate the need to have separate cases for insertion and deletion, but it would require additional memory.

Additionally, I think optimizing copy_n() could speed up concatenation.

Yes, concatenation as well (when bytes aren't aligned).

I've been working on #139, which is in pretty good shape now. It is very fast (for all shifts), and helped me understand all the different cases much better (left vs. right shift, big vs. little endian, using chars and uint64).
I've done this (which took me basically all day) as an exercise to tackle copy_n() next.

@danielg1111 I just created #140, this adds about 120 lines of code and is fairly involved (but the speed improvements are quite significant also). This leaves copy_n() unchanged, and more optimizations in other places are still possible. For example, it doesn't speedup .extend() at all. Also in:

a[1001:2001] = bitarray(3000)

The expansion of the array (by 2000 elements) uses insert_n() and is now fast, while the (unaligned) copying of the 3000 new elements into the new space is still slow.

I'm not sure where to go with #140, I'm still thinking out a more general way to make copy_n() faster.

I created and merged #142 in favor of #140. I decided to leave copy_n() like it is, except that I added an assert(n <= 8); for the slow copying part, to ensure it isn't used anywhere. Other functions then align the arrays (using shift_r8()) before calling copy_n(). This was a lot of work, but I think the performance benefit makes it worth it.

@danielg1111 I just released bitarray 2.2.4 which includes the speedup. I was able to write a function copy2() as a drop in replacement for copy_n() which handles the shifting for all cases. Having this, made things a simpler, see b48f44b. I love to remove 100 lines of code in one commit!

Excellent! I'm glad you found a simpler way.