ilanschnell/bitarray

Feature: Accept list for __setitem__

Closed this issue · 10 comments

Set multiple indices provided as a list, e.g.

bitarray[[1,2,5]] = True

Today only bitarray[1:5] = True works.

Thanks for your suggestion. I guess this means the __setitem__ method should be able to take any iterable whose elements can be used as indicies. I don't think such a feature would find wide enough use for introducing the additional complexity. Just use:

for i in multiple_indices:
    a[i] = True

I'd like to see it compatible to numpy's advanced indexing. Not sure if they support iterable in general or only tuple and list.

Ah, thanks. I'm not sure how useful advanced indexing features would be for a one dimensional array. As I wish to keep bitarray low-level and simple, maybe it would make sense to create a (potentially separate) library which allows creating N-dimensional bitarrays (with all the numpy features, and relying on the existing one dimensional bitarray).

I agree, n dimensional would be an additional beast, but just setting a list of indices for one-dimensional would help a lot. Because looping in python is slow.

I don't want to add more functionality to the core bitarray libarray. Also Python's bytearray object does not allow list (or iterables) as an argument on the __setitem__ method. Therefore, I'm thinking about adding a utility function bitarray.util.setitems() which takes an iterable. The signature and docstring would look like this:

setitems(bitarray, iterable, value)

Iterate over indices and sets corresponding elements of bitarray to value.

So your initial example bitarray[[1,2,5]] = True could be written as:

from bitarray.util import setitems

setitems(a, [1, 2, 5], True)

What do you think?

I personally love the elegance of numpy's syntax - but the decision is up to you. I'm fine with your proposal.

What I wonder more is if some efficiency can be gained above a simple loop in C. My indices will be in random order (hashes for bloom filter) and this might result in many pages swaps.

I would expect the number of indices in the Bloom filter to be rather small, not more than maybe 10. One could sort those indices first, but I'm not sure if it's worth it.

For a Bloom filter, a function which checks if all of the hash indices are True should also be fast. In https://github.com/ilanschnell/bitarray/blob/master/examples/bloom.py this done by:

all(self.array[i] for i in self._hashes(key))

Having a getitems() function (which returns a bitarray of values corresponding to indices) would not be efficient, because as soon as an element is False, one cat terminate the loop. So having a C function which does the equivalent would probably help:

allitems(bitarray, iterable_of_indices)

Iterate over indices and return False as soon as corresponding element in bitarray is 0.

The number of hash values per item is low, but if you add thousands of items in a batch to a Bloom filter, you will have many assignents and then swapping of memory pages might become a limiting factor. But no need to provide this in setitems, as the caller can also take care for this.
Not directly related, but why processing a sorted array is faster might be an interesting read.

For getitems my use case is to identify which items from a batch of items (each with multiple hashes per item) have (probably) already been added. So getting the bitarray

[all(self.array[i] for i in self._hashes(item)) for item in batch]

which returns a mask for each item in the batch.

Related to getitems: As the number of hashes per item is low (at least for my Bloom filter use case), I fear more performance is lost by checking for early breaking instead of running branchless till the end.

allitems: A better signature for my use case would be allitems(bitarray, iterable_of_iterable_of_indices.
And even better for my use case: a function that returns a mask, which items have been added:

result = bitarray(len(batch))
for i in range(len(batch)):
    for j in self._hashes(batch[i]):
        result[i] |= ~self.array[j]
        self.array[j] = True
return result

here is what I do for 2D bitarray (.mx is 1D) using Cython and it is pretty fast... you can redo it for 1D will be much simpler .. cause you use only one slice.


	def __getitem__(self, slices not None):
		cdef IXTYPE ix, nix
		cdef CTYPE r,c, rng
		cdef int i = 0, j = 0

		assert slices[0] is not None
		assert slices[1] is not None

		# exact [r,c]
		# if type0 is int and type1 is int :
		if isinstance(slices[0], (int,np.integer)) and isinstance(slices[1], (int,np.integer)) :
			ix = slices[0] * self.ncols + slices[1]
			return getbit(self.mx,ix)
			# return self.mx[ix]

		if   isinstance(slices[0], slice) : rixs = range(*slices[0].indices(self.nrows)) 
		elif isinstance(slices[0], (list,np.ndarray)) : rixs = slices[0]
		elif isinstance(slices[0], (int,np.integer)) : rixs = [slices[0]]

		if   isinstance(slices[1], slice) : cixs = range(*slices[1].indices(self.ncols))
		elif isinstance(slices[1], (list,np.ndarray)) : cixs = slices[1]
		elif isinstance(slices[1], (int,np.integer)) : cixs = [slices[1]]

		#cant use clen for range()
		new = BinaryMatrix((len(rixs), len(cixs)))

		# print(rixs, cixs)
		for i,r in enumerate(rixs) :
			for j,c in enumerate(cixs) :
				ix = r * self.ncols + c
				nix = i * new.ncols + j
				# print(f'{ix},{nix},i:{i},j:{j},{r},{c}')
				# new.mx[nix] = self.mx[ix]
				setbit( new.mx, nix, getbit(self.mx, ix) )

		return new		


	def __setitem__(self, slices not None, value not None):
		cdef IXTYPE ix
		cdef CTYPE r,c
		cdef int i = 0, j = 0

		assert slices[0] is not None
		assert slices[1] is not None

		if isinstance(slices[0], (int,np.integer)) and isinstance(slices[1], (int,np.integer)) :
			ix = slices[0] * self.ncols + slices[1]
			setbit(self.mx,ix, value)
			# self.mx[ix] = 1
			return

		if   isinstance(slices[0], slice) : rixs = range(*slices[0].indices(self.nrows)) 
		elif isinstance(slices[0], (list,np.ndarray)) : rixs = slices[0]
		elif isinstance(slices[0], (int,np.integer)) : rixs = [slices[0]]

		if   isinstance(slices[1], slice) : cixs = range(*slices[1].indices(self.ncols))
		elif isinstance(slices[1], (list,np.ndarray)) : cixs = slices[1]
		elif isinstance(slices[1], (int,np.integer)) : cixs = [slices[1]]

		vtype = type(value)

		if vtype is int :
			for r in rixs :
				for c in cixs :
					ix = r * self.ncols + c
					setbit( self.mx, ix, value )

		elif vtype is list or vtype is np.ndarray :
			ncols = len(value[0])
			for i,r in enumerate(rixs) :
				for j,c in enumerate(cixs) :
					ix = r * self.ncols + c
					# print(f'{ix},i:{i},j:{j},{r},{c}')
					setbit( self.mx, ix, value[i][j] )

#204 has been merged, such that the original feature request

bitarray[[1,2,5]] = True

is working now.