eclipse/eclipse-collections

`Bag.top(bottom)Occurrences`

dxxvi opened this issue · 4 comments

dxxvi commented

The API documentation says that these methods return ListIterable<ObjectInPair<T>> but it didn't say if the element at the beginning of the returned list would have the most or the least occurrence. I think it's a bit unclear there. The unit test MutableBagTestCase.topOccurences says that the 1st element in the returned list has the most occurrences. Is there a unit test for this case (mentioned in the API doc) "In the event of a tie, all the items with the number of occurrences that match the occurrences of the last item will be returned"?

The implementation for those top(Bottom)Occurrences, if I'm not wrong, is in AbstractBag.occurrencesSortingBy. In this method, all elements in the bag are put into a list with occurrences. Then this list is sorted by some function. Then to get the top/bottom, e.g. 5, 5 first element in that list are returned. IMO, in some cases, this is inefficient: to get the top/bottom 5 occurrences, we have to sort the whole list.

motlin commented

the topOccurrences() method has an inefficient implementation that ought to be optimized. The current implementation sorts, as you said, making it O(n log n). The selection algorithm is O(n) and sorting would take O(k log k) in the absence of ties.

motlin commented

I took a look at Python's most_common and it takes the front element from a heap k times, making it O(n * k). This is a nice compromise in that it's faster than our implementation, but still pretty easy to implement.

I took a look at Python's most_common and it takes the front element from a heap k times, making it O(n * k). This is a nice compromise in that it's faster than our implementation, but still pretty easy to implement.

That's only decent if k < log(n). For k near n, it's impossible to do better than sorting (simply because the result has to be sorted) and we don't want O(n^2). So the algorithm needs to be smart based on k vs n.

I wonder how bad it would be to stuff the first k elements in a tree, then remove the worst one as new ones are added (the tree would have to be a list at each node to contain the equals).

motlin commented

Sorry that was a typo, I forgot the logarithm.

One implementation is to keep a min heap of fixed size k. Insert every (item, count) pair into it and remove the minimum whenever the heap size exceeds k. Inserting and removing in this heap is O(log k) since its fixed to size k. Handling ties is tricky in this implementation but can be done.

Another implementation is to heapify all items into a max heap, and then pop the max until we're done. This approach makes it easier to handle ties, since they come at the end. Heapify is O(n). Popping the max from a max heap is O(log n). So overall, this approach is O(k log n).