sangupta/ps

Issue in Solution for fastest-sorting-integers.md

Closed this issue · 13 comments

The problem stated :

Sort a given a huge array of bounded, non-duplicate, positive integers. Extend the problem to include negative integers. What are the different ways to reduce memory complexity of the problem.

is not totally clear. What does it mean that the array of integers is bounded ? Any finite amount of integers will be bounded (by Max(array)).

They is a flow in your proposed solution :

  1. Construct a boolean array of length N
  2. For every integer n in array, mark the boolean at index n in array as true
  3. Once the array iteration is complete, just iterate over the boolean array again to print all the indices that have the value set as true

Steps 1 and 2 are both O(n), however, step 3 is going to take O(max), where max is the biggest integer of the given array.

If we have for example 100 integers that are between 1 and 1000, we would have with, your algorithm 1000 steps for the step 3, whereas a traditional mergesort would be ~ 100*ln(100) ~= 461.

Agreed that any given array would be bounded. The reason I specifically mentioned was that the lower and higher bounds are known so that the length and memory for a boolean array can be initialized.

Agreed, that in step 3 we are taking a 1000 lookups of O(1) increasing the time-complexity of the problem. Also, the usage of boolean array was meant only to explain the usage of a flag-based approach. The ideal solution is noted in the article is to use a sparse bit-array. So when the number of integers tend to be too high, and bounds known, a sparse-bit-array would perform better.

I am updating the article with notes to make it more clear and explicit.

Thanks a bunch for the notes.

Just updated the article with notes from here.

Am closing this one for now - please reopen to discuss more. How I wish Github allowed embed of DISQUS comments at the bottom of markdown files.

Yes, totally agree on discuss comments in markdown files !

For the complexity, it is true that the best algorithm is with sparse arrays. I think this is really good news, since we could use the same algorithm with decimals (which is usually the case in computers), and also with strings, since strings can be converted to numbers and keep their orders (if you have a maximum allowed length for those strings). I'm not sure how practical that would be however for strings, since the numbers would become quite big, so the "bounded" argument could not be applied.

However, it is wrong that it will be O(n), I think it will rather be in 0(n sqrt (log log n)) if it is the same implementation as the paper you are refering to..

The fact that we can't achieve O(n) is I believe because of the insertion in a sparse array, which cannot be done in O(n) for n insertions, since inserting one element in a sparse array is O(log log N), and prev() and next() have the same complexity., where N is the biggest element in our array.

When we start using strings - and they go overboard a certain length/bounds - then we need to hash them to equally distribute amongst the integer space. Which then introduces the concept of error probability and led to the creation of Bloom Filters. I have talked about the same in another of my notes here.

Regarding the complexity and algorithm for a sparse array - let me give it a thought in the day tomorrow to see if we can work up an O(1) solution for insertion. I was thinking of using a list of arrays, stored in a hash table based on bucket size.

Am opening the issue so that I remember to validate this.

Thanks a bunch for this great discussion.

Yes, interesting discussion !

I think you should be introducing the variable K, which is the bound of the integers. Note that K>=N, since we do not allow duplicates.

https://en.wikipedia.org/wiki/Integer_sorting#Algorithms_for_small_keys says the best known algorithm is O(n√log log K), which will be bigger than O(n√log log n).

Thanks to you, I also learned Bloom Filters, thanks !

A friendly update

I worked up a SparseBitArray implementation in Java that uses buckets of int[] array within another int[] array to conserve memory.

Refer gist at: https://gist.github.com/sangupta/2601a9f81307ec4541ec31791e507cc2

As the buckets itself are in an integer array, finding the bucket and referencing them will be O(1). Finding the element in sub-array is again O(1).

Am still to optimize the above code-base and then compare this with other implementations namely brettwooldridge/SparseBitSet [1] and other linked-list implementations based on the Wikipedia article [2]

1: https://github.com/brettwooldridge/SparseBitSet
2: https://en.wikipedia.org/wiki/Sparse_array

I have been checking my implementation above and to me it looked like O(1) for all operations - considering that I created one for 100 million elements and then tested the number of operations in 100 random cycles setting a million elements each.

I think we can surely do a SparseBitArray as O(1) abusing memory for sure.

@edi9999 Friendly ping.

After various tests at my end, with random data, I am of the opinion that a SparseBitArray can be O(1) at the cost of memory.

Thoughts please.

Indeed, your solution seems to be O(1) for insertions and also for retrieving the values.

However, the method getNextSetBit(int fromIndex) is also needed, and that method needs to be O(1) to because it will be called n times I think.

Good job anyway.

@edi9999 - interesting one. We need getNextSetBit() method only to display the values. On the top of my head - I can reduce the lookups to O(m) - worst case where m is the band size. Let me think over it on how we can achieve this one.

The getNextSetBit() is the function that would be called to do the iteration on all non-empty values.

From the third step :

Once the array iteration is complete, just iterate over the boolean array again to print all the indices that have the value set as true

Or how would the iteration at the end happen, in O(n) ?

@edi9999 Sorry for the extreme delay in here. I got the point and added the explicit constraint of O(X) where X is the difference in min and max values in an array. Thus, complexity tends to reach O(N) when X approaches N.

Am closing this now. Feel free to reopen if the constraint does not fit in correctly.