dataplayer12/Fly-LSH

Could the DenseFly processes large scale and sparse datasets?

Closed this issue · 4 comments

Dear author, I tried to apply DenseFly algorithm on a large scale and a very sparse dataset which has 21612 rows x 28065 columns and the density is about 0.017 which is very sparse. My purpose is to hash the row and query their nearest neighbors efficiently. I notice that the DenseFly would firstly project the data into a high dimensional space (self.hash) and then reduce the hash dimension into a low dimensional space (self.lowd_hashes). Does that mean I have to set the embedding_size variable larger than 28065? (Or reduce the column dimension smaller than embedding_size)
Meanwhile, as far as I know, the precondition of firstly projecting the data into high dimensional space is that the data matrix is density. So that in high dimensional space the elements would be more obvious in a sparse data matrix. Does that mean the DenseFly is not appropriate for processing the sparse dataset?
Thank you very much.

Dear @BIOINSu this is a very interesting question. I don't think Densefly is a very good algorithm for your problem because of two reasons:

  • You have very few samples (21k) in comparison to the dimensionality of the dataset (28k). This makes it difficult to find bins (which self.lowd_hashes is doing). Basically, even the lowd hashes will be quite high dimensional and most of the bins will be empty. This will mean you have to have a larger search radius to get enough number of results and this will eat into a lot of the computational advantage provided by densefly.

  • As you mentioned about the density precondition, the DenseFly algorithm basically works because a binomial distribution with enough samples approximates a gaussian distribution. Here the number of samples refers to the number of non zero entries sampled by the projection matrix of DenseFly. This means that, in principle, you could project your dataset to a high enough dimensionality (say a million) to make the approximation work, but that would be impractical.

May I ask if the non zero entries in your dataset are categorical or continuous real valued floats? I will assume that they are continuous in further discussion.

I suggest a form of data transformation, which might help you greatly. Since your dataset is so sparse, you can convert it from real valued numbers to binary by placing a 1 at every location where the value is non zero. This way you get a binary hash of 28065 dimensions. From here, you can just index the column for each row where the entry is non zero. While comparing hashes you could count the number of common entries for each row and use that as a metric for similarity.

If the above method is not sufficient for you, you could use WTAHash, which is something we compare DenseFly against in the paper. It is implemented in this repo as WTAHash class and should give you very good results. I also improved upon the WTA Hash algorithm from its original version as presented by Yagnik et. al. This is implemented in the repo as WTAHash2 and it gives better results than its original counterpart. This is also worth a try. Finally, if you still want to experience some of the benefits of high dimensional projection, you could use a hybrid of FlyHash and WTAHash, implemented as FlyWTA. This performs intermediate between FlyHash and WTAHash on the datasets used in our paper, but I have not tested it on very high dimensional data. These modified algorithms did not find a mention in our manuscript, but I decided to release them in case someone finds them useful.

I remember reading in a paper that dimensionality much greater than 1000 is considered too much for ANN algorithms and it is recommended to do a dimensionality reduction first before applying ANN. Have you looked into reducing the dimensionality of your data?

Thanks for the reply!

Actually, the non zero entries of my data matrix are either positive categorical or continuous real-valued floats. It depends on the normalization methods which the preprocessed approach used.

Yes, there are some methods which select a threshold to convert the elements from real-valued numbers to binary. But I think simply count the number for each row and use them as a metric for similarity do not work well due to the experiment results of my colleague.

The WTA Hash algorithm is indeed another direction for us to consider. I think I have to do more experiments on other hash algorithms such as WTA Hash or FlyWTA as you said.

Due to the task is to find the nearest neighbor of each row, dimension reduction could only be applied on the columns. We indeed are simultaneously using appropriate methods for reducing the column dimension. The experiment is still in progress.

Thanks for the suggestions! I still have a little question that in the class flylsh fuction query_lowd_bins, there is a line valid_bins=np.flatnonzero((query_bin[None,:]^self.lowd_bins).sum(axis=1)<=2*search_radius)
. May I ask why the search_radius variable has to multiply 2? Does the 2 have some special usages?

Thank you!

Ah, I was hoping somebody would ask this question. Please see this comment which I initially made in reference to the FlyHash algorithm. The hash from fly algorithm has a fixed number of 1s in it. So, whenever two fly hashes differ at one position, they must also differ at another complementary position. This is because when you try to modify a flyhash by taking out a 1 from some position, you must put that 1 at another position which initially had a 0 in it. For easy illustration, let us consider the family of 4 dimensional fly hashes with exactly 2 1s in each hash. We consider one such hash, say H = [1,0,0,1]. If we want to modify this hash and get another valid hash, we will have to take one 1 from H[0] or H[3], say H' = [1,0,0,0]. The L1 distance between H and H' is 1, but note that H' is not a valid fly hash yet since a valid fly hash would contain at least two 1s, so we must add another 1 at either H'[1] or H'[2], say, H'' = [1,0,1,0]. Now, the L1 distance between H and H'' is 2. Since, H[3] is 1 and H''[3] is 0, there must be another position, x where H[3] is 0 and H''[3] is 1. In this example, x happens to be 2. For ease of understanding, you can try removing the comment from the line I mentioned above and print out the L1 distances on random data. You will see that they are all even.

I see! Thanks a lot. Your comment is very helpful. Issues close.