YingfanWang/PaCMAP

speed up processing large dataset

Closed this issue · 1 comments

I have a very large dataset (50 million rows by 768 features) I am trying to use, and a test case with 1M rows took about 35 minutes. Scaling to 50M implies it would take over 1 day to finish. This is on a machine with 160GB memory and 40 cores. Any suggestions on how to speed this up? I would prefer trying to fit the whole dataset vs fitting just a subset and applying that to the remaining data, but will do that if necessary.

There are a subset of rows that I think are linked - would specifying n_neighbors help with speeding things up?

Here are a few suggestions:

  • If you are able to figure out which rows you would like to be neighbors, it can be very helpful. Providing the nearest neighbor pairs (pair_NN) to the PaCMAP instance can save about 25% of time for large datasets.
  • Using a smaller n_neighbors may provide some limited speedup, but I haven't profiled such cases before. By default, PaCMAP uses n_neighbors=10
  • Reduce the number of iteration num_iters. By default, PaCMAP uses num_iters=450. The last 200 iterations are mostly optimizing the local structure of the embedding, and it barely affects the general layout. Reducing this parameter to 250 can save you a lot of time at the cost of some local structure.