speed up processing large dataset
Closed this issue · 1 comments
richardehughes commented
I have a very large dataset (50 million rows by 768 features) I am trying to use, and a test case with 1M rows took about 35 minutes. Scaling to 50M implies it would take over 1 day to finish. This is on a machine with 160GB memory and 40 cores. Any suggestions on how to speed this up? I would prefer trying to fit the whole dataset vs fitting just a subset and applying that to the remaining data, but will do that if necessary.
There are a subset of rows that I think are linked - would specifying n_neighbors help with speeding things up?
hyhuang00 commented
Here are a few suggestions:
- If you are able to figure out which rows you would like to be neighbors, it can be very helpful. Providing the nearest neighbor pairs (
pair_NN
) to the PaCMAP instance can save about 25% of time for large datasets. - Using a smaller
n_neighbors
may provide some limited speedup, but I haven't profiled such cases before. By default, PaCMAP usesn_neighbors=10
- Reduce the number of iteration
num_iters
. By default, PaCMAP usesnum_iters=450
. The last 200 iterations are mostly optimizing the local structure of the embedding, and it barely affects the general layout. Reducing this parameter to 250 can save you a lot of time at the cost of some local structure.