obidam/pyxpcm

not fitting with a 5Gb dataset

Closed this issue · 3 comments

gmaze commented

Hello Sir, been working on the same dataset for a week now, I have to do it for 5 years, so I downloaded the three year data first, making the total size of array about 5 GB. The m.fit_predict(ds, features=features_in_ds, inplace=True) command now takes forever to work, along with a warning that says, Slicing is producing a large chunk

Screenshot 2022-07-15 124947

If possible, can you answer these important queries ?

  1. Is there a workaround for implementing the model for large size of data ? If no, just answer the 2nd question which is more important for me.
  2. How to determine the optimum number of classes (k value), because the tutorial don't have any way to find the value of
    k using BIC elbow method. We just took random value, like k=12. Can you please tell me how to find the k value suitable for my dataset using pyXpcm?

Originally posted by @Priyanshu-Malik in #35 (comment)

gmaze commented

Hi @Priyanshu-Malik

Did you tried to change the statistic backend from scikit-learn to dask-ml ?
Install dask-ml and then use the option backend=dask_ml when you instantiate your PCM.

If this still does not allow you to classify such a dataset, you can try to check where pyXpcm is getting stalled, following:
https://pyxpcm.readthedocs.io/en/latest/debug_perf.html

Then, my last advice is to subset your dataset to profiles REALY independent from each others, i.e. not closer than a typical correlation scale. It is often the case that the information is redundant in a high resolution dataset, so statistically speaking, there is no need for all profiles for the PCM to be relevant. You should be able to fit a PCM on a subset, and then easily predict classes for the full dataset.

This is a classic question that depends too much on your specific problem. My advice is to compute correctly the BIC (using a correct number of independent samples/profiles) to give you a hint of the statistically allowed range of k, and then to select the k value that make sense for your analysis.

Thank you for the suggestions, I was able to classify this large dataset and fit it for only over Indian ocean region.
I am still to follow the BIC part, but that's not my priority now. As you know, I was able to plot for one "nc" file of a month last time. Now, all I want to do is plot the subset data on the map for 12 months or more. But I'm getting an error while plotting the cluster as can be seen here :
All the fit_predcit,fir_proba, etc commands executed well without any error

Screenshot 2022-07-24 162830

I can't figure out what this error really means and what shall I do to fix it. Once I get a plot, I will be pretty much done with my project and the queries related to pyXpcm package. Thank you in advance!

Here's the subset data summary if it may help..
summary

gmaze commented

this has nothing to do with pyXpcm and is pretty obvious, LABELS have a time coordinate (100 x 350 x 12 = 420000) !
closing this since the issue is solved