ntanmayee/decoden

sklearn warnings

Opened this issue · 1 comments

When running run_decoden I receive a lot of warnings

/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:874: RuntimeWarning: invalid value encountered in scalar divide
  if (previous_error - error) / error_at_init < tol:
/cluster/gjb_lab/mgierlinski/software/miniconda3/envs/decoden/lib/python3.10/site-packages/sklearn/decomposition/_nmf.py:1665: ConvergenceWarning: Maximum number of iterations 500 reached. Increase it to improve convergence.

They come from sklearn library, so it might be difficult to trace them down, but it is something in the data passed to sklearn that causes the warning. I asked Chat-GPT for help with this warning and here is what it suggested:

This warning indicates that an invalid value (e.g., NaN or infinity) was encountered during a scalar division operation in the Non-negative Matrix Factorization (NMF) module of the scikit-learn library. The specific line mentioned in the warning checks for the convergence of the NMF algorithm by comparing the relative change in the error with a user-specified tolerance level (tol).

There might be several reasons for encountering such invalid values during the computation:

Input data issues: The input data could have missing values (NaNs), very large values, or other problematic features that could cause the algorithm to produce invalid results during computations. Make sure to preprocess your data by removing or imputing missing values, scaling or normalizing the data, and removing outliers if necessary.

Initialization issues: NMF relies on the initialization of matrices for its iterative optimization process. If the initialization is poor or leads to numerical instability, the algorithm might not converge properly. By default, scikit-learn uses 'nndsvd' initialization, which is generally a good choice. However, you can try different initialization strategies by setting the 'init' parameter when creating the NMF instance, for example, 'random' or 'nndsvda'.

Parameter choices: The choice of hyperparameters for the NMF algorithm, such as the number of components, regularization terms, and maximum number of iterations, can also impact convergence. Experiment with different hyperparameter settings to see if the issue persists.

Numerical precision issues: Sometimes, the computations might result in very small or very large intermediate values, which can cause numerical instability and produce invalid values. This is more common when working with high-dimensional data or when the algorithm is close to convergence. You can try increasing the 'tol' parameter to allow for a more relaxed convergence criterion, which might help prevent such issues.

If the warning persists after addressing these potential causes, consider using other dimensionality reduction techniques, such as PCA or TruncatedSVD, which might be more stable for your specific dataset.

Thanks for the heads up, I'll look into it, although it sounds like an issue with the NMF implementation in scikit-learn, and potentially a signal of low data quality that leads to the factorisation not converging. We'll discuss what to do in this regard to communicate it to the user.