MaxHalford/prince

ValueError: array must not contain infs or NaNs

Ashur59 opened this issue · 5 comments

I am utilizing this page (https://datascienceplus.com/parsing-html-and-applying-unsupervised-machine-learning-part-3-principal-component-analysis-pca-using-python/) for a project I am working on. I am running my python script on Visual Studio Code on a Windows 10 machine with the following versions:
sklearn: 1.1.1
pip: 22.1.2
setuptools: 62.6.0
numpy: 1.23.1
scipy: 1.8.1
Cython: 0.29.28
pandas: 1.3.5
matplotlib: 3.5.2
joblib: 1.1.0
threadpoolctl: 3.0.0

The first block of code under "Putting it Together" section of the article where it is:

pca.plot_row_coordinates(
     df2[numerical_features],
     ax=None,
     figsize=(10, 8),
     x_component=0,
     y_component=1,
     labels=None,
     color_labels=df2['Kcluster'],
     ellipse_outline=True,
     ellipse_fill=True,
     show_points=True
 ).legend(loc='center left', bbox_to_anchor=(1, 0.5))

gives the following error message despite the fact that none of the entries of the dataframe are NANs or INFs:

C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\plot.py:45: RuntimeWarning: Degrees of freedom <= 0 for slice
  cov_matrix = np.cov(np.vstack((X, Y)))
C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: divide by zero encountered in divide
  c *= np.true_divide(1, fact)
C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: invalid value encountered in multiply
  c *= np.true_divide(1, fact)
Traceback (most recent call last):
  File "c:\Users\username\Desktop\observables\my_script.py", line 163, in <module>
    main()
  File "c:\Users\username\Desktop\observables\my_script.py", line 95, in main
    preprocessor(grand_df, pca)
  File "c:\Users\username\Desktop\observables\utilities.py", line 1815, in preprocessor
    ax = pca.plot_row_coordinates(df2[numerical_features],
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\pca.py", line 238, in plot_row_coordinates
    x_mean, y_mean, width, height, angle = plot.build_ellipse(x[mask], y[mask])
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\prince\plot.py", line 46, in build_ellipse
    U, s, V = linalg.svd(cov_matrix, full_matrices=False)
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\scipy\linalg\_decomp_svd.py", line 108, in svd
    a1 = _asarray_validated(a, check_finite=check_finite)
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\scipy\_lib\_util.py", line 287, in _asarray_validated
    a = toarray(a)
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\numpy\lib\function_base.py", line 627, in asarray_chkfinite   
    raise ValueError(
ValueError: array must not contain infs or NaNs

Is there any workaround to account for the 0-degrees of freedom encountered in the calculation of covariance matrix?

Adding the , ddof=0 to the np.cov(np.vstack((X, Y))) of the said plot.py file fixes the issue but that is known to return the sample average in the calculation of covariance matrix which translate to the fact that two of my clusters (with the minimum number of members) are not drawn in the plot.

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.

Hi,

I ran into the same error when applying MCA to a dataframe of N integer columns. It seems that the instruction in class MCA

one_hot = pd.get_dummies(X)

Does not work with non str values and don't throw any warnings. I solved it by adding the casting to str to the whole dataframe before using the algorithm such as

mca = mca.fit(missing_matrix.astype(str))

Maybe a casting or a type check could be added to the method (?)

All the best!

Good catch @vblanes! I know how to fix that. Basically, pd.get_dummies only does the transformation to str and cat columns. But there's an options to apply it to all columns. I'll send a fix right now.

This is now fixed in version 0.10.7. We now ensure MCA works correctly, regardless of the types of the columns. It treats all the columns as categorical data.