mingzehuang/latentcor

ValueError in corr estimation between mixed types

Opened this issue ยท 17 comments

Hi!

I would like to use latentcorr in my project, but faced with the issue of addressing this block of the code which I cannot interpret.

You could recreate this issue with the following example:

`a = np.array([27.16, 68.78, 46.21, 39.4 , 20.86, 38.18, 33.16, 45.84, 29.36, 43.27])
b = np.array([32., 34., 32., 47., 31., 34., 34., 39., 36., 33.])
c = np.array([1., 2., 1., 2., 2., 2., 1., 2., 2., 1.])

t = np.column_stack((a, b, p))
latentcor(t, tps = get_tps(t, tru_prop=0.05))`

Error message

ValueError Traceback (most recent call last)
Input In [49], in <cell line: 11>()
7 t.shape
9 get_tps(t)
---> 11 latentcor(t, tps = get_tps(t, tru_prop=0.05))
File .../latentcor/latentcor.py:494, in latentcor(X, tps, method, use_nearPD, nu, tol, ratio, showplot)
492 comb_select = combs_cp == comb
493 if comb == "00":
--> 494 R_lower[comb_select] = numpy.sin((numpy.pi / 2) * K_a_lower)
495 else:
496 K = K_a_lower[comb_select]

ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true

I've made multiple tests, and seems that this error arises when there are several continuous variables.
Would be great if you could explain why this is happening, thank you!

Oleg

Hi, @Vlasovets , it seems that "t = np.column_stack((a, b, p))" should be "t = np.column_stack((a, b, c))". Isn't it? I assume you're talking about the latentcor_py.
BTW, I'm so sorry for very late reply. I was super busy last month...

I'm sorry I see my bug... @Vlasovets Thank you very much for your test!!! I'll fix it ASAP!

hi, @mingzehuang!
thanks for replying!
yes, of course, it is just a typo, I meant three different vectors namely a, b, c.
let me know if it works for you because I've managed to solve it yet and thanks a lot for bringing it to python ๐Ÿ‘

Hi, @Vlasovets I think I've fixed the bug on both github version and PyPI. Let me know if anything need to further improved. Thank you for your test!

hi, @mingzehuang!
thanks for looking at this issue, may I ask you if the example above works for you?
I've tried to run it again after re-installing the library, but it does not work.
The problem seems to be in this [line](https://github.com/mingzehuang/latentcor_py/blob/master/latentcor/latentcor.py#:~:text=ipol_10(numpy.-,column_stack,-((K%2C%20zratio1)

And seems to be connected to the length of the array to be 10.
I'm not familiar with what those ipol variables mean, but if you change my example where the length of a/b/c is <10 or >10, the error disappears, hope it will help you for tracking the bug further. Thank you!

a = np.array([27.16, 68.78, 46.21, 39.4 , 20.86, 38.18, 33.16, 45.84, 29.36, 43.27])
b = np.array([32., 34., 32., 47., 31., 34., 34., 39., 36., 33.])
c = np.array([1., 2., 1., 2., 2., 2., 1., 2., 2., 1.])

t = np.column_stack((a, b, c))

latentcor(t, tps = get_tps(t, tru_prop=0.05))

Although, the error message now is different:
`ValueError Traceback (most recent call last)
Input In [14], in <cell line: 7>()
3 c = np.array([1., 2., 1., 2., 2., 2., 1., 2., 2., 1.])
5 t = np.column_stack((a, b, c))
----> 7 latentcor(t, tps = get_tps(t, tru_prop=0.05))

File ~/kora/lib/python3.9/site-packages/latentcor/latentcor.py:502, in latentcor(X, tps, method, use_nearPD, nu, tol, ratio, showplot)
500 R_lower[comb_select] = r_sol.batch(self = r_sol, K = K, comb = comb, zratio1 = zratio1, zratio2 = zratio2, tol = tol)
501 elif method == "approx":
--> 502 R_lower[comb_select] = r_switch.r_approx(self = r_switch, K = K, zratio1 = zratio1, zratio2 = zratio2, comb = comb, tol = tol, ratio = ratio)
503 K = numpy.zeros((p, p), dtype = numpy.float32)
504 K[cp] = K_a_lower; R[cp] = R_lower

File ~/kora/lib/python3.9/site-packages/latentcor/latentcor.py:252, in r_switch.r_approx(self, K, zratio1, zratio2, comb, tol, ratio)
250 out = numpy.full(len(K), numpy.nan); revcutoff = numpy.logical_not(cutoff)
251 out[cutoff] = r_sol.batch(self = r_sol, K = K[cutoff], zratio1 = zratio1[ : , cutoff], zratio2 = zratio2[ : , cutoff], comb = comb, tol = tol)
--> 252 out[revcutoff] = r_switch.r_ml(self = r_switch, K = K[revcutoff] / bound[revcutoff], zratio1 = zratio1[ : , revcutoff], zratio2 = zratio2[ : , revcutoff], comb = comb)
253 return out

File ~/kora/lib/python3.9/site-packages/latentcor/latentcor.py:240, in r_switch.r_ml(self, K, zratio1, zratio2, comb)
238 if comb == "33":
239 zratio2[0, : ] = zratio2[0, : ] / zratio2[1, : ]
--> 240 out = r_switch.ipol_switch(self = r_switch, comb = comb, K = K, zratio1 = zratio1, zratio2 = zratio2)
241 return out

File ~/kora/lib/python3.9/site-packages/latentcor/latentcor.py:214, in r_switch.ipol_switch(self, comb, K, zratio1, zratio2)
212 def ipol_switch(self, comb, K, zratio1, zratio2):
213 if comb == "10":
--> 214 out = ipol_10(numpy.column_stack((K, zratio1[0, : ])))
215 elif comb == "11":
216 out = ipol_11(numpy.column_stack((K, zratio1[0, : ], zratio2[0, : ])))

File ~/kora/lib/python3.9/site-packages/scipy/interpolate/_interpolate.py:2528, in RegularGridInterpolator.call(self, xi, method)
2525 for i, p in enumerate(xi.T):
2526 if not np.logical_and(np.all(self.grid[i][0] <= p),
2527 np.all(p <= self.grid[i][-1])):
-> 2528 raise ValueError("One of the requested xi is out of bounds "
2529 "in dimension %d" % i)
2531 indices, norm_distances, out_of_bounds = self._find_indices(xi.T)
2532 if method == "linear":

ValueError: One of the requested xi is out of bounds in dimension 0`

Hi, @Vlasovets, it works well on my computer. The output as follows:

ordinal levels between 4 and 10 will be approximated by either countinuous or truncated type.
(array([[ 1.00000000e+00, 3.08707986e-01, -4.31046120e-16],
[ 3.08707986e-01, 1.00000000e+00, 6.84747538e-01],
[-3.98478053e-16, 6.84747538e-01, 1.00000000e+00]]),
array([[ 1.0000000e+00, 3.0901700e-01, -1.3877788e-16],
[ 3.0901700e-01, 1.0000000e+00, 6.8543297e-01],
[-1.3877788e-16, 6.8543297e-01, 1.0000000e+00]], dtype=float32),
None,
array([[1. , 0.2 , 0. ],
[0.2 , 1. , 0.31111112],
[0. , 0.31111112, 1. ]], dtype=float32),
array([[nan, nan, 0.4],
[nan, nan, nan]]))

I'm not sure where your problem come from but I'm looking at it trying to reproduce it:)

I have created a collab, so you can test this bug independently from your working environment:
https://colab.research.google.com/drive/1UHTG0fUFX7AEQsJKOA-qJ85AlKlcE483?usp=sharing

Hi, @Vlasovets,
I'm sorry I found that it works well when we set "method ='original'", but hit some problem when we set "method = 'approx'", I think perhaps there is some version conflict when we use colab to load some scipy functions. I'll fix it ASAP. But at least you can use it by setting 'method = original' if your data set is not too big right now:)

Hi @Vlasovets,
I think the package works well, the only problem is I didn't make it to recognize binary as 1 and 2 but instead it should be 0 and 1. So when I do c-1, everything works.
https://colab.research.google.com/drive/1m1OOShk2HcihZPHS-IUoT642KznY4bUq?usp=sharing

hi, @mingzehuang!

thanks for taking a look at this!
after I've encoded my binary variables with 0/1 notation and added method ='original', it worked out ๐Ÿ‘

  1. However, now I experience a bigger issue since the results of latent_corr in R and Python are not equal on my dataset.

So, when I run latentcor(data, tps = clean_types, method ='original')
I get the following warning:

/python3.9/site-packages/statsmodels/stats/correlation_tools.py:90: IterationLimitWarning:
Maximum iteration reached.

And I do not get this warning when I run the same command in R, however, the results are different and it seems to be a convergence problem caused by the imbalance in the data, i.e., in some features there are only a small number of distinct values.
I expect people will often face this issue since data imbalance is typical for real data.
How would you approach this? Is it possible to increase the number of iterations? Thank you in advance!

You can recreate this bug by the same link:
https://colab.research.google.com/drive/1m1OOShk2HcihZPHS-IUoT642KznY4bUq?usp=sharing

  1. Also, I really like how you can access attributes via lat_cor$R in R, but in python you produce tuple() which contains several arrays and the user cannot know what those arrays represent if he/she has never worked with R version. I would suggest storing the output as dict(), one can easily access the needed array by the name, e.g.,
clean_types = get_tps(data)
lat_cor = latentcor(data, tps = clean_types, method ='original')

data.keys() # to see what is stored inside
data["R"] # access the correlation matrix

Let me know if you need some clarification, I would be happy to help!

Hi, @Vlasovets.
I see! The maximum iteration limit actually comes from nearest Positive definite adjustment. Let me try to turn off them and see if the output is correct:) And also thank you for your suggestion on the key. I'll try to set it up!

Hi, @Vlasovets, I've used both latentcor in R and Python, the pointwise R (no positive definite correction) are consistent. So you can set "use_nearPD=False" if you don't need positive correction right now. I'm going to find some way to deal with the iteration limit of positive definite correction ASAP!

@Vlasovets Also I just showed in your colab, the "method = approx" works as well :)

Hi, @Vlasovets , I've made dictionary and increase the iteration limit of positive definite correction as well:) See the colab:)

hi, @mingzehuang!

  1. I've tried to import an updated version of latentcor, but it is no longer possible with Python 3.7 (default for Google Colab).
    It seems that scipy.interpolate._rgi is being renamed to scipy.interpolate.RegularGridInterpolator in the latest (1.9.0) version of Scipy which is required by the latest version of latentcor.

So, the problem seems to be from Colab side.
However, if I may suggest, it would be great to check and add which versions of Python are compatible with latentcor.
E.g., you could add this line to your README:
[![](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/)

Otherwise, the user might think that it works for any version by default.

  1. On my local machine with Python 3.9, I was able to import latentcor.

There are a few things I have spotted:

  • 2.1. In case, when there is no variation in the variable, the corresponding error message is overshadowed by the following error message:
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/latentcor/latentcor.py", line 382, in get_tps
    exit()
NameError: name 'exit' is not defined

I believe you should do from sys import exit to import this method

  • 2.2. I was wondering if there is there any specific reason for using numba==0.55.1 in the requirements?
    In my analysis, I have faced the issue that after installation of latentcor my current version of numba has been downgraded as a result numpy has also been downgraded to v.1.20.0. And that would give me the error:
    latentcor 0.2.3 requires numpy>=1.21, but you have numpy 1.20.0 which is incompatible.

This can be easily fixed with pip install numba --upgrade, but if it is not strictly required to use this specific version of numba, I would change it to something like numba>=0.46.0, otherwise, it might cause other bugs.

Thank you for keep working on it, especially, for the dictionary in output, that made it much easier to use :)

Hi, @Vlasovets
I've fixed them accordingly. Let me know if it still has any issue:)

hi, @mingzehuang!

Hope you doing fine!

I run it on a dataframe containing both microbial count table and respective covariates.
Here are the types I got after running get_types() method:

array(['tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru',
'tru', 'tru', 'tru', 'bin', 'tru', 'bin', 'bin', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'tru', 'bin', 'bin',
'tru', 'tru', 'tru', 'bin', 'tru', 'ter', 'tru', 'tru', 'bin', 'tru', 'ter', 'tru', 'tru', 'bin', 'bin', 'ter', 'tru', 'tru',
'tru', 'bin', 'ter', 'tru', 'tru', 'tru', 'tru', 'tru', 'bin', 'tru', 'ter', 'ter', 'tru', 'ter', 'tru', 'bin', 'ter', 'ter',
'tru', 'tru', 'tru', 'tru', 'bin', 'bin', 'tru', 'bin', 'tru', 'tru', 'bin', 'tru', 'bin', 'ter', 'tru', 'tru', 'tru', 'tru',
'tru', 'bin', 'tru', 'ter', 'tru', 'ter', 'ter', 'bin', 'tru', 'tru', 'ter', 'tru', 'tru', 'bin', 'bin', 'tru', 'tru', 'tru',
'ter', 'ter', 'ter', 'tru', 'tru', 'bin', 'bin', 'tru', 'ter', 'ter', 'tru', 'bin', 'con', 'con', 'con', 'con', 'con', 'con',
'con', 'con', 'con', 'con', 'con', 'con', 'con', 'con'], dtype='<U3')

It seems to be correct the bacterial counts are truncated or binary (if bacterium appears only in one sample), and the last features are continuous covariates.

However, if I would like to get a positive definite estimation ("approx" method), I face maximum iteration error.
Above, you said that the "approx" method works for you as intended in Python, could you give an example, please?

I've uploaded the .csv file with the data, so you can reproduce the error.

Thank you for your time!

../statsmodels/stats/correlation_tools.py:90: IterationLimitWarning: 
Maximum iteration reached.
  warnings.warn(iteration_limit_doc, IterationLimitWarning)