dpeerlab/Palantir

compute_gene_trends throws ValueError: X data must not contain Inf nor NaN

widsquid opened this issue · 45 comments

Thanks for creating this tool, it has been interesting so far. Sadly after much success I now get this error with the trend computation function. I have checked for NAs and Infs and zeros. I have tried to create a tester imp_df of random floats (no NAs nan Inf zeros) to input instead and still the same result. Any idea what could be the problem? I installed new from from git clone yesterday. All my data comes straight from 10x output. I am wondering could the NA be from the pr_res data? Thanks so much.

gene_trends = palantir.presults.compute_gene_trends(pr_res, imp_df.loc[:, genes])

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
r = call_item()
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 275, in call
return self.fn(*self.args, **self.kwargs)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 620, in call
return self.func(*args, **kwargs)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py", line 289, in call
for func, args, kwargs in self.items]
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py", line 289, in
for func, args, kwargs in self.items]
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py", line 162, in gam_fit_predict
y_pred = gam.predict(pred_x)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py", line 434, in predict
return self.predict_mu(X)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py", line 414, in predict_mu
features=self.feature, verbose=self.verbose)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py", line 273, in check_X
name='X data', verbose=verbose)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py", line 171, in check_array
raise ValueError('{} must not contain Inf nor NaN'.format(name))
ValueError: X data must not contain Inf nor NaN
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
/tmp/ipykernel_1341918/4064161686.py in
----> 1 gene_trends = palantir.presults.compute_gene_trends(pr_res, imp_df.loc[:, genes])

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py in compute_gene_trends(pr_res, gene_exprs, lineages, n_splines, spline_order, n_jobs)
120 spline_order
121 )
--> 122 for gene in gene_exprs.columns
123 )
124

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in call(self, iterable)
1096
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
565 AsyncResults.get from multiprocessing."""
566 try:
--> 567 return future.result(timeout=timeout)
568 except CfTimeoutError as e:
569 raise TimeoutError from e

~/anaconda3/envs/env_palantir/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()

~/anaconda3/envs/env_palantir/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

ValueError: X data must not contain Inf nor NaN

Can you please run the function with n_jobs=1. This will help isolate where the issue is for better debugging.

Thanks so much for the reply. Here is the output. Let me know if there is any other info that might help.

gene_trends = palantir.presults.compute_gene_trends(pr_res, imp_df.loc[:, genes], n_jobs=1)

ValueError Traceback (most recent call last)
/tmp/ipykernel_1341918/2426816647.py in
----> 1 gene_trends = palantir.presults.compute_gene_trends(pr_res, imp_df.loc[:, genes], n_jobs=1)

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py in compute_gene_trends(pr_res, gene_exprs, lineages, n_splines, spline_order, n_jobs)
120 spline_order
121 )
--> 122 for gene in gene_exprs.columns
123 )
124

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in call(self, iterable)
1083 # remaining jobs.
1084 self._iterating = False
-> 1085 if self.dispatch_one_batch(iterator):
1086 self._iterating = self._original_iterator is not None
1087

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
899 return False
900 else:
--> 901 self._dispatch(tasks)
902 return True
903

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
817 with self._lock:
818 job_idx = len(self._jobs)
--> 819 job = self._backend.apply_async(batch, callback=cb)
820 # A job can complete so quickly than its callback is
821 # called before we get here, causing self._jobs to

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py in init(self, batch)
595 # Don't delay the application, to avoid keeping the input
596 # arguments in memory
--> 597 self.results = batch()
598
599 def get(self):

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in call(self)
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
288 return [func(*args, **kwargs)
--> 289 for func, args, kwargs in self.items]
290
291 def reduce(self):

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in (.0)
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
288 return [func(*args, **kwargs)
--> 289 for func, args, kwargs in self.items]
290
291 def reduce(self):

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py in gam_fit_predict(x, y, weights, pred_x, n_splines, spline_order)
160 if pred_x is None:
161 pred_x = x
--> 162 y_pred = gam.predict(pred_x)
163
164 # Standard deviations

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py in predict(self, X)
432 containing predicted values under the model
433 """
--> 434 return self.predict_mu(X)
435
436 def _modelmat(self, X, term=-1):

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py in predict_mu(self, X)
412 X = check_X(X, n_feats=self.statistics_['m_features'],
413 edge_knots=self.edge_knots_, dtypes=self.dtype,
--> 414 features=self.feature, verbose=self.verbose)
415
416 lp = self._linear_predictor(X)

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py in check_X(X, n_feats, min_samples, edge_knots, dtypes, features, verbose)
271 # basic diagnostics
272 X = check_array(X, force_2d=True, n_feats=n_feats, min_samples=min_samples,
--> 273 name='X data', verbose=verbose)
274
275 # check our categorical data has no new categories

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py in check_array(array, force_2d, n_feats, ndim, min_samples, name, verbose)
169 # check finite
170 if not(np.isfinite(array).all()):
--> 171 raise ValueError('{} must not contain Inf nor NaN'.format(name))
172
173 # check ndim

ValueError: X data must not contain Inf nor NaN

Can you please check if there is NA or inf in imp_df. imp_df.isna().sum() and imp_df.isinf().sum()

Sure thanks so much.

Screen Shot 2023-02-18 at 10 46 39 PM

Screen Shot 2023-02-18 at 10 46 23 PM

Can you please share the output of pr_res by running the palantir.plot.plot_palantir_results? Looks like it is an issue in there and not from imputation.

Do any of pr_res.pseudotime or pr_res.branch_probs contain infinities or NA?

This is puzzling! Do you mind sharing a pickle object of pr_res and imp_df? I can examine it in my end to better understand this.

sure, thanks for your efforts. please let me know if you do not successfully receive the files.
W

Sorry, can you please clarify what format the file is saved and how I can load it.

were you able to successfully unzip the file? my mistake I should have said two .obj files saved using pkl

Looks like the pr_res file might be corrupted. Can you please share again?
image

Still the same issue.
image

What version of pickle are you using to save?

Sorry still the same issue
image

Perhaps you can save the information in there as text:
pr_res.pseudotime.to_csv('pt.csv')
pr_res.branch_probs.to_csv('branch_probs.csv')
pr_res.entropy.to_csv('entropy.csv')

Can you please email to manu.n.setty@gmail.com? The above links are causing Onedrive to try and convert the files which never finishes.

This looks an issue with a subset of the branches. If you use the Peri or SRC lineages, the trends will be computed without issues
image

For the Failed lineage, the maximum probability of any given cell is 0.4. When assigning cells to branches, our code looks for cells with at least 0.7 probability (

br_cells = pr_res.branch_probs.index[pr_res.branch_probs.loc[:, branch] > 0.7]
). You will need to make a tweak in this line to get trends for this lineage.

So you are correct, it works if I restrict the lineages to src and peri but still does not work for failed even if the setting is > 0.1. Does that makes sense?

Same result if I modify the threshold to 0.0. Would that be typical in a case like this? Also what does it say about this particular lineage if it is not calling high probability cells?

It is puzzling that lowering the threshold did not work - are you seeing the error at the same code location?

Low probability on a lineage indicates that it might not be a true terminal state.

there is a possibility it is not a true terminal state, perhaps rather a failed trajectory. There are however interesting differentially expressed genes in that region. Is it still possible to plot those on a trajectory? I suppose perhaps lowering the threshold should allow this if working?

This is the error, seems like the same outcome at least (of NA or undefined values being generated):

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
r = call_item()
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 275, in call
return self.fn(*self.args, **self.kwargs)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 620, in call
return self.func(*args, **kwargs)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py", line 289, in call
for func, args, kwargs in self.items]
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py", line 289, in
for func, args, kwargs in self.items]
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py", line 162, in gam_fit_predict
y_pred = gam.predict(pred_x)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py", line 434, in predict
return self.predict_mu(X)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/pygam.py", line 414, in predict_mu
features=self.feature, verbose=self.verbose)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py", line 273, in check_X
name='X data', verbose=verbose)
File "/home/w/anaconda3/envs/env_palantir/lib/python3.7/site-packages/pygam/utils.py", line 171, in check_array
raise ValueError('{} must not contain Inf nor NaN'.format(name))
ValueError: X data must not contain Inf nor NaN
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
/tmp/ipykernel_2663296/3306120243.py in
3 imp_df = pd.DataFrame(ad[:, genes].layers['MAGIC_imputed_data'],
4 index=ad.obs_names, columns=genes)
----> 5 gene_trends = palantir.presults.compute_gene_trends( pr_res, imp_df.loc[:, genes])#, lineages = ['SRC'], n_jobs = 1)
6 #gene_trends = palantir.presults.compute_gene_trends( pr_res, imp_df.loc[:, genes], n_jobs = 1)

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/palantir/presults.py in compute_gene_trends(pr_res, gene_exprs, lineages, n_splines, spline_order, n_jobs)
120 spline_order
121 )
--> 122 for gene in gene_exprs.columns
123 )
124

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in call(self, iterable)
1096
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())

~/anaconda3/envs/env_palantir/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
565 AsyncResults.get from multiprocessing."""
566 try:
--> 567 return future.result(timeout=timeout)
568 except CfTimeoutError as e:
569 raise TimeoutError from e

~/anaconda3/envs/env_palantir/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()

~/anaconda3/envs/env_palantir/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

ValueError: X data must not contain Inf nor NaN

Can you please check that br_cells parameter has non-zero cells here:

br_cells = pr_res.branch_probs.index[pr_res.branch_probs.loc[:, branch] > 0.7]
after you reduce the threshold and reinstall the package.

Hello again, sorry for the delay. I was able to get back to it and I have run the script again I believe with the modified threshold. Can we assume that the threshold is set to zero based on running the command in my attached image and seeing the 0.0 in that line? Thanks again.
Screen Shot 2023-04-22 at 7 55 06 PM
Screen Shot 2023-04-22 at 7 54 48 PM

This should have worked. I can investigate this on my end - can you please reshape the files again? Sorry, I cannot seem to locate them.

Certainly, my pleasure, thanks for taking a look.
W

FYI when computing gene trends, "SRC13" trajectory works alone, "Peri" works alone but they do not work together and "Failed" will not work alone or with any combination.

The links dont seem to be working - can you please me again manu.n.setty <AT> gmail.com

Hello again

The changes we discussed did work for me and the gene trend computation was successful for all lineage and genes you shared with me

image

Yes - please pull the latest repo, make the changes and then reinstall. Please try starting your notebook server only after re-installing the package.

Some progress? After a reinstall and open a new notebook I got a second trajectory on the trend plots but it ignored the third one.

['SRC13', 'Peri', 'Failed']
Screen Shot 2023-05-01 at 5 47 33 PM
Screen Shot 2023-05-01 at 5 01 26 PM

No good point, it does not. Looks like just for the two that are plotted, trends and std each (49 genes x 500 waypoints).

I sent some information to your gmail. Thanks again.
W

Thank you, with the clues you provided I was able to work out where the problem was coming from. I tested a number of terminal cell combinations and and thresholds and determined that with certain combinations of terminal cells, the probabilities were indeed dropping well below the threshold. We were also able to reproduce this error in another dataset that had previously resolved four trajectories by simply setting an additional terminal cell too close to one of the existing terminal cells, thereby reducing the confidence of the existing trajectory to below the threshold I suppose. In any case thank you for your efforts.