Compatibility Issue with pandas in scFates 1.0.8 while i use tl.test_fork

Question

Compatibility Issue with pandas in scFates 1.0.8 while i use tl.test_fork

Opened this issue 2 months ago · 5 comments

Hi scFates team,

I am using scFates version 1.0.8 along with pandas 1.5.3, which is the latest version that satisfies the <2.0 constraint for this release. However, I'm encountering the following error when running the test_fork() function:

TypeError: drop_duplicates() takes 1 positional argument but 2 were given.

It appears that the drop_duplicates() function in pandas may no longer accept the False argument in the same way, causing incompatibility with the current scFates code. Here's the exact line in bifurcation_tools.py:

brcells = brcells.loc[brcells.index.drop_duplicates(False)]

I would suggest either removing the second argument or adapting the call to be compatible with the latest versions of pandas <2.0. Alternatively, could you suggest a workaround?

Thanks in advance for your help!

Best regards,

by the way this is the code I tried to run:
scf.tl.test_fork(bdata,root_milestone="30",milestones=["28","70"],n_jobs=60,rescale=True)

Answer 1 · 2024-10-23T23:11:07.000Z

Hi thanks for bringing up this issue. I have pushed changes that allow pandas version over 2.0 which should fix it.

However I can't release it yet on pypi, the account is locked out at the moment.

In the mean time I would suggest to install scfates from GitHub with the tag v1.0.9

Answer 2 · 2024-10-29T16:37:27.000Z

Dear Dr. Faure, Thanks for your previous response, I updated the scfate and it worked. However I have another issue, i would appreciate you if you could help me. I am running the following code: scf.tl.linearity_deviation(adata, start_milestone="106", end_milestone="62", percentiles=[20, 80], n_jobs=1,n_map=1, plot=True,basis="X_umap") I have confirmed that the milestones are included in the observations of adata, and I filtered out doublet cells in my preprocessing step. However, I am encountering this error: ValueError: zero-size array to reduction operation maximum which has no identity I am using scFates version 1.0.9 and matplotlib version 3.8.4. Thank you for your assistance! Best regards, Maryam

…

Message ID: ***@***.***>

Answer 3 · 2024-10-29T21:37:52.000Z

It is a bit hard to figure out the issue, could you show the full error message?

How many cells do you have between milestone 106 and 62? You can check this by running scf.tl.subset_tree(adata, root_milestone ="106",milestone=["62"]), this should subset the anndata to that path.

Answer 4 · 2024-10-30T09:41:07.000Z

Thanks for your prompt response. The number of the cells is in the following and then you can see the code and the full error.

scf.tl.subset_tree(adata, root_milestone ="106",milestones=["62"])

#subsetting tree
node 123 selected as a root --> added
.uns['graph']['root'] selected root.
.uns['graph']['pp_info'] for each PP, its distance vs root and segment assignment.
.uns['graph']['pp_seg'] segments network information.
projecting cells onto the principal graph
finished (0:00:02) --> added
.obs['edge'] assigned edge.
.obs['t'] pseudotime value.
.obs['seg'] segment of the tree assigned.
.obs['milestones'] milestone assigned.
.uns['pseudotime_list'] list of cell projection from all mappings.
finished (0:00:00) --> tree extracted
--> added
.obs['old_milestones'], previous milestones from intial tree

Print the number of cells in the subset


num_cells = adata.shape[0]
print("Number of cells between milestones 106 and 62:", num_cells)

Number of cells between milestones 106 and 62: 4379

scf.tl.linearity_deviation(adata,
                           start_milestone="106",
                           end_milestone="62",
                           percentiles=[20, 80],
                           n_jobs=1,n_map=1, plot=True,basis="X_umap")

ValueError Traceback (most recent call last)
Cell In[297], line 1
----> 1 scf.tl.linearity_deviation(adata,
2 start_milestone="106",
3 end_milestone="62",
4 percentiles=[20, 80],
5 n_jobs=1,n_map=1, plot=True,basis="X_umap")

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/scFates/tools/linearity_deviation.py:151, in linearity_deviation(adata, start_milestone, end_milestone, percentiles, n_jobs, n_map, plot, basis, copy)
148 n_jobs_map = n_jobs
149 n_jobs = 1
--> 151 rss = ProgressParallel(
152 total=n_map,
153 n_jobs=n_jobs_map,
154 use_tqdm=n_map > 1,
155 desc=" multi mapping",
156 file=sys.stdout,
157 )(delayed(lindev_map)(m) for m in range(n_map))
159 if plot:
160 import scanpy as sc

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/scFates/tools/utils.py:47, in ProgressParallel.call(self, *args, **kwargs)
40 def call(self, *args, **kwargs):
41 with tqdm(
42 disable=not self._use_tqdm,
43 total=self._total,
44 desc=self._desc,
45 file=self._file,
46 ) as self._pbar:
---> 47 return Parallel.call(self, *args, **kwargs)

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.call(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/scFates/tools/linearity_deviation.py:135, in linearity_deviation..lindev_map(m)
132 results = model.fit()
133 return results.resid_pearson
--> 135 rs = ProgressParallel(
136 total=len(X),
137 n_jobs=n_jobs,
138 use_tqdm=n_map == 1,
139 desc=" cells on the bridge",
140 file=sys.stdout,
141 )(delayed(get_resid)(x) for x in X)
143 rs = np.vstack(rs).T
144 return rs.mean(axis=1) / X_all.std(axis=0)

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/scFates/tools/utils.py:47, in ProgressParallel.call(self, *args, **kwargs)
40 def call(self, *args, **kwargs):
41 with tqdm(
42 disable=not self._use_tqdm,
43 total=self._total,
44 desc=self._desc,
45 file=self._file,
46 ) as self._pbar:
---> 47 return Parallel.call(self, *args, **kwargs)

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/joblib/parallel.py:1918, in Parallel.call(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/scFates/tools/linearity_deviation.py:129, in linearity_deviation..lindev_map..get_resid(x)
128 def get_resid(x):
--> 129 model = smf.ols(
130 formula="x ~ A + B - 1", data=pd.DataFrame({"x": x, "A": A, "B": B})
131 )
132 results = model.fit()
133 return results.resid_pearson

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/model.py:229, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
223 design_info = design_info.subset(cols)
225 kwargs.update({'missing_idx': missing_idx,
226 'missing': missing,
227 'formula': formula, # attach formula for unpckling
228 'design_info': design_info})
--> 229 mod = cls(endog, exog, *args, **kwargs)
230 mod.formula = formula
231 # since we got a dataframe, attach the original

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:924, in OLS.init(self, endog, exog, missing, hasconst, **kwargs)
921 msg = ("Weights are not supported in OLS and will be ignored"
922 "An exception will be raised in the next version.")
923 warnings.warn(msg, ValueWarning)
--> 924 super().init(endog, exog, missing=missing,
925 hasconst=hasconst, **kwargs)
926 if "weights" in self._init_keys:
927 self._init_keys.remove("weights")

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:749, in WLS.init(self, endog, exog, weights, missing, hasconst, **kwargs)
747 else:
748 weights = weights.squeeze()
--> 749 super().init(endog, exog, missing=missing,
750 weights=weights, hasconst=hasconst, **kwargs)
751 nobs = self.exog.shape[0]
752 weights = self.weights

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/regression/linear_model.py:203, in RegressionModel.init(self, endog, exog, **kwargs)
202 def init(self, endog, exog, **kwargs):
--> 203 super().init(endog, exog, **kwargs)
204 self.pinv_wexog: Float64Array | None = None
205 self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/model.py:270, in LikelihoodModel.init(self, endog, exog, **kwargs)
269 def init(self, endog, exog=None, **kwargs):
--> 270 super().init(endog, exog, **kwargs)
271 self.initialize()

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/model.py:95, in Model.init(self, endog, exog, **kwargs)
93 missing = kwargs.pop('missing', 'none')
94 hasconst = kwargs.pop('hasconst', None)
---> 95 self.data = self._handle_data(endog, exog, missing, hasconst,
96 **kwargs)
97 self.k_constant = self.data.k_constant
98 self.exog = self.data.exog

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/model.py:135, in Model._handle_data(self, endog, exog, missing, hasconst, **kwargs)
134 def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
--> 135 data = handle_data(endog, exog, missing, hasconst, **kwargs)
136 # kwargs arrays could have changed, easier to just attach here
137 for key in kwargs:

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/data.py:675, in handle_data(endog, exog, missing, hasconst, **kwargs)
672 exog = np.asarray(exog)
674 klass = handle_data_class_factory(endog, exog)
--> 675 return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
676 **kwargs)

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/data.py:88, in ModelData.init(self, endog, exog, missing, hasconst, **kwargs)
86 self.const_idx = None
87 self.k_constant = 0
---> 88 self._handle_constant(hasconst)
89 self._check_integrity()
90 self._cache = {}

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/statsmodels/base/data.py:132, in ModelData._handle_constant(self, hasconst)
129 else:
130 # detect where the constant is
131 check_implicit = False
--> 132 exog_max = np.max(self.exog, axis=0)
133 if not np.isfinite(exog_max).all():
134 raise MissingDataError('exog contains inf or nans')

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/numpy/core/fromnumeric.py:2810, in max(a, axis, out, keepdims, initial, where)
2692 @array_function_dispatch(_max_dispatcher)
2693 @set_module('numpy')
2694 def max(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
2695 where=np._NoValue):
2696 """
2697 Return the maximum of an array or maximum along an axis.
2698
(...)
2808 5
2809 """
-> 2810 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
2811 keepdims=keepdims, initial=initial, where=where)

File ~/anaconda3/envs/secondenv/lib/python3.9/site-packages/numpy/core/fromnumeric.py:88, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 else:
86 return reduction(axis=axis, out=out, **passkwargs)
---> 88 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

ValueError: zero-size array to reduction operation maximum which has no identity

Answer 5 · 2024-11-02T14:48:24.000Z

From this subsetted dataset, can you run:

t_perc = [adata.obs.t.quantile(p / 100) for p in [20,80]]

print((adata.obs.t < t_perc[0]).sum())
print((adata.obs.t > t_perc[1]).sum())
print(((adata.obs.t > t_perc[0]) & ((adata.obs.t < t_perc[1]).sum())).sum())

It could be that there are no cells in one of the three groups, which would be unlikely, but the error suggests that