st.map_new_data use subset of var_genes?
ccshao opened this issue · 8 comments
In the mapping step, some of the var_genes in adata_ref are not found in adata_new.
Would it be possible to add an option to set the var_genes to be used? Thanks!
Yes, you can try something like:
adata.obsm['var_genes'] = adata[:,your_gene_list].X.copy()
Some additional errors after changing var_genes in uns an obsm. Might be other value needed to be updated?
#- wt and ko are two adata, want to map ko to wt
#- In wt var_genes, Fgf17 is not found in ko
vgene = adata_wt.uns['var_genes'].intersection(adata_ko.var.index)
adata_wt.uns['var_genes'] = vgene
adata_wt.obsm['var_genes'] = adata_wt[:, vgene].X.copy()
adata_wt.obsm['var_genes'].shape
# (487, 749)
len(adata_wt.uns['var_genes'])
# 749
adata_wt.obsm['var_genes']
#- however, it claimed 750 again.
adata_combined = st.map_new_data(adata_wt, adata_ko)
Top variable genes are being used for mapping ...
method 'umap' is being used for mapping ...
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/umap_.py in transform(self, X)
2055 dmat = pairwise_distances(
-> 2056 X, self._raw_data, metric=_m, **self._metric_kwds
2057 )
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
1778
-> 1779 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1780
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1359 if effective_n_jobs(n_jobs) == 1:
-> 1360 return func(X, Y, **kwds)
1361
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _pairwise_callable(X, Y, metric, force_all_finite, **kwds)
1379 """
-> 1380 X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
1381
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
160 "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 161 X.shape[1], Y.shape[1]))
162
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 749 while Y.shape[1] == 750
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-5-1d6046fd4e78> in <module>
----> 1 adata_combined = st.map_new_data(adata_wt, adata_ko)
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in map_new_data(adata_ref, adata_new, use_radius)
4101 if('trans_umap' in adata_ref.uns_keys()):
4102 trans = adata_ref.uns['trans_umap']
-> 4103 adata_new.obsm['X_umap_mapping'] = trans.transform(input_data)
4104 adata_new.obsm['X_dr'] = adata_new.obsm['X_umap_mapping'].copy()
4105 else:
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/umap_.py in transform(self, X)
2061 self._raw_data,
2062 metric=self._input_distance_func,
-> 2063 kwds=self._metric_kwds,
2064 )
2065 indices = np.argpartition(dmat, self._n_neighbors)[:, : self._n_neighbors]
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/distances.py in pairwise_special_metric(X, Y, metric, kwds)
1258 return metric(_X, _Y, *kwd_vals)
1259
-> 1260 return pairwise_distances(X, Y, metric=_partial_metric)
1261 else:
1262 special_metric_func = named_distances[metric]
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
1777 func = partial(distance.cdist, metric=metric, **kwds)
1778
-> 1779 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1780
1781
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1358
1359 if effective_n_jobs(n_jobs) == 1:
-> 1360 return func(X, Y, **kwds)
1361
1362 # enforce a threading backend to prevent data communication overhead
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _pairwise_callable(X, Y, metric, force_all_finite, **kwds)
1378 """Handle the callable case for pairwise_{distances,kernels}
1379 """
-> 1380 X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
1381
1382 if X is Y:
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
159 raise ValueError("Incompatible dimension for X and Y matrices: "
160 "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 161 X.shape[1], Y.shape[1]))
162
163 return X, Y
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 749 while Y.shape[1] == 750
Yeah, you are right.
The current mapping procedure only works when the query dataset contains all the variable genes in reference dataset since the set of variable genes in ref dataset were used to define the original manifold. To be able to get projected into the same manifold, the cells from query data needs to have the same set of genes.
In your case, Fgf17
was used to learn the WT(reference) space but does not exist in KO(query) dataset. So theoretically KO cells can not be projected onto WT since it's missing one feature/gene.
The easiest solution here is to remove 'Fgf17' from the variable gene set in your WT data after the step st.select_variable_genes()
. Then you can perform st.dimension_reduction()
and the rest of steps again. Removing one gene probably won't change much the result of WT data but it will get the mapping procedure to work for KO.
I hope the above is useful to you.
Here is another strange error for me.
As suggested, the wt and ko now have the same genes. However, there is an error in st.map_new_data
KeyError: "None of [Index(['n_counts', 'n_cells', 'pct_cells'], dtype='object')] are in the [columns]"
Both wt and ko data have all three columsn in var
. Below are the codes. Thanks!
1. Preprocess data
st.__version__
st.set_figure_params(dpi=300,
style='white',
figsize=[6.4, 5.8],
rc={'image.cmap': 'viridis'})
adata = st.read(file_name='../../01_data/wt_count.tsv.gz',
workdir='./stream_result')
st.add_metadata(adata, file_name='../../01_data/wt_meta.tsv')
st.cal_qc(adata, assay='rna')
st.filter_features(adata, min_n_cells=3)
adata_ko = st.read(file_name='../../01_data/ko_count.tsv.gz',
workdir='./mapping/')
st.add_metadata(adata_ko, file_name='../../01_data/ko_meta.tsv')
st.cal_qc(adata_ko, assay='rna')
st.filter_features(adata_ko, min_n_cells=3)
tgene = adata.var.index.intersection(adata_ko.var.index)
adata = adata[:, tgene].copy()
adata_ko = adata_ko[:, tgene].copy()
###Normalize gene expression based on library size
st.normalize(adata, method='lib_size')
###Logarithmize gene expression
st.log_transform(adata)
###Remove mitochondrial genes
st.remove_mt_genes(adata)
###Normalize gene expression based on library size
st.normalize(adata_ko, method='lib_size')
###Logarithmize gene expression
st.log_transform(adata_ko)
###Remove mitochondrial genes
st.remove_mt_genes(adata_ko)
st.write(adata, file_name="adata_basic.pkl")
st.write(adata_ko, file_name="adata_ko_basic.pkl")
#- !! then perform stream on adata and save as stream_result.pkl
2. mapping.
adata_ko = st.read(file_name='adata_ko_basic.pkl')
adata_wt = st.read(file_name='stream_result.pkl')
adata_ko
Out[16]:
AnnData object with n_obs × n_vars = 596 × 14481
obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt'
var: 'n_counts', 'n_cells', 'pct_cells'
uns: 'workdir', 'label_color', 'assay'
adata_wt
Out[14]:
AnnData object with n_obs × n_vars = 487 × 14481
obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt', 'kmeans', 'node', 'branch_id', 'branch_id_alias', 'branch_lam', 'branch_dist', 'S0_pseudotime', 'S1_pseudotime'
var: 'n_counts', 'n_cells', 'pct_cells'
uns: 'workdir', 'label_color', 'assay', 'var_genes', 'trans_umap', 'params', 'epg', 'flat_tree', 'seed_epg', 'seed_flat_tree', 'ori_epg', 'epg_obj', 'ori_epg_obj', 'stream_S1', 'scaled_marker_expr', 'transition_markers'
obsm: 'var_genes', 'X_umap', 'X_dr', 'X_spring', 'X_stream_S1'
#- !! I could not understand the error message.
adata_all = st.map_new_data(adata_wt, adata_ko)
Top variable genes are being used for mapping ...
method 'umap' is being used for mapping ...
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-17-d60b8ec659ab> in <module>
----> 1 adata_all = st.map_new_data(adata_wt, adata_ko)
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in map_new_data(adata_ref, adata_new, use_radius)
4129 shared_var_key = [x for x in adata_new.var_keys() if x in adata_ref.var_keys()]
4130 adata_combined.obs = adata_combined.obs[shared_obs_key+['batch']]
-> 4131 adata_combined.var = adata_combined.var[shared_var_key]
4132 adata_combined.uns['workdir'] = adata_new.uns['workdir']
4133 for key in adata_ref.uns_keys():
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2910 if is_iterator(key):
2911 key = list(key)
-> 2912 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
2913
2914 # take() does not accept boolean indexers
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Index(['n_counts', 'n_cells', 'pct_cells'], dtype='object')] are in the [columns]"
Thanks for the feedback. This is indeed a bug in the function map_new_data()
.
The easiest solution is to rename the column names of adata_ko.var
to make them distinct from those of adata_wt.var
before adata_all = st.map_new_data(adata_wt, adata_ko)
:
adata_ko.var.columns = 'ko_'+adata_ko.var.columns
Sorry I came back with another trivial issue. Now I successfully generated the combined adata but failed to plot the batch
in obs
. It raised error,
"TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one"
Plotting with label
variable works.
adata_all = st.map_new_data(adata_wt, adata_ko)
adata_all
AnnData object with n_obs × n_vars = 1083 × 14481
obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt', 'node', 'branch_id', 'branch_id_alias', 'branch_lam', 'branch_dist', 'S0_pseudotime', 'S1_pseudotime', 'batch', 'batch2'
uns: 'workdir', 'epg', 'flat_tree', 'label_color', 'var_genes'
obsm: 'var_genes', 'X_dr'
st.plot_dimension_reduction(adata_all,
color=['batch'],
show_graph=True,
show_text=False,
fig_name='dimension_reduction.png')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-f0fa6b6ec554> in <module>
3 show_graph=True,
4 show_text=False,
----> 5 fig_name='dimension_reduction.png')
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in plot_dimension_reduction(adata, n_components, comp1, comp2, comp3, color, key_graph, fig_size, fig_ncol, fig_legend_ncol, fig_legend_order, vmin, vmax, alpha, pad, w_pad, h_pad, show_text, show_graph, save_fig, fig_path, fig_name, plotly)
1477 # ax_i.get_legend().texts[0].set_text("")
1478 else:
-> 1479 vmin_i = df_plot[ann].min() if vmin is None else vmin
1480 vmax_i = df_plot[ann].max() if vmax is None else vmax
1481 sc_i = ax_i.scatter(df_plot_shuf['Dim'+str(comp1+1)], df_plot_shuf['Dim'+str(comp2+1)],
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
11467 return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
11468 return self._reduce(
> 11469 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11470 )
11471
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4237 if isinstance(delegate, ExtensionArray):
4238 # dispatch to ExtensionArray interface
-> 4239 return delegate._reduce(name, skipna=skipna, **kwds)
4240
4241 else:
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in _reduce(self, name, skipna, **kwargs)
2081 if func is None:
2082 raise TypeError(f"Categorical cannot perform the operation {name}")
-> 2083 return func(skipna=skipna, **kwargs)
2084
2085 @deprecate_kwarg(old_arg_name="numeric_only", new_arg_name="skipna")
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in min(self, skipna, **kwargs)
2104 """
2105 nv.validate_min((), kwargs)
-> 2106 self.check_for_ordered("min")
2107
2108 if not len(self._codes):
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in check_for_ordered(self, op)
1444 if not self.ordered:
1445 raise TypeError(
-> 1446 f"Categorical is not ordered for operation {op}\n"
1447 "you can use .as_ordered() to change the "
1448 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
#- I have tried to order the category but no help
adata_all.obs['batch2'] = adata_all.obs['batch'].cat.as_ordered()
st.plot_dimension_reduction(adata_all,
color=['batch2'],
show_graph=True,
show_text=False,
fig_name='dimension_reduction.png')
ValueError: 'c' argument must be a color, a sequence of colors, or a sequence of numbers, not WT_CATTCTAAGCGACAGT-1-ref
This error is seemingly caused by the latest version of pandas
. You can either
- change the data type of
adata_all.obs['batch]
tostring
adata_all.obs['batch'] = adata_all.obs['batch'].astype(str)
or
2) downgrade your pandas
to '1.0.5'
, the error should be gone.
We did notice some functions are broken due to the recent updates of several dependencies. We will try to fix these issues in the next release.
Thanks very much for all the tips! The first option works well.