Some suggestions about OV
Opened this issue · 8 comments
Hi,
Firstly, I'd like to commend the OV project for its contributions to scRNA analysis. I have a few suggestions that could potentially enhance its utility:
-
Flexibility in ov.pp.preprocess: This function integrates several key processing steps. However, some steps like robust gene identification and gene filtering are mandatory. It might be beneficial to offer more control here. For instance, adding a parameter such as
robust_gene=True, threshold=0.05
could provide users with the option to toggle this feature. Similarly, the mandatory use ofsc.pp.normalize_total(..., exclude_highly_expressed=True...)
could be made optional with a control parameter. -
Expanding regress Functionality: Currently, the regress function seems limited to specific parameters like mito_perc and nUMIs. It would be advantageous to allow regression on other variables as per user requirements.
-
Concerns with regress_and_scale: In the current implementation, I'm wondering if replacing adata = sc.pp.regress_out(adata, ['mito_perc', 'nUMIs']) with
adata_mock = sc.pp.scale(adata_mock)
at line 471 might be more appropriate. This change could potentially improve the function's performance or accuracy.
I believe these enhancements could make OV even more flexible and user-friendly for diverse scRNA analysis scenarios. Looking forward to your thoughts on this.
Moreover, why not enroll Celltypist and BBKNN into OV? I think some of your previous excellent strategies for batch removal and annotation on Wechat Official Accounts can also be integrated into the OV process.
Thank you for your suggestion, here is the response to your suggestion:
-
- more parameters are indeed optional in preprocess, it is worth noting that we auto-calculate the high variable genes, but instead of auto-filtering, we store the raw values in
adata.layers['counts']
. So you can use adata.layers['counts'] to get the raw values if you need to.
- more parameters are indeed optional in preprocess, it is worth noting that we auto-calculate the high variable genes, but instead of auto-filtering, we store the raw values in
-
- any regression is not implemented in omicverse, this is due to the fact that the regression will interpolate the dropout phenomenon, but is this interpolation really correct? Much of the literature expresses concern about this
-
- instead of modifying the original data, we stored the scaled values in
adata.layers['scaled']
for the calculation of pca only.
- instead of modifying the original data, we stored the scaled values in
-
- for Celltypist, it's a good auto annotation algorithm, but considering its installation dependency may conflict with omicverse installation, we suggest that users can use it on their own, and we may add it in a future release.
-
- For BBKNN, this de-batching algorithm is so obsolete that its performance is so poor that it is not acceptable to anyone. We have compared the results of different de-batching algorithms in detail in the tutorial on batch correction, where the optimal algorithm without GPU is harmony and the optimal algorithm with GPU is scVI.
Thanks again for your suggestions.
Zehua
Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:
- Flexibility in
ov.pp.preprocess
.
I noticed that in the preprocessing step (ov.pp.preprocess
), certain parameters are set by default, such as in the snippet from lines 372-379 in_preprocess.py
:
sc.pp.normalize_total(
adata,
target_sum=target_sum,
exclude_highly_expressed=True,
max_fraction=0.2,
)
Here, exclude_highly_expressed=True
is automatically applied. In scanpy, the default of exclude_highly_expressed
is False
(https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it to True
based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter in ov.pp.preprocess
. This could provide more control over the preprocessing based on specific data characteristics.
-
Expanding
regress
Functionality.
In theregress
function (lines 437-452 in_preprocess.py
), the parametersmito_perc
andnUMIs
are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant. -
Concerns with
regress_and_scale
.
Regarding theregress_and_scale
function, particularly at line 471 in_preprocess.py
, I wonder if the codeadata_mock = scale(adata_mock)
could be modified toadata_mock = sc.pp.scale(adata_mock)
. As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.
Looking forward to your thoughts on this.
Thanks for your kind reply, but it seems that I didn't express what I meant exactly. I appreciate the opportunity to clarify my suggestions:
- Flexibility in
ov.pp.preprocess
.
I noticed that in the preprocessing step (ov.pp.preprocess
), certain parameters are set by default, such as in the snippet from lines 372-379 in_preprocess.py
:sc.pp.normalize_total( adata, target_sum=target_sum, exclude_highly_expressed=True, max_fraction=0.2, )
Here,
exclude_highly_expressed=True
is automatically applied. In scanpy, the default ofexclude_highly_expressed
isFalse
(https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html#scanpy.pp.normalize_total). It may be more appropriate to set it toTrue
based on your experience, but I think it may be more appropriate to stay consistent with the classic tutorial. Thus, my suggestion is to allow users to adjust this setting, perhaps through an additional parameter inov.pp.preprocess
. This could provide more control over the preprocessing based on specific data characteristics.
- Expanding
regress
Functionality.
In theregress
function (lines 437-452 in_preprocess.py
), the parametersmito_perc
andnUMIs
are fixed targets for regression. I propose enhancing this function's flexibility by allowing users to specify which parameters to regress. This flexibility could be crucial for analyses where other variables might be more relevant.- Concerns with
regress_and_scale
.
Regarding theregress_and_scale
function, particularly at line 471 in_preprocess.py
, I wonder if the codeadata_mock = scale(adata_mock)
could be modified toadata_mock = sc.pp.scale(adata_mock)
. As someone relatively new to Python, I'm not sure if this change would be more appropriate or efficient, and would appreciate your insight on this.Looking forward to your thoughts on this.
Thanks for your advice, we will add more parameter in next version.
Zehua
Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?
Additionaly, there are some mistakes in the tutorials (https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/).
And similar problem occurs in other calibration batches of tutorials.
Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?
If you want to save on RAM expenses, then you might consider setting argument backed='r'
when reading h5ad files using sc.read
or ov.read
Follow the issue, I had one question want to ask. Do the authors had any abvises for the huge dataset to save the RAM memory when used the OV to do the analysis? Not sure what kind statistics in the pp.reprocess, it take lot of RAM memory, can use gc.collect to release the memory?
If you want to save on RAM expenses, then you might consider setting argument
backed='r'
when reading h5ad files usingsc.read
orov.read
backed='r'
can help save the memory of pp.preprocessing? I remember that only speed up to reading the files. Whatever, will try.