Incorrect gene.annotation processing for sciPlex3

Question

Incorrect gene.annotation processing for sciPlex3

Closed this issue 3 months ago · 1 comments

Using header=None reads the first line (which is a header) as a row in the gene annotation dataframe. This subsequently affects the dimensions of the whole dataset.

https://github.com/sanderlab/scPerturb/blob/fac49ee392f6873b50fad27550e82f6507158834/dataset_processing/SrivatsanTrapnell2020.py#L77C19-L78

Currently, sciPlex3's var looks like this:

srivatsan.var
Out[4]: 
                     ensembl_id   ncounts  ncells
gene_symbol                                      
nan          id gene_short_name   26582.0   23228
nan:1           ENSG00000000003      35.0      33
nan:2           ENSG00000000005  163109.0  116153

So all the genes are shifted somehow. This can drastically affect downstream tasks since it's no longer clear what genes are expressed.

Answer 1 · 2024-08-21T15:24:27.000Z

Fixed, error corrected in script, updated dataset will be included in v1.4 on Zenodo