sanderlab/scPerturb

Incorrect gene.annotation processing for sciPlex3

Closed this issue · 1 comments

Using header=None reads the first line (which is a header) as a row in the gene annotation dataframe. This subsequently affects the dimensions of the whole dataset.

https://github.com/sanderlab/scPerturb/blob/fac49ee392f6873b50fad27550e82f6507158834/dataset_processing/SrivatsanTrapnell2020.py#L77C19-L78

Currently, sciPlex3's var looks like this:

srivatsan.var
Out[4]: 
                     ensembl_id   ncounts  ncells
gene_symbol                                      
nan          id gene_short_name   26582.0   23228
nan:1           ENSG00000000003      35.0      33
nan:2           ENSG00000000005  163109.0  116153

So all the genes are shifted somehow. This can drastically affect downstream tasks since it's no longer clear what genes are expressed.

Fixed, error corrected in script, updated dataset will be included in v1.4 on Zenodo