umccr/RNAsum

Handle new ensembl gene annotations

skanwal opened this issue · 3 comments

Currently, we expect ensembl gene name to have a format like ENSG00000223972. This only works with older ensembl database versions. Using ensembl v105, the gene names are like ENSG00000223972.5 - i.e. have a version number after decimal.
This errors out when joining sample count matrix with reference data matrix that has ensembl gene names without decimal.

We should strip out version from from ensembl gene names when reading in counts information for a sample.

Additionally, we have PAR_Y contigs in gencode v39. From gencode release 25, the PAR genes and transcripts have "_PAR_Y" appended to their ids. We have decided to keep only the ones without PAR_Y:

The gene annotation is the same in both ensembl and gencode files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

In order to work with Dragen4.2.4, we need to port the changes for this issue to master branch.