Unexpected Number of Unique Genes in Report

Question

Unexpected Number of Unique Genes in Report

OEveSalar opened this issue a year ago · 10 comments

Hi, I've run SQANTI3_QC on some long-read nanopore RNA-seq data and I'm seeing an unexpectedly high number of "Unique Genes" in the SQANTI3 report. Hoping you can help identify what I'm doing wrong!

My processing pipeline up until SQANTI is as follows:

Quality score filter reads
Pool samples
Mapped to Atlantic salmon genome (minimap2) retaining only primary alignments
Collapsed reads and merged samples using TAMA
Filtered collapsed models based on minimum read support parameters
Convert bed to gtf using TAMA
Run SQANTI3 with minimum inputs

When I manually count the number of transcripts and genes in the resultant bed and gtf files, I get 266,222 isoforms and 35,480 genes. The scripts I used to count these are below:

grep -w "transcript" ./filtered_transcriptome.bed | wc -l
grep -w "gene" ./filtered_transcriptome.bed | wc -l

However, when running SQANTI3_QC, the report says I have 266,222 Unique Isoforms (great), but 43,695 Unique Genes (not so great). Is there any reason this could be happening? Have I done something wrong leading up to using SQANTI3? The scripts I used to run SQANTI3 are below:

export PYTHONPATH=$PYTHONPATH:///cDNA_Cupcake/sequence/
python sqanti3_qc.py -o Transcriptome_wobble10 -d ./SQANTI/ --report both ./Final_Transcriptome/transcriptome_wobble10.gtf ./Ssal_v3.1_annotation.sorted.gtf ./Ssal_v3.1_genome.fa

The genome and annotation files correspond, have the same gene/transript IDs, and the genome.fasta is the one I mapped against initially. I don't know where the discrepancy is coming in. Hoping you can help.

Many thanks,
Oliver

Answer 1 · 2024-02-08T15:44:32.000Z

Hi @OEveSalar,

Thanks for using SQANTI! The number of unique genes is counted from the output classification.txt file, in the column "associated_gene" (column 7). The number of genes is different because in the SQANTI report, we are also counting the isoforms classified as "Novel genes". If you check column 7 in your classification.txt file, I think you will have 8215 isoforms classified as novel genes with names like: novelGene_Zdhhc20_AS, novelGene_Immt_AS, and similar names. These come from intergenic regions, antisense transcripts etc.

Let me know if this is solving this difference in your data :)

Best,
Carolina.

Answer 2 · 2024-02-22T08:43:34.000Z

Hello, @carolinamonzo
I also encountered the same problem，The number of genes in the _corrected.gtf and _corrected.gtf.cds.gff in the output of SQANTI3 QC is inconsistent. I still don’t quite understand why its number is higher.
I hope to get your reply.
Best

Answer 3 · 2024-02-23T08:55:05.000Z

Hi,

Only those transcripts with predicted as coding will appear in the cds.gff file, so that is why you have a smaller number of isoforms compared to the corrected.gtf that includes all the isoforms.

Answer 4 · 2024-02-23T08:59:14.000Z

You can find more information about the difference between those two files in the SQANTI3 documentation. https://github.com/ConesaLab/SQANTI3/wiki/Understanding-the-output-of-SQANTI3-QC

Answer 5 · 2024-02-24T11:47:04.000Z

Hi, @carolinamonzo

Here, there are 71,712 annotated genes, but my reference genome annotation file only has a total of 71,297 genes. How do I understand this?

Answer 6 · 2024-02-24T11:52:27.000Z

Hi, @alexpan00
In fact, cds.gff has more genes than corrected.gtf, Why?

Answer 7 · 2024-02-25T07:17:47.000Z

Sorry to bother you again, @carolinamonzo
awk '$3 == "transcript"' Gb_Inclufib_gene_corrected.gtf | awk '{print $12}' | sed 's/"//g' | sed 's/;//g' | sort -u | wc -l
76010
awk '$3 == "transcript"' Gb_Inclufib_gene_corrected.gtf.cds.gff | awk '{print $12}' | sed 's/"//g' | sed 's/;//g' | sort -u | wc -l
83854
Why does Gb_Inclufib_gene_corrected.gtf.cds.gff have more genes than this one?
Aand I treat isoform as Antisense and Intergenic as new genes to eliminate, but the number of differences between the two is still inconsistent.

Answer 8 · 2024-02-25T09:51:14.000Z

Hi @Tang-pro ,

The fact that you see more annotated genes than what you have in your reference could be because SQANTI3 associates isoforms classified as fusion with all of the genes that are overlapped. If the isoform overlaps g1 and g2, the associated gene for the isoform would be g1_g2. This results in new associated genes that would be considered annotated.

Regarding the number of genes in the cds. gff and corrected.gtf I was wrong initially. The cds.gff is an updated version of the corrected.gtf. The gene_id of the corrected.gtf gets updated in the cds.gff to match the gene_id in the SQANTI3 classification file. If you have already controlled that different intergenic and antisense isoforms belonging to the same gene in the original gtf are associated with different novel genes in the cds.gff, the differences could come from other isoforms that belong to the same gene in the original gtf and get assigned to different genes in cds.gff. This could happen if, for example, one of the isoforms is classified as genic and the other as genic intron.

I hope it is more clear now,
Alejandro.

Answer 9 · 2024-02-26T01:13:57.000Z

Hi, @alexpan00
First, associates isoforms classified as genic also show the ID of Gbar_A01G016840_Gbar_A01G016830, where two genes are fused together. This is why in addition to fusion, genic also contains this kind of ID.
awk '$6 == "genic"' Gb_Inclufib_gene_classification.txt | cut -f7 | grep "Gbar_.*_Gbar_.*"

Second, isoform structural_category, which are newly discovered isoforms annotated to known sites, and which are newly discovered isoforms annotated to new sites?

Answer 10 · 2024-02-26T08:20:59.000Z

Yes, indeed, genic isoforms can also overlap multiple genes.

In SQANTI3, those isoforms in the categories intergenic, genic_intron, or antisense will be labeled novel genes.

Alejandro