Question about gene_name/gene_id in gff3 file for tappAS
Closed this issue · 3 comments
Hello,
SQANTI is really cool and useful!
I want to run tappAS with the created gff3 file. But I am a bit confused about the gff3 file.
- When I run:
python /../SQANTI3/sqanti3_qc.py \
$collapsed_gff \ # output from isoseq collapse
$reference_gtf \
$reference_genome \
--dir ${read_name}_out \
--isoAnnotLite \
--report both \
The gff3 looks like this:
PB.7346.3 tappAS gene 1 1356 . - . ID=Serine threonine-protein phosphatase; Name=Serine threonine-protein phosphatase; Desc=Serine threonine-protein phosphatase; PosType=T
So gene names are used for ID, Name and Desc. This caused tappAS to fail due to the whitespaces in the gene IDs.
- When I run sqanti3_qc.py without
--isoAnnotLite
and IsoAnnotLite.py afterwards the gff3 looks like this (only gene ids for all fields.
PB.7346.1 tappAS gene 1 2057 . - . ID=PGSC0003DMG400027398; Name=PGSC0003DMG400027398; Desc=PGSC0003DMG400027398; PosType=T
Is there a way to get something get both the gene_id and gene_name from the reference file? (like ID=PGSC0003DMG400027398; Name=Serine threonine-protein phosphatase;)
Or could this be related to my reference gtf file?
My reference gtf file looks like this:
chr00 GLEAN exon 982993 983054 . - . transcript_id "PGSC0003DMT400003257"; gene_id "PGSC0003DMG400001290"; gene_name "Cytochrome P450 92B1";
Thank you!
Best,
Nadja
Hi @nadjano,
Please keep in mind that the gff3 that is output when using --isoAnnotLite
is different to a standard gff3 file. You have more info about this in the wiki and in the tappAS website.
Normally, the ID field is correctly filled with the gene ID values from the classification.txt file from SQANTI3 (have a look at the example output in the repo. I would suggest examining the ID fields in the classification, the input GTF and the reference transcriptome file, since you may be able to do some tweaking to correct this (for instance, if the reference does not have a properly defined gene ID field). You can use the example inputs and outputs in the repo for guidance.
Best,
Ángeles
Hi @aarzalluz,
thank you for the explanations and the examples!😄
From what I observed I think when you run qc with the --isoAnnotLite
flag gene names are used instead of IDs for the gff3 file (I can also see this in the example file Homo_sapiens_GRCh38_Ensembl_86.chr22.gff3 that for tapAS gene entries ID=Name and ID seems to be gene name instead of gene id).
e.g.
ENST00000615943 tappAS gene 1 113 . - . ID=U2; Name=U2; Desc=U2 spliceosomal RNA; PosType=T
Because if you run sqanti3_qc.py with --isoAnnotLite the gene the reference partner is run with -geneNameAsName2
Lines 656 to 659 in fecf76a
I'm not entirely certain if this behavior is intentional, but it does pose challenges when dealing with tappAS and gene names that contain special characters or when working with species that have poor gene_name annotations.
In my case, it seems to work best to run sqanti_qy.py without --isoAnnotLite (where the IDs in the classification file match the IDs from the GTF file), then run isoAnnotLite separately, and finally add my gene names to the gff3 file from the initial reference GTF file.
I hope this feedback is helpful, and I appreciate your assistance with this matter.
Best,
Nadja
Hi @nadjano -thanks for your feedback on this and for providing the solve! I'm sure it will be useful to users experiencing similar issues.
Even though IsoAnnot was designed as a workaround to unlock tappAS analysis from SQANTI3, a full version of this type of tool, IsoAnnot, is expected to be launched soon. This will enable annotating functional elements de novo in transcript models, including a more flexible, species-specific behaviour, and should override IsoAnnotLite :)
Best,
Ángeles