The _corrected.gtf.cds.gff is not generated when running SQANTI3 QC module
Closed this issue · 6 comments
Dear developers,
I ran SQANTI3 QC on Plasmodium falciparum LR-data, using the following command:
python sqanti3_qc.py Plasmo_LR.gtf PlasmoDB-63_Pfalciparum3D7.gtf PlasmoDB-63_Pfalciparum3D7_Genome.fasta -o OUT --saturation --report both -n 10
All the expected output files were generated except for the _corrected.gtf.cds.gff
gtf file, that should contain CDS information based on GMST ORF predictions. I would appreciate your help in resolving this issue.
Thanks in advance.
Hello @Rombar123 sorry for the late reply! If that's okay with you, I could try to run SQANTI3 with your data to see what is the problem.
Kind regards!
Hi, I met the same problem. So do you know the reason? I finished sqanti3_qc.py successfully with my data. But there are no .cds.gff
file. There are only 8 files even no other folders in your wiki example.
In my annotation file(~/reference/merge/merge.combined.gtf), there are only 2 features in column 3 (transcripts
and exons
), and I don't know whether it will have an effect.
- Below is the screenshot of my outputs:
- Below are the two log files generated while I ran sqanti3:
37184315.err.txt
37184315.out.txt
- Below is the parameters file:
Version 5.2
Input ~/transcript.prediction/results/merge/merge.samples/tissues.combined.gtf
Annotation ~/reference/merge/merge.combined.gtf
Genome ~/reference/GRCh38.p14.genome.fa
MinRefLength 0
ForceIdIgnore False
Aligner minimap2
FLCount NA
Expression NA
Junction NA
CAGEPeak ~/software/SQANTI3-5.2.1/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed
PolyAMotif ~/software/SQANTI3-5.2.1/data/polyA_motifs/mouse_and_human.polyA_motif.txt
PolyAPeak NA
IsFusion False
PhyloP NA
SkipORF False
ORFInput NA
FASTAused False
Expression NA
GMAPindex NA
OutputPrefix tissues
OutputDirectory ~/transcript.prediction/results/merge/sqanti3/test
Coverage NA
CanonicalSites ATAC,GCAG,GTAG
PostTTSWindow 20
GeneName False
ReportType both
RunIsoAnnotLite False
isoAnnotGFF3 NA
ShortReads NA
ShortReadsBAMs NA
ratioTSSmetric max
- By the way, I also tested sqanti3_qc.py with your example tutorial(https://github.com/ConesaLab/SQANTI3/wiki/Tutorial:-running-SQANTI3-on-an-example-dataset). I followed your codes but got no report files. And it reported some errors while generated reports(.pdf and .html). (This is the log files: 37183052.err.txt , 37183052.out.txt )
So I think there maybe some potential bugs in your scripts. Because different problems generated with your example data and my data. The main outputs all generated but missing some files(such as.cds.gff
in my data and.pdf/.html
in your example data).
Hi, I checked your source code, and I found the reason. It is due to the split function. When the --chunk
parameter is set, the split function will be performed. But the .cds.gtf.gff
files are not in the combined list. Maybe you forget. So while I don't set the chunk
parameter, the .cds.gtf.gff
file will be outputted.
Below is part of your source code in sqanti_qc.py
, you just combine 5 files without .cds.gtf.gff
:
def combine_split_runs(args, split_dirs):
"""
Combine .faa, .fasta, .gtf, .classification.txt, .junctions.txt
I ran sqanti_qc.py with and without --chunk
parameter on the same data but I got different results. It is wired that the _classification.txt
files are not identical. Especially some important columns such as associated_gene
, RTS_stage
, coding
,ORF_seq
, ratio_TSS
. By the way, I sorted the two tables firstly by the transcript id column so the orders are the same.
Hi @dudududu12138 I've been reviewing this issue for the last week.
- I've check the source code for the .cds.gtf.gff file and you're right - that file was missing. I have a fix for that and i'm testing the code to solve it so it's generated too
- The differences between the classification files seems to be related to the parallelization itself: transcripts may be splitted between chunks and therefore some calculations are not done properly. For now, I haven't seen differences on the 4 important columns you mention, but I'll work on it to make sure the are no differences. I've opened a new issue (#340 ) with all the details I have about it, so any contribution you may have, please, post it there.
I'll keep this issue open until i merge the fix for the .cds.gtf.gff file, to keep it akin to the original issue
By the way ,I can got the .cds.gtf.gff
file without set --chunks
. But the gene_id
in .cds.gtf.gff
file are different from the gene_id
in *_corrected.gtf
.
This is the gene_id
in `*_corrected.gtf:
GL000008.2 StringTie transcript 83318 135086 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001"
GL000008.2 StringTie exon 83318 83545 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2 StringTie exon 83927 84014 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2 StringTie exon 85457 85477 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2 StringTie exon 85567 85625 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2 StringTie exon 129985 130583 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2 StringTie exon 133918 135086 . + . transcript_id "tissues_00000005"; gene_id "XLOC_000001";
This is the gene_id
in *_corrected.cds.gtf.gff
file:
GL000008.2 PacBio transcript 83318 135086 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 83318 83545 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 83927 84014 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 85457 85477 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 85567 85625 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 129985 130583 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2 PacBio exon 133918 135086 . + . transcript_id "tissues_00000005"; gene_id "novelGene_1";