The _corrected.gtf.cds.gff is not generated when running SQANTI3 QC module

Question

The _corrected.gtf.cds.gff is not generated when running SQANTI3 QC module

Closed this issue 2 months ago · 6 comments

Dear developers,

I ran SQANTI3 QC on Plasmodium falciparum LR-data, using the following command:

python sqanti3_qc.py Plasmo_LR.gtf PlasmoDB-63_Pfalciparum3D7.gtf PlasmoDB-63_Pfalciparum3D7_Genome.fasta -o OUT --saturation --report both -n 10

All the expected output files were generated except for the _corrected.gtf.cds.gff gtf file, that should contain CDS information based on GMST ORF predictions. I would appreciate your help in resolving this issue.

Thanks in advance.

Answer 1 · 2023-07-20T14:16:13.000Z

Hello @Rombar123 sorry for the late reply! If that's okay with you, I could try to run SQANTI3 with your data to see what is the problem.

Kind regards!

Answer 2 · 2024-09-29T06:56:27.000Z

Hi， I met the same problem. So do you know the reason? I finished sqanti3_qc.py successfully with my data. But there are no .cds.gff file. There are only 8 files even no other folders in your wiki example.
In my annotation file(~/reference/merge/merge.combined.gtf), there are only 2 features in column 3 (transcripts and exons), and I don't know whether it will have an effect.

Below is the screenshot of my outputs:

Below are the two log files generated while I ran sqanti3:

37184315.err.txt
37184315.out.txt

Below is the parameters file:

Version 5.2
Input  ~/transcript.prediction/results/merge/merge.samples/tissues.combined.gtf
Annotation     ~/reference/merge/merge.combined.gtf
Genome  ~/reference/GRCh38.p14.genome.fa
MinRefLength    0
ForceIdIgnore   False
Aligner minimap2
FLCount NA
Expression      NA
Junction        NA
CAGEPeak        ~/software/SQANTI3-5.2.1/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed
PolyAMotif      ~/software/SQANTI3-5.2.1/data/polyA_motifs/mouse_and_human.polyA_motif.txt
PolyAPeak       NA
IsFusion        False
PhyloP  NA
SkipORF False
ORFInput        NA
FASTAused       False
Expression      NA
GMAPindex       NA
OutputPrefix    tissues
OutputDirectory ~/transcript.prediction/results/merge/sqanti3/test
Coverage        NA
CanonicalSites  ATAC,GCAG,GTAG
PostTTSWindow   20
GeneName        False
ReportType      both
RunIsoAnnotLite False
isoAnnotGFF3    NA
ShortReads      NA
ShortReadsBAMs  NA
ratioTSSmetric  max

By the way, I also tested sqanti3_qc.py with your example tutorial(https://github.com/ConesaLab/SQANTI3/wiki/Tutorial:-running-SQANTI3-on-an-example-dataset). I followed your codes but got no report files. And it reported some errors while generated reports(.pdf and .html). (This is the log files: 37183052.err.txt , 37183052.out.txt )
So I think there maybe some potential bugs in your scripts. Because different problems generated with your example data and my data. The main outputs all generated but missing some files(such as .cds.gff in my data and .pdf/.html in your example data).

Answer 3 · 2024-10-01T03:22:03.000Z

Hi, I checked your source code, and I found the reason. It is due to the split function. When the --chunk parameter is set, the split function will be performed. But the .cds.gtf.gff files are not in the combined list. Maybe you forget. So while I don't set the chunk parameter, the .cds.gtf.gff file will be outputted.
Below is part of your source code in sqanti_qc.py, you just combine 5 files without .cds.gtf.gff:

def combine_split_runs(args, split_dirs):
    """
    Combine .faa, .fasta, .gtf, .classification.txt, .junctions.txt

Answer 4 · 2024-10-01T04:18:28.000Z

I ran sqanti_qc.py with and without --chunk parameter on the same data but I got different results. It is wired that the _classification.txt files are not identical. Especially some important columns such as associated_gene, RTS_stage, coding,ORF_seq, ratio_TSS. By the way, I sorted the two tables firstly by the transcript id column so the orders are the same.

Answer 5 · 2024-10-08T09:04:39.000Z

Hi @dudududu12138 I've been reviewing this issue for the last week.

I've check the source code for the .cds.gtf.gff file and you're right - that file was missing. I have a fix for that and i'm testing the code to solve it so it's generated too
The differences between the classification files seems to be related to the parallelization itself: transcripts may be splitted between chunks and therefore some calculations are not done properly. For now, I haven't seen differences on the 4 important columns you mention, but I'll work on it to make sure the are no differences. I've opened a new issue (#340 ) with all the details I have about it, so any contribution you may have, please, post it there.

I'll keep this issue open until i merge the fix for the .cds.gtf.gff file, to keep it akin to the original issue

Answer 6 · 2024-10-08T13:28:34.000Z

By the way ,I can got the .cds.gtf.gff file without set --chunks. But the gene_id in .cds.gtf.gff file are different from the gene_id in *_corrected.gtf.
This is the gene_id in `*_corrected.gtf:

GL000008.2      StringTie       transcript      83318   135086  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001"
GL000008.2      StringTie       exon    83318   83545   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    83927   84014   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    85457   85477   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    85567   85625   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    129985  130583  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    133918  135086  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";

This is the gene_id in *_corrected.cds.gtf.gff file:

GL000008.2      PacBio  transcript      83318   135086  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    83318   83545   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    83927   84014   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    85457   85477   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    85567   85625   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    129985  130583  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    133918  135086  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";