ConesaLab/SQANTI3

The _corrected.gtf.cds.gff is not generated when running SQANTI3 QC module

Closed this issue · 6 comments

Dear developers,

I ran SQANTI3 QC on Plasmodium falciparum LR-data, using the following command:

python sqanti3_qc.py Plasmo_LR.gtf PlasmoDB-63_Pfalciparum3D7.gtf PlasmoDB-63_Pfalciparum3D7_Genome.fasta -o OUT --saturation --report both -n 10

All the expected output files were generated except for the _corrected.gtf.cds.gff gtf file, that should contain CDS information based on GMST ORF predictions. I would appreciate your help in resolving this issue.

Thanks in advance.

Hello @Rombar123 sorry for the late reply! If that's okay with you, I could try to run SQANTI3 with your data to see what is the problem.

Kind regards!

Hi, I met the same problem. So do you know the reason? I finished sqanti3_qc.py successfully with my data. But there are no .cds.gff file. There are only 8 files even no other folders in your wiki example.
In my annotation file(~/reference/merge/merge.combined.gtf), there are only 2 features in column 3 (transcripts and exons), and I don't know whether it will have an effect.

  • Below is the screenshot of my outputs:

1727597232541

  • Below are the two log files generated while I ran sqanti3:

37184315.err.txt
37184315.out.txt

  • Below is the parameters file:
Version 5.2
Input  ~/transcript.prediction/results/merge/merge.samples/tissues.combined.gtf
Annotation     ~/reference/merge/merge.combined.gtf
Genome  ~/reference/GRCh38.p14.genome.fa
MinRefLength    0
ForceIdIgnore   False
Aligner minimap2
FLCount NA
Expression      NA
Junction        NA
CAGEPeak        ~/software/SQANTI3-5.2.1/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed
PolyAMotif      ~/software/SQANTI3-5.2.1/data/polyA_motifs/mouse_and_human.polyA_motif.txt
PolyAPeak       NA
IsFusion        False
PhyloP  NA
SkipORF False
ORFInput        NA
FASTAused       False
Expression      NA
GMAPindex       NA
OutputPrefix    tissues
OutputDirectory ~/transcript.prediction/results/merge/sqanti3/test
Coverage        NA
CanonicalSites  ATAC,GCAG,GTAG
PostTTSWindow   20
GeneName        False
ReportType      both
RunIsoAnnotLite False
isoAnnotGFF3    NA
ShortReads      NA
ShortReadsBAMs  NA
ratioTSSmetric  max

Hi, I checked your source code, and I found the reason. It is due to the split function. When the --chunk parameter is set, the split function will be performed. But the .cds.gtf.gff files are not in the combined list. Maybe you forget. So while I don't set the chunk parameter, the .cds.gtf.gff file will be outputted.
Below is part of your source code in sqanti_qc.py, you just combine 5 files without .cds.gtf.gff:

def combine_split_runs(args, split_dirs):
    """
    Combine .faa, .fasta, .gtf, .classification.txt, .junctions.txt

I ran sqanti_qc.py with and without --chunk parameter on the same data but I got different results. It is wired that the _classification.txt files are not identical. Especially some important columns such as associated_gene, RTS_stage, coding,ORF_seq, ratio_TSS. By the way, I sorted the two tables firstly by the transcript id column so the orders are the same.

Hi @dudududu12138 I've been reviewing this issue for the last week.

  1. I've check the source code for the .cds.gtf.gff file and you're right - that file was missing. I have a fix for that and i'm testing the code to solve it so it's generated too
  2. The differences between the classification files seems to be related to the parallelization itself: transcripts may be splitted between chunks and therefore some calculations are not done properly. For now, I haven't seen differences on the 4 important columns you mention, but I'll work on it to make sure the are no differences. I've opened a new issue (#340 ) with all the details I have about it, so any contribution you may have, please, post it there.

I'll keep this issue open until i merge the fix for the .cds.gtf.gff file, to keep it akin to the original issue

By the way ,I can got the .cds.gtf.gff file without set --chunks. But the gene_id in .cds.gtf.gff file are different from the gene_id in *_corrected.gtf.
This is the gene_id in `*_corrected.gtf:

GL000008.2      StringTie       transcript      83318   135086  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001"
GL000008.2      StringTie       exon    83318   83545   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    83927   84014   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    85457   85477   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    85567   85625   .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    129985  130583  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";
GL000008.2      StringTie       exon    133918  135086  .       +       .       transcript_id "tissues_00000005"; gene_id "XLOC_000001";

This is the gene_id in *_corrected.cds.gtf.gff file:

GL000008.2      PacBio  transcript      83318   135086  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    83318   83545   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    83927   84014   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    85457   85477   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    85567   85625   .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    129985  130583  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";
GL000008.2      PacBio  exon    133918  135086  .       +       .       transcript_id "tissues_00000005"; gene_id "novelGene_1";