EI-CoreBioinformatics/mikado

Missing coding regions

Closed this issue · 16 comments

Hi,

I'm having a problem with Mikado. It seems like do not translate the CDS on some transcripts. This is an example.

Mikado prepare generate this coordinates:

Avcr0020688	CATin	transcript	1	2623	.	-	.	gene_id "CATin_transcript:Avcr.Avcr0020688G5581.1.gene"; transcript_id "CATin_transcript:Avcr.Avcr0020688G5581.1"; gene_name "Avcr.Avcr0020688G5581"; transcript_name "Avcr.Avcr0020688G5581.1.pol"; gene_biotype "protein_coding"; transcript_biotype "protein_coding"; is_reference "True";
Avcr0020688	CATin	exon	1	2623	.	-	.	gene_id "CATin_transcript:Avcr.Avcr0020688G5581.1.gene"; transcript_id "CATin_transcript:Avcr.Avcr0020688G5581.1";
Avcr0020688	CATout	mRNA	1	2623	.	-	.	gene_id "CATout_Avcr_T0016849.gene"; transcript_id "CATout_Avcr_T0016849"; Name "CATout_Avcr_T0016849"; has_start_codon "True"; has_stop_codon "False";
Avcr0020688	CATout	CDS	1	2623	.	-	0	gene_id "CATout_Avcr_T0016849.gene"; transcript_id "CATout_Avcr_T0016849";
Avcr0020688	CATout	exon	1	2623	.	-	.	gene_id "CATout_Avcr_T0016849.gene"; transcript_id "CATout_Avcr_T0016849";

During mikado pick I also submit the bed file from transdecoder mikado_prepared.fasta.transdecoder.bed, which have thse coordinates for that transcript:

CATin_transcript:Avcr.Avcr0020688G5581.1	0	2623	ID=CATin_transcript:Avcr.Avcr0020688G5581.1.p1;GENE.CATin_transcript:Avcr.Avcr0020688G5581.1~~CATin_transcript:Avcr.Avcr0020688G5581.1.p1;ORF_type:3prime_partial_len:875_(+),score=56.00,Avcr_T0012173|99.4|0.0e+00	0	+	0	2622	0	1	2623	0

Mikado pick generates this coordinates whitout any errors in the log:

Avcr0020688	Mikado_loci	superlocus	1	2623	.	-	.	ID=Mikado_superlocus:Avcr0020688-:1-2623;Name=superlocus:Avcr0020688-:1-2623
Avcr0020688	Mikado_loci	ncRNA_gene	1	2623	11	-	.	ID=Avcr.Avcr0020688G1;Name=Avcr.Avcr0020688G1;multiexonic=False;superlocus=Mikado_superlocus:Avcr0020688-:1-2623
Avcr0020688	Mikado_loci	ncRNA	1	2623	11.0	-	.	ID=Avcr.Avcr0020688G1.1;Parent=Avcr.Avcr0020688G1;alias=CATin_transcript:Avcr.Avcr0020688G5581.1;gene_biotype=protein_coding;gene_name=Avcr.Avcr0020688G5581;is_reference=True;primary=True;transcript_biotype=protein_coding;transcript_name=Avcr.Avcr0020688G5581.1.pol
Avcr0020688	Mikado_loci	exon	1	2623	.	-	.	ID=Avcr.Avcr0020688G1.1.exon1;Parent=Avcr.Avcr0020688G1.1

Why is that?
What am I doing wrong?

Cheers
F

Hi @francicco

Looks like you have some ORFs loaded from the GFF and some you have loaded via the serialise step, the model selected is the one where the orf was indicated to have been loaded from the bed file. Firstly check that the serialise step correctly loaded the orfs (see serialise.log). You could also check the SQlite mikado.db https://mikado.readthedocs.io/en/stable/Usage/Serialise/#technical-details to see if the orf is loaded for this transcript. However given you have an identical model in the prepare GTF (transcript_id "CATout_Avcr_T0016849") with a CDS and this wasn't selected at the pick stage there is likely some specific reason to explain why the "non-coding" model was selected. If the ORFs were correctly loaded during serialise then I would suggest running mikado pick just on the region you are querying. You can do this using --regions . Also run with -lv DEBUG and see if that gives you any more useful info.

I'm assuming you have examples where the CDS was output and the ORF was loaded via serialise.

Note if you have ORFs in GFF file and others you are loading via serialise then you may want to only run the models through transdecoder that lack ORFs in the GFF. I'm not 100% sure what happens is you load an orf for a transcript via GFF AND also via serialise (i.e. which would be used).

David

Hi David,

I think I understood why. I used --update-reference and for some reason, the ORF from that transcript was stripped away, although it was correct. This is a problem I'm trying to understand how to solve. What I mean is that when I use gffread to check my reference it returns no errors and the protein sequences have no stop codon, with the exception of the last stop-codons. Sometimes there are Xs in the seq due to gaps on the scaffolds, and I would like to keep them. I generally receive a number of warnings such as

transcript:XM_024099223.1 has mixed splices
Error: Invalid ORF for CATin_transcript:XM_024085644.1 (1, 3591, phase 0), reason: 8 internal stop codons found
incorrect fusions of splice junctions

To name a couple
F

I think I understood why. I used --update-reference and for some reason, the ORF from that transcript was stripped away, although it was correct.

you mean the option

--reference-update Flag. If switched on, Mikado will prioritise transcripts marked as reference and will consider any other transcipt within loci only in reference to these reference transcripts. Novel loci will still be reported.

if your CATout models are not marked as reference and your CATin models ARE marked as reference, then for a loci where you have a mix of reference and non-reference models this option will pick a primary model from the reference set, the non-reference models will then be used for padding and alt splice variants. So in this specific case as CATout_Avcr_T0016849 is not a reference model it will lose out to CATin_transcript:Avcr.Avcr0020688G5581.1 and then wont be brought back as an alt splice model as it has an identical structure.

If you have loaded an orf for CATin_transcript:Avcr.Avcr0020688G5581.1 via serialise then from the info provided I cant really say why this was deemed not valid by mikado.

I would rerun pick just on that region and change your CATout to be a reference model or dont run with --reference-update, and see if the CATout_Avcr_T0016849 is picked with the ORF.

transcript:XM_024099223.1 has mixed splices
Error: Invalid ORF for CATin_transcript:XM_024085644.1 (1, 3591, phase 0), reason: 8 internal stop codons found
incorrect fusions of splice junctions

Difficult for me to comment on if gffread or mikado is wrong i.e. whether the model should translate or not (the splicing junctions might suggest it wouldn't). From memory I believe we have had examples where gffread was wrong perhaps in relation to the phase indicated but I don't recall details.

I'd need to understand what these errors mean in order to correct the annotation, otherwise I'll lose loci.
Would you help me with that?
Cheers
F

Sorry @francicco I cant really help with looking in detail at your data, but will try to comment if possible.

I just wanted to know what these warnings mean:

> transcript:XM_024099223.1 has mixed splices
> Error: Invalid ORF for XXXXXXX (1, 3591, phase 0), reason: 8 internal stop codons found
> incorrect fusions of splice junctions

Because I'm manually looking at them but I find nothing wrong
F

I would think the junction error is indicating some junctions are canonical on one strand and other junctions canonical on the other strand (i.e. what mikado regards as suspicious_splicing), and invalid orf that there are stop codons when translated in frame 0 i.e. the codon starting at the first base. If the above is from pick, you should be able to extract this transcript from the mikado prepare gff and check the junctions / translation.

These are from mikado prepare.

if you just run that 1 transcript throught prepare with --strip-faulty-cds then this might output the model but just with the CDS removed (as mikado finds it to be invalid) and you can compare the input and output intron-exon structure (they should be the same) and I would expect not all junctions to be canonical.

That's what I'm doing, but sometimes I can reconstruct the ORF anymore.
Does the mixed splices remove the transcript entirely?
F

Now I'm also getting this:

2021-06-07 17:16:41,909 - prepare - prepare.py:685 - ERROR - prepare - MainProcess - 'Invalid chromosome name! CATin_transcript:Hbur.Hbur000001G2277.1, ./mikado_shelf_00000.db, Hbur000001, (207237, 299840)'
Traceback (most recent call last):
  File "/work/tk19812/software/mikado-2.3.0/venv/lib/python3.8/site-packages/Mikado/preparation/prepare.py", line 682, in prepare
    perform_check(divide_by_chrom(), shelve_names, mikado_config, logger)
  File "/work/tk19812/software/mikado-2.3.0/venv/lib/python3.8/site-packages/Mikado/preparation/prepare.py", line 262, in perform_check
    raise KeyError("Invalid chromosome name! {}, {}, {}, {}".format(tid, shelf_name, chrom, key))
KeyError: 'Invalid chromosome name! CATin_transcript:Hbur.Hbur000001G2277.1, ./mikado_shelf_00000.db, Hbur000001, (207237, 299840)'
2021-06-07 17:16:41,912 - prepare - prepare.py:688 - ERROR - prepare - MainProcess - Mikado has encountered an error, exiting
2021-06-07 17:16:41,913 - prepare - prepare.py:705 - ERROR - prepare - MainProcess - Mikado prepare has encountered a fatal error. Please check the logs and, if there is a bug,report it to https://github.com/EI-CoreBioinformatics/mikado/issues

Is there a length limit for scaffolds id?
Cheers
F

@francicco

That error would seem to indicate the chromosome Hbur000001 in the gtf is not present in the reference fasta, perhaps this isn't the correct reference fasta file (also check the fa.fai file is correct).

It seems fine to me:

(venv) [tk19812@bp1-login01 Hbur.Correction]$ cut -f1 Hbur.v1.0.annotation.bed.gff3 | grep -v '#' | sort -u | while read scf;
     do echo $scf; grep "$scf" ../../Hbur.assembly.v1.5.fasta;
done | head
>Hbur000001
Hbur000002
>Hbur000002
Hbur000003
>Hbur000003
Hbur000004
>Hbur000004
Hbur000005
>Hbur000005

If the fasta file has the correct chr ids, make sure the fa.fai is also correct, you will get the same error if you have a preexisting fa.fai file where the ids are wrong (you could just delete the fa.fai and rerun prepare and it will regenerate this from the fasta)

OOOh, I'll have a look.
Thanks
F

It worked!
Thanks
F