EI-CoreBioinformatics/mikado

CAT to Mikado

francicco opened this issue · 8 comments

Hi Luca,

I have a question regarding the definition of loci. I explain.

I used mikado to annotate a bunch of species [DONE], I then used CAT to cross annotate other species for which I don't have an annotation. CAT returns tables and annotations for all species. I think there is a problem with CAT because the new gene_ids do not correspond to a specific gene/locus, meaning that more gene_ids point to the same genomic loci.
You can see this in the figure (upper blue track)
Screenshot 2021-04-30 at 10 03 21 AM

I was wondering if there's a way to give all the CAT annotated transcripts (bed/gff3) and let Mikado to redefine loci/gene grating new ids.

Please help!
Cheers
F

PS. I'm sure they need to fix it, but it may require a lot of time which I don't have. This is a rescue manover.

Hi @francicco

Unfortunately I have left the Earlham Institute and I am maintaining Mikado only in an unofficial capacity :-/ I do not have direct experience with CAT either (aside from it being a pain to install on our cluster, some years ago).

@swarbred , @ljyanesm any thoughts on this?

Best,
Luca

Hi Luca,

Forget about CAT itself, I just need to define loci giving a set of transcripts, that's all.
F

Hi @francicco

Honestly we might use gffread for this ... @swarbred , @ljyanesm do we have anything in Minos or REAT for this?

Best

@francicco

give all the CAT annotated transcripts (bed/gff3) and let Mikado to redefine loci/gene grating new ids

You can provide them as input to mikado, and mikado would select from these. If you adjust the scoring and config you can reduce the number of models that would be filtered out by the requirements, splicing requirements etc. You will lose some models (which may be desirable or not) but the final output will be "clustered" into genes. Run with the input models as Reference and this will avoid models being removed at some of the prepare and pick stages. Running https://github.com/EI-CoreBioinformatics/minos is likely overkill for what you want to achieve but is what we do to integrate gene models from alternative sources.

Other light touch ways would be to just derive the clustering via gffread or cuffcompare and then use this to redefine the genes. Another option https://github.com/GenomeRIK/tama could also help with clustering genes. All of these likely require you to do some additonal work i.e. you can't just provide your input files and get exaclty what you need.

Thanks a lot! I did this:

echo -e "$CATBED\tCAT\tTrue\t1" > Mikado.conf
mikado configure --list Mikado.conf --reference ../../../CompleteAssemblies/AssembledGenomes/$GENOME --codon-table 0 configuration.yaml
mikado prepare --json-conf configuration.yaml --procs 1
mikado serialise -p $THREADS --json-conf configuration.yaml
mikado pick --mode nosplit --prefix $TARGETSP -p $THREADS --json-conf configuration.yaml --subloci-out $TARGETSP.mikado.subloci.out.gff3 --output-dir $TARGETSP.Mikado

What do you think?
F

@francicco

That will select models from your input file, whether the results are what you want I cant really say, if you are using the standard configs in the context of scoring gene models rather than transcript assemblies then mikados normal intrinsic and external metrics may not be ideal (hence why we have minos to generate the evidence based scoring metrics for use in mikado). If it generates what you want then great. As I mentioned if you want mikado to remove less models then run as reference i.e. adjust your --list file and adjust the alt splicing sections of the main mikado config and the scoring so they are less restrictive.

--list LIST           Tab-delimited file containing rows with the following format: <file> <label> <strandedness(def. False)> <score(optional, def. 0)> <is_reference(optional, def. False)>
                        <exclude_redundant(optional, def. True)> <strip_cds(optional, def. False)> <skip_split(optional, def. False)> "strandedness", "is_reference", "exclude_redundant", "strip_cds" and
                        "skip_split" must be boolean values (True, False) "score" must be a valid floating number.

Would this conf solve the problem?

Hmel.v3.1.CATannotation.bed CAT True 0 True True True

F

Also, CAT generates a lot of transcripts with stop codons in it or entire stretches of N. Maybe I can use Mikado to filter out these guys...
F