HuffordLab/NAM-genomes

Some questions about gene prediction

MrbrilliantLL opened this issue · 5 comments

Hi,

I have some more questions about the gene-prediction part, mainly in 'Refining, merging and post-processing predictions':

  1. In the 'Step 1. Generate non-verlapping Braker set', there is no detailed description of the practice in this step, my understanding is that only the parts of Braker's gff3 file with mRNA tags are retained (excluding CDS and exon).

  2. In the 'Step 2. Combine Mikado and non overlapping BRAKER models to a WS', use MAKER-P's gff3_merge tool to merge the two results. But during my operation, gff3_merge simply merged the two results together (cat file1 file2 > file_combine). Is the real intention of this step to take the union of the gene models predicted by the two results (take a union set for each locus instead of merging two files)?

  3. In the step of updating WS models by PASA: PASA can only update protein coding gene models, but the mikado file contains ncRNA annotations, so how are the ncRNA annotations handled?

Thank you for your help!

Hi @MrbrilliantLL,

Thanks for reaching out. I'll try to answer your questions below:

  1. For non-overlapping BRAKER gene models, we used bedtools. Specifically,
bedtools insterset -v -s -a BRAKER.gff -b mikado.gff > non.overlaping.BRAKER.gff

The reasoning was that we wanted to keep the evidence-based predictions in our final set in lieu of ab initio predictions.

  1. That's right. Since the overlapping models were removed, we planned to get the union set.
  2. We did not consider ncRNA predictions and focussed only on protein-coding gene predictions.
    I hope this helps!

Thanks,

Thank you for your quick response!

Is it only the updated WS models from PASA that are used in the subsequent TE-filtered and HCS gene models steps? Was the annotated ncRNA in mikado discarded?

In the Method part of paper, 'The HCS gene models were further classified based on homology to related species, and assigned coding and non- coding biotypes'. Is it in this step that ncRNA is distinguished?

I'll ping @Kapeel who might be able to answer your question about ncRNA prediction.

that's correct the ncRNA in mikado were discarded. The assigning of biotypes for HCS gene models were purely based on whether the transcripts had a complete CDS(coding) or incomplete CDS(non-coding).

I get it, thanks for all your help!