Questions regarding tsebra

Question

Questions regarding tsebra

bauerlev opened this issue 6 months ago · 1 comments

Hello, I have a few questions regarding tsebra. I know it has it's own github, but it doesn't seem to be active and your group is the same maker so I'm hoping this is an appropriate place to ask.

What do each of these parameters in the config file actually mean? I can't find an answer in the documentation anywhere.

Allowed difference for each feature

Values have to be in [0,2]

e_1 0.0
e_2 0.5
e_3 0.096
e_4 0.02
e_5 0.18
e_6 0.18

Is there documentation on how the script get_longest_isoform works? We noticed multiple transcripts for a given loci after attempting to run braker3, so I tried this script and while it helped there are still instances where there's more than one transcript for a given locus.

Thanks for your help! We've been having significantly better success with braker over maker and I'm very grateful.

Answer 1 · 2024-12-02T22:01:27.000Z

Hello @bauerlev,
Hope you may find this helpful;

"Hi, I will upload a TSEBRA version with the keep-all option by the end of this week.

Your command line looks correct and it should work.

You might be correct that the configuration of the long-read version of TSEBRA isn't fitted for all species as the amount of long-read data available during development was very limited. If you want to adjust the configuration, I would suggest that you try different values for intron_support, e_1, e_4, e_5, e_6.
The support values in the config file specify the minimum fraction that has to be supported by extrinsic evidence. If a transcript has lower evidence support in start/stop-codon and intron, it will be filtered out. For the current long read configuration, this means that all transcripts must have either all introns or their stop supported. I would suggest decreasing intron_support if you want to change anything here. This can be especially helpful if you think that the sensitivity at the gene level is not high enough.
The e parameter are thresholds that are used to allow some difference between the different scores of two transcripts at the same locus. In short, the thresholds correspond to scores as follows: e_1: relative fraction of supported introns, e_2: relative fraction of supported stop-codons, e_3 relative fraction of supported start-codons, e_4: absolute intron support, e_5: absolute stop-codon support, e_6 absolute start-codon support. If you want to go more in-depth, you can take a look at our paper. I would try to increase e_1, e_4, e_5, e_6, especially if you want to keep more alternative isoforms per gene."

Gaius-Augustus/TSEBRA#13