sestaton/Transposome

Reformat RepeatModeler-based library

davidecarlson opened this issue · 6 comments

Hi @sestaton

I have a repeat library issue very similar to one of the comments from another recent issue (#37).
I'm attempting to run Transposome on a set of Illumina short read sequences from several plant species. I have a good quality reference genome from a related species from which I created a de novo repeat library using a combination of RepeatModeler, LTRharvest/LTRdigest, and TransposonPSI. The repeats were classified using RepeatMasker's RepeatClassifer tool. This produces a fasta header that looks something like:

my_sequence_info#TE_superfamily/TE

So an example header might be:

sequence1#LTR/Gypsy

This is obviously not the same as the RepBase format. I tried using the format_database.pl utility to reformat my headers, but I just ended up getting the same error message as listed in issue #37.

Ideally, I would like to be able to use my custom repeat library in the Transposome analysis. Do you have any suggestions for how I might reformat my repeat library headers to be used with Transposome?

Thanks very much for your time!
Dave

I will adjust that script to handle this format. Can you attach the file so I can test it? You can email it if you don't want to share it on here.

Thanks for the message.

Hi Dave,

Please try the format_database.pl script on your file again. Note that it will generate warnings but these can be ignored.

You will have to pull the latest release or code from the master branch because I've updated the annotation methods. Also, it would be helpful to see the log and results from your analysis because I cannot test without some sequence reads. You can email those files if you want (and don't mind).

Thanks,
Evan

To be clear, what I meant was that you will need to update your Transposome installation as well as get the latest version of that script in the separate transposome-scripts repo. The script uses Transposome methods, which have been updated for this issue.

Thanks.

Hi Evan,

Thanks for the update!
I used the new format_database.pl script to reformat my repeat library, then I reinstalled Transposome and ran it the analysis on a very small (20K reads) dataset. The summary of my results is below.

Things seemed to work pretty well, though there is a large disparity between the total repeat fraction and the annotated repeat fraction, though this could be due to the small number of reads I used (or related to my library). I will email a tar file with the full results.
Thanks again!
Dave

Results - Total number of clustered reads: 11481.
Transposome::Annotation::annotate_clusters started at: 31-05-2017 17:18:53.
Transposome::Annotation::annotate_clusters completed at: 31-05-2017 17:30:12.
Results - Total sequences: 20000
Results - Total sequences clustered: 11481
Results - Total sequences unclustered: 8519
Results - Repeat fraction from clusters: 0.57405
Results - Singleton repeat fraction: 0.566498415306961
Results - Total repeat fraction: 0.81535
Results - Total repeat fraction from annotations: 0.574049999697482
Transposome::Annotation::clusters_annotation_to_summary started at: 31-05-2017 17:30:12.
Transposome::Annotation::clusters_annotation_to_summary completed at: 31-05-2017 17:30:12.
======== Transposome completed at: 31-05-2017 17:30:13. Elapsed time: 11 minutes, 27 seconds.

Great! I'll follow up about the results through email. I'll close this but comment if there are any issues.