Reformat RepeatModeler-based library

Question

Reformat RepeatModeler-based library

davidecarlson opened this issue 8 years ago · 6 comments

I have a repeat library issue very similar to one of the comments from another recent issue (#37).
I'm attempting to run Transposome on a set of Illumina short read sequences from several plant species. I have a good quality reference genome from a related species from which I created a de novo repeat library using a combination of RepeatModeler, LTRharvest/LTRdigest, and TransposonPSI. The repeats were classified using RepeatMasker's RepeatClassifer tool. This produces a fasta header that looks something like:

my_sequence_info#TE_superfamily/TE

So an example header might be:

sequence1#LTR/Gypsy

This is obviously not the same as the RepBase format. I tried using the format_database.pl utility to reformat my headers, but I just ended up getting the same error message as listed in issue #37.

Ideally, I would like to be able to use my custom repeat library in the Transposome analysis. Do you have any suggestions for how I might reformat my repeat library headers to be used with Transposome?

Thanks very much for your time!
Dave

Answer 1 · 2017-05-26T16:53:55.000Z

I will adjust that script to handle this format. Can you attach the file so I can test it? You can email it if you don't want to share it on here.

Thanks for the message.

Answer 2 · 2017-05-26T17:13:51.000Z

Dear Evan, I have attached a gzipped copy of my custom repeat library. It's larger than 25 mb (37 mb) so I am sending it via google drive. Please let me know if you have any problems accessing it. Thanks again for your help! Dave oelata_combined_repeat_lib.fa.gz

…

On Fri, May 26, 2017 at 12:53 PM, Evan Staton ***@***.***> wrote: I will adjust that script to handle this format. Can you attach the file so I can test it? You can email it if you don't want to share it on here. Thanks for the message. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYZ5jAbkGO-heH6RJ49kJ6wiDzTqmK77ks5r9wOkgaJpZM4NnzJR> .

-- Dave Carlson Doctoral Student Ecology and Evolution Department Stony Brook University

Answer 3 · 2017-05-31T20:34:58.000Z

Hi Dave,

Please try the format_database.pl script on your file again. Note that it will generate warnings but these can be ignored.

You will have to pull the latest release or code from the master branch because I've updated the annotation methods. Also, it would be helpful to see the log and results from your analysis because I cannot test without some sequence reads. You can email those files if you want (and don't mind).

Thanks,
Evan

Answer 4 · 2017-05-31T21:09:06.000Z

To be clear, what I meant was that you will need to update your Transposome installation as well as get the latest version of that script in the separate transposome-scripts repo. The script uses Transposome methods, which have been updated for this issue.

Thanks.

Answer 5 · 2017-05-31T23:20:51.000Z

Hi Evan,

Thanks for the update!
I used the new format_database.pl script to reformat my repeat library, then I reinstalled Transposome and ran it the analysis on a very small (20K reads) dataset. The summary of my results is below.

Things seemed to work pretty well, though there is a large disparity between the total repeat fraction and the annotated repeat fraction, though this could be due to the small number of reads I used (or related to my library). I will email a tar file with the full results.
Thanks again!
Dave

Results - Total number of clustered reads: 11481.
Transposome::Annotation::annotate_clusters started at: 31-05-2017 17:18:53.
Transposome::Annotation::annotate_clusters completed at: 31-05-2017 17:30:12.
Results - Total sequences: 20000
Results - Total sequences clustered: 11481
Results - Total sequences unclustered: 8519
Results - Repeat fraction from clusters: 0.57405
Results - Singleton repeat fraction: 0.566498415306961
Results - Total repeat fraction: 0.81535
Results - Total repeat fraction from annotations: 0.574049999697482
Transposome::Annotation::clusters_annotation_to_summary started at: 31-05-2017 17:30:12.
Transposome::Annotation::clusters_annotation_to_summary completed at: 31-05-2017 17:30:12.
======== Transposome completed at: 31-05-2017 17:30:13. Elapsed time: 11 minutes, 27 seconds.

Answer 6 · 2017-06-01T01:30:14.000Z

Great! I'll follow up about the results through email. I'll close this but comment if there are any issues.