Consensus Sequences

Question

Consensus Sequences

Closed this issue 2 years ago · 4 comments

roperete commented 2 years ago

Dear Kevin,

I hope your summer is going good!

I have a question about reasonaTE:

I there at possibility to retrieve the consensus sequences used for each transposon class?

Thanks :)
Kind regards, Alvaro

Answer 1 · 2022-07-23T19:18:13.000Z

Dear Alvaro,
summer is going good, hope same for you, even though its too hot this year at least for my gusto ;-).

To your question:
Yes, you can find it in our Zenodo Database.
https://zenodo.org/record/5518085#.YtxIZ4TP2Uk
"Classification.zip" contains basically a large FASTA file "TransposonDB.fasta".
This fasta contains all sequences, and also in their head the class (according to our proposed taxonomy).

If you are interested how exactly we created this large database and unified the taxonomy, you can refer to our paper: https://academic.oup.com/nar/article/50/11/e64/6541023
I think this excerpt describes it basically:

Previous studies used small transposon sequence databases, each with different taxonomic schemes, which does not allow for a direct comparison. Therefore, we created TransposonDB (Figure 3, File F1), a large collection of transposon sequences that consists of ten databases: ConTEdb (53) (http://genedenovoweb.ticp.net:81/conTEdb/index.php), DPTEdb (54) (http://genedenovoweb.ticp.net:81/DPTEdb/browse.php?species=cpa&name=Carica_papaya_L.), mipsREdat-PGSB (55) (https://pgsb.helmholtz-muenchen.de/plant/recat/index.jsp), MnTEdb (56) (http://genedenovoweb.ticp.net:81/MnTEdb1/), PMITEdb (57) (http://pmite.hzau.edu.cn/download_mite/), RepBase (58) (https://www.girinst.org/repbase/, we use version 23.08 that was the last publicly available version before the paywall was introduced), RiTE (59) (https://www.genome.arizona.edu/cgi-bin/rite/index.cgi), Soyetedb (60) (https://www.soybase.org/soytedb/#bulk), SPTEDdb (61) (http://genedenovoweb.ticp.net:81/SPTEdb/browse.php?species=ptr&name=Populus_trichocarpa) and TrepDB (62) (http://botserv2.uzh.ch/kelldata/trep-db/downloadFiles.html). To create the database, the taxonomies were unified, duplicates were dropped and several filter rules were applied (Supplementary Table S1). Filtering included the removal of sequences with no label, the exclusion of fragments, contigs, satellites and RNA sequences. Moreover, only sequences with a length >100 bp and those including at least once each of the letters ‘A’,’C’,’G’ and ‘T’ were kept. To the best of our knowledge, this is the largest database of transposon sequences available. Since TransposonDB covers all relevant Eukaryotic kingdoms, it allows for the training and evaluation of a robust, cross-species hierarchical classification model (Supplementary Tables S2 and S3). Moreover, the database is balanced and covers sufficient examples for all taxonomic nodes (Supplementary Table S4). However, TransposonDB is still likely to be biased as most of the TEs are from plant genomes.

Best regards,
Kevin

Answer 2 · 2022-07-25T13:26:19.000Z

Dear Kevin,

Thanks for the detailed and quick response!
This summer is being indeed a bit too mediterranean across the globe. I hope it serves as a wake-up call! In the meantime.. time to drink delicious Gazpacho! ;)

Then, I can see that TransposonDB.fasta contiains a bunch of consensus sequences, multiple of them assigned to a given taxonomical class (according to Wicker et al.). Thus there is not a single consensus seq. for each class, rather a lot of them in the database, if I understand correctly.

However RFSB is proof-checking the annotation done by reasonaTE, but there is not a (visible/stored) consensus generated by the de novo tools in reasonaTE, right?

Sorry for my confusion, still a newbie in the TE world :')

Thanks a bunch!

Answer 3 · 2022-07-26T12:31:38.000Z

Hey Alvaro,
I love tomato Gazpacho ;-).

Thus there is not a single consensus seq. for each class, rather a lot of them in the database, if I understand correctly.

Exactly. The sequences originate from many different databases, sometimes sequenced sequences, sometimes consesus sequences. Moreover, you should know how we train our classifier. For instance to train a classifier to distinguish betweeen class 1/1 and 1/2 transposon, we use all sequences in the DB that are 1/1 or 1/2 or subclasses of these two classes.

However RFSB is proof-checking the annotation done by reasonaTE, but there is not a (visible/stored) consensus generated by the de novo tools in reasonaTE, right?

RFSB is not proof-checking. reasonaTE is annotating, RFSB is then classifying each annotation into the taxonomy. RFSB is not classifying whether the annotation is correct or not, it is classifying which class the annotated transposon is most like to be categorized.
Some tools (as far as I remember only RepeatModeler and RepeatMasker) provide classification themselves, while others not at all, as they are class specific anyway. (e.g. a MiteHunter will hopefully only annotate MITE transposons^^). If you want to know how RepeatMasker and RepeatModeler classified their findings, you can just check out the many output files they generate in their respective folder.

Did this clarify some of your questions?

Best, Kevin

Answer 4 · 2022-07-26T13:14:39.000Z

Hi Kevin,

This clarifies some gaps I had on my understanding on how the pipeline works.

Thanks a lot!