DrosophilaGenomeEvolution/TrEMOLO

Database of TE

Opened this issue · 1 comments

Hello developers, I don’t know where to download TE_DB such as homo sapiens. Can you provide a way to download TE DB?
Thank you

TE_DB: "/path/to/database_TE.fasta" #Database of TE (a fasta file) [required]

Hi,

There are several ways to obtain a transposable elements (TE) database. For instance, you can download the EMBL file from Dfam and convert it to a FASTA file, keeping only the sequences of interest, using a simple Python script :

from Bio import SeqIO

input_embl = 'Dfam.embl'
output_fasta = 'Dfam_homo_sapiens.fasta'

with open(input_embl, "r") as input_handle, open(output_fasta, "w") as output_handle:
    for record in SeqIO.parse(input_handle, "embl"):
        if "Homo sapiens" in record.annotations.get("organism", ""):
            count = SeqIO.write(record, output_handle, "fasta")

print(f"Number of sequences converted: {count}.")

You can also check RepBase.

Another method is to use RepeatMasker on the human genome and retrieve the masked sequences to use them as a database.

However, be aware that TE databases might be incomplete or contain irrelevant sequences. This depends on the context of your research and your objectives.

It is important to mention that it is not necessary to provide only TE sequences. The goal is to provide sequences that you are searching for in a sample that would not be present in your reference genome.