HamleTE: a deep learning powered tool to annotate and classify transposable elements

Introduction
Install
Usage
- Basic usage
- Docker container
Output example
Questions, issues and requests
Updating

Latest updates

- 2024-03-29: Dockerfile update to use vsearch.

- 2024-03-29: Manual explaining in more details how the program works.

- 2024-03-18: Improved LTR superfamily model.

- 2024-03-09: Improved class I/class II model.

- 2024-03-07: Check for ORFs in sequences option added.

- 2024-03-07: Improved LTR/non-LTR model.

- 2024-03-06: Calculate sequence entropy to filter out low complexity sequences.

- 2024-03-04: New mode 'r' - Use RepeatScout to find repeats.

- 2024-03-02: New non-LTR model - classification between LINE and SINE.

- 2024-02-13: New ClassI/ClassII model.

- 2024-02-07: New TE/non-TE model.

- 2024-02-02: Clustering software changed to vsearch from cd-hit-est;no cluster by default.

- 2024-02-01: Python version to 3.10.12; Tensorflow to 2.13; Models update; k-mer length to 14.

- 2023-10-24: Total base counts by TE group added to CNT table.

- 2023-10-24: Function to replace non "ACTGN" bases optimized.

- 2023-09-08: Video tutorials for installation and usage.

- 2023-08-04: Default cutoff value set to 0.5.

- 2023-08-04: Prediction table containing the accuracy for all classification levels.

Introduction

HamleTE is a deep learning-based tool with a workflow for finding and classifying transposable elements (TEs) in eukaryotic genomes. It uses Red to find genomic repeats and, by using the power of convolutional neural networks feature extraction, 6 models to classify sequences as either being a TE or not, and then, the ones classified as TEs to the level of superfamily.

Install

HamleTE can be installed either by creating a conda environment or manually. The first step is to download or clone this repo. To clone it run:

git clone --depth 1 https://github.com/Tiago-Minuzzi/HamleTE

Decompress the hamlete_models.tar.xz in the models directory using your favorite application or via command-line using:

tar xJvf hamlete_models.tar.xz

After that, you can install the dependecies through a conda environment or manually.

Here is a video tutorial to install HamleTE: Installation video.

Conda environment

If you don't have conda installed, you can check how to install on miniconda's webpage or you can watch the installation tutorial for Linux here: Video tutorial link. With conda installed on your system you can easily create a conda environment containing all dependecies by running:

conda env create -f hamlete_env.yml

Then, you can enter the conda environment with the command:

conda activate hamleTE

Download conda for linux clicking here.

Docker

To run HamleTE using a docker container, first build the image using the Dockerfile. Inside HamleTE's directory run:

docker build -t hamlete .

Future/Newer versions of docker will use buildx to build images, then, you may need to install it.

Example in Debian/Ubuntu based systems:

sudo apt install -y docker-buildx && docker buildx install

Then you can build the image using:

docker buildx build -t hamlete .

Manually

If you prefer a conda-free installation, it can be done manually by installing the depencies below:

Python=3.10.12
biopython=1.81
h5py=3.9.0
Keras=2.13.1
numpy=1.24.3
orffinder=1.8
pandas=1.3.4
seqshannon=1.0.0
scikit-learn=1.2.2
scipy=1.10.1
tensorflow=2.13.0
vsearch=2.27.0
tomli=2.0.1
tqdm=4.64.1
protobuf=4.24.0
Red=2.0

For the manual installation, it's suggest to use a Python version management tool such as Pyenv and use it through a virtual environment to avoid depency conflicts. You can run pip install -r requirements.txt to install the Python packages needed.

To install Red, clone Red's github repository, change the name of your C++ compiler inside Red's makefile and compile the program.

To install vsearch, clone vsearch repository proceed with the following commands:

wget https://github.com/torognes/vsearch/archive/v2.27.0.tar.gz
tar xzf v2.27.0.tar.gz
cd vsearch-2.27.0
./autogen.sh
./configure
make
make install  # as root or sudo make install

Usage

The annotation mode is the default, which is used to find TE's in genomes or transcriptomes. There is also a repeats mode, which uses RepeatScout to find repeats instead of Red. If you have a set of sequences/TE library that you just want to classify, you can use the classifier mode by changing the mode flag. Below are the available options.

usage: hamleTE.py [-h] -f FASTA [-m MODE] [-c CUTOFF] [-k LABEL_CUTOFF]
                 [-b BATCH_VALUE] [-o OUTPUT_DIR] [-l LEN_KMER] [--noclust]
                 [--nobar]

Find repeats in eukaryotic genomes and classify them using deep learning.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        Genome or repeats/TEs fasta file.
  -m MODE, --mode MODE  Type (without quotation marks) 'a' for annotation mode,
                        'r' for 'repeats' mode using RepeatScout or
                        'c' for classifier mode. Default = a.
  -c CUTOFF, --cutoff CUTOFF
                        Cutoff value for TE identification. Value must be
                        between 0 and 1. Default = 0.9.
  -k LABEL_CUTOFF, --label_cutoff LABEL_CUTOFF
                        Cutoff value for TE classification. Value
                        must be between 0 and 1. Default = 0.5.
  -b BATCH_VALUE, --batch_value BATCH_VALUE
                        Set batch size. Default = 32, max = 250.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Set output directory to save results.
  -l LEN_KMER, --len_kmer LEN_KMER
                        Length of k-mer to find repeats in genomes. Default =
                        14.
  --min_len MIN_LEN     Minimum repeat sequence length. Default = 200.
  --clust               Cluster repeats. Slows down analysis cosiderably,
                        but reduces redundancy.
  --orf                 Check for sequences containing open reading frames (ORFs).
  --nobar               Disable progress bar.

Basic usage

After activating HamleTE's conda environment (using conda activate hamlete), for genomes or transcriptomes, you can simply run:

python3 hamleTE.py -f genome.fasta

Video tutorial: Running annotation mode.

Clustering of repeats is disabled by default. If you would like to cluster the repeats to reduce redundancy, please, use the flag --clust. Example:

python3 hamleTE.py -f genome.fasta --clust

To use the repeats mode change the mode flag as follows:

python3 hamleTE.py -m r -f genome.fasta

To use the classifier mode change the mode flag as follows:

python3 hamleTE.py -m c -f my_TE_set.fasta

Video tutorial: Running classification mode.

Docker container

To run the docker container version, mount the directory containing your fasta files inside the container using the -v flag.

docker run -v /path/to/my/directory:/mnt -it hamlete hamleTE.py -f /mnt/genome.fasta -o /mnt/out_flow

Output example

Annotation mode

id	start-end	length	prediction_1	accuracy_1	prediction_2	accuracy_2	prediction_3	accuracy_3	prediction_final	accuracy_final
chrom1	4852-4968	117	TE	0.999	Retro	0.998	LTR	1.0	Gypsy	0.809
chrom2	88-1423	1336	TE	0.907	Retro	0.956	nonLTR	1.0	LINE	0.841
chrom3	1-1906	1906	TE	0.983	DNA	0.994	DNA	0.994	Tc1-Mariner	0.952
chrom4	1-1579	1579	TE	0.941	DNA	0.966	DNA	0.966	Helitron	0.979

Classification mode

id	prediction_1	accuracy_1	prediction_2	accuracy_2	prediction_3	accuracy_3	prediction_final	accuracy_final
Seq_430	TE	1.0	Retro	0.582	LTR	0.516	Gypsy	0.999
Seq_835	TE	0.792	DNA	0.89	DNA	0.89	Tc1-Mariner	1.0
Seq_328	TE	0.966	Retro	1.0	LTR	1.0	Copia	0.705
Seq_102	TE	0.99	Retro	0.9	nonLTR	1.0	LINE	0.966

Count table

The file ending with CNT.tsv contains the total count of TE groups and the total count of bases for each group in the annotation mode.

id	count	base_count
LTR\|Gypsy	166	414936
nonLTR\|LINE	106	318573
DNA\|Helitron	97	280230
nonLTR\|Penelope	51	75123

Questions, issues and requests

If you have any questions about the usage, issues found during usage or feature requests, please, feel free to open an issue on the issues section of HameleTE's github page.

Updating

If you don't have the latest features, run the following command from the command-line inside HamleTE's folder on your machine:

git pull

Citation

If you have used HamleTE, please, cite:

Minuzzi Freire da Fontoura Gomes, T. (2024). HamleTE: a deep learning powered tool to annotate and classify transposable elements. (v1.0). Zenodo. https://doi.org/10.5281/zenodo.10894746

To-do

Option to use RepeatScout instead of Red.
Add Docker install and usage tutorial.
Model for classifying LINEs in superfamilies.
Add model for Class II subclass classification.
Return log file.
Add GPU support.

Tiago-Minuzzi/HamleTE