/TrEMOLO

Transposable Elements MOvement detection using LOng reads

Primary LanguageJavaScriptGNU General Public License v3.0GPL-3.0

https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg

TrEMOLO

Transposable Elements MOvement detection using LOng reads

TrEMOLO uses long reads, either directly or through their assembly, to detect:

  • Global TE variations between two assembled genomes
  • Populational/somatic variation in TE insertion/deletion

Global variations, the insiders

Using a reference genome and an assembled one (preferentially using long contigs or even better a chrosomome-scale assembly), TrEMOLO will extract the insiders, i.e. variant transposable elements (TEs) present globally in the assembly, and tag them. Indeed, assemblers will provide the most frequent haplotype at each locus, and thus an assembly represent just the "consensus" of all haplotypes present at each locus. You will obtain a set of files with the location of these variable insertions and deletions.

Populational variations, the outsiders

Through remapping of reads that have been used to assemble the genome of interest, TrEMOLO will identify the populational variations (and even somatic ones) within the initial dataset of reads, and thus of DNA/individuals sampled. These variant TEs are the outsiders, present only in a part of the population or cells. In the same way as for insiders, you will obtain a set of files with the location of these variable insertions and deletions.

Release Notes

Version 2.5.4

  • Update : Packages R Updated

    • bookdown - 0.38
    • rmarkdown - 2.26
  • Change : Modifications in rules.snk Files

    • The FIND_SV_ON_REF, FIND_TE_ON_REF rules have been replaced by LIFT_OFF.
  • Add : New Parameters in config.yaml for INSIDER

    • MINIMAP2:
      • PRESET_OPTION: 'asm5'
      • OPTION: '--cs'
  • Add : New Modules

Current limitations

  • In INSIDER_VARIANT mode, TE annotation on the REFERENCE (parameter INTEGRATE_TE_TO_GENOME) is suboptimal. Some TEs might not be annotated on the reference.

  • Difficulty in identifying the true positives concerning clipped insertions (SOFT, HARD)

Upcoming Features

Comprehensive TE Analysis

In our upcoming release, we will be expanding our analysis capabilities to include a comprehensive examination of Transposable Elements (TEs) within both reads and genomes. This enhancement will go beyond merely identifying INDELs to encompass a full spectrum analysis of TEs.

Requirements

Numerous tools are used by TrEMOLO. We recommand to use the Singularity installation to be sure to have all of them in the good configurations and versions.

Installation

Using Git

Once the requirements fullfilled, just git clone

git clone https://github.com/DrosophilaGenomeEvolution/TrEMOLO.git

Using Singularity

Singularity installation Debian/Ubuntu with package

Compiling yourself

A Singularity container (version 3.10.0+ required) is available with all tools compiled in. The Singularity file provided in this repo and can be compiled as such:

sudo singularity build TrEMOLO.simg TrEMOLO/Singularity

YOU MUST BE ROOT for compiling

Alternatively, you can download a pre-compiled Singularity container from the following link:

Download TrEMOLO Singularity Container

Test TrEMOLO with singularity

singularity exec TrEMOLO.simg snakemake --snakefile TrEMOLO/run.snk --configfile TrEMOLO/test/tmp_config.yml
#OR
singularity run TrEMOLO.simg snakemake --snakefile TrEMOLO/run.snk --configfile TrEMOLO/test/tmp_config.yml

Pulling from SingularityHub

This option is disabled since Singularity Hub is for the moment in read-only. We are looking for a Singularity repo to ease the use.

Configuration of the parameter file

TrEMOLO uses Snakemake to perform its analyses. You have then first to provide your parameters in a .yaml file (see an example in the config.yaml file). Parameters are :

# all path can be relative or absolute depending of your tree.
#It is advised to only use absolute path if you are not familiar with computer science or the importance of folder trees structure.
DATA:
    GENOME:          "/path/to/genome_file.fasta"      #genome (fasta file) [required]
    TE_DB:           "/path/to/database_TE.fasta"      #Database of TE (a fasta file) [required]
    REFERENCE:       "/path/to/reference_file.fasta"   #reference genome (fasta file) only if INSIDER_VARIANT = True [optional]
    SAMPLE:          "/path/to/reads_file.fastq"       #long reads (a fastq[.gz] file) only if OUTSIDER_VARIANT = True [optional]
    #At least, provide either REFERENCE or SAMPLE. Both can be provided
    WORK_DIRECTORY:  "/path/to/directory"         #name of output directory [optional, will be created as 'TrEMOLO_OUTPUT']

#At least, you must provide either the reference file, or the fastq file or both

CHOICE:
    PIPELINE:
        OUTSIDER_VARIANT: True  # outsiders, TE not in the assembly - population variation
        INSIDER_VARIANT: True   # insiders, TE in the assembly
        REPORT: True            # for getting a report.html file with graphics
    OUTSIDER_VARIANT:
        CALL_SV: "sniffles"     # possibilities for SV tools: sniffles
        INTEGRATE_TE_TO_GENOME: True # (True, False) Re-build the assembly with the OUTSIDER integrated in
        CLIPPED_READS: False # (True, False) Processing of clipped reads (SOFT, HARD)
    INSIDER_VARIANT:
        DETECT_ALL_TE: False    # detect ALL TE on genome (parameter GENOME) assembly not only new insertion. Warning! it may be take several hours on big genomes
    INTERMEDIATE_FILE: True     # Conserve the intermediate analyses files to process them latter.


PARAMS:
    THREADS: 8 #number of threads for some task
    OUTSIDER_VARIANT:
        MINIMAP2:
            PRESET_OPTION: 'map-ont' # minimap2 option is map-ont by default (map-pb, map-ont)
            OPTION: '' # more option of minimap2 can be specified here
        SAMTOOLS_VIEW:
            PRESET_OPTION: ''
        SAMTOOLS_SORT:
            PRESET_OPTION: ''
        SAMTOOLS_CALLMD:
            PRESET_OPTION: ''
        TSD:
            SIZE_FLANK: 15  # flanking sequence size for calculation of TSD; put value > 4
        TE_DETECTION:
            CHROM_KEEP: "." # regular expresion for chromosome filtering; for instance for Drosophila  "2L,2R,3[RL],X" ; Put "." to keep all chromosome
            GET_SEQ_REPORT_OPTION: "-m 30" #sequence recovery file in the vcf
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80 -k 'INS|DEL'" # option for TrEMOLO/lib/python/parse_blast_main.py - don't put -c option
    INSIDER_VARIANT:
        PARS_BLN_OPTION: "--min-size-percent 80 --min-pident 80" # parameters for validation of insiders
        MINIMAP2:
            PRESET_OPTION: 'asm5' # minimap2 preset option is asm5 by default (asm5, asm10, asm20 etc)
            OPTION: '--cs'

The main parameters are:

  • GENOME : Assembly of the sample of interest (or mix of samples), fasta file.
  • TE_DB : A Multifasta file containing the canonical sequence of transposable elements. You can add also copy sequences but results will be more complex to interpretate.
  • REFERENCE : Fasta file containing the reference genome of the species of interest.
  • WORK_DIRECTORY : Directory that will contain the output files. If the directory does not exist it will be created; default value is TrEMOLO_OUTPUT.
  • SAMPLE : File containing the reads used for the sample assembly.

You can use config_INSIDER.yaml for only INSIDER analysis or config_OUTSIDER.yaml for only OUTSIDER analysis. To analyse INSIDER, only the REFERENCE , the GENOME, the TE_DB and the WORK_DIRECTORY are required. To analyse OUTSIDER, only the SAMPLE , the GENOME, the TE_DB and the WORK_DIRECTORY are required.

Usage

snakemake --snakefile /path/to/TrEMOLO/run.snk --configfile /path/to/your_config.yaml

For running tests

snakemake --snakefile TrEMOLO/run.snk --configfile TrEMOLO/test/tmp_config.yml

Output files summary πŸ“‚

Here is the structure of the output files obtained after running the pipeline.

WORK_DIRECTORY
β”œβ”€β”€ params.yaml  ##**Your config file
β”œβ”€β”€ LIST_HEADER_DB_TE.csv ##** list of names assigned to TE in the TE database (Only if you have charactere "& ; / \ | ' : ! ? " in your TE database)
β”œβ”€β”€ POSITION_ALL_TE.bed -> INSIDER/TE_DETECTION/POSITION_ALL_TE.bed ##**ALL TE ON GENOME NOT ONLY INSERTION (ONLY IF PARAMETER "DETECT_ALL_TE" is True),
β”œβ”€β”€ POSITION_TE_INOUTSIDER.bed
β”œβ”€β”€ POSITION_TE_INSIDER.bed
β”œβ”€β”€ POSITION_TE_OUTSIDER.bed
β”œβ”€β”€ POS_TE_INSIDER_ON_REF.bed -> INSIDER/TE_DETECTION/INSERTION_TE_ON_REF.bed ##**POSITION TE INSIDER ON REFRENCE GENOME
β”œβ”€β”€ POS_TE_OUTSIDER_ON_REF.bed ##**POSITION TE OUTSIDER ON REFRENCE GENOME
β”œβ”€β”€ POSITION_TE_OUTSIDER_IN_NEO_GENOME.bed  ##**POSITION TE SEQUENCE ON BEST READS SUPPORT INTEGRATED IN GENOME
β”œβ”€β”€ POSITION_TE_OUTSIDER_IN_PSEUDO_GENOME.bed  ##**POSITION TE SEQUENCE ON TE DATABASE (with ID) INTEGRATED IN GENOME
β”œβ”€β”€ VALUES_TSD_ALL_GROUP.csv
β”œβ”€β”€ VALUES_TSD_GROUP_OUTSIDER.csv
β”œβ”€β”€ VALUES_TSD_INSIDER_GROUP.csv
β”œβ”€β”€ TE_INFOS.bed ##**FILE CONTENING ALL INFO OF TE INSERTION
β”œβ”€β”€ DELETION_TE.bed -> INSIDER/TE_DETECTION/DELETION_TE.bed ##**TE DELETION POSTION ON GENOME
β”œβ”€β”€ DELETION_TE_ON_REF.bed -> INSIDER/TE_DETECTION/DELETION_TE_ON_REF.bed ##**TE DELETION POSITION ON REFERENCE
β”œβ”€β”€ SOFT_TE.bed -> OUTSIDER/TE_DETECTION/SOFT/SOFT_TE.bed ##**TE INSERTION FOUND IN SOFT READS
β”œβ”€β”€ INSIDER ##**FOLDER CONTAINS FILES TRAITEMENT INSIDER
β”‚   β”œβ”€β”€ FREQ_INSIDER
β”‚   β”œβ”€β”€ TE_DETECTION
β”‚   β”œβ”€β”€ TSD
β”‚   β”‚   └── TSD_TE.tsv
β”‚   β”œβ”€β”€ TE_INSIDER_VR
β”‚   └── VARIANT_CALLING
β”œβ”€β”€ log  ##**log file to check if you have any error
β”œβ”€β”€ OUTSIDER
β”‚   β”œβ”€β”€ ET_FIND_FA
β”‚   β”‚   β”œβ”€β”€ TE_REPORT_FOUND_TE_NAME.fasta
β”‚   β”‚   β”œβ”€β”€ TE_REPORT_FOUND_blood.fasta
β”‚   β”‚   └── TE_REPORT_FOUND_ZAM.fasta
...
β”‚   β”œβ”€β”€ FREQUENCY
|   |   β”œβ”€β”€ FREQUENCY_TE_INS_PRECISE.fasta
β”‚   β”‚   └── FREQUENCY_TE_INS.tsv
β”‚   β”œβ”€β”€ INSIDER_VR
β”‚   β”œβ”€β”€ MAPPING ##**FOLDER CONTAINS FILES MAPPING ON GENOME
β”‚   β”œβ”€β”€ MAPPING_TO_REF ##**FOLDER CONTAINS FILES MAPPING ON REFERENCE GENOME
β”‚   β”œβ”€β”€ TE_DETECTION
β”‚   β”‚   └── MERGE_TE
β”‚   β”œβ”€β”€ TSD
β”‚   β”‚   └── TSD_TE.tsv
β”‚   β”œβ”€β”€ TrEMOLO_SV_TE
β”‚   β”‚   β”œβ”€β”€ INS
β”‚   β”‚   β”œβ”€β”€ HARD
β”‚   β”‚   └── SOFT
β”‚   β”œβ”€β”€ TE_TOWARD_GENOME ##**FOLDER CONTAINS ALL THE READs ASSOCIATED WITH THE TE
β”‚   β”‚   β”œβ”€β”€ NEO_GENOME.fasta   ##**GENOME CONTAINS TE OUTSIDER (the best sequence of svim/sniffles)
β”‚   β”‚   β”œβ”€β”€ PSEUDO_GENOME_TE_DB_ID.fasta   ##**GENOME CONTAINS TE OUTSIDER (the sequence of database TE and the ID of svim/sniffles)
β”‚   β”‚   β”œβ”€β”€ TRUE_POSITION_TE_PSEUDO.bed   ##**POSITION IN PSEUDO GENOME
β”‚   β”‚   β”œβ”€β”€ TRUE_POSITION_TE.fasta  ##**SEQUENCE INTEGRATE IN PSEUDO GENOME
β”‚   β”‚   β”œβ”€β”€ TRUE_POSITION_TE_NEO.bed  ##**POSITION IN NEO GENOME
β”‚   β”‚   └── TRUE_POSITION_TE_READS.fasta  ##**SEQUENCE INTEGRATE IN NEO GENOME
β”‚   └── VARIANT_CALLING  ##**FOLDER CONTAINS FILES OF sniflles/svim
β”œβ”€β”€ REPORT
β”‚   β”œβ”€β”€ mini_report
β”‚   └── report.html
β”œβ”€β”€ SNAKE_USED
β”‚   β”œβ”€β”€ Snakefile_insider.snk
└── └── Snakefile_outsider.snk

Most useful output

The most useful output files are :

  • The html report in your_work_directory/REPORT/report.html with summary graphics, as shown here

The output file your_work_directory/TE_INFOS.bed gathers all the necessary information.

chrom start end TE|ID strand TSD pident psize_TE SIZE_TE NEW_POS FREQ (%) FREQ_OPTIMIZED (%) SV_SIZE ID_TrEMOLO TYPE
2R_RaGOO_RaGOO 16943971 16943972 roo|svim.INS.175 + GTACA 97.026 99.2 9006 16943978 28.5714 28.5714 9000 TE_ID_OUTSIDER.94047.INS.107508.0 INS
X_RaGOO_RaGOO 21629415 21629416 ZAM|Assemblytics_w_534 - CGCG 98.6 90.5 8435 21629413 11.1111 10.0000 8000 TE_ID_INSIDER.77237.Repeat_expansion.8 Repeat_expansion
  1. chrom : chromosome
  2. start : start position for the TE
  3. end : end position for the TE
  4. TE|ID : TE name and ID in SV.vcf,SV_SOFT.vcf,HARD.fasta and SV_INS_CLUST.bed (for OUTSIDER) or assemblytics_out.Assemblytics_structural_variants.bed (for INSIDER)
  5. strand : strand of the TE
  6. TSD : TSD SEQUENCE
  7. pident : percentage of identical matches with TE
  8. psize_TE : percentage of size with TE in database
  9. SIZE_TE : TE size
  10. NEW_POS : position corrected with calculated TSD (only for OUTSIDER)
  11. FREQ : frequency, normalized
  12. FREQ_WITH_CLIPPED : frequency with clipped read (OUTSIDER only)
  13. SV_SIZE : size of the structural variant (may be larger than the size of the TE)
  14. ID_TrEMOLO : TrEMOLO ID of the TE
  15. TYPE : type of insertion can be HARD,SOFT (Warning : HARD, SOFT are often false positives),INS,INS_DEL... (INS_DEL is an insertion located on a deletion of the assembly)

Modules

Modules are crucial tools in post-processing for analyses. They enable the extraction and visualization of complex information in an intuitive and accessible manner. With these modules, users can gain a deep understanding of data by directly visualizing outcomes in various graphical formats, thereby facilitating the interpretation and utilization of research results or analyses.

1 - Scatter Frequency

The "Scatter Frequency TE Tremolo" module provides a crucial graphical tool for researchers studying the evolution of transposable element (TE) insertion frequencies across generations. It clearly visualizes the dynamics of these genomic elements, offering valuable insights into their behavior and potential for adaptation or evolutionary change within populations over extended periods. For more details, please consult the full documentation at this link.

2 - ANALYSYS TE BLAST

This module enables the visualization of BLAST results concerning the newly detected transposable element insertions. It allows for the visual identification of specific structures such as LTR recombinations, transposable elements (TEs) inserted within other TEs, or more complex structures like clusters of TEs. This tool is crucial for genomic researchers aiming to deeply analyze the dynamics of TE insertions. For more details, please consult the full documentation at this link.

How to use TrEMOLO

What strategy to use

Example of result obtained with simulated data set

The choice of the right strategy depends on the context.

Context 1 : Strategy 2 is better

Context 2 : Strategy 1 is better

Licence and Citation

Mourdas MOHAMED.

This work is licensed under CC BY 4.0 for all docs and manuals. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

It is licencied under CeCill-C and GPLv3.

If you use TrEMOLO, please cite:

Mohamed, M.; Sabot, F.; Varoqui, M.; Mugat, B.; Audouin, K.; PΓ©lisson, A.; Fiston-Lavier, A.-S. & Chambeyron S. TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches. Genome Biol 24, 63 (2023). (https://doi.org/10.1186/s13059-023-02911-2)

Mohamed, M.; Dang, N. .-M.; Ogyama, Y.; Burlet, N.; Mugat, B.; Boulesteix, M.; MΓ©rel, V.; Veber, P.; Salces-Ortiz, J.; Severac, D.; PΓ©lisson, A.; Vieira, C.; Sabot, F.; Fablet, M.; Chambeyron, S. A Transposon Story: From TE Content to TE Dynamic Invasion of Drosophila Genomes Using the Single-Molecule Sequencing Technology from Oxford Nanopore. Cells 2020, 9, 1776. (https://www.mdpi.com/2073-4409/9/8/1776)

The data used in the paper are available here on DataSuds.