This workflow is to computationally identify putative horizontally transferred (HT) DNA transposons, done here by searching across 251 mammalian species, specifically focused on those present in any of 37 bat species.
Given the wide variety of DNA transposons and species involved, this is a broad-scale search, with a priori search thresholds. The main steps are:
- Transposable element (TE) annotation using RepeatMasker in 253 mammalian genome assemblies
- Detailed workflow in TE_annotation folder
- Local BLAST+ searches (blastn) of all DNA transposons with patchy species distributions in all mammal and other eukaryote genome assemblies; BLAST search criteria of at least 90% sequence identity and at least 90% query alignment
- Detailed workflow in BLAST folder
- Generate species-specific consensus sequences for all candidate DNA transposons
- Detailed workflow in TE_alignments folder
- Identify autonomous elements:
- Use EMBOSS getorf utility to identify open-reading frames (ORFs) for all species-specific consensus sequences greater than 800 bp
- Perform blastx searches of non-redundant proteins for the largest ORFs from each consensus sequence (can be performed remotely or in web browser)
- Once autonomous elements are identified, exclude any with less than 20 copies meeting the 90/90/90 criteria; exclude any non-autonomous elements with less than 100 copies
- Generate RAxML trees of species-specific consensus sequences for each element (optional)
- Only useful if present in 3 or more species; used to search for phylogenetic incongruence with species tree. Since this project focused on putative HT specifically involving bats, and many elements were not found outside a single clade, this step was not particularly informative.
- Detailed workflow in TE_alignments folder
- Identify and exclude deletion products of elements within the set of putative HT elements by clustering sequences via a) the cross_match utility of Phrap v0.990319 with default settings, and b) a modified CD-HIT search with the utility ClusterPartialMatchingSubs.pl using default settings.
- Detailed workflow in TE_alignments folder
- Estimate average age of a TE subfamily in a given species using the average modified Kimura 2-parameter (K2P) distance from the library consensus and the neutral mutation rate of the species. Modified K2P values calculated in RepeatModeler's utilities.
- Detailed workflow in TE_alignments folder
- Infer putative HT placement on bat phylogeny (see bat_phylogeny) based on presence/absence and average TE age across species.
- Bat phylogeny based on Foley et al. 2021 and Amador et al. 2018 and used a combination of non-conflicting average or median divergnence estimates from TimeTree (accessed 3 September 2021).
- To be conservative, elements assigned to oldest possible branch based on presence/absence data within a given clade (i.e. there are four Myotis species representing to sister species pairs (for this tree), M. brandtii + lucifugus and M. davidii + myotis; if only M. brandtii and myotis were searched, and the element was found in both, it was assumed to also be in the other two species, and so the HT event would be inferred to have occurred in the ancestral Myotis lineage.)
- Estimate association between young (>50 My) TE accumulation in bats and species richess; association between putative HT diversity and species richness in bats.
- Detailed workflow in association_modeling folder