This repository contains scripts I wrote as a graduate research assistant while working on the Wyeomyia smithii Genome project for Dr. Elizabeth Cooper's Lab at UNC Charlotte. In this project I identified and aged Transposable Elements (TEs) within W.smithii and six other mosquito genome annotations, then I visualized these results.
- repeatModeler.slurm
Bash script used to identify the TE families in the Wyeomyia smithii genome assembly. Requires genome annotation of Wyeomyia smithii and the repeatmodeler module. I replicated with the 6 other mosquito species genome annotations.
- add-names.py
Python script to add species name identifiers to each header in the fasta files of TE families identified from repeatmodeler for each of the 7 species. The output from repeatmodeler is required for this script.
- get-families.py
Python script to extract the TE families that are in each genome. The fasta files with added species identifiers are required for this script. I repeated with the 6 other mosquito species genome annotations.
- get-family-counts.py
Python script to determine the number of each TE family found within each species. The output from get-families.py is required for this script. I repeated with the 6 other mosquito species.
- longest_transcripts.sh
Bash script to get primary transcripts for later input into orthofinder. This script requires proteome files for each investigated species.
- OrthoScript.sh
Bash script to run orthofinder on the primary transcripts found in the previous step. The output of interest is Species_Tree/SpeciesTree_rooted.txt
- iTOL annotation editor
Input SpeciesTree_rooted.txt into iTOL, then use the annotation editor to produce a multi-value bar-chart of TE family counts
- ltrharvest.sh
Bash script to run ltrharvst and locate unfragmented LTRs in the mosquito genome annotations. Genome annotations in fasta format are required for this script.
- gff-format.py
Python script to alter the format of the ltrharvest output to make it work with getfasta. The ltrharvest output is required for this script.
- getfasta.sh
Bash script to extract sequences from ltrharvest. The fasta genome annotations and the output from gff-format.py are required for this script.
- ltr-seqs.py
Python script to extract the long terminal repeats. The getfasta.sh output is required for this script.
- ltr-files.py
Python script to produce one file for each set of LTR sequences. The ltr-seqs.py output is required for this script.
- clustal-loop.sh
Bash script to align each of the LTR files we produced from the previous step. The directory containing all of the previous outputs is required for this script.
- TE_aging.R
R script to determine and visualize the age of the identified LTRs. The aligned outputs from the previous step are required for this script.