/EndophyteGenomes

Scripts for Hill et al. (2023) doi:10.1093/gbe/evad038 🌿

Primary LanguageShell

Tapping culture collections for fungal endophytes: first genome assemblies for three genera and five species in the Ascomycota

Pipeline workflow

Bioinformatics analysis pipeline for:

Hill et al. (2023) Tapping culture collections for fungal endophytes: first genome assemblies for three genera and five species in the Ascomycota. Genome Biology and Evolution 15(3):evad038. doi:10.1093/gbe/evad038

The pipeline was written for and run on Queen Mary University of London's Apocrita HPC facility which uses the Univa Grid Engine batch-queue system. This means that many of the bash scripts (.sh file endings) specify core allocation, run times and memory usage allocation that may need to be adapted for different platforms.


1 Read trimming/basecalling

cd reads

Short-reads

  1. qsub trimmomatic.sh trims raw reads using Trimmomatic; requires NexteraPE-PE.fa file with adapter sequences downloaded from here (for Illumina NovaSeq 6000 151bp paired-end reads).
  2. qsub fastqc.sh checks trimmed read quality with FastQC.

Long-reads

qsub guppy.sh performs fast basecalling of raw MinION read data.

2 De novo genome assembly

cd denovo_assembly

  1. ./submit_assembly.sh makes new directory and submits job scripts for each assembly tool - short-read tools abyss.sh (ABySS), megahit.sh (MEGAHIT) and spades.sh (SPAdes), and long-read tools flye.sh (Flye), raven.sh (Raven) and spades_hybrid (hybridSPAdes). For all tools except for ABySS, these scripts also include short-read mapping with BWA-MEM for polishing with Pilon and remove sequences <200bp using Seqtk for NCBI compliance.
  2. qsub -t 1-8 abyss_comp.sh compares the assembly stats to choose 'best' kmer size for ABySS (must be done after abyss.sh has finished for all kmer sizes and strains), followed by short-read polishing with Pilon.

3 Assessment

cd assessment

Assembly tool comparison

  1. ./submit_assessment submits scripts for assembly quality statistics - quast.sh (QUAST) and busco.sh (BUSCO), which requires the ascomycota_odb10.2020-09-10 BUSCO dataset downloaded from here) - and blast.sh (BLAST), diamond.sh (DIAMOND) and read_mapping.sh, which maps reads with BWA-MEM, to produce input for BlobTools.
  2. qsub -t 1-15 blobtools.sh submits blobtools.sh to run BlobTools (must be done after blast.sh, diamond.sh and read_mapping.sh have finished for the strain(s) in question).

Contamination filtering

  1. qsub -t 1-15 remove_contam.sh removes contigs which BlobTools flagged as belonging to the wrong taxonomic class using Seqtk.
  2. qsub -t 1-15 ncbi_filter.sh removes or trims contigs flagged as mitochondrial or adapter contaminations during NCBI submission with the help of bedtools; requires strain_ncbi_remove.txt and strain_ncbi_trim.bed files.

Final quality statistics

  1. qsub quast_final.sh reruns QUAST on the contaminant-filtered assemblies.
  2. mkdir busco_final | qsub -t 1-15 busco_final.sh makes a new directory and reruns BUSCO on the contaminant-filtered assemblies.
  3. qsub -t 1-15 read_mapping_final.sh performs a final round of read mapping and produces mapping statisics with SAMtools to calculate both short- and long-read coverage.

4 Annotation

cd annotation

Repeat masking

cd annotation/repeat_masking

  1. qsub -t 1-15 repeatmodeler makes custom repeat library for each strain using RepeatModeler.
  2. qsub -t 1-15 repeatmasker.sh uses the custom repeat libraries to softmask assemblies using RepeatMasker.

Structural annotation

cd annotation/structural

qsub -t 1-15 funannotate.sh sorts and relabels contigs in the repeatmasked assembly before predicting gene models using funannotate. Requires protein and EST evidence downloaded from Mycocosm to be saved in this folder.

Functional annotation

cd annotation/functional

  1. qsub -t 1-15 eggnogmapper.sh submits eggNOG-mapper on predicted gene models.
  2. qsub -t 1-15 antismash.sh submits antiSMASH on predicted gene models.
  3. qsub -t 1-15 interproscan.sh submits InterProScan on predicted gene models.
  4. qsub -t 1-15 funannotate_annotate.sh maps results from the previous three programmes onto the structural annotation and produces the .sqn files for NCBI submission.

5 Phylogenetics

cd phylogenetics

This folder contains a file - lineages - listing the 10 lineages for which trees must be built and the strains that belong to said lineages, and a file - markers - listing the 13 genetic markers selected for building the trees.

Gene extraction

cd GenePull

./genepull.sh submits GenePull to extract selected genetic markers from contaminant filtered assemblies of each strain. Requires fasta files containing a single example sequence from a closely related taxon for each genetic marker being extracted.

Alignment

  1. file_prep.sh contains example one-liners for formatting sequence headers in each of the gene alignment fasta files so that they are identical across different genes (i.e. removing GenBank accessions; removing misc text after taxon names/vouchers; replacing spaces with underscores etc).
  2. qsub -t 1-10 align.sh submits gene alignments using MAFFT and trimming using trimAl for each of the 10 lineages.
  3. Gene alignments are manually checked with AliView.
  4. qsub -t 1-10 concat.sh submits concatenation of all gene alignments with AMAS for each of the 10 lineages.

ML tree building

qsub -t 1-10 raxmlng.sh submits RAxML-NG with bootstrapping until convergence or up to 1,000 replicates (whichever first) for each of the 10 lineages.

6 Data visualisation

plots.r


Citation

Hill et al. (2023) Tapping culture collections for fungal endophytes: first genome assemblies for three genera and five species in the Ascomycota. Genome Biology and Evolution 15(3):evad038. doi:10.1093/gbe/evad038