/Pathogenic_allodiploid_hybrids_of_Aspergillus_fungi

Holds code and files associated with Pathogenic allodiploid hybrids of Aspergillus fungi

Primary LanguageShell

Scripts and pipelines from Pathogenic allodiploid hybrids of Aspergillus fungi

This repository provides command-line arguments and script for reproducibility of Pathogenic allodiploid hybrids of Aspergillus fungi in accordance with STAR methods of Current Biology.

If you use contents from this repository, please cite:



Determination of gene set completeness

Determines the number of complete, duplicated, fragmented, and missing BUSCO genes in a proteome.
Software: BUSCO

bash run_busco.sh proteome.faa

Determination of copy number variable regions

Exemplary Control-FREEC configuration file. Calculating p-values of each copy number variable region was assessed using the original software author's script assess_significance.R
Software: Control-FREEC

freec -conf config.freec |& tee freec.log

In addition to using Control-FREEC, we assessed if CNVnator was an accurate software for our case. To do so, we used the following wrapper script to run CNVnator. 1st agrument should be the root file; 2nd argument should be the tree file; 3rd argument should be the genome file; 4th argument should be the scaffolds directory path; 5th argument should be the window size.
Software: [CNVnator](https://github.com/abyzovlab/CNVnator)
``` bash CNVnator_wrapper.sh file.root ./path_to_sorted_bam_file ./path_to_file_with_scaffold_lengths ./path_to_directory_with_scaffolds window_size ```

Predicting gene boundaries

Augustus is a powerful and popular tool for predicting gene boundaries. We ran augustus with default parameters with species training on Aspergillus nidulans.
Software: Augustus

augustus --species=aspergillus_nidulans

Genome assembly

To simply the genome assembly process, we used the wrapper utility iWGS. iWGS was run with the recommended default parameters using Kmergenie, Trimmomatic, SPAdes, MaSuRCA, and QUAST.
Software: iWGS

Strain determination using taxonomically informative loci

To determine the evolutionary history of taxonomically informative loci -- i.e., determine that the parents of the hybrids are Aspergillus spinulosporus and a close relative of Aspergillus quadrilineatus -- we used the maximum likelihood software RAxML. To determine bipartition support, we used rapid bootstrap analysis. Below we provide an exemplary command used during tree search.
Software: RAxML

raxmlHPC -f a -m GTRGAMMAX -x 12345 -p 12345 -N 1000

Predicting orthologous groups of genes

To predict groups of orthologous genes for downstream phylogenetic analyses, we used a sequence similarity-based cluster approach. The following is an exemplary command of how we did so.
Sofware: OrthoFinder

./orthofinder -os -M msa -I 1.5 -S blast -f directory_of_proteomes/

Sequence alignment and trimming

To align and trim sequences for downstream analysis, we first created nucleotide multi-fasta files of single copy orthologous genes predicted by OrthoFinder. Then, we aligned and trimmed the sequences in the multi-fasta files using the commands shown here.
Software: Mafft, trimAl

# Sequence alignment
mafft --maxiterate 1000 --genafpair input > output 

# Alignment trimming
trimal -in input -out output -gappyout

Creating a concatenated genome-scale data matrix of multiple sequence alignments

A custom script (link) was used to concatenated the aligned and trimmed sequences described in section 'Sequence alignment and trimming'. The input files and parameters include a list of alignment files to concatenate, a list of taxa to include, whether the sequences are proteins or nucleotides, and a prefix for output files. Output files include a fasta file of concatenated sequence with '.fa' appended to the end, a RAxML style partition file, and a file that summarizes the occupancy of each gene from each alignment. Software: biopython

python create_concat_matrix.py -a alignment.list -c sequence_character -t taxa.list -p output_prefix

Original author: Jacob Steenwyk

Genome-scale phylogenies of each parental genome and topology tests

Genome-scale phylogenies to predict the evolutionary history of each subgenome were examined using the exemplary command described here.
Software: IQ-TREE

iqtree -s input.fa -seed 86924356 -st DNA -pre output -nt 24 -nbest 10 -m TEST -bb 5000

In addition, we conducted topology tests. In brief, these topology tests were used to determine if the topology inferred from one data matrix (from one parental genome) was equivalent to the topology inferred from the other data matrix (the other parental genome).

iqtree -s data_matrix_from_one_parent.fa -z Phylogenies_inferred_from_both_data_matrices.tres -n 0 -zb 10000 -zw -au -m GTR+F+I+G4

Reciprocal best blast hit (RBBH)

To conduct reciprocal best blast analysis, we used the following custom script. Part of the script relies on a resource from Harvard (link). The exemplary script was used for RBBH between nucleotide sequences. Changing lines 13,14,17,18 can allow for RBBH between protein sequences.
Sofware: Blast+, perl

bash RBBH.bash fasta_file_a fasta_file_b

Ks values for every gene

A custom pipeline to calculate Ks for every gene in one genome compared to its best blast hit in another genome. See ./Ks_pipeline/README for a detailed explanation of the concept and usage. Note, this pipeline is explicitly designed to function with the slurm job scheduler at Vanderbilt University's high performance computing cluster, ACCRE (Advanced Computing Center for Research and Education).
Sofware: Blast+, paml, pal2nal, samtools, perl

bash ./Ks_pipeline/execute.sh