This repository provides command-line arguments and script for reproducibility of Pathogenic allodiploid hybrids of Aspergillus fungi in accordance with STAR methods of Current Biology.
If you use contents from this repository, please cite:
Determines the number of complete, duplicated, fragmented, and missing BUSCO genes in a proteome.
Software: BUSCO
bash run_busco.sh proteome.faa
Exemplary Control-FREEC configuration file. Calculating p-values of each copy number variable region was assessed using the original software author's script assess_significance.R
Software: Control-FREEC
freec -conf config.freec |& tee freec.log
In addition to using Control-FREEC, we assessed if CNVnator was an accurate software for our case. To do so, we used the following wrapper script to run CNVnator. 1st agrument should be the root file; 2nd argument should be the tree file; 3rd argument should be the genome file; 4th argument should be the scaffolds directory path; 5th argument should be the window size.
Software: [CNVnator](https://github.com/abyzovlab/CNVnator)
``` bash CNVnator_wrapper.sh file.root ./path_to_sorted_bam_file ./path_to_file_with_scaffold_lengths ./path_to_directory_with_scaffolds window_size ```
Augustus is a powerful and popular tool for predicting gene boundaries. We ran augustus with default parameters with species training on Aspergillus nidulans.
Software: Augustus
augustus --species=aspergillus_nidulans
To simply the genome assembly process, we used the wrapper utility iWGS. iWGS was run with the recommended default parameters using Kmergenie, Trimmomatic, SPAdes, MaSuRCA, and QUAST.
Software: iWGS
To determine the evolutionary history of taxonomically informative loci -- i.e., determine that the parents of the hybrids are Aspergillus spinulosporus and a close relative of Aspergillus quadrilineatus -- we used the maximum likelihood software RAxML. To determine bipartition support, we used rapid bootstrap analysis. Below we provide an exemplary command used during tree search.
Software: RAxML
raxmlHPC -f a -m GTRGAMMAX -x 12345 -p 12345 -N 1000
To predict groups of orthologous genes for downstream phylogenetic analyses, we used a sequence similarity-based cluster approach. The following is an exemplary command of how we did so.
Sofware: OrthoFinder
./orthofinder -os -M msa -I 1.5 -S blast -f directory_of_proteomes/
To align and trim sequences for downstream analysis, we first created nucleotide multi-fasta files of single copy orthologous genes predicted by OrthoFinder. Then, we aligned and trimmed the sequences in the multi-fasta files using the commands shown here.
Software: Mafft, trimAl
# Sequence alignment
mafft --maxiterate 1000 --genafpair input > output
# Alignment trimming
trimal -in input -out output -gappyout
A custom script (link) was used to concatenated the aligned and trimmed sequences described in section 'Sequence alignment and trimming'. The input files and parameters include a list of alignment files to concatenate, a list of taxa to include, whether the sequences are proteins or nucleotides, and a prefix for output files. Output files include a fasta file of concatenated sequence with '.fa' appended to the end, a RAxML style partition file, and a file that summarizes the occupancy of each gene from each alignment. Software: biopython
python create_concat_matrix.py -a alignment.list -c sequence_character -t taxa.list -p output_prefix
Original author: Jacob Steenwyk
Genome-scale phylogenies to predict the evolutionary history of each subgenome were examined using the exemplary command described here.
Software: IQ-TREE
iqtree -s input.fa -seed 86924356 -st DNA -pre output -nt 24 -nbest 10 -m TEST -bb 5000
In addition, we conducted topology tests. In brief, these topology tests were used to determine if the topology inferred from one data matrix (from one parental genome) was equivalent to the topology inferred from the other data matrix (the other parental genome).
iqtree -s data_matrix_from_one_parent.fa -z Phylogenies_inferred_from_both_data_matrices.tres -n 0 -zb 10000 -zw -au -m GTR+F+I+G4
To conduct reciprocal best blast analysis, we used the following custom script. Part of the script relies on a resource from Harvard (link). The exemplary script was used for RBBH between nucleotide sequences. Changing lines 13,14,17,18 can allow for RBBH between protein sequences.
Sofware: Blast+, perl
bash RBBH.bash fasta_file_a fasta_file_b
A custom pipeline to calculate Ks for every gene in one genome compared to its best blast hit in another genome. See ./Ks_pipeline/README for a detailed explanation of the concept and usage. Note, this pipeline is explicitly designed to function with the slurm job scheduler at Vanderbilt University's high performance computing cluster, ACCRE (Advanced Computing Center for Research and Education).
Sofware: Blast+, paml, pal2nal, samtools, perl
bash ./Ks_pipeline/execute.sh