Adaptive evolution of protein-coding genes implicated in the macroevolution of beak shape

Scripts used in pipeline for the 'Adaptive evolution of protein-coding genes implicated in the macroevolution of beak shape. Pipeline can be split into two main types of scripts:

Filtering steps
PAML scripts

This pipeline contains external programs. The following lists scripts and the order implemented to obtain the results found in manuscript. Starting dataset of 12,018 protein-coding genes were retrieved using reciprocal BLAST.

Filtering steps.

Several preparatory steps were the implemented in lieu of filtering steps:
- trim_run.py : First used trimal (Capella-Gutiérrez et al, 2009) to convert fasta alignments into phylip files.
- Pretreatment.py: a script used to remove sequences in alignments where sequences were either too short or were more than 20% dissimilar compared to Gallus_gallus.
- Pairwise_dnds_test.py: a script used to generate pairwise dN/dS estimates. This was later used to filter saturated sequences.
Step 1
- calculate_no_species.py: Used to generate text file containing a list of alignments with less than 20 species.
- Filter_1.py: A script removing alignments with less than 20 species using list generated by calculate_no_species.py
Step 2
- Length_filter.py: if alignments were longer than 1500 amino acids (AA) and less than 150 AA, they were ignored. Alignments between these two numbers were piped into a text file.
- Filter_2.py: this script moved alignments between 1500 and 150 AA into another directory.
Step 3
- pairwise_dnds_test.py: calculates pairwise dN/dS between all species in each alignment.
- extract_dn.py, extract_dnds.py, extract_ds.py: extracts sequences with dN and dN/dS pairwise estimates equal to larger than 2 (dN, dN/dS ≥ 2), sequences with dS pairwise estimates equal to or larger than 5 (dS ≥ 5), and sequences with dS estimates larger than the pairwise dS estimate between Gallus gallus and Taeniopygia guttata, in each respective alignment. Each scripts produces a txt file with sequences that exceed the set saturation thresholds.
- Filter_4.py: Sequences in every alignment that aren't in the txt files are outwritten into a new directory and file.
Step 4
- PRANK_run.py: re-align all of the alignments.
Step 5
- filtering_cds_species.py: a script that updates all codon sequences according to the species still present in the corresponding protein alignment file.
Step 6
- Gblocks.py: Using an program called Gblocks to remove unreliable positions using the following parameters:
  - -k=y -n=y -p=t -v=32000
- Zorro_run.py: Using another program (ZORRO - a probabilisitc masking tool) to remove unreliable positions. Scores lower than 9 for each position were removed.
- group_files.py and Filter_5.py: Both of these scripts used in conjunction, were used to create a fake sequence, which took into account the unreliable positions flagged by both Gblocks and Zorro, for each alignment.
Step 7

PAL2NAL.py: utilising another external program called PAL2NAL to convert AA into codons and to mask unreliable positions.

Step 8

calculate_no_species.py: remove alignments with the less than 20 species again by creating a txt file with only alignments with more than 20 species.
Filter_1.py: Move alignments with more than 20 species into another directory.

Final step
- clean_alignment.py: cleave fake_sequence used for masking off the end of each alignment
Additional filtering steps
- update_zorro.py: to get an overall score for each sequence, not taking into account leading and lagging parts of the sequences that are masked during the step 6.
- Overall scores can then be used to remove additional alignments that don't meet a certain threshold.

Morphometric data from Cooney et al (2017)

The following data was curated from Cooney et al (2017) (https://www.nature.com/articles/nature21074) with the author's permission.

PAML pipeline

The following pipeline contains an shell script which will prepare alignments, create trees, mark branches on each alignment tree and run CODEML. Note, CODEML must be properly compiled prior and Data folder, including getTree.R script, must all be in the same directory before running the shell script.

Scripts in .sh file include:

clean_alignments.py
create_trees.py
mark_top1_branch.py
mark_top2_branch.py
mark_top3_branch.py
mark_binned_branch.py
prepare_codeml.py
runCODEML.py

LeebanY/avian-comparative-genomics

Adaptive evolution of protein-coding genes implicated in the macroevolution of beak shape

Filtering steps.

Morphometric data from Cooney et al (2017)

PAML pipeline