/LAR_phylogeny_gungor-et-al-2020

Phylogenetic tree of LAR and other PIP family enzymes as shown in Güngör et al. 2020. This repository details all code and intermediate files used.

Primary LanguageJupyter Notebook

This repository contains a phylogenetic tree of LAR and other PIP family enzymes as shown in Güngör et al. 2020: Azolla ferns testify: seed plants and ferns share a common ancestor for leucoanthocyanidin reductase enzymes. Additionally, this repository details all code and intermediate files used in the process towards that tree.

DOI

Quick links:

Phylogeny of LAR and LAR likes in plants:

Final alignment: raw & trimmed

Final complete fasta file used for the alignment which consists of:

Final figure as shown in Güngör et al. 2020

PIP enzymes and LAR phylogenetic tree

Guide through folders and files

The data folder contains (unaligned) fasta files, lists of sequence names, and aligned sequences in both trimmed and untrimmed versions. File names tend to be long, but are meant to reflect the history of that specific file. For example: 1kP_LAR_orthogroup_manual-selection-1_guidev4_aligned-mafft-linsi_trim-gt6-seq80.fasta contains sequences from the 1kP LAR orthogroup from which a manual selection was taken. Second, a set of guide sequences (sequences whose function has been verified) was added. These sequences were then aligned with mafft-linsi and trimmed with trimAL settings -gt .6 and -seq 80.

The analyses folder contains tree inferences. These are organised in folders of starting dataset, and then in folders of alignment and trimming strategy. Still, a folder may contain several tree inferences made with IQTree. The final part of the filename summarises the settings used to create a particular tree file. Note that intermediate trees are just that, intermediate results. The fernLARclades_analyses directory contains tree inferences on specifically the fern LAR, WLAR1 and WLAR2 clades as shown in figure 8 of Güngör et al. 2020.

The figures folder contains the final versions of the figures shown in Güngör et al. in several formats. These were made by importing a .treefile in iToL, then adding annotation manually, and downloading these as .svg file. These .svg files were then finalised in Inkscape to their published form and exported as pdf or png.

The workflows for which data is shared here, are documented in JuPyter notebooks (*.ipynb). The workflow describing the final version of the tree is tree_building_workflow_v5. The other two workflows are explorative and should be interpreted as such. A blank version of the workflow is maintained here. Note that figures which are embedded in the JuPyter notebooks are not properly displayed online on Github. You may download the .ipynb files to display them locally, including images. Alternatively, a html export may be found here.

Finaly, the condaenv.yaml file details all software names and versions that were used in this project. This file may be used to recreate the exact software environment for this analysis using miniconda. To do so, issue a command like so conda env create -f ./condaenv.yaml. One specific perl script that is not included in the conda environment, is stored in the opt directory.

Data sources used in this project

In building these trees, we have made use of publicly available data exclusively. Except perhaps, for two Azolla filiculoides sequences for which we have manually corrected the automated annotation. Azolla automated annotations are available on fernbase The manually annotated sequences used here were submitted to EBI's ENA under study accession number PRJEB39515. These sequences are also hosted in this github repository as nucleotide and protein fasta files.

Notably, we have made use of data made available by the 1000 plant transcriptomes project (1kP). First, we made use of the 1kP orthogroup extractor to extract a LAR orthogroup by providing it with the Vitis vinifera LAR sequence. Second, we made use of the online sample list viewer to create a subset of the 1kP PIP enzyme orthogroup; taking care to sample across the tree of all plants with extra attention to seed-free plants. The subset used here is online in google sheets, and the resulting lists are stored here in the data directory.

The 1kP project provides a wealth of sequencing information on taxa of plants for which few sequences are available from genome sequences, let alone sequences of which their function is verified. Therefore, we thankfully made use of the sequences collected in literature and online databases; most notably so in Koeduka's paper 'Functional evolution of biosynthetic enzymes that produce plant volatiles' published in 'Bioscience, Biotechnology, and Biochemistry' in 2018. Each of these sequences and their original accession number are listed in this fasta file.