azolla_MYBs: An HTML repository from lauralwd

This repository contains a phylogenetic tree of R2R3 MYB transcription factors. Additionally, this repository details all code and intermediate files used in the process towards infering that tree. Many of these results are intermediate and should be treated as such. For the final results, please refer to the quick links listed below

Manuscript DOI: preprint on bioRXiv

Repository DOI:

Quick links:

treefile
Main text figure png, pdf and Inkscape_svg.
Input sequences fasta
Aligned input sequences fasta, or png
Trimmed input sequences fasta or png

Final figure as shown in Dijkhuizen et al. 2021 with added MSA

The MSA shown below is not included in the manuscript for size limitations. It shows the region of R2R3 MYBs used to differentiate the different subfamilies as described by Jiang & Rao (2020). The figure actually included in the paper is available here.

Guide through folders and files

The data folder contains (unaligned) fasta files, lists of sequence names, and aligned sequences in both trimmed and untrimmed versions. File names reflect the history of that specific file and therefore tend to be rather long. For example combi-I-to-VIII-Azfi-Arabidopsis_sequences_linear_aligned-mafft-einsi_trim-gt4.fasta contains a combination of sequences from the subfamilies I to VIII and sequences from Azolla filiculoides and Arabidopsis thaliana. Those sequences were then aligned with mafft-einsi and trimmed with a gap threshold of .4 (40%).

The analyses folder contains tree inferences and annotation information for use in iToL. These are organised in folders of starting dataset, and then in folders of alignment and trimming strategy. Still, a folder may contain several tree inferences made with IQTree. The final part of the filename summarises the settings used to create a particular tree file. Note that intermediate trees are just that, intermediate results.

The figures folder contains the final versions of the figures shown in the manuscript in several formats. These were made by importing a .treefile in iToL, then adding annotation manually, and downloading these as .svg file. Annotation files for use in iToL can be found in the different directories in the analyses directory These .svg files were then finalised in Inkscape to their published form and exported as pdf or png.

Jupyter notebooks

The workflows shared here are documented in JuPyter notebooks (*.ipynb). Most notebooks contain intermediate work that is shared for transparency and reproducibility purposes and should be treated as such. Alternativelly, the git history may be explored for more information. Note that figures which are embedded in the JuPyter notebooks may not be correctly displayed online on Github. You may download the .ipynb files to display them locally, including images. Alternatively, a html export may be found accompanying the JuPy notebook file.

In step1_differentiate_subfamilies_VI_and_VII (html preview & ipynb preview) we gather R2R3 MYB sequences of subfamily VI & VII and reproduce findings by Jiang & Rao (2020).
In step2_classify-Azfi-RNAseq-targets (html preview & ipynb preview) we placed several Azolla filiculoides sequences in the phylogeny of subfamily VI & VII R2R3 MYBs and compare the differentiating domains as described by Jiang & Rao (2020).
In step3_VI-subfam_in_azolla (html preview & ipynb preview) missing type VI sequences were identified in the Azolla filiculoides genome with hmms and added to the phylogeny.
In step4_expanding-phylogeny (html preview & ipynb preview) the phylogenetic analysis was expanded with R2R3 MYB sequences from all subfamilies (I to VIII). Sequences were taken from the Jiang & Rao (2020) paper.
Finally, in step5_supplement-with-arabidopsis-sequences (html preview & ipynb preview) some uninformative and rogue sequences were removed, Arabidopsis thaliana sequences were added, more Azolla filiculoides sequences were added, and the tree was annotated with RNA-seq data for A. filiculoides.

A template version of the workflow is maintained here.

Finally, the envs directory contains conda environment export files detailing all software names and versions that were used in this project. This file may be used to recreate the exact software environment for this analysis using miniconda. To do so, issue a command like so conda env create -f ./condaenv.yaml.

Data sources used in this project

In building these trees, we have made use of publicly available data exclusively. Most notably, we build here upon the work of Jiang & Rao (2020). Azolla automated annotations are available on fernbase. The manually re-ananotated A. filiculoides R2R3 MYB sequence is made available in ENA and NCBI under accession number [....] . This sequence, and all raw RNA-seq reads used in this project are also made availble in ENA and NCBI under project accession number [....] .

All sequences taken from the several databases used here and their original accession numbers are listed in the data folder, organised in files per subfamily type. These sequences originate from several databases, each with a slightly different naming system. The Jiang & Rao (2020) paper lists each of the species used here, and where to find the right database to search for accession numbers. Those are predominantly:

NCBI nucleotide and protein.
Fernbase for Azolla filiculoides and Salvinia cuculata.
Congenie for Picea abies.
marchantia.info for Marchantia polymorpha.
uniprot for Arabidopsis thaliana sequences.

Authors

The analyses in this repository were conceived and executed by Dr. Henriette Schluepmann (orcid Utrecht University ) and PhD candidate Laura Dijkhuizen (orcid Utrecht University website) .

lauralwd/azolla_MYBs