This repository details our work towards placing Azolla filiculoides MIKC genes in a broader picture of MIKC transcription factor evolution. The phylogeny created here is featured in this preprint. Many results made publicly available here are intermediate results and should be treated as such. For the final results, please refer to the quick links listed below
manuscript DOI: preprint
Final input, output and intermediate files:
The five directories in this repository should be reasonably self-explanatory.
The data
directory contains input sequences and alignments.
Since I do not consider phylogenetic trees to be data, these are stored in the analyses
directory.
The docs
directory contains several Html "print-outs" of the Jupyter notebooks, for these may not render on GitHub.
For any questions about software versions used: conda environments used in the analyses documented here are available in the envs
directory.
If no specific conda environment is listed in the notebooks, then the default environment 'phylogenetics' was used.
Finally, the figures
directory contains two manually "polished" versions of the phylogenetic tree and annotating data for direct inclusion in a journal submission.
This repository contains many JuPy notebooks in which my colleagues and I stepwise tried to get a better phylogenetic signal of MIKC evolution and interpretation of the placement of fern sequences. In this iterative process, MIKC protein sequences were aligned, filtered, a tree was inferred, and we concluded. Each notebook describes one iteration of this process. Here I list the main conclusion or improvement briefly for each of these notebooks in their chronological order.
In MIKC_tree_workflow-v1.ipynb
(ipynb
html
iToL)
we make a first version of a MIKC tree.
We sample MIKC(-like) genes from specific species across all major groups of plants as described in the 1kP project.
In MIKC_tree_workflow-v2.ipynb
(ipynb
html
iToL)
we replace some species with others, add bryophytes and lycophytes, and attempt to introduce an outgroup of non-plant sequences.
In MIKC_tree_workflow-basalclades-v1.ipynb
(ipynb
html
iToL)
we attempt to improve the phylogenetic signal by sampling fewer sequences in more recent branches; about a third compared to the previous workflow.
In MIKC_tree_workflow-basalclades-v2.ipynb
(ipynb
html
iToL_gt4)
I add Salivinia cuculata sequences from fernbase and remove some sequences that were behaving oddly due to too large horizontal gaps in the alignment.
Finally the Chara globularis MADS1 sequence was added, like in the 1kP capstone paper, to serve as an outgroup.
In algal sequences.ipynb
(ipynb
html
)
I extract and align all algal sequences from the MIKC orthogroup from the 1kP project in an effort to identify those algal sequences which truly have all four domains: M, I, K and C.
We found that in this orthogroup, sequences almost always have the highly conserved M domain but often lack the IKC domains or part thereof.
Based on these results, we proceed only with algal sequences that contain all four domains.
In MIKC_tree_workflow-basalclades-v3.ipynb
(ipynb
html
iToL)
I aim to identify non MIKC sequences and remove these from the analyses;
hence I'm making a phylogeny of only MIKC genes and not sequences that only have an M domain.
In a separate notebook, I already did so for all algal sequences in the 1kP MIKC orthogroup.
These should confirm the placing of the CgMADS1 sequence and provide a solid root to the tree.
In MIKC_tree_workflow-basalclades-v4.ipynb
(ipynb
html
iToL)
I remove some sequences and run the tree with non-parametric bootstraps instead.
In MIKC_tree_workflow-basalclades-v5.ipynb
(ipynb
html
iToL)
I added gymnosperms and Azolla sequences.
Since the algal outgroup has proven to be stable but also very big, it's size is trimmed down.
In MIKC_tree_workflow-basalclades-v6.ipynb
(ipynb
html
iToL-UFbootstrap
iToL-nonparametric)
some redundant Azolla sequences were removed again and a big clade of MIKC* sequences was removed.
MIKC* sequences are characterised by a longer C domain.
In MIKC_tree_workflow-basalclades-v7.ipynb
(ipynb
html)
several rogue taxa were removed; those taxa that were poorly supported and moved around the tree in between different inferences.
Also, the algal outgroup was reduced in size, and I experimented with different extents of column-content trimming.
I varied the miniumum sequence content per column from 10% to 50% and made UFBootstrap tree inferences:
iToL UFbootstrap
gt .1
gt .2
gt .3
gt .4
gt .5
Based on the alignments and the trees, I coose the 50% sequence content alignment and made a non-parametric tree only to find that bootstrap support broke down completely:
itol-nonparametric.
In MIKC_tree_workflow-basalclades-v8.ipynb
(ipynb
html
sequences were removed from the alignment more stringently if they missed too many residues compared to the rest of the alignment.
Also, more algal sequences were added back into the dataset again to provide the tree with a more solid and confident outgroup.
Finally, since bootstrap values on basal branches remain low and uninformative, I experimented with alternative support assessments.
Stricter trimming of alignments and jackknifing were deemed unsuccessful, and a bayesian method was not considered feasible for deadline considerations.
Finally, experimentation with transfer bootstraps proved to provide more informative tree support values.
In MIKC_tree_workflow-basalclades-v9.ipynb
(ipynb
html
I tried to keep the best of versions 6, 7 and 8: a solid algal outgroup, strict trimming of sequences, and more informative support assessment.
More notably, I revisited alignment optimisation with prank
.
A set of sequences was aligned, and a ML tree was inferred with IQtree as before
Then this ML tree was used to re-align insertions and deletions with prank
, ideally making a clear and less noisy alignment.
A final tree was then made with IQtree, 1000 non-parametric bootstraps and then transfer bootstrap support values were calculated with booster
.
While gathering data to building these trees, we have relied heavily on publicly available data. Most sequences used here were generated by the 1kP project. Specifically, we have used the 1kP orthogroup extractor to retrieve an orthogroup of MIKC sequences using query 'At2g45660': Arabidopsis thaliana SOC1. To guide our way through the phylogeny, we collected a set of guide sequences, as shown in Zhang et al. (2019) and supplemented these guide sequences to our data before alignment.
- The Azolla lab at Utrecht University
- A MYB phylogeny workflow, similar to this one and featured in the same preprint.
- A blank version of this workflow
The analyses in this repository were conceived and executed by Dr. Henriette Schluepmann (orcid Utrecht University ) and PhD candidate Laura Dijkhuizen (orcid Utrecht University website) .