Code repository to generate an automatically updated Fern Tree of Life.
Please see the accompanying paper:
- Nitta JH, Schuettpelz E, Ramírez-Barahona S, Iwasaki W. An open and continuously updated fern tree of life (FTOL) https://doi.org/10.1101/2022.03.31.486640
All code is in R, and workflow is controlled with the targets package.
A docker image is available to run the code.
Docker tags match the version of FTOL; e.g., the image with tag 1.0.1
is the image used to generate FTOL v1.0.1.
Run the setup.R script to create the folder structure needed to store external files and download most of the data files automatically.
source("R/setup.R")
Alternatively, you can manually create the following folder hierarchy yourself:
_targets
└── user
├── data_raw
│ ├── ref_aln
│ └── restez
│ └── sql_db
├── intermediates
│ ├── blast_sanger
│ ├── iqtree
│ │ ├── plastome
│ │ ├── sanger
│ │ ├── sanger_1
│ │ ├── sanger_10
│ │ ├── sanger_2
│ │ ├── sanger_3
│ │ ├── sanger_4
│ │ ├── sanger_5
│ │ ├── sanger_6
│ │ ├── sanger_7
│ │ ├── sanger_8
│ │ ├── sanger_9
│ │ └── sanger_fast
│ ├── ref_seqs
│ └── treepl
│ ├── con
│ ├── ml
│ └── ts
└── results
Another folder called ftol_data
(to store data files generated by this workflow that will be made available via the ftolr R package) also needs to be created in the project root. This folder is itself a repo that can be cloned from https://github.com/fernphy/ftol_data.
If setup.R
was run successfully, it will have already downloaded and unzipped the input data files from FigShare.
Alternatively, you can do so manually following these instructions:
- Download
ref_aln.tar.gz
(reference alignments) andrestez_sql_db.tar.gz
(local GenBank database) from FigShare (https://doi.org/10.6084/m9.figshare.19474316) - Unzip
ref_aln.tar.gz
and put theref_aln
folder in_targets/user/data_raw/
- Unzip
restez_sql_db.tar.gz
and put thebat
andsql_logs
folders in_targets/user/data_raw/restez/sql_db
.
-
The script to generate the local GenBank database (
restez_sql_db.tar.gz
) is setup_gb.R. -
The script (targets workflow) to generate the reference FASTA files for extracting target gene regions (
ref_aln.tar.gz
) is prep_ref_seqs_plan.R
The data that result from these scripts have been made available on FigShare as described above, so these generally shouldn't need to be run.
- _targets.R defines the main workflow to generate FTOL. This can be run with
targets::tar_make()
.
Note that this code was designed to be run on a multi-core machine, so the number of cores specified (e.g., here) may need to be changed.
The complete workflow takes 1-2 weeks to complete, with phylogenetic analysis taking up by far most of the time.
- code: MIT