btb-phylo
is APHA software that provides tools for performing phylogeny on processed bovine TB WGS data.
It is also used in production for producing snp matricies to serve the ViewBovine APHA app.
The software can run on any linux EC2 instance within DEFRA's scientific computing environment (SCE), with read access to s3-csu-003
.
It downloads consensus files from s3
from which it builds SNP matricies and phylogenetic trees.
The full pipeline can be run with Docker and needs only Docker to be installed.
- Clone this github repository:
git clone https://github.com/APHA-CSU/btb-phylo.git
- Run the following command from inside the cloned repository to run the pipeline inside a docker container:
./btb-phylo.sh path/to/results/directory path/to/consensus/directory -c path/to/config/json -j 1 --with-docker
This will download the latest docker image from DockerHub and run the full btb-phylo
pipeline. Consensus files are downloaded from s3-csu-003
and a snp-matrix is built using a single thread.
path/to/results/directory
is an absolute output path to the local directory for storing results;path/to/consensus/directory
is an absolute output path to a local directory where consensus sequences are downloaded;path/to/config/json
is an absolute path to the configuration file, in.json
format, that specifies filtering criteria for including samples;-j
is an optional argument setting the number of threads to use for building snp matricies. If omitted it defaults to the number of available CPU cores.
By default the results directory will contain:
.
├── metadata
│ ├── all_wgs_samples.csv
│ ├── deduped_wgs.csv
│ ├── filters.json
│ ├── metadata.json
│ └── passed_wgs.csv
├── multi_fasta.fas
├── snps.csv
└── snps.fas
all_wgs_samples.csv
: a csv file containing metadata for all WGS samples ins3-csu-003
;deduped_wgs.csv
: a copy ofall_wgs_samples.csv
with duplicate submissions removed;filters.json
: a.json
file describing the filters used for choosing samples;metadata.json
: a.json
containing metadata for abtb-phylo
run;passed_wgs.csv
: a copy ofdeduped_wgs.csv
after filtering, i.e. WGS metadata for all samples included in phylogeny;multi_fasta.fas
: a fasta file containing consensus sequences for all samples included in the results;snps.fas
: a fasta file containing consensus sequences for all samples included in the results, where only snp sites are retained;snps.csv
: a snp matrix
./btb-phylo.sh ~/results ~/consensus -c $PWD/example_config.json -j 1 --with-docker
This will run the full pipeline inside a docker container with 4 samples, downloading consensus sequences to ~/consensus
and saving the results to ~/results
.
The final output should be:
This is snp-dists 0.8.2
Will use 4 threads.
Read 4 sequences of length 831
- You must have
python3
andpython3-pip
installed. Using a virtual environment with eithervenv
orvirtualenv
is recommended:
sudo apt install python3
sudo apt install python3-pip
- Clone this github repository:
git clone https://github.com/APHA-CSU/btb-phylo.git
- Install required python packages :
cd btb-phylo
python setup.py install
- Install software dependencies:
sudo apt update
bash ./install/install.bash
./install/install.bash
will install the following dependencies:
snp-sites
(installed withapt
)snp-dists
(installed from source to~/biotools
, with symlink in/usr/local/bin
)megacc
(installed withapt
from.deb
file)
./btb-phylo.sh ~/results ~/consensus -c $PWD/example_config.json -j 1
This will run the full pipeline locally with 4 samples, downloading consensus sequences to ~/consensus
and saving the results to ~/results
.
The final output should be:
This is snp-dists 0.8.2
Will use 4 threads.
Read 4 sequences of length 831
The full pipeline consists of six main stages:
- Updating a local
.csv
that contains metadata for every processed APHA bovine-TB sample. The default path of this file is./all_wgs_samples.csv
. When new samples are available ins3-csu-003
this file is updated with new samples only. - Removing duplicate WGS submissions. Multiple samples may exist for a given submission, generally due to poor quality data or inconclusive outcomes. This stage chooses one sample from each submission.
- Filtering the samples by a set of criteria defined in either the configuration file or a set of command line arguments. The metadata file for filtered samples is saved in the results directory.
- "Consistifying" the samples with cattle and movement data. Designed for use with ViewBovine, this removes samples from WGS, cattle and movement datasets that are not common to all three datasets.
- Downloading consensus sequences for the filtered sample set from
s3-csu-003
. If a consistent directory is used for storing consensus sequences, then only new samples will be downloaded. - Performing phylogeny: Detecting snp sites using
snp-sites
, building a snp matrix usingsnp-dists
and optionally building a phylogentic tree usingmegacc
.
Stages 1-6 in pipeline detials can be run in isolation or combination via a set of sub-commands.
usage: btb-phylo [-h] {update_samples,filter,de_duplicate,consistify,phylo,full_pipeline,ViewBovine} ...
positional arguments:
{update_samples,filter,de_duplicate,consistify,phylo,full_pipeline,ViewBovine}
sub-command help
update_samples updates a local copy of "all_wgs_samples" metadata .csv file
filter filters wgs_samples.csv file
de_duplicate removes duplicated wgs samples from wgs_samples.csv
consistify removes wgs samples that are missing from cattle and movement data (metadata warehouse)
phylo performs phylogeny
full_pipeline runs the full phylogeny pipeline: updates full samples summary, filters samples and performs
phylogeny
ViewBovine runs phylogeny with default settings for ViewBovine
optional arguments:
-h, --help show this help message and exit
Get full list of optional arguments for any sub-command:
python btb_phylo.py sub-command -h
sub-command
is one ofupdate_samples
,filter
,de_duplicate
,consistify
,phylo
,full_pipeline
,ViewBovine
.
Running the full pipeline
- on all pass samples
python btb_phylo.py full_pipeline path/to/results/directory path/to/consensus/directory
- on all pass samples + "consistified" with cattle and movement data
python btb_phylo.py full_pipeline path/to/results/directory path/to/consensus/directory --cat_mov_path path/to/folder/with/cattle/and/movement/csvs
- filtering with a configuration file
python btb_phylo.py full_pipeline path/to/results/directory path/to/consensus/directory --config path/to/config/file
- on a subset of samples
python btb_phylo.py full_pipeline path/to/results/directory path/to/consensus/directory --sample_name AFT-61-00846-22 AF-12-02550-18 16-3828-08-a
- building a pyhlogentic tree and filtering with a configuration file
python btb_phylo.py full_pipeline path/to/results/directory path/to/consensus/directory --config path/to/config/file --build_tree
Other common optional arguments are:
--download_only
: optional switch to download consensus sequences without doing phylogeny-j
: the number of threads to use withsnp-dists
; default is 1
btb-phylo
provides a snp-matrix for ViewBovine APHA. Details of the ViewBovine phylogeny dataflow and sample selection are provided in ViewBovineDataFlow.md
Updating the snp-matrix is triggered manually and should be run either weekly or on arrival of new processed WGS data.
- Ensure that the dev machine has write access to
s3-ranch-042
; - Update a local copy of cattle and movement metadata (details of how to do this is are in the ViewBovine readme)
- Run the ViewBovine update script;
bash ViewBovine/scipts/update.bash {path/to/directory/containing/cattle/and/movement/csvs}
By default the results directory will contain:
.
├── metadata
│ ├── CladeInfo.csv
│ ├── all_wgs_samples.csv
│ ├── cattle.csv
│ ├── consistified_wgs.csv
│ ├── deduped_wgs.csv
│ ├── filters.json
│ ├── metadata.json
│ ├── movement.csv
│ ├── passed_wgs.csv
│ └── report.csv
├── multi_fasta.fas
├── snps.csv
└── snps.fas
This will use predefined filtering criteria to download new samples to a local directory in this repo, consistify the samples with cattle and movement data and update the snp-matrix. It will then push the results up to s3-ranch-042
.
The configuration file specifies which filtering criteria should be used to choose samples. It is a .json
file with the following format:
{
"parameter_1":[criteria],
"parameter_2":[criteria],
.
.
.
"parameter_n":[criteria]
}
Each parameter
key should be one of the following:
Sample
, GenomeCov
, MeanDepth
, NumRawReads
, pcMapped
, Outcome
, flag
, group
, CSSTested
, matches
, mismatches
, noCoverage
, anomalous
, Ncount
, ResultLoc
, ID
, TotalReads
, Abundance
(i.e. the column names in FinalOut.csv
output from btb-seq
).
For numerical variables, e.g. Ncount
and pcMapped
the criteria should be a maxium and minimum number. For categorical variables, e.g. Sample
, group
(clade) or flag
the criteria should be a list of strings.
See example, example_config.json, which includes a selection of 6 samples if; they have pcMapped
> 95%; are in the B6-84, B1-11, B6-11, or B3-11 clades; are either BritishbTB
or nonBritishbTB
; and have a maximum Ncount
of 56000.
To perform phylogeny without any filters, i.e. on all pass samples, simply omit the -c
option.