This pipeline, built with snakemake, is a structural variant calling pipeline. The original pipline, published in "The structural variation landscape in 492 Atlantic salmon genomes" (see references), was adapted to be more accessible to other research projects. Additionally, automatic cloud deployment (Active development) was added in order to simplify workloads. The inputs and outputs are:
- Reference genome database, saved as an
.fa
file - Forward and reverse sequence files, saved as
.fq
files
This pipeline uses the following tools:
- bwa
- reference genome indexing
- local alignments
- samtools
- reference genome indexing
- sorting alignments
- indexing bam files
- goleft
- index coverage visualizations
- smoove
- variant calling of bam
- genotyping on variant calls
- custom script for extracting gap regions
To install the pipeline, simply run
$> git clone https://github.com/mchowdh200/animal_svs.git
Then, in your favorite python development environment, run
$> cd animal_svs
$animal_svs> pip install -r requirements.txt
As of now, there are two available workflows:
- Local
- GCP VM instance deployment
To run the pipeline locally, more dependencies are needed (if not installed already). To install these, run the following in your terminal:
$animal_svs> chmod u+x build.sh
$animal_svs> ./build.sh
This will install minconda and snakemake as well as adding the bioconda channel.
After this installation is complete, edit the config.yaml
file found in the src
directory. Please provide full paths, such as /Users/<user_name>/Documents/path/to/files/
for all input files and output directories. Make sure that the value type
under run
, deployment
is set to 'local'.
An example config file for local deployment looks like:
############### Parameters associated with the data ###############
input:
# as of right now, we only support a single experiment
# with a forward and reverse component
samples:
forward: '/home/<user_name>/animal_svs/data/samples/forward.fq'
reverse: '/home/<user_name>/animal_svs/data/samples/reverse.fq'
# reference genome database
reference: '/home/<user_name>/animal_svs/data/reference.fa'
# used to name output files. Not needed. If left blank,
# defaults to the filename of the forward sequence
sample_name: 'local-run'
############### Parameters associated with the run ###############
run:
# temporary file store. Files will be deleted
temp_dir: '/home/<user_name>/animal_svs/temp/'
# output any issues/progress to this location
logs_dir: '/home/<user_name>/animal_svs/logs/'
# directory to save final output files
output_dir: '/home/<user_name>/animal_svs/output/'
# for deployment either locally or to the cloud
deployment:
# supported types: 'cloud', 'local'
type: 'local'
# number of cores to use. If left at 0, max cores are used
cores: 0
# NOTE: the rest of the yaml does not matter for local deployment
# values below can have any value or you can remove them
Finally, to run the pipeline, do:
$animal_svs> conda activate snakemake
(snakemake) $animal_svs> python run_pipeline.py
And the pipeline should start to run
Before you can run this pipeline please make sure to have done the following:
- Make sure you have
python
(v3.5+) installed (installation instructions) - Create a Google Cloud Account
- Setup a project in Google Cloud
- Make sure you have
gcloud
as a command on your machine and have the credentials to rungcloud
commands - Created a Google Cloud Bucket with your data in it
Once you have done the steps above, you will want to edit the config file found in src/config.yaml
An example config file for cloud deployment looks like:
############### Parameters associated with the data ###############
input:
# as of right now, we only support a single experiment
# with a forward and reverse component
samples:
forward: 'forward.fq'
reverse: 'reverse.fq'
# reference genome database
reference: 'reference.fa'
# used to name output files. Not needed. If left blank,
# defaults to the filename of the forward sequence
sample_name: 'cloud-deployment'
############### Parameters associated with the run ###############
run:
# temporary file store. Files will be deleted
temp_dir: '../temp'
# output any issues/progress to this location
logs_dir: '../logs'
# directory to save final output files
output_dir: 'output'
# for deployment either locally or to the cloud
deployment:
# supported types: 'cloud', 'local'
type: 'cloud'
# number of cores to use. If left at 0, max cores are used
cores: 0
# Service is ONLY used if deployment is set to 'cloud'
# current supported services are: 'gcp'
service: 'gcp'
# if deployment is cloud, we assume that files are stored in the cloud
# therefore we need the bucket name. As of right now, we only support
# storage that is the same as the deployment. So if deployment is 'gcp',
# data must be stored in a gcp bucket. do NOT add the gs:// or s3:// prefix
bucket_name: '<your-bucke-name>'
# the project name. Right now, only supported for GCP project
project_name: '<your-project-name>'
# instance machine type
# documentation for Google Cloud Compute Enginer machine types can be seen here
# https://cloud.google.com/compute/docs/machine-types
# NOTE: for extra large files, its advised to get at least 16 GB ram if not more
# and at least the size of all files for the hard drive.
gcp_instance:
machine_type: 'e2-standard-4'
disk_space: '40' # in GB
ram_size: '16' # in GB
# documentation for region and zone found here
# https://cloud.google.com/compute/docs/regions-zones
region: 'us-central1'
zone: 'us-central1-a'
Finally, to deploy the pipeline, do
$animal_svs> python run_pipeline.py
As of now, the pipeline relies on the environment/operating system it is run on. Due to this, some tools are known to not work on certain operating systems (local runs) these are:
- smoove
- The smoove dependency pulled by snakemake does not have a build for either macOS nor Windows, so neither of these OSes will run the pipeline locally
bwa:
Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
Original paper:
Bertolotti, Alicia C., et al. "The structural variation landscape in 492 Atlantic salmon genomes." bioRxiv (2020)