Pipeline for https://github.com/lculibrk/Ploidetect
git clone https://github.com/lculibrk/Ploidetect-pipeline
cd ./Ploidetect-pipeline/
Under config/
are three yaml
-formatted files which specify some default settings, run parameters, and sample information.
This section will break down how to set it all up.
The simplest way to specify your samples is by providing a tab-separated file with the sample information. You can find an example of this file under config/sample_tsv.tsv. You can then use this file to generate a config compatible with the Ploidetect workflow:
python3 scripts/sampletsvtoyaml.py --tsv_file=/path/to/sample_tsv.tsv -o config/ -c False
The -c
option allows you to specify whether you should create a new file from the tsv, or to add the sample information to an existing samples file. Once this is done, proceed to the next section
If you'd rather do it manually, below are instructions for manually setting up your samples.yaml file.
In the config/samples.yaml
you will find the following lines:
bams:
COLO829-TestA:
somatic:
A36971: /projects/POG/POG_data/COLO829-TestA/wgs/biop1_t_A36971/merge/hg19a_bwa-mem-0.7.6a/A36971_2_lanes_dupsFlagged.bam
normal:
A36973: /projects/POG/POG_data/COLO829-TestA/wgs/blood1_n_A36973/merge/hg19a_bwa-mem-0.7.6a/A36973_1_lane_dupsFlagged.bam
Let's break down this entry further, line-by-line.
-
bams:
designates this as the block of samples and bams. Do not change -
COLO829-TestA:
is the sample name. For each case (tumor/normal pair), you have some kind of sample name. For example, patient ID could go here. -
somatic
andnormal
designate the somatic and matched normal libraries. Do not change -
A36971
andA36973
are the library names. Change these to a unique identifier for each sequenced library. They must be unique (don't re-use library names across different samples) -
/projects/yadda/yadda.bam
is the path to the bam file for the somatic and normal libraries. Undersomatic:
you will enterlibraryname: /path/to/the/somatic/bam/file.bam
, and do the same for the normal bam undernormal:
.
If you have multiple samples to run, add them as new lines, following the existing formatting. So after adding another sample, the config might look like this:
bams:
COLO829-TestA:
somatic:
A36971: /projects/POG/POG_data/COLO829-TestA/wgs/biop1_t_A36971/merge/hg19a_bwa-mem-0.7.6a/A36971_2_lanes_dupsFlagged.bam
normal:
A36973: /projects/POG/POG_data/COLO829-TestA/wgs/blood1_n_A36973/merge/hg19a_bwa-mem-0.7.6a/A36973_1_lane_dupsFlagged.bam
sample_2:
somatic:
A12345: /path/to/somatic.bam
normal:
A54321: /path/to/normal.bam
We're moving on to the config/parameters.yaml
section
Next we specify the genome name in the config.
# genome_name should match bams
genome_name: "hg19"
Change the "hg19"
in quotations to whatever you're using.
We also allow you to specify custom cytobands if you are not using hg19 or hg38, or if you want to use a different annotation than those on UCSC.
cyto_path: auto
Leave this on "auto" if using hg19 or hg38. If you'd like to use custom cytobands, this must point to a tab-separated file with five columns: chromosome, start, end, band name, and band annotation. Column names are optional.
This only applies if you're not running the pipeline with --use-singularity
(which is the recommended way, to be explained below).
Specify the version name of Ploidetect to use from the github repo in the ploidetect_ver
entry. Here we chose v1.4.2 (latest version). If you have a local clone on your filesystem, you can specify that using ploidetect_local_clone
. If you don't have a clone, leave it blank (i.e. the line should simply read ploidetect_local_clone:
). You can also specify a commit id, such as the one in the paper.
# ploidetect_ver should be a branch or tag. Overriden by ploidetect_local_clone.
ploidetect_ver: v1.4.2
# Leave ploidetect_local_clone blank or 'None' to download from github
ploidetect_local_clone:
We've recently added support for oxford nanopore data. Specify here whether your data are from short reads (Illumina, BGI) using short
or from long reads (Oxford nanopore) using ont
. While we don't foresee issues with using Pacific Biosciences data, it has not been tested.
sequence_type: "short"
OR
sequence_type: "ont"
Currently, we support hg19 and hg38 without requiring any user-specified files. If you are using hg19 or hg38, you can skip this subsection and move to section 2 without worry.
Genomes hosted on UCSC aside from hg19 and hg38 are compatible, requiring only a few files from the user. If you are using hg19 or hg38, but would like to use your own versions of these files, you may also specify them here. By default the workflow will use packaged files (SNP positions) or UCSC data (fasta and cytoband annotations).
If you are using a non-hg19/38 genome, you must include these lines in the config:
array_positions:
genome_name:
/path/to/snp/positions.txt
This file must be formatted as the files under resources/snp_arrays/*/SNP_array_positions.txt
, and include the positions of known SNPs in your genome. We used about 200,000 SNP positions to develop and benchmark Ploidetect. Your mileage may vary if you use a different number relative to the length of your genome.
Next, specify a path to your genome fasta file. The directory of the fasta must also contain the fasta index (.fai). You can generate this using samtools faidx
, for example.
genome:
genome_name:
/path/to/genome/fasta.fa
Finally, you must specify the chromosomes to be analyzed. For hg19 and hg38, this defaults to chr1-22 + X. Below is an example of the default hg19 chromosome configuration. The chromosome names must match the .bam that you use.
chromosomes:
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- X
Once the config has been dealt with, we then handle dependencies.
Ploidetect is most easily run using this Snakemake workflow, which take tumor/normal bam files and runs Ploidetect as well as its data preparation steps.
You will need to install snakemake
to run this workflow. Installation instructions for Snakemake can be found here. You can also run the following command to create a conda environment in the current working directory named venv
with Snakemake installed
make snakemake_env
source venv/bin/activate
Some form of package management is recommended to satisfy the dependencies of the workflow. Currently, Conda and singularity are supported. You only need one to run this workflow. If you followed the above instructions you will already have a conda environment active. We recommend Singularity, as that is the simplest way to satisfy the dependencies for Ploidetect and requires minimal input from you.
mamba can be used inplace of conda, and runs much, much faster.
Simply replace 'conda' with 'mamba' in every command after installing. If you installed conda and Snakemake using the Makefile (section 2.1) Mamba will already be installed and you can skip this step.
conda install mamba -c conda-forge
NOTE: You only need to do ONE OF 2.2.2 or 2.2.3
If you did not do step 1.1, run the below command:
conda install snakemake --yes
Then install the dependencies before running the workflow using the following:
snakemake --use-conda --conda-create-envs-only -j 1
If you installed mamba, you can use the below command instead:
snakemake --use-conda --conda-frontend mamba --conda-create-envs-only -j 1
Once complete, proceed to section 3
You don't need to do anything beforehand. The container will be automatically pulled for the run. Simply ensure that Singularity is installed on your system.
We recommend running this workflow with THREADS set to at least 24, to allow parallel processing of the chromosomes as much as possible. We highly recommend dry-running Snakemake once after you're done to ensure you have everything set up correctly by appending -n
to the end of your command. If you have a screen of green/yellow text, you're good. If there's red, then mistakes have been made.
Depending on how you completed section 2.2, you run the workflow using one of three options:
snakemake -p --restart-times 2 -j THREADS --use-conda
You may encounter an error to the tune of an R lazy-load database being corrupt. --restart-times 2
should deal with this, but if it doesn't, it should be solvable by just rerunning the command. if that doesn't, open an issue.
snakemake -p --restart-times 1 -j THREADS --use-singularity
If you encounter errors with singularity mode that involve files being not found, first check for typos in the file paths in question. If they're fine, then you probably need to bind the relevant directory. For example if you're running this under a different "root" directory form the data. ie. the data are stored under /data
and you're running this in /analysis
or something along those lines. In this situation you need to bind the directories:
snakemake -p --restart-times 1 -j THREADS --use-singularity --singularity-args "-B /path/to/where/the/bams/are/stored"
If you're running on a cluster with slurm
installed, you can have Snakemake submit the jobs like so:
mkdir logs_slurm/
snakemake \
-p \
--restart-times 2 \
-j THREADS \
--use-conda \
--cluster 'sbatch -t 2-00:00:00 --mem={resources.mem_mb} -c {resources.cpus} -o logs_slurm/{rule}_{wildcards} -e logs_slurm/{rule}_{wildcards} -p QUEUE_NAME'
mkdir logs_slurm/
snakemake \
-p \
--restart-times 2 \
-j THREADS \
--use-singularity \
--cluster 'sbatch -t 2-00:00:00 --mem={resources.mem_mb} -c {resources.cpus} -o logs_slurm/{rule}_{wildcards} -e logs_slurm/{rule}_{wildcards} -p QUEUE_NAME'
Use GSC bioapps details to automatically find configuration details and run the pipeline for a given ID and biop
Eg. run Ploidetect-pipeline for POG965 biop2.
snakemake -s Snakefile.gsc.smk -pr --config id=POG965 biopsy=biop2