snpArcher is a reproducible workflow optimized for nonmodel organisms and comparisons across datasets, built on the Snakemake workflow management system. It provides a streamlined approach to dataset acquisition, variant calling, quality control, and downstream analysis.
This is a fork of the original SNPArcher to add new features, including:
- make fastq and bam as temporary files to scale up to large datasets in smaller clusters (input fastq and intermediate bam files are removed on a per sample basis)
- the
quantize_cov
module implementing the mosdepth quantize method as in Laetsch et al. (2023)
Use the same procedure as for the original SNPArcher repository. See the docs.
Briefly:
conda activate snakemake
snakemake -d .test/ecoli --cores 1 --use-conda
First of all, you need to configure your repository for analyses. Please avoid changing directly the config files and committing changes to the main branch, otherwise other users will be affected by your commits. In order to work properly with parallel settings (for each user), it is strongly recommended to use branching.
Eash user has to make its own branch, and commit changes to config files only in its own branch. This branch should NEVER be merged with main branch. Hence each user can work with it own settings in isolation. Another advantage is that the user will not be affected by upstream commits (i.e. updates in the pipeline), unless he explicitly merge his branch with upstream main
branch.
git branch myownconfigname
git checkout myownconfigname
Branching in a nutshell Github branching documentation
You also need to make a working directory, either within or outside the snpArcher directory. If it is outside, you don't care of the above branching recommendations.
Within your working directory you need to copy the config/
directory and modify the settings (usually config.yaml
and samples.csv
) according to your analyses. You can add multiple sub-directories if you wish to run multiple datasets in parallel (e.g. <your working directory>/<your dataset>/
). Snakemake will install a .snakemake
config directory at the first run to install conda envs (it can take times).
In addition, you can modify the profiles/slurm/config.yaml
in place (since you are on your own branch) or copy it to your working directory.
A recommended architecture is the following:
.
├── snparcher-dev/
│ ├── config/
│ │ ├── config.yaml <default config file>
| ├── profiles/
| | ├── slurm/
│ | | ├── config.yaml <default slurm profile>
│ ├── workflow/
│ ├── data/
| | ├── config/
│ │ | ├── config.yaml <your own config file>
| | ├── profiles/
| | | ├── slurm/
│ │ | | ├── config.yaml <your own slurm profile>
Run the analyses from the snpArcher directory with :
snakemake --snakefile workflow/Snakefile --use-conda --cores <number of cores> --printshellcmds --profile <your working directory>/profiles/slurm/config.yaml -d <your working directory>/<your dataset>/
BAM files are now marked as temporary files, and are removed as soon as they have been used by the last rule calling them. It should improve pipeline scalability by freeing storage space earlier in the process.
Laetsch, D. R., Bisschop, G., Martin, S. H., Aeschbacher, S., Setter, D., & Lohse, K. (2023). Demographically explicit scans for barriers to gene flow using gIMble. PLoS genetics, 19(10), e1010999.
For usage instructions and complete documentation, please visit our docs.
A number of resequencing datasets have been run with snpArcher generating consistent variant calls, available via Globus in the Comparative Population Genomics Data collection. Details of data processing are described in our manuscript. If you use any of these datasets in your projects, please cite both the snpArcher paper and the original data producers.
- Cade D Mirchandani, Allison J Shultz, Gregg W C Thomas, Sara J Smith, Mara Baylis, Brian Arnold, Russ Corbett-Detig, Erik Enbody, Timothy B Sackton, A fast, reproducible, high-throughput variant calling workflow for population genomics, Molecular Biology and Evolution, 2023;, msad270, https://doi.org/10.1093/molbev/msad270
- Also, make sure to cite the tools you used within snpArcher.