Tiger Salamander Project

SNP calling, haplotyping, and subsampling pipeline using Snakemake for amplicon sequence data. Written for the Tiger Salamander project in the Weisrock Lab at the University of Kentucky.

Setup

You can download this repository with:

$ git clone https://github.com/kelly-sovacool/tiger_salamander_project.git

I recommend using the Conda package manager. If you don't already have Conda installed, the fastest way to get up and running is to use the Miniconda Python 3 distribution, which includes Conda.

After installing Conda, change into the directory containing the repository. Then, create a Conda environment with:

$ conda env create --name tiger_salamander_project --file config/environment.yml

Conda will create an environment with all the dependencies specified in the environment file. To activate the environment, run:

$ source activate tiger_salamander_project

Alternatively, if you would prefer to use a different package management tool (e.g. pip3) and/or install packages system-wide, you can manually install the dependencies listed in config/environment.yml.

Usage

There are two main pipelines here: haplotype_pipeline and snp_pipeline. The general workflow is to call variants and haplotypes with the haplotype_pipeline, manually curate the alignments (optional), then call SNPs and create subsamples with the snp_pipeline before feeding the results into a program such as Structure. The project directory structure is as follows:

.
├── haplotype_pipeline
│   ├── adapters
│   ├── results
│   └── rules
├── legacy_scripts
├── reference
└── snp_pipeline
    ├── haplotypes_curated
    ├── results
    └── scripts

Haplotype Pipeline

If you use the default config.yaml file, place the Illumina fastq files in haplotype_pipeline/data/illumina_fastq, with a separate subdirectory for each sequencing run. Place the 454 haplotype fasta files in haplotype_pipeline/data/454_haplotypes. Finally, change into the haplotype_pipeline directory and run it using as many cores as are available with:

$ snakemake -j

The haplotype pipeline will output haplotypes as single-locus fasta files in haplotype_pipeline/haplotypes. If desired, these can be manually curated with a tool such as Geneious.

SNP Pipeline

Place curated haplotypes in snp_pipeline/haplotypes_curated (alternatively, copy the files from haplotype_pipeline/haplotypes). Be sure to edit config.yaml so that it contains your email address and the absolute path to Structure on your DLX account. Run it with the same snakemake command as above from the snp_pipeline directory.

The snp_pipeline outputs filtered SNP sites as single-locus fasta files in snp_pipeline/snp_sites_filtered, subsamples of the SNP data in Structure format and scripts for running Structure on the DLX in snp_pipeline/snp_subsamples. The subsamples and scripts can be copied to your DLX account for running Structure with:

$ scp -r snp_pipeline/snp_subsamples username@server:~/

Notes:

This pipeline was written for a SNP-only analysis. The filter_variants rule in rules/haplotype_illumina_data.smk will filter out indels. If you wish to keep indels, you can remove the --remove-indels flag in that rule.
Various scripts and small programs from before the time of Snakemake are now in legacy_scripts/ for posterity.