/primergen

Species specific primers

Primary LanguagePython

Overview

Given a set of genome fasta files, generate primer candidates (for qPCR) that is specific to each genome.

Approach

  1. Focus on coding sequences - first annotate each genome to extract protein coding sequences.
  2. Concatenate every genome fasta file to make an “excluded” genome. For every genome G, primers specific to G should not bind to G’, the excluded genome.
  3. Index each genome and its cognate excluded genome using bowtie2.
  4. Propose primer candidates with parameters suitable for qPCR using primer3. Format the forward and reverse primers as “reads”
  5. Align these primer “read” files to both the genome and the excluded genome.
  6. Filter out primer pairs with non specific binding. I am using bowtie2 with a default alignment range of [0,5000]. This is meant to filter out short confordant alignments.

Installation and setup notes

  • Setup conda with python 3.11 and snakemake 8.11.3. My initial choice of using python3.11 is affecting a lot of downstream decisions, because most conda packages still require python<3.11.
  • Installed the snakemake executor plugin for slurm to work
  • I am using two custom conda environments.
    • bakta has weird dependency issues with python 3.11
      • bakta also has weird issues currently with its reliance on amrfinderplus. I have had to comment out the parts of the bakta code which handle this.
    • bowtie2 was easier to install in a separate environment for some reason.
  • I am using bakta because I have not been able to install prokka with my current setup.
  • Finally, this pipeline is setup to use slurm for parallelizing tasks.
  • Applied this patch to get a local env woring https://github.com/snakemake/snakemake/compare/main…ShogoAkiyama:snakemake:conda-url-env-bug
  • bakta performance is affected by a lot of network IO (oschwengers/bakta#282). Typical runs I’ve observed are ~30 minutes.

Usage

  1. Create a directory, say RUNNAME.
  2. Modify config.yaml to point run_path to RUNNAME.
  3. In RUNNAME/genome/ copy the individual genome FASTA files that are the targets for primer generation.
  4. If using slurm modify the supplied sbatch file to reflect the partition parameters, and run using sbatch run_primergen.sbath
  5. If running from the commandline, use something like
    snakemake --snakefile primergen.snakefile -p\
     --rerun-incomplete\
     --cores 4\
     --use-conda\
     --configfile config.yaml\
     -j 100\
     --keep-incomplete --latency-wait 60