/PCR_strainer

A tool for assessing the inclusivity of primer and probe oligonucleotides from PCR assays against numerous reference genome sequences

Primary LanguagePythonMIT LicenseMIT

PCR_strainer

PCR_strainer is a tool for assessing the inclusivity of primer and probe oligonucleotides from diagnostic qPCR assays and amplicon sequencing schemes. It depends on thermonucleotideBLAST (TNTBLAST), which conducts local alignments between query oligonucleotides and subject sequences that include a thermodynamic assessment of the alignment. PCR_strainer parses and tabulates the TNTBLAST output to generate reports on assay performance and sequence variants in oligo sites.

You can read more about PCR_strainer and see how it has been applied in the following publications:

  1. Kuchinski KS, Jassem AN, Prystajecky NA. Assessing oligonucleotide designs from early lab developed PCR diagnostic tests for SARS-CoV-2 using the PCR_strainer pipeline. J Clin Virol. 2020 Oct;131:104581. doi: 10.1016/j.jcv.2020.104581. Epub 2020 Aug 21. PMID: 32889496; PMCID: PMC7441044.
  2. Kuchinski KS, Nguyen J, Lee TD, Hickman R, Jassem AN, Hoang LMN, Prystajecky NA, Tyson JR. Mutations in emerging variant of concern lineages disrupt genomic sequencing of SARS-CoV-2 clinical specimens. Int J Infect Dis. 2022 Jan;114:51-54. doi: 10.1016/j.ijid.2021.10.050. Epub 2021 Oct 29. PMID: 34757201; PMCID: PMC8555373.

PCR_strainer Setup

  1. Install TNTBLAST from: https://github.com/jgans/thermonucleotideBLAST
  2. Install Python (version >= 3.7)
  3. Install PCR_strainer:
$ python3 -m pip install pcr_strainer

PCR_strainer Usage

Usage example:

$ pcr_strainer -a <assay CSV file> -g <genomes FASTA file> -o <output dir>/<output name> [<optional args>]

Required arguments:

-a : path to assay details in CSV file
-g : path to target genomes in FASTA file
-o : path to output directory and name to append to output files

Optional arguments:

-t : minimum prevalence (%) of total errors and oligo site variants reported in reports (default = 0 (reports everything), min > 0, max < 100)
-m : minimum Tm (degrees C) for primers and probes (default = 45)
-p : molar concentration of primer oligos (uM) (default = 1, min > 0)
-P : molar concentration of probe oligos (uM) (default = 1, min > 0)

The assay CSV file:

PCR_strainer expects a csv file where each line describes a PCR assay using the following format: assay_name, forward_primer_name, forward_primer_seq, reverse_primer_name, reverse_primer_seq, probe_name, probe_seq Example assay file entry:

BCCDC_SARS2_RdRP,BCCDC_RdRP_Fwd,TGCCGATAAGTATGTCCGCA,BCCDC_RdRP_Rev,CAGCATCGTCAGAGAGTATCATCATT,BCCDC_RdRP_Probe,TTGACACAGACTTTGTGAATG
  • all oligo sequences should be writen in the 5' to 3' orientation
  • degenerate nucleotides are permitted in the assay oligo sequences
  • the probe name and probe sequence can be omitted for a conventional PCR
  • for amplicon sequencing schemes, enter primer pairs as lines in the same file

The reference genomes:

PCR_strainer expects DNA sequences in FASTA format without spaces in the header. For single-stranded genomes, ensure all sequences represent the same sense (e.g. all coding strand). We recommend you filter your reference genomes to remove sequences containing degenerate nucleotides in target locations to limit false negatives; thermonucleotideBLAST does not expand degenerate nucleotide possibilities for the subject sequences.

The name of the output:

PCR_strainer generates four TSV files. The output name will be appended to these file names (no spaces). Including a file path before the output name will write output files to that directory.

PCR_strainer Reports

PCR_strainer generates three report files and a table of raw results from TNTBLAST.

assay_report

The assay_report indicates how many reference sequences are impacted by nucleotide mismatches and gaps acrosss all oligos for each assay. Filter this table for rows with 0 in the errors columns for quick overview of assay inclusivity; this will quickly show what percentage are the provided reference sequences had no gaps or mismatches against the provided assays.

COLUMN : DESCRIPTION

assay_name : The name of the assay from the assay CSV file

total_targets : The total number of reference sequences in the genomes file

detected_targets : The number of reference sequences in the genomes file in which thermonucleotideBLAST was able to identify all oligo sites and generate an amplicon

perc_detected : detected_targets as a percentage of total_targets

total_errors : The number of nucleotide errors across all of the assay's oligonucleotides; this includes gaps and the total number of unannealed nucleotides (including those impacted by nearby mismatches despite having complementary base pairing)

target_count: The number of reference sequences with the indicated number of errors for this assay

perc_of_detected : target_count as a percentage of detected_targets

perc_of_total : target_count as a percentage of total_targets

variant_report

The variant_report provides information about locations in the provided reference sequences that are targeted by assay oligos, but contain gaps and mismatches. This report identifies common sequence variants in oligo sites, facilitating oligo re-design. In oligo site variant sequences, mismatched bases are written in lower case, deletions are indicated with dashes, and insertions are surrounded by parentheses.

COLUMN : DESCRIPTION

assay_name : The name of the assay from the assay file

oligo : Forward primer, reverse primer, or probe

oligo_name : Name of the oligo from the assay file

oligo_seq : The sequence of the oligo provided in the assay file

total_targets : The total number of reference sequences in the genomes file

detected_targets : The number of reference sequences in the genomes file in which thermonucleotideBLAST was able to identify all oligo sites and generate an amplicon

perc_detected : detected_targets as a percentage of total_targets

oligo_site_variant : The variant sequence at the oligo site, written in 'oligo sense', ie the same sense as the PCR oligo

oligo_errors : The number of nucleotide errors present at this variant site; this includes gaps and the total number of unannealed nucleotides (including those impacted by nearby mismatches despite having complementary base pairing)

target_count: The number of reference sequences with the indicated oligo site variant

perc_of_detected : target_count as a percentage of detected_targets

perc_of_total : target_count as a percentage of total_targets

missed_seqs_report

The missed_report provides the name of target reference sequences in the genomes files that were not aligned by thermonucleotideBLAST. PCR_strainer provides the headers of these missed targets for trouble-shooting assays with high percentages of missed targets (i.e. low perc_detected values). These targets are generally either a) poor quality and contain too many Ns in/around the oligo target sites, or b) too divergent from the oligos.

COLUMN : DESCRIPTION

assay_name : The name of the assay from the assay file

target : The FASTA header of the missed reference sequence in the genomes file

target_length : The length of the target sequence in nucleotides

total_Ns : The number of nucleotide positions in the target sequence represented by ambiguous N bases

perc_Ns : The percentage of the target sequence represented by ambiguous N bases

PCR_results

This file contains the parsed and tabulated output from TNTBLAST. Each line describes the results from one assasy against one reference sequence. One use for this data is to identify headers for genome sequences containing specific oligo site variants. For instance, imagine the variant_report identifies a forward primer site variant in 5% of genomes. The sequence for that forward primer site variant could be copied from the variant_report, then used to search the fwd_primer_site_seq column in the PCR_results files to identify headers for sequences containing this variant.

---

Questions, feedback, and bug reports are welcome! kevin.kuchinski@bccdc.ca