/barCoder

barCoder will generate a list of sequences that can be directly synthesized and inserted into the target genomes via user-specified targeting vectors.

Primary LanguagePerlGNU General Public License v3.0GPL-3.0

#########################################
# This file is part of barCoder.
#
# barCoder is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3 of the License.
#
# barCoder is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with barCoder.  If not, see <https://www.gnu.org/licenses/>.
#
#########################################
# Introduction
#########################################
# This document describes how to use the barcode generation code. The code will
# generate a list of sequences that can be directly synthesized and inserted
# into the target genomes via user-specified targeting vectors. 
#
# Running the code
#
# This software was tested on CentOS 7, and requires at least 64 GB of RAM in 
# order to support BLAST searches against the "nt" database. A modern CPU is 
# recommended, however only the BLAST searches utilize multi-threaded
# processing. The default behavior is to use all cores available for BLAST. 
#
# Dependencies:
#
# * perl 5.8 or higher
# * Bioperl. Information on installing Bioperl can be found at
#   https://bioperl.org/INSTALL.html Specific Bioperl modules used in this program
#   include:
#	- StandAloneBlastPlus (bl2seq, blastn)
#	- EMBOSS
# * nt database from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*
#
# Test data is located in the Projects directory.	
#
# The purpose of the code is to generate a set of barcodes that minimize the chances of 
#		non-unique detection.
#
# Generating unique barcodes: 
# 	A project contains a set of genomes that need to have barcodes generated
# for them, called "target genomes". These barcodes meet a minimum set of 
# requirements for PCR-based detection, including standard PCR parameters
# (Tm, length, etc.) and high specificity. The specificity includes low homology
# to the target genomes, other primers, and a second set of genomes called
# "other genomes". This second set of genomes includes things that do not need
# to be barcoded, but are expected to be present in the environment. The primers
# must not match these other genomes in order to minimize false positives. The 
# primers are also blasted against the NCBI nt database to maximize the odds of
# uniqueness.
#	Three primers are generated. The first primer is the same across a 
# single project and is one of the two primers needed for qPCR detection using 
# either SYBR Green or TaqMan technologies. The second qPCR primer is unique for
# each target genome in the project. For each target genome a second unique 
# sequence is generated for use as an optional probe sequence, needed only for
# TaqMan-based qPCR detection.
#	A filler sequence completes the barcode. This sequence is randomly 
# generated with 2 constraints: (1) the GC content matches that of the target
# genome and (2) there are no major secondary structures to interfere with 
# the PCR. For each target genome, a second set of 3 primers are generated as
# a backup in case any of the sets fail for some unforseen reason.
#	At the time of execution, the user can choose the number of barcodes to 
# generate per target genome. This provides multiple alternative barcodes in the
# case that 1 or more do not function well.
#
#
#########################################
# User Guide
#########################################
# The first step is to create a new project directory. This directory needs to 
# 	be placed under the Barcode/Projects/ directory. There are directories
#	there called BAsterne, BTK and YPco92 as examples.
#
# This folder should contain the following (some are optional):
#	(1) Subdirectory containing the target genomes. These can be in genbank or
#		fasta format. 
#	(2) Text file containing the script parameters, called "parameters.txt".
#		These include things like target Tm and blast threshold to count 
#		as a "match." This file should be copied from the BAsterne, BTK or YPco92  
#		project and tweaked as appropriate. Each parameter is described 
#		in the "parameters.txt" file in the project directory.
#	(3) Subdirectory containing other genomes (genbank or fasta 
#		format). The subdirectory is required, but genome files are optional, so 
#		the subdirectory may be left empty.
#	(4) (optional) List of primers to avoid from previous projects. Fasta
#		file listing primers. File should be named prevPrimers* (i.e.
#		can have prevPrimers1.fasta, prevPrimersProject2.fas, etc.)
#
# Usage: 
# 	from the Barcoder/ directory, type:
#
#	perl barcoder.pl "project_name" nBarcodes
#
#	Where:
#	- project_name = name of the subdirectory of the current project, located
#		at Barcode/Projects/project_name
#	- nBarcodes = number of barcodes to generate per target genome (it is 
#		useful to generate multiple in case some fail). 
# Outputs:
#	- primerList.fa: list of primers generated for barcodes
#	- barcodeList.fa: list of barcodes generated
#	- log files: complete description of what the script did (see
#	 	parameters.txt for some options)
#
#########################################