/Cluster-pipeline

454 sequence clustering and identification

Primary LanguagePython

Description:

A pipeline for clustering and identifying barcode sequences from a next-generation
sequencer such as Roche 454 or the Ion Torrent.


Basic usage:

pipeline.py --input_file [fasta_files] --outdir_dir [results_directory]

By default the input files will be clustered with octupus (if present) and the non-singleton
clusters will be identified by blasting them against the NCBI nucleotide database.

For a complete list of commands and options see the manuals or the help under pipeline.py -h


Scripts:

This project contains the following scripts:

 - pipeline.py			The main script that takes the user
				files and executes the other scripts
				and programs.
 
 - paths.py			Search the system for the tools and 
				scripts that are needed to run the pipeline.
				The programs used are listed below.

 - filter.py			A small python script that can filter
				the input data based on size or remove
				duplicate sequences.

 - tag_fasta_files.py		When dealing with multiple datasets,
				each set can be tagged to identify the
				and compare different datasets during
				downstream analysis.

 - pick_otu_rep.py		It takes the OTU output files from
				the various cluster programs and pick
				a consensus or random sequence for 
				identification.

 - cluster_to_txt.py		Normalize the OTU information to a tsv file.
				It contains the sequence names present in 
				a cluster.

 - custom_blast_db.py		Uses the NCBI-blast+ program to set up
				a blast database based on a reference
				set and then blasts a set of sequences
				against the database to identify these.

 - blast.py			Identifies sequences by blasting them
				against the genbank database.

 - filter_blast.py		Filter the genbank or local blast results based
				on hit length or similarity.

 - cluster_freq.py		When dealing with multiple input files
				it takes the clustering results and
				compares how much each dataset has
				contributed to a certain cluster.
				
 - cluster_stat.py		Takes the cluster OTU file and retrieve
 				basic information about the number of 
 				sequences used, number of cluster and 
 				the size of the clusters.

Requirements:

This project require the following the following software / tools:

 - Python 2.7 or 3.2		The latest version of either python 2.*
				or 3.* are needed to support the modules
				used in the scripts. (most notably the
				'argparse' module).

 - Biopython 1.58		Biopython is used to deal with fasta
				sequences, multiple sequence alignments
				and genbank files.

 - NCBI-Blast-2.2.25+		Blast+ is used to identify sequences
				with a custom made refrence database.

Any of the following cluster programs:
 - TGICL
 - Usearch 6.0 or newer 	(the older versions can be used by specifying
				--program usearch_old when using the pipeline).
 - Octupus (default)
 - cd-hit