seqpipeline (pipeline.py)

This is bioinformatics sequence-analysis pipelining software.

Currently, it is written in python, with a single R script to handle 'edgeR' differential expression.

The code is located at https://github.com/gladstone-institutes/seqpipeline

How to run it:

Make a new directory where you're going to run everything.
Download the test data (FASTQ reads) from http://gb.ucsf.edu/bio/public/kp-600/test_data/
- There are six of these files.
Put that test FASTQ data into a new folder named 'test_data'
You can now run one of the commands in the 'Makefile' for this project, which at the moment has two options:
- test_2groups
- test_3groups
If you run "make test_2groups", pipeline.py will be run and will generate an output file named 'script_test.sh'
You can then invoke that script by running "bash script_test.sh". That is how you actually run all the bioinformatics tools.
Finally, if you ran one of the 'make test_2groups' or 'make test_3groups' commands above, you can compare your output to the reference output in the "test_data" directory, for example: 2-group edgeR output.txt

Example command:

python2  ./pipeline.py --basedir="/data/projects/kp-600-b2b-osono-data-pipeline-run-feb-16/B-2016-11-November/test_data/" \
            --outdir="/data/projects/kp-600-b2b-osono-data-pipeline-run-feb-16/B-2016-11-November/" \
	--experiment-id="Test_3_Compare" \
	--sample-ids="X1,X2,Y1,Y2,Z3A,Z3B" \
	--groups="1,1,2,2,3,3" \
	--rna-samples=a1.mm9.chr19.fq.gz,a2.mm9.chr19.fq.gz,b1.mm9.chr19.fq.gz,b2.mm9.chr19.fq.gz,a3.mm9.chr19.fq.gz,b3.mm9.chr19.fq.gz \
	--species=mm9  --script="script_3_compare_test.sh"

To-do:

Currently, only RNA-seq has been properly debugged in the updated 'pipeline.py' program.

=========================================

Required software:

Tophat (splice-aware aligner)

Executable name: tophat (version 2.1.1)

To install: See details at: http://ccb.jhu.edu/software/tophat/index.shtml

Bowtie (non-spliced aligner)

Executable name: bowtie2 (version 2.2.4)

To install: See details at: http://bowtie-bio.sourceforge.net/index.shtml

BCP (ChIP-seq peak caller for everything except TF binding)

Executable name: BCP_HM (version 1.1)

To install: Available at https://cb.utdallas.edu/BCP/
           (Paper: http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1002613 )

bam2bed (Part of the 'bedops' suite. Converts BAM regions to BED files.)

Executable name: bam2bed (version 2.4.20)

Note: This program is required only for running BCP in the CHIPseq pipeline--nothing else uses it.

To install: Available here: http://bedops.readthedocs.io/en/latest/content/installation.html
           (Paper: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract )

GEM (motif-aware ChIP-seq peak caller for TF binding sites)

Executable name: gem.jar (version 2.5)

To install: See instructions at: http://groups.csail.mit.edu/cgs/gem/

           (We were using version 2.5, but version 2.7+ is available now.)

htseq-count (reads -> gene-level counts)

Executable name: htseq-count (version 0.6.0)

```
To install: pip2.7 install HTseq
```

           (Note: must be installed via pip (or other package manager). Do not just copy the binaries--it will not work.)

samtools

Executable name: samtools (version 1.3)

To install: yum install samtools (on Redhat/CentOS)

           (Available through your package manager. Other examples: brew install samtools (Mac Homebrew), apt-get install samtools (Ubuntu))

edgeR (R library for differential expression)

Executable name: NA (version 3.14.0)

To install: source('https://bioconductor.org/biocLite.R'); biocLite('edgeR')
           (Available through Bioconductor.)

java

Executable name: java (version 1.8.0)

To install: Install via package manager or on Oracle's web site: https://java.com/en/
           (May also be possible to install using your package manager (e.g. 'yum install java'). Java is required to run 'gem.jar' and 'MarkDuplicates.jar')