/HugeSeq

For original Nature Biotechnology Publication (Q1 2012)

Primary LanguagePythonOtherNOASSERTION

#####################################
#                                   #
# HugeSeq                           #
# The Variant Detection Pipeline    #
#                                   #
#####################################

-- DEPENDENCIES

+ STANOVAR version 0.1

+ BEDtools version 2.17.0

+ BreakDancer version 1.1.2

+ BreakSeq Lite version 1.0

+ BWA version 0.7.4

+ CNVnator version 0.2.7

+ GATK version 3.2.2

+ JDK version 1.7.0_03

+ Modules Release 3.2.8

+ Perl

+ Picard Tools version 1.32

+ Pindel version 0.2.4t

+ Python version 2.7

+ Simple Job Manager version 1.0

+ Tabix version 0.2.6

+ vcftools version 0.1.12

+ zlib version 1.2.7

+ root version 5.34.30

+ r version 3.2.0


-- INSTALLATION

HugeSeq is a modular, computational pipeline that runs in a Unix environment in a highly parallel fashion. It was tested on Red Hat Enterprise Linux (RHEL) server v5.6 but it should work in most Linux servers. The batch system it currently supports out-of-the-box is Sun Grid Engine.


Batch System

Many of the clusters are already installed with Sun Grid Engine (SGE). For installing SGE, please refer to the vendor's manual. 

Running the analysis pipeline requires submitting many interdependent jobs to the batch scheduling system (e.g. Sun Grid Engine). Therefore, we developed a software program called SJM (Simple Job Manager) to simplify this process, including properly specifying the dependencies, tracking progress of the group of jobs, and responding properly if a job fails.

For batch systems other than SGE, it requires developing an adaptor in SJM. Please write to us for more details.

Modules Environment

To manage different versions of softwares and parameters in the modules, HugeSeq uses a Unix software package called Environment Modules, which provides for the dynamic modification of a user's environment via modulefiles.

To initiate Modules, modify your login profile such as .bash_profile to add the following:

. /path-to-Modules/default/init/sh

Supporting Tools

Install the required softwares, such as the aligners, variant callers and manipulation tools, defined in the software requirements section. For details, please refer to the individual software websites. The softwares are recommended to be installed separately under a single parent directory, such as ~/apps/BreakSeq and ~/apps/CNVnator.

Data Sets

HugeSeq depends on several public data sets for alignment, variant calling, and annotations. They are:
The reference genome (e.g. HG19 in FASTA format: hg19.fa)
The BWA index of the reference genome (e.g. hg19.fa.bwt, hg19.fa.ann, etc)
A .dict dictionary of the contig names and sizes (e.g. hg19.fa.dict)
A .fai fasta index file (e.g. hg19.fa.fai)
For creating .dict and .fai, please see here. All the indexes and dictionary should reside in the reference genome directory which contains the whole genome FASTA (e.g. hg19.fa)
The breakpoint junctions (i.e. BreakSeq library in FASTA format: bplib.fa)
The SNP annotation
UCSC Known Genes (knownGene)
dbSNP
SIFT (avsift)
RepeatMasker (buildver_rmsk.gff)
The STANOVAR application should be installed and corresponding module need to be defined. 


Download HugeSeq to your server.

Extract the programs from the compressed archive to a directory, such as ~/app. A directory like ~/app/HugeSeq will then be created, which contains the core program and its configuration. As described above, HugeSeq uses the Environment Modules package for configuration. Its modulefile is in the directory /path-to-HugeSeq/modulefiles/hugeseq named with its version, such as 1.0. To enable Modules to look up the modulefile for correct setting, modify the login profile as above and add the following:

export MODULEPATH=/path-to-HugeSeq/modulefiles:$MODULEPATH

In addition, modify the module file, such as /path-to-HugeSeq/modulefiles/hugeseq/2.0, and change all the programs' paths to the locations where you installed the required programs and the data paths to where you stored the datasets.

Logout and login again to your shell to activate the login profile with the latest configuration. You should now be able to run HugeSeq by loading its module:

> module load hugeseq/2.0

After loading the module, you can run HugeSeq simply by typing:

> hugeseq

For the usage of HugeSeq, please refer to the Usage section.

-- USAGE

usage: hugeseq [-h] --reads1 FILE [FILE ...] [--reads2 FILE [FILE ...]]
               --output DIR [--account STR] [--tmp DIR] [--readgroup STR]
               [--samplename STR] [--bam] [--variants TYPE [TYPE ...]]
               [--targeted] [--capture FILE [FILE ...]] [--relax_realignment]
               [--reference_calls] [--snp_hapcaller] [--indel_hapcaller]
               [--nosnpvqsr] [--noindelvqsr] [--vqsrchrom] [--nobinning]
               [--nocleanup] [--novariant] [--alignmentonly] [--cleanuponly]
               [--variantonly] [--donealign] [--donebinning] [--donecleanup]
               [--donegenotyping] [--donesnpvqsr] [--memory SIZE]
               [--queue NAME] [--email NAME] [--threads COUNT]
               [--jobfile FILE] [--submit]

Generating the job file for the HugeSeq variant detection pipeline

optional arguments:
  -h, --help            show this help message and exit
  --reads1 FILE [FILE ...]
                        The FASTQ file(s) for reads 1
  --reads2 FILE [FILE ...]
                        The FASTQ file(s) for reads 2, if paired-end
  --output DIR          The output directory
  --account STR         Accounting string for the purpose of cluster
                        accounting.
  --tmp DIR             The TMP directory for storing intermediate files
                        (default=output directory
  --readgroup STR       The read group annotation (Default:
                        @RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE)
  --samplename STR      The SM tag in the read group annotation (Default:
                        "SAMPLE" in
                        @RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE)
  --bam                 Support for aligned BAMs as input. By default input
                        (-r) is aligned again. Use --variantonly otherwise.
  --variants TYPE [TYPE ...]
                        gatk breakdancer cnvnator pindel breakseq (default to
                        all)
  --targeted            Use GATK in targeted sequencing mode (default: whole-
                        genome mode)
  --capture FILE [FILE ...]
                        Capture BED file(s) used for targeted genotyping
                        (default: void, separate multipe files with commas:
                        capture1.bed,capture2.bed,...)
  --relax_realignment   Relaxes GATKs realignment when dealing with badly
                        scored reads (default: false)
  --reference_calls     Store all reference calls from GATK (default: false)
                        in gVCF format in addition to a standard VCF file
                        containing only the variants (valid only for SNV
                        calling)
  --snp_hapcaller       Use GATK HaplotypeCaller to discover SNPs (default:
                        UnifiredGenotyper)
  --indel_hapcaller     Use GATK HaplotypeCaller to discover Indels (default:
                        UnifiredGenotyper)
  --nosnpvqsr           Do not perform VQSR SNPs (variant quality score
                        recalibration)
  --noindelvqsr         Do not perform VQSR on Indels (variant quality score
                        recalibration)
  --vqsrchrom           Perform VQSR on individual chromosomes (valid when
                        binning performed; default: VQSR on whole genome VCF)
  --nobinning           Do not bin the alignments by chromosomes
  --nocleanup           Do not clean up the alignments
  --novariant           Do not call variants
  --alignmentonly       Only align input FASTQ or BAM files (-r)
  --cleanuponly         Only clean up input BAM files (-r)
  --variantonly         Only call variants in input BAM files (-r)
  --donealign           Sequences already aligned using the pipeline
  --donebinning         Alignments already binned by chromosomes using the
                        pipeline
  --donecleanup         Alignments already cleaned using the pipeline
  --donegenotyping      Variants already called using the pipeline but VQSR is
                        not
  --donesnpvqsr         Processing is started after SNP VQSR (from Indel VQSR)
  --memory SIZE         Memory size (GB) per job (default: 12)
  --queue NAME          Queue for jobs (default: extended)
  --email NAME          Email address to receive emails for ending or aborting
                        last jobs in the queque
  --threads COUNT       Number of threads for alignment, only works for SGE
                        (default: 4)
  --jobfile FILE        The jobfile name (default: stdout)
  --submit              Submit the jobs