NGS-pipeline: A Python repository from seru71

# bin directory

Contains all the custom programs to run the pipeline and extract information from bams.

## Pipeline

The NGS pipeline can be run using one of two python scripts:

### pipeline.py

Main pipeline file for *single sample analysis*.

* Running the script

You can run the script using `python ~/bin/pipeline.py`.
A list of possible options will be presented. The only required option is `--bam`,
and it accepts glob expansions for inputting multiple files ('e.g. *.bam').

If you want to follow progress of the script, use the verbose option (`-vvvvv`).
In order to use multithreading, use the `-j` option (`-j 12`).

There are several options that need to be edited in the file, under the
*user definable options* section: the paths to picard and gat; the path to the reference
sequence fasta; the path to dbsnp; and the path to a bed file containing the regions
of the capture sequence.

* Outputs

The script will create one or more directories in the current folder, named after
the bam file prefix (`xxx.bam` will create directory `xxx`).

After finishing a few files will be in that directory, including the cleaned up
bam file (`xxx.gatk.bam`), the snp and indel files restricted by the exome bed file
(`xxx.snps` and `xxx.indels`), and the raw variant calls (`xxx.bcf`).

* Typical usage

For running the analysis with the default reference files (human g1k, dbsnp 132, and
Nimblegen enrichment bed +- 10bp), on all the bam files in the directory ~/data/new_project
using 12 cpus (half of astrakan).

mkdir ~/results/new_project
cd ~/results/new_project
python ~/bin/pipeline.py --bam '~/data/new_project/*.bam' -vvvvv -j 12

### pipeline_multi.py

Main pipeline script for *multi sample analysis*.

* Running the script

Options are very similar to the single sample analysis, but the bam input option now
accepts a directory and is renamed to `--bamdir`. There is also a new option `--groups`
that will accept a number indicating how many groups should the dataset be split for
multisample calling (I suggest that the dataset is split into groups of roughly 40
samples).

In addition, the *user definable options* section has two additional options: the path
to the hapmap vcf; and the path to 1000g OMNI dataset vcf.

* Outputs

The script will create one or more directories in the current folder, named after
the bam file prefix (`xxx.bam` will create directory `xxx`). This time these folders
will only contain the files up to the cleaned up bam file (`xxx.gatk.bam`)

In the directory where the script is run, additional files will be created with the
multisample results (`multisample.gatk.variants.vcf` for the multisample snp and indels,
and `multisample.gatk.variants.exome.vcf` for the variants restricted to the bed file).

* Typical usage

For running with the default settings on all the bam files present in the directory
~/data/new_project, splitting them into 3 distinct groups, using 12 cpus in astrakan:

mkdir ~/results/new_project
cd ~/results/new_project
python ~/bin/pipeline_multisample.py --bamdir ~/data/new_project --groups 3 -vvvvv -j 12

## Information about bam files

### gene_coverage.R

Calculates the coverage across all exons in a bam file, for a given set of genes

* Running the script

The script accepts a a bam file as the first option and any number of gene names, space
separated.

* Outputs

The script will write to standard out a gene coverage info for each gene, together with
individual exon coverage analysis for each gene. Exon position is derived from the ccds
database. Updating the database should be done in the apropriate section of the script.

* Typical usage

Calculate gene coverage for LPT1 in bam `xxx.bam`:

Rscript ~/bin/gene_coverage.R 'xxx.bam' LPT1

### region_coverage.R

Calulates coverage statistics for a specific region in the genome, along with exon coverage
for that region.

* Running the script

The script accepts a a bam file as the first option, chromosome number, start position,
end position, and minimum coverage wanted for the report.

* Outputs

The script will write to standard out exon coverage information for the specified region.

* Typical usage

Calculate exon statistics for `xxx.bam` in chr7, between pos 36399490 and 55889334,
with a minimum coverage of 8x:

Rscript ~/bin/region_coverage.R 'xxx.bam' 7 36399490 55889334 8

seru71/NGS-pipeline