picardmetrics

Run Picard tools and collate multiple metrics files.

Summary

picardmetrics runs 10 Picard tools on a BAM file:

You can find additional scripts in the scripts/ folder:

make_refFlat creates a refFlat file with (human) Gencode v19 gene annotations.
make_rRNA_intervals creates an intervals_list file with all human ribosomal RNA genes.
plot_picardmetrics.R shows how to read and plot the metrics.

Commands

$ picardmetrics
Usage: picardmetrics COMMAND
  run         Run the Picard tools on a given BAM file.
  collate     Collate metrics files for multiple BAM files.

$ picardmetrics run
Usage: picardmetrics run [-f FILE] [-r] <file.bam>
  -f FILE   The configuration file. (Default: ~/.picardmetricsrc)
  -r        The BAM file has RNA-seq reads. (Default: false)

$ picardmetrics collate
Usage: picardmetrics collate PREFIX <file.bam> [<file.bam> ...]

Installation

# Download the code.
git clone git@github.com:slowkow/picardmetrics.git

# Install the script to your preferred location.
cd picardmetrics
cp picardmetrics ~/bin/

# Copy and edit the configuration file to match your system.
cp picardmetricsrc ~/.picardmetricsrc
vim ~/.picardmetricsrc

You also need to install these dependencies:

Picard
samtools, which depends on htslib
stats

Examples

I've included two BAM files, each with 10,000 mapped reads, to illustrate the usage of picardmetrics. Please see the data/ folder.

Here are three examples of how you can run the program:

Run picardmetrics sequentially (in a for loop) on multiple BAM files.
Run in parallel with GNU parallel, using multiple processors or multiple servers.
Run in parallel with an LSF queue, distributing jobs to multiple servers.

Example 1: Sequential

Run the Picard tools on the provided example BAM files:

$ for f in data/project1/sample?/sample?.bam; do picardmetrics run -r $f; done

Collate the generated metrics files:

$ picardmetrics collate data/project1 data/project1/sample?/sample?.sorted.bam

Example 2: GNU parallel

Run 2 jobs in parallel:

$ parallel -j2 picardmetrics run -r {} ::: data/project1/sample?/sample?.bam

If you have many files, or if you want to run jobs on multiple servers, it's a good idea to put the full paths in a text file.

Here, we have ssh access to server1 and server2. We're launching 16 jobs on server1 and 8 jobs on server2. You'll have to make sure that picardmetrics is in your PATH on all servers.

$ ls /full/path/to/data/project1/sample*/sample*.bam > bams.txt
$ parallel -S 16/server1,8/server2 picardmetrics :::: bams.txt

Example 3: LSF

I recommend you install and use asub to submit jobs easily. This command will submit a job for each BAM file to the myqueue LSF queue.

$ cat bams.txt | xargs -i echo picardmetrics run -r {} | asub -j picardmetrics -q myqueue

harmjanwestra/picardmetrics