Experimental web service for accessing and slicing bam/cram/vcf.bgzip data
This in an incubating proposal within the GA4GH data working group towards a pragmatic web protocol for accessing large genomic datasets like reads and variants. Please see this annotated slide deck for further introduction and context, and the working draft of the protocol specification.
First, make the htsnexus client tool ready. This is a ~50 SLOC Python script. You'll also need samtools and curl.
curl https://raw.githubusercontent.com/dnanexus-rnd/htsnexus/master/client/htsnexus.py > htsnexus
chmod +x htsnexus
Here are a few things you can do with htsnexus and samtools:
Download an ENCODE ChIP-Seq BAM
./htsnexus ENCODE ENCFF904PIO > ENCFF904PIO.bam
Slice a genomic range out of a Platinum Genomes BAM
./htsnexus -r chr12:111766922-111817529 platinum NA12878 > NA12878_ALDH2.bam
Count reads on chr21 in a 1000 Genomes BAM
./htsnexus -r 21 1000genomes_low_coverage NA20276 | samtools view -c -
id. with CRAM (samtools 1.2+ needed)
./htsnexus -r 21 1000genomes_low_coverage HG01102 CRAM | samtools view -c -
Stream reads from Heng Li's bamsvr and display as headered SAM
./htsnexus -r 11:10899000-10900000 lh3bamsvr EXA00001 | samtools view -h - | less -S
Slice a bgzipped VCF
./htsnexus -r 12:112204691-112247789 1000genomes 20130502_autosomes vcf | gzip -dc | grep rs671 | cut -f1-16
The htsnexus client tool simply emits BAM/CRAM/VCF to standard output, which can be redirected to a file or piped into samtools/bcftools. It delivers a well-formed BAM/CRAM/VCF file, with the proper header, even when slicing a genomic range. Here are the data accessions currently available:
namespace | accession | format |
---|---|---|
platinum Illumina Platinum Genomes stored at EBI |
NA12877 NA12878 NA12879 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12889 NA12890 NA12891 NA12892 NA12893 | BAM |
ENCODE ChIP-seq data released by the ENCODE DCC in Jan 2016 |
ENCFF014ABI ENCFF024MPE ENCFF070QUN ENCFF090MZL ENCFF124VCI ENCFF137WND ENCFF180VYU ENCFF308BKD ENCFF373VCV ENCFF465GPJ ENCFF572JRO ENCFF630NYB ENCFF743FRI ENCFF800DAY ENCFF862PIC ENCFF866OLR ENCFF904PIO ENCFF929AIJ ENCFF946BKE ENCFF951SEJ | BAM |
lh3bamsvr Heng's examples |
EXA00001 EXA00002 | BAM |
1000genomes_low_coverage Low-coverage whole-genome sequencing from the 1000 Genomes Project |
2,535 individual accessions (example usage above) | BAM, CRAM |
1000genomes 1000 Genomes Project variant calls |
20130502_autosomes | VCF |
(We recognize that a directory API method is needed instead of this Markdown table...)
You can get a feel for how htsnexus works by running it in verbose mode:
$ ./htsnexus -v 1000genomes_low_coverage NA20276 > /dev/null
Query URL: http://htsnexus.rnd.dnanex.us/v1/reads/1000genomes_low_coverage/NA20276?format=BAM
Response: {
"namespace": "1000genomes_low_coverage",
"accession": "NA20276",
"urls": [
{
"url": "https://s3.amazonaws.com/1000genomes/phase3/data/NA20276/alignment/NA20276.mapped.ILLUMINA.bwa.ASW.low_coverage.20120522.bam"
}
],
"format": "BAM"
}
Piping: ['curl', '-LSs', '-H', 'range: bytes=0-836074732', '-H', 'referer: http://htsnexus.rnd.dnanex.us/v1/reads/1000genomes_low_coverage/NA20276?format=BAM', 'https://s3.amazonaws.com/1000genomes/phase3/data/NA20276/alignment/NA20276.mapped.ILLUMINA.bwa.ASW.low_coverage.20120522.bam']
Success
The htsnexus client makes the request to an API server. The server's JSON response gives the client another URL at which it can access the desired data, in this case a BAM file within the AWS mirror of 1000 Genomes. Then, the client delegates to curl to download that file. (Shelling out to curl is just a quick-and-dirty demo implementation strategy, of course.)
How about when we slice a genomic range? This is slightly more complicated.
$ ./htsnexus -v -r chr12:111766922-111817529 platinum NA12878 | wc -c
Query URL: http://htsnexus.rnd.dnanex.us/v1/reads/platinum/NA12878?format=BAM&referenceName=chr12&start=111766922&end=111817529
Response: {
"urls": [
{
"url": "data:application/octet-stream;base64,[704 base64 characters]"
},
{
"url": "https://dl.dnanex.us/F/D/8P6zFPZ0fy5z20bJzy32jbG4165F54Fv5fZFbzpK/NA12878_S1.bam",
"headers": {
"range": "bytes=81272945657-81275405960",
"referer": "http://htsnexus.rnd.dnanex.us/v1/reads/platinum/NA12878?format=BAM&referenceName=chr12&start=111766922&end=111817529"
}
},
{
"url": "data:application/octet-stream;base64,[40 base64 characters]"
}
],
"namespace": "platinum",
"accession": "NA12878",
"reference": "hg19",
"format": "BAM"
}
Piping: ['curl', '-LSs', '-H', 'range: bytes=81272945657-81275405960', '-H', 'referer: http://htsnexus.rnd.dnanex.us/v1/reads/platinum/NA12878?format=BAM&referenceName=chr12&start=111766922&end=111817529', 'https://dl.dnanex.us/F/D/8P6zFPZ0fy5z20bJzy32jbG4165F54Fv5fZFbzpK/NA12878_S1.bam']
Success
2460858
The server tells the client to access the Platinum Genomes BAM file from EBI, but furthermore to access a specific byte range containing the desired genomic range. Thus the server handles the genomic range lookup, so the client doesn't need index files. This will benefit genome browsers, which today sometimes have to fetch a ~100 MiB BAI file in order to take a far smaller BAM slice for visualization. (The server also gives the client the BAM header in an inline data URI prefixed to the byte range, facilitating delivery of a well-formed BAM.) CRAM slicing works in just the same way, differing only in the server-side index construction method, which the client is abstracted from.
Here's a diagram illustrating this core mechanic:
The htsnexus server maintains metadata and genomic range indices, but doesn't necessarily have to store or transport the data files themselves. This means it can be operated quite frugally, with heavy lifting offloaded to general-purpose services like S3. Architecturally, one htsnexus server could probably support thousands of clients engaged in a large scale analysis (not saying the current implementation is there). At the same time, should the need arise to generate the requested data on-the-fly, the server can instead direct the client to an endpoint providing that function, such as bamsvr above.
Slicing by htsnexus is imprecise, in that the result may contain some records outside of the requested range. This is a salient shortcoming, but part of an informed tradeoff, and rather easy to compensate for on the client.