TileDB-VCF
A C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.
Features
- Easily ingest large amounts of variant-call data at scale
- Supports ingesting single sample VCF and BCF files
- New samples are added incrementally, avoiding computationally expensive merging operations
- Allows for highly compressed storage using TileDB sparse arrays
- Efficient, parallelized queries of variant data stored locally or remotely on S3
- Export lossless VCF/BCF files or extract specific slices of a dataset
What's Included?
- Command line interface (CLI)
- APIs for C, C++, Python, and Java
- Integrates with Spark and Dask
Quick Start
The documentation website provides comprehensive usage examples but here are a few quick exercises to get you started.
We'll use a dataset that includes 20 synthetic samples, each one containing over 20 million variants. We host a publicly accessible version of this dataset on S3, so if you have TileDB-VCF installed and you'd like to follow along just swap out the uri
's below for s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20
. And if you don't have TileDB-VCF installed yet, you can use our Docker images to test things out.
CLI
Export complete chr1 BCF files for a subset of samples:
tiledbvcf export \
--uri vcf-samples-20 \
--regions chr1:1-248956422 \
--sample-names v2-usVwJUmo,v2-WpXCYApL
Create a TSV file containing all variants within one or more regions of interest:
tiledbvcf export \
--uri vcf-samples-20 \
--sample-names v2-tJjMfKyL,v2-eBAdKwID \
-Ot --tsv-fields "CHR,POS,REF,S:GT" \
--regions "chr7:144000320-144008793,chr11:56490349-56491395"
Python
Running the same query in python
import tiledbvcf
ds = tiledbvcf.Dataset(uri = "vcf-samples-20", mode="r")
ds.read(
attrs = ["sample_name", "pos_start", "fmt_GT"],
regions = ["chr7:144000320-144008793", "chr11:56490349-56491395"],
samples = ["v2-tJjMfKyL", "v2-eBAdKwID"]
)
returns results as a pandas DataFrame
sample_name pos_start fmt_GT
0 v2-nGEAqwFT 143999569 [-1, -1]
1 v2-tJjMfKyL 144000262 [-1, -1]
2 v2-tJjMfKyL 144000518 [-1, -1]
3 v2-nGEAqwFT 144000339 [-1, -1]
4 v2-nzLyDgYW 144000102 [-1, -1]
.. ... ... ...
566 v2-nGEAqwFT 56491395 [0, 0]
567 v2-ijrKdkKh 56491373 [0, 0]
568 v2-eBAdKwID 56491391 [0, 0]
569 v2-tJjMfKyL 56491392 [-1, -1]
570 v2-nzLyDgYW 56491365 [-1, -1]
Want to Learn More?
- Check out the full documentation
- Motivation and idea behind storing VCF data in 3D sparse arrays
- Data Model
Code of Conduct
All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.