Prototype for loading VCF Datasets into SciDB, currently built around the 1000 Genomes dataset. Very very early, unstable. Work in progress.
Part of the original prototype was adapted from scidb-genotypes by Douglas Slotta (NCBI) See: https://github.com/slottad/scidb-genotypes
Built to load 1000 Genomes data or data with very similar organization
- Assumes running SciDB 14.8 or newer, Python, CPP compiler. The larger the cluster - the faster this will run. On our modest 4-node cluster, we loaded all of the 1000Genomes data with an average throughput of 0.76 milliseconds per line (at 2504 samples per line).
- Install load_tools from www.github.com/paradigm4/load_tools
- Currently, all VCFs must contain the same number of samples in the same positions
- Currently, no two VCFs may have the same variant
- But this can be - and probably soon will be - a lot more flexible. Needs a few more code paths in load_file.sh
- Run ./kg_loader/recreate_db.sh once initially to create all the target arrays; run it again to blow away all the data
- Run ./kg_loader/load_file.sh FILENAME
- Hang onto something
After data is loaded, one can install shim and SciDBR and then run the examples and queries in vcf_toolkit.R. The schema is younger than the file, not all queries will work right away. Working on it.
A slightly older version of this is packaged into the Bioinformatics AMI. Instructions for that are here: http://www.paradigm4.com/try_scidb/
Work in progress