Platinum Genomes

This repo contains the Platinum Genomes small variant truthset for samples NA12878 (also known as hg001) and NA12877. Platinum Genomes truthset variants were validated using haplotype inheritance information through a well studied 17-member pedigree (CEPH 1463).

Truthsets

Truthsets are made up of a VCF of validated variant records and a BED file of confident regions. These files aren't huge (00s of MB) but are too large to play nicely with git and github, here's a few ways to download:

AWS CLI

Truthset files are stored in an AWS S3 bucket called platinum-genomes, one way to download is via the AWS CLI:

aws s3 cp s3://platinum-genomes/2017-1.0 pg2017 --recursive

To download without AWS credentials, add the --no-sign-request flag. You can also explore the bucket and download individual files with this S3 bucket display.

wget

Alternatively, use wget or similar with the file URIs in this repo, e.g.:

wget -xi files/2017-1.0.files

You can then use the relevant md5 checksum in each release to validate data integrity.

Finally, truthset files can also be downloaded via FTP, e.g.:

wget ftp://platgene_ro:''@ussd-ftp.illumina.com/2017-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz

Usage

To compare a VCF against these truthsets, we recommend using hap.py which performs a sophisticated haplotype comparison rather than a simpler tool such as bcftools isec.

Applications wrapping hap.py and containing these truthsets are available on the following platforms:

Details

See the attached wiki for technical information.

Raw data

Sequencing data for NA12878, NA12877 and samples NA12889-NA12892 (grandparents) are available through the ENA:

BaseSpace users can access this data via a shared BaseSpace project:

Sequencing data for the remaining pedigree members is not consented for public release and so is made available through the dbGaP database:

Issues

Please open an issue for comments, issues and other feedback.

Citation

For further information or to cite Platinum Genomes resources, see:

  • Eberle, MA et al. (2017) A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Research, 27:157-164. doi:10.1101/gr.210500.116