/CuratedHumanGenotypes

A companion resource for the publication 'A Collection of 2,280 Public Domain (CC0) Curated Human Genotypes'

Primary LanguagePythonMIT LicenseMIT

Curated Human Genotypes

A companion resource for the publication 'A Collection of 2,280 Public Domain (CC0) Curated Human Genotypes'

As Direct-to-Consumer (DTC) Personal Genetic Testing (PGT) continues to attract users, many decide to make their genome data accessible via public repositories for further reuse. Personal genetic profiles can predict a range of health and ancestry traits and are valuable resource for the scientific community as reference datasets. Using the Repositive platform (https://repositive.io), we retrieved 3,137 public domain files from 23andMe origin, a leading PGT provider. After filtering out duplicate, incomplete or corrupted files, we ended up with a curated dataset of 2,280 unique personal genetic profiles. To assess the validity of this dataset, we performed a principal component analysis of their SNP variation and compared it to the 1000 Genomes Project. This gave us a benchmark for ethnicity assessment in our curated collection, predicting it to be 94.9% European. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms (CC0), make it a useful reference pool for validation purposes and control experiments.

Data availability

All 2,280 ‘clean’ 23andMe genotype URLs are available in the file provided.

Code availability

A python script that takes the 23andMe URLs and downloads them is available as well.