diazale/1KGP_dimred

Number of samples in 1000 genomes

dkobak opened this issue · 3 comments

Your preprint says

The 1KGP contains genotype data of 3,450 individuals from 26 relatively distinct labeled populations

and when I run your script that's what I get too. At the same time, https://www.nature.com/articles/nature15393 says

Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations

Why 2504 vs 3450?

I believe the 2,504 refers to the people who were sequenced, not those who were genotyped (The citation is to give the 1KGP credit for the dataset)

There's file-specific documentation here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/

The specific file documentation is README_Affy6_3450samples_merge.txt

Hmm. Maybe. I looked again through the original paper and skimmed through the supplementaries file https://media.nature.com/original/nature-assets/nature/journal/v526/n7571/extref/nature15393-s1.pdf but could not really figure out what exactly the set of 3450 people is. I don't really see it explained in the README_Affy6_3450samples_merge.txt either... Does it include 2504 people from the main paper and then an additional ~1000? The main text of the paper mentions things like

In addition, individuals and available first-degree relatives (generally, adult offspring) were genotyped using high-density SNP microarrays

To evaluate variant discovery power and genotyping accuracy, we also generated deep Complete Genomics data (mean depth = 47×) for 427 individuals (129 mother–father–child trios, 12 parent–child duos, and 16 unrelateds).

and I assume these are not part of the 2504 dataset, but I could not find exact numbers to check if it adds up to 3450.

According to some other piece of documentation it contains the samples: (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/README_supporting_info_20141104)

This contains vcf files for omni and affy6.0 genotypes for the 1000 genomes samples and other samples from the same populations.

If the IDs are the same for subjects across datasets you could also see if the genotype IDs are a subset of the sequencing IDs.