GenomicGPS
is a software package for applying multilateration to genomic data. Multilateration is a technique used in the Global Positioning System (GPS). To determine the position of an aircraft, the signals from satellites are interpreted to the distances to them. Given these distances, it is possible to mathematically calculate the exact position of an aircraft. In genomic multilateration, investigators calculate genetic distances between their samples and reference samples, which are from data held in the public domain, and share this information with others. This sharing enables certain types of genomic analysis, such as identifying sample overlaps and close relatives, decomposing ancestry, and mapping of geographical origin without disclosing personal genome. Importantly, though, the shared information conceals individual genotypes, making it extremely difficult to reconstruct the personal genome. Thus, our method can be seen as a balance between open data sharing and privacy protection.
We measure the genetic distance to individuals in public data, which we call "distance vector". The software includes modules (1) to generate distance vectors given genomic data and (2) to detect overlapping samples given distance vectors.
Currently, Linux
or Mac
is required to run GenomicGPS
.
In order to download GenomicGPS
, you can clone this repository via the commands.
Before using 'git clone' command, please install the extension of git called Git Large File Storage (LFS) for cloning the reference file (>100MB). Since it has many different ways to install LFS for various OS, please refer to this page. Then, type the following to clone our repository.
$ git clone https://github.com/hanlab-SNU/GenomicGPS.git
$ cd GenomicGPS
(If you couldn't install LSF, don't worry; you can still clone the software. However, the
Reference.tar.gz
file will not be properly cloned due to large size usinggit clone
command. In that case, you can manually click and download thisReference.tar.gz
file from our website, and put that into the cloned directory.)
Some software packages must be installed.
First, please make sure that PLINK is installed. The installed plink path should be added to the system path.
You can verify by
$ plink --version
Also, you should install python3 and pip or anaconda for downloading the following necessary python packages :
- numpy
- pandas
- scipy
- datatable (It only supports MacOS, linux)
If you are using Python, you can install the required packages with:
$ pip install -U numpy pandas scipy datatable
If you are using Anaconda, you can install the required packages with:
$ conda install -c conda-forge numpy pandas scipy pip
$ pip install datatable
The input data has to be in PLINK
format.
Please ensure that the .bed/.bim/.fam
filesets or .map/.ped
filesets are all present in the same path. When you run the code, you should give the path and prefix of the data.
To run GenomicGPS
, we need a reference panel from public domain.
Don't worry, we've already prepared the reference dataset for you;
Reference.tar.gz
file contains 1000Genomes Phase 3 data
(2,504 samples and 176,504 pruned SNPs with MAF>5%) formatted for our software.
The user can specify how many SNPs (N
) and how many reference individuals (K
) to use.
Then, our software will randomly sample N
SNPs and K
individuals from our prepared reference dataset
(which will add a layer of security, because SNP sets and references will be different case by case).
The rule of thumb is that we'll need K>10
references and the SNP/reference ratio (N/K)
needs to be >20
. A simple choice we recommend is K=30
and N=1,000
.
Our first module is for generating a distance vector. There are two usages.
Researcher A
wants to calculate distance vectors of his samples and send this information to Researcher B
, so that B
can check if there are sample overlaps.
Then, Researcher A
runs the following:
$ cd 1.DV_Generator
$ chmod +x dv_gen.sh
$ ./dv_gen.sh -n N(#snps) -k K(#references) -d mydata1
This will generate
mydata1.input
: input file converted fromplink
formatmydata1.out
: Distance vectorsmydata1.ref
: Randomly sub-sampled reference data, which was used for distance vector calculationmydata1.ref.p
: The MAF information of the randomly sub-sampled reference data
Then, Researcher A
sends mydata1.out
, mydata1.ref
, and mydata1.ref.p
to Researcher B
.
Researcher A
also lets Researcher B
know the numbers of SNPs and references used (K
and N
).
DO NOT SEND
mydata1.input
, because this is genotype information of the samples that we want to conceal!!
Researcher B
also calculates distance vectors, but the difference is that he puts the reference information received from A
so that the reference information can be aligned between the two.
cd 1.DV_Generator
$ chmod +x dv_gen.sh
$ ./dv_gen.sh -n N(#snps) -k K(#references) -d mydata2 -r mydata1.ref -p mydata1.ref.p
Given -r
and -p
option, our software uses the reference information from these files instead of sampling from 1000Genomes data
.
This will generate
mydata2.input
: input file converted fromplink
formatmydata2.out
: Distance vectors
Our second module is for detecting sample overlap given distance vectors.
Researcher B
can now run the following:
$ cd 2.DV_Comp_Detct
$ chmod +x comp_det.sh
$ ./comp_det.sh -d1 mydata1.out -d2 mydata2.out -p mydata1.ref.p (+ optional : -t threshold)
This code compares two distance vectors to detect sample overlap. The reference allele frequency (.ref.p
) is used to calculate Σ (variance-covariance matrix). Then, the statistic and p-value will be calculated.
This will generate
mydata1_mydata2.outcomp_2data.stat
: Sample overlap detection statistic for all pairwise comparison. GivenN1
samples from ResearcherA
andN2
samples from ResearcherB
, this will beN1xN2
matrix.mydata1_mydata2.outcomp_2data.pvalue
: P-values of all pairwise comparison. This will beN1xN2
matrix.mydata1_mydata2.outcomp_2data.sigma
: The covariance matrix used for calculation. This will beKxK
matrix.
Also, the detected sample overlap will be printed:
- Indiv 0 from data 1 ---- Indiv 0 from data 2 | were collected from the same individual.
- Indiv 1 from data 1 ---- Indiv 1 from data 2 | were collected from the same individual.
- Indiv 17 from data 1 ---- Indiv 2 from data 2 | were collected from the same individual.
- Indiv 18 from data 1 ---- Indiv 3 from data 2 | were collected from the same individual.
- Indiv 19 from data 1 ---- Indiv 4 from data 2 | were collected from the same individual.
as for the pairs whose p-values exceeded the significance threshold specified by -t
.
If one hasn't decided the p-value threshold, then one can omit
-t
parameter. Then, our software will internally simulate the null hypothesis (unrelated pairs) and the alternative hypothesis (overlapping pairs) lots of times, and decide an appropriate threshold for you as the middlelog(P)
value between the null and alternative clusters. This procedure takes some time (~10 min). The determined threshold will be printed on the output screen.
Our package includes example dataset at
- PATH :
./sample_data/
To run the testrun,
$ chmod +x genomicgps.sh
$ ./genomicgps.sh -n N(#snps) -k K(#satellites) -d1 sample_data/mydata1 -d2 sample_data/mydata2
This testrun will generate distance vectors from both datasets and perform the overlapping detection test.
In our paper, we described the situation that an opponent wants to recover the genome information given the distance vector and reference dataset. We implemented the opponent as a greedy algorithm. The results showed that recovery of the genome information was very difficult. For anyone who wants to replicate the results, we provide the following JAVA code:
JAVA code : click >> Greedy Algorithm package
This project is licensed under the terms of the MIT license.
If you use the software genomicGPS
, please cite Kim, Kunhee, et al. "Genomic GPS: using genetic distance from individuals to public data for genomic analysis without disclosing personal genomes." Genome biology 20.1 (2019): 175.
- PLINK v1.9 | Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4. (2015) doi:10.1186/s13742-015-0047-8
- 1000 Genome Phase 3 data | A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (2015) doi:10.1038/nature15393
This software was implemented by Chloe Soohyun Jang and Buhm Han. Please contact hanlab.snu@gmail.com