This repository contains tools for generating and interacting with linkage disequilibrium (LD) data, including
- Google Cloud Dataflow
pipelines to
- calculate LD from a Google Genomics variant set and write the results to Google Cloud Storage,
- load a set of LD results from Cloud Storage into a Cloud BigTable,
- and efficiently query a Cloud BigTable for LD data.
- The BigQuery schema for this data.
In addition, the tools have been used to create a publicly-available repository of LD on the 1000 Genomes Phase 3 data. See the example Datalab or the Google Genomics Cookbook data description for more details.
LD is the non-random association of alleles at distinct physical loci, and can arise due to many different factors in population genetics.
Knowing the extent of LD between variants is essential for filtering when using various population genetics algorithms, such as using principal component analysis to identify genome-wide population structure. Furthermore, LD between variants can be exploited when one is interested in an individual's genotype at a particular variant of interest, but that variant is not directly assayed in the experiment used (i.e. as a quick proxy for genotype imputation).
Calculating LD between a single pair of variants is conceptually simple. However, it is computationally burdensome for large data sets, as the number of calculations grows as O(N^2) for a data set of N variants.
This repository demonstrates the ability to perform the resource-heavy LD computation efficiently using Dataflow, as well as ways to interact with the resulting data.
-
git clone this repository.
-
If you have not already done so, follow the Google Genomics getting started instructions to set up your environment including installing gcloud and running
gcloud init
. -
If you have not already done so, follow the Dataflow getting started instructions to set up your environment for Dataflow.
-
Download the correct version of the ALPN documentation.
-
See the ALPN documentation for a table of which ALPN jar to use for your JRE version.
-
Then download the correct version from here.
-
Use a recent version of Apache Maven (e.g., version 3.3.3) to build this code:
cd linkage-disequilibrium
mvn package
Now, pipelines can be run on Google Compute Engine using the example commands listed within the pipeline files themselves (see, for example, WriteLdBigtable.java).
Because calculation of LD can be performed independently for all pairs of
variants of interest, it is an ideal candidate for parallelization. The
LinkageDisequilibrium.java
pipeline takes as input a Google Genomics variant
set and writes LD output to Cloud Storage as a file of LD results.
Each LD result is represented as a comma-separated line with 22 entries. Each variant is represented by eight values:
- chromosome
- start position (0-based half-open, like UCSC BED format)
- end position (0-based half open, like UCSC BED format)
- Google Genomics variant ID
- Semicolon-delimited list of rsids assigned to the variant
- The number of non-reference alleles defined for the variant
- The allele coded as 0 in the LD calculations
- The allele coded as 1 in the LD calculations
The final six values are:
- The number of chromosomes used in the LD calculation
- The number of chromosomes for which the query variant has the 1 allele
- The number of chromosomes for which the target variant has the 1 allele
- The number of chromosomes for which both the query and target variants have the 1 allele
- The allelic correlation coefficient measure of LD between the variants
- The D' measure of LD between the variants
There are multiple design decisions made in the LinkageDisequilibrium.java
pipeline that influence its results. Ensure your data is appropriate before
running the pipeline blindly.
The LD measures chosen to calculate were allelic correlation coefficient and D'. Note that each of these expects phased data as input.
The standard way to handle missing data is to calculate the correlation based only on individuals who have genotype calls at both variants of interest. This may be problematic if no-calls are correlated with the presence of a particular allele, but is an expected way to handle the missing data. This is the method used in the pipeline.
A comparison to two other strategies (imputation or sampling) supports this being the method that most accurately recovers the true underlying Pearson correlation.
Genotypic correlation is defined in the context of two alleles, and simply calculates the correlation between the counts of a given allele at each variant. Genotypic correlation is ill-defined in the context of variants with more than two distinct alleles.
Frequently, tri- or higher-allelic variants have two alleles that are present in nearly all individuals, with the other variants being extremely rare. As a result, we handle multiallelic sites by identifying the two most frequent alleles, and only using chromosomes that contain one of those two alleles in the LD calculation (i.e., effectively treating rare alleles as no-calls).
Different models can be used to model LD on sex chromosomes. This pipeline uses the same genotype counts as encoded in the Variant in its calculation of LD metrics.
Because the LD calculation pipeline writes its output as comma-separated lines, it is easy to load the data into BigQuery.
General instructions for loading data into BigQuery are available here. An example script for loading from CSV is available at load_data_from_csv.py. When using that script, the schema for the table of LD results is available in this repository in the ld_bigquery_schema_fields.txt file.
Consequently, LD data can be loaded into a BigQuery table with the following code snippet:
PROJECTID=<your-project-id>
DATASETID=<your-bigquery-dataset-id>
TABLE=<your-desired-bigquery-table-name>
DATA=<path-to-linkage-disequilibrium-result-data>
python path/to/load_data_from_csv.py \
$PROJECTID $DATASETID $TABLE schema/ld_bigquery_schema_fields.txt $DATA
Because BigTable allows efficient access to extremely large datasets indexed by
a single key, it is a natural choice for representation of LD data. The pipeline
WriteLdBigtable.java
provides an example in which Dataflow is used to read
data generated by the LinkageDisequilibrium.java
pipeline and writes the
results into a BigTable. The key for each BigTable row is designed so that all
LD results for a single query variant appear in a contiguous block of the table,
sorted by the location of the target variants, and results for query variants
are sorted by the location of query variants. This key design allows efficient
access to all LD results for a single variant or a single region of the genome.
See the Google Genomics Cookbook for an example command line for this pipeline.
Once a BigTable storing LD data has been created, a mechanism for accessing the results must be created. The QueryLdBigtable.java pipeline provides an example in which Dataflow is used to read a subset of data from an LD BigTable and write the results to GCS in the same format as it was originally written by the LinkageDisequilibrium.java pipeline.
See the Google Genomics Cookbook for an example command line for this pipeline.