bigquery-examples

The projects in this repository demonstrate working with genomic data via Google BigQuery. All examples are built upon public datasets.

You can execute these examples by:

Copying and pasting the queries into

the BigQuery Browser Tool
the bq Command-Line Tool
one of the many third-party tools that have integrated BigQuery

Running the chunks of R code within the RMarkdown files in R or RStudio
Running the chunks of Python code within the iPython Notebooks in iPython

With minor modification, you can run the same analyses on your own genomic data within BigQuery.

Datasets

1000genomes

Sample analyses upon VCF data from the 1,000 Genomes Project

Project Name: google.com:biggene

pgp

Sample analyses upon the Personal Genome Project

Project Name: google.com:biggene

Getting Started

Set up a BigQuery project.

Follow the BigQuery instructions on how to sign up for BigQuery and set up a new API project.
You can use the Team tab to share a project with other people at your company

You’ll need to enable billing and you will be charged for any queries you execute.

See BigQuery pricing for more detail.
For example, queries within the 1,000 Genomes dataset that examine sample genotype columns will process approximately 1TB of data per query. (1,000 GB * $0.005 per GB processed = $5.00)
Note that if you would like to try a few queries prior to enabling billing, see the queries for the much smaller phenotypic dataset available in the data story "Exploring the phenotypic data".

To add a dataset to your project:

go to the BigQuery Browser Tool
click on the drop down icon beside your project name in the left navigator
pick ‘Switch to project’ in the menu, and ‘Display project...’ in the submenu enter the project name in the ‘Add Project’ dialog.

Loading Variant Data into BigQuery

The Google Genomics API spec includes a not-yet-implemented import method that loads VCF files directly from Cloud Storage. Until an implementation of the method is available, you will need to transform your VCF data into JSON with a schema similar to what you see in these examples, and then load the JSON into BigQuery. See Preparing Data for BigQuery and also BigQuery in Practice : Loading Data Sets That are Terabytes and Beyond for more detail.

The mailing list

The Google Genomics Discuss mailing list is a good way to sync up with other people who use googlegenomics including the core developers. You can subscribe by sending an email to google-genomics-discuss+subscribe@googlegroups.com or just post using the web forum page.

Contributing changes

See CONTRIBUTING.

Licensing

See LICENSE.

vthorsson/bigquery-examples