A scalable and efficient solution to storing and accessing genomic variants
First you need to load the extended manifest file for your array. This will become the probe_id table for your dataset. The WDL to upload the manifest file is in the GATK repo: https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore_wdl/ImportArrayManifest.wdl
The dataset will be created if it doesn't exist. The table should not exist or duplicate data will be loaded. For manifest-schema, use schema here: https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore_wdl/schemas/manifest_schema.json.
There are several steps to ingest the vcf array data into BigQuery.
To get around BigQuery query limits, you should extract the probe_info table to a csv file in a google bucket. It will be used as input to the array ingest command.
Assign a sequential integer id for each sample. If you want to process several samples, you can ceate a csv file where the first column is the integer id and the second column is the sample name. For example:
1,204126160095_R01C01
2,204126160095_R02C01
3,204126160095_R03C01
Run the gatk ingest tool to convert the vcf file to 2 tsv files: one for the sample mapping and one for the array data.
./gatk CreateArrayIngestFiles -V <input-vcf> --probe-info-table <gs-location-of-probe-info-export> --ref-version 37
Copy the resulting tsv files to a google bucket for upload
gsutil cp <output-of-gatk-command>*.tsv gs://broad-dsp-spec-ops/scratch/import/
Both of these steps (the gatk tool and the copy of the files) can be accomplished by running the CreateArrayImportTsvs.wdl script.
Run the bq ingest script for array data. This script will import the sample and raw array data files for the table specified and then move the files to a "done" subdirectory.
./ingest/bq_ingest_arrays.sh <project-id> <dataset-name> <storage-location> <table-number>
For example:
./ingest/bq_ingest_arrays.sh spec-ops-aou aou_arrays_test gs://broad-dsp-spec-ops/scratch/import 2
WARNING
It is important that new files are not being added to this directory during this process or they might be moved to the done directory without being processed. It is important not to reload the same file more than once or you will get duplicate entries in the database.
Here is a sample query you can use to create a cohort table for the samples you want to extract. (Soon there will be an option to pass this as a tsv).
CREATE OR REPLACE TABLE `spec-ops-aou.aou_preprod.cohort_20` AS
SELECT sample_id, sample_name FROM
(
SELECT sample_id, sample_name, RAND() as x
FROM `spec-ops-aou.aou_preprod.sample_list`
ORDER BY x
LIMIT 20
)
Once you have created a cohort table, you can run the extract/raw_array_cohort_extract.wdl
with the extract/raw_array_cohort_extract.aou_demo_10.cloud.json
file as an example of the inputs needed. This will create a temp table with the cohort data and create an output vcf for each shard in the export.
When running ./ingest/bq_ingest_array.sh the gsutil mv
command at the end uses the -m
option to multi-thread the move. Locally, I get this error (I think my python environment has problems). But I have been able to run the command with this option on from a web console. This should not be an error, but if so, we can always remove the -m
option. If this error does occur, it just means not all of the ingested files are moved to the done directory. Typically the data is loaded (if there are no errors before this during the bq load commands).
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ahaessly/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2348, in run
cls = copy.copy(class_map[caller_id])
File "<string>", line 2, in __getitem__
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/pyException in thread Thread-5:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 811, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ahaessly/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2348, in run
cls = copy.copy(class_map[caller_id])
File "<string>", line 2, in __getitem__
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 815, in _callmethod
self._connect()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
thon3.7/multiprocessing/managers.py", line 802, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused