/fetch-dbgap-files

Code to fetch dbGaP files using sra-toolkit

Primary LanguagePythonMIT LicenseMIT

fetch-dbgap-files

Code to fetch dbGaP files using sra-toolkit.

Installation and setup

The Dockerfile can be used to build a docker image that can be used to run the fetch.py script.

Alternatively, to run outside of the Docker image, you must install SRAToolkit. The code currently uses v3.0.10; it may work with other versions, but it is not guaranteed.

Preparation

Before running the script, you will need to use the dbGaP File Selector to select which files to download. From the My Requests section of the dbGaP authorized access webpage, locate the Data Access Rquest (DAR) for which you would like to download data. Then click on "Request Files" next to the DAR. On the new page, click on the "dbGaP File Selector" link.

Once in the dbGaP File selector, select which files you would like to download. After you have made your selection, toggle the "Selected" in the "Select" pane. You will need to download two files to use as input for the workflow:

  • "Cart file": the cart file containing the list of files to download in sratoolkit kart format.
  • "Files Table": the manifest file listing which files should be used to download.

Local usage

The fetch.py python script can be run locally to download dbGaP data.

Required inputs:

Argument Description
--ngc The path to the dbGaP project key for your dbGaP application
--cart A cart file generated by the dbGaP File Selector
--manifest A manifest file generated by the dbGaP File Selector
--outdir The output directory where the data should be saved

Optional inputs:

Argument Description
--prefetch The path to the SRAToolkit prefetch binary
--untar Flag the can be set if the script should untar any .tar or .tar.gz files into a directory with the same name as the archive (without extension). If set, the original .tar or .tar.gz archive will be deleted.

Because prefetch somestimes exits without error but without downloading all requested files, the script will attempt to download the files and compare agianst the manfiest; if all files were not downloaded initially, it will retry 3 times. Once all files are successfully downloaded, it will copy the files to the final requested outdir.

Note that if the fetch.py script crashes for some reason, you will have to restart from the beginning.

Running the workflow

A WDL workflow is also provided to download the files. The WDL automatically untars the files and deletes the original archive (by passing the --untar argument to fetch.py under the hood). The inputs to the WDL are as follows:

Required inputs:

Argument Description
ngc_file The path to the dbGaP project key for your dbGaP application
cart_file A cart file generated by the dbGaP File Selector
manifest_file A manifest file generated by the dbGaP File Selector
output_directory The output directory where the data should be saved

Optional inputs:

Argument Description
disk_gb The hard disk size of the instance to use for downloading and untarring. If downloading a large volume of files, you may need to increase this value. (Default: 50)

The workflow can be found on Dockstore.

Caveats

Note that the project key (--ngc or ngc_file) is sensitive; do not share it with people who are not covered by your dbGaP application as it will allow them to download data. We recommend that you do not put the project key file in a Terra/AnVIL workspace that you are planning to share with other people. Instead, store it in a more protected workspace that is only shared with people covered by the dbGaP application.