A file download tool for AnVIL/TDR data identified by Data Repository Service URIs (DRS URIs).
Operating System | DRS Downloader | Checksum |
---|---|---|
macOS | drs_downloader.pkg | checksums.txt |
Linux | drs_downloader | checksums.txt |
Windows | drs_downloader.exe | checksums.txt |
Download the latest drs_downloader
zip file for your operating system. Unzipping the downloaded file will provide a drs_downloader
executable file that can be run directly.
Supported OS Versions
Operating System | Supported Versions |
---|---|
macOS | 12 (Monterey), 13 (Ventura) |
Linux | Ubuntu 22.04 (Jammy Jellyfish) |
Windows | Windows 11 |
Notes:
- Testing was done on hardware running macOS Monterey and Ventura (Apple Silicon M1 chips), with Windows and Linux emulation through UTM.
- Due to hardware limitations with the ARM M1 chips, Windows 10 was not included in the list of tested operated systems as Microsoft does not currently provide a public Windows 10 ARM build.
- Ubuntu 20.04 (Focal Fossa) uses version 2.31 of the GNU C Library which appears to be incompatible with Python 3.10 requirement of version 2.35.
In order to verify that the downloaded file can be trusted checksums are provided in checksums.txt
. See below for examples of how to use this file.
Successful Verification
To verify the integrity of the binaries on macOS run the following command in the same directory as the downloaded file:
$ shasum -c checksums.txt --ignore-missing
drs_downloader.pkg: OK
If the shasum
command outputs OK
than the verification was successful and the executable can be trusted.
Unsuccessful Verification
Alternatively if the commad outputs FAILED
than the checksum did not match and the binary should not be run.
$ shasum -c checksums.txt --ignore-missing
drs_downloader.pkg: FAILED
shasum: WARNING: 1 computed checksum did NOT match
shasum: checksums.txt: no file was verified
In such a case please reach out to the contributors for assistance.
For Linux to run the exe you will have to grant the file higher permissions. you can do this by running:
chmod +x [filename]
For Mac, the binary is installed in /Applications
by default. To run drs_downloader
rather than /Applications/drs_downloader
every time, move the binary to an existing directory in the PATH
variable, eg:
sudo mv /Applications/drs_downloader /usr/local/bin/
Alternatively, you can add your current directory to the PATH variable so the binary is globally executable.
The downloader requires that a Google Cloud project be designated as the billing project. In order for the downloader to authenticate and set the desired billing project the gcloud CLI tool must first be installed:
- gcloud CLI — used to authenticate the downloader and set the billing project.
- Python (>= 3.10) — required for gcloud CLI functionality.
Upon running the following gcloud
command a browser window will open in which you may choose the Google account used for the billing project:
$ gcloud auth application-default login
Your browser has been opened to visit:
https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=...
You are now logged in as [rosalind@ohsu.edu].
Your current project is [terra-314159]. You can change this setting by running:
$ gcloud config set project PROJECT_ID
To change the billing project at any time you may use either the $ gcloud config set project PROJECT_ID
command or the built-in drs-downloader
command:
$ drs_downloader terra --project-id Project ID>
A manifest is a tsv file where at least one column contains a set of drs IDs, such as this minimal manifest file. These manifests can either be created by hand or downloaded from the AnVIL Data Explorer or a Terra workspace data page.
More on manifests according to DRS can be found here.
$ drs_downloader terra -m <manifest file> -d <destination directory>
-s, --silent
Disables all output to the terminal during and after downloading.
-d, --destination_dir TEXT
The directory or folder to download the DRS Objects to. Defaults to
/tmp/testing
if no value is provided.
-m, --manifest_path TEXT
The manifest file that contains the DRS Objects to be downloaded. Typically a TSV file with one row per DRS Object.
--drs-column-name TEXT
The value of the column in the manifest file containing the DRS Object IDs. Defaults to
pfb:ga4gh_drs_uri
if no value is provided.
--duplicate
downloads files and saves them into the specified directory even if there is already files with the same name already in the directory. Numbered naming is used to specify the order of duplicates downloaded to the directory. For example: 1st -> original_file 2nd -> original_file(1) 3rd-> original_file(2) ...
The below command is a basic example of how to structure a download command with all of the required arguments. It uses:
- a manifest file called
terra-data.tsv
with 10 DRS Objects. - a DRS column ID value of
pfb:ga4gh_drs_uri
within the manifest file to reference the DRS objects. It can be omitted since this is the default value used by the downloader. - a download directory called
DATA
as the destination
$ drs_downloader terra -m tests/fixtures/manifests/terra-data.tsv -d DATA
100%|████████████████████████████████| 10/10 [00:00<00:00, 56148.65it/s]
2022-11-21 16:56:49,595 ('HG03873.final.cram.crai', 'OK', 1351946, 1)
2022-11-21 16:56:49,595 ('HG04209.final.cram.crai', 'OK', 1338980, 1)
2022-11-21 16:56:49,595 ('HG02142.final.cram.crai', 'OK', 1405543, 1)
2022-11-21 16:56:49,595 ('HG01552.final.cram.crai', 'OK', 1296198, 1)
2022-11-21 16:56:49,595 ('NA18613.final.cram.crai', 'OK', 1370106, 1)
2022-11-21 16:56:49,595 ('HG00536.final.cram.crai', 'OK', 1244278, 1)
2022-11-21 16:56:49,595 ('HG02450.final.cram.crai', 'OK', 1405458, 1)
2022-11-21 16:56:49,595 ('NA20525.final.cram.crai', 'OK', 1337382, 1)
2022-11-21 16:56:49,595 ('NA20356.final.cram.crai', 'OK', 1368064, 1)
2022-11-21 16:56:49,595 ('HG00622.final.cram.crai', 'OK', 1254920, 1)
2022-11-21 16:56:49,595 ('done', 'statistics.max_files_open', 37)
After the download completes we can look in the DATA
directory to confirm that all 10 DRS Objects have been downloaded:
$ ls ./DATA
HG00536.final.cram.crai HG01552.final.cram.crai
HG02450.final.cram.crai HG04209.final.cram.crai
NA20356.final.cram.crai HG00622.final.cram.crai
HG02142.final.cram.crai HG03873.final.cram.crai
NA18613.final.cram.crai NA20525.final.cram.crai
Let's take a look at different manifest file called terra-different-header.tsv
. Namely the DRS header value is now drs_uri
so we will need to tell the downloader which column to find the DRS URI's in the manifest with the --drs-column-name
flag:
drs_downloader terra -m tests/fixtures/manifests/terra-different-header.tsv -d DATA --drs-column-name drs_uri
This will download the DRS Objects specified in the drs_uri
column into the DATA
directory just as before.
To see all available flags run the help
command:
$ drs_downloader terra --help
Usage: drs_download terra [OPTIONS]
Copy files from terra.bio
Options:
-s, --silent Display nothing.
-d, --destination_dir TEXT Destination directory. [default: /tmp/testing]
-m, --manifest_path TEXT Path to manifest tsv.
--duplicate allow duplicate downloads with same file name
--drs-column-name TEXT The column header in the TSV file associated
with the DRS URIs.Example: pfb:ga4gh_drs_uri
--help Show this message and exit.
This project is developed in partnership between The AnVIL Project, the Broad Institute, and the Ellrott Lab at Oregon Health & Science University. Development is lead by Brian Walsh with contributions from Matthew Peterkort and Liam Beckman. Special thanks to Michael Baumann at the Broad Institute for guidance and development recommendations.