EGA download client: pyEGA3
Overview
The pyEGA3 download client is a python-based tool for viewing and downloading files from authorized EGA datasets. pyEGA3 uses the EGA Data API and has several key features:
- Files are transferred over secure https connections and received unencrypted, so no need for decryption after download.
- Downloads resume from where they left off in the event that the connection is interrupted.
- pyEGA3 supports file segmenting and parallelized download of segments, improving overall performance.
- After download completes, file integrity is verified using checksums.
- pyEGA3 implements the GA4GH-compliant htsget protocol for download of genomic ranges for data files with accompanying index files.
Tutorial video
A video tutorial demonstrating the usage of pyEGA3 from installation through file download is available here.
Requirements
- Python3 (download instructions)
Firewall ports
pyEGA3 makes https calls to the EGA AAI (https://ega.ebi.ac.uk:8443) and the EGA Data API (https://ega.ebi.ac.uk:8052). Ports 8443 and 8052 must both be reachable from the location where pyEGA3 is executed to avoid timeouts.
For Linux/Mac users, check if ports 8443 and 8052 are open by running the following commands:
openssl s_client -connect ega.ebi.ac.uk:8443
openssl s_client -connect ega.ebi.ac.uk:8052
If the ports are open, the commands should print CONNECTED
to the terminal.
For Windows users, check if ports 8443 and 8052 are open by going to the following URLs:
- https://ega.ebi.ac.uk:8443/ega-openid-connect-server/
- https://ega.ebi.ac.uk:8052/elixir/central/stats/load
If the ports are open, both of the sites should load with no timeouts.
Installation and update
Using Pip3
-
Install pyEGA3 using pip3.
sudo pip3 install pyega3
-
Update pyEGA3, if needed, using pip3.
pip3 install pyega3 --upgrade
-
Test your pip3 installation by running pyEGA3.
pyega3 --help
Using conda (bioconda channel)
-
Install pyEGA3 using conda.
conda config --add channels bioconda conda config --add channels conda-forge conda install pyega3
-
Update pyEGA3, if needed, using conda.
conda update pyega3
-
Test your conda installation by running pyEGA3.
pyega3 --help
Using GitHub
-
Clone the ega-download-client GitHub repository.
-
Navigate to the directory where the repository was cloned.
cd path/to/ega-download-client
-
Three scripts are provided to install the required Python environment depending on the host operating system.
- Linux (Red Hat): red_hat_dependency_install.sh
- Linux: debian_dependency_install.sh
- macOS: osx_dependency_install.sh
-
Execute the script corresponding to the host operating system. For example, if using Red Hat Linux, run:
sh red_hat_dependency_install.sh
-
Test your GitHub installation by running pyEGA3.
pyega3/pyega3.py --help
Usage - File download
usage: pyega3 [-h] [-d] [-cf CONFIG_FILE] [-sf SERVER_FILE] [-c CONNECTIONS]
[-t]
{datasets,files,fetch} ...
Download from EMBL EBI's EGA (European Genome-phenome Archive)
positional arguments:
{datasets,files,fetch}
subcommands
datasets List authorized datasets
files List files in a specified dataset
fetch Fetch a dataset or file
optional arguments:
-h, --help show this help message and exit
-d, --debug Extra debugging messages
-cf CONFIG_FILE, --config-file CONFIG_FILE
JSON file containing credentials/config
e.g.{"username":"user1","password":"toor"}
-sf SERVER_FILE, --server-file SERVER_FILE
JSON file containing server config
e.g.{"url_auth":"aai url","url_api":"api url",
"url_api_ticket":"htsget url", "client_secret":"client
secret"}
-c CONNECTIONS, --connections CONNECTIONS
Download using specified number of connections
-t, --test Test user activated
Testing pyEGA3 installation
We recommend that all fresh installations of pyEGA3 be tested. A test account has been created which can be used (-t
) to test the following pyEGA3 actions:
List the datasets available to the test account
pyega3 -d -t datasets
List the files available in a test dataset
pyega3 -d -t files EGAD00001003338
Download a test file
pyega3 -d -t fetch EGAF00001775036
The test dataset (EGAD00001003338) is large (almost 1TB), so please be mindful if deciding to test downloading the entire dataset. The test account does not require an EGA username and password because it contains publicaly accessible files from the 1000 Genomes Project. The files in the test dataset can be used for troubleshooting and training purposes.
Defining credentials
To view and download files for which you have been granted access, pyEGA3 requires your EGA username (email address) and password saved to a credentials file.
Create a file called CREDENTIALS_FILE and place it in the directory where pyEGA3 will run. The credentials file must be in JSON format and must contain your registered EGA username (email address) and password provided by EGA Helpdesk.
An example CREDENTIALS_FILE is available here.
Using pyEGA3 for file download
Replace <these values>
with values relevant for your datasets.
Display authorized datasets
pyega3 -cf </Path/To/CREDENTIALS_FILE> datasets
Display files in a dataset
pyega3 -cf </Path/To/CREDENTIALS_FILE> files EGAD<NUM>
Download a dataset
pyega3 -cf </Path/To/CREDENTIALS_FILE> fetch EGAD<NUM> --saveto </Path/To/Output>
Download a single file
pyega3 -cf </Path/To/CREDENTIALS_FILE> fetch EGAF<NUM> --saveto </Path/To/Output>
List unencrypted md5 checksums for all files in a dataset
pyega3 -cf </Path/To/CREDENTIALS_FILE> files EGAD<NUM>
Save unencrypted md5 checksums to a file
nohup pyega3 -cf </Path/To/CREDENTIALS_FILE> files EGAD<NUM> </Path/To/File/md5sums.txt>
Download a file or dataset using 5 connections
pyega3 -c 5 -cf </Path/To/CREDENTIALS_FILE> fetch EGAD<NUM> --saveto </Path/To/Output>
Usage - Genomic range requests via htsget
usage: pyega3 fetch [-h] [--reference-name REFERENCE_NAME]
[--reference-md5 REFERENCE_MD5] [--start START]
[--end END] [--format {BAM,CRAM}]
[--max-retries MAX_RETRIES] [--retry-wait RETRY_WAIT]
[--saveto [SAVETO]]
identifier
positional arguments:
identifier Id for dataset (e.g. EGAD00000000001) or file (e.g.
EGAF12345678901)
optional arguments:
-h, --help show this help message and exit
--reference-name REFERENCE_NAME, -r REFERENCE_NAME
The reference sequence name, for example 'chr1', '1',
or 'chrX'. If unspecified, all data is returned.
--reference-md5 REFERENCE_MD5, -m REFERENCE_MD5
The MD5 checksum uniquely representing the requested
reference sequence as a lower-case hexadecimal string,
calculated as the MD5 of the upper-case sequence
excluding all whitespace characters.
--start START, -s START
The start position of the range on the reference,
0-based, inclusive. If specified, reference-name or
reference-md5 must also be specified.
--end END, -e END The end position of the range on the reference,
0-based exclusive. If specified, reference-name or
reference-md5 must also be specified.
--format {BAM,CRAM}, -f {BAM,CRAM}
The format of data to request.
--max-retries MAX_RETRIES, -M MAX_RETRIES
The maximum number of times to retry a failed
transfer. Any negative number means infinite number of
retries.
--retry-wait RETRY_WAIT, -W RETRY_WAIT
The number of seconds to wait before retrying a failed
transfer.
--saveto [SAVETO] Output file(for files)/output dir(for datasets)
--delete-temp-files Do not keep those temporary, partial files which were
left on the disk after a failed transfer.
Using pyEGA3 for fetching a genomic range
Replace <these values>
with values relevant for your datasets. Please note that htsget can only be used with files that have corresponding index files in EGA.
Download chromosome 1 for a BAM file
pyega3 fetch -cf </Path/To/CREDENTIALS_FILE> --reference-name 1 --format BAM --saveto </Path/To/Output> EGAF<NUM>
Download position 0-1000000 on chromosome 1 for a BAM file
pyega3 fetch -cf </Path/To/CREDENTIALS_FILE> --start 0 --end 1000000 --reference-name 1 --format BAM --saveto </Path/To/Output> EGAF<NUM>
Troubleshooting
First, please ensure you are using the most recent version of pyEGA3 by following instructions in the "Installation and update" section for updating pyEGA3.
Failure to validate credentials
Please ensure that your credentials are formatted correctly. Email addresses (usernames) are case-sensitive. If you have an EGA submission account, these credentials are different from your data access credentials. Please ensure you are using your data access credentials with pyEGA3.
Slow download speeds
Download speed can be optimized using the --connections
parameter which will parallelize download at the file level. If the --connections
parameter is provided, all files >100Mb will be downloaded using the specified number of parallel connections.
Using a very high number of connections will introduce overhead that can slow the download of the file. It is important to note that files are still downloaded sequentially, so using multiple connections does not mean downloading multiple files in parallel. We recommend trying with 30 connections initially and adjusting from there to get maximum throughput.
File taking a long time to save
Please note that when a file is being saved, it goes through two processes. First, the downloaded file "chunks" are pieced back together to reconstruct the original file. Second, pyEGA3 calculates the checksum of the file to confirm the file downloaded successfully. Larger files will take more time to reconstruct and validate the checksum.
Further assistance
If, after troubleshooting an issue, you are still experiencing difficulties, please email EGA Helpdesk (helpdesk@ega-archive.org) with the following information:
- Attach the log file (pyega3_output.log) located in the directory where pyEGA3 is running
- Indicate the compute environment you are running pyEGA3 in: compute cluster, single machine, other (please describe).
Attribution
Parts of pyEGA3 are derived from pyEGA developed by James Blachly.