EGA-archive/ega-download-client

speed optimization

Closed this issue · 7 comments

Speed optimization

Experiencing slow speeds with dataset EGAD00001004493. Similar issue to #27.

Description of the bug

I have a script that downloads the files associated with one patient from this dataset, analyzes it, then deletes it and moves on to the next. However, no matter how many --connections I tell it to use (from 5 to 50 for example), I never seem to experience speeds greater than 2MB/s (and frequently much slower). I frequently download files from SRA on this same machine at much higher speeds without issue. Is there something else I can try to increase this speed? Or is there something wrong with this specific dataset?

Used versions (please complete the following information)

  • Operating System version: CentOS7
  • Python version: 3.6.8
  • PyEGA3 version: 3.4.1
  • Please, confirm you have tested your PyEGA3 installation (follow instructions): Yes.

My issue does appear in the Troubleshooting, but the connections argument doesn't seem to have any benefit.

To Reproduce

Try downloading an RNA-seq file from the above mentioned dataset.

I was able to mostly "resolve" this by just writing a script that continuously repeats the file downloads until it successfully completes. The downloads still fail extremely frequently (basically needs 2 or 3 retries on each) and the speeds are generally slower than I would expect (2MB/s), but it works eventually.

An example of the error I see most often:

[2021-09-22 12:25:03 -0700] retry attempt 1
[2021-09-22 12:25:03 -0700] Download starting [using 10 connection(s)]...
 70%|######9   | 328M/468M [02:43<01:09, 2.01MB/s]
[2021-09-22 12:27:46 -0700] 500 Server Error:  for url: https://ega.ebi.ac.uk:8052/elixir/data/files/EGAF00002258661?destinationFormat=plain
Traceback (most recent call last):
  File "/home/chughes/virtualPython368/lib64/python3.6/site-packages/pyega3/pyega3.py", line 494, in download_file_retry
    download_file(token, file_id, file_size, check_sum, num_connections, key, output_file)
  File "/home/chughes/virtualPython368/lib64/python3.6/site-packages/pyega3/pyega3.py", line 422, in download_file
    for part_file_name in executor.map(download_file_slice_, params):
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/chughes/virtualPython368/lib64/python3.6/site-packages/pyega3/pyega3.py", line 296, in download_file_slice_
    return download_file_slice(*args)
  File "/home/chughes/virtualPython368/lib64/python3.6/site-packages/pyega3/pyega3.py", line 282, in download_file_slice
    r.raise_for_status()
  File "/home/chughes/virtualPython368/lib64/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error:  for url: https://ega.ebi.ac.uk:8052/elixir/data/files/EGAF00002258661?destinationFormat=plain
[2021-09-22 12:28:46 -0700] retry attempt 2

It will retry this 5 times, but then gives up. But, by restarting it with the script it eventually gets it done.

@chrishuges Could you please share the script? I'm wondering how did you check the completion status...
We are having trouble downloading a large dataset (15TB) and I think with the current rate of errors and fails it will take us about a year to download with manual monitoring and restarts of the download when it fails...

@danarte happy to. Keep in mind, this script is built around my own system and my own dataset. For my work, I download one patient's worth of files, process them, then delete the fastq files and any bam files I don't need, then move onto the next patient. I do this to save space. The code below is just for downloading the files, one patient at a time. If you want the processing code, let me know, I am happy to share it.

Please go easy on my basic code! It simply just looks to see whether the file has been created or not, and if it hasn't, it tries again. The raw files are stored as .slice files during downloading, so if there is no fastq.gz file, it means it wasn't downloaded correctly.

#! /bin/bash

# this script will process RNAseq data from the ICGC cohort in chunks per patient.
rawDataOutputDirectory="/path/to/where/you/want/to/store/and/process/your/data/"

#########################
# check for the data processing directory and move there
if [ ! -d $rawDataOutputDirectory ]; then
  printf "Data directory $rawDataOutputDirectory does not exist, creating it.\n"
  eval mkdir $rawDataOutputDirectory
  eval cd $rawDataOutputDirectory
else
  printf "Data directory $rawDataOutputDirectory exists, moving on.\n"
  eval cd $rawDataOutputDirectory
fi

##########################
# process the data files
for j in T{1..57} # these are my patient IDs, they each have a 'T#' identifier. Yours may not.
do
  ############ in the two lines below, you need to change the EGAD# to whatever it is for your own dataset
  egaFileId=($(awk 'BEGIN {FS="\t"; OFS="\t"} {print $3, $4}' EGAD##########/delimited_maps/Sample_File.map | grep "^${j}_" | awk 'BEGIN {FS="\t"; OFS=" "} {print $2}'))
  rnaFileId=($(awk 'BEGIN {FS="\t"; OFS="\t"} {print $3, $4}' EGAD##########/delimited_maps/Sample_File.map | grep "^${j}_" | awk 'BEGIN {FS="\t"; OFS=" "} {print $1}'))


  for i in $( seq 0 $(( ${#egaFileId[@]} - 1)) )
  do
    if [ -f "${rawDataOutputDirectory}${rnaFileId[$i]::-4}" ]; then
      printf "Raw file for ${rnaFileId[$i]::-4} already exists, skipping file.\n\n"
    else
      while [ ! -f "${rawDataOutputDirectory}${rnaFileId[$i]::-4}" ]
      do
        printf "File ${rnaFileId[$i]::-4} doesn't exist, attempting download."
        eval pyega3 -c 10 -cf ${rawDataOutputDirectory}credential_file.json fetch ${egaFileId[$i]} --saveto ${rawDataOutputDirectory}${rnaFileId[$i]::-4}
      done
    fi
  done
done

Hi @chrishuges - Thank you for reporting this issue, and apologies that you are experiencing suboptimal data transfer speeds. Faster speeds are theoretically possible, and, the speed you practically see will be due to a number of factors, including your local network/infrastructure. Please see this note about finding an optimal number of connections, as using more connections doesn't necessarily equate to faster speeds! If you continually see a maximum of 2MB/s, you could perhaps check with your local infrastructure maintainer (or that might be you) to see if there are any bottlenecks that are throttling the combined speed at 2MB/s.

I will also note that, although there are various retries built into pyega3 and the underlying API already, we are currently integrating even more retries so that in the future you won't have to retry manually on your side. I'm glad you have found a workaround to support your aims. We are in the final stages of testing a new version of our API, which should provide a more stable data download service including more retries. Thank you for using pyega3 and providing this valuable feedback!

Hi there! We have made some improvements to pyega3 that address this issue and have merged these changes into master. If you install pyega3 from GitHub you should be able to take advantage of these updates. We have not yet done a full version release, but when we do, these features will also be available from pip3/conda install. Thank you for your patience while we improved pyega3!

Thank you for the update! The new tool works much better and I am no longer having issues with it.

Hi @chrishuges - Thank you so much for the feedback, and glad to hear the tool is working better for you!