EGA-archive/ega-download-client

Downloading slice of BAM recognized as complete but BAM download is incomplete

Opened this issue · 9 comments

When downloading a slice of a BAM PyEGA downloads only a few Kb of data and then marks the download as complete.

Used versions

  • Operating System version: Ubuntu 22.04 LTS (5.10.102.1-microsoft-standard-WSL2) on Windows 10 Enterprise 22H2 (19045.2364)
  • Python version: 3.11.0
  • PyEGA3 version: PyEGA3-5.0.1

To Reproduce

Steps to reproduce the behaviour that led you to the bug.

  1. Trying to download BAM file EGAF00005572695.
pyega3 -c 10 -cf ../ega_credentials.json fetch --max-retries -1 --format BAM --output-dir . -r 19 EGAF00005572695;
  1. After a few minutes download stops and is marked as complete. pyega3_output.log

Additional context

I've already contacted the EGA helpdesk about this problem, but they couldn't resolve my issue. I already tried to check if my port 8443 was open and this was the case. Looking forward to your reply.

Hi @JasperO98,

Were you referring to EGAF00005572698 or EGAF00005572695 ? The log file seems truncated as well. Do you still have the full logs with you?

I tried looking into this and the log says it was able to download the 221K bam file with accession EGAF00005572695 successfully. That file should contain all the lines matching the reference name 19 which was specified in the -r argument in the command. Do you still have the incomplete file you downloaded? Would you be able to send it to our helpdesk (helpdesk@ega-archive.org) for us to check?

Many thanks
Alegria

Hey @aaclan-ebi,
Sorry I was referring to EGAF00005572695 and have updated the log file in the initial issue message.
Yes, the download is reported as completed but it downloaded of data 221 KB, but I'm expecting a file of 4 GB.
I've already send the incomplete file to the helpdesk, but they were not able to help me.

I see, Thanks, @JasperO98 , we will investigate this further.

@JasperO98 Could you please confirm the issue still exist? We pushed updates to both the api and the client since this issue was created.

Yes the problem still persists using PyEGA3=5.0.2
pyega3_output.log

@JasperO98 In the log you attached I do not see any sign that the client is actually downloading the file. Could you please check if you already have a directory with the name of EGAF00005572695 present in the output directory? If so, could you please delete the directory (the client checks if a subdirectory with the file id is already present at the download location with files, and if so, it won't attempt to download the file) and retry downloading the slice?

I tried that just now and it results in the same issue.

Could you also try the following command (delete the previously downloaded data first)?

pyega3 -c 10 -cf ../ega_credentials.json fetch --max-retries -1 --format BAM --output-dir . -r chr19 EGAF00005572695

(Submitters might use different naming conventions when uploading files)

Unfortunately that also does not work.

pyega3 -c 10 -cf ../ega_credentials.json fetch --max-retries -1 --format BAM --output-dir . -r chr19 EGAF00005572695
[2023-09-29 13:00:59 +0200]
[2023-09-29 13:00:59 +0200] pyEGA3 - EGA python client version 5.0.2 (https://github.com/EGA-archive/ega-download-client)
[2023-09-29 13:00:59 +0200] Parts of this software are derived from pyEGA (https://github.com/blachlylab/pyega) by James Blachly
[2023-09-29 13:00:59 +0200] Python version : 3.11.5
[2023-09-29 13:00:59 +0200] OS version : Linux #2311-Microsoft Tue Nov 08 17:09:00 PST 2022
[2023-09-29 13:00:59 +0200] Server URL: https://ega.ebi.ac.uk:8443/v2
[2023-09-29 13:00:59 +0200] Session-Id: 325751053
[2023-09-29 13:01:00 +0200]
[2023-09-29 13:01:00 +0200] Authentication success for user 'j.ouwerkerk.1@erasmusmc.nl'
[2023-09-29 13:01:00 +0200] File Id: 'EGAF00005572695'(236223204951 bytes).
[2023-09-29 13:01:00 +0200] Total space : 7452.02 GiB
[2023-09-29 13:01:00 +0200] Used space : 6552.95 GiB
[2023-09-29 13:01:00 +0200] Free space : 899.07 GiB
Traceback (most recent call last):
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/io.py", line 96, in __get
    response.raise_for_status()
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error:  for url: https://ega.ebi.ac.uk:8443/v2/htsget/reads/EGAF00005572695?referenceName=chr19&format=BAM

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jasper/anaconda3/envs/pyega/bin/pyega3", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/pyega3/pyega3.py", line 150, in main
    execute_subcommand(args, data_client)
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/pyega3/libs/commands.py", line 22, in execute_subcommand
    fetch_data(args, data_client)
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/pyega3/libs/commands.py", line 49, in fetch_data
    file.download_file_retry(num_connections=args.connections,
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/pyega3/libs/data_file.py", line 307, in download_file_retry
    htsget.get(
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/io.py", line 81, in get
    manager.run()
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/protocol.py", line 143, in run
    self.__retry(self._handle_ticket_request)
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/protocol.py", line 114, in __retry
    method(*args)
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/io.py", line 147, in _handle_ticket_request
    first_piece = next(stream, "").decode(encoding)
                  ^^^^^^^^^^^^^^^^
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/io.py", line 107, in _stream
    response = self.__get(url, headers=headers, stream=True, timeout=self.timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jasper/anaconda3/envs/pyega/lib/python3.11/site-packages/htsget/io.py", line 100, in __get
    raise exceptions.ClientError(str(he), response.text)
htsget.exceptions.ClientError: 404 Client Error:  for url: https://ega.ebi.ac.uk:8443/v2/htsget/reads/EGAF00005572695?referenceName=chr19&format=BAM:{"htsget":{"timestamp":"2023-09-29T11:01:01.063+00:00","url":"http://ega.ebi.ac.uk/v2/htsget/reads/EGAF00005572695","error":"NotFound","message":"Sequence \"chr19\" not found"}}