pepkit/geofetch

Disable downloading huge soft files

Closed this issue · 5 comments

Some of the soft files are bigger then 10 MB.
I think we should disable downloading them if particular argument is not set.

The information about soft files can be find here: e.g. https://ftp.ncbi.nlm.nih.gov/geo/series/GSE199nnn/GSE199233/soft/

By using 'requests.head' get information about the size of the file. And fail it if neccessury.

@nsheff
We can't get file size information from API (head request).
What we can do we can parse website page using e.g. beautifulsoup . But we were talking about it few month ago, and decision was not to do it.

No, don't scrape the website. Just construct the http url to the file, and then HEAD it.

e.g.:

curl -I https://ftp.ncbi.nlm.nih.gov/geo/series/GSE107nnn/GSE107227/soft/GSE107227_family.soft.gz
HTTP/1.1 200 OK
Date: Wed, 30 Nov 2022 18:53:19 GMT
Server: Apache
Last-Modified: Fri, 04 Nov 2022 05:17:32 GMT
ETag: "75b-5ec9e30d9cf04"
Accept-Ranges: bytes
Content-Length: 1883
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST,PUT,OPTIONS
Access-Control-Allow-Headers: RANGE, Cache-control, If-None-Match, Content-Type
Access-Control-Expose-Headers: Content-Length, Content-Range, Content-Type
Content-Type: application/x-gzip

Yes, but there is no information about size of the file

Yes there is, it's under Content-Length: 1883

That is file size in bytes.

ohh, I see, my bad