schmitt-muc/SEN12MS

Unable to download dataset from command line

Opened this issue · 9 comments

Hi, I'm working on a torchvision-style dataset that automatically downloads and checksums SEN12MS. I see that the dataset is hosted on https://dataserv.ub.tum.de/s/m1474000. However, when I try to download one of the files, I get an error message:

$ wget 'https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz'
--2021-06-10 21:01:24--  https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz
Resolving dataserv.ub.tum.de (dataserv.ub.tum.de)... 138.246.224.34, 2001:4ca0:800::8af6:e022
Connecting to dataserv.ub.tum.de (dataserv.ub.tum.de)|138.246.224.34|:443... connected.
ERROR: cannot verify dataserv.ub.tum.de's certificate, issued by ‘CN=DFN-Verein Global Issuing CA,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE’:
  Unable to locally verify the issuer's authority.
To connect to dataserv.ub.tum.de insecurely, use `--no-check-certificate'.

Clicking on the download button allows me to download through the web browser, but I would like to be able to download from the command line. Is this possible (without disabling security certificate checks)?

@calebrob6

Current workaround pointed out by @calebrob6:

$ wget "ftp://m1474000:m1474000@dataserv.ub.tum.de/ROIs1158_spring_lc.tar.gz"

Sorry for the late reply! I would prefer rsync:
"The data server also offers downloads with rsync (password m1474000):
rsync rsync://m1474000@dataserv.ub.tum.de/m1474000/"

Hi @schmitt-muc, when I run that command it doesn't download anything.

I'm trying to write a PyTorch data loader. Torchvision is able to automatically download and checksum datasets from a URL, but the FTP and rsync URLs don't work for this.

I have just checked (running Ubuntu 20.04 LTS from inside Windows 10 Enterprise using WSL2):
Running the command
rsync -chavzP --stats rsync://m1474000@dataserv.ub.tum.de/m1474000/ path/to/your/local/storage/folder
works. Of course you first have to enter the password m1474000, and of course retrieving the incremental file list takes ages, but it should do the job.

Yes, that seems to work, although I still can't download the data from Python without calling some system rsync executable. A normal URL would be much nicer for cases where users aren't using rsync.

Ah, now I understand. I suggest following Caleb Robinson's advice. At least for me wget -r "ftp://m1474000:m1474000@dataserv.ub.tum.de" does the job just fine and downloads the whole package automatically.

Yes, that URL works with wget but not with Python's urllib for some reason. Is there a working https:// option?

I have sent an inquiry to TUM's library, which hosts the data on their media server. The response won't make you too happy: There is definitely no https:// option, as also the .zip file you can download when clicking the Download button in the graphical interface is only created on the fly using some internal Nextcloud function. The only suggestion I got was to look into the Python libraries ftplib, wget and urllib2, which are dedicated to ftp downloads.

There also seems to be a mirrored version on Google Cloud Storage, see https://gitlab.com/frontierdevelopmentlab/disaster-prevention/sen12ms: gsutil -m rsync -r gs://fdl_floods_2019_data/SEN12MS.
Not sure whether this is of any help for you, though