Downloading fails for files with no Content-Disposition

Question

Downloading fails for files with no Content-Disposition

Opened this issue 2 years ago · 1 comments

Example packages:
1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py
Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py
url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

Answer 1 · 2022-08-03T17:22:38.000Z

The second one is fixed by spoofing the user agent with a browser, i.e., it's Wiley (the publisher) trying to block automated downloads. I did it using wget to test but we should be able to do the same thing in Python.

As you mentioned earlier the first one is a mess. Not only is it rendering into html, but the data itself isn't in the html it's being rendered by javascript, so I think you'd basically have to cut and paste the text out of the browser. I don't have any good thoughts on this one other than to email the data providers and ask them to provide a better option. We might be able to scrape it out somehow, but I don't think it's worth it for one dataset.