Downloading fails for files with no Content-Disposition
Opened this issue · 1 comments
Example packages:
1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py
Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList
2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py
url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip
The second one is fixed by spoofing the user agent with a browser, i.e., it's Wiley (the publisher) trying to block automated downloads. I did it using wget
to test but we should be able to do the same thing in Python.
As you mentioned earlier the first one is a mess. Not only is it rendering into html, but the data itself isn't in the html it's being rendered by javascript, so I think you'd basically have to cut and paste the text out of the browser. I don't have any good thoughts on this one other than to email the data providers and ask them to provide a better option. We might be able to scrape it out somehow, but I don't think it's worth it for one dataset.