peterson-tim-j/AWAPer

Download link from BOM seems to have been blocked

Opened this issue · 4 comments

Hi Tim et al. We have been using AWAPer with success, but now the BOM link seems to be blocked:

Example code

library(AWAPer)

setwd('S:/PRJ-MuttamaCreek/AWAP')

makeNetCDF_file(ncdfFilename = 'AWAP_2000_2020.nc',
ncdfSolarFilename = 'AWAP_2000_2020_solar.nc',
updateFrom = as.Date("2000-01-01"),
updateTo = as.Date('2020-12-31'))

Starting to build both netCDF files.
... Testing downloading of AWAP precip. grid
... Getting grid gemoetry from file.
Error in readLines(a <- file(des.file.name)) : cannot open the connection
In addition: Warning messages:
1: In utils::download.file(url, des.file.name, quiet = T, mode = "wb") :
cannot open URL 'http://www.bom.gov.au/web03/ncc/www/awap/rainfall/totals/daily/grid/0.05/history/nat/2000010120000101.grid.Z': HTTP status was '403 Forbidden'
2: In readLines(a <- file(des.file.name)) :
cannot open file 'S:/PRJ-MuttamaCreek/AWAP/precip.20000101.grid': No such file or directory

Contact BOM?

May well be related to BOM's new policy.

You are entitled to use material on Bureau websites in accordance with the applicable terms above, noting that material such as Water Data is generally available under generous open access terms including a right to distribute and modify material. The use of any material on the Bureau websites obtained through use of automated or manual techniques including any form of scraping or hacking is prohibited.
'scraping' includes page, content, screen or web scraping amongst others, and is the process of extracting information from websites usually by converting unstructured website content (usually HTML) into structured data.

http://www.bom.gov.au/other/copyright.shtml

bomrang is now blocked from downloading historical weather data sets served as HTTP requests, but none of the FTP requests that it makes are blocked that I'm aware of.

More discussion: ropensci-archive/bomrang#137

Thanks Willem and Adam for identifying this.

I've done some digging. Following this, I tried the following lines:

install.packages("RCurl")
library(RCurl)
url_str='http://www.bom.gov.au/web03/ncc/www/awap/rainfall/totals/daily/grid/0.05/history/nat/2000010120000101.grid.Z'
dest=getURL(url_str, verbose = TRUE, useragent = getOption("HTTPUserAgent"))
dest

and it returns: [1] "Potential automated request detected! We are making changes to our website therefore web scraping is no longer supported. Please contact us by filling in the details at http://reg.bom.gov.au/screenscraper/screenscraper_enquiry_form/ and we will get in touch with you."

Since I enjoy a challenge, I dug some more and found that I need to simulate a web browser (see here). So I tried the following and now 'dest' contains the contents of the .Z AWAP compressed grid file.

url_str='http://www.bom.gov.au/web03/ncc/www/awap/rainfall/totals/daily/grid/0.05/history/nat/2000010120000101.grid.Z'
dest=getBinaryURL(url_str, verbose = TRUE, .opts=list(useragent="Mozila 5.0"))
dest

Could both of you test if this works for you?

If this hack works for both of you, then I'll try to edit the code and get AWAPer back to life.

Hi Tim,
seems to work:

dest
[1] 1f 9d 90 6e c6 bc 61 33 07 04 0e 1c 36 14 b8 91 f3 e6 4e 41
[21] 1b 39 62 28 c0 c3 86 cd 98 32 6e e8 94 91 03 22 46 0c 19 2e

However, I hope this gets you around the bigger issue that BOM is not a fan of web scraping or similar. I had discussions with the BOM about this in 2012 or so (and the same policy was in force then), so I have always been surprised that you were getting around this, I assumed you had a discussion with BOM and got permission. If this hasn't been done, we really should tackle this as a group (bomrang, AWAPer, bomWater) to make sure they sanction our tools. I thought bomrang was sanctioned?

Yes, we’re aware that we can spoof a browser in the user agent string. We already use a custom one that states that it’s bomrang version “XX”. Given my position as a public servant for state government, I’m not sure that I’m willing to make the change such that bomrang is a web browser in that string to circumvent BOM’s policies.

bomrang isn’t official or sanctioned. But so far we’ve had no issues with our FTP-based functions, just the HTTP requesting functions.