Archives all ZIP files on http://gis.epa.ie/GetData/Download.
Check out this repository. Then install the dependencies:
yarn
or npm install
Head over to Mailback and create a new mailbox.
Then set the SCRAPER_MAILBACK_MAILBOX environment variable to the mailbox ID from above and run the index.js Node.js script:
SCRAPER_MAILBACK_MAILBOX=##### node index
If the scraping or downloading stalls, hit CTRL+C to quit and then restart the scraping. The tool is reentrant: it will check the existing manifest and continue scraping from where it has left off.
- SCRAPER_MAILBACK_MAILBOX - Required. Mailback mailbox ID.
- SCRAPER_INDEX_URL - The document index page URL. Overwrite for testing. Defaults to: http://gis.epa.ie/GetData/Download
- SCRAPER_DOWNLOAD_URL - The download request submit URL. Overwrite for testing. Defaults to: http://gis.epa.ie/getdata/downloaddata
- SCRAPER_MAX_DOCS - The maximum number of documents to request. Overwrite for testing. Defaults to nothing (i.e. request everything).
- SCRAPER_START_IDX - The starting index (zero-based). Overwrite for testing or retrying a particular file. Defaults to 0 (i.e. start from the beginning).
- SCRAPER_SKIP_DOWNLOAD - To skip downloading of the files and just scrape the download URLs. Set to any non-empty value (like "1"). Defaults to nothing (i.e. download ZIP files).
The index page uses a navigation paradigm that shows radio buttons in chunks, but in fact all the radio buttons are present on the page. The script fetches the HTML and grabs the document IDs from the radio buttons on the page.
The download request form includes a CAPTCHA. but it turned out to be a pure-client-side check that can be completely bypassed by directly submitting the required fields to the target URL.
The "X-Requested-With" parameter (not HTTP header) somehow needed to be submitted along with the other fields.
- Fetches the index page and extracts all document IDs
- Submits the download request form for each document ID while using a Mailback mailbox email address
- Fetches the HTML content of the email from Mailback
- Extracts the first link in the HTML (it points to a ZIP file)
- Downloads the ZIP file (can be opted out)
- Saves the ZIP file under the archive folder (can be opted out)
- Updates manifest.json file under the archive folder